Coder EvalPlus

Attribution EvalPlus November 25, 2024

EvalPlus is a rigorous evaluation framework for LLM4Code, with:

  • HumanEval+: 80x more tests than the original HumanEval!
  • MBPP+: 35x more tests than the original MBPP!
  • EvalPerf: evaluating the efficiency of LLM-generated code!
  • Framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.

🔥Quick Start🚀LLM Backends

Model (pass@1)🏆 HumanEval+HumanEvalMBPP+MBPP
O1 Preview (Sept 2024)8996.380.295.5
O1 Mini (Sept 2024)8996.378.893.1
GPT 4o (Aug 2024)87.292.772.287.6
Qwen2.5-Coder-32B-Instruct87.292.17790.5
DeepSeek-V3 (Nov 2024)86.691.57387.6
GPT-4-Turbo (April 2024)86.690.2
DeepSeek-V2.5 (Nov 2024)83.590.274.187.6
GPT 4o Mini (July 2024)83.588.472.285.4
DeepSeek-Coder-V2-Instruct82.385.475.189.4
Claude Sonnet 3.5 (June 2024)81.787.274.389.4
GPT-4-Turbo (Nov 2023)81.785.473.385.7
Grok Beta80.588.465.686
Gemini 1.5 Pro 00279.38974.689.7
GPT-4 (May 2023)79.388.4
CodeQwen1.5-7B-Chat78.783.56979.4
claude-3-opus (Mar 2024)77.482.973.389.4
OpenCoder-8B-Instruct77.481.771.482
Gemini 1.5 Flash 00275.682.367.584.7
DeepSeek-Coder-33B-instruct7581.170.180.4
Codestral-22B-v0.173.879.961.972.5
OpenCodeInterpreter-DS-33B73.879.368.580.2
WizardCoder-33B-V1.173.279.9
Artigenz-Coder-DS-6.7B72.675.669.680.7
Llama3-70B-instruct7277.46982.3
OpenCodeInterpreter-DS-6.7B7277.466.476.5
speechless-codellama-34B-v2.07277.461.473.8
Mixtral-8x22B-Instruct-v0.17276.264.373.8
Magicoder-S-DS-6.7B71.376.86979.4
DeepSeek-Coder-7B-instruct-v1.571.375.662.275.2
DeepSeek-Coder-6.7B-instruct71.374.465.674.9
starchat2-15b-v0.171.373.864.674.9
GPT-3.5-Turbo (Nov 2023)70.776.869.782.5
code-millenials-34B70.774.464.676.2
databricks/dbrx-instruct70.17555.867.2
XwinCoder-34B69.575.664.877
WaveCoder-Ultra-6.7B69.57563.574.9
claude-3-haiku (Mar 2024)68.976.868.880.2
OpenChat-3.5-7B-010667.772.654.563.8
Magicoder-S-CL-7B67.770.760.170.6
Phind-CodeLlama-34B-v267.171.3
GPT-3.5 (May 2023)66.573.2
WhiteRabbitNeo-33B-v165.97266.979.4
CodeLlama-70B-Instruct65.972
speechless-coder-ds-6.7B65.971.364.475.9
WizardCoder-Python-34B-V1.064.673.263.275.1
claude-3-sonnet (Mar 2024)6470.769.383.6
Llama3.1-8B-instruct62.869.555.668.3
speechless-starcoder2-15b62.867.162.473.5
Mistral Large (Mar 2024)62.269.559.572.8
claude-2 (Mar 2024)61.669.5
Gemini Pro 1.56168.3
starcoder2-15b-instruct-v0.160.467.765.178
DeepSeek-Coder-1.3B-instruct60.465.954.865.3
Code-290k-6.7B-Instruct59.764.6
Qwen1.5-72B-Chat59.168.361.672.5
Phi-3-mini-4k-instruct59.164.654.265.9
dolphin-2.6-mixtral-8x7b57.3645970.6
Command-R+56.76463.574.3
Llama3-8B-instruct56.761.654.864.6
Gemini Pro 1.055.563.461.475.4
Code-13B52.456.1
codegemma-7b-it51.860.456.970.4
speechless-starcoder2-7b51.856.156.366.7
claude-instant-1 (Mar 2024)50.657.3
WizardCoder-15B-V1.050.656.754.264.3
CodeLlama-70B50.655.5
speechless-coding-7B-16k-tora50.654.950.664.2
Code-33B49.454.9
OpenHermes-2.5-Code-290k-13B48.854.345.852.4
CodeQwen1.5-7B45.751.860.873.5
WizardCoder-Python-7B-V1.045.150.649.558.5
phi-2-2.7B45.149.454.264
DeepSeek-Coder-33B-base44.551.2
CodeLlama-34B43.951.856.369.3
Mistral-codealpaca-7B42.148.2
MistralHermes-CodePro-7B-v142.147.646.457.4
speechless-code-mistral-7B-v1.041.548.248.757.4
codegemma-7b41.544.552.465.1
DeepSeek-Coder-6.7B-base39.647.658.772
Mixtral-8x7B-Instruct-v0.139.645.149.759.5
CodeLlama-13B38.442.752.663.5
StarCoder2-15B37.846.3
SOLAR-10.7B-Instruct-v1.037.243.336.243.9
Mistral-7B-Instruct-v0.23642.13744.7
gemma-1.1-7b-it35.442.74557.1
CodeLlama-7B35.437.846.859.5
xDAN-L1-Chat-RL-v1-7B32.940.241.350.3
Python-Code-13B30.532.9
StarCoder2-7B29.935.4
StarCoder-15B29.334.146.155.1
Llama3-8B-base29.333.551.661.4
gemma-7b28.735.443.452.6
CodeGen-16B2832.945.554.2
StarCoder2-3B27.431.7
CodeT5+-16B26.831.747.156.6
stable-code-3B25.629.345.854.8
CodeGen-6B25.629.342.950.8
DeepSeek-Coder-1.3B-base25.628.747.956.9
gemma-7b-it2528.736.847.1
CodeT5+-6B24.429.341.552.9
Mistral-7B23.828.742.151.9
Zephyr β-7B23.23034.742.1
CodeGen-2B22.624.43646.3
CodeT5+-2B222538.148.4
StarCoderBase-7B21.324.4
codegemma-2b20.726.846.655.6
gemma-2b20.72534.141.8
gemma-1.1-2b-it17.722.623.329.8

ℹ️ Note

  1. Models are ranked according to pass@1 using greedy decoding.
  2. Both MBPP and MBPP+ referred in our leaderboard use a subset (399 tasks) of hand-verified problems from MBPP-sanitized (427 tasks), to make sure the programming task is well-formed and unambiguous.

About

Why EvalPlus? What does using EvalPlus datasets bring to you?

  • Precise evaluation: See our leaderboard for the latest LLM rankings before and after rigorous evaluation.
  • Coding rigorousness: Look at the score differences! esp. before and after using EvalPlus tests! Less drop means more rigorousness in code generation; while a bigger drop means the generated code tends to be fragile.
  • Code efficiency: Beyond correctness, our EvalPerf dataset evaluates the efficiency of LLM-generated code via performance-exercising coding tasks and test inputs.

Want to know more details? Read our papers & materials!

🔥 Quick Start

Code Correctness Evaluation: HumanEval(+) or MBPP(+)

pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --backend vllm                         \
                  --greedy
🛡️ Safe code execution within Docker :: click to expand ::
# Local generation
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                 --dataset humaneval                    \
                 --backend vllm                         \
                 --greedy

# Code execution within Docker
docker run --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \
           evalplus.evaluate --dataset humaneval                                     \
           --samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl

Code Efficiency Evaluation: EvalPerf (*nix only)

pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release

sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm
🛡️ Safe code execution within Docker :: click to expand ::
# Local generation
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                 --dataset evalperf                     \
                 --backend vllm                         \
                 --temperature 1.0                      \
                 --n-samples 100

# Code execution within Docker
sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
docker run --cap-add PERFMON --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \
           evalplus.evalperf --samples /app/evalperf/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl

🚀 LLM Backends

HuggingFace models

  • transformers backend:
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --backend hf                           \
                  --greedy

ℹ️ Note

EvalPlus uses different prompts for base and chat models. By default it is detected by tokenizer.chat_template when using hf/vllm as backend. For other backends, only chat mode is allowed.

Therefore, if your base models come with a tokenizer.chat_template, please add --force-base-prompt to avoid being evaluated in a chat mode.

Enable Flash Attention 2 :: click to expand ::
# Install Flash Attention 2
pip install packaging ninja
pip install flash-attn --no-build-isolation
# Note: if you have installation problem, consider using pre-built
# wheels from https://github.com/Dao-AILab/flash-attention/releases

# Run evaluation with FA2
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B"         \
                  --dataset [humaneval|mbpp]                     \
                  --backend hf                                   \
                  --attn-implementation [flash_attention_2|sdpa] \
                  --greedy
  • vllm backend:
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --backend vllm                         \
                  --tp [TENSOR_PARALLEL_SIZE]            \
                  --greedy
  • openai compatible servers (e.g., vLLM):
# OpenAI models
export OPENAI_API_KEY="{KEY}" # https://platform.openai.com/settings/organization/api-keys
evalplus.evaluate --model "gpt-4o-2024-08-06"  \
                  --dataset [humaneval|mbpp]   \
                  --backend openai --greedy

# DeepSeek
export OPENAI_API_KEY="{KEY}" # https://platform.deepseek.com/api_keys
evalplus.evaluate --model "deepseek-chat"              \
                  --dataset [humaneval|mbpp]           \
                  --base-url https://api.deepseek.com  \
                  --backend openai --greedy

# Grok
export OPENAI_API_KEY="{KEY}" # https://console.x.ai/
evalplus.evaluate --model "grok-beta"             \
                  --dataset [humaneval|mbpp]      \
                  --base-url https://api.x.ai/v1  \
                  --backend openai --greedy

# vLLM server
# First, launch a vLLM server: https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --base-url http://localhost:8000/v1    \
                  --backend openai --greedy

# GPTQModel
evalplus.evaluate --model "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1" \
                  --dataset [humaneval|mbpp]                                          \
                  --backend gptqmodel --greedy

OpenAI models

export OPENAI_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gpt-4o"            \
                  --dataset [humaneval|mbpp]  \
                  --backend openai            \
                  --greedy

Anthropic models

export ANTHROPIC_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "claude-3-haiku-20240307" \
                  --dataset [humaneval|mbpp]        \
                  --backend anthropic               \
                  --greedy

Google Gemini models

export GOOGLE_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gemini-1.5-pro"    \
                  --dataset [humaneval|mbpp]  \
                  --backend google            \
                  --greedy

Amazon Bedrock models

export BEDROCK_ROLE_ARN="[BEDROCK_ROLE_ARN]"
evalplus.evaluate --model "anthropic.claude-3-5-sonnet-20241022-v2:0" \
                  --dataset [humaneval|mbpp]                          \
                  --backend bedrock                                   \
                  --greedy

You can checkout the generation and results at evalplus_results/[humaneval|mbpp]/

⏬ Using EvalPlus as a local repo? :: click to expand ::
git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt

📜 Citation

@inproceedings{evalplus,
  title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
  author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming},
  booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},
  year = {2023},
  url = {https://openreview.net/forum?id=1qvx610Cu7},
}

@inproceedings{evalperf,
  title = {Evaluating Language Models for Efficient Code Generation},
  author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming},
  booktitle = {First Conference on Language Modeling},
  year = {2024},
  url = {https://openreview.net/forum?id=IBCBMeAhmC},
}

🙏 Acknowledgement