Coder EvalPlus

EvalPlus is a rigorous evaluation framework for LLM4Code, with:

✨ HumanEval+: 80x more tests than the original HumanEval!
✨ MBPP+: 35x more tests than the original MBPP!
✨ EvalPerf: evaluating the efficiency of LLM-generated code!
✨ Framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.

Model (pass@1)	HumanEval+	HumanEval	MBPP+	MBPP
o1 preview (Sept 2024)	`89`	`96.3`	`80.2`	`95.5`
o1 mini (Sept 2024)	`89`	`96.3`	`78.8`	`93.1`
GPT 4o (Aug 2024)	`87.2`	`92.7`	`72.2`	`87.6`
Qwen2.5-Coder-32B-Instruct	`87.2`	`92.1`	`77`	`90.5`
DeepSeek-V3 (Nov 2024)	`86.6`	`91.5`	`73`	`87.6`
GPT-4-Turbo (April 2024)	`86.6`	`90.2`
DeepSeek-V2.5 (Nov 2024)	`83.5`	`90.2`	`74.1`	`87.6`
GPT 4o mini (July 2024)	`83.5`	`88.4`	`72.2`	`85.4`
DeepSeek-Coder-V2-Instruct	`82.3`	`85.4`	`75.1`	`89.4`
Claude Sonnet 3.5 (June 2024)	`81.7`	`87.2`	`74.3`	`89.4`
GPT-4-Turbo (Nov 2023)	`81.7`	`85.4`	`73.3`	`85.7`
Grok Beta	`80.5`	`88.4`	`65.6`	`86`
Gemini 1.5 Pro 002	`79.3`	`89`	`74.6`	`89.7`
GPT-4 (May 2023)	`79.3`	`88.4`
CodeQwen1.5-7B-Chat	`78.7`	`83.5`	`69`	`79.4`
claude-3-opus (Mar 2024)	`77.4`	`82.9`	`73.3`	`89.4`
OpenCoder-8B-Instruct	`77.4`	`81.7`	`71.4`	`82`
Gemini 1.5 Flash 002	`75.6`	`82.3`	`67.5`	`84.7`
DeepSeek-Coder-33B-instruct	`75`	`81.1`	`70.1`	`80.4`
Codestral-22B-v0.1	`73.8`	`79.9`	`61.9`	`72.5`
OpenCodeInterpreter-DS-33B	`73.8`	`79.3`	`68.5`	`80.2`
WizardCoder-33B-V1.1	`73.2`	`79.9`
Artigenz-Coder-DS-6.7B	`72.6`	`75.6`	`69.6`	`80.7`
Llama3-70B-instruct	`72`	`77.4`	`69`	`82.3`
OpenCodeInterpreter-DS-6.7B	`72`	`77.4`	`66.4`	`76.5`
speechless-codellama-34B-v2.0	`72`	`77.4`	`61.4`	`73.8`
Mixtral-8x22B-Instruct-v0.1	`72`	`76.2`	`64.3`	`73.8`
Magicoder-S-DS-6.7B	`71.3`	`76.8`	`69`	`79.4`
DeepSeek-Coder-7B-instruct-v1.5	`71.3`	`75.6`	`62.2`	`75.2`
DeepSeek-Coder-6.7B-instruct	`71.3`	`74.4`	`65.6`	`74.9`
starchat2-15b-v0.1	`71.3`	`73.8`	`64.6`	`74.9`
GPT-3.5-Turbo (Nov 2023)	`70.7`	`76.8`	`69.7`	`82.5`
code-millenials-34B	`70.7`	`74.4`	`64.6`	`76.2`
databricks/dbrx-instruct	`70.1`	`75`	`55.8`	`67.2`
XwinCoder-34B	`69.5`	`75.6`	`64.8`	`77`
WaveCoder-Ultra-6.7B	`69.5`	`75`	`63.5`	`74.9`
claude-3-haiku (Mar 2024)	`68.9`	`76.8`	`68.8`	`80.2`
OpenChat-3.5-7B-0106	`67.7`	`72.6`	`54.5`	`63.8`
Magicoder-S-CL-7B	`67.7`	`70.7`	`60.1`	`70.6`
Phind-CodeLlama-34B-v2	`67.1`	`71.3`
GPT-3.5 (May 2023)	`66.5`	`73.2`
WhiteRabbitNeo-33B-v1	`65.9`	`72`	`66.9`	`79.4`
CodeLlama-70B-Instruct	`65.9`	`72`
speechless-coder-ds-6.7B	`65.9`	`71.3`	`64.4`	`75.9`
WizardCoder-Python-34B-V1.0	`64.6`	`73.2`	`63.2`	`75.1`
claude-3-sonnet (Mar 2024)	`64`	`70.7`	`69.3`	`83.6`
Llama3.1-8B-instruct	`62.8`	`69.5`	`55.6`	`68.3`
speechless-starcoder2-15b	`62.8`	`67.1`	`62.4`	`73.5`
Mistral Large (Mar 2024)	`62.2`	`69.5`	`59.5`	`72.8`
claude-2 (Mar 2024)	`61.6`	`69.5`
Gemini Pro 1.5	`61`	`68.3`
starcoder2-15b-instruct-v0.1	`60.4`	`67.7`	`65.1`	`78`
DeepSeek-Coder-1.3B-instruct	`60.4`	`65.9`	`54.8`	`65.3`
Code-290k-6.7B-Instruct	`59.7`	`64.6`
Qwen1.5-72B-Chat	`59.1`	`68.3`	`61.6`	`72.5`
Phi-3-mini-4k-instruct	`59.1`	`64.6`	`54.2`	`65.9`
dolphin-2.6-mixtral-8x7b	`57.3`	`64`	`59`	`70.6`
Command-R+	`56.7`	`64`	`63.5`	`74.3`
Llama3-8B-instruct	`56.7`	`61.6`	`54.8`	`64.6`
Gemini Pro 1.0	`55.5`	`63.4`	`61.4`	`75.4`
Code-13B	`52.4`	`56.1`
codegemma-7b-it	`51.8`	`60.4`	`56.9`	`70.4`
speechless-starcoder2-7b	`51.8`	`56.1`	`56.3`	`66.7`
claude-instant-1 (Mar 2024)	`50.6`	`57.3`
WizardCoder-15B-V1.0	`50.6`	`56.7`	`54.2`	`64.3`
CodeLlama-70B	`50.6`	`55.5`
speechless-coding-7B-16k-tora	`50.6`	`54.9`	`50.6`	`64.2`
Code-33B	`49.4`	`54.9`
OpenHermes-2.5-Code-290k-13B	`48.8`	`54.3`	`45.8`	`52.4`
CodeQwen1.5-7B	`45.7`	`51.8`	`60.8`	`73.5`
WizardCoder-Python-7B-V1.0	`45.1`	`50.6`	`49.5`	`58.5`
phi-2-2.7B	`45.1`	`49.4`	`54.2`	`64`
DeepSeek-Coder-33B-base	`44.5`	`51.2`
CodeLlama-34B	`43.9`	`51.8`	`56.3`	`69.3`
Mistral-codealpaca-7B	`42.1`	`48.2`
MistralHermes-CodePro-7B-v1	`42.1`	`47.6`	`46.4`	`57.4`
speechless-code-mistral-7B-v1.0	`41.5`	`48.2`	`48.7`	`57.4`
codegemma-7b	`41.5`	`44.5`	`52.4`	`65.1`
DeepSeek-Coder-6.7B-base	`39.6`	`47.6`	`58.7`	`72`
Mixtral-8x7B-Instruct-v0.1	`39.6`	`45.1`	`49.7`	`59.5`
CodeLlama-13B	`38.4`	`42.7`	`52.6`	`63.5`
StarCoder2-15B	`37.8`	`46.3`
SOLAR-10.7B-Instruct-v1.0	`37.2`	`43.3`	`36.2`	`43.9`
Mistral-7B-Instruct-v0.2	`36`	`42.1`	`37`	`44.7`
gemma-1.1-7b-it	`35.4`	`42.7`	`45`	`57.1`
CodeLlama-7B	`35.4`	`37.8`	`46.8`	`59.5`
xDAN-L1-Chat-RL-v1-7B	`32.9`	`40.2`	`41.3`	`50.3`
Python-Code-13B	`30.5`	`32.9`
StarCoder2-7B	`29.9`	`35.4`
StarCoder-15B	`29.3`	`34.1`	`46.1`	`55.1`
Llama3-8B-base	`29.3`	`33.5`	`51.6`	`61.4`
gemma-7b	`28.7`	`35.4`	`43.4`	`52.6`
CodeGen-16B	`28`	`32.9`	`45.5`	`54.2`
StarCoder2-3B	`27.4`	`31.7`
CodeT5+-16B	`26.8`	`31.7`	`47.1`	`56.6`
stable-code-3B	`25.6`	`29.3`	`45.8`	`54.8`
CodeGen-6B	`25.6`	`29.3`	`42.9`	`50.8`
DeepSeek-Coder-1.3B-base	`25.6`	`28.7`	`47.9`	`56.9`
gemma-7b-it	`25`	`28.7`	`36.8`	`47.1`
CodeT5+-6B	`24.4`	`29.3`	`41.5`	`52.9`
Mistral-7B	`23.8`	`28.7`	`42.1`	`51.9`
Zephyr β-7B	`23.2`	`30`	`34.7`	`42.1`
CodeGen-2B	`22.6`	`24.4`	`36`	`46.3`
CodeT5+-2B	`22`	`25`	`38.1`	`48.4`
StarCoderBase-7B	`21.3`	`24.4`
codegemma-2b	`20.7`	`26.8`	`46.6`	`55.6`
gemma-2b	`20.7`	`25`	`34.1`	`41.8`
gemma-1.1-2b-it	`17.7`	`22.6`	`23.3`	`29.8`

ℹ️ Note
Models are ranked according to pass@1 using greedy decoding.
Both MBPP and MBPP+ referred in our leaderboard use a subset (399 tasks) of hand-verified problems from MBPP-sanitized (427 tasks), to make sure the programming task is well-formed and unambiguous.

About

Why EvalPlus? What does using EvalPlus datasets bring to you?

✨ Precise evaluation: See our leaderboard for the latest LLM rankings before and after rigorous evaluation.
✨ Coding rigorousness: Look at the score differences! esp. before and after using EvalPlus tests! Less drop means more rigorousness in code generation; while a bigger drop means the generated code tends to be fragile.
✨ Code efficiency: Beyond correctness, our EvalPerf dataset evaluates the efficiency of LLM-generated code via performance-exercising coding tasks and test inputs.

Want to know more details? Read our papers & materials!

EvalPlus: NeurIPS'23 paper, Slides
EvalPerf: COLM'24 paper, Poster

🔥 Quick Start

Code Correctness Evaluation: HumanEval(+) or MBPP(+)

pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --backend vllm                         \
                  --greedy

🛡️ Safe code execution within Docker :: click to expand ::

# Local generation
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                 --dataset humaneval                    \
                 --backend vllm                         \
                 --greedy

# Code execution within Docker
docker run --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \
           evalplus.evaluate --dataset humaneval                                     \
           --samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl

Code Efficiency Evaluation: EvalPerf (*nix only)

pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release

sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm

🛡️ Safe code execution within Docker :: click to expand ::

# Local generation
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                 --dataset evalperf                     \
                 --backend vllm                         \
                 --temperature 1.0                      \
                 --n-samples 100

# Code execution within Docker
sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
docker run --cap-add PERFMON --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \
           evalplus.evalperf --samples /app/evalperf/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl

🚀 LLM Backends

HuggingFace models

transformers backend:

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --backend hf                           \
                  --greedy

ℹ️ Note
EvalPlus uses different prompts for base and chat models. By default it is detected by tokenizer.chat_template when using hf/vllm as backend. For other backends, only chat mode is allowed.
Therefore, if your base models come with a tokenizer.chat_template, please add --force-base-prompt to avoid being evaluated in a chat mode.

Enable Flash Attention 2 :: click to expand ::

# Install Flash Attention 2
pip install packaging ninja
pip install flash-attn --no-build-isolation
# Note: if you have installation problem, consider using pre-built
# wheels from https://github.com/Dao-AILab/flash-attention/releases

# Run evaluation with FA2
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B"         \
                  --dataset [humaneval|mbpp]                     \
                  --backend hf                                   \
                  --attn-implementation [flash_attention_2|sdpa] \
                  --greedy

vllm backend:

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --backend vllm                         \
                  --tp [TENSOR_PARALLEL_SIZE]            \
                  --greedy

openai compatible servers (e.g., vLLM):

# OpenAI models
export OPENAI_API_KEY="{KEY}" # https://platform.openai.com/settings/organization/api-keys
evalplus.evaluate --model "gpt-4o-2024-08-06"  \
                  --dataset [humaneval|mbpp]   \
                  --backend openai --greedy

# DeepSeek
export OPENAI_API_KEY="{KEY}" # https://platform.deepseek.com/api_keys
evalplus.evaluate --model "deepseek-chat"              \
                  --dataset [humaneval|mbpp]           \
                  --base-url https://api.deepseek.com  \
                  --backend openai --greedy

# Grok
export OPENAI_API_KEY="{KEY}" # https://console.x.ai/
evalplus.evaluate --model "grok-beta"             \
                  --dataset [humaneval|mbpp]      \
                  --base-url https://api.x.ai/v1  \
                  --backend openai --greedy

# vLLM server
# First, launch a vLLM server: https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --base-url http://localhost:8000/v1    \
                  --backend openai --greedy

# GPTQModel
evalplus.evaluate --model "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1" \
                  --dataset [humaneval|mbpp]                                          \
                  --backend gptqmodel --greedy

OpenAI models

Access OpenAI APIs from OpenAI Console

export OPENAI_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gpt-4o"            \
                  --dataset [humaneval|mbpp]  \
                  --backend openai            \
                  --greedy

Anthropic models

Access Anthropic APIs from Anthropic Console

export ANTHROPIC_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "claude-3-haiku-20240307" \
                  --dataset [humaneval|mbpp]        \
                  --backend anthropic               \
                  --greedy

Google Gemini models

Access Gemini APIs from Google AI Studio

export GOOGLE_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gemini-1.5-pro"    \
                  --dataset [humaneval|mbpp]  \
                  --backend google            \
                  --greedy

Amazon Bedrock models

Amazon Bedrock

export BEDROCK_ROLE_ARN="[BEDROCK_ROLE_ARN]"
evalplus.evaluate --model "anthropic.claude-3-5-sonnet-20241022-v2:0" \
                  --dataset [humaneval|mbpp]                          \
                  --backend bedrock                                   \
                  --greedy

You can checkout the generation and results at

evalplus_results/[humaneval|mbpp]/

⏬ Using EvalPlus as a local repo? :: click to expand ::

git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt

📜 Citation

@inproceedings{evalplus,
  title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
  author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming},
  booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},
  year = {2023},
  url = {https://openreview.net/forum?id=1qvx610Cu7},
}

@inproceedings{evalperf,
  title = {Evaluating Language Models for Efficient Code Generation},
  author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming},
  booktitle = {First Conference on Language Modeling},
  year = {2024},
  url = {https://openreview.net/forum?id=IBCBMeAhmC},
}