Coder EvalPlus

Attribution EvalPlus β€’ June 2, 2024

EvalPlus is a rigorous evaluation framework for LLM4Code, with:

  • ✨ HumanEval+: 80x more tests than the original HumanEval!
  • ✨ MBPP+: 35x more tests than the original MBPP!
  • ✨ Evaluation framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.

File a request to add your models on our leaderboard!

πŸ”₯Quick Start β€’ πŸ’»LLM code β€’ πŸ”¨Tools

Model (pass@1)πŸ† HumanEval+HumanEvalMBPP+MBPP
GPT-4-Turbo (April 2024)86.690.2
DeepSeek-Coder-V2-Instruct82.385.475.189.4
GPT-4-Turbo (Nov 2023)81.785.473.385.7
GPT-4 (May 2023)79.388.4
CodeQwen1.5-7B-Chat78.783.56979.4
claude-3-opus (Mar 2024)77.482.973.389.4
DeepSeek-Coder-33B-instruct7581.170.180.4
OpenCodeInterpreter-DS-33B73.879.368.580.2
WizardCoder-33B-V1.173.279.9
Artigenz-Coder-DS-6.7B72.675.669.680.7
Llama3-70B-instruct7277.46982.3
OpenCodeInterpreter-DS-6.7B7277.466.476.5
speechless-codellama-34B-v2.07277.461.473.8
Mixtral-8x22B-Instruct-v0.17276.264.373.8
Magicoder-S-DS-6.7B71.376.86979.4
DeepSeek-Coder-7B-instruct-v1.571.375.662.275.2
DeepSeek-Coder-6.7B-instruct71.374.465.674.9
starchat2-15b-v0.171.373.864.674.9
GPT-3.5-Turbo (Nov 2023)70.776.869.782.5
code-millenials-34B70.774.464.676.2
databricks/dbrx-instruct70.17555.867.2
XwinCoder-34B69.575.664.877
WaveCoder-Ultra-6.7B69.57563.574.9
claude-3-haiku (Mar 2024)68.976.868.880.2
OpenChat-3.5-7B-010667.772.654.563.8
Magicoder-S-CL-7B67.770.760.170.6
Phind-CodeLlama-34B-v267.171.3
GPT-3.5 (May 2023)66.573.2
WhiteRabbitNeo-33B-v165.97266.979.4
CodeLlama-70B-Instruct65.972
speechless-coder-ds-6.7B65.971.364.475.9
WizardCoder-Python-34B-V1.064.673.263.275.1
claude-3-sonnet (Mar 2024)6470.769.383.6
speechless-starcoder2-15b62.867.162.473.5
Mistral Large (Mar 2024)62.269.559.572.8
claude-2 (Mar 2024)61.669.5
Gemini Pro 1.56168.3
starcoder2-15b-instruct-v0.160.467.765.178
DeepSeek-Coder-1.3B-instruct60.465.954.865.3
Code-290k-6.7B-Instruct59.764.6
Qwen1.5-72B-Chat59.168.361.672.5
Phi-3-mini-4k-instruct59.164.654.265.9
dolphin-2.6-mixtral-8x7b57.3645970.6
Command-R+56.76463.574.3
Llama3-8B-instruct56.761.659.370.1
Gemini Pro 1.055.563.461.475.4
Code-13B52.456.1
codegemma-7b-it51.860.456.970.4
speechless-starcoder2-7b51.856.156.366.7
claude-instant-1 (Mar 2024)50.657.3
WizardCoder-15B-V1.050.656.754.264.3
CodeLlama-70B50.655.5
speechless-coding-7B-16k-tora50.654.950.664.2
Code-33B49.454.9
OpenHermes-2.5-Code-290k-13B48.854.345.852.4
CodeQwen1.5-7B45.751.860.873.5
WizardCoder-Python-7B-V1.045.150.649.558.5
phi-2-2.7B45.149.454.264
DeepSeek-Coder-33B-base44.551.2
CodeLlama-34B43.951.856.369.3
Mistral-codealpaca-7B42.148.2
MistralHermes-CodePro-7B-v142.147.646.457.4
speechless-code-mistral-7B-v1.041.548.248.757.4
codegemma-7b41.544.552.465.1
DeepSeek-Coder-6.7B-base39.647.658.772
Mixtral-8x7B-Instruct-v0.139.645.149.759.5
CodeLlama-13B38.442.752.663.5
StarCoder2-15B37.846.3
SOLAR-10.7B-Instruct-v1.037.243.336.243.9
Mistral-7B-Instruct-v0.23642.13744.7
gemma-1.1-7b-it35.442.74557.1
CodeLlama-7B35.437.846.859.5
xDAN-L1-Chat-RL-v1-7B32.940.241.350.3
Python-Code-13B30.532.9
StarCoder2-7B29.935.4
StarCoder-15B29.334.146.155.1
Llama3-8B-base29.333.551.661.4
gemma-7b28.735.443.452.6
CodeGen-16B2832.945.554.2
StarCoder2-3B27.431.7
CodeT5+-16B26.831.747.156.6
stable-code-3B25.629.345.854.8
CodeGen-6B25.629.342.950.8
DeepSeek-Coder-1.3B-base25.628.747.956.9
gemma-7b-it2528.736.847.1
CodeT5+-6B24.429.341.552.9
Mistral-7B23.828.742.151.9
Zephyr Ξ²-7B23.23034.742.1
CodeGen-2B22.624.43646.3
CodeT5+-2B222538.148.4
StarCoderBase-7B21.324.4
codegemma-2b20.726.846.655.6
gemma-2b20.72534.141.8
gemma-1.1-2b-it17.722.623.329.8

πŸ“ Notes

  1. Samples are generated from scratch and are post-processed by our sanitizer script. We also run syntax checkers to avoid trivial syntactical errors.
  2. Models are ranked according to pass@1 using greedy decoding. Setup details can be found here.
  3. Both MBPP and MBPP+ referred in our leaderboard use a subset (399 tasks) of hand-verified problems from MBPP-sanitized (427 tasks), to make sure the programming task is well-formed and unambiguous.

About

Why EvalPlus? What does using EvalPlus datasets bring to you?

  • ✨ Reliable ranking: See our leaderboard for the latest LLM ranking before and after rigorous evaluation.
  • ✨ Coding rigorousness: Look at the score differences! esp. before and after using EvalPlus tests! Less drop is better as it means more rigorousness and less laxity in code generation; while a big drop means the generated code tends to be fragile.
  • ✨Pre-generated samples: EvalPlus accelerates LLM4Code research by open-sourcing LLM-generated samples for vairous models – no need to re-run the expensive benchmarks!

Want to know more details? Read our NeurIPS'23 paper as well as our Google Slides!

πŸ”₯ Quick Start

To get started, please first setup the environment:

pip install evalplus --upgrade
⏬ Install nightly version :: click to expand ::
pip install "git+https://github.com/evalplus/evalplus.git" --upgrade
⏬ Using EvalPlus as a local repo? :: click to expand ::
git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt

Code generation

Implement the GEN_SOLUTION function by calling the LLM to produce the complete solution (include the code) and save the samples to samples.jsonl:

from evalplus.data import get_[human_eval|mbpp]_plus, write_jsonl

samples = [
    dict(task_id=task_id, solution=GEN_SOLUTION(problem["prompt"]))
    for task_id, problem in get_[human_eval|mbpp]_plus().items()
]
write_jsonl("samples.jsonl", samples)
πŸ€” Structure of `problem`? :: click to expand ::
  • task_id is the identifier string for the task
  • entry_point is name of the function
  • prompt is the function signature with docstring
  • canonical_solution is the ground-truth implementation (re-implemented to fix bugs in HumanEval)
  • base_input is the test inputs in original HumanEval
  • plus_input is the test inputs brought by EvalPlus

[!Note]

Expected Schema of samples.jsonl

  1. task_id: Task ID, which are the keys of get_[human_eval|mbpp]_plus()
  2. solution (optional): Self-contained solution (usually including the prompt)
    • Example: {"task_id": "HumanEval/?", "solution": "def f():\n return 1"}
  3. completion (optional): Function body without prompt
    • Example: {"task_id": "HumanEval/?", "completion": " return 1"}

Only one of solution and completion is required. If both are provided, solution will be used. We also accept solutions in the form of directory, i.e., --samples ${SAMPLE_DIR} where ${SAMPLE_DIR} is organized as: ${SAMPLE_DIR}/${TASK_ID}/{SAMPLE_ID}.py (${TASK_ID} = task_id.replace("/", "_")).

Code evaluation

You are strongly recommended to use a sandbox such as docker:

docker run -v $(pwd):/app ganler/evalplus:latest --dataset [humaneval|mbpp] --samples samples.jsonl

…Or if you want to try it locally regardless of the risks ⚠️:

evalplus.evaluate --dataset [humaneval|mbpp] --samples samples.jsonl

[!Warning]

Do you use a very slow machine?

LLM solutions are regarded as failed on timeout (and OOM etc.). Specifically, we set the timeout $T=\max(T_{base}, T_{gt}\times k)$, where:

  • $T_{base}$ is the minimal timeout (configurable by --min-time-limit; default to 1s);
  • $T_{gt}$ is the runtime of the ground-truth solutions (achieved via profiling);
  • $k$ is a configurable factor --gt-time-limit-factor (default to 4);

If your machine is too slow and you are getting high-variance results, try to use larger $k$ and $T_{base}$.

Additionally, you are NOT encouraged to make your test-bed over stressed while running evaluation. For example, using --parallel 64 on a 4-core machine or doing something else during evaluation are bad ideas…

πŸ€” Evaluate with local GitHub repo? :: click to expand ::
export PYTHONPATH=$PYTHONPATH:$(pwd)
python evalplus/evaluate.py --dataset humaneval --samples samples.jsonl
⌨️ More command-line flags :: click to expand ::
  • --parallel: by default half of the cores
  • --base-only (store_ture): only run base HumanEval tests
  • --i-just-wanna-run: force a re-run

The output should be like (below is GPT-4 greedy decoding example):

Computing expected output...
Expected outputs computed in 15.18s
Reading samples...
164it [00:04, 37.79it/s]
Evaluating samples...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 164/164 [00:03<00:00, 44.75it/s]
Base
{'pass@1': 0.8841463414634146}
Base + Extra
{'pass@1': 0.768}
  • Base is the pass@k for the original HumanEval
  • Base + Extra is the pass@k for the our HumanEval+ (with extra tests)
  • The “k” includes [1, 10, 100] where k values <= the sample size will be used
  • A cache file named like samples_eval_results.jsonl will be cached. Remove it to re-run the evaluation
πŸ€” How long it would take? :: click to expand ::

If you do greedy decoding where there is only one sample for each task, the evaluation should take just a few seconds. When running 200 samples x 164 tasks x ~700+ tests, it can take around 2-10 minute by using --parallel 64 and --test-details. Here are some tips to speed up the evaluation:

  • Use --parallel $(nproc)
  • Do NOT use --test-details if you just want to quickly get pass@k as --test-details will run all tests (700+ on average for each task), while without --test-details the testing for a sample stops immediately when it fails the first test.
  • Use our pre-evaluated results (see LLM-generated code)
  • Use HumanEval+ Mini

[!Note]

πŸš€ Try out HumanEvalPlus-Mini! which selects a minimal set of additional tests with the highest quality, achieving almost the same effectiveness of the full version. Just add a --mini flag, it can run 23+% faster! (even faster if you evaluate all tests without fail-stop with --test-details).

docker run -v $(pwd):/app ganler/evalplus:latest --dataset humaneval --samples samples.jsonl --mini
# ...Or locally ⚠️
# evalplus.evaluate --dataset humaneval --samples samples.jsonl --mini

πŸ’» LLM-generated code

We also share pre-generated code samples from LLMs we have evaluated:

Each sample file is packaged in a zip file named like ${model_name}_temp_${temperature}.zip. You can unzip them to a folder named like ${model_name}_temp_${temperature} and run the evaluation from scratch with:

evalplus.evaluate --dataset humaneval --samples ${model_name}_temp_${temperature}

πŸ”¨ Useful tools

To use these tools, please first install the repository from GitHub:

git clone https://github.com/evalplus/evalplus.git
cd evalplus
pip install -r requirements-tools.txt

Syntax checker for LLM-generated code

Check LLM-produced code and answer the following questions:

  1. Is the generation entirely done for all samples / all problems in the dataset?
  2. Are LLM-generated code compilable? (if no, something could be wrong and you’d better check)
# Set PYTHONPATH to run local Python files
export PYTHONPATH=$PYTHONPATH:$(pwd)

python tools/checker.py --samples samples.jsonl --dataset [humaneval|mbpp]
# --samples can also be a directory organized as: ${SAMPLE_DIR}/${TASK_ID}/{SAMPLE_ID}.py

Post code sanitizer

LLM-generated code may contain some syntax errors. But some of them can be easily fixable by doing simple post-processing. This tool will make the LLM-generated code more clean/compilable by doing certain post-processing such as trimming with more magical EOFs and some garbage non-code tokens.

# Set PYTHONPATH to run local Python files
export PYTHONPATH=$PYTHONPATH:$(pwd)

# πŸ’‘ If you are storing codes in directories:
python tools/sanitize.py --samples samples.jsonl --dataset [humaneval|mbpp]
# Sanitized code will be produced to `samples-sanitized.jsonl`

# πŸ’‘ If you are storing codes in directories:
python tools/sanitize.py --samples /path/to/vicuna-[??]b_temp_[??] --dataset [humaneval|mbpp]
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`

You should now further check the validity of sanitized code with tools/checker.py. Sometimes (e.g., Chat models) there might be some natural language lines that impact the compilation. You might use --rm-prefix-lines to cut those NL lines with a prefix (e.g., --rm-prefix-lines "Here's").

Render pass@k results to rich and LaTeX tables

python tools/render.py --type /path/to/[model]-[??]b # NOTE: no `_temp_[??]`

Perform test input generation from scratch (TBD)

Name convention

  • evalplus is the package name.
  • ${DATASET}_plus is the name of dataset applied with evalplus.

πŸ“œ Citation

@inproceedings{evalplus,
  title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
  author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming},
  booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},
  year = {2023},
  url = {https://openreview.net/forum?id=1qvx610Cu7},
}

πŸ™ Acknowledgement