| ## 1. Introduction | |
| We provide a test script to evaluate the performance of the **deepseek-coder** model on code generation benchmarks, [**MBPP**](https://huggingface.co/datasets/mbpp), with 3-shot setting. | |
| ## 2. Setup | |
| ``` | |
| pip install accelerate | |
| pip install attrdict | |
| pip install transformers | |
| pip install pytorch | |
| ``` | |
| ## 3. Evaluation | |
| We've created a sample script, **eval.sh**, that demonstrates how to test the **deepseek-coder-1.3b-base** model on the MBPP dataset leveraging **8** GPUs. | |
| ```bash | |
| MODEL_NAME_OR_PATH="deepseek-ai/deepseek-coder-1.3b-base" | |
| DATASET_ROOT="data/" | |
| LANGUAGE="python" | |
| python -m accelerate.commands.launch --config_file test_config.yaml eval_pal.py --logdir ${MODEL_NAME_OR_PATH} --dataroot ${DATASET_ROOT} | |
| ``` | |
| ## 4. Experimental Results | |
| We report experimental results here for several models. We set the maximum input length to **4096** and the maximum output length to **500**, and employ the **greedy search strategy**. | |
| #### (1) Multilingual Base Models | |
| | Model | Size | Pass@1 | | |
| |-------------------|------|--------| | |
| | CodeShell | 7B | 38.6% | | |
| | CodeGeeX2 | 6B | 36.2% | | |
| | StarCoder | 16B | 42.8% | | |
| | CodeLLama-Base | 7B | 38.6% | | |
| | CodeLLama-Base | 13B | 47.0% | | |
| | CodeLLama-Base | 34B | 55.0% | | |
| | | | | | | | | | | | | | |
| | DeepSeek-Coder-Base| 1.3B | 46.8% | | |
| | DeepSeek-Coder-Base| 5.7B | 57.2% | | |
| | DeepSeek-Coder-Base| 6.7B | 60.6% | | |
| | DeepSeek-Coder-Base|33B | **66.0%** | | |
| #### (2) Instruction-Tuned Models | |
| | Model | Size | Pass@1 | | |
| |---------------------|------|--------| | |
| | GPT-3.5-Turbo | - | 70.8% | | |
| | GPT-4 | - | **80.0%** | | |
| | | | | | | | | | | | | | |
| | DeepSeek-Coder-Instruct | 1.3B | 49.4% | | |
| | DeepSeek-Coder-Instruct | 6.7B | 65.4% | | |
| | DeepSeek-Coder-Instruct | 33B | **70.0%** | | |