Update README.md
Browse files
README.md
CHANGED
|
@@ -3,4 +3,56 @@ license: apache-2.0
|
|
| 3 |
datasets:
|
| 4 |
- bigcode/the-stack
|
| 5 |
- HuggingFaceFW/fineweb
|
| 6 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
datasets:
|
| 4 |
- bigcode/the-stack
|
| 5 |
- HuggingFaceFW/fineweb
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
# Model Details
|
| 10 |
+
|
| 11 |
+
The TinyCodeLM family of tiny language models (LMs) is a collection of pretrained and instruction tuned generative code models in 150M and 400M sizes. These models are pretrained on a mixture of open-source web text and Python code. The instruction tuned TinyCodeLM models are optimized for Python code synthesis, and are trained on [synthetic edit sequence data generated with the LintSeq algorithm](https://arxiv.org/abs/2410.02749).
|
| 12 |
+
|
| 13 |
+
Despite being trained on only 72 billion tokens of text, the models outperform many of the available open source Python code synthesis models on HumanEval and MBPP. The TinyCodeLM-LintSeqInstruct models are state-of-the-art on Python synthesis for their size.
|
| 14 |
+
|
| 15 |
+
**Model Developers** Ulyana Piterbarg, Lerrel Pinto, Rob Fergus (NYU)
|
| 16 |
+
|
| 17 |
+
**Variations** TinyCodeLM comes in two sizes (150M and 400M parameters) in pretrained and edit sequence instruction tuned variants.
|
| 18 |
+
|
| 19 |
+
**Input** Text only.
|
| 20 |
+
|
| 21 |
+
**Output** Models generate text and code. Instruction tuned models generate code via sequences of "diffs".
|
| 22 |
+
|
| 23 |
+
**Model Architecture** TinyCodeLMs are autoregressive language models with architectures that mimic the two smallest versions of GPT-2 (Radford et al., 2019), while integrating the transformer architecture changes of the OLMo models.
|
| 24 |
+
|
| 25 |
+
**Instruction Tuning Data** TinyCodeLMs are instruction tuned on paired instruction and Python edit sequence data. These edit sequences are generated with the LintSeq algorithm over a source dataset of paired instruction and Python programs drawn from the Magicoder and StarCoder2 OSS-Instruct datasets (Wei et al., 2024).
|
| 26 |
+
|
| 27 |
+
# Benchmarks
|
| 28 |
+
|
| 29 |
+
**Pretrained (Temperature 0)**
|
| 30 |
+
|
| 31 |
+
|**Benchmark**|**TinyCodeLM 150M** |**TinyCodeLM 400M** |
|
| 32 |
+
| :--------------------- | -----------------: | :-----------------: |
|
| 33 |
+
| HumanEval, pass@1 | 6.1 | 6.7 |
|
| 34 |
+
| MBPP(+), pass@1 | 5.4 | 6.8 |
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
**Edit Sequence / Instruction Tuned (Temperature-Tuned)**
|
| 38 |
+
|
| 39 |
+
|**Benchmark** |**TinyCodeLM 150M** |**TinyCodeLM 400M** |
|
| 40 |
+
| :----------- | -----------------: | -----------------: |
|
| 41 |
+
| HumanEval, pass@1 | 12.8 | 13.4 |
|
| 42 |
+
| HumanEval, pass@10 | 20.6 | 20.9 |
|
| 43 |
+
| MBPP(+), pass@1 | 13.6 | 24.4 |
|
| 44 |
+
| MBPP(+), pass@10 | 24.4 | 29.9 |
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
# Citation
|
| 48 |
+
|
| 49 |
+
```
|
| 50 |
+
@misc{piterbarg2024training,
|
| 51 |
+
title={Training Language Models on Synthetic Edit Sequences Improves Code Synthesis},
|
| 52 |
+
author={Ulyana Piterbarg and Lerrel Pinto and Rob Fergus},
|
| 53 |
+
year={2024},
|
| 54 |
+
eprint={2410.02749},
|
| 55 |
+
archivePrefix={arXiv},
|
| 56 |
+
primaryClass={cs.LG}
|
| 57 |
+
}
|
| 58 |
+
```
|