File size: 5,765 Bytes
716baf6 a1acc74 716baf6 a1acc74 716baf6 a1acc74 716baf6 a1acc74 716baf6 a1acc74 716baf6 43c2758 716baf6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
---
license: apache-2.0
language:
- en
- code
library_name: transformers
tags:
- smallcoder
- code-llm
- sft
- 303m
- trc
datasets:
- HuggingFaceFW/fineweb-edu
- nvidia/Nemotron-Pretraining-SFT-v1
- bigcode/starcoderdata
- nvidia/Nemotron-Pretraining-Code-v1
- HuggingFaceFW/finewiki
- open-web-math/open-web-math
- nvidia/Nemotron-CC-Math-v1
- nvidia/OpenCodeInstruct
- nvidia/OpenMathInstruct-2
---
# SmallCoder (303M)
SmallCoder is a **303 Million parameter** Large Language Model (LLM) trained from scratch, specializing in code generation and algorithmic reasoning.
This checkpoint is the result of a 6 Billion token Supervised Fine-Tuning (SFT) run, which **fixed a critical End-of-Sequence (EOS) token bug** present in previous versions.
This model demonstrates state-of-the-art (SOTA) coding performance for its size, outperforming models larger than 1B parameters and competing with models 23x its size.
**Trained with support from Google's TPU Research Cloud (TRC) program.**
## π Key Performance (Benchmarks)
The goal of SmallCoder was to maximize coding performance in a compact (<500M) package. This model achieves SOTA scores that rival or exceed models in the 1B+ class.
| Model | Size | HumanEval (pass@1) | MBPP (pass@1) |
| :--- | :---: | :---: | :---: |
| **SmallCoder (S4.1)** | **303M** | **27.4%** | **31.0%** |
| TinyLlama-1.1B | 1.1B | ~26.4% | ~27.6% |
| MPT-1B-Instruct | 1.0B | ~22.0% | ~25.0% |
| Zephyr-1.3B SFT | 1.3B | 31.0% | 34.0% |
| Mistral-7B Base | 7B | 30.5% | 47.5% |
SmallCoder (303M) nearly achieves **parity with Mistral 7B** on HumanEval while being **23x smaller**.
## π§ Model Architecture
This model uses a Llama-type architecture (MHA) with 303M parameters.
* **Architecture**: LlamaForCausalLM (MHA)
* **Hidden Size**: 768
* **Layers**: 24
* **Attention Heads**: 8
* **KV Heads**: 8 (Standard MHA)
* **Vocab Size**: 49152 (Tokenizer: `bigcode/starcoder`)
* **Max Context**: 1024 tokens
```python
LlamaConfig(
vocab_size=49152,
hidden_size=768,
num_hidden_layers=24,
intermediate_size=3072,
num_attention_heads=8,
num_key_value_heads=8,
max_position_embeddings=1024,
...
)
````
## π οΈ Training Plan (4 Stages)
This model is the result of a multi-stage training curriculum totaling **29.8 Billion tokens**.
### Stage 1: Linguistic Base (Completed)
* **Tokens**: 6.3B
* **Dataset**: `FineWeb-Edu`
* **Objective**: Learn natural language.
* **Loss**: 10.87 β **2.58**
### Stage 2: Code Specialization (Completed)
* **Tokens**: 7.5B
* **Dataset**: `Nemotron Synthetic Code Q/A CoT` (60%) / `StarCoderData` (40%)
* **Objective**: Learn code syntax and reasoning.
* **Loss**: 5.00 β **1.25**
### Stage 3: Math & Knowledge (Completed)
* **Tokens**: 10B
* **Dataset**: `Nemotron CC-Math-4plus` (40%) / `FineWiki-EN` (35%) / `Nemotron CC-Math-4` (15%) / `OpenWebMath` (10%)
* **Objective**: Learn mathematical reasoning.
* **Loss**: 2.77 β **1.55**
* **Result**: A solid base model (Wikitext PPL: 35.4).
### Stage 4.1: SFT (EOS-Fixed) (Completed)
* **Tokens**: 6B
* **Starting Checkpoint**: `stage-3/`
* **Dataset**: `Nemotron-SFT-Code` (45%), `OpenCodeInstruct` (30%), `OpenMathInstruct-2` (15%), `Nemotron-SFT-General` (10%)
* **Objective**: Align on code instructions and fix the EOS generation bug.
* **Loss**: 1.73 β **\~0.70** (low point)
-----
## π Detailed Benchmarks (Stage 4.1)
The SFT (Code) scores are excellent. The generalist scores (Math, Reasoning) are low, indicating the SFT has heavily specialized the model (a "code specialist").
| Task | Benchmark | n-shot | Metric | Score |
| :--- | :--- | :---: | :--- | :---: |
| **Code** | **HumanEval** | 0 | **pass@1** | **27.4%** |
| **Code** | **MBPP** | 3 | **pass@1** | **31.0%** |
| **Math** | **GSM8k** | 0 | exact\_match | **4.55%** |
| **General** | **Wikitext** | 0 | word\_perplexity | 167.6 |
| **Reasoning** | **ARC Easy** | 0 | acc\_norm | 34.6% |
| **Reasoning** | **ARC Challenge** | 0 | acc\_norm | 22.8% |
| **Commonsense** | **HellaSwag** | 0 | acc\_norm | 28.3% |
*`humaneval`/`mbpp` scores are based on manual analysis (`max_gen_toks=512`), as official `lm-eval` benchmarks fail to evaluate this model due to SFT formatting and truncation issues.*
## β οΈ Known Limitations
1. **Code Specialist:** Heavily optimized for code (27.4% HEval) at the expense of other skills. Performance on math (`gsm8k` 4.55%) and general knowledge (PPL 167) is low. **This is a code specialist model, not a generalist.**
2. **Limited Context:** This model was trained exclusively on a sequence length of **1024 tokens**. It cannot handle longer prompts.
## β‘ How to Use
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "Beebey/smallcoder-303m"
device = "cuda" # or "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16
).to(device)
# Note the 'User:' and 'Assistant:' formatting
prompt = "User: Write a Python function to compute the Fibonacci sequence.\nAssistant:"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
# Generation
# The model was trained to use tokenizer.eos_token_id
# It should stop automatically.
outputs = model.generate(
**inputs,
max_new_tokens=512,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```
## Acknowledgements
### Trained with the Google TRC
This model was trained with support from Google's **TPU Research Cloud (TRC)** program. We thank Google for providing access to the TPU v4 infrastructure that made this training run possible.
``` |