File size: 6,777 Bytes
716baf6
 
 
 
 
 
297e718
716baf6
 
 
297e718
716baf6
297e718
 
716baf6
 
 
 
 
 
 
 
 
 
 
 
 
 
297e718
716baf6
297e718
716baf6
297e718
716baf6
297e718
716baf6
297e718
716baf6
297e718
716baf6
297e718
716baf6
 
297e718
 
 
 
 
 
716baf6
297e718
716baf6
297e718
716baf6
297e718
716baf6
297e718
716baf6
 
 
297e718
716baf6
 
 
 
297e718
716baf6
 
 
 
297e718
 
 
 
 
 
 
 
 
716baf6
297e718
716baf6
297e718
716baf6
297e718
 
 
 
 
 
716baf6
297e718
716baf6
297e718
716baf6
297e718
716baf6
297e718
 
 
 
 
 
 
 
716baf6
297e718
716baf6
297e718
716baf6
297e718
716baf6
297e718
 
716baf6
297e718
 
716baf6
297e718
 
716baf6
297e718
716baf6
297e718
716baf6
 
 
 
 
43c2758
297e718
716baf6
 
297e718
716baf6
297e718
 
716baf6
 
297e718
 
 
 
 
 
 
716baf6
297e718
716baf6
 
297e718
 
 
716baf6
297e718
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
716baf6
297e718
716baf6
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
license: apache-2.0
language:
- en
- code
library_name: transformers
pipeline_tag: text-generation
tags:
- smallcoder
- code-llm
- code-generation
- sft
- pretraining
- tpu
- 303m
- trc
datasets:
- HuggingFaceFW/fineweb-edu
- nvidia/Nemotron-Pretraining-SFT-v1
- bigcode/starcoderdata
- nvidia/Nemotron-Pretraining-Code-v1
- HuggingFaceFW/finewiki
- open-web-math/open-web-math
- nvidia/Nemotron-CC-Math-v1
- nvidia/OpenCodeInstruct
- nvidia/OpenMathInstruct-2
---

# 🧠 SmallCoder (303M)

**SmallCoder** is a **303M parameter** LLaMA-style language model trained **from scratch** for **code generation** and **algorithmic reasoning**.

This checkpoint represents a **6B-token Supervised Fine-Tuning (SFT)** run that fixed a critical **End-of-Sequence (EOS) token bug** from earlier versions.

Despite its compact size, SmallCoder achieves **state-of-the-art (SOTA) coding performance for <500M models**, rivaling 1B–7B parameter LLMs.

> Trained with support from **Google’s TPU Research Cloud (TRC)** program.

---

## 🚀 Key Results

| Model | Size | HumanEval (pass@1) | MBPP (pass@1) |
|:------|:----:|:------------------:|:--------------:|
| **SmallCoder (Stage 4.1)** | **303M** | **27.4 %** | **31.0 %** |
| TinyLlama-1.1B | 1.1B | ~26.4 % | ~27.6 % |
| MPT-1B-Instruct | 1.0B | ~22.0 % | ~25.0 % |
| Zephyr-1.3B-SFT | 1.3B | 31.0 % | 34.0 % |
| Mistral-7B-Base | 7B | 30.5 % | 47.5 % |

> ⚖️ **SmallCoder nearly matches Mistral 7B on HumanEval while being 23× smaller.**

---

## 🧬 Model Architecture

A **LLaMA-type causal decoder** with standard Multi-Head Attention (MHA).

```python
LlamaConfig(
  vocab_size=49152,               # StarCoder tokenizer
  hidden_size=768,
  num_hidden_layers=24,
  num_attention_heads=8,
  num_key_value_heads=8,
  intermediate_size=3072,
  max_position_embeddings=1024,
)
````

| Parameter         | Value                          |
| ----------------- | ------------------------------ |
| Total parameters  | ≈ 303 M                        |
| Context length    | 1 024 tokens                   |
| Tokenizer         | `bigcode/starcoder`            |
| Architecture type | LLaMA (MHA, non-GQA)           |
| Precision         | bfloat16                       |
| Optimizer         | AdamW XLA                      | 
| Hardware          | TPU v4-32 (TRC)                 |

---

## 📚 Training Curriculum (4 Stages, 29.8B tokens)

| Stage                      | Tokens (B) | Dataset                                              | Objective                        |    Loss ↓    |
| :------------------------- | :--------: | :--------------------------------------------------- | :------------------------------- | :----------: |
| **1. Linguistic Base**     |     6.3    | FineWeb-Edu                                          | General English grounding        | 10.87 → 2.58 |
| **2. Code Specialization** |     7.5    | 60 % Nemotron Synthetic Code / 40 % StarCoderData    | Code syntax & reasoning          |  5.00 → 1.25 |
| **3. Math & Knowledge**    |    10.0    | Nemotron CC-Math / FineWiki / OpenWebMath            | Mathematical reasoning           |  2.77 → 1.55 |
| **4.1 SFT (EOS Fixed)**    |     6.0    | Nemotron SFT / OpenCodeInstruct / OpenMathInstruct-2 | Instruction-tuned code alignment | 1.73 → ~0.70 |

> 🧩 Total ≈ 29.8 B tokens of curated curriculum learning.

---

## 📊 Detailed Benchmarks (Stage 4.1 SFT)

| Domain          | Benchmark            | Metric       |     Score     |
| :-------------- | :------------------- | :----------- | :-----------: |
| **Code**        | HumanEval (0-shot)   | pass@1       |   **27.4 %**  |
| **Code**        | MBPP (3-shot)        | pass@1       |   **31.0 %**  |
| **Math**        | GSM8k (0-shot)       | exact match  |   **4.55 %**  |
| **Knowledge**   | Wikitext-2           | perplexity ↓ |   **167.6**   |
| **Reasoning**   | ARC (Easy/Challenge) | acc norm     | 34.6 / 22.8 % |
| **Commonsense** | HellaSwag            | acc norm     |     28.3 %    |

> `humaneval`/`mbpp` were computed with manual evaluation (`max_new_tokens=512`, `temp=0.2`) due to SFT format truncation issues in `lm-eval`.

---

## ⚠️ Known Limitations

1. **Code-Specialized Model**
   Tuned for Python and algorithmic reasoning. Poor performance on general text, math, and commonsense tasks.

2. **Short Context**
   Trained on **1 024-token** sequences only. Performance degrades on longer inputs.

3. **Tokenizer Bias**
   Uses `bigcode/starcoder` BPE vocabulary — optimized for code, not prose.

---

## 💻 Usage Example

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "Beebey/smallcoder-303m"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)

prompt = """User: Write a Python function to compute Fibonacci numbers.
Assistant:"""
inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

💡 *Trained using the “User:” / “Assistant:” dialogue format.*

---

## 🧾 Citation

If you use **SmallCoder (303M)** in your research, please cite:

```
@misc{smallcoder303m,
  title  = {SmallCoder: A 303M-parameter Code LLM trained from scratch},
  author = {Da Silva, Ilan},
  year   = {2025},
  url    = {https://huggingface.co/Beebey/smallcoder-303m},
  note   = {Trained with Google TPU Research Cloud (TRC) support}
}
```

---

## 🙏 Acknowledgements

This model was trained with support from the **Google TPU Research Cloud (TRC)** program.
Special thanks to the open datasets that enabled this work:
FineWeb, StarCoderData, Nemotron, and OpenWebMath.

---

## 🧩 Summary

| Category            | Description                 |
| ------------------- | --------------------------- |
| **Type**            | Code LLM (LLaMA-style)      |
| **Parameters**      | 303 M                       |
| **Training tokens** | ~29.8 B                     |
| **Specialty**       | Code generation & reasoning |
| **Context window**  | 1 024 tokens                |
| **Tokenizer**       | `bigcode/starcoder`         |
| **License**         | Apache 2.0                  |
| **Hardware**        | TPU v4 (TRC Program)        |

---

> 🔬 **SmallCoder (303M)** demonstrates that a carefully designed <500M model can achieve near-SOTA coding performance, matching 1B-class models on HumanEval — proving that *efficient, compact, open models* still matter.

```