File size: 6,777 Bytes
716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 43c2758 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 297e718 716baf6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 |
---
license: apache-2.0
language:
- en
- code
library_name: transformers
pipeline_tag: text-generation
tags:
- smallcoder
- code-llm
- code-generation
- sft
- pretraining
- tpu
- 303m
- trc
datasets:
- HuggingFaceFW/fineweb-edu
- nvidia/Nemotron-Pretraining-SFT-v1
- bigcode/starcoderdata
- nvidia/Nemotron-Pretraining-Code-v1
- HuggingFaceFW/finewiki
- open-web-math/open-web-math
- nvidia/Nemotron-CC-Math-v1
- nvidia/OpenCodeInstruct
- nvidia/OpenMathInstruct-2
---
# 🧠 SmallCoder (303M)
**SmallCoder** is a **303M parameter** LLaMA-style language model trained **from scratch** for **code generation** and **algorithmic reasoning**.
This checkpoint represents a **6B-token Supervised Fine-Tuning (SFT)** run that fixed a critical **End-of-Sequence (EOS) token bug** from earlier versions.
Despite its compact size, SmallCoder achieves **state-of-the-art (SOTA) coding performance for <500M models**, rivaling 1B–7B parameter LLMs.
> Trained with support from **Google’s TPU Research Cloud (TRC)** program.
---
## 🚀 Key Results
| Model | Size | HumanEval (pass@1) | MBPP (pass@1) |
|:------|:----:|:------------------:|:--------------:|
| **SmallCoder (Stage 4.1)** | **303M** | **27.4 %** | **31.0 %** |
| TinyLlama-1.1B | 1.1B | ~26.4 % | ~27.6 % |
| MPT-1B-Instruct | 1.0B | ~22.0 % | ~25.0 % |
| Zephyr-1.3B-SFT | 1.3B | 31.0 % | 34.0 % |
| Mistral-7B-Base | 7B | 30.5 % | 47.5 % |
> ⚖️ **SmallCoder nearly matches Mistral 7B on HumanEval while being 23× smaller.**
---
## 🧬 Model Architecture
A **LLaMA-type causal decoder** with standard Multi-Head Attention (MHA).
```python
LlamaConfig(
vocab_size=49152, # StarCoder tokenizer
hidden_size=768,
num_hidden_layers=24,
num_attention_heads=8,
num_key_value_heads=8,
intermediate_size=3072,
max_position_embeddings=1024,
)
````
| Parameter | Value |
| ----------------- | ------------------------------ |
| Total parameters | ≈ 303 M |
| Context length | 1 024 tokens |
| Tokenizer | `bigcode/starcoder` |
| Architecture type | LLaMA (MHA, non-GQA) |
| Precision | bfloat16 |
| Optimizer | AdamW XLA |
| Hardware | TPU v4-32 (TRC) |
---
## 📚 Training Curriculum (4 Stages, 29.8B tokens)
| Stage | Tokens (B) | Dataset | Objective | Loss ↓ |
| :------------------------- | :--------: | :--------------------------------------------------- | :------------------------------- | :----------: |
| **1. Linguistic Base** | 6.3 | FineWeb-Edu | General English grounding | 10.87 → 2.58 |
| **2. Code Specialization** | 7.5 | 60 % Nemotron Synthetic Code / 40 % StarCoderData | Code syntax & reasoning | 5.00 → 1.25 |
| **3. Math & Knowledge** | 10.0 | Nemotron CC-Math / FineWiki / OpenWebMath | Mathematical reasoning | 2.77 → 1.55 |
| **4.1 SFT (EOS Fixed)** | 6.0 | Nemotron SFT / OpenCodeInstruct / OpenMathInstruct-2 | Instruction-tuned code alignment | 1.73 → ~0.70 |
> 🧩 Total ≈ 29.8 B tokens of curated curriculum learning.
---
## 📊 Detailed Benchmarks (Stage 4.1 SFT)
| Domain | Benchmark | Metric | Score |
| :-------------- | :------------------- | :----------- | :-----------: |
| **Code** | HumanEval (0-shot) | pass@1 | **27.4 %** |
| **Code** | MBPP (3-shot) | pass@1 | **31.0 %** |
| **Math** | GSM8k (0-shot) | exact match | **4.55 %** |
| **Knowledge** | Wikitext-2 | perplexity ↓ | **167.6** |
| **Reasoning** | ARC (Easy/Challenge) | acc norm | 34.6 / 22.8 % |
| **Commonsense** | HellaSwag | acc norm | 28.3 % |
> `humaneval`/`mbpp` were computed with manual evaluation (`max_new_tokens=512`, `temp=0.2`) due to SFT format truncation issues in `lm-eval`.
---
## ⚠️ Known Limitations
1. **Code-Specialized Model**
Tuned for Python and algorithmic reasoning. Poor performance on general text, math, and commonsense tasks.
2. **Short Context**
Trained on **1 024-token** sequences only. Performance degrades on longer inputs.
3. **Tokenizer Bias**
Uses `bigcode/starcoder` BPE vocabulary — optimized for code, not prose.
---
## 💻 Usage Example
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "Beebey/smallcoder-303m"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)
prompt = """User: Write a Python function to compute Fibonacci numbers.
Assistant:"""
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
💡 *Trained using the “User:” / “Assistant:” dialogue format.*
---
## 🧾 Citation
If you use **SmallCoder (303M)** in your research, please cite:
```
@misc{smallcoder303m,
title = {SmallCoder: A 303M-parameter Code LLM trained from scratch},
author = {Da Silva, Ilan},
year = {2025},
url = {https://huggingface.co/Beebey/smallcoder-303m},
note = {Trained with Google TPU Research Cloud (TRC) support}
}
```
---
## 🙏 Acknowledgements
This model was trained with support from the **Google TPU Research Cloud (TRC)** program.
Special thanks to the open datasets that enabled this work:
FineWeb, StarCoderData, Nemotron, and OpenWebMath.
---
## 🧩 Summary
| Category | Description |
| ------------------- | --------------------------- |
| **Type** | Code LLM (LLaMA-style) |
| **Parameters** | 303 M |
| **Training tokens** | ~29.8 B |
| **Specialty** | Code generation & reasoning |
| **Context window** | 1 024 tokens |
| **Tokenizer** | `bigcode/starcoder` |
| **License** | Apache 2.0 |
| **Hardware** | TPU v4 (TRC Program) |
---
> 🔬 **SmallCoder (303M)** demonstrates that a carefully designed <500M model can achieve near-SOTA coding performance, matching 1B-class models on HumanEval — proving that *efficient, compact, open models* still matter.
``` |