File size: 5,765 Bytes
716baf6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1acc74
716baf6
a1acc74
716baf6
 
 
 
 
 
 
 
 
a1acc74
716baf6
 
 
a1acc74
716baf6
 
 
 
 
a1acc74
716baf6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43c2758
716baf6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
license: apache-2.0
language:
- en
- code
library_name: transformers
tags:
- smallcoder
- code-llm
- sft
- 303m
- trc
datasets:
- HuggingFaceFW/fineweb-edu
- nvidia/Nemotron-Pretraining-SFT-v1
- bigcode/starcoderdata
- nvidia/Nemotron-Pretraining-Code-v1
- HuggingFaceFW/finewiki
- open-web-math/open-web-math
- nvidia/Nemotron-CC-Math-v1
- nvidia/OpenCodeInstruct
- nvidia/OpenMathInstruct-2
---

# SmallCoder (303M)

SmallCoder is a **303 Million parameter** Large Language Model (LLM) trained from scratch, specializing in code generation and algorithmic reasoning.

This checkpoint is the result of a 6 Billion token Supervised Fine-Tuning (SFT) run, which **fixed a critical End-of-Sequence (EOS) token bug** present in previous versions.

This model demonstrates state-of-the-art (SOTA) coding performance for its size, outperforming models larger than 1B parameters and competing with models 23x its size.

**Trained with support from Google's TPU Research Cloud (TRC) program.**

## πŸš€ Key Performance (Benchmarks)

The goal of SmallCoder was to maximize coding performance in a compact (<500M) package. This model achieves SOTA scores that rival or exceed models in the 1B+ class.

| Model | Size | HumanEval (pass@1) | MBPP (pass@1) |
| :--- | :---: | :---: | :---: |
| **SmallCoder (S4.1)** | **303M** | **27.4%** | **31.0%** |
| TinyLlama-1.1B | 1.1B | ~26.4% | ~27.6% |
| MPT-1B-Instruct | 1.0B | ~22.0% | ~25.0% |
| Zephyr-1.3B SFT | 1.3B | 31.0% | 34.0% |
| Mistral-7B Base | 7B | 30.5% | 47.5% |

SmallCoder (303M) nearly achieves **parity with Mistral 7B** on HumanEval while being **23x smaller**.

## 🧠 Model Architecture

This model uses a Llama-type architecture (MHA) with 303M parameters.

* **Architecture**: LlamaForCausalLM (MHA)
* **Hidden Size**: 768
* **Layers**: 24
* **Attention Heads**: 8
* **KV Heads**: 8 (Standard MHA)
* **Vocab Size**: 49152 (Tokenizer: `bigcode/starcoder`)
* **Max Context**: 1024 tokens

```python
LlamaConfig(
  vocab_size=49152,
  hidden_size=768,
  num_hidden_layers=24,
  intermediate_size=3072,
  num_attention_heads=8,
  num_key_value_heads=8,
  max_position_embeddings=1024,
  ...
)
````

## πŸ› οΈ Training Plan (4 Stages)

This model is the result of a multi-stage training curriculum totaling **29.8 Billion tokens**.

### Stage 1: Linguistic Base (Completed)

  * **Tokens**: 6.3B
  * **Dataset**: `FineWeb-Edu`
  * **Objective**: Learn natural language.
  * **Loss**: 10.87 β†’ **2.58**

### Stage 2: Code Specialization (Completed)

  * **Tokens**: 7.5B
  * **Dataset**: `Nemotron Synthetic Code Q/A CoT` (60%) / `StarCoderData` (40%)
  * **Objective**: Learn code syntax and reasoning.
  * **Loss**: 5.00 β†’ **1.25**

### Stage 3: Math & Knowledge (Completed)

  * **Tokens**: 10B
  * **Dataset**: `Nemotron CC-Math-4plus` (40%) / `FineWiki-EN` (35%) / `Nemotron CC-Math-4` (15%) / `OpenWebMath` (10%)
  * **Objective**: Learn mathematical reasoning.
  * **Loss**: 2.77 β†’ **1.55**
  * **Result**: A solid base model (Wikitext PPL: 35.4).

### Stage 4.1: SFT (EOS-Fixed) (Completed)

  * **Tokens**: 6B
  * **Starting Checkpoint**: `stage-3/`
  * **Dataset**: `Nemotron-SFT-Code` (45%), `OpenCodeInstruct` (30%), `OpenMathInstruct-2` (15%), `Nemotron-SFT-General` (10%)
  * **Objective**: Align on code instructions and fix the EOS generation bug.
  * **Loss**: 1.73 β†’ **\~0.70** (low point)

-----

## πŸ“Š Detailed Benchmarks (Stage 4.1)

The SFT (Code) scores are excellent. The generalist scores (Math, Reasoning) are low, indicating the SFT has heavily specialized the model (a "code specialist").

| Task | Benchmark | n-shot | Metric | Score |
| :--- | :--- | :---: | :--- | :---: |
| **Code** | **HumanEval** | 0 | **pass@1** | **27.4%** |
| **Code** | **MBPP** | 3 | **pass@1** | **31.0%** |
| **Math** | **GSM8k** | 0 | exact\_match | **4.55%** |
| **General** | **Wikitext** | 0 | word\_perplexity | 167.6 |
| **Reasoning** | **ARC Easy** | 0 | acc\_norm | 34.6% |
| **Reasoning** | **ARC Challenge** | 0 | acc\_norm | 22.8% |
| **Commonsense** | **HellaSwag** | 0 | acc\_norm | 28.3% |

*`humaneval`/`mbpp` scores are based on manual analysis (`max_gen_toks=512`), as official `lm-eval` benchmarks fail to evaluate this model due to SFT formatting and truncation issues.*

## ⚠️ Known Limitations

1.  **Code Specialist:** Heavily optimized for code (27.4% HEval) at the expense of other skills. Performance on math (`gsm8k` 4.55%) and general knowledge (PPL 167) is low. **This is a code specialist model, not a generalist.**
2.  **Limited Context:** This model was trained exclusively on a sequence length of **1024 tokens**. It cannot handle longer prompts.

## ⚑ How to Use

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "Beebey/smallcoder-303m"
device = "cuda" # or "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16
).to(device)

# Note the 'User:' and 'Assistant:' formatting
prompt = "User: Write a Python function to compute the Fibonacci sequence.\nAssistant:"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Generation
# The model was trained to use tokenizer.eos_token_id
# It should stop automatically.
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

## Acknowledgements

### Trained with the Google TRC

This model was trained with support from Google's **TPU Research Cloud (TRC)** program. We thank Google for providing access to the TPU v4 infrastructure that made this training run possible.

```