README.md · Beebey/smallcoder-303m at a1acc744e36801d14e73445cf93d69c25e70efa8

smallcoder-303m / README.md

Beebey

Update README.md

a1acc74 verified 2 months ago

preview code

raw

history blame

5.77 kB

	---
	license: apache-2.0
	language:
	- en
	- code
	library_name: transformers
	tags:
	- smallcoder
	- code-llm
	- sft
	- 303m
	- trc
	datasets:
	- HuggingFaceFW/fineweb-edu
	- nvidia/Nemotron-Pretraining-SFT-v1
	- bigcode/starcoderdata
	- nvidia/Nemotron-Pretraining-Code-v1
	- HuggingFaceFW/finewiki
	- open-web-math/open-web-math
	- nvidia/Nemotron-CC-Math-v1
	- nvidia/OpenCodeInstruct
	- nvidia/OpenMathInstruct-2
	---

	# SmallCoder (303M)

	SmallCoder is a 303 Million parameter Large Language Model (LLM) trained from scratch, specializing in code generation and algorithmic reasoning.

	This checkpoint is the result of a 6 Billion token Supervised Fine-Tuning (SFT) run, which fixed a critical End-of-Sequence (EOS) token bug present in previous versions.

	This model demonstrates state-of-the-art (SOTA) coding performance for its size, outperforming models larger than 1B parameters and competing with models 23x its size.

	Trained with support from Google's TPU Research Cloud (TRC) program.

	## 🚀 Key Performance (Benchmarks)

	The goal of SmallCoder was to maximize coding performance in a compact (<500M) package. This model achieves SOTA scores that rival or exceed models in the 1B+ class.

	\| Model \| Size \| HumanEval (pass@1) \| MBPP (pass@1) \|
	\| :--- \| :---: \| :---: \| :---: \|
	\| SmallCoder (S4.1) \| 303M \| 27.4% \| 31.0% \|
	\| TinyLlama-1.1B \| 1.1B \| ~26.4% \| ~27.6% \|
	\| MPT-1B-Instruct \| 1.0B \| ~22.0% \| ~25.0% \|
	\| Zephyr-1.3B SFT \| 1.3B \| 31.0% \| 34.0% \|
	\| Mistral-7B Base \| 7B \| 30.5% \| 47.5% \|

	SmallCoder (303M) nearly achieves parity with Mistral 7B on HumanEval while being 23x smaller.

	## 🧠 Model Architecture

	This model uses a Llama-type architecture (MHA) with 303M parameters.

	* Architecture: LlamaForCausalLM (MHA)
	* Hidden Size: 768
	* Layers: 24
	* Attention Heads: 8
	* KV Heads: 8 (Standard MHA)
	* Vocab Size: 49152 (Tokenizer: `bigcode/starcoder`)
	* Max Context: 1024 tokens

	```python
	LlamaConfig(
	vocab_size=49152,
	hidden_size=768,
	num_hidden_layers=24,
	intermediate_size=3072,
	num_attention_heads=8,
	num_key_value_heads=8,
	max_position_embeddings=1024,
	...
	)
	````

	## 🛠️ Training Plan (4 Stages)

	This model is the result of a multi-stage training curriculum totaling 29.8 Billion tokens.

	### Stage 1: Linguistic Base (Completed)

	* Tokens: 6.3B
	* Dataset: `FineWeb-Edu`
	* Objective: Learn natural language.
	* Loss: 10.87 → 2.58

	### Stage 2: Code Specialization (Completed)

	* Tokens: 7.5B
	* Dataset: `Nemotron Synthetic Code Q/A CoT` (60%) / `StarCoderData` (40%)
	* Objective: Learn code syntax and reasoning.
	* Loss: 5.00 → 1.25

	### Stage 3: Math & Knowledge (Completed)

	* Tokens: 10B
	* Dataset: `Nemotron CC-Math-4plus` (40%) / `FineWiki-EN` (35%) / `Nemotron CC-Math-4` (15%) / `OpenWebMath` (10%)
	* Objective: Learn mathematical reasoning.
	* Loss: 2.77 → 1.55
	* Result: A solid base model (Wikitext PPL: 35.4).

	### Stage 4.1: SFT (EOS-Fixed) (Completed)

	* Tokens: 6B
	* Starting Checkpoint: `stage-3/`
	* Dataset: `Nemotron-SFT-Code` (45%), `OpenCodeInstruct` (30%), `OpenMathInstruct-2` (15%), `Nemotron-SFT-General` (10%)
	* Objective: Align on code instructions and fix the EOS generation bug.
	* Loss: 1.73 → \~0.70 (low point)

	-----

	## 📊 Detailed Benchmarks (Stage 4.1)

	The SFT (Code) scores are excellent. The generalist scores (Math, Reasoning) are low, indicating the SFT has heavily specialized the model (a "code specialist").

	\| Task \| Benchmark \| n-shot \| Metric \| Score \|
	\| :--- \| :--- \| :---: \| :--- \| :---: \|
	\| Code \| HumanEval \| 0 \| pass@1 \| 27.4% \|
	\| Code \| MBPP \| 3 \| pass@1 \| 31.0% \|
	\| Math \| GSM8k \| 0 \| exact\_match \| 4.55% \|
	\| General \| Wikitext \| 0 \| word\_perplexity \| 167.6 \|
	\| Reasoning \| ARC Easy \| 0 \| acc\_norm \| 34.6% \|
	\| Reasoning \| ARC Challenge \| 0 \| acc\_norm \| 22.8% \|
	\| Commonsense \| HellaSwag \| 0 \| acc\_norm \| 28.3% \|

	`humaneval`/`mbpp` scores are based on manual analysis (`max_gen_toks=512`), as official `lm-eval` benchmarks fail to evaluate this model due to SFT formatting and truncation issues.

	## ⚠️ Known Limitations

	1. Code Specialist: Heavily optimized for code (27.4% HEval) at the expense of other skills. Performance on math (`gsm8k` 4.55%) and general knowledge (PPL 167) is low. This is a code specialist model, not a generalist.
	2. Limited Context: This model was trained exclusively on a sequence length of 1024 tokens. It cannot handle longer prompts.

	## ⚡ How to Use

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_id = "Beebey/smallcoder-303m"
	device = "cuda" # or "cpu"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16
	).to(device)

	# Note the 'User:' and 'Assistant:' formatting
	prompt = "User: Write a Python function to compute the Fibonacci sequence.\nAssistant:"
	inputs = tokenizer(prompt, return_tensors="pt").to(device)

	# Generation
	# The model was trained to use tokenizer.eos_token_id
	# It should stop automatically.
	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	pad_token_id=tokenizer.eos_token_id,
	eos_token_id=tokenizer.eos_token_id
	)

	response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(response)
	```

	## Acknowledgements

	### Trained with the Google TRC

	This model was trained with support from Google's TPU Research Cloud (TRC) program. We thank Google for providing access to the TPU v4 infrastructure that made this training run possible.

	```