|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: mit |
|
|
tags: |
|
|
- text-generation |
|
|
- transformer |
|
|
- conversational |
|
|
datasets: |
|
|
- HuggingFaceFW/fineweb-edu |
|
|
- cais/mmlu |
|
|
- gsm8k |
|
|
- HuggingFaceTB/smoltalk |
|
|
model-index: |
|
|
- name: nanochat |
|
|
results: |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
name: MMLU |
|
|
type: cais/mmlu |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 31.51 |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
name: GSM8K |
|
|
type: gsm8k |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 4.55 |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
name: HumanEval |
|
|
type: openai_humaneval |
|
|
metrics: |
|
|
- type: pass@1 |
|
|
value: 8.54 |
|
|
--- |
|
|
|
|
|
# nanochat |
|
|
|
|
|
**nanochat** is a 561M parameter transformer language model trained for conversational AI tasks. This model demonstrates that capable chat models |
|
|
can be trained efficiently on modest hardware budgets (~$100 on 8x H100 GPUs). |
|
|
|
|
|
Read about the process at https://samdobson.uk/posts/training-a-chatgpt-clone-for-cheap/ |
|
|
|
|
|
Chat with the model at https://huggingface.co/spaces/sdobson/nanochat |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Developed by:** Andrej Karpathy |
|
|
- **Trained by:** Sam Dobson |
|
|
- **Model type:** Transformer-based causal language model |
|
|
- **Language(s):** English |
|
|
- **License:** MIT |
|
|
- **Parameters:** 560,988,160 (~561M) |
|
|
|
|
|
### Architecture |
|
|
|
|
|
- **Layers:** 20 |
|
|
- **Hidden size:** 1280 channels |
|
|
- **Attention heads:** 10 |
|
|
- **Head dimension:** 128 |
|
|
- **Vocabulary size:** 65,536 tokens |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
nanochat was trained in multiple stages: |
|
|
|
|
|
1. **Pretraining:** 100B token subset of FineWeb-EDU (11.2B tokens processed) |
|
|
2. **Midtraining:** SmolTalk conversations, MMLU multiple choice questions, GSM8K math problems |
|
|
3. **Supervised Fine-tuning (SFT):** Conversational adaptation data |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
#### Tokenization |
|
|
- Custom Rust-based tokenizer |
|
|
- Vocabulary: 65,536 tokens |
|
|
- Compression ratio: 4.8 characters per token |
|
|
|
|
|
#### Training Infrastructure |
|
|
- **Hardware:** 8x H100 GPUs (Lambda GPU Cloud) |
|
|
- **Training time:** ~3 hours for pretraining stage |
|
|
- **Estimated compute:** ~4e19 FLOPs |
|
|
- **Total cost:** ~$100 |
|
|
|
|
|
#### Training Stages |
|
|
The model was trained in three stages: |
|
|
1. **Pretraining** on web text (FineWeb-EDU) |
|
|
2. **Midtraining** on domain-specific datasets (reasoning, conversation, maths) |
|
|
3. **Supervised fine-tuning** for chat optimisation |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Benchmark Results |
|
|
|
|
|
| Benchmark | Score | Description | |
|
|
|-----------|-------|-------------| |
|
|
| **MMLU** | 23.99% | Multitask language understanding | |
|
|
| **GSM8K** | 4.47% | Grade school math problems | |
|
|
| **HumanEval** | 6.71% | Python code generation | |
|
|
| **ARC-Easy** | 24.79% | Science questions (easy) | |
|
|
| **ARC-Challenge** | 24.32% | Science questions (hard) | |
|
|
| **ChatCORE** | 1.73% | Conversational reasoning | |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
nanochat is designed for: |
|
|
- Conversational AI applications |
|
|
- Research on efficient language model training |
|
|
- Educational purposes for understanding LLM training pipelines |
|
|
- Low-resource deployment scenarios |
|
|
|
|
|
### Downstream Use |
|
|
|
|
|
The model can be fine-tuned for specific conversational tasks or used as a base model for further domain adaptation. |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
- Production-grade conversational AI (the model is relatively small and has limited capabilities) |
|
|
- Tasks requiring specialised knowledge or high accuracy |
|
|
- Critical applications where errors could cause harm |
|
|
|
|
|
## Limitations and Bias |
|
|
|
|
|
- **Small scale:** At 561M parameters, this model has significantly fewer capabilities than larger models (1B+ parameters) |
|
|
- **Limited training:** Trained on only 11.2B tokens, which is modest by modern standards |
|
|
- **Performance:** Benchmark scores indicate limited reasoning and mathematical capabilities |
|
|
- **Bias:** Inherits biases from training data (FineWeb-EDU, SmolTalk, etc.) |
|
|
- **Language:** English-only |
|
|
|
|
|
## Inference guide |
|
|
|
|
|
Simon Willison created a script to allow this to run on CPU on MacOS: |
|
|
|
|
|
``` |
|
|
cd /tmp |
|
|
git clone https://huggingface.co/sdobson/nanochat |
|
|
uv run https://gist.githubusercontent.com/simonw/912623bf00d6c13cc0211508969a100a/raw/80f79c6a6f1e1b5d4485368ef3ddafa5ce853131/generate_cpu.py \ |
|
|
--model-dir /tmp/nanochat \ |
|
|
--prompt "Tell me about dogs." |
|
|
``` |
|
|
|
|
|
Otherwise you can: |
|
|
|
|
|
1. Download all files |
|
|
2. Put `tokenizer.pkl` and `token_bytes.pt` in `~/.cache/nanochat/tokenizer` |
|
|
3. Put `model_000650.pt` and `meta_000650.json` in `~/.cache/nanochat/chatsft_checkpoints/d20` |
|
|
4. Clone https://github.com/karpathy/nanochat |
|
|
5. Run `uv sync` followed by `uv run python -m scripts.chat_web` |
|
|
|
|
|
## Citation |
|
|
|
|
|
**Repository:** [github.com/karpathy/nanochat](https://github.com/karpathy/nanochat) |
|
|
|
|
|
```bibtex |
|
|
@software{nanochat2025, |
|
|
author = {Karpathy, Andrej}, |
|
|
title = {nanochat: A 561M parameter conversational language model}, |
|
|
year = {2025}, |
|
|
url = {https://github.com/karpathy/nanochat} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Model Card Author |
|
|
|
|
|
Sam Dobson |