NanoChat D34 SFT
A fine-tuned version of karpathy/nanochat-d34 with mid-training and supervised fine-tuning (SFT) for chat capabilities.
Model Details
- Base Model: karpathy/nanochat-d34 (2.2B parameters)
- Training Pipeline: Base โ Mid-Training โ SFT
- Hardware: Lambda Labs H100 80GB
- Training Time:
6 hours total (5.5h mid-training + ~30min SFT)
Training Details
Mid-Training
- Steps: 813
- Final validation BPB: 0.3282
- Batch size: 4 (reduced from default 32 to fit single H100)
SFT (Supervised Fine-Tuning)
- Steps: 700
- MMLU Accuracy: 42.6%
- ARC-Easy Accuracy: 72.0%
Files
Tokenizer
| File | Description |
|---|---|
tokenizer/token_bytes.pt |
Token byte mappings |
tokenizer/tokenizer.pkl |
Pickled tokenizer object |
Checkpoints
| Checkpoint | Step | MMLU | ARC-Easy | Description |
|---|---|---|---|---|
model_000719.pt |
719 | 42.0% | 73.9% | Best ARC-Easy performance |
model_000700.pt |
700 | 42.6% | 72.0% | Best MMLU performance |
full_custom_model_000055.pt |
55 | 42.4% | 70.5% | Full dataset + custom data |
custom_only_model_000033.pt |
33 | 41.8% | 71.2% | Custom data only |
mid_model_000813.pt |
813 | - | - | Mid-training checkpoint (BPB: 0.328) |
Checkpoint Selection Guide:
- General use:
model_000700.pt- best MMLU, solid reasoning - Domain reasoning:
full_custom_model_000055.pt- knows domain vocabulary, codebase patterns - Note: For SQL syntax, use MCP (Context7) rather than relying on fine-tuning (see Recommendations below)
Training Data
| File | Examples | Description |
|---|---|---|
training_data/combined_training_nanochat.jsonl |
1,092 | Custom domain data (analytics/SQL) |
training_data/full_combined_training_nanochat.jsonl |
1,812 | Full dataset including custom data |
The custom training data contains domain-specific examples for analytics queries, particularly ClickHouse SQL for trading/blockchain data analysis.
Usage
Setup
git clone https://github.com/karpathy/nanochat.git
cd nanochat
uv sync
# Download this model
huggingface-cli download victoremnm/nanochat-d34-sft --local-dir ~/nanochat-d34-sft
# Setup directories
mkdir -p ~/.cache/nanochat/tokenizer
mkdir -p ~/.cache/nanochat/chatsft_checkpoints/d34
cp ~/nanochat-d34-sft/tokenizer/* ~/.cache/nanochat/tokenizer/
cp ~/nanochat-d34-sft/model_000719.pt ~/.cache/nanochat/chatsft_checkpoints/d34/
cp ~/nanochat-d34-sft/meta_000719.json ~/.cache/nanochat/chatsft_checkpoints/d34/
Run Chat
# Web interface
uv run python -m scripts.chat_web --source=sft --model-tag=d34 --step=719 --temperature=0.6
# CLI interface
uv run python -m scripts.chat_cli --source=sft --model-tag=d34 --step=719 --temperature=0.6
Training Data Format
The training data uses the standard chat format (JSONL with role/content pairs):
[
{"role": "user", "content": "How do I find the top traders by volume?"},
{"role": "assistant", "content": "Here's the SQL query:\n\n```sql\nSELECT trader, SUM(amount) FROM trades GROUP BY trader ORDER BY 2 DESC LIMIT 10;\n```"}
]
Performance Comparison
| Metric | Without Mid-Training | With Mid-Training (this model) |
|---|---|---|
| MMLU Accuracy | ~25% (random) | 42.6% |
| ARC-Easy Accuracy | ~30% | 72.0% |
| Chat Quality | Gibberish | Coherent conversations |
| Math | Broken | Basic arithmetic works |
| Code | Broken | Working code generation |
Domain Evaluation Results
We evaluated the models on domain-specific tasks:
| Model | Overall | Clickhouse SQL | Codebase | Reasoning |
|---|---|---|---|---|
| Base (step 700) | 60% | 25% | 100% | 100% |
| Custom (step 55) | 50% | 0% | 100% | 100% |
Key Finding: Custom training maintained codebase/reasoning but degraded SQL syntax. The model outputs SQL Server syntax (TOP 10) instead of Clickhouse (LIMIT 10) due to pre-training bias - even though training data was correct.
Recommendations: Fine-tuning + MCP Hybrid
Based on our evaluation, we recommend a hybrid approach:
| Component | Use For | Why |
|---|---|---|
| Fine-tuned model | Domain reasoning, business logic, codebase awareness | Stable knowledge that doesn't change often |
| MCP (Context7) | SQL syntax, table schemas, evolving patterns | Always current, hard to override pre-training |
What the fine-tuned model provides:
- Domain vocabulary (BonkBot, swap_events, materializer, etc.)
- Business context (trading analytics, DEX patterns, holder tracking)
- Codebase awareness (repository structures, API patterns)
What to defer to MCP:
- Current table schemas (DDLs change frequently)
- SQL dialect syntax (Clickhouse functions, BigQuery patterns)
- External documentation (always up-to-date via Context7)
Context7 has 12,916 Clickhouse code snippets available via /clickhouse/clickhouse-docs.
Limitations
- 2.2B parameter model - smaller than production chat models
- Trained on limited data compared to commercial models
- SQL dialect knowledge limited by pre-training bias (use MCP for syntax)
- Best used for learning/experimentation, not production
Training Resources
License
MIT (same as base nanochat)
Model tree for victoremnm/nanochat-d34-sft
Base model
karpathy/nanochat-d34