NanoChat D34 SFT

A fine-tuned version of karpathy/nanochat-d34 with mid-training and supervised fine-tuning (SFT) for chat capabilities.

Model Details

Base Model: karpathy/nanochat-d34 (2.2B parameters)
Training Pipeline: Base → Mid-Training → SFT
Hardware: Lambda Labs H100 80GB
Training Time: ~~6 hours total (~~5.5h mid-training + ~30min SFT)

Training Details

Mid-Training

Steps: 813
Final validation BPB: 0.3282
Batch size: 4 (reduced from default 32 to fit single H100)

SFT (Supervised Fine-Tuning)

Steps: 700
MMLU Accuracy: 42.6%
ARC-Easy Accuracy: 72.0%

Files

Tokenizer

File	Description
`tokenizer/token_bytes.pt`	Token byte mappings
`tokenizer/tokenizer.pkl`	Pickled tokenizer object

Checkpoints

Checkpoint	Step	MMLU	ARC-Easy	Description
`model_000719.pt`	719	42.0%	73.9%	Best ARC-Easy performance
`model_000700.pt`	700	42.6%	72.0%	Best MMLU performance
`full_custom_model_000055.pt`	55	42.4%	70.5%	Full dataset + custom data
`custom_only_model_000033.pt`	33	41.8%	71.2%	Custom data only
`mid_model_000813.pt`	813	-	-	Mid-training checkpoint (BPB: 0.328)

Checkpoint Selection Guide:

General use: model_000700.pt - best MMLU, solid reasoning
Domain reasoning: full_custom_model_000055.pt - knows domain vocabulary, codebase patterns
Note: For SQL syntax, use MCP (Context7) rather than relying on fine-tuning (see Recommendations below)

Training Data

File	Examples	Description
`training_data/combined_training_nanochat.jsonl`	1,092	Custom domain data (analytics/SQL)
`training_data/full_combined_training_nanochat.jsonl`	1,812	Full dataset including custom data

The custom training data contains domain-specific examples for analytics queries, particularly ClickHouse SQL for trading/blockchain data analysis.

Usage

Setup

git clone https://github.com/karpathy/nanochat.git
cd nanochat
uv sync

# Download this model
huggingface-cli download victoremnm/nanochat-d34-sft --local-dir ~/nanochat-d34-sft

# Setup directories
mkdir -p ~/.cache/nanochat/tokenizer
mkdir -p ~/.cache/nanochat/chatsft_checkpoints/d34

cp ~/nanochat-d34-sft/tokenizer/* ~/.cache/nanochat/tokenizer/
cp ~/nanochat-d34-sft/model_000719.pt ~/.cache/nanochat/chatsft_checkpoints/d34/
cp ~/nanochat-d34-sft/meta_000719.json ~/.cache/nanochat/chatsft_checkpoints/d34/

Run Chat

# Web interface
uv run python -m scripts.chat_web --source=sft --model-tag=d34 --step=719 --temperature=0.6

# CLI interface
uv run python -m scripts.chat_cli --source=sft --model-tag=d34 --step=719 --temperature=0.6

Training Data Format

The training data uses the standard chat format (JSONL with role/content pairs):

[
  {"role": "user", "content": "How do I find the top traders by volume?"},
  {"role": "assistant", "content": "Here's the SQL query:\n\n```sql\nSELECT trader, SUM(amount) FROM trades GROUP BY trader ORDER BY 2 DESC LIMIT 10;\n```"}
]

Performance Comparison

Metric	Without Mid-Training	With Mid-Training (this model)
MMLU Accuracy	~25% (random)	42.6%
ARC-Easy Accuracy	~30%	72.0%
Chat Quality	Gibberish	Coherent conversations
Math	Broken	Basic arithmetic works
Code	Broken	Working code generation

Domain Evaluation Results

We evaluated the models on domain-specific tasks:

Model	Overall	Clickhouse SQL	Codebase	Reasoning
Base (step 700)	60%	25%	100%	100%
Custom (step 55)	50%	0%	100%	100%

Key Finding: Custom training maintained codebase/reasoning but degraded SQL syntax. The model outputs SQL Server syntax (TOP 10) instead of Clickhouse (LIMIT 10) due to pre-training bias - even though training data was correct.

Recommendations: Fine-tuning + MCP Hybrid

Based on our evaluation, we recommend a hybrid approach:

Component	Use For	Why
Fine-tuned model	Domain reasoning, business logic, codebase awareness	Stable knowledge that doesn't change often
MCP (Context7)	SQL syntax, table schemas, evolving patterns	Always current, hard to override pre-training

What the fine-tuned model provides:

Domain vocabulary (BonkBot, swap_events, materializer, etc.)
Business context (trading analytics, DEX patterns, holder tracking)
Codebase awareness (repository structures, API patterns)

What to defer to MCP:

Current table schemas (DDLs change frequently)
SQL dialect syntax (Clickhouse functions, BigQuery patterns)
External documentation (always up-to-date via Context7)

Context7 has 12,916 Clickhouse code snippets available via /clickhouse/clickhouse-docs.

Limitations

2.2B parameter model - smaller than production chat models
Trained on limited data compared to commercial models
SQL dialect knowledge limited by pre-training bias (use MCP for syntax)
Best used for learning/experimentation, not production

Training Resources

License

MIT (same as base nanochat)

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for victoremnm/nanochat-d34-sft

Base model

karpathy/nanochat-d34

Finetuned

(4)

this model