NanoChat D34 SFT

A fine-tuned version of karpathy/nanochat-d34 with mid-training and supervised fine-tuning (SFT) for chat capabilities.

Model Details

  • Base Model: karpathy/nanochat-d34 (2.2B parameters)
  • Training Pipeline: Base โ†’ Mid-Training โ†’ SFT
  • Hardware: Lambda Labs H100 80GB
  • Training Time: 6 hours total (5.5h mid-training + ~30min SFT)

Training Details

Mid-Training

  • Steps: 813
  • Final validation BPB: 0.3282
  • Batch size: 4 (reduced from default 32 to fit single H100)

SFT (Supervised Fine-Tuning)

  • Steps: 700
  • MMLU Accuracy: 42.6%
  • ARC-Easy Accuracy: 72.0%

Files

Tokenizer

File Description
tokenizer/token_bytes.pt Token byte mappings
tokenizer/tokenizer.pkl Pickled tokenizer object

Checkpoints

Checkpoint Step MMLU ARC-Easy Description
model_000719.pt 719 42.0% 73.9% Best ARC-Easy performance
model_000700.pt 700 42.6% 72.0% Best MMLU performance
full_custom_model_000055.pt 55 42.4% 70.5% Full dataset + custom data
custom_only_model_000033.pt 33 41.8% 71.2% Custom data only
mid_model_000813.pt 813 - - Mid-training checkpoint (BPB: 0.328)

Checkpoint Selection Guide:

  • General use: model_000700.pt - best MMLU, solid reasoning
  • Domain reasoning: full_custom_model_000055.pt - knows domain vocabulary, codebase patterns
  • Note: For SQL syntax, use MCP (Context7) rather than relying on fine-tuning (see Recommendations below)

Training Data

File Examples Description
training_data/combined_training_nanochat.jsonl 1,092 Custom domain data (analytics/SQL)
training_data/full_combined_training_nanochat.jsonl 1,812 Full dataset including custom data

The custom training data contains domain-specific examples for analytics queries, particularly ClickHouse SQL for trading/blockchain data analysis.

Usage

Setup

git clone https://github.com/karpathy/nanochat.git
cd nanochat
uv sync

# Download this model
huggingface-cli download victoremnm/nanochat-d34-sft --local-dir ~/nanochat-d34-sft

# Setup directories
mkdir -p ~/.cache/nanochat/tokenizer
mkdir -p ~/.cache/nanochat/chatsft_checkpoints/d34

cp ~/nanochat-d34-sft/tokenizer/* ~/.cache/nanochat/tokenizer/
cp ~/nanochat-d34-sft/model_000719.pt ~/.cache/nanochat/chatsft_checkpoints/d34/
cp ~/nanochat-d34-sft/meta_000719.json ~/.cache/nanochat/chatsft_checkpoints/d34/

Run Chat

# Web interface
uv run python -m scripts.chat_web --source=sft --model-tag=d34 --step=719 --temperature=0.6

# CLI interface
uv run python -m scripts.chat_cli --source=sft --model-tag=d34 --step=719 --temperature=0.6

Training Data Format

The training data uses the standard chat format (JSONL with role/content pairs):

[
  {"role": "user", "content": "How do I find the top traders by volume?"},
  {"role": "assistant", "content": "Here's the SQL query:\n\n```sql\nSELECT trader, SUM(amount) FROM trades GROUP BY trader ORDER BY 2 DESC LIMIT 10;\n```"}
]

Performance Comparison

Metric Without Mid-Training With Mid-Training (this model)
MMLU Accuracy ~25% (random) 42.6%
ARC-Easy Accuracy ~30% 72.0%
Chat Quality Gibberish Coherent conversations
Math Broken Basic arithmetic works
Code Broken Working code generation

Domain Evaluation Results

We evaluated the models on domain-specific tasks:

Model Overall Clickhouse SQL Codebase Reasoning
Base (step 700) 60% 25% 100% 100%
Custom (step 55) 50% 0% 100% 100%

Key Finding: Custom training maintained codebase/reasoning but degraded SQL syntax. The model outputs SQL Server syntax (TOP 10) instead of Clickhouse (LIMIT 10) due to pre-training bias - even though training data was correct.

Recommendations: Fine-tuning + MCP Hybrid

Based on our evaluation, we recommend a hybrid approach:

Component Use For Why
Fine-tuned model Domain reasoning, business logic, codebase awareness Stable knowledge that doesn't change often
MCP (Context7) SQL syntax, table schemas, evolving patterns Always current, hard to override pre-training

What the fine-tuned model provides:

  • Domain vocabulary (BonkBot, swap_events, materializer, etc.)
  • Business context (trading analytics, DEX patterns, holder tracking)
  • Codebase awareness (repository structures, API patterns)

What to defer to MCP:

  • Current table schemas (DDLs change frequently)
  • SQL dialect syntax (Clickhouse functions, BigQuery patterns)
  • External documentation (always up-to-date via Context7)

Context7 has 12,916 Clickhouse code snippets available via /clickhouse/clickhouse-docs.

Limitations

  • 2.2B parameter model - smaller than production chat models
  • Trained on limited data compared to commercial models
  • SQL dialect knowledge limited by pre-training bias (use MCP for syntax)
  • Best used for learning/experimentation, not production

Training Resources

License

MIT (same as base nanochat)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for victoremnm/nanochat-d34-sft

Finetuned
(4)
this model