NanoMind · 152M
A 152M parameter GPT-2 style language model trained from scratch on GPT-4 quality instruction data, with a hand-written C++ inference engine featuring AVX2 SIMD, OpenMP parallelism, and persistent KV-cache.
Model Details
| Property | Value |
|---|---|
| Architecture | GPT-2 (decoder-only transformer) |
| Parameters | 152.83M |
| Layers | 16 |
| Attention heads | 12 |
| Embedding dim | 768 |
| Context length | 1024 tokens |
| Vocab size | 50,304 (GPT-2 BPE) |
| Training steps | 9,800 |
| Final loss | ~1.73 |
| Effective batch | 96 (12 × 8 grad accum) |
| Optimizer | AdamW (weight decay 0.1, β=0.9/0.95) |
| LR schedule | Warmup 300 steps + cosine decay |
| Peak LR | 5e-4 |
| Hardware | Kaggle T4 GPU (~12 hours) |
Training Data
~220M tokens from GPT-4 quality sources:
| Dataset | Samples | Quality |
|---|---|---|
| OpenHermes 2.5 | 500k | GPT-4 multi-turn |
| Alpaca GPT-4 | 52k | GPT-4 instruction |
| WizardLM Evol V2 | 143k | GPT-4 evolved |
| Open-Platypus | 25k | STEM reasoning |
All data formatted as:
System: You are a helpful, thoughtful, and articulate AI assistant.
User: <instruction>
Assistant: <response>
Inference Engine
This model ships with a custom C++ daemon (inference.cpp) — not transformers, not llama.cpp.
Features
- AVX2 + FMA matrix-vector multiply (8 floats/cycle)
- AVX2 attention dot products and weighted V accumulation
- OpenMP parallelism across attention heads and matmul rows
- Persistent KV-cache per session — no recomputation on follow-up turns
- LRU eviction — up to 20 concurrent sessions, oldest evicted automatically
- Streaming protocol over stdin/stdout — FastAPI wraps as SSE
Performance (HF Space T4)
| Mode | Engines | OMP threads | Throughput |
|---|---|---|---|
| Speed (default) | 1 | 2 | ~40+ tok/s |
| Multi-user | 4 | 1 | ~35 tok/s × 4 users |
Compile
g++ -O3 -march=native -fopenmp -ffast-math -std=c++17 \
-o inference inference.cpp -lm
Files
| File | Size | Description |
|---|---|---|
model.bin |
765 MB | Raw float32 weights (custom binary format) |
tokenizer.bin |
522 KB | GPT-2 BPE vocab in custom binary format |
model.bin format
Header: [n_layer, n_head, n_embd, block_size, vocab_size] (5 × int32)
wte: [vocab_size × n_embd] float32
wpe: [block_size × n_embd] float32
Per layer (×16):
ln1_w, ln1_b, c_attn_w, c_attn_b,
c_proj_w, c_proj_b, ln2_w, ln2_b,
mlp_fc_w, mlp_fc_b, mlp_proj_w, mlp_proj_b
ln_f_w, ln_f_b, lm_head_w
API
# Chat (streaming SSE)
curl -X POST http://localhost:7860/chat \
-H "Content-Type: application/json" \
-d '{"message": "What is machine learning?", "session_id": "abc123"}'
# Health
curl http://localhost:7860/health
# Metrics
curl http://localhost:7860/metrics
# Reset session
curl -X POST http://localhost:7860/chat/reset \
-d '{"session_id": "abc123"}'
Known Limitations
- Reasoning: 152M parameters cannot chain multi-step logic. Expect factual recall and pattern matching, not reasoning.
- Hallucination: No RLHF/DPO — model will confidently say wrong things.
- Context: Hard limit of 1024 tokens (~750 words).
- Evaluation: Trained on loss minimization only. No MMLU, HellaSwag, or held-out eval set — a proper eval harness is the next planned addition.
- serialize() note:
shape_hintparam is only used whent=None(bias=True config). Would refactor signature in v2. - vocab_size=50304: GPT-2's actual vocab is 50,257. Padded to 50,304 (nearest multiple of 64) for memory alignment — standard trick, undocumented in v1.
Citation
@misc{nanomind2025,
author = {NOT-OMEGA},
title = {NanoMind: A 152M GPT-2 Model with Custom C++ Inference},
year = {2025},
howpublished = {\url{https://huggingface.co/NOT-OMEGA/NanoMind}},
}
Trained from scratch · Custom C++ engine · No frameworks at inference time