| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - gpt2 |
| - causal-lm |
| - text-generation |
| - from-scratch |
| - avx2 |
| - cpp-inference |
| - kv-cache |
| pipeline_tag: text-generation |
| --- |
| |
| # NanoMind · 152M |
|
|
| > A 152M parameter GPT-2 style language model trained **from scratch** on GPT-4 quality instruction data, with a hand-written **C++ inference engine** featuring AVX2 SIMD, OpenMP parallelism, and persistent KV-cache. |
|
|
| --- |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |---|---| |
| | **Architecture** | GPT-2 (decoder-only transformer) | |
| | **Parameters** | 152.83M | |
| | **Layers** | 16 | |
| | **Attention heads** | 12 | |
| | **Embedding dim** | 768 | |
| | **Context length** | 1024 tokens | |
| | **Vocab size** | 50,304 (GPT-2 BPE) | |
| | **Training steps** | 9,800 | |
| | **Final loss** | ~1.73 | |
| | **Effective batch** | 96 (12 × 8 grad accum) | |
| | **Optimizer** | AdamW (weight decay 0.1, β=0.9/0.95) | |
| | **LR schedule** | Warmup 300 steps + cosine decay | |
| | **Peak LR** | 5e-4 | |
| | **Hardware** | Kaggle T4 GPU (~12 hours) | |
|
|
| --- |
|
|
| ## Training Data |
|
|
| ~220M tokens from GPT-4 quality sources: |
|
|
| | Dataset | Samples | Quality | |
| |---|---|---| |
| | [OpenHermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 500k | GPT-4 multi-turn | |
| | [Alpaca GPT-4](https://huggingface.co/datasets/vicgalle/alpaca-gpt4) | 52k | GPT-4 instruction | |
| | [WizardLM Evol V2](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k) | 143k | GPT-4 evolved | |
| | [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus) | 25k | STEM reasoning | |
|
|
| All data formatted as: |
| ``` |
| System: You are a helpful, thoughtful, and articulate AI assistant. |
| User: <instruction> |
| Assistant: <response> |
| ``` |
|
|
| --- |
|
|
| ## Inference Engine |
|
|
| This model ships with a **custom C++ daemon** (`inference.cpp`) — not transformers, not llama.cpp. |
|
|
| ### Features |
| - **AVX2 + FMA** matrix-vector multiply (8 floats/cycle) |
| - **AVX2** attention dot products and weighted V accumulation |
| - **OpenMP** parallelism across attention heads and matmul rows |
| - **Persistent KV-cache** per session — no recomputation on follow-up turns |
| - **LRU eviction** — up to 20 concurrent sessions, oldest evicted automatically |
| - **Streaming protocol** over stdin/stdout — FastAPI wraps as SSE |
|
|
| ### Performance (HF Space T4) |
| | Mode | Engines | OMP threads | Throughput | |
| |---|---|---|---| |
| | Speed (default) | 1 | 2 | ~40+ tok/s | |
| | Multi-user | 4 | 1 | ~35 tok/s × 4 users | |
|
|
| ### Compile |
| ```bash |
| g++ -O3 -march=native -fopenmp -ffast-math -std=c++17 \ |
| -o inference inference.cpp -lm |
| ``` |
|
|
| --- |
|
|
| ## Files |
|
|
| | File | Size | Description | |
| |---|---|---| |
| | `model.bin` | 765 MB | Raw float32 weights (custom binary format) | |
| | `tokenizer.bin` | 522 KB | GPT-2 BPE vocab in custom binary format | |
|
|
| ### model.bin format |
| ``` |
| Header: [n_layer, n_head, n_embd, block_size, vocab_size] (5 × int32) |
| wte: [vocab_size × n_embd] float32 |
| wpe: [block_size × n_embd] float32 |
| Per layer (×16): |
| ln1_w, ln1_b, c_attn_w, c_attn_b, |
| c_proj_w, c_proj_b, ln2_w, ln2_b, |
| mlp_fc_w, mlp_fc_b, mlp_proj_w, mlp_proj_b |
| ln_f_w, ln_f_b, lm_head_w |
| ``` |
|
|
| --- |
|
|
| ## API |
|
|
| ```bash |
| # Chat (streaming SSE) |
| curl -X POST http://localhost:7860/chat \ |
| -H "Content-Type: application/json" \ |
| -d '{"message": "What is machine learning?", "session_id": "abc123"}' |
| |
| # Health |
| curl http://localhost:7860/health |
| |
| # Metrics |
| curl http://localhost:7860/metrics |
| |
| # Reset session |
| curl -X POST http://localhost:7860/chat/reset \ |
| -d '{"session_id": "abc123"}' |
| ``` |
|
|
| --- |
|
|
| ## Known Limitations |
|
|
| - **Reasoning:** 152M parameters cannot chain multi-step logic. |
| Expect factual recall and pattern matching, not reasoning. |
| - **Hallucination:** No RLHF/DPO — model will confidently say wrong things. |
| - **Context:** Hard limit of 1024 tokens (~750 words). |
| - **Evaluation:** Trained on loss minimization only. |
| No MMLU, HellaSwag, or held-out eval set — a proper |
| eval harness is the next planned addition. |
| - **serialize() note:** `shape_hint` param is only used when |
| `t=None` (bias=True config). Would refactor signature in v2. |
| - **vocab_size=50304:** GPT-2's actual vocab is 50,257. |
| Padded to 50,304 (nearest multiple of 64) for memory |
| alignment — standard trick, undocumented in v1. |
| |
| --- |
| |
| ## Citation |
| |
| ```bibtex |
| @misc{nanomind2025, |
| author = {NOT-OMEGA}, |
| title = {NanoMind: A 152M GPT-2 Model with Custom C++ Inference}, |
| year = {2025}, |
| howpublished = {\url{https://huggingface.co/NOT-OMEGA/NanoMind}}, |
| } |
| ``` |
| |
| --- |
| |
| *Trained from scratch · Custom C++ engine · No frameworks at inference time* |