--- license: mit language: - en tags: - gpt2 - causal-lm - text-generation - from-scratch - avx2 - cpp-inference - kv-cache pipeline_tag: text-generation --- # NanoMind · 152M > A 152M parameter GPT-2 style language model trained **from scratch** on GPT-4 quality instruction data, with a hand-written **C++ inference engine** featuring AVX2 SIMD, OpenMP parallelism, and persistent KV-cache. --- ## Model Details | Property | Value | |---|---| | **Architecture** | GPT-2 (decoder-only transformer) | | **Parameters** | 152.83M | | **Layers** | 16 | | **Attention heads** | 12 | | **Embedding dim** | 768 | | **Context length** | 1024 tokens | | **Vocab size** | 50,304 (GPT-2 BPE) | | **Training steps** | 9,800 | | **Final loss** | ~1.73 | | **Effective batch** | 96 (12 × 8 grad accum) | | **Optimizer** | AdamW (weight decay 0.1, β=0.9/0.95) | | **LR schedule** | Warmup 300 steps + cosine decay | | **Peak LR** | 5e-4 | | **Hardware** | Kaggle T4 GPU (~12 hours) | --- ## Training Data ~220M tokens from GPT-4 quality sources: | Dataset | Samples | Quality | |---|---|---| | [OpenHermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 500k | GPT-4 multi-turn | | [Alpaca GPT-4](https://huggingface.co/datasets/vicgalle/alpaca-gpt4) | 52k | GPT-4 instruction | | [WizardLM Evol V2](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k) | 143k | GPT-4 evolved | | [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus) | 25k | STEM reasoning | All data formatted as: ``` System: You are a helpful, thoughtful, and articulate AI assistant. User: Assistant: ``` --- ## Inference Engine This model ships with a **custom C++ daemon** (`inference.cpp`) — not transformers, not llama.cpp. ### Features - **AVX2 + FMA** matrix-vector multiply (8 floats/cycle) - **AVX2** attention dot products and weighted V accumulation - **OpenMP** parallelism across attention heads and matmul rows - **Persistent KV-cache** per session — no recomputation on follow-up turns - **LRU eviction** — up to 20 concurrent sessions, oldest evicted automatically - **Streaming protocol** over stdin/stdout — FastAPI wraps as SSE ### Performance (HF Space T4) | Mode | Engines | OMP threads | Throughput | |---|---|---|---| | Speed (default) | 1 | 2 | ~40+ tok/s | | Multi-user | 4 | 1 | ~35 tok/s × 4 users | ### Compile ```bash g++ -O3 -march=native -fopenmp -ffast-math -std=c++17 \ -o inference inference.cpp -lm ``` --- ## Files | File | Size | Description | |---|---|---| | `model.bin` | 765 MB | Raw float32 weights (custom binary format) | | `tokenizer.bin` | 522 KB | GPT-2 BPE vocab in custom binary format | ### model.bin format ``` Header: [n_layer, n_head, n_embd, block_size, vocab_size] (5 × int32) wte: [vocab_size × n_embd] float32 wpe: [block_size × n_embd] float32 Per layer (×16): ln1_w, ln1_b, c_attn_w, c_attn_b, c_proj_w, c_proj_b, ln2_w, ln2_b, mlp_fc_w, mlp_fc_b, mlp_proj_w, mlp_proj_b ln_f_w, ln_f_b, lm_head_w ``` --- ## API ```bash # Chat (streaming SSE) curl -X POST http://localhost:7860/chat \ -H "Content-Type: application/json" \ -d '{"message": "What is machine learning?", "session_id": "abc123"}' # Health curl http://localhost:7860/health # Metrics curl http://localhost:7860/metrics # Reset session curl -X POST http://localhost:7860/chat/reset \ -d '{"session_id": "abc123"}' ``` --- ## Known Limitations - **Reasoning:** 152M parameters cannot chain multi-step logic. Expect factual recall and pattern matching, not reasoning. - **Hallucination:** No RLHF/DPO — model will confidently say wrong things. - **Context:** Hard limit of 1024 tokens (~750 words). - **Evaluation:** Trained on loss minimization only. No MMLU, HellaSwag, or held-out eval set — a proper eval harness is the next planned addition. - **serialize() note:** `shape_hint` param is only used when `t=None` (bias=True config). Would refactor signature in v2. - **vocab_size=50304:** GPT-2's actual vocab is 50,257. Padded to 50,304 (nearest multiple of 64) for memory alignment — standard trick, undocumented in v1. --- ## Citation ```bibtex @misc{nanomind2025, author = {NOT-OMEGA}, title = {NanoMind: A 152M GPT-2 Model with Custom C++ Inference}, year = {2025}, howpublished = {\url{https://huggingface.co/NOT-OMEGA/NanoMind}}, } ``` --- *Trained from scratch · Custom C++ engine · No frameworks at inference time*