---
license: mit
language:
- en
tags:
- gpt2
- causal-lm
- text-generation
- from-scratch
- avx2
- cpp-inference
- kv-cache
pipeline_tag: text-generation
---

# NanoMind · 152M

> A 152M parameter GPT-2 style language model trained **from scratch** on GPT-4 quality instruction data, with a hand-written **C++ inference engine** featuring AVX2 SIMD, OpenMP parallelism, and persistent KV-cache.

---

## Model Details

| Property | Value |
|---|---|
| **Architecture** | GPT-2 (decoder-only transformer) |
| **Parameters** | 152.83M |
| **Layers** | 16 |
| **Attention heads** | 12 |
| **Embedding dim** | 768 |
| **Context length** | 1024 tokens |
| **Vocab size** | 50,304 (GPT-2 BPE) |
| **Training steps** | 9,800 |
| **Final loss** | ~1.73 |
| **Effective batch** | 96 (12 × 8 grad accum) |
| **Optimizer** | AdamW (weight decay 0.1, β=0.9/0.95) |
| **LR schedule** | Warmup 300 steps + cosine decay |
| **Peak LR** | 5e-4 |
| **Hardware** | Kaggle T4 GPU (~12 hours) |

---

## Training Data

~220M tokens from GPT-4 quality sources:

| Dataset | Samples | Quality |
|---|---|---|
| [OpenHermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 500k | GPT-4 multi-turn |
| [Alpaca GPT-4](https://huggingface.co/datasets/vicgalle/alpaca-gpt4) | 52k | GPT-4 instruction |
| [WizardLM Evol V2](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k) | 143k | GPT-4 evolved |
| [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus) | 25k | STEM reasoning |

All data formatted as:
```
System: You are a helpful, thoughtful, and articulate AI assistant.
User: <instruction>
Assistant: <response>
```

---

## Inference Engine

This model ships with a **custom C++ daemon** (`inference.cpp`) — not transformers, not llama.cpp.

### Features
- **AVX2 + FMA** matrix-vector multiply (8 floats/cycle)
- **AVX2** attention dot products and weighted V accumulation
- **OpenMP** parallelism across attention heads and matmul rows
- **Persistent KV-cache** per session — no recomputation on follow-up turns
- **LRU eviction** — up to 20 concurrent sessions, oldest evicted automatically
- **Streaming protocol** over stdin/stdout — FastAPI wraps as SSE

### Performance (HF Space T4)
| Mode | Engines | OMP threads | Throughput |
|---|---|---|---|
| Speed (default) | 1 | 2 | ~40+ tok/s |
| Multi-user | 4 | 1 | ~35 tok/s × 4 users |

### Compile
```bash
g++ -O3 -march=native -fopenmp -ffast-math -std=c++17 \
    -o inference inference.cpp -lm
```

---

## Files

| File | Size | Description |
|---|---|---|
| `model.bin` | 765 MB | Raw float32 weights (custom binary format) |
| `tokenizer.bin` | 522 KB | GPT-2 BPE vocab in custom binary format |

### model.bin format
```
Header: [n_layer, n_head, n_embd, block_size, vocab_size]  (5 × int32)
wte:    [vocab_size × n_embd]  float32
wpe:    [block_size × n_embd]  float32
Per layer (×16):
  ln1_w, ln1_b, c_attn_w, c_attn_b,
  c_proj_w, c_proj_b, ln2_w, ln2_b,
  mlp_fc_w, mlp_fc_b, mlp_proj_w, mlp_proj_b
ln_f_w, ln_f_b, lm_head_w
```

---

## API

```bash
# Chat (streaming SSE)
curl -X POST http://localhost:7860/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What is machine learning?", "session_id": "abc123"}'

# Health
curl http://localhost:7860/health

# Metrics
curl http://localhost:7860/metrics

# Reset session
curl -X POST http://localhost:7860/chat/reset \
  -d '{"session_id": "abc123"}'
```

---

## Known Limitations

- **Reasoning:** 152M parameters cannot chain multi-step logic.
  Expect factual recall and pattern matching, not reasoning.
- **Hallucination:** No RLHF/DPO — model will confidently say wrong things.
- **Context:** Hard limit of 1024 tokens (~750 words).
- **Evaluation:** Trained on loss minimization only.
  No MMLU, HellaSwag, or held-out eval set — a proper
  eval harness is the next planned addition.
- **serialize() note:** `shape_hint` param is only used when
  `t=None` (bias=True config). Would refactor signature in v2.
- **vocab_size=50304:** GPT-2's actual vocab is 50,257.
  Padded to 50,304 (nearest multiple of 64) for memory
  alignment — standard trick, undocumented in v1.

---

## Citation

```bibtex
@misc{nanomind2025,
  author       = {NOT-OMEGA},
  title        = {NanoMind: A 152M GPT-2 Model with Custom C++ Inference},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/NOT-OMEGA/NanoMind}},
}
```

---

*Trained from scratch · Custom C++ engine · No frameworks at inference time*