NanoMind / README.md
NOT-OMEGA's picture
Update README.md
b728e8b verified
---
license: mit
language:
- en
tags:
- gpt2
- causal-lm
- text-generation
- from-scratch
- avx2
- cpp-inference
- kv-cache
pipeline_tag: text-generation
---
# NanoMind · 152M
> A 152M parameter GPT-2 style language model trained **from scratch** on GPT-4 quality instruction data, with a hand-written **C++ inference engine** featuring AVX2 SIMD, OpenMP parallelism, and persistent KV-cache.
---
## Model Details
| Property | Value |
|---|---|
| **Architecture** | GPT-2 (decoder-only transformer) |
| **Parameters** | 152.83M |
| **Layers** | 16 |
| **Attention heads** | 12 |
| **Embedding dim** | 768 |
| **Context length** | 1024 tokens |
| **Vocab size** | 50,304 (GPT-2 BPE) |
| **Training steps** | 9,800 |
| **Final loss** | ~1.73 |
| **Effective batch** | 96 (12 × 8 grad accum) |
| **Optimizer** | AdamW (weight decay 0.1, β=0.9/0.95) |
| **LR schedule** | Warmup 300 steps + cosine decay |
| **Peak LR** | 5e-4 |
| **Hardware** | Kaggle T4 GPU (~12 hours) |
---
## Training Data
~220M tokens from GPT-4 quality sources:
| Dataset | Samples | Quality |
|---|---|---|
| [OpenHermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 500k | GPT-4 multi-turn |
| [Alpaca GPT-4](https://huggingface.co/datasets/vicgalle/alpaca-gpt4) | 52k | GPT-4 instruction |
| [WizardLM Evol V2](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k) | 143k | GPT-4 evolved |
| [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus) | 25k | STEM reasoning |
All data formatted as:
```
System: You are a helpful, thoughtful, and articulate AI assistant.
User: <instruction>
Assistant: <response>
```
---
## Inference Engine
This model ships with a **custom C++ daemon** (`inference.cpp`) — not transformers, not llama.cpp.
### Features
- **AVX2 + FMA** matrix-vector multiply (8 floats/cycle)
- **AVX2** attention dot products and weighted V accumulation
- **OpenMP** parallelism across attention heads and matmul rows
- **Persistent KV-cache** per session — no recomputation on follow-up turns
- **LRU eviction** — up to 20 concurrent sessions, oldest evicted automatically
- **Streaming protocol** over stdin/stdout — FastAPI wraps as SSE
### Performance (HF Space T4)
| Mode | Engines | OMP threads | Throughput |
|---|---|---|---|
| Speed (default) | 1 | 2 | ~40+ tok/s |
| Multi-user | 4 | 1 | ~35 tok/s × 4 users |
### Compile
```bash
g++ -O3 -march=native -fopenmp -ffast-math -std=c++17 \
-o inference inference.cpp -lm
```
---
## Files
| File | Size | Description |
|---|---|---|
| `model.bin` | 765 MB | Raw float32 weights (custom binary format) |
| `tokenizer.bin` | 522 KB | GPT-2 BPE vocab in custom binary format |
### model.bin format
```
Header: [n_layer, n_head, n_embd, block_size, vocab_size] (5 × int32)
wte: [vocab_size × n_embd] float32
wpe: [block_size × n_embd] float32
Per layer (×16):
ln1_w, ln1_b, c_attn_w, c_attn_b,
c_proj_w, c_proj_b, ln2_w, ln2_b,
mlp_fc_w, mlp_fc_b, mlp_proj_w, mlp_proj_b
ln_f_w, ln_f_b, lm_head_w
```
---
## API
```bash
# Chat (streaming SSE)
curl -X POST http://localhost:7860/chat \
-H "Content-Type: application/json" \
-d '{"message": "What is machine learning?", "session_id": "abc123"}'
# Health
curl http://localhost:7860/health
# Metrics
curl http://localhost:7860/metrics
# Reset session
curl -X POST http://localhost:7860/chat/reset \
-d '{"session_id": "abc123"}'
```
---
## Known Limitations
- **Reasoning:** 152M parameters cannot chain multi-step logic.
Expect factual recall and pattern matching, not reasoning.
- **Hallucination:** No RLHF/DPO — model will confidently say wrong things.
- **Context:** Hard limit of 1024 tokens (~750 words).
- **Evaluation:** Trained on loss minimization only.
No MMLU, HellaSwag, or held-out eval set — a proper
eval harness is the next planned addition.
- **serialize() note:** `shape_hint` param is only used when
`t=None` (bias=True config). Would refactor signature in v2.
- **vocab_size=50304:** GPT-2's actual vocab is 50,257.
Padded to 50,304 (nearest multiple of 64) for memory
alignment — standard trick, undocumented in v1.
---
## Citation
```bibtex
@misc{nanomind2025,
author = {NOT-OMEGA},
title = {NanoMind: A 152M GPT-2 Model with Custom C++ Inference},
year = {2025},
howpublished = {\url{https://huggingface.co/NOT-OMEGA/NanoMind}},
}
```
---
*Trained from scratch · Custom C++ engine · No frameworks at inference time*