NanoMind / README.md

Update README.md

b728e8b verified 10 days ago

4.51 kB

	---
	license: mit
	language:
	- en
	tags:
	- gpt2
	- causal-lm
	- text-generation
	- from-scratch
	- avx2
	- cpp-inference
	- kv-cache
	pipeline_tag: text-generation
	---

	# NanoMind · 152M

	> A 152M parameter GPT-2 style language model trained from scratch on GPT-4 quality instruction data, with a hand-written C++ inference engine featuring AVX2 SIMD, OpenMP parallelism, and persistent KV-cache.

	---

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Architecture \| GPT-2 (decoder-only transformer) \|
	\| Parameters \| 152.83M \|
	\| Layers \| 16 \|
	\| Attention heads \| 12 \|
	\| Embedding dim \| 768 \|
	\| Context length \| 1024 tokens \|
	\| Vocab size \| 50,304 (GPT-2 BPE) \|
	\| Training steps \| 9,800 \|
	\| Final loss \| ~1.73 \|
	\| Effective batch \| 96 (12 × 8 grad accum) \|
	\| Optimizer \| AdamW (weight decay 0.1, β=0.9/0.95) \|
	\| LR schedule \| Warmup 300 steps + cosine decay \|
	\| Peak LR \| 5e-4 \|
	\| Hardware \| Kaggle T4 GPU (~12 hours) \|

	---

	## Training Data

	~220M tokens from GPT-4 quality sources:

	\| Dataset \| Samples \| Quality \|
	\|---\|---\|---\|
	\| [OpenHermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) \| 500k \| GPT-4 multi-turn \|
	\| [Alpaca GPT-4](https://huggingface.co/datasets/vicgalle/alpaca-gpt4) \| 52k \| GPT-4 instruction \|
	\| [WizardLM Evol V2](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k) \| 143k \| GPT-4 evolved \|
	\| [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus) \| 25k \| STEM reasoning \|

	All data formatted as:
	```
	System: You are a helpful, thoughtful, and articulate AI assistant.
	User: <instruction>
	Assistant: <response>
	```

	---

	## Inference Engine

	This model ships with a custom C++ daemon (`inference.cpp`) — not transformers, not llama.cpp.

	### Features
	- AVX2 + FMA matrix-vector multiply (8 floats/cycle)
	- AVX2 attention dot products and weighted V accumulation
	- OpenMP parallelism across attention heads and matmul rows
	- Persistent KV-cache per session — no recomputation on follow-up turns
	- LRU eviction — up to 20 concurrent sessions, oldest evicted automatically
	- Streaming protocol over stdin/stdout — FastAPI wraps as SSE

	### Performance (HF Space T4)
	\| Mode \| Engines \| OMP threads \| Throughput \|
	\|---\|---\|---\|---\|
	\| Speed (default) \| 1 \| 2 \| ~40+ tok/s \|
	\| Multi-user \| 4 \| 1 \| ~35 tok/s × 4 users \|

	### Compile
	```bash
	g++ -O3 -march=native -fopenmp -ffast-math -std=c++17 \
	-o inference inference.cpp -lm
	```

	---

	## Files

	\| File \| Size \| Description \|
	\|---\|---\|---\|
	\| `model.bin` \| 765 MB \| Raw float32 weights (custom binary format) \|
	\| `tokenizer.bin` \| 522 KB \| GPT-2 BPE vocab in custom binary format \|

	### model.bin format
	```
	Header: [n_layer, n_head, n_embd, block_size, vocab_size] (5 × int32)
	wte: [vocab_size × n_embd] float32
	wpe: [block_size × n_embd] float32
	Per layer (×16):
	ln1_w, ln1_b, c_attn_w, c_attn_b,
	c_proj_w, c_proj_b, ln2_w, ln2_b,
	mlp_fc_w, mlp_fc_b, mlp_proj_w, mlp_proj_b
	ln_f_w, ln_f_b, lm_head_w
	```

	---

	## API

	```bash
	# Chat (streaming SSE)
	curl -X POST http://localhost:7860/chat \
	-H "Content-Type: application/json" \
	-d '{"message": "What is machine learning?", "session_id": "abc123"}'

	# Health
	curl http://localhost:7860/health

	# Metrics
	curl http://localhost:7860/metrics

	# Reset session
	curl -X POST http://localhost:7860/chat/reset \
	-d '{"session_id": "abc123"}'
	```

	---

	## Known Limitations

	- Reasoning: 152M parameters cannot chain multi-step logic.
	Expect factual recall and pattern matching, not reasoning.
	- Hallucination: No RLHF/DPO — model will confidently say wrong things.
	- Context: Hard limit of 1024 tokens (~750 words).
	- Evaluation: Trained on loss minimization only.
	No MMLU, HellaSwag, or held-out eval set — a proper
	eval harness is the next planned addition.
	- serialize() note: `shape_hint` param is only used when
	`t=None` (bias=True config). Would refactor signature in v2.
	- vocab_size=50304: GPT-2's actual vocab is 50,257.
	Padded to 50,304 (nearest multiple of 64) for memory
	alignment — standard trick, undocumented in v1.

	---

	## Citation

	```bibtex
	@misc{nanomind2025,
	author = {NOT-OMEGA},
	title = {NanoMind: A 152M GPT-2 Model with Custom C++ Inference},
	year = {2025},
	howpublished = {\url{https://huggingface.co/NOT-OMEGA/NanoMind}},
	}
	```

	---

	Trained from scratch · Custom C++ engine · No frameworks at inference time