NanoMind / README.md

NOT-OMEGA

Update README.md

b728e8b verified 9 days ago

preview code

raw

history blame contribute delete

4.51 kB

metadata

license: mit
language:
  - en
tags:
  - gpt2
  - causal-lm
  - text-generation
  - from-scratch
  - avx2
  - cpp-inference
  - kv-cache
pipeline_tag: text-generation

NanoMind · 152M

A 152M parameter GPT-2 style language model trained from scratch on GPT-4 quality instruction data, with a hand-written C++ inference engine featuring AVX2 SIMD, OpenMP parallelism, and persistent KV-cache.

Model Details

Property	Value
Architecture	GPT-2 (decoder-only transformer)
Parameters	152.83M
Layers	16
Attention heads	12
Embedding dim	768
Context length	1024 tokens
Vocab size	50,304 (GPT-2 BPE)
Training steps	9,800
Final loss	~1.73
Effective batch	96 (12 × 8 grad accum)
Optimizer	AdamW (weight decay 0.1, β=0.9/0.95)
LR schedule	Warmup 300 steps + cosine decay
Peak LR	5e-4
Hardware	Kaggle T4 GPU (~12 hours)

Training Data

~220M tokens from GPT-4 quality sources:

Dataset	Samples	Quality
OpenHermes 2.5	500k	GPT-4 multi-turn
Alpaca GPT-4	52k	GPT-4 instruction
WizardLM Evol V2	143k	GPT-4 evolved
Open-Platypus	25k	STEM reasoning

All data formatted as:

System: You are a helpful, thoughtful, and articulate AI assistant.
User: <instruction>
Assistant: <response>

Inference Engine

This model ships with a custom C++ daemon (inference.cpp) — not transformers, not llama.cpp.

Features

AVX2 + FMA matrix-vector multiply (8 floats/cycle)
AVX2 attention dot products and weighted V accumulation
OpenMP parallelism across attention heads and matmul rows
Persistent KV-cache per session — no recomputation on follow-up turns
LRU eviction — up to 20 concurrent sessions, oldest evicted automatically
Streaming protocol over stdin/stdout — FastAPI wraps as SSE

Performance (HF Space T4)

Mode	Engines	OMP threads	Throughput
Speed (default)	1	2	~40+ tok/s
Multi-user	4	1	~35 tok/s × 4 users

Compile

g++ -O3 -march=native -fopenmp -ffast-math -std=c++17 \
    -o inference inference.cpp -lm

Files

File	Size	Description
`model.bin`	765 MB	Raw float32 weights (custom binary format)
`tokenizer.bin`	522 KB	GPT-2 BPE vocab in custom binary format

model.bin format

Header: [n_layer, n_head, n_embd, block_size, vocab_size]  (5 × int32)
wte:    [vocab_size × n_embd]  float32
wpe:    [block_size × n_embd]  float32
Per layer (×16):
  ln1_w, ln1_b, c_attn_w, c_attn_b,
  c_proj_w, c_proj_b, ln2_w, ln2_b,
  mlp_fc_w, mlp_fc_b, mlp_proj_w, mlp_proj_b
ln_f_w, ln_f_b, lm_head_w

API

# Chat (streaming SSE)
curl -X POST http://localhost:7860/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What is machine learning?", "session_id": "abc123"}'

# Health
curl http://localhost:7860/health

# Metrics
curl http://localhost:7860/metrics

# Reset session
curl -X POST http://localhost:7860/chat/reset \
  -d '{"session_id": "abc123"}'

Known Limitations

Reasoning: 152M parameters cannot chain multi-step logic. Expect factual recall and pattern matching, not reasoning.
Hallucination: No RLHF/DPO — model will confidently say wrong things.
Context: Hard limit of 1024 tokens (~750 words).
Evaluation: Trained on loss minimization only. No MMLU, HellaSwag, or held-out eval set — a proper eval harness is the next planned addition.
serialize() note: shape_hint param is only used when t=None (bias=True config). Would refactor signature in v2.
vocab_size=50304: GPT-2's actual vocab is 50,257. Padded to 50,304 (nearest multiple of 64) for memory alignment — standard trick, undocumented in v1.

Citation

@misc{nanomind2025,
  author       = {NOT-OMEGA},
  title        = {NanoMind: A 152M GPT-2 Model with Custom C++ Inference},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/NOT-OMEGA/NanoMind}},
}

Trained from scratch · Custom C++ engine · No frameworks at inference time