NanoMind / README.md
NOT-OMEGA's picture
Update README.md
b728e8b verified
metadata
license: mit
language:
  - en
tags:
  - gpt2
  - causal-lm
  - text-generation
  - from-scratch
  - avx2
  - cpp-inference
  - kv-cache
pipeline_tag: text-generation

NanoMind · 152M

A 152M parameter GPT-2 style language model trained from scratch on GPT-4 quality instruction data, with a hand-written C++ inference engine featuring AVX2 SIMD, OpenMP parallelism, and persistent KV-cache.


Model Details

Property Value
Architecture GPT-2 (decoder-only transformer)
Parameters 152.83M
Layers 16
Attention heads 12
Embedding dim 768
Context length 1024 tokens
Vocab size 50,304 (GPT-2 BPE)
Training steps 9,800
Final loss ~1.73
Effective batch 96 (12 × 8 grad accum)
Optimizer AdamW (weight decay 0.1, β=0.9/0.95)
LR schedule Warmup 300 steps + cosine decay
Peak LR 5e-4
Hardware Kaggle T4 GPU (~12 hours)

Training Data

~220M tokens from GPT-4 quality sources:

Dataset Samples Quality
OpenHermes 2.5 500k GPT-4 multi-turn
Alpaca GPT-4 52k GPT-4 instruction
WizardLM Evol V2 143k GPT-4 evolved
Open-Platypus 25k STEM reasoning

All data formatted as:

System: You are a helpful, thoughtful, and articulate AI assistant.
User: <instruction>
Assistant: <response>

Inference Engine

This model ships with a custom C++ daemon (inference.cpp) — not transformers, not llama.cpp.

Features

  • AVX2 + FMA matrix-vector multiply (8 floats/cycle)
  • AVX2 attention dot products and weighted V accumulation
  • OpenMP parallelism across attention heads and matmul rows
  • Persistent KV-cache per session — no recomputation on follow-up turns
  • LRU eviction — up to 20 concurrent sessions, oldest evicted automatically
  • Streaming protocol over stdin/stdout — FastAPI wraps as SSE

Performance (HF Space T4)

Mode Engines OMP threads Throughput
Speed (default) 1 2 ~40+ tok/s
Multi-user 4 1 ~35 tok/s × 4 users

Compile

g++ -O3 -march=native -fopenmp -ffast-math -std=c++17 \
    -o inference inference.cpp -lm

Files

File Size Description
model.bin 765 MB Raw float32 weights (custom binary format)
tokenizer.bin 522 KB GPT-2 BPE vocab in custom binary format

model.bin format

Header: [n_layer, n_head, n_embd, block_size, vocab_size]  (5 × int32)
wte:    [vocab_size × n_embd]  float32
wpe:    [block_size × n_embd]  float32
Per layer (×16):
  ln1_w, ln1_b, c_attn_w, c_attn_b,
  c_proj_w, c_proj_b, ln2_w, ln2_b,
  mlp_fc_w, mlp_fc_b, mlp_proj_w, mlp_proj_b
ln_f_w, ln_f_b, lm_head_w

API

# Chat (streaming SSE)
curl -X POST http://localhost:7860/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What is machine learning?", "session_id": "abc123"}'

# Health
curl http://localhost:7860/health

# Metrics
curl http://localhost:7860/metrics

# Reset session
curl -X POST http://localhost:7860/chat/reset \
  -d '{"session_id": "abc123"}'

Known Limitations

  • Reasoning: 152M parameters cannot chain multi-step logic. Expect factual recall and pattern matching, not reasoning.
  • Hallucination: No RLHF/DPO — model will confidently say wrong things.
  • Context: Hard limit of 1024 tokens (~750 words).
  • Evaluation: Trained on loss minimization only. No MMLU, HellaSwag, or held-out eval set — a proper eval harness is the next planned addition.
  • serialize() note: shape_hint param is only used when t=None (bias=True config). Would refactor signature in v2.
  • vocab_size=50304: GPT-2's actual vocab is 50,257. Padded to 50,304 (nearest multiple of 64) for memory alignment — standard trick, undocumented in v1.

Citation

@misc{nanomind2025,
  author       = {NOT-OMEGA},
  title        = {NanoMind: A 152M GPT-2 Model with Custom C++ Inference},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/NOT-OMEGA/NanoMind}},
}

Trained from scratch · Custom C++ engine · No frameworks at inference time