grapheneaffiliates
/

h4-polytopic-attention

@@ -1,27 +1,21 @@
 # Project Olympus: Frontier-Quality AI on CPU
 ## Goal
-Build a system that approaches frontier model quality (Claude Opus, GPT-4 class)
-running entirely on CPU hardware, using only legally clean open-source models and data.
-No GPU. No API dependency. No monthly cost. No legal risk.
-**This is for the billions of people who can't afford frontier AI subscriptions and GPU compute.** Good-enough answers on free hardware beat perfect answers on expensive hardware — for education, small business, developing nations, and anyone who values privacy and independence.
 ## The Core Insight
-Claude Opus is one giant model that memorizes everything in its weights.
-We build many small specialists that know their domain deeply and retrieve
-everything else from a geometric knowledge index.
 The difference:
-- Opus: 200B+ params × 16 bits = ~400GB weights. Needs GPU cluster.
-- Ours: 6 specialists × 1.7B params × 1.58 bits = ~2GB total. Runs on laptop.
 - The gap is filled by E8 lattice retrieval (R@5=100%) from a knowledge index.
-A 1.7B model that can look up any fact in 20ms is functionally equivalent
-to a 200B model that memorized those facts — for the user, the answer
-is the same. The 200B model is faster at raw generation. Ours is cheaper,
-private, and never hallucinates on indexed knowledge.
 ## What's Already Proven
@@ -37,51 +31,82 @@ This project builds on the H4 Polytopic Attention foundation (7 phases, all test
 | CPU training | Proven | 24M ternary params, 8 hours, coherent English |
 | Autoresearch | Proven | 42+ autonomous experiments, finds optimal configs |
 ## Legal Foundation
-**This is NOT distillation.** We do not use outputs from proprietary models as
-training data. Every component is legally clean:
-### Base Models (published with explicit fine-tuning permission)
-| Model | Params | License | Source |
-|-------|--------|---------|--------|
-| SmolLM2-1.7B | 1.7B | Apache 2.0 | huggingface/SmolLM2-1.7B-Instruct |
-| SmolLM2-360M | 360M | Apache 2.0 | huggingface/SmolLM2-360M-Instruct |
-| OLMo-1B | 1B | Apache 2.0 | allenai/OLMo-1B |
-| OLMo-7B | 7B | Apache 2.0 | allenai/OLMo-7B |
-| TinyLlama-1.1B | 1.1B | Apache 2.0 | TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
-| Qwen2.5-0.5B | 0.5B | Apache 2.0 | Qwen/Qwen2.5-0.5B-Instruct |
-| Qwen2.5-1.5B | 1.5B | Apache 2.0 | Qwen/Qwen2.5-1.5B-Instruct |
-Apache 2.0 means: use for any purpose, modify, distribute, commercial use.
-These are not distilled — they were trained from scratch on open data by their
-respective organizations and released specifically for the community to use.
-### Training Data (all openly licensed)
-| Dataset | Size | License | HuggingFace ID |
-|---------|------|---------|----------------|
-| SlimPajama | 627B tokens | Apache 2.0 | cerebras/SlimPajama-627B |
-| FineWeb-Edu | 1.3T tokens | ODC-By 1.0 | HuggingFaceFW/fineweb-edu |
-| The Stack v2 | 3.3T tokens | Per-file license | bigcode/the-stack-v2 |
-| OpenWebMath | 14.7B tokens | ODC-By 1.0 | open-web-math/open-web-math |
-| OpenAssistant 2 | 161K messages | Apache 2.0 | OpenAssistant/oasst2 |
-| Dolly 15K | 15K instructions | CC-BY-SA | databricks/databricks-dolly-15k |
-| FLAN Collection | Millions | Apache 2.0 | Muennighoff/flan |
-| Natural Questions | 307K pairs | CC-BY-SA | google-research/natural-questions |
-| SQuAD 2.0 | 150K pairs | CC-BY-SA | rajpurkar/squad_v2 |
-| GSM8K | 8.5K problems | MIT | openai/gsm8k |
-| ARC | 7.7K questions | CC-BY-SA | allenai/ai2_arc |
-| Project Gutenberg | 70K books | Public domain | aleph_alpha/gutenberg |
-| Wikipedia | ~4B tokens | CC-BY-SA | wikimedia/wikipedia |
-| CNN/DailyMail | 300K articles | Apache 2.0 | abisee/cnn_dailymail |
-### Reranking Model
 | Model | Params | License | HuggingFace ID |
 |-------|--------|---------|----------------|
 | ms-marco-MiniLM-L-6-v2 | 22M | Apache 2.0 | cross-encoder/ms-marco-MiniLM-L-6-v2 |
-Everything above is published under open licenses that explicitly permit
-the use case we're building. No terms of service violations. No gray areas.
 ## Architecture
@@ -95,16 +120,16 @@ the use case we're building. No terms of service violations. No gray areas.
                    |  (16 chambers)      |  Coxeter chamber classification
                    +----------+----------+
                               |
-              +---------------+---------------+
-              |               |               |
-              v               v               v
-        +-----------+  +-----------+  +-----------+
-        |Specialist |  |Specialist |  |Specialist |  6 specialists
-        |    1      |  |    2      |  |   ...     |  Each 360M-1.7B
-        |  (1.7B)   |  |  (1.7B)   |  |  (360M)   |  Ternary weights
-        +-----+-----+  +-----+-----+  +-----+-----+
-              |               |               |
-              +---------------+---------------+
                               |
                               v
                    +---------------------+
@@ -125,81 +150,93 @@ the use case we're building. No terms of service violations. No gray areas.
                           Response
 ```
-### Why Multiple Specialists Beat One Giant Model (on CPU)
-1. **Only one specialist loads at a time.** 1.7B ternary = 340MB in RAM.
-   You don't need 32GB to hold all 6 simultaneously. The router selects
-   the right specialist, loads it (or keeps hot specialists cached),
-   generates the response, and moves on.
-2. **Each specialist is focused.** A 1.7B model fine-tuned on code data
-   writes better code than a 1.7B model fine-tuned on everything.
-   Specialization > generalization at small scale.
-3. **Retrieval replaces memorization.** The E8 lattice holds unlimited
-   knowledge (bounded only by disk space). The model doesn't need to
-   memorize Wikipedia — it retrieves the relevant passage in 20ms.
-4. **The router is geometric, not learned.** The ChamberTree classifies
-   queries by their H4 chamber in <1ms. No routing model needed.
-   No additional parameters. Pure geometry.
-## The 6 Specialists
-| # | Specialist | Base Model | Fine-tune Data | Tokens | Chambers |
-|---|-----------|-----------|----------------|--------|----------|
-| 1 | General + Instructions | SmolLM2-1.7B | OpenAssistant + Dolly + FLAN | ~50M | 0-2 |
-| 2 | Code | SmolLM2-1.7B | The Stack v2 (Python, JS, Rust, C) | ~100M | 3-5 |
-| 3 | Math/Reasoning | SmolLM2-360M | GSM8K + ARC + MATH | ~20M | 6-7 |
-| 4 | Factual QA | SmolLM2-360M | NQ + SQuAD + TriviaQA | ~30M | 8-9 |
-| 5 | Creative Writing | SmolLM2-360M | Project Gutenberg | ~50M | 10-12 |
-| 6 | Summarization | SmolLM2-360M | CNN/DailyMail + XSum | ~20M | 13-15 |
-## H4 Attention Progressive Swap
-The base models use standard softmax attention. We swap in H4 geometric
-attention through a progressive gating mechanism:
-1. **Adapter phase:** H4 as parallel path with learned gate (starts at 0)
-2. **Hybrid phase:** Both pathways train together, gate opens
-3. **Full swap:** Gate = 1.0, remove original attention
-4. **Quantize:** BitLinear ternary compression (13.8x)
-## Training Timeline
-### Sequential (1 CPU): ~45 days
-### Parallel (6 CPUs): ~3-4 weeks
-### Cost: 6 cloud VMs × $0.05/hr × 14 days = **$100.80** (or $0 with 6 friends' laptops)
 ## Honest Quality Expectations
-| Capability | Claude Opus | Olympus | Gap | Why |
-|------------|-------------|---------|-----|-----|
-| Factual QA | 90% | 85-95% | TIE/WIN | E8 retrieval > memorization |
-| Code | 85% | 40-55% | LARGE | 1.7B can't match 200B+ |
-| Math | 95% | 50-65% | LARGE | Complex reasoning needs more params |
-| Conversation | 95% | 75-85% | MODERATE | Good with OpenAssistant data |
-| Creative | 95% | 70-80% | MODERATE | Smaller model, less nuance |
-| Long context | 85% | 80-90% | SMALL | O(log t) advantage is real |
-| Summarization | 90% | 75-85% | MODERATE | CNN/DM fine-tune helps |
-| **Cost/month** | **$$$** | **$0** | **WIN** | **Runs on laptop** |
-| **Privacy** | **Cloud** | **Local** | **WIN** | **Data never leaves machine** |
-This is NOT Claude Opus quality across the board. It IS:
-- 85-95% on factual QA (retrieval advantage)
 - 75-85% on instruction following (good enough for most tasks)
-- Free, private, local, and improving
-- A foundation that a community can build on
 ## The Vision
-A laptop running 6 small specialists, routed by H4 geometry, backed by
-unlimited knowledge retrieval from E8 lattice memory. Not as good as
-Claude Opus at everything. But good enough at most things, free to run,
-private by default, and available to anyone with a computer.
-**That's not a replacement for frontier models. It's an alternative
-for the billions of people who can't afford them.**
 ---

 # Project Olympus: Frontier-Quality AI on CPU
 ## Goal
+Build a system that approaches frontier model quality (Claude Opus, GPT-4 class) running entirely on CPU hardware, using only legally clean open-source models and data. No GPU. No API dependency. No monthly cost. No legal risk.
+**This is for the billions of people who can't afford frontier AI subscriptions and GPU compute.** Good-enough answers on free hardware beat perfect answers on expensive hardware --- for education, small business, developing nations, and anyone who values privacy and independence.
 ## The Core Insight
+Claude Opus is one giant model that memorizes everything in its weights. We build focused specialists that know their domain deeply and retrieve everything else from a geometric knowledge index.
 The difference:
+- Opus: 200B+ params x 16 bits = ~400GB weights. Needs GPU cluster.
+- Ours: 4 specialists x 3B params x 1.58 bits = ~2.4GB total. Runs on laptop.
 - The gap is filled by E8 lattice retrieval (R@5=100%) from a knowledge index.
+A 3B model that can look up any fact in 20ms is functionally equivalent to a 200B model that memorized those facts --- for the user, the answer is the same.
 ## What's Already Proven
 | CPU training | Proven | 24M ternary params, 8 hours, coherent English |
 | Autoresearch | Proven | 42+ autonomous experiments, finds optimal configs |
+## The Base Model: SmolLM3-3B-Instruct
+**HuggingFace ID:** `HuggingFaceTB/SmolLM3-3B-Instruct`
+SmolLM3-3B (July 2025) is the correct base model. Using anything smaller would leave performance on the table:
+- **11.2T training tokens** (vs 2T for SmolLM2)
+- **128K context window** (vs 8K for SmolLM2)
+- **Dual-mode reasoning** (thinking + direct)
+- **Outperforms** Llama 3.2 3B, Qwen 2.5 3B on every benchmark
+- **Apache 2.0 license** --- full commercial use
+- **Full training recipe published** (data mixtures, hyperparameters, ablations)
+- **Tool calling support** built in
+### Why SmolLM3-3B over other options
+| Model | Params | License | Context | Trained on | Notes |
+|-------|--------|---------|---------|------------|-------|
+| **SmolLM3-3B** | **3B** | **Apache 2.0** | **128K** | **11.2T tokens** | **Best in class, fully open** |
+| Phi-4-mini | 3.8B | MIT | 128K | Proprietary mix | Slightly larger, MIT is fine too |
+| Qwen2.5-3B | 3B | Apache 2.0 | 32K | Unknown size | Older, lower benchmarks |
+| Llama 3.2 3B | 3B | Llama License | 128K | ~10T? | Meta license has usage limits |
+| SmolLM2-1.7B | 1.7B | Apache 2.0 | 8K | 2T tokens | Obsoleted by SmolLM3 |
+### Ternary size
+- Float32: 3B x 4 bytes = 12 GB
+- Float16: 3B x 2 bytes = 6 GB
+- **Ternary (1.58 bit): 3B x 0.2 bytes = ~600 MB**
+- With optimizer states for fine-tuning: ~4-8 GB total in RAM
+- **Fits comfortably in 32 GB RAM for fine-tuning on CPU**
 ## Legal Foundation
+**This is NOT distillation.** We do not use outputs from proprietary models as training data. Every component is legally clean.
+### Base Models (all Apache 2.0)
 | Model | Params | License | HuggingFace ID |
 |-------|--------|---------|----------------|
+| SmolLM3-3B-Instruct | 3B | Apache 2.0 | HuggingFaceTB/SmolLM3-3B-Instruct |
 | ms-marco-MiniLM-L-6-v2 | 22M | Apache 2.0 | cross-encoder/ms-marco-MiniLM-L-6-v2 |
+### Fine-tuning Data (all openly licensed)
+**Code Specialist:**
+| Dataset | Size | License | HuggingFace ID |
+|---------|------|---------|----------------|
+| The Stack v2 (filtered) | ~100M tokens | Per-file | bigcode/the-stack-v2 |
+| CodeAlpaca 20K | 20K instructions | Apache 2.0 | sahil2801/CodeAlpaca-20k |
+| CodeFeedback | 66K examples | Apache 2.0 | m-a-p/CodeFeedback-Filtered-Instruction |
+| Evol-Instruct-Code | 110K | Apache 2.0 | nickrosh/Evol-Instruct-Code-80k-v1 |
+**Math/Reasoning Specialist:**
+| Dataset | Size | License | HuggingFace ID |
+|---------|------|---------|----------------|
+| MetaMathQA | 395K | MIT | meta-math/MetaMathQA |
+| OpenMathInstruct v2 | 1.8M | Permissive | nvidia/OpenMathInstruct-2 |
+| GSM8K | 8.5K | MIT | openai/gsm8k |
+| MATH | 12.5K | MIT | hendrycks/competition_math |
+| ARC | 7.7K | CC-BY-SA | allenai/ai2_arc |
+**QA/Retrieval Specialist:**
+| Dataset | Size | License | HuggingFace ID |
+|---------|------|---------|----------------|
+| Natural Questions | 307K | CC-BY-SA | google-research-datasets/nq_open |
+| SQuAD 2.0 | 150K | CC-BY-SA | rajpurkar/squad_v2 |
+| TriviaQA | 95K | Apache 2.0 | mandarjoshi/trivia_qa |
+| HotpotQA | 113K | CC-BY-SA | hotpot_qa |
+**Knowledge Index:**
+| Source | Size | License | Notes |
+|--------|------|---------|-------|
+| Wikipedia EN | ~4B tokens | CC-BY-SA | All human knowledge |
+| Stack Overflow | ~10GB | CC-BY-SA | Programming Q&A |
+| Project Gutenberg | 70K books | Public domain | Literature |
+| User's own docs | Variable | N/A | Custom knowledge base |
 ## Architecture
                    |  (16 chambers)      |  Coxeter chamber classification
                    +----------+----------+
                               |
+              +-------+-------+-------+
+              |       |       |       |
+              v       v       v       v
+        +--------+ +------+ +------+ +------+
+        |General | | Code | | Math | |  QA  |  4 specialists
+        | (3B)   | | (3B) | | (3B) | | (3B)|  SmolLM3-3B base
+        | as-is  | | FT'd | | FT'd | | FT'd|  Ternary weights
+        +---+----+ +--+---+ +--+---+ +--+---+
+            |          |        |        |
+            +----------+--------+--------+
                               |
                               v
                    +---------------------+
                           Response
 ```
+### Why 4 Specialists Instead of 6
+With SmolLM3-3B as the base (much stronger than SmolLM2-1.7B), we don't need 6 specialists. The base model is already strong at conversation, creative writing, and summarization. We only specialize where it matters:
+| # | Specialist | Base | Fine-tuning | Why Separate |
+|---|-----------|------|-------------|--------------|
+| 0 | General | SmolLM3-3B-Instruct AS-IS | None needed | Already instruction-tuned |
+| 1 | Code | SmolLM3-3B + code data | ~200M tokens | Code needs 80%+ domain data |
+| 2 | Math | SmolLM3-3B + math data | ~100M tokens | Weakest area for small models |
+| 3 | QA | SmolLM3-3B + retrieval QA | ~150M tokens | Learn to answer FROM context |
+**Total active RAM: ~600MB** (one specialist loaded at a time) + 90MB MiniLM + E8 index
+## H4 Attention Integration
+SmolLM3 uses GQA with 4 groups --- maps naturally to H4's 4 Coxeter simple roots.
+**Progressive swap in 4 phases:**
+1. **Adapter (Days 1-3):** Freeze SmolLM3, add H4 adapter parallel to each GQA layer. Gate starts at 0. Train only H4 params.
+2. **Hybrid (Days 3-7):** Unfreeze SmolLM3 attention. Both paths train. Monitor which layers prefer H4.
+3. **Selective swap (Days 7-10):** Layers with gate >0.8 keep only H4. Layers with gate <0.3 keep only original. Others stay hybrid.
+4. **Ternary (Day 10):** Apply BitLinear to H4 layers. Export final model.
+**What this gives you:** O(log t) attention for long sequences (SmolLM3's 128K context is O(t^2) via Flash Attention), ternary attention weights (600MB), and E8 lattice integration for retrieval.
+## Fine-Tuning: QLoRA on CPU
+Full fine-tuning of 3B params on CPU is slow. QLoRA is 3-6x faster because only 1-2% of parameters get gradients:
+| Method | Step time | Steps/day | Trainable params |
+|--------|-----------|-----------|-----------------|
+| Full fine-tune 3B on CPU | ~3s | ~28K | 3B (100%) |
+| **QLoRA 3B on CPU** | **~0.5-1s** | **~86-170K** | **~20-50M (1-2%)** |
+### Per-specialist training budget
+| Specialist | Tokens | Steps | Time |
+|------------|--------|-------|------|
+| Code | 200M | ~50K | 1-2 days |
+| Math | 100M | ~25K | 0.5-1 day |
+| QA | 150M | ~37K | 1-1.5 days |
+| **Total** | **450M** | **~112K** | **3-5 days** |
+## The 14-Day Plan
+| Day | Task | Validation |
+|-----|------|------------|
+| 1 | Download SmolLM3, verify, setup QLoRA | Generates text OK |
+| 2 | Fine-tune code specialist | Writes Python functions |
+| 3 | Fine-tune math specialist | Solves GSM8K problems |
+| 4 | Fine-tune QA specialist | Answers from context |
+| 5-6 | H4 progressive swap Phase 1 | Perplexity within 5% |
+| 7-8 | H4 progressive swap Phase 2 | Gate values meaningful |
+| 9-10 | H4 selective swap + ternary | Chamber preservation >80% |
+| 11 | ChamberTree router | Routes correctly |
+| 12 | E8 knowledge index (Wikipedia) | Retrieval finds facts |
+| 13 | Integration + demo | End-to-end works |
+| 14 | Benchmarks + upload to HF | Numbers documented |
+**Cost:** 3-5 days specialist training + 6-9 days H4 swap = ~10-14 days total. On cloud: ~$50-100. On laptops: $0.
 ## Honest Quality Expectations
+| Task | SmolLM3-3B base | + Specialist FT | + E8 Retrieval | Opus |
+|------|----------------|-----------------|----------------|------|
+| MMLU | ~60% | ~62% | ~70-75% | ~88% |
+| HumanEval | ~45% | ~55-65% | N/A | ~85% |
+| GSM8K | ~55% | ~65-75% | N/A | ~95% |
+| TriviaQA | ~50% | ~55% | **~85-90%** | ~90% |
+| Instruction | ~80% | ~82% | N/A | ~95% |
+| Long context | Good to 128K | Same | Better | 200K |
+| **Cost** | **$0** | **$0** | **$0** | **$$$** |
+| **Privacy** | **Local** | **Local** | **Local** | **Cloud** |
+The retrieval-augmented factual QA (85-90%) is where we compete directly with frontier models. Everything else is 60-85% of Opus.
+**This is NOT Claude Opus quality across the board.** It IS:
+- 85-90% on factual QA (retrieval advantage --- the model looks up facts instead of hallucinating)
 - 75-85% on instruction following (good enough for most tasks)
+- 55-75% on code and math (honest gap --- complex reasoning needs more params)
+- Free, private, local, legally clean, and improvable by the community
 ## The Vision
+A laptop running 4 focused specialists, routed by H4 geometry in <1ms, backed by unlimited knowledge retrieval from E8 lattice memory in 20ms, reranked to 98.5% accuracy. Not as good as Claude Opus at everything. But good enough at most things, free to run, private by default, and available to anyone with a computer.
+**That's not a replacement for frontier models. It's an alternative for the billions of people who can't afford them.**
 ---