grapheneaffiliates commited on
Commit
efa7086
·
verified ·
1 Parent(s): e3012a4

Upload PROJECT_OLYMPUS.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. PROJECT_OLYMPUS.md +155 -118
PROJECT_OLYMPUS.md CHANGED
@@ -1,27 +1,21 @@
1
  # Project Olympus: Frontier-Quality AI on CPU
2
 
3
  ## Goal
4
- Build a system that approaches frontier model quality (Claude Opus, GPT-4 class)
5
- running entirely on CPU hardware, using only legally clean open-source models and data.
6
- No GPU. No API dependency. No monthly cost. No legal risk.
7
 
8
- **This is for the billions of people who can't afford frontier AI subscriptions and GPU compute.** Good-enough answers on free hardware beat perfect answers on expensive hardware for education, small business, developing nations, and anyone who values privacy and independence.
 
 
9
 
10
  ## The Core Insight
11
 
12
- Claude Opus is one giant model that memorizes everything in its weights.
13
- We build many small specialists that know their domain deeply and retrieve
14
- everything else from a geometric knowledge index.
15
 
16
  The difference:
17
- - Opus: 200B+ params × 16 bits = ~400GB weights. Needs GPU cluster.
18
- - Ours: 6 specialists × 1.7B params × 1.58 bits = ~2GB total. Runs on laptop.
19
  - The gap is filled by E8 lattice retrieval (R@5=100%) from a knowledge index.
20
 
21
- A 1.7B model that can look up any fact in 20ms is functionally equivalent
22
- to a 200B model that memorized those facts — for the user, the answer
23
- is the same. The 200B model is faster at raw generation. Ours is cheaper,
24
- private, and never hallucinates on indexed knowledge.
25
 
26
  ## What's Already Proven
27
 
@@ -37,51 +31,82 @@ This project builds on the H4 Polytopic Attention foundation (7 phases, all test
37
  | CPU training | Proven | 24M ternary params, 8 hours, coherent English |
38
  | Autoresearch | Proven | 42+ autonomous experiments, finds optimal configs |
39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  ## Legal Foundation
41
 
42
- **This is NOT distillation.** We do not use outputs from proprietary models as
43
- training data. Every component is legally clean:
44
-
45
- ### Base Models (published with explicit fine-tuning permission)
46
- | Model | Params | License | Source |
47
- |-------|--------|---------|--------|
48
- | SmolLM2-1.7B | 1.7B | Apache 2.0 | huggingface/SmolLM2-1.7B-Instruct |
49
- | SmolLM2-360M | 360M | Apache 2.0 | huggingface/SmolLM2-360M-Instruct |
50
- | OLMo-1B | 1B | Apache 2.0 | allenai/OLMo-1B |
51
- | OLMo-7B | 7B | Apache 2.0 | allenai/OLMo-7B |
52
- | TinyLlama-1.1B | 1.1B | Apache 2.0 | TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
53
- | Qwen2.5-0.5B | 0.5B | Apache 2.0 | Qwen/Qwen2.5-0.5B-Instruct |
54
- | Qwen2.5-1.5B | 1.5B | Apache 2.0 | Qwen/Qwen2.5-1.5B-Instruct |
55
-
56
- Apache 2.0 means: use for any purpose, modify, distribute, commercial use.
57
- These are not distilled — they were trained from scratch on open data by their
58
- respective organizations and released specifically for the community to use.
59
-
60
- ### Training Data (all openly licensed)
61
- | Dataset | Size | License | HuggingFace ID |
62
- |---------|------|---------|----------------|
63
- | SlimPajama | 627B tokens | Apache 2.0 | cerebras/SlimPajama-627B |
64
- | FineWeb-Edu | 1.3T tokens | ODC-By 1.0 | HuggingFaceFW/fineweb-edu |
65
- | The Stack v2 | 3.3T tokens | Per-file license | bigcode/the-stack-v2 |
66
- | OpenWebMath | 14.7B tokens | ODC-By 1.0 | open-web-math/open-web-math |
67
- | OpenAssistant 2 | 161K messages | Apache 2.0 | OpenAssistant/oasst2 |
68
- | Dolly 15K | 15K instructions | CC-BY-SA | databricks/databricks-dolly-15k |
69
- | FLAN Collection | Millions | Apache 2.0 | Muennighoff/flan |
70
- | Natural Questions | 307K pairs | CC-BY-SA | google-research/natural-questions |
71
- | SQuAD 2.0 | 150K pairs | CC-BY-SA | rajpurkar/squad_v2 |
72
- | GSM8K | 8.5K problems | MIT | openai/gsm8k |
73
- | ARC | 7.7K questions | CC-BY-SA | allenai/ai2_arc |
74
- | Project Gutenberg | 70K books | Public domain | aleph_alpha/gutenberg |
75
- | Wikipedia | ~4B tokens | CC-BY-SA | wikimedia/wikipedia |
76
- | CNN/DailyMail | 300K articles | Apache 2.0 | abisee/cnn_dailymail |
77
-
78
- ### Reranking Model
79
  | Model | Params | License | HuggingFace ID |
80
  |-------|--------|---------|----------------|
 
81
  | ms-marco-MiniLM-L-6-v2 | 22M | Apache 2.0 | cross-encoder/ms-marco-MiniLM-L-6-v2 |
82
 
83
- Everything above is published under open licenses that explicitly permit
84
- the use case we're building. No terms of service violations. No gray areas.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
 
86
  ## Architecture
87
 
@@ -95,16 +120,16 @@ the use case we're building. No terms of service violations. No gray areas.
95
  | (16 chambers) | Coxeter chamber classification
96
  +----------+----------+
97
  |
98
- +---------------+---------------+
99
- | | |
100
- v v v
101
- +-----------+ +-----------+ +-----------+
102
- |Specialist | |Specialist | |Specialist | 6 specialists
103
- | 1 | | 2 | | ... | Each 360M-1.7B
104
- | (1.7B) | | (1.7B) | | (360M) | Ternary weights
105
- +-----+-----+ +-----+-----+ +-----+-----+
106
- | | |
107
- +---------------+---------------+
108
  |
109
  v
110
  +---------------------+
@@ -125,81 +150,93 @@ the use case we're building. No terms of service violations. No gray areas.
125
  Response
126
  ```
127
 
128
- ### Why Multiple Specialists Beat One Giant Model (on CPU)
 
 
 
 
 
 
 
 
 
 
 
 
 
129
 
130
- 1. **Only one specialist loads at a time.** 1.7B ternary = 340MB in RAM.
131
- You don't need 32GB to hold all 6 simultaneously. The router selects
132
- the right specialist, loads it (or keeps hot specialists cached),
133
- generates the response, and moves on.
134
 
135
- 2. **Each specialist is focused.** A 1.7B model fine-tuned on code data
136
- writes better code than a 1.7B model fine-tuned on everything.
137
- Specialization > generalization at small scale.
138
 
139
- 3. **Retrieval replaces memorization.** The E8 lattice holds unlimited
140
- knowledge (bounded only by disk space). The model doesn't need to
141
- memorize Wikipedia it retrieves the relevant passage in 20ms.
 
142
 
143
- 4. **The router is geometric, not learned.** The ChamberTree classifies
144
- queries by their H4 chamber in <1ms. No routing model needed.
145
- No additional parameters. Pure geometry.
146
 
147
- ## The 6 Specialists
148
 
149
- | # | Specialist | Base Model | Fine-tune Data | Tokens | Chambers |
150
- |---|-----------|-----------|----------------|--------|----------|
151
- | 1 | General + Instructions | SmolLM2-1.7B | OpenAssistant + Dolly + FLAN | ~50M | 0-2 |
152
- | 2 | Code | SmolLM2-1.7B | The Stack v2 (Python, JS, Rust, C) | ~100M | 3-5 |
153
- | 3 | Math/Reasoning | SmolLM2-360M | GSM8K + ARC + MATH | ~20M | 6-7 |
154
- | 4 | Factual QA | SmolLM2-360M | NQ + SQuAD + TriviaQA | ~30M | 8-9 |
155
- | 5 | Creative Writing | SmolLM2-360M | Project Gutenberg | ~50M | 10-12 |
156
- | 6 | Summarization | SmolLM2-360M | CNN/DailyMail + XSum | ~20M | 13-15 |
157
 
158
- ## H4 Attention Progressive Swap
 
 
 
159
 
160
- The base models use standard softmax attention. We swap in H4 geometric
161
- attention through a progressive gating mechanism:
 
 
 
 
 
162
 
163
- 1. **Adapter phase:** H4 as parallel path with learned gate (starts at 0)
164
- 2. **Hybrid phase:** Both pathways train together, gate opens
165
- 3. **Full swap:** Gate = 1.0, remove original attention
166
- 4. **Quantize:** BitLinear ternary compression (13.8x)
167
 
168
- ## Training Timeline
 
 
 
 
 
 
 
 
 
 
 
 
169
 
170
- ### Sequential (1 CPU): ~45 days
171
- ### Parallel (6 CPUs): ~3-4 weeks
172
- ### Cost: 6 cloud VMs × $0.05/hr × 14 days = **$100.80** (or $0 with 6 friends' laptops)
173
 
174
  ## Honest Quality Expectations
175
 
176
- | Capability | Claude Opus | Olympus | Gap | Why |
177
- |------------|-------------|---------|-----|-----|
178
- | Factual QA | 90% | 85-95% | TIE/WIN | E8 retrieval > memorization |
179
- | Code | 85% | 40-55% | LARGE | 1.7B can't match 200B+ |
180
- | Math | 95% | 50-65% | LARGE | Complex reasoning needs more params |
181
- | Conversation | 95% | 75-85% | MODERATE | Good with OpenAssistant data |
182
- | Creative | 95% | 70-80% | MODERATE | Smaller model, less nuance |
183
- | Long context | 85% | 80-90% | SMALL | O(log t) advantage is real |
184
- | Summarization | 90% | 75-85% | MODERATE | CNN/DM fine-tune helps |
185
- | **Cost/month** | **$$$** | **$0** | **WIN** | **Runs on laptop** |
186
- | **Privacy** | **Cloud** | **Local** | **WIN** | **Data never leaves machine** |
187
-
188
- This is NOT Claude Opus quality across the board. It IS:
189
- - 85-95% on factual QA (retrieval advantage)
 
190
  - 75-85% on instruction following (good enough for most tasks)
191
- - Free, private, local, and improving
192
- - A foundation that a community can build on
193
 
194
  ## The Vision
195
 
196
- A laptop running 6 small specialists, routed by H4 geometry, backed by
197
- unlimited knowledge retrieval from E8 lattice memory. Not as good as
198
- Claude Opus at everything. But good enough at most things, free to run,
199
- private by default, and available to anyone with a computer.
200
 
201
- **That's not a replacement for frontier models. It's an alternative
202
- for the billions of people who can't afford them.**
203
 
204
  ---
205
 
 
1
  # Project Olympus: Frontier-Quality AI on CPU
2
 
3
  ## Goal
 
 
 
4
 
5
+ Build a system that approaches frontier model quality (Claude Opus, GPT-4 class) running entirely on CPU hardware, using only legally clean open-source models and data. No GPU. No API dependency. No monthly cost. No legal risk.
6
+
7
+ **This is for the billions of people who can't afford frontier AI subscriptions and GPU compute.** Good-enough answers on free hardware beat perfect answers on expensive hardware --- for education, small business, developing nations, and anyone who values privacy and independence.
8
 
9
  ## The Core Insight
10
 
11
+ Claude Opus is one giant model that memorizes everything in its weights. We build focused specialists that know their domain deeply and retrieve everything else from a geometric knowledge index.
 
 
12
 
13
  The difference:
14
+ - Opus: 200B+ params x 16 bits = ~400GB weights. Needs GPU cluster.
15
+ - Ours: 4 specialists x 3B params x 1.58 bits = ~2.4GB total. Runs on laptop.
16
  - The gap is filled by E8 lattice retrieval (R@5=100%) from a knowledge index.
17
 
18
+ A 3B model that can look up any fact in 20ms is functionally equivalent to a 200B model that memorized those facts --- for the user, the answer is the same.
 
 
 
19
 
20
  ## What's Already Proven
21
 
 
31
  | CPU training | Proven | 24M ternary params, 8 hours, coherent English |
32
  | Autoresearch | Proven | 42+ autonomous experiments, finds optimal configs |
33
 
34
+ ## The Base Model: SmolLM3-3B-Instruct
35
+
36
+ **HuggingFace ID:** `HuggingFaceTB/SmolLM3-3B-Instruct`
37
+
38
+ SmolLM3-3B (July 2025) is the correct base model. Using anything smaller would leave performance on the table:
39
+
40
+ - **11.2T training tokens** (vs 2T for SmolLM2)
41
+ - **128K context window** (vs 8K for SmolLM2)
42
+ - **Dual-mode reasoning** (thinking + direct)
43
+ - **Outperforms** Llama 3.2 3B, Qwen 2.5 3B on every benchmark
44
+ - **Apache 2.0 license** --- full commercial use
45
+ - **Full training recipe published** (data mixtures, hyperparameters, ablations)
46
+ - **Tool calling support** built in
47
+
48
+ ### Why SmolLM3-3B over other options
49
+
50
+ | Model | Params | License | Context | Trained on | Notes |
51
+ |-------|--------|---------|---------|------------|-------|
52
+ | **SmolLM3-3B** | **3B** | **Apache 2.0** | **128K** | **11.2T tokens** | **Best in class, fully open** |
53
+ | Phi-4-mini | 3.8B | MIT | 128K | Proprietary mix | Slightly larger, MIT is fine too |
54
+ | Qwen2.5-3B | 3B | Apache 2.0 | 32K | Unknown size | Older, lower benchmarks |
55
+ | Llama 3.2 3B | 3B | Llama License | 128K | ~10T? | Meta license has usage limits |
56
+ | SmolLM2-1.7B | 1.7B | Apache 2.0 | 8K | 2T tokens | Obsoleted by SmolLM3 |
57
+
58
+ ### Ternary size
59
+
60
+ - Float32: 3B x 4 bytes = 12 GB
61
+ - Float16: 3B x 2 bytes = 6 GB
62
+ - **Ternary (1.58 bit): 3B x 0.2 bytes = ~600 MB**
63
+ - With optimizer states for fine-tuning: ~4-8 GB total in RAM
64
+ - **Fits comfortably in 32 GB RAM for fine-tuning on CPU**
65
+
66
  ## Legal Foundation
67
 
68
+ **This is NOT distillation.** We do not use outputs from proprietary models as training data. Every component is legally clean.
69
+
70
+ ### Base Models (all Apache 2.0)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  | Model | Params | License | HuggingFace ID |
72
  |-------|--------|---------|----------------|
73
+ | SmolLM3-3B-Instruct | 3B | Apache 2.0 | HuggingFaceTB/SmolLM3-3B-Instruct |
74
  | ms-marco-MiniLM-L-6-v2 | 22M | Apache 2.0 | cross-encoder/ms-marco-MiniLM-L-6-v2 |
75
 
76
+ ### Fine-tuning Data (all openly licensed)
77
+
78
+ **Code Specialist:**
79
+ | Dataset | Size | License | HuggingFace ID |
80
+ |---------|------|---------|----------------|
81
+ | The Stack v2 (filtered) | ~100M tokens | Per-file | bigcode/the-stack-v2 |
82
+ | CodeAlpaca 20K | 20K instructions | Apache 2.0 | sahil2801/CodeAlpaca-20k |
83
+ | CodeFeedback | 66K examples | Apache 2.0 | m-a-p/CodeFeedback-Filtered-Instruction |
84
+ | Evol-Instruct-Code | 110K | Apache 2.0 | nickrosh/Evol-Instruct-Code-80k-v1 |
85
+
86
+ **Math/Reasoning Specialist:**
87
+ | Dataset | Size | License | HuggingFace ID |
88
+ |---------|------|---------|----------------|
89
+ | MetaMathQA | 395K | MIT | meta-math/MetaMathQA |
90
+ | OpenMathInstruct v2 | 1.8M | Permissive | nvidia/OpenMathInstruct-2 |
91
+ | GSM8K | 8.5K | MIT | openai/gsm8k |
92
+ | MATH | 12.5K | MIT | hendrycks/competition_math |
93
+ | ARC | 7.7K | CC-BY-SA | allenai/ai2_arc |
94
+
95
+ **QA/Retrieval Specialist:**
96
+ | Dataset | Size | License | HuggingFace ID |
97
+ |---------|------|---------|----------------|
98
+ | Natural Questions | 307K | CC-BY-SA | google-research-datasets/nq_open |
99
+ | SQuAD 2.0 | 150K | CC-BY-SA | rajpurkar/squad_v2 |
100
+ | TriviaQA | 95K | Apache 2.0 | mandarjoshi/trivia_qa |
101
+ | HotpotQA | 113K | CC-BY-SA | hotpot_qa |
102
+
103
+ **Knowledge Index:**
104
+ | Source | Size | License | Notes |
105
+ |--------|------|---------|-------|
106
+ | Wikipedia EN | ~4B tokens | CC-BY-SA | All human knowledge |
107
+ | Stack Overflow | ~10GB | CC-BY-SA | Programming Q&A |
108
+ | Project Gutenberg | 70K books | Public domain | Literature |
109
+ | User's own docs | Variable | N/A | Custom knowledge base |
110
 
111
  ## Architecture
112
 
 
120
  | (16 chambers) | Coxeter chamber classification
121
  +----------+----------+
122
  |
123
+ +-------+-------+-------+
124
+ | | | |
125
+ v v v v
126
+ +--------+ +------+ +------+ +------+
127
+ |General | | Code | | Math | | QA | 4 specialists
128
+ | (3B) | | (3B) | | (3B) | | (3B)| SmolLM3-3B base
129
+ | as-is | | FT'd | | FT'd | | FT'd| Ternary weights
130
+ +---+----+ +--+---+ +--+---+ +--+---+
131
+ | | | |
132
+ +----------+--------+--------+
133
  |
134
  v
135
  +---------------------+
 
150
  Response
151
  ```
152
 
153
+ ### Why 4 Specialists Instead of 6
154
+
155
+ With SmolLM3-3B as the base (much stronger than SmolLM2-1.7B), we don't need 6 specialists. The base model is already strong at conversation, creative writing, and summarization. We only specialize where it matters:
156
+
157
+ | # | Specialist | Base | Fine-tuning | Why Separate |
158
+ |---|-----------|------|-------------|--------------|
159
+ | 0 | General | SmolLM3-3B-Instruct AS-IS | None needed | Already instruction-tuned |
160
+ | 1 | Code | SmolLM3-3B + code data | ~200M tokens | Code needs 80%+ domain data |
161
+ | 2 | Math | SmolLM3-3B + math data | ~100M tokens | Weakest area for small models |
162
+ | 3 | QA | SmolLM3-3B + retrieval QA | ~150M tokens | Learn to answer FROM context |
163
+
164
+ **Total active RAM: ~600MB** (one specialist loaded at a time) + 90MB MiniLM + E8 index
165
+
166
+ ## H4 Attention Integration
167
 
168
+ SmolLM3 uses GQA with 4 groups --- maps naturally to H4's 4 Coxeter simple roots.
 
 
 
169
 
170
+ **Progressive swap in 4 phases:**
 
 
171
 
172
+ 1. **Adapter (Days 1-3):** Freeze SmolLM3, add H4 adapter parallel to each GQA layer. Gate starts at 0. Train only H4 params.
173
+ 2. **Hybrid (Days 3-7):** Unfreeze SmolLM3 attention. Both paths train. Monitor which layers prefer H4.
174
+ 3. **Selective swap (Days 7-10):** Layers with gate >0.8 keep only H4. Layers with gate <0.3 keep only original. Others stay hybrid.
175
+ 4. **Ternary (Day 10):** Apply BitLinear to H4 layers. Export final model.
176
 
177
+ **What this gives you:** O(log t) attention for long sequences (SmolLM3's 128K context is O(t^2) via Flash Attention), ternary attention weights (600MB), and E8 lattice integration for retrieval.
 
 
178
 
179
+ ## Fine-Tuning: QLoRA on CPU
180
 
181
+ Full fine-tuning of 3B params on CPU is slow. QLoRA is 3-6x faster because only 1-2% of parameters get gradients:
 
 
 
 
 
 
 
182
 
183
+ | Method | Step time | Steps/day | Trainable params |
184
+ |--------|-----------|-----------|-----------------|
185
+ | Full fine-tune 3B on CPU | ~3s | ~28K | 3B (100%) |
186
+ | **QLoRA 3B on CPU** | **~0.5-1s** | **~86-170K** | **~20-50M (1-2%)** |
187
 
188
+ ### Per-specialist training budget
189
+ | Specialist | Tokens | Steps | Time |
190
+ |------------|--------|-------|------|
191
+ | Code | 200M | ~50K | 1-2 days |
192
+ | Math | 100M | ~25K | 0.5-1 day |
193
+ | QA | 150M | ~37K | 1-1.5 days |
194
+ | **Total** | **450M** | **~112K** | **3-5 days** |
195
 
196
+ ## The 14-Day Plan
 
 
 
197
 
198
+ | Day | Task | Validation |
199
+ |-----|------|------------|
200
+ | 1 | Download SmolLM3, verify, setup QLoRA | Generates text OK |
201
+ | 2 | Fine-tune code specialist | Writes Python functions |
202
+ | 3 | Fine-tune math specialist | Solves GSM8K problems |
203
+ | 4 | Fine-tune QA specialist | Answers from context |
204
+ | 5-6 | H4 progressive swap Phase 1 | Perplexity within 5% |
205
+ | 7-8 | H4 progressive swap Phase 2 | Gate values meaningful |
206
+ | 9-10 | H4 selective swap + ternary | Chamber preservation >80% |
207
+ | 11 | ChamberTree router | Routes correctly |
208
+ | 12 | E8 knowledge index (Wikipedia) | Retrieval finds facts |
209
+ | 13 | Integration + demo | End-to-end works |
210
+ | 14 | Benchmarks + upload to HF | Numbers documented |
211
 
212
+ **Cost:** 3-5 days specialist training + 6-9 days H4 swap = ~10-14 days total. On cloud: ~$50-100. On laptops: $0.
 
 
213
 
214
  ## Honest Quality Expectations
215
 
216
+ | Task | SmolLM3-3B base | + Specialist FT | + E8 Retrieval | Opus |
217
+ |------|----------------|-----------------|----------------|------|
218
+ | MMLU | ~60% | ~62% | ~70-75% | ~88% |
219
+ | HumanEval | ~45% | ~55-65% | N/A | ~85% |
220
+ | GSM8K | ~55% | ~65-75% | N/A | ~95% |
221
+ | TriviaQA | ~50% | ~55% | **~85-90%** | ~90% |
222
+ | Instruction | ~80% | ~82% | N/A | ~95% |
223
+ | Long context | Good to 128K | Same | Better | 200K |
224
+ | **Cost** | **$0** | **$0** | **$0** | **$$$** |
225
+ | **Privacy** | **Local** | **Local** | **Local** | **Cloud** |
226
+
227
+ The retrieval-augmented factual QA (85-90%) is where we compete directly with frontier models. Everything else is 60-85% of Opus.
228
+
229
+ **This is NOT Claude Opus quality across the board.** It IS:
230
+ - 85-90% on factual QA (retrieval advantage --- the model looks up facts instead of hallucinating)
231
  - 75-85% on instruction following (good enough for most tasks)
232
+ - 55-75% on code and math (honest gap --- complex reasoning needs more params)
233
+ - Free, private, local, legally clean, and improvable by the community
234
 
235
  ## The Vision
236
 
237
+ A laptop running 4 focused specialists, routed by H4 geometry in <1ms, backed by unlimited knowledge retrieval from E8 lattice memory in 20ms, reranked to 98.5% accuracy. Not as good as Claude Opus at everything. But good enough at most things, free to run, private by default, and available to anyone with a computer.
 
 
 
238
 
239
+ **That's not a replacement for frontier models. It's an alternative for the billions of people who can't afford them.**
 
240
 
241
  ---
242