h4-polytopic-attention / PROJECT_OLYMPUS.md

Upload PROJECT_OLYMPUS.md with huggingface_hub

efa7086 verified 27 days ago

preview code

raw

history blame contribute delete

11.9 kB

Project Olympus: Frontier-Quality AI on CPU

Goal

Build a system that approaches frontier model quality (Claude Opus, GPT-4 class) running entirely on CPU hardware, using only legally clean open-source models and data. No GPU. No API dependency. No monthly cost. No legal risk.

This is for the billions of people who can't afford frontier AI subscriptions and GPU compute. Good-enough answers on free hardware beat perfect answers on expensive hardware --- for education, small business, developing nations, and anyone who values privacy and independence.

The Core Insight

Claude Opus is one giant model that memorizes everything in its weights. We build focused specialists that know their domain deeply and retrieve everything else from a geometric knowledge index.

The difference:

Opus: 200B+ params x 16 bits = ~400GB weights. Needs GPU cluster.
Ours: 4 specialists x 3B params x 1.58 bits = ~2.4GB total. Runs on laptop.
The gap is filled by E8 lattice retrieval (R@5=100%) from a knowledge index.

A 3B model that can look up any fact in 20ms is functionally equivalent to a 200B model that memorized those facts --- for the user, the answer is the same.

What's Already Proven

This project builds on the H4 Polytopic Attention foundation (7 phases, all tested):

Component	Status	Result
H4 geometric attention	Proven	O(log t), 10.6x speedup at 65K keys
Ternary quantization	Proven	0.003 bpb gap, ~17x compression
E8 lattice retrieval	Proven	R@5=100%, 20ms, 240-neighbor Voronoi search
MiniLM reranking	Proven	R@1=98.5% on bi-encoder candidates
Language generation	Proven	PPL 10.0 on TinyStories (beats 33M baseline)
CPU training	Proven	24M ternary params, 8 hours, coherent English
Autoresearch	Proven	42+ autonomous experiments, finds optimal configs

The Base Model: SmolLM3-3B-Instruct

HuggingFace ID: HuggingFaceTB/SmolLM3-3B-Instruct

SmolLM3-3B (July 2025) is the correct base model. Using anything smaller would leave performance on the table:

11.2T training tokens (vs 2T for SmolLM2)
128K context window (vs 8K for SmolLM2)
Dual-mode reasoning (thinking + direct)
Outperforms Llama 3.2 3B, Qwen 2.5 3B on every benchmark
Apache 2.0 license --- full commercial use
Full training recipe published (data mixtures, hyperparameters, ablations)
Tool calling support built in

Why SmolLM3-3B over other options

Model	Params	License	Context	Trained on	Notes
SmolLM3-3B	3B	Apache 2.0	128K	11.2T tokens	Best in class, fully open
Phi-4-mini	3.8B	MIT	128K	Proprietary mix	Slightly larger, MIT is fine too
Qwen2.5-3B	3B	Apache 2.0	32K	Unknown size	Older, lower benchmarks
Llama 3.2 3B	3B	Llama License	128K	~10T?	Meta license has usage limits
SmolLM2-1.7B	1.7B	Apache 2.0	8K	2T tokens	Obsoleted by SmolLM3

Ternary size

Float32: 3B x 4 bytes = 12 GB
Float16: 3B x 2 bytes = 6 GB
Ternary (1.58 bit): 3B x 0.2 bytes = ~600 MB
With optimizer states for fine-tuning: ~4-8 GB total in RAM
Fits comfortably in 32 GB RAM for fine-tuning on CPU

Legal Foundation

This is NOT distillation. We do not use outputs from proprietary models as training data. Every component is legally clean.

Base Models (all Apache 2.0)

Model	Params	License	HuggingFace ID
SmolLM3-3B-Instruct	3B	Apache 2.0	HuggingFaceTB/SmolLM3-3B-Instruct
ms-marco-MiniLM-L-6-v2	22M	Apache 2.0	cross-encoder/ms-marco-MiniLM-L-6-v2

Fine-tuning Data (all openly licensed)

Code Specialist:

Dataset	Size	License	HuggingFace ID
The Stack v2 (filtered)	~100M tokens	Per-file	bigcode/the-stack-v2
CodeAlpaca 20K	20K instructions	Apache 2.0	sahil2801/CodeAlpaca-20k
CodeFeedback	66K examples	Apache 2.0	m-a-p/CodeFeedback-Filtered-Instruction
Evol-Instruct-Code	110K	Apache 2.0	nickrosh/Evol-Instruct-Code-80k-v1

Math/Reasoning Specialist:

Dataset	Size	License	HuggingFace ID
MetaMathQA	395K	MIT	meta-math/MetaMathQA
OpenMathInstruct v2	1.8M	Permissive	nvidia/OpenMathInstruct-2
GSM8K	8.5K	MIT	openai/gsm8k
MATH	12.5K	MIT	hendrycks/competition_math
ARC	7.7K	CC-BY-SA	allenai/ai2_arc

QA/Retrieval Specialist:

Dataset	Size	License	HuggingFace ID
Natural Questions	307K	CC-BY-SA	google-research-datasets/nq_open
SQuAD 2.0	150K	CC-BY-SA	rajpurkar/squad_v2
TriviaQA	95K	Apache 2.0	mandarjoshi/trivia_qa
HotpotQA	113K	CC-BY-SA	hotpot_qa

Knowledge Index:

Source	Size	License	Notes
Wikipedia EN	~4B tokens	CC-BY-SA	All human knowledge
Stack Overflow	~10GB	CC-BY-SA	Programming Q&A
Project Gutenberg	70K books	Public domain	Literature
User's own docs	Variable	N/A	Custom knowledge base

Architecture

                        User Query
                            |
                            v
                   +---------------------+
                   |  ChamberTree        |  H4 geometric routing (<1ms)
                   |  Router             |  Maps query to specialist via
                   |  (16 chambers)      |  Coxeter chamber classification
                   +----------+----------+
                              |
              +-------+-------+-------+
              |       |       |       |
              v       v       v       v
        +--------+ +------+ +------+ +------+
        |General | | Code | | Math | |  QA  |  4 specialists
        | (3B)   | | (3B) | | (3B) | | (3B)|  SmolLM3-3B base
        | as-is  | | FT'd | | FT'd | | FT'd|  Ternary weights
        +---+----+ +--+---+ +--+---+ +--+---+
            |          |        |        |
            +----------+--------+--------+
                              |
                              v
                   +---------------------+
                   |  E8 Lattice         |  Knowledge retrieval
                   |  Memory             |  R@5=100%, 20ms
                   |  (Wikipedia,        |  240 kissing neighbors
                   |   docs, code)       |  Voronoi cell addressing
                   +----------+----------+
                              |
                              v
                   +---------------------+
                   |  MiniLM             |  Reranking
                   |  Cross-encoder      |  R@1=98.5%
                   |  (22M, float)       |  Picks best passage
                   +----------+----------+
                              |
                              v
                          Response

Why 4 Specialists Instead of 6

With SmolLM3-3B as the base (much stronger than SmolLM2-1.7B), we don't need 6 specialists. The base model is already strong at conversation, creative writing, and summarization. We only specialize where it matters:

#	Specialist	Base	Fine-tuning	Why Separate
0	General	SmolLM3-3B-Instruct AS-IS	None needed	Already instruction-tuned
1	Code	SmolLM3-3B + code data	~200M tokens	Code needs 80%+ domain data
2	Math	SmolLM3-3B + math data	~100M tokens	Weakest area for small models
3	QA	SmolLM3-3B + retrieval QA	~150M tokens	Learn to answer FROM context

Total active RAM: ~600MB (one specialist loaded at a time) + 90MB MiniLM + E8 index

H4 Attention Integration

SmolLM3 uses GQA with 4 groups --- maps naturally to H4's 4 Coxeter simple roots.

Progressive swap in 4 phases:

Adapter (Days 1-3): Freeze SmolLM3, add H4 adapter parallel to each GQA layer. Gate starts at 0. Train only H4 params.
Hybrid (Days 3-7): Unfreeze SmolLM3 attention. Both paths train. Monitor which layers prefer H4.
Selective swap (Days 7-10): Layers with gate >0.8 keep only H4. Layers with gate <0.3 keep only original. Others stay hybrid.
Ternary (Day 10): Apply BitLinear to H4 layers. Export final model.

What this gives you: O(log t) attention for long sequences (SmolLM3's 128K context is O(t^2) via Flash Attention), ternary attention weights (600MB), and E8 lattice integration for retrieval.

Fine-Tuning: QLoRA on CPU

Full fine-tuning of 3B params on CPU is slow. QLoRA is 3-6x faster because only 1-2% of parameters get gradients:

Method	Step time	Steps/day	Trainable params
Full fine-tune 3B on CPU	~3s	~28K	3B (100%)
QLoRA 3B on CPU	~0.5-1s	~86-170K	~20-50M (1-2%)

Per-specialist training budget

Specialist	Tokens	Steps	Time
Code	200M	~50K	1-2 days
Math	100M	~25K	0.5-1 day
QA	150M	~37K	1-1.5 days
Total	450M	~112K	3-5 days

The 14-Day Plan

Day	Task	Validation
1	Download SmolLM3, verify, setup QLoRA	Generates text OK
2	Fine-tune code specialist	Writes Python functions
3	Fine-tune math specialist	Solves GSM8K problems
4	Fine-tune QA specialist	Answers from context
5-6	H4 progressive swap Phase 1	Perplexity within 5%
7-8	H4 progressive swap Phase 2	Gate values meaningful
9-10	H4 selective swap + ternary	Chamber preservation >80%
11	ChamberTree router	Routes correctly
12	E8 knowledge index (Wikipedia)	Retrieval finds facts
13	Integration + demo	End-to-end works
14	Benchmarks + upload to HF	Numbers documented

Cost: 3-5 days specialist training + 6-9 days H4 swap = ~10-14 days total. On cloud: ~$50-100. On laptops: $0.

Honest Quality Expectations

Task	SmolLM3-3B base	+ Specialist FT	+ E8 Retrieval	Opus
MMLU	~60%	~62%	~70-75%	~88%
HumanEval	~45%	~55-65%	N/A	~85%
GSM8K	~55%	~65-75%	N/A	~95%
TriviaQA	~50%	~55%	~85-90%	~90%
Instruction	~80%	~82%	N/A	~95%
Long context	Good to 128K	Same	Better	200K
Cost	$0	$0	$0	$$$
Privacy	Local	Local	Local	Cloud

The retrieval-augmented factual QA (85-90%) is where we compete directly with frontier models. Everything else is 60-85% of Opus.

This is NOT Claude Opus quality across the board. It IS:

85-90% on factual QA (retrieval advantage --- the model looks up facts instead of hallucinating)
75-85% on instruction following (good enough for most tasks)
55-75% on code and math (honest gap --- complex reasoning needs more params)
Free, private, local, legally clean, and improvable by the community

The Vision

A laptop running 4 focused specialists, routed by H4 geometry in <1ms, backed by unlimited knowledge retrieval from E8 lattice memory in 20ms, reranked to 98.5% accuracy. Not as good as Claude Opus at everything. But good enough at most things, free to run, private by default, and available to anyone with a computer.

That's not a replacement for frontier models. It's an alternative for the billions of people who can't afford them.

Project Olympus is built on the H4 Polytopic Attention foundation. See README.md for the full technical documentation, RESULTS.md for all experiment results, and docs/PAPER.md for the arXiv paper draft.