Update README.md

5e0412a verified 7 days ago

6.51 kB

	---
	language:
	- en
	license: cc-by-nc-nd-4.0
	tags:
	- mixture-of-experts
	- foundation-training
	- curriculum-learning
	- sparse
	- reasoning
	- pytorch
	pipeline_tag: text-generation
	---

	# ARCHE3-7B

	A 7B sparse MoE language model trained with structured reasoning patterns before text. Built solo on consumer hardware.

	---

	## Core idea: Foundation Curriculum Training

	Standard LLM pre-training optimizes for next-token prediction on raw text. The model learns statistical regularities — what words follow what words — but has no structured inductive bias toward causal reasoning, transfer, or analogy.

	Foundation Curriculum Training (FCT) is a pre-training methodology that addresses this directly. Before any text data, the model trains on 290 structured reasoning patterns across 14 cognitive domains. Each pattern encodes a reasoning update in a fixed format:

	```
	OBSERVE → what changed in the world
	PRIOR → what the model believed before
	UPDATE → how that belief should change
	RIPPLE → which other domains are affected
	ANALOGY → a structural parallel from another field
	ACT → what behavior should change
	```

	This structure directly encodes predictive coding: the model must predict how beliefs update given new observations, and how that update propagates across domains.

	The hypothesis: if abstract reasoning structure is instilled through a small, carefully designed dataset before large-scale text training, the model may generalize more robustly and be more sample-efficient on downstream domains.

	---

	## Why this might matter

	Current approaches to improving LLM reasoning (chain-of-thought, RLHF, process reward models) work at inference or fine-tuning time. FCT operates at pre-training — it tries to shape what the model learns to represent, not just how it outputs.

	The FCT dataset is 290 patterns, 1,160 training sequences. The observed effect — cross-domain loss reduction, where each new domain starts lower than the previous domain's baseline — is attributable to structure, not scale.

	This is a direction worth exploring further, not a solved problem.

	---

	## What's actually new here

	- Structured pre-training before text: FCT as a pre-training stage, not fine-tuning
	- Cross-domain transfer signal: measurable loss reduction across unrelated domains after FCT
	- Split Dense Core: separate Input and Fusion processing stages around the MoE layer
	- SmartRouter: four-mechanism anti-collapse system for MoE routing (load balance + entropy bonus + jitter + adaptive temperature)
	- Consumer hardware scale: full 7B training on MacBook M-series, 16GB RAM

	---

	## What is implemented

	14 Python scripts, PyTorch, no external dependencies beyond numpy:

	\| Component \| Description \|
	\|---\|---\|
	\| `arche3_model.py` \| Split Dense Core + HierarchicalMoE + ExpertManager (LRU cache) \|
	\| `arche3_trainer.py` \| AdamW for Dense Core + SparseExpertAdamW (updates only activated experts) \|
	\| `hive_trainer.py` \| FCT training pipeline \|
	\| `hive_store.py` \| BF16 memory-mapped expert storage (hive.bin) \|
	\| `arche3_tokenizer.py` \| Custom BPE, no external dependencies \|
	\| `arche3_ethical_core.py` \| Objective-based action evaluation \|
	\| `arche3_world_model.py` \| Visual state reasoning module \|
	\| `data_loader.py` \| Multi-format dataset loader \|
	\| `arche3_main.py` \| Interactive CLI \|

	Public in this repo: `hive_router.py` (SmartRouter) and `arche3_config.py` (all hyperparameters).

	---

	## Architecture

	```
	Tokens → Embedding
	→ Dense Core Input (5 × TransformerBlock, GQA 16/4, d_model=2048)
	→ HierarchicalMoE (8 domains × 2,560 experts = 20,480 total, top-k=4)
	→ Dense Core Fusion (5 × TransformerBlock, integrates MoE output)
	→ Output logits (weight-tied)
	```

	Memory: experts stored in hive.bin (BF16, ~12.5 GB), LRU cache of 32 slots in RAM. Peak RAM ~8 GB training, ~3.5 GB inference.

	---

	## Evidence

	Benchmarks were run on a custom evaluation scale (AIS) measuring reasoning structure. Results are limited — 5 of 14 FCT domains were trained in this run.

	\| Block \| Score \| Max \| Notes \|
	\|---\|---\|---\|---\|
	\| FCT Reasoning \| 45 \| 100 \| 5/14 domains trained \|
	\| Values & Reflection \| 15 \| 30 \| \|
	\| Dopamine Autonomy \| 20 \| 20 \| \|
	\| Normalized \| 53/100 \| — \| "Strong LLM" band \|

	FCT training loss across 5 domains:

	\| Domain \| Start \| End \| Reduction \|
	\|---\|---\|---\|---\|
	\| Systems \| 2.149 \| 0.941 \| 56% \|
	\| Mathematics \| — \| < 0.7 \| — \|
	\| Physics \| — \| < 0.7 \| — \|
	\| Biology \| — \| < 0.7 \| — \|
	\| Cognition \| — \| < 0.7 \| — \|

	Standard benchmarks (MMLU, GSM8K, HumanEval) have not been run. This is on the roadmap.

	Full notebooks: [arche3-benchmarks](https://github.com/OpenSynapseLabs/arche3-benchmarks)

	---

	## Limitations

	- No standard benchmark evaluation — AIS is a custom scale; results are not comparable to published models
	- FCT dataset is small — 290 patterns; whether the transfer effect scales with more patterns is unknown
	- Partial training — only 5 of 14 FCT domains completed in the reported run
	- Single-author implementation — code has not been independently reviewed or reproduced
	- Consumer hardware — training was done on MacBook M-series; larger runs require proper compute

	---

	## Access

	Public (this repo):
	- `hive_router.py` — SmartRouter implementation
	- `arche3_config.py` — full configuration

	Gated access — full 14-script codebase.
	Click Access repository above. Describe who you are and what you want to do with it.

	---

	## Looking for collaborators

	This project is at an early stage. The next step is ARCHE3-35B with a photonic inference chip (ArchePhoton-35) — a purpose-built MZI-based processor for sparse expert inference.

	I'm looking for people who find this direction interesting and want to work on it:

	- ML Research Engineer — MoE architectures, sparse training, PyTorch
	- Photonic IC Designer — MZI layout, GST/PCM memory, IMEC PDK
	- RTL / RISC-V Engineer — custom instruction extensions for sparse inference
	- Photonic Systems Researcher — optical neural networks, PCM

	No salary at this stage. If that's not workable, that's understood.

	📬 opensynapselabs@proton.me
	🏗 [github.com/OpenSynapseLabs](https://github.com/OpenSynapseLabs)
	📄 [Preprint — Zenodo](https://doi.org/10.5281/zenodo.18738608)

	---

	© 2026 Ilya Osovskoi / Open Synapse Labs
	Public files: CC BY-NC-ND 4.0 · Full codebase: All Rights Reserved