arche3-7b / README.md
OpenSynapseLabs's picture
Update README.md
5e0412a verified
---
language:
- en
license: cc-by-nc-nd-4.0
tags:
- mixture-of-experts
- foundation-training
- curriculum-learning
- sparse
- reasoning
- pytorch
pipeline_tag: text-generation
---
# ARCHE3-7B
A 7B sparse MoE language model trained with structured reasoning patterns before text. Built solo on consumer hardware.
---
## Core idea: Foundation Curriculum Training
Standard LLM pre-training optimizes for next-token prediction on raw text. The model learns statistical regularities β€” what words follow what words β€” but has no structured inductive bias toward causal reasoning, transfer, or analogy.
**Foundation Curriculum Training (FCT)** is a pre-training methodology that addresses this directly. Before any text data, the model trains on 290 structured reasoning patterns across 14 cognitive domains. Each pattern encodes a reasoning update in a fixed format:
```
OBSERVE β†’ what changed in the world
PRIOR β†’ what the model believed before
UPDATE β†’ how that belief should change
RIPPLE β†’ which other domains are affected
ANALOGY β†’ a structural parallel from another field
ACT β†’ what behavior should change
```
This structure directly encodes predictive coding: the model must predict how beliefs update given new observations, and how that update propagates across domains.
The hypothesis: if abstract reasoning structure is instilled through a small, carefully designed dataset *before* large-scale text training, the model may generalize more robustly and be more sample-efficient on downstream domains.
---
## Why this might matter
Current approaches to improving LLM reasoning (chain-of-thought, RLHF, process reward models) work at inference or fine-tuning time. FCT operates at pre-training β€” it tries to shape what the model learns to represent, not just how it outputs.
The FCT dataset is 290 patterns, 1,160 training sequences. The observed effect β€” cross-domain loss reduction, where each new domain starts lower than the previous domain's baseline β€” is attributable to structure, not scale.
This is a direction worth exploring further, not a solved problem.
---
## What's actually new here
- **Structured pre-training before text**: FCT as a pre-training stage, not fine-tuning
- **Cross-domain transfer signal**: measurable loss reduction across unrelated domains after FCT
- **Split Dense Core**: separate Input and Fusion processing stages around the MoE layer
- **SmartRouter**: four-mechanism anti-collapse system for MoE routing (load balance + entropy bonus + jitter + adaptive temperature)
- **Consumer hardware scale**: full 7B training on MacBook M-series, 16GB RAM
---
## What is implemented
14 Python scripts, PyTorch, no external dependencies beyond numpy:
| Component | Description |
|---|---|
| `arche3_model.py` | Split Dense Core + HierarchicalMoE + ExpertManager (LRU cache) |
| `arche3_trainer.py` | AdamW for Dense Core + SparseExpertAdamW (updates only activated experts) |
| `hive_trainer.py` | FCT training pipeline |
| `hive_store.py` | BF16 memory-mapped expert storage (hive.bin) |
| `arche3_tokenizer.py` | Custom BPE, no external dependencies |
| `arche3_ethical_core.py` | Objective-based action evaluation |
| `arche3_world_model.py` | Visual state reasoning module |
| `data_loader.py` | Multi-format dataset loader |
| `arche3_main.py` | Interactive CLI |
Public in this repo: `hive_router.py` (SmartRouter) and `arche3_config.py` (all hyperparameters).
---
## Architecture
```
Tokens β†’ Embedding
β†’ Dense Core Input (5 Γ— TransformerBlock, GQA 16/4, d_model=2048)
β†’ HierarchicalMoE (8 domains Γ— 2,560 experts = 20,480 total, top-k=4)
β†’ Dense Core Fusion (5 Γ— TransformerBlock, integrates MoE output)
β†’ Output logits (weight-tied)
```
**Memory**: experts stored in hive.bin (BF16, ~12.5 GB), LRU cache of 32 slots in RAM. Peak RAM ~8 GB training, ~3.5 GB inference.
---
## Evidence
Benchmarks were run on a custom evaluation scale (AIS) measuring reasoning structure. Results are limited β€” 5 of 14 FCT domains were trained in this run.
| Block | Score | Max | Notes |
|---|---|---|---|
| FCT Reasoning | 45 | 100 | 5/14 domains trained |
| Values & Reflection | 15 | 30 | |
| Dopamine Autonomy | 20 | 20 | |
| **Normalized** | **53/100** | β€” | "Strong LLM" band |
FCT training loss across 5 domains:
| Domain | Start | End | Reduction |
|---|---|---|---|
| Systems | 2.149 | 0.941 | 56% |
| Mathematics | β€” | < 0.7 | β€” |
| Physics | β€” | < 0.7 | β€” |
| Biology | β€” | < 0.7 | β€” |
| Cognition | β€” | < 0.7 | β€” |
Standard benchmarks (MMLU, GSM8K, HumanEval) have not been run. This is on the roadmap.
Full notebooks: [arche3-benchmarks](https://github.com/OpenSynapseLabs/arche3-benchmarks)
---
## Limitations
- **No standard benchmark evaluation** β€” AIS is a custom scale; results are not comparable to published models
- **FCT dataset is small** β€” 290 patterns; whether the transfer effect scales with more patterns is unknown
- **Partial training** β€” only 5 of 14 FCT domains completed in the reported run
- **Single-author implementation** β€” code has not been independently reviewed or reproduced
- **Consumer hardware** β€” training was done on MacBook M-series; larger runs require proper compute
---
## Access
**Public** (this repo):
- `hive_router.py` β€” SmartRouter implementation
- `arche3_config.py` β€” full configuration
**Gated access** β€” full 14-script codebase.
Click **Access repository** above. Describe who you are and what you want to do with it.
---
## Looking for collaborators
This project is at an early stage. The next step is ARCHE3-35B with a photonic inference chip (ArchePhoton-35) β€” a purpose-built MZI-based processor for sparse expert inference.
I'm looking for people who find this direction interesting and want to work on it:
- **ML Research Engineer** β€” MoE architectures, sparse training, PyTorch
- **Photonic IC Designer** β€” MZI layout, GST/PCM memory, IMEC PDK
- **RTL / RISC-V Engineer** β€” custom instruction extensions for sparse inference
- **Photonic Systems Researcher** β€” optical neural networks, PCM
No salary at this stage. If that's not workable, that's understood.
πŸ“¬ opensynapselabs@proton.me
πŸ— [github.com/OpenSynapseLabs](https://github.com/OpenSynapseLabs)
πŸ“„ [Preprint β€” Zenodo](https://doi.org/10.5281/zenodo.18738608)
---
*Β© 2026 Ilya Osovskoi / Open Synapse Labs*
*Public files: CC BY-NC-ND 4.0 Β· Full codebase: All Rights Reserved*