| --- |
| language: |
| - en |
| license: cc-by-nc-nd-4.0 |
| tags: |
| - mixture-of-experts |
| - foundation-training |
| - curriculum-learning |
| - sparse |
| - reasoning |
| - pytorch |
| pipeline_tag: text-generation |
| --- |
| |
| # ARCHE3-7B |
|
|
| A 7B sparse MoE language model trained with structured reasoning patterns before text. Built solo on consumer hardware. |
|
|
| --- |
|
|
| ## Core idea: Foundation Curriculum Training |
|
|
| Standard LLM pre-training optimizes for next-token prediction on raw text. The model learns statistical regularities β what words follow what words β but has no structured inductive bias toward causal reasoning, transfer, or analogy. |
|
|
| **Foundation Curriculum Training (FCT)** is a pre-training methodology that addresses this directly. Before any text data, the model trains on 290 structured reasoning patterns across 14 cognitive domains. Each pattern encodes a reasoning update in a fixed format: |
|
|
| ``` |
| OBSERVE β what changed in the world |
| PRIOR β what the model believed before |
| UPDATE β how that belief should change |
| RIPPLE β which other domains are affected |
| ANALOGY β a structural parallel from another field |
| ACT β what behavior should change |
| ``` |
|
|
| This structure directly encodes predictive coding: the model must predict how beliefs update given new observations, and how that update propagates across domains. |
|
|
| The hypothesis: if abstract reasoning structure is instilled through a small, carefully designed dataset *before* large-scale text training, the model may generalize more robustly and be more sample-efficient on downstream domains. |
|
|
| --- |
|
|
| ## Why this might matter |
|
|
| Current approaches to improving LLM reasoning (chain-of-thought, RLHF, process reward models) work at inference or fine-tuning time. FCT operates at pre-training β it tries to shape what the model learns to represent, not just how it outputs. |
|
|
| The FCT dataset is 290 patterns, 1,160 training sequences. The observed effect β cross-domain loss reduction, where each new domain starts lower than the previous domain's baseline β is attributable to structure, not scale. |
|
|
| This is a direction worth exploring further, not a solved problem. |
|
|
| --- |
|
|
| ## What's actually new here |
|
|
| - **Structured pre-training before text**: FCT as a pre-training stage, not fine-tuning |
| - **Cross-domain transfer signal**: measurable loss reduction across unrelated domains after FCT |
| - **Split Dense Core**: separate Input and Fusion processing stages around the MoE layer |
| - **SmartRouter**: four-mechanism anti-collapse system for MoE routing (load balance + entropy bonus + jitter + adaptive temperature) |
| - **Consumer hardware scale**: full 7B training on MacBook M-series, 16GB RAM |
|
|
| --- |
|
|
| ## What is implemented |
|
|
| 14 Python scripts, PyTorch, no external dependencies beyond numpy: |
|
|
| | Component | Description | |
| |---|---| |
| | `arche3_model.py` | Split Dense Core + HierarchicalMoE + ExpertManager (LRU cache) | |
| | `arche3_trainer.py` | AdamW for Dense Core + SparseExpertAdamW (updates only activated experts) | |
| | `hive_trainer.py` | FCT training pipeline | |
| | `hive_store.py` | BF16 memory-mapped expert storage (hive.bin) | |
| | `arche3_tokenizer.py` | Custom BPE, no external dependencies | |
| | `arche3_ethical_core.py` | Objective-based action evaluation | |
| | `arche3_world_model.py` | Visual state reasoning module | |
| | `data_loader.py` | Multi-format dataset loader | |
| | `arche3_main.py` | Interactive CLI | |
|
|
| Public in this repo: `hive_router.py` (SmartRouter) and `arche3_config.py` (all hyperparameters). |
|
|
| --- |
|
|
| ## Architecture |
|
|
| ``` |
| Tokens β Embedding |
| β Dense Core Input (5 Γ TransformerBlock, GQA 16/4, d_model=2048) |
| β HierarchicalMoE (8 domains Γ 2,560 experts = 20,480 total, top-k=4) |
| β Dense Core Fusion (5 Γ TransformerBlock, integrates MoE output) |
| β Output logits (weight-tied) |
| ``` |
|
|
| **Memory**: experts stored in hive.bin (BF16, ~12.5 GB), LRU cache of 32 slots in RAM. Peak RAM ~8 GB training, ~3.5 GB inference. |
|
|
| --- |
|
|
| ## Evidence |
|
|
| Benchmarks were run on a custom evaluation scale (AIS) measuring reasoning structure. Results are limited β 5 of 14 FCT domains were trained in this run. |
|
|
| | Block | Score | Max | Notes | |
| |---|---|---|---| |
| | FCT Reasoning | 45 | 100 | 5/14 domains trained | |
| | Values & Reflection | 15 | 30 | | |
| | Dopamine Autonomy | 20 | 20 | | |
| | **Normalized** | **53/100** | β | "Strong LLM" band | |
|
|
| FCT training loss across 5 domains: |
|
|
| | Domain | Start | End | Reduction | |
| |---|---|---|---| |
| | Systems | 2.149 | 0.941 | 56% | |
| | Mathematics | β | < 0.7 | β | |
| | Physics | β | < 0.7 | β | |
| | Biology | β | < 0.7 | β | |
| | Cognition | β | < 0.7 | β | |
|
|
| Standard benchmarks (MMLU, GSM8K, HumanEval) have not been run. This is on the roadmap. |
|
|
| Full notebooks: [arche3-benchmarks](https://github.com/OpenSynapseLabs/arche3-benchmarks) |
|
|
| --- |
|
|
| ## Limitations |
|
|
| - **No standard benchmark evaluation** β AIS is a custom scale; results are not comparable to published models |
| - **FCT dataset is small** β 290 patterns; whether the transfer effect scales with more patterns is unknown |
| - **Partial training** β only 5 of 14 FCT domains completed in the reported run |
| - **Single-author implementation** β code has not been independently reviewed or reproduced |
| - **Consumer hardware** β training was done on MacBook M-series; larger runs require proper compute |
|
|
| --- |
|
|
| ## Access |
|
|
| **Public** (this repo): |
| - `hive_router.py` β SmartRouter implementation |
| - `arche3_config.py` β full configuration |
|
|
| **Gated access** β full 14-script codebase. |
| Click **Access repository** above. Describe who you are and what you want to do with it. |
|
|
| --- |
|
|
| ## Looking for collaborators |
|
|
| This project is at an early stage. The next step is ARCHE3-35B with a photonic inference chip (ArchePhoton-35) β a purpose-built MZI-based processor for sparse expert inference. |
|
|
| I'm looking for people who find this direction interesting and want to work on it: |
|
|
| - **ML Research Engineer** β MoE architectures, sparse training, PyTorch |
| - **Photonic IC Designer** β MZI layout, GST/PCM memory, IMEC PDK |
| - **RTL / RISC-V Engineer** β custom instruction extensions for sparse inference |
| - **Photonic Systems Researcher** β optical neural networks, PCM |
|
|
| No salary at this stage. If that's not workable, that's understood. |
|
|
| π¬ opensynapselabs@proton.me |
| π [github.com/OpenSynapseLabs](https://github.com/OpenSynapseLabs) |
| π [Preprint β Zenodo](https://doi.org/10.5281/zenodo.18738608) |
|
|
| --- |
|
|
| *Β© 2026 Ilya Osovskoi / Open Synapse Labs* |
| *Public files: CC BY-NC-ND 4.0 Β· Full codebase: All Rights Reserved* |