README.md · rfi-irfos/albert at main

albert / README.md

rfi-irfos

Model card: repoint repo links to rfi-irfos org (post-transfer)

b736f90 5 days ago

preview code

raw

history blame contribute delete

16.7 kB

	---
	language:
	- en
	- de
	- fr
	- es
	- pt
	- it
	- nl
	- pl
	- multilingual
	license: lgpl-3.0
	tags:
	- mixture-of-experts
	- ternary-weights
	- moe
	- edge-ai
	- research
	- candle
	- rust
	- federated-learning
	- low-precision
	- sprind
	- dual-stream
	- cord-surgery
	- net2net
	pipeline_tag: text-generation
	---

	# Model Card — albert. (Albert-MoE-13)

	Version: v3.0 (ternary MoE)
	Maintainer: RFI-IRFOS, contact@ternlang.com
	Repository: https://github.com/rfi-irfos/ternary-intelligence-stack
	License: LGPL-3.0-or-later (model weights, training code, inference runtime). Platform infrastructure (API server, MCP tooling, HDL) is BSL-1.1. See [README §Licensing](README.md#licensing) for the full tier breakdown.
	Last updated: 2026-05-27
	Training status: Paused (Modal billing ceiling, ep4234) — 26L dual-stream · 13 depth surgeries + 1 cord surgery complete. Cord surgery fired autonomously ep4202, 2026-05-27T16:44Z — first documented single-to-dual-stream bifurcation mid-training. S13 (25L→26L) fired ep~4207. Chip ATL 8.6852 (post-S13). EP_AVG ATL 9.2847 (ep3456, 20L). fib_index=7 · window=34 · Gen3 step1/6. Resuming on Modal T4 once billing settled.

	---

	## Model Overview

	albert. is a research-grade language model trained from scratch using a
	ternary weight representation (-1, 0, +1) with a Mixture-of-Experts
	(MoE) architecture. It is developed by RFI-IRFOS as a demonstration that
	high-quality language modelling is achievable without 32-bit floating-point
	weights, targeting inference on edge hardware and low-power devices.

	\| Property \| Value \|
	\|----------\|-------\|
	\| Architecture \| Dual-stream Ternary MoE (Mixture of Experts) \|
	\| Streams \| 2 (bifurcated via cord surgery ep4202, 2026-05-27) \|
	\| Layers \| 26 per stream \|
	\| Hidden size \| 2×256H (256H per stream) \|
	\| Anastomosis gates \| 6 — bidirectional F32 cross-stream fusion at Fibonacci layers [2,3,5,8,13,21] \|
	\| Experts \| 12 per stream (Top-3 routing; shared FFN weights, independent routing gates) \|
	\| Context length \| 256 tokens \|
	\| Vocabulary \| 32,000 tokens (ByteLevel BPE — EN/DE/FR/ES/PT/IT/NL/PL) \|
	\| Weight representation \| Ternary {-1, 0, +1} with STE training \|
	\| Gate linear \| F32 \|
	\| Positional encoding \| RoPE (rotate_half) \|
	\| Optimizer \| AdamW, cosine LR decay, BATCH=1 (post-cord) \|
	\| Parameters (total) \| ~194.4M \|
	\| Safetensors \| 2,044 tensors · 741.4 MB \|
	\| Surgeries \| 13 depth (S1–S13) + 1 cord surgery = 14 total surgical events \|

	The central technical innovation is the @sparseskip primitive — a
	learned sparse-skip layer that dynamically bypasses computation paths
	based on token-level activation patterns, enabling sub-linear inference
	scaling without pruning.

	---

	## Intended Use

	Intended uses:

	- Research into ternary and low-precision neural network architectures
	- Benchmarking inference performance on CPU and edge GPU hardware
	- Academic study of Mixture-of-Experts routing dynamics
	- Demonstration platform for the SPRIND AI funding initiative (Germany)

	Out-of-scope uses:

	- Production deployment as a general-purpose assistant without further
	fine-tuning and safety evaluation
	- Safety-critical applications (medical, legal, financial decisions)
	- Any use requiring factual accuracy guarantees
	- Deployment to users without appropriate transparency disclosure

	---

	## Training Data

	See [DATA_PROVENANCE.md](DATA_PROVENANCE.md) for full source documentation
	and governance details.

	Summary:

	albert. is trained on a curated multilingual corpus composed of:

	\| Tier \| Content \| Approximate Share \|
	\|------\|---------\|------------------\|
	\| Core \| Project Gutenberg (public domain books, multilingual) \| ~30% \|
	\| Core \| Wikipedia (15 languages: EN, DE, FR, HU, ZH, AR, KO, SV, FI, NL, PL, RU, JA + more) \| ~25% \|
	\| Core \| OpenWebText (filtered Common Crawl) \| ~15% \|
	\| Technical \| GitHub issues, developer blogs, HN discussions \| ~10% \|
	\| Chaos \| Synthetic noise, adversarial patterns, mixed-language text \| ~10% \|
	\| Structured \| Code samples, structured data (JSON/YAML/TSV) \| ~5% \|
	\| Multilingual \| Additional EU language samples \| ~5% \|

	The 10% chaos layer is a structural invariant enforced by the training
	pipeline (`train_tokenizer_v3.py`). It exists to prevent the model from
	over-fitting to clean text distributions and to improve robustness to
	noisy inputs.

	---

	## Evaluation

	Primary metric: Cross-entropy loss on a held-out WikiText-2 sample
	(`eval_sample.txt`, not seen during training).

	Benchmark results (benchmark suite v2.0.0):

	\| Epoch \| Loss (avg) \| Epoch ATL \| Batch ATL \| Tok/s (T4 GPU) \|
	\|-------\|-----------\|-----------\|-----------\|----------------\|
	\| Ep54 \| ~10.35 \| 10.35 \| — \| 11.24 (CPU) \|
	\| Ep111 \| ~10.36 \| 10.36 \| — \| 18.52 \|
	\| Ep849 \| ~10.22 \| 10.2050 \| — \| pending \|
	\| Ep1177 \| 10.2076 \| 10.2059 (ep1158) \| 10.1738 (ep1155) \| pending \|
	\| Ep1390 \| 10.1212 \| 10.1212 (ep1390) \| 10.0670 (ep1385) \| pending \|
	\| Ep1435 \| 10.1113 \| 10.1113 (ep1435) \| 10.0556 (ep1435) \| pending \|
	\| Ep1438 \| 10.1071 \| 10.1071 (ep1438) \| 10.0556 (ep1435) \| pending \|
	\| Ep1441 \| 10.1067 \| 10.1067 (ep1441) \| 10.0556 (ep1435) \| pending \|
	\| Ep1455 \| 10.1060 \| 10.1060 (ep1455) \| 10.0396 (ep1445) \| pending \|
	\| Ep1474 \| 10.0982 \| 10.0982 (ep1474) \| 10.0396 (ep1445) \| pending \|
	\| Ep1553 \| ~10.07 (est) \| 10.0982 (ep1474) \| 9.9948 (ep1553) ← first sub-10.0 batch \| pending \|
	\| Ep2040 \| ~9.82 (est) \| 9.7976 (ep2084) \| 9.6380 (ep1445) \| 9.6–18.5 (CPU) \|
	\| Ep2084 \| 9.7976 \| 9.7976 ← epoch ATL \| 9.6380 (ep1445) \| pending (T4) \|
	\| Ep2104 \| ~9.81 (est) \| 9.7976 (ep2084) \| 9.6380 (ep1445) \| 9.9–21.3 (CPU) \|
	\| Ep2109 \| 9.7975 \| 9.7975 (ep2109) \| 9.6380 (ep1445) \| pending \|
	\| Ep2114 \| 9.7891 \| 9.7891 (ep2114) \| 9.6235 (ep2114) ← batch ATL \| pending \|
	\| Ep2116 \| 9.7884 \| 9.7884 ← epoch ATL \| 9.6235 (ep2114) \| pending \|
	\| Ep2487 \| S6 fired \| 18L→19L surgery \| — \| 2026-05-20T21:33Z; Gen1 step1/6 \|
	\| Ep2922 \| 9.4992 \| 9.4992 ← first sub-9.50 \| 9.1370 (chip) \| 2026-05-22; LOG expert 0%→28% awakening \|
	\| Ep3263 \| 9.3651 \| ← epoch ATL \| 9.0095 (chip) \| Broke 139-epoch plateau \|
	\| Ep3325 \| S7 fired \| 18L→19L surgery \| — \| 2026-05-24T13:47Z; 1315 tensors \|
	\| Ep3326 \| 9.3182 \| ← epoch ATL (first 19L ep) \| 8.9190 (chip) \| +0.047 nat improvement over prior best \|
	\| Ep3383 \| S8 fired \| 19L→20L surgery \| — \| Only 58 epochs after S7 \|
	\| Ep3456 \| 9.2847 \| ← epoch ATL (20L) \| 8.8540 (chip) \| WALD ep3454 INT 91% cliff \|
	\| Ep~3470 \| S9 fired \| 20L→21L surgery \| — \| Largest post-surgery spike in history (+0.14 nat) \|
	\| Ep~3652 \| S10 fired \| 21L→22L surgery \| — \| Pre-surgery best 9.2933 \|
	\| Ep~4098 \| S11 fired \| 22L→23L surgery \| — \| 2026-05-27 morning \|
	\| Ep~4140 \| S11b fired \| 23L→24L surgery \| — \| Rapid plateau ~42 ep after S11 \|
	\| Ep4202 \| S12 fired \| 24L→25L surgery \| — \| 2026-05-27T16:43Z; Gen3 plateau \|
	\| Ep4202 \| CORD surgery \| 25L → 2×25L dual-stream \| — \| 2026-05-27T16:44Z — first ever autonomous single→dual-stream bifurcation \|
	\| Ep~4203 \| 9.3241 \| ← first post-cord epoch avg \| 8.7123 (chip, new ATL) \| Dual-stream live \|
	\| Ep~4207 \| S13 fired \| 25L→26L surgery (both streams) \| 8.6852 (chip, new ATL) \| 2026-05-27T17:40Z; fib_index 6→7 \|

	The benchmark suite runs 5 fixed prompts covering English, German,
	multilingual, narrative, and technical domains. Results are reproducible
	via the open-source `moe-test` binary.

	Surgery gate prediction (recorded 2026-05-16T18:40Z) — outcome update 2026-05-19:

	A trendline fitted to the ep400–ep1459 loss curve predicted the surgery gate threshold (9.8 epoch-avg) at approximately ep~2000. Prediction confirmed: the loss gate was cleared at ep2080 (9.7997, 2026-05-19T10:40Z), within the predicted ep2000–2150 window.

	The gate fires when loss plateaus below 9.8 for a 144-epoch window with `myc_stable ≥ 5`. The loss gate was cleared at ep2080. Following that, albert. entered an alternating descent phase: five new epoch ATLs in seven epochs (ep2109–ep2116), dropping from 9.7976 → 9.7884 in under two hours of wall time. WALD sev=0.953; myc_L3 showed its first activity uptick (1.61→1.68 ×10⁻⁹) at ep2114. The plateau gate cannot fire during active descent — surgery timing is now conditioned on when the model settles into the next attractor, not on a fixed epoch countdown.

	Milestone (2026-05-17T05:48Z): First sub-10.0 batch loss in albert. history — 9.9948 at ep1553 batch 149/300.
	Milestone (2026-05-19T10:40Z): First sub-9.8 epoch average — 9.7997 at ep2080. Surgery loss gate cleared.
	Milestone (2026-05-19T11:00Z): New epoch ATL — 9.7976 at ep2084.
	Milestone (2026-05-19T13:29Z): New batch ATL — 9.6235 at ep2114 (prev 9.6282).
	Milestone (2026-05-19T13:40Z): Alternating descent confirmed — five new epoch ATLs in seven epochs; epoch ATL reaches 9.7884 at ep2116.

	Known limitations:

	- At current training depth (~1459 epochs), output quality is pre-fluency:
	the model produces partially coherent text in familiar domains but lacks
	consistent grammatical structure across longer sequences.
	- Context window of 256 tokens is shorter than contemporary LLMs; cannot
	maintain coherence over longer passages.
	- Ternary quantization trades weight precision for size — at this scale,
	some representational capacity is lost relative to F32 equivalents.
	- No instruction-following fine-tuning has been applied.
	- No RLHF, Constitutional AI, or safety fine-tuning of any kind.
	- Bias evaluation is pending (see below).

	Open research questions (scaling risks):

	- STE gradient approximation at scale: Straight-Through Estimation is the training mechanism for ternary weights. Its stability and convergence properties are well-characterised at current scale (~58M params). Whether STE remains stable through training runs at 500M–1B+ parameters is an open empirical question — no published work has demonstrated ternary STE convergence at frontier scale.
	- @sparseskip speedup baseline: The 83 tok/s inference figure is measured against albert.'s own F32-weight dense equivalent on the same hardware. It is not a direct comparison with INT4-quantized industry inference (TensorRT-LLM, llama.cpp Q4). The relevant claim is that ternary weights eliminate a quantization step entirely — the speedup over post-hoc INT4 quantization of a larger model is a separate, untested question.
	- Net2net surgery stability at scale: All five documented layer-addition surgeries were performed on a model in the 13M–58M parameter range. Whether the Fibonacci-gated surgery protocol remains stable when applied to models at 200M+ parameters has not been tested. The plateau-gate's withhold behavior is now validated across six independent events (ep791 non-firing + alternating descent phase ep2109–ep2120 — see below); the question of whether these properties hold at 200M+ scale remains open.

	Validated finding — surgery governor robustness (ep2120): The plateau gate demonstrated robustness against premature surgery triggering: at ep2120, despite crossing the loss threshold (9.7997 < 9.80) at ep2081, the model continued descending through the projected plateau zone, invalidating three pre-computed surgery timing scenarios. The governor correctly withheld surgery while the model was still actively learning — a validation of the design principle that architecture should grow only when learning has genuinely exhausted current capacity. Five new epoch ATLs were recorded in seven epochs (9.7976→9.7884) during the withheld window. Full technical record: [convergence_log.md — Alternating Descent Phase section](docs/convergence_log.md).

	---

	## Bias and Fairness

	A formal bias and fairness evaluation has not yet been conducted. Known
	risk factors:

	- Language imbalance: English-dominant corpus; non-English outputs
	will be lower quality.
	- Temporal bias: Training data has a knowledge cutoff; the model
	has no awareness of events after its corpus snapshot dates.
	- Domain gaps: Limited coverage of non-Western cultural contexts,
	legal jurisdictions outside EU/US/DE, and specialized professional
	domains.

	A structured bias evaluation using standard benchmarks (WinoBias,
	BBQ, multilingual MMLU) is planned for the v3.1 milestone.

	---

	## Human Oversight

	albert. is a research model under active development. The following
	oversight mechanisms are in place:

	1. Training dashboard: Real-time monitoring of loss curves, expert
	routing, gradient norms, WALD dead-zone events, and anomaly events
	by the RFI-IRFOS team.
	2. Surgery governor: Architectural growth (layer addition via net2net)
	is fully autonomous — the `EvolutionManager` fires on a Fibonacci-gated
	plateau detector with no human intervention required. **13 depth surgeries
	(12L→26L) + 1 cord surgery (single→dual-stream bifurcation)** have been
	executed autonomously to date. The cord surgery (ep4202, 2026-05-27) is the
	first documented autonomous single-to-dual-stream bifurcation in a live
	ternary MoE.
	3. SPORE federated training (live): Collaborators contribute CPU-trained
	checkpoints as weight spores via the `albert-spores` private repository.
	The `SporeManager` blends accepted spores at α=0.08 each epoch boundary
	with fitness (loss gate) and architecture guards. Colony is active as of
	2026-05-16 with external contributors. Spores are stored via Git LFS;
	each contributor runs `albert-train` locally and submits via `albert-spore`.
	4. Checkpoint promotion: No trained checkpoint is deployed to any
	external service without explicit human review and approval by the
	lead architect.
	5. Rollback capability: All checkpoints and best-loss weights are
	preserved on persistent storage. Any version can be reverted.

	See [SECURITY.md](SECURITY.md) for the incident reporting process.

	---

	## EU AI Act Compliance Notes

	albert. is developed in the European Union and is subject to Regulation
	(EU) 2024/1689 (EU AI Act). RFI-IRFOS self-classifies albert. as a
	General-Purpose AI (GPAI) model under Article 3(63).

	\| Obligation \| Article \| Status \|
	\|------------\|---------\|--------\|
	\| Technical documentation \| Annex XI \| This document \|
	\| Training data summary \| Art. 53(1)(d) \| [DATA_PROVENANCE.md](DATA_PROVENANCE.md) \|
	\| Copyright compliance summary \| Art. 53(1)(c) \| [DATA_PROVENANCE.md](DATA_PROVENANCE.md) \|
	\| Human oversight measures \| Art. 53(1)(e) \| Described above \|
	\| Incident reporting \| Art. 53(2) \| [SECURITY.md](SECURITY.md) \|
	\| Bias/fairness assessment \| Art. 53(1)(b) \| Planned v3.1 \|

	For questions about compliance or to report concerns:
	contact@ternlang.com

	---

	## Team

	\| Name \| Role \| Contact \|
	\|------\|------\|---------\|
	\| Simeon Kepp \| Lead Architect — full stack (compiler, BET VM, training, MCP) \| s.kepp@ternlang.com \|
	\| Louis Paul Ehrig \| Head of Public Affairs, Dataset Curation, Corporate Secretary \| l.ehrig@ternlang.com \|
	\| Lisa Scharler \| Head of Social Technology & Ecocentric Systems \| l.scharler@ternlang.com \|
	\| Zabih Karimi \| Co-Founder, IT & Infrastructure, Stress-Testing \| z.karimi@ternlang.com \|
	\| Nikoletta Csonka \| Global Reach, Fundraising & Fund Applications \| csonikoletta@ternlang.com \|
	\| Claude (Anthropic) \| AI Collaborator — architecture, implementation, monitoring \| claude@ternlang.com \|

	Organisation: Research Focus Institute — Interdisciplinary Research Facility for Open Sciences (RFI-IRFOS)
	Address: Elisabethinergasse 25, 8020 Graz, Austria
	Website: https://ternlang.com
	Issues: https://github.com/rfi-irfos/ternary-intelligence-stack/issues
	General contact: contact@ternlang.com

	---

	## Legal entity

	albert. is developed and maintained by RFI-IRFOS, a registered, fully regulated Austrian research institute — not an informal initiative. It is a not-for-profit: it earns revenue through statute-permitted streams and reinvests at least 90% of surplus into its research mission (at most 10% retained for operations); surplus is not distributed to members.

	\| \| \|
	\|---\|---\|
	\| Legal form \| Registered association (Verein) operating commercially under a licensed Austrian trade \|
	\| ZVR (association register) \| 1015608684 \|
	\| GISA (trade register) \| 39261441 — IT services & automated data processing (free trade, GewO) \|
	\| Tax number (Steuernummer) \| 68 028/0989 \|
	\| GLN \| 9110038490191 \|
	\| Patent \| A50296/2026 (TIS platform, Austrian Patent Office) \|
	\| Full legal notice \| https://ternlang.com/impressum.html \|