T³ 124M v3.6 (run-3 release)
Inference-ready checkpoint for T³, a Clifford-algebra-augmented transformer architecture. 124M parameters, GPT-2 Small substrate, 5B training tokens.
This is the canonical reference checkpoint for the v3.6 lineage. Companion artifacts:
- Code:
mirrorethic/t3-reference(Apache-2.0) - Trace library + benchmarks: https://t3atlas.dev
- Sibling checkpoint:
mirrorethic/t3-124m-v36-pcloss— same architecture, trained with the inter-stage predictive-coding loss un-detached. Slightly worse PPL (28.53 vs 27.76), neutral on reasoning. Use the pair for the controlled inter-stage-PC ablation.
Quick start
pip install t3-reference
from huggingface_hub import hf_hub_download
from t3 import T3Model
ckpt = hf_hub_download("mirrorethic/t3-124m-v36", "pytorch_model.bin")
model = T3Model.from_checkpoint(ckpt)
model.eval()
import torch
input_ids = torch.randint(0, 50257, (1, 16))
with torch.no_grad():
logits, *_ = model(input_ids)
To generate a schema-v1 ecology trace:
from t3.tracing import generate_trace
generate_trace(model, "The capital of France is",
prompt_id="factual", n_tokens=32,
out_path="trace.jsonl")
To re-run the published lm-eval-harness benchmarks:
from t3.benchmarks import run_benchmark_suite
results = run_benchmark_suite("path/to/pytorch_model.bin")
Architecture
T³ extends standard multi-head attention with a per-head ecology of
six conjugate primitives (E, I, F, V, C, K) coupled through bivector
composition in Cl(3,3) geometric algebra. Heads interact through a
learned blockade-and-cosurvival graph and ponder adaptively per stage via
output-entropy halt. Full technical specification:
docs/ARCHITECTURE.md.
| Field | Value |
|---|---|
| Parameters | 124,500,000 |
| Stages | 3, with layers_per_stage = [4, 3, 5] (12 transformer blocks total) |
d_model |
768 |
n_heads |
12 |
d_ff |
3072 |
vocab_size |
50257 (GPT-2 tokenizer) |
max_seq_len |
1024 |
| Substrate | GPT-2 Small initialization |
| Training data | 5B tokens (FineWeb-Edu 40%, DCLM 20%, StackEdu 10%, FineMath 10%, Cosmopedia 10%, Wikipedia 10%) |
| Cumulative training step | 138,000 (135.5K substrate + 2,500 v3.6 increment) |
| Hamiltonian coupling ω | 0.02 |
| Trivectors | off (the trivectors-on variant is a planned v3.7 follow-up release) |
| Inter-stage predictive coding | on (weight = 0.05) |
| Scratchpad heads | on (scratchpad_inject_entropy = (0.0, 0.0, 0.03) — S2-only) |
| ACT | output-entropy halt + per-stage 4-step cap |
Evaluation
All numbers are full lm-eval-harness 0.4.x runs (no subset). Reproduce with
examples/run_benchmarks.py from the reference repo.
| Task | Metric | Value | stderr |
|---|---|---|---|
| WikiText-103 (val) | perplexity | 27.76 | — |
| BoolQ | acc | 0.6046 | 0.0086 |
| ARC-Easy | acc | 0.4331 | 0.0102 |
| ARC-Challenge | acc | 0.2176 | 0.0121 |
| PIQA | acc | 0.6050 | 0.0114 |
| HellaSwag | acc | 0.3040 | 0.0046 |
| WinoGrande | acc | 0.5043 | 0.0141 |
| COPA | acc | 0.6000 | 0.0492 |
| RTE | acc | 0.5235 | 0.0301 |
For comparison panels (parameter-efficiency vs vanilla GPT-2 same-data, compute frontier), see https://t3atlas.dev/benchmarks/.
vs vanilla GPT-2 (same 5B-token training data)
T³-124M-v36 vs gpt2/124M_vanilla_5b_same_data baseline. The interesting
delta is on multi-step reasoning — T³ pondering allocates extra forward
compute per token (S1 averages 2.7–3.7 ponder loops on these tasks).
Intended use
- Research and interpretability. The checkpoint is designed to be inspected via the trace library, not deployed for production text generation. The model is small (124M), English-only, and not instruction-tuned.
- Architectural comparison. A reference point for novel-attention work (Mamba, RWKV, xLSTM, etc.) that's matched to vanilla GPT-2 on training data and parameter count.
- Ecology / dynamics analysis. The trace JSONL records per-head, per-stage ecology state across forward passes — useful for studying how Clifford-algebra-coupled state evolves during inference.
Limitations
- 124M parameters: too small to be a useful generative chat model.
- English only.
- No instruction tuning, no RLHF, no safety tuning.
- Trained without grade-3 trivector terms (the static bivector Ω is the full Cl(3,3) rotation in this checkpoint). The state-dependent variant is a planned v3.7 Medium-scale follow-up.
- The
v36-pclosssibling is the same architecture trained with the inter-stage predictive-coding loss un-detached. Slightly worse PPL (28.53), neutral on reasoning — the K-predictor learning a real cross-stage map (r=0.59) doesn't translate into downstream gains at this scale.
Capabilities probe
The checkpoint declares the following dynamics in its config and
state dict (consumed by t3atlas viewer for trace-rendering):
{
"has_coupling": true,
"has_trivectors": false,
"has_dyn_omega": false,
"has_inter_stage_pc": true,
"has_scratchpad": true,
"n_primitives": 6,
"null_cone_strength": 0.02,
"hamiltonian_coupling": 0.02,
"sigma_hidden": 16,
"scratchpad_inject_entropy": [0.0, 0.0, 0.03]
}
Citation
@misc{sutherland2026t3,
author = {Sutherland, Garret},
title = {T³: A Clifford-Algebra-Augmented Transformer Architecture},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/mirrorethic/t3-124m-v36}
}
License
Apache-2.0. Both code (mirrorethic/t3-reference) and weights (this
repository).
Contact
Garret Sutherland (MirrorEthic LLC) — gsutherland@mirrorethic.com.
Released 2026-05-03. The pytorch_model.bin here is a stripped
inference-ready copy (498 MB) of the canonical best.pt from the v3.6
training campaign (run-3, step 2500, val PPL 27.76 on WikiText-103). The
optimizer state and data-loader state were dropped; everything T3Model
needs at inference is preserved (model_state, ecology_state, config, and
provenance metadata).
Model tree for mirrorethic/t3-124m-v36
Base model
openai-community/gpt2Datasets used to train mirrorethic/t3-124m-v36
Evaluation results
- perplexity on WikiText-103self-reported27.760
- accuracy on BoolQself-reported0.605
- accuracy on ARC-Easyself-reported0.433
- accuracy on ARC-Challengeself-reported0.218
- accuracy on PIQAself-reported0.605
- accuracy on HellaSwagself-reported0.304
- accuracy on WinoGrandeself-reported0.504
- accuracy on COPAself-reported0.600
- accuracy on RTEself-reported0.523