T³ 124M v3.6 (run-3 release)

Inference-ready checkpoint for , a Clifford-algebra-augmented transformer architecture. 124M parameters, GPT-2 Small substrate, 5B training tokens.

This is the canonical reference checkpoint for the v3.6 lineage. Companion artifacts:

Quick start

pip install t3-reference
from huggingface_hub import hf_hub_download
from t3 import T3Model

ckpt = hf_hub_download("mirrorethic/t3-124m-v36", "pytorch_model.bin")
model = T3Model.from_checkpoint(ckpt)
model.eval()

import torch
input_ids = torch.randint(0, 50257, (1, 16))
with torch.no_grad():
    logits, *_ = model(input_ids)

To generate a schema-v1 ecology trace:

from t3.tracing import generate_trace
generate_trace(model, "The capital of France is",
               prompt_id="factual", n_tokens=32,
               out_path="trace.jsonl")

To re-run the published lm-eval-harness benchmarks:

from t3.benchmarks import run_benchmark_suite
results = run_benchmark_suite("path/to/pytorch_model.bin")

Architecture

T³ extends standard multi-head attention with a per-head ecology of six conjugate primitives (E, I, F, V, C, K) coupled through bivector composition in Cl(3,3) geometric algebra. Heads interact through a learned blockade-and-cosurvival graph and ponder adaptively per stage via output-entropy halt. Full technical specification: docs/ARCHITECTURE.md.

Field Value
Parameters 124,500,000
Stages 3, with layers_per_stage = [4, 3, 5] (12 transformer blocks total)
d_model 768
n_heads 12
d_ff 3072
vocab_size 50257 (GPT-2 tokenizer)
max_seq_len 1024
Substrate GPT-2 Small initialization
Training data 5B tokens (FineWeb-Edu 40%, DCLM 20%, StackEdu 10%, FineMath 10%, Cosmopedia 10%, Wikipedia 10%)
Cumulative training step 138,000 (135.5K substrate + 2,500 v3.6 increment)
Hamiltonian coupling ω 0.02
Trivectors off (the trivectors-on variant is a planned v3.7 follow-up release)
Inter-stage predictive coding on (weight = 0.05)
Scratchpad heads on (scratchpad_inject_entropy = (0.0, 0.0, 0.03) — S2-only)
ACT output-entropy halt + per-stage 4-step cap

Evaluation

All numbers are full lm-eval-harness 0.4.x runs (no subset). Reproduce with examples/run_benchmarks.py from the reference repo.

Task Metric Value stderr
WikiText-103 (val) perplexity 27.76
BoolQ acc 0.6046 0.0086
ARC-Easy acc 0.4331 0.0102
ARC-Challenge acc 0.2176 0.0121
PIQA acc 0.6050 0.0114
HellaSwag acc 0.3040 0.0046
WinoGrande acc 0.5043 0.0141
COPA acc 0.6000 0.0492
RTE acc 0.5235 0.0301

For comparison panels (parameter-efficiency vs vanilla GPT-2 same-data, compute frontier), see https://t3atlas.dev/benchmarks/.

vs vanilla GPT-2 (same 5B-token training data)

T³-124M-v36 vs gpt2/124M_vanilla_5b_same_data baseline. The interesting delta is on multi-step reasoning — T³ pondering allocates extra forward compute per token (S1 averages 2.7–3.7 ponder loops on these tasks).

Intended use

  • Research and interpretability. The checkpoint is designed to be inspected via the trace library, not deployed for production text generation. The model is small (124M), English-only, and not instruction-tuned.
  • Architectural comparison. A reference point for novel-attention work (Mamba, RWKV, xLSTM, etc.) that's matched to vanilla GPT-2 on training data and parameter count.
  • Ecology / dynamics analysis. The trace JSONL records per-head, per-stage ecology state across forward passes — useful for studying how Clifford-algebra-coupled state evolves during inference.

Limitations

  • 124M parameters: too small to be a useful generative chat model.
  • English only.
  • No instruction tuning, no RLHF, no safety tuning.
  • Trained without grade-3 trivector terms (the static bivector Ω is the full Cl(3,3) rotation in this checkpoint). The state-dependent variant is a planned v3.7 Medium-scale follow-up.
  • The v36-pcloss sibling is the same architecture trained with the inter-stage predictive-coding loss un-detached. Slightly worse PPL (28.53), neutral on reasoning — the K-predictor learning a real cross-stage map (r=0.59) doesn't translate into downstream gains at this scale.

Capabilities probe

The checkpoint declares the following dynamics in its config and state dict (consumed by t3atlas viewer for trace-rendering):

{
  "has_coupling":       true,
  "has_trivectors":     false,
  "has_dyn_omega":      false,
  "has_inter_stage_pc": true,
  "has_scratchpad":     true,
  "n_primitives":       6,
  "null_cone_strength": 0.02,
  "hamiltonian_coupling": 0.02,
  "sigma_hidden":       16,
  "scratchpad_inject_entropy": [0.0, 0.0, 0.03]
}

Citation

@misc{sutherland2026t3,
  author = {Sutherland, Garret},
  title  = {T³: A Clifford-Algebra-Augmented Transformer Architecture},
  year   = {2026},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/mirrorethic/t3-124m-v36}
}

License

Apache-2.0. Both code (mirrorethic/t3-reference) and weights (this repository).

Contact

Garret Sutherland (MirrorEthic LLC) — gsutherland@mirrorethic.com.


Released 2026-05-03. The pytorch_model.bin here is a stripped inference-ready copy (498 MB) of the canonical best.pt from the v3.6 training campaign (run-3, step 2500, val PPL 27.76 on WikiText-103). The optimizer state and data-loader state were dropped; everything T3Model needs at inference is preserved (model_state, ecology_state, config, and provenance metadata).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mirrorethic/t3-124m-v36

Finetuned
(2151)
this model

Datasets used to train mirrorethic/t3-124m-v36

Evaluation results