Theo Ultimate β Y.AI
Hymba Γ Mamba-3 Γ Recurrent Depth Transformer Γ DeepSeek MoE
A brain-inspired hybrid language model with three distinct memory systems.
Model Summary
| Property | Value |
|---|---|
| Name | Theo Ultimate |
| Made by | Y.AI |
| Parameters | ~800M |
| Architecture | Hymba + Mamba-3 + RDT + MoE |
| dtype | bfloat16 |
| Context length | 512 tokens |
| License | Apache 2.0 |
Architecture
Theo Ultimate is a fully custom hybrid architecture that combines four cutting-edge ideas into one unified model: Input β βΌ [Token Embedding] + [Meta Tokens Γ 16] β βΌ [Prelude HymbaBlock] β βΌ [Recurrent Hymba β Full Attention] β loops 2β6Γ β βΌ [Recurrent Hymba β Sliding Window] β loops 2β6Γ β βΌ [Coda HymbaBlock] β βΌ [LM Head]
text
1 β Hymba Hybrid Heads (NVIDIA)
Each block runs an SSM branch and an Attention branch in parallel, then fuses their outputs. This mirrors the NVIDIA Hymba design and gives the model both fast recurrent scanning and precise token retrieval in every single layer.
2 β Mamba-3 SSM
The SSM branch uses a Mamba-3 kernel with:
- Exponential trapezoidal discretisation (
exp_trap) - Data-dependent
Ξ,B,Cprojections - Data-dependent RoPE applied to
BandC - LayerNorm stabilisation on both
BandC
3 β Recurrent Depth Transformer (Claude Mythos RDT)
Two RecurrentHymbaBlock modules loop the same weights 2β6 times,
each pass conditioned on a stable LTI injection:
h_t = A Β· h_{t-1} + B Β· e
text
A is sigmoid-bounded so Ο(A) < 1 is mathematically guaranteed
β the model can never diverge regardless of what the optimiser does.
4 β DeepSeek-style Mixture of Experts
Every block contains a sparse MoE FFN:
- 24 routed experts + 1 shared expert
- Top-2 routing with load-balancing auxiliary loss
- Loop-index embedding conditions the router on iteration depth
Memory System
Theo has three distinct memory types inspired by neuroscience:
| Memory | Mechanism | Brain Analog |
|---|---|---|
| π§ Meta Memory | 16 persistent meta tokens prepended to every sequence | Prefrontal cortex |
| π Fading Memory | Mamba-3 SSM hidden state β decays over distance | Hippocampal short-term |
| πΈ Snapshot Memory | GQA attention with optional sliding window β exact retrieval | Episodic long-term |
Configuration
{
"vocab_size": 32000,
"n_meta_tokens": 16,
"sliding_window": 128,
"d_model": 1536,
"d_state": 128,
"d_latent": 1536,
"d_ff": 4096,
"n_heads": 12,
"n_kv_heads": 2,
"n_prelude": 1,
"n_coda": 1,
"max_loop_iters": 6,
"max_seq_len": 512,
"n_experts": 24,
"n_shared": 1,
"n_experts_tok": 2,
"expert_dim": 2560,
"loop_min": 2,
"loop_max": 6,
"rho_target": 0.37
}
Files in This Repo
text
theo-ultimate/
βββ config.json # Model hyperparameters
βββ tokenizer.json # BPE tokenizer (vocab 32 000)
βββ theo_best.pt # Best checkpoint (lowest val loss)
βββ theo_final.pt # Final epoch checkpoint
βββ corpus.txt # Training corpus
βββ checkpoints/
βββ theo_epoch01.pt
βββ theo_epoch02.pt
βββ ... # Per-epoch snapshots
Quick Start
Load the tokenizer
Python
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
enc = lambda t: tok.encode(t).ids
dec = lambda i: tok.decode(i)
Reconstruct the model
Python
import torch
import torch.nn as nn
import torch.nn.functional as F
from dataclasses import dataclass
import math
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
# paste the full TheoUltimate class definition here
# (or import from your local copy of the training script)
import json
with open("config.json") as f:
cfg_dict = json.load(f)
cfg = TheoUltimateConfig(**cfg_dict)
theo = TheoUltimate(cfg).to(device).to(dtype)
theo.load_state_dict(
torch.load("theo_best.pt", map_location=device)
)
theo.eval()
print("β
Theo loaded!")
Chat
Python
PAD_ID = tok.token_to_id("<pad>")
EOS_ID = tok.token_to_id("<eos>")
END_ID = tok.token_to_id("</theo>")
@torch.no_grad()
def chat(user_input, n_loops=6, max_new=60, temp=0.75, top_p=0.9):
ids = enc(f"<bos> <user> {user_input.strip()} <theo>")
inp = torch.tensor([ids], dtype=torch.long, device=device)
gen = []
for _ in range(max_new):
ctx = inp[:, -cfg.max_seq_len:]
with torch.autocast(device_type="cuda", dtype=dtype):
logits, _ = theo(ctx, n_loops=n_loops)
probs = F.softmax(logits[0, -1, :].float() / max(temp, 1e-6), dim=-1)
sp, si = torch.sort(probs, descending=True)
cum = torch.cumsum(sp, dim=0)
sp[cum - sp > top_p] = 0.0
sp /= sp.sum().clamp(min=1e-9)
nxt = si[torch.multinomial(sp, 1)].item()
if nxt in (EOS_ID, END_ID):
break
gen.append(nxt)
inp = torch.cat(
[inp, torch.tensor([[nxt]], device=device)], dim=1
)
return dec(gen).strip() or "I'm here to help!"
print(chat("hi"))
# β Hello! I'm Theo, made by Y.AI. How can I help you today?
print(chat("who are you"))
# β I'm Theo, an AI assistant created by Y.AI.
print(chat("how does your memory work"))
# β I have three memories: meta, fading SSM, and snapshot attention!
Training Details
Setting Value
Epochs 10
Batch size 8
Optimiser AdamW (Ξ²β=0.9, Ξ²β=0.95, wd=0.1)
Learning rate 3e-4 (cosine decay β 1.5e-5)
Warmup steps 100
Grad clip 1.0
Aux loss weight 0.01
Loop iters random 2β6 per step
Hardware RTX Blackwell 6000 96 GB
Precision bfloat16 (autocast)
Stability Guarantees
Property How
Ο(A) < 1 always A = sigmoid(raw_A) β mathematically bounded
No RoPE dtype mismatch Cache rebuilt in model dtype on demand
Inference safe Full torch.autocast wrapping
Load balancing Auxiliary MoE loss at every forward pass
Citation
If you use Theo in your research or product, please cite:
bibtex
@misc{theo_ultimate_2025,
title = {Theo Ultimate: A Brain-Inspired Hybrid Language Model},
author = {Y.AI},
year = {2025},
url = {https://huggingface.co/YOUR_USERNAME/theo-ultimate}
}
About Y.AI
Y.AI is a company focused on building helpful, efficient, and
interpretable AI systems. Theo is our flagship conversational model,
combining the best ideas from state space models, transformers, recurrent
depth, and mixture of experts into a single coherent architecture.
Made with β€οΈ by Y.AI
- Downloads last month
- 211