Theo Ultimate — Y.AI

Hymba × Mamba-3 × Recurrent Depth Transformer × DeepSeek MoE
A brain-inspired hybrid language model with three distinct memory systems.

Model Summary

Property	Value
Name	Theo Ultimate
Made by	Y.AI
Parameters	~800M
Architecture	Hymba + Mamba-3 + RDT + MoE
dtype	bfloat16
Context length	512 tokens
License	Apache 2.0

Architecture

Theo Ultimate is a fully custom hybrid architecture that combines four cutting-edge ideas into one unified model: Input │ ▼ [Token Embedding] + [Meta Tokens × 16] │ ▼ [Prelude HymbaBlock] │ ▼ [Recurrent Hymba — Full Attention] ← loops 2–6× │ ▼ [Recurrent Hymba — Sliding Window] ← loops 2–6× │ ▼ [Coda HymbaBlock] │ ▼ [LM Head]

text

1 — Hymba Hybrid Heads (NVIDIA)

Each block runs an SSM branch and an Attention branch in parallel, then fuses their outputs. This mirrors the NVIDIA Hymba design and gives the model both fast recurrent scanning and precise token retrieval in every single layer.

2 — Mamba-3 SSM

The SSM branch uses a Mamba-3 kernel with:

Exponential trapezoidal discretisation (exp_trap)
Data-dependent Δ, B, C projections
Data-dependent RoPE applied to B and C
LayerNorm stabilisation on both B and C

3 — Recurrent Depth Transformer (Claude Mythos RDT)

Two RecurrentHymbaBlock modules loop the same weights 2–6 times, each pass conditioned on a stable LTI injection: h_t = A · h_{t-1} + B · e

text

A is sigmoid-bounded so ρ(A) < 1 is mathematically guaranteed — the model can never diverge regardless of what the optimiser does.

4 — DeepSeek-style Mixture of Experts

Every block contains a sparse MoE FFN:

24 routed experts + 1 shared expert
Top-2 routing with load-balancing auxiliary loss
Loop-index embedding conditions the router on iteration depth

Memory System

Theo has three distinct memory types inspired by neuroscience:

Memory	Mechanism	Brain Analog
🧠 Meta Memory	16 persistent meta tokens prepended to every sequence	Prefrontal cortex
🌊 Fading Memory	Mamba-3 SSM hidden state — decays over distance	Hippocampal short-term
📸 Snapshot Memory	GQA attention with optional sliding window — exact retrieval	Episodic long-term

Configuration

{
  "vocab_size": 32000,
  "n_meta_tokens": 16,
  "sliding_window": 128,
  "d_model": 1536,
  "d_state": 128,
  "d_latent": 1536,
  "d_ff": 4096,
  "n_heads": 12,
  "n_kv_heads": 2,
  "n_prelude": 1,
  "n_coda": 1,
  "max_loop_iters": 6,
  "max_seq_len": 512,
  "n_experts": 24,
  "n_shared": 1,
  "n_experts_tok": 2,
  "expert_dim": 2560,
  "loop_min": 2,
  "loop_max": 6,
  "rho_target": 0.37
}
Files in This Repo
text

theo-ultimate/
├── config.json              # Model hyperparameters
├── tokenizer.json           # BPE tokenizer (vocab 32 000)
├── theo_best.pt             # Best checkpoint (lowest val loss)
├── theo_final.pt            # Final epoch checkpoint
├── corpus.txt               # Training corpus
└── checkpoints/
    ├── theo_epoch01.pt
    ├── theo_epoch02.pt
    └── ...                  # Per-epoch snapshots
Quick Start
Load the tokenizer
Python

from tokenizers import Tokenizer

tok = Tokenizer.from_file("tokenizer.json")
enc = lambda t: tok.encode(t).ids
dec = lambda i: tok.decode(i)
Reconstruct the model
Python

import torch
import torch.nn as nn
import torch.nn.functional as F
from dataclasses import dataclass
import math

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dtype  = torch.bfloat16 if torch.cuda.is_available() else torch.float32

# paste the full TheoUltimate class definition here
# (or import from your local copy of the training script)

import json
with open("config.json") as f:
    cfg_dict = json.load(f)

cfg  = TheoUltimateConfig(**cfg_dict)
theo = TheoUltimate(cfg).to(device).to(dtype)
theo.load_state_dict(
    torch.load("theo_best.pt", map_location=device)
)
theo.eval()
print("✅ Theo loaded!")
Chat
Python

PAD_ID = tok.token_to_id("<pad>")
EOS_ID = tok.token_to_id("<eos>")
END_ID = tok.token_to_id("</theo>")

@torch.no_grad()
def chat(user_input, n_loops=6, max_new=60, temp=0.75, top_p=0.9):
    ids = enc(f"<bos> <user> {user_input.strip()} <theo>")
    inp = torch.tensor([ids], dtype=torch.long, device=device)
    gen = []

    for _ in range(max_new):
        ctx = inp[:, -cfg.max_seq_len:]
        with torch.autocast(device_type="cuda", dtype=dtype):
            logits, _ = theo(ctx, n_loops=n_loops)

        probs = F.softmax(logits[0, -1, :].float() / max(temp, 1e-6), dim=-1)
        sp, si = torch.sort(probs, descending=True)
        cum = torch.cumsum(sp, dim=0)
        sp[cum - sp > top_p] = 0.0
        sp /= sp.sum().clamp(min=1e-9)
        nxt = si[torch.multinomial(sp, 1)].item()

        if nxt in (EOS_ID, END_ID):
            break
        gen.append(nxt)
        inp = torch.cat(
            [inp, torch.tensor([[nxt]], device=device)], dim=1
        )

    return dec(gen).strip() or "I'm here to help!"

print(chat("hi"))
# → Hello! I'm Theo, made by Y.AI. How can I help you today?

print(chat("who are you"))
# → I'm Theo, an AI assistant created by Y.AI.

print(chat("how does your memory work"))
# → I have three memories: meta, fading SSM, and snapshot attention!
Training Details
Setting	Value
Epochs	10
Batch size	8
Optimiser	AdamW (β₁=0.9, β₂=0.95, wd=0.1)
Learning rate	3e-4 (cosine decay → 1.5e-5)
Warmup steps	100
Grad clip	1.0
Aux loss weight	0.01
Loop iters	random 2–6 per step
Hardware	RTX Blackwell 6000 96 GB
Precision	bfloat16 (autocast)
Stability Guarantees
Property	How
ρ(A) < 1 always	A = sigmoid(raw_A) — mathematically bounded
No RoPE dtype mismatch	Cache rebuilt in model dtype on demand
Inference safe	Full torch.autocast wrapping
Load balancing	Auxiliary MoE loss at every forward pass
Citation
If you use Theo in your research or product, please cite:

bibtex

@misc{theo_ultimate_2025,
  title   = {Theo Ultimate: A Brain-Inspired Hybrid Language Model},
  author  = {Y.AI},
  year    = {2025},
  url     = {https://huggingface.co/YOUR_USERNAME/theo-ultimate}
}
About Y.AI
Y.AI is a company focused on building helpful, efficient, and
interpretable AI systems. Theo is our flagship conversational model,
combining the best ideas from state space models, transformers, recurrent
depth, and mixture of experts into a single coherent architecture.

Made with ❤️ by Y.AI

Downloads last month: 211