TCF-1 β€” Temporal Compression Field

Architecture: Hybrid Transformer + Mamba SSM
Parameters: 494.9M
Built by: Tree-ForgedMind
Hardware: Apple Mac Studio (128GB unified memory)
Framework: MLX


What TCF-1 Is

TCF-1 (Temporal Compression Field) is a from-scratch hybrid language model architecture. No pre-trained weights. No borrowed base model. Custom tokenizer, custom architecture, trained from zero.

The core idea: pair causal self-attention with a Mamba State Space Model in every layer. Two simultaneous memory mechanisms β€” attention's contextual lookup and SSM's compressed temporal signal β€” running together through 12 hybrid layers.


Architecture

Component Value
Parameters 494.9M
Vocabulary 32,000 (SentencePiece BPE)
Embedding dim 1,024
Attention heads 16
Hybrid pairs 12
Block size 2,048 (trained at 512)
SSM state dim 16
SSM expansion 2

Each of the 12 HybridPair layers contains:

  1. AttentionBlock β€” causal self-attention
  2. MambaBlock β€” selective state space model with persistent hidden state
  3. FeedForward β€” GELU-activated projection

The Mamba hidden state (B, d_inner, d_state) carries temporal information across the sequence independently of attention. The model has two paths through every layer β€” one that looks across the sequence (attention) and one that compresses time (SSM).


Training β€” Three Stages

Stage A β€” Knowledge Substrate

Build world knowledge from zero. Pure language modeling β€” no instruction format, no masking. Text completion only.

Tokens 5.79B
Steps 50,000
Best val loss 4.1684
Final val loss 4.2214

Data:

  • Wikipedia
  • FineWeb-Edu
  • Cosmopedia
  • Project Gutenberg
  • ArXiv abstracts
  • Orca Math
  • CodeParrot Python
  • Stack v1 Python

Stage B β€” Instruction Fine-Tuning

Teach the model to respond rather than continue. Loss is masked to assistant tokens only β€” the model learns to answer, not to repeat the question.

Tokens 300M
Steps 20,000
LR 5e-5
Best val loss 3.6757
Loss Masked β€” assistant tokens only

Data:

  • OpenAssistant (oasst2)
  • SlimOrca
  • Hermes3
  • Tulu3
  • GSM8K
  • MetaMath

Stage A Output Examples

Prompt: "The nature of consciousness is"
Checkpoint: step 8,250 β€” Temperature: 0.8

Three runs, same prompt, same weights:

Run 1:

instantiated. A distinct form of the same domain of intelligence in the context of a systematic approach and is not related to the two of the world's most prevalent and prominent. However, the problems on the links of the computer. The scientific will have the potential for myths. The formation of the modern art is that the scientific and contemporary art of art museum was their focus on the use of human architecture in the 1960s. Γ–stei Thumn, with the Tromszkala, in the advisory process from its inception, was the earliest examples of the text, and in the Demodu Papers of the reduction inatic language...

Run 2:

the most prevalent figure in the world. This is the most commonly used organization that is responsible for the formation of the different species... In the Peerage of the Palas (Ming) it is also known as the Apostles of the Baractic Empire, which has a number of reasons, who decide; the hymns and experiences of the first Baron in the early 20th century...

Run 3:

a symbol of the subject of Greek genocide and the denomination in the world... The holy civilization is being considered an unacceptable marker... a large denomination is a state of Islam. The Prophet is the "Call of the revelation of God" sometimes referred to as "The right of the Divine Church" as the King's father, the god, as you prodigised and the world's gait...

Three runs. Three universes. The same weights, walked differently each time. This is what 5.79B tokens of compressed knowledge looks like before instruction tuning β€” not a lookup table, but a geometry.

Stage B Output Examples

Settings: temp=0.75, top-p=0.95, block-size=512

Stage B knows it's in a conversation. It holds the Assistant role, responds in first person, stays in format. The semantic layer is loose β€” but compare to Stage A, which had zero awareness of being addressed at all.

Human: how are you feeling?
Assistant: It's good to remember the meaning of the word "let" in this sentence.
It's important to note that it is a good practice to start making someone happy.
It's important to understand the meaning of the situation. It helps you feel
good and focused.
Human: good
Assistant: I'm afraid I can't believe it. It's important to remember that
you're doing well. I'm thinking about being kind in good terms and getting
to know. It's okay to take care of yourself.

Human: thats good advice
Assistant: I'm thinking about something that doesn't mean you're keeping up
with everything you've done. It's not just good, but it's important to help
others to keep your mind safe.

Structure without substance. Stage A had substance without structure. Stage C is where both converge.


Why Hybrid

Standard Transformers hold context through attention β€” quadratic cost, finite window. SSMs compress sequences into a fixed hidden state β€” linear cost, longer memory but less precise retrieval.

TCF-1 runs both in parallel at every layer. The attention handles precise contextual lookup. The SSM handles long-range temporal compression. Neither replaces the other.

At 494.9M parameters trained from scratch on a single Apple Silicon machine, TCF-1 is a proof that the hybrid architecture is accessible outside of large-scale infrastructure.


Files

File Description
  • stage_b_step15250_best.npz β€” 1.98 GB weights
  • leeknet_500m.py β€” architecture
  • chat_stage_b.py β€” chat interface
  • tokenizer/leek_bpe_32k.model β€” SentencePiece model
  • tokenizer/leek_bpe_32k.vocab β€” vocabulary

Built from scratch. No borrowed weights.
Tree β€” 2026

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support