File size: 6,292 Bytes
cd98eb4
 
5410040
 
 
 
 
 
 
 
 
 
 
 
cd98eb4
9115bb0
 
 
 
 
ad79318
9115bb0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a3b18c7
9115bb0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
040e6c3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9115bb0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e6150cf
 
 
 
 
9115bb0
 
 
 
5410040
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
license: cc-by-nc-4.0
language:
- en
pipeline_tag: text-generation
tags:
- mamba
- ssm
- hybrid
- transformer
- from-scratch
- custom-architecture
- mlx
- apple-silicon
---

# TCF-1 β€” Temporal Compression Field

**Architecture:** Hybrid Transformer + Mamba SSM  
**Parameters:** 494.9M  
**Built by:** Tree-ForgedMind  
**Hardware:** Apple Mac Studio (128GB unified memory)  
**Framework:** MLX  

---

## What TCF-1 Is

TCF-1 (Temporal Compression Field) is a from-scratch hybrid language model architecture. No pre-trained weights. No borrowed base model. Custom tokenizer, custom architecture, trained from zero.

The core idea: pair causal self-attention with a Mamba State Space Model in every layer. Two simultaneous memory mechanisms β€” attention's contextual lookup and SSM's compressed temporal signal β€” running together through 12 hybrid layers.

---

## Architecture

| Component | Value |
|-----------|-------|
| Parameters | 494.9M |
| Vocabulary | 32,000 (SentencePiece BPE) |
| Embedding dim | 1,024 |
| Attention heads | 16 |
| Hybrid pairs | 12 |
| Block size | 2,048 (trained at 512) |
| SSM state dim | 16 |
| SSM expansion | 2 |

Each of the 12 **HybridPair** layers contains:

1. **AttentionBlock** β€” causal self-attention
2. **MambaBlock** β€” selective state space model with persistent hidden state
3. **FeedForward** β€” GELU-activated projection

The Mamba hidden state `(B, d_inner, d_state)` carries temporal information across the sequence independently of attention. The model has two paths through every layer β€” one that looks across the sequence (attention) and one that compresses time (SSM).

---

## Training β€” Three Stages

### Stage A β€” Knowledge Substrate

Build world knowledge from zero. Pure language modeling β€” no instruction format, no masking. Text completion only.

| | |
|--|--|
| Tokens | 5.79B |
| Steps | 50,000 |
| Best val loss | 4.1684 |
| Final val loss | 4.2214 |

**Data:**
- Wikipedia
- FineWeb-Edu
- Cosmopedia
- Project Gutenberg
- ArXiv abstracts
- Orca Math
- CodeParrot Python
- Stack v1 Python

---

### Stage B β€” Instruction Fine-Tuning

Teach the model to respond rather than continue. Loss is masked to assistant tokens only β€” the model learns to answer, not to repeat the question.

| | |
|--|--|
| Tokens | 300M |
| Steps | 20,000 |
| LR | 5e-5 |
| Best val loss | 3.6757 |
| Loss | Masked β€” assistant tokens only |

**Data:**
- OpenAssistant (oasst2)
- SlimOrca
- Hermes3
- Tulu3
- GSM8K
- MetaMath

---

## Stage A Output Examples

**Prompt:** `"The nature of consciousness is"`  
**Checkpoint:** step 8,250 β€” Temperature: 0.8

Three runs, same prompt, same weights:

**Run 1:**
> instantiated. A distinct form of the same domain of intelligence in the context of a systematic approach and is not related to the two of the world's most prevalent and prominent. However, the problems on the links of the computer. The scientific will have the potential for myths. The formation of the modern art is that the scientific and contemporary art of art museum was their focus on the use of human architecture in the 1960s. Γ–stei Thumn, with the Tromszkala, in the advisory process from its inception, was the earliest examples of the text, and in the Demodu Papers of the reduction inatic language...

**Run 2:**
> the most prevalent figure in the world. This is the most commonly used organization that is responsible for the formation of the different species... In the Peerage of the Palas (Ming) it is also known as the Apostles of the Baractic Empire, which has a number of reasons, who decide; the hymns and experiences of the first Baron in the early 20th century...

**Run 3:**
> a symbol of the subject of Greek genocide and the denomination in the world... The holy civilization is being considered an unacceptable marker... a large denomination is a state of Islam. The Prophet is the "Call of the revelation of God" sometimes referred to as "The right of the Divine Church" as the King's father, the god, as you prodigised and the world's gait...

Three runs. Three universes. The same weights, walked differently each time. This is what 5.79B tokens of compressed knowledge looks like before instruction tuning β€” not a lookup table, but a geometry.

## Stage B Output Examples

**Settings:** temp=0.75, top-p=0.95, block-size=512

Stage B knows it's in a conversation. It holds the Assistant role, responds in first person, stays in format. The semantic layer is loose β€” but compare to Stage A, which had zero awareness of being addressed at all.

```
Human: how are you feeling?
Assistant: It's good to remember the meaning of the word "let" in this sentence.
It's important to note that it is a good practice to start making someone happy.
It's important to understand the meaning of the situation. It helps you feel
good and focused.
Human: good
Assistant: I'm afraid I can't believe it. It's important to remember that
you're doing well. I'm thinking about being kind in good terms and getting
to know. It's okay to take care of yourself.

Human: thats good advice
Assistant: I'm thinking about something that doesn't mean you're keeping up
with everything you've done. It's not just good, but it's important to help
others to keep your mind safe.
```

Structure without substance. Stage A had substance without structure. Stage C is where both converge.

---

## Why Hybrid

Standard Transformers hold context through attention β€” quadratic cost, finite window. SSMs compress sequences into a fixed hidden state β€” linear cost, longer memory but less precise retrieval.

TCF-1 runs both in parallel at every layer. The attention handles precise contextual lookup. The SSM handles long-range temporal compression. Neither replaces the other.

At 494.9M parameters trained from scratch on a single Apple Silicon machine, TCF-1 is a proof that the hybrid architecture is accessible outside of large-scale infrastructure.

---

## Files

| File | Description |
|------|-------------|
  - stage_b_step15250_best.npz β€” 1.98 GB weights
  - leeknet_500m.py β€” architecture
  - chat_stage_b.py β€” chat interface
  - tokenizer/leek_bpe_32k.model β€” SentencePiece model
  - tokenizer/leek_bpe_32k.vocab β€” vocabulary

---

*Built from scratch. No borrowed weights.*  
*Tree β€” 2026*