reinforceai-labs commited on
Commit
c67081b
·
verified ·
1 Parent(s): b458618

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +74 -296
README.md CHANGED
@@ -1,328 +1,102 @@
1
- # ATTENTION FIELDS
2
-
3
- **Unified Projections for Efficient Language Models**
4
-
5
- > *We introduce Yocto, a 484K parameter language model that reduces attention parameters by 67% while achieving better perplexity than models 2-4× larger. The key insight: Q, K, V projections share structure and can be unified into a single projection.*
6
-
7
  ---
8
-
9
- # Abstract
10
-
11
- Standard transformer attention uses three separate projections (Q, K, V), each with d² parameters. We show this is redundant.
12
-
13
- We introduce **Unified Attention**: a single projection whose output splits into [seeking|offering|content] bands. Through training, these bands learn the functions of Q, K, and V respectively—but with **67% fewer attention parameters**.
14
-
15
- Results:
16
- - **484,272 total parameters** (1.85 MB at float32, <1 MB quantized)
17
- - **700+ tokens/sec on CPU** (no GPU required)
18
- - **5.7% of parameters in attention** (vs ~25% in standard transformers)
19
- - **9.58 validation perplexity** on TinyStories (matching models 2-4× larger)
20
- - **Geometric preservation**: Controlled experiments show Berry phase and layer orthogonality within 2% of standard attention
21
-
22
- We interpret our findings through wave physics: vectors are waveforms, weight matrices are fields, and projection computes amplitude resonance with phase alignment. This interpretation predicted unification would work—the three projections share structure because they transform the same input and optimize the same objective.
23
-
24
- The physics of attention is simpler than standard architectures suggest.
25
-
26
- ---
27
-
28
- # 1. Introduction
29
-
30
- Transformer attention computes:
31
-
32
- $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) V$$
33
-
34
- where Q = W_Q·x, K = W_K·x, V = W_V·x are separate linear projections. This requires **3d² attention parameters per layer**.
35
-
36
- But why three separate matrices? The original paper [1] offered no theoretical justification. It worked, so the field adopted it. Nine years later, we still use three matrices.
37
-
38
- **Our question**: What is the minimal parameterization for attention?
39
-
40
- **Our finding**: One matrix suffices. A single projection, split into three bands, achieves the same geometric properties with 67% fewer attention parameters—and *better* perplexity.
41
-
42
- ### Contributions
43
-
44
- 1. **Unified Attention**: Single projection → [seeking|offering|content] bands
45
- 2. **67% reduction** in attention parameters (from 3d² to d²)
46
- 3. **Improved perplexity** (9.58 vs ~10-12 baseline)
47
- 4. **Geometric verification** via Berry phase and orthogonality measurements
48
- 5. **Wave-field interpretation** explaining why unification works
49
-
50
  ---
51
 
52
- # 2. Theory: Why Unification Should Work
53
 
54
- ## 2.1 Vectors as Waveforms
 
 
 
 
 
55
 
56
- Every vector is a waveform with d dimensions. Each dimension carries:
57
- - **Amplitude**: |vᵢ| — activation strength
58
- - **Phase**: sign(vᵢ) — positive or negative direction
59
 
60
- The dot product between vectors is **wave interference**:
61
 
62
- $$\mathbf{x} \cdot \mathbf{w} = \sum_i |x_i| \cdot |w_i| \cdot \text{sign}(x_i) \cdot \text{sign}(w_i)$$
63
 
64
- Same sign → constructive interference (positive contribution)
65
- Opposite sign destructive interference (negative contribution)
 
 
66
 
67
- ## 2.2 Why Q, K, V Share Structure
68
 
69
- The three projections are not independent:
70
 
71
- 1. **Same input**: All three transform x
72
- 2. **Same objective**: All three optimize the same loss
73
- 3. **Coupled function**: Q must find what K offers
74
 
75
- If Q, K, V share underlying structure, learning them jointly (one matrix) should be more efficient than learning them separately (three matrices). The single matrix acts as an implicit regularizer.
 
 
76
 
77
- **Prediction**: Unified projection will match or exceed standard attention with fewer parameters.
 
78
 
79
- ---
80
 
81
- # 3. Architecture
 
 
 
 
 
 
82
 
83
- ## 3.1 Unified Attention
84
 
85
- ```
86
- Standard: Q = W_Q·x, K = W_K·x, V = W_V·x [3d² params]
87
- Unified: u = W·x, Q = u[:d/3], K = u[d/3:2d/3], V = u[2d/3:] [d² params]
88
- ```
89
 
90
- We apply Rotary Position Embedding (RoPE) to Q and K bands but **not** to V. Position affects *routing* (who attends to whom) but not *content* (what information transfers).
91
 
92
- ## 3.2 Model Configuration
93
 
94
  | Component | Value |
95
  |-----------|-------|
96
- | Embedding dimension | 72 |
97
  | Layers | 4 |
98
  | Attention heads | 3 |
99
- | FFN hidden | 288 |
100
- | Vocabulary | 4,000 |
101
  | Context length | 512 |
102
- | **Total parameters** | **484,272** |
103
-
104
- ### Parameter Distribution
105
-
106
- | Component | Parameters | Share |
107
- |-----------|-----------|-------|
108
- | Embeddings | 288,000 | 59.5% |
109
- | **Attention** | **27,648** | **5.7%** |
110
- | FFN | 166,464 | 34.4% |
111
- | Other | 2,160 | 0.4% |
112
-
113
- Standard transformers allocate ~25% to attention. Ours uses **5.7%**.
114
-
115
- ---
116
-
117
- # 4. Experiments
118
-
119
- ## 4.1 Setup
120
-
121
- - **Data**: TinyStories [3]
122
- - **Training**: 82,000 steps, batch size 64, AdamW
123
- - **Learning rate**: 1e-3 → 1e-4 (cosine decay)
124
-
125
- ## 4.2 Results
126
-
127
- ### Perplexity Comparison
128
-
129
- | Model | Total Params | Attention Share | Val PPL |
130
- |-------|-------------|-----------------|---------|
131
- | **Ours (Unified)** | **484K** | **5.7%** | **9.58** |
132
- | TinyStories-1M | 1M | ~25% | ~10-12 |
133
- | seangoedecke [4] | 1.8M | ~25% | ~9.6 |
134
-
135
- With **52% fewer total parameters** than TinyStories-1M, we achieve better perplexity. With **73% fewer parameters** than seangoedecke, we match their perplexity.
136
-
137
- ### Generation Quality
138
-
139
- | Metric | Score |
140
- |--------|-------|
141
- | Quality score | 99.6/100 |
142
- | Vocabulary diversity | 67.0% |
143
- | 2-gram repetition | 4.7% |
144
- | Story elements | 68.8% |
145
-
146
- **Example** (prompt: "Once upon a time"):
147
-
148
- > Once upon a time there was a little girl named Lily. She loved to run and run, but one day she didn't have any friends to play with.
149
- >
150
- > Lily went home and bought her favorite toy to play. When she woke up, she saw something in the closet. It was a ball of yarn! She played with it all night long.
151
-
152
- Named characters, temporal progression, narrative coherence—from 484K parameters.
153
-
154
- ### Inference Speed
155
-
156
- | Hardware | Tokens/sec |
157
- |----------|------------|
158
- | Apple M-series CPU | **700+** |
159
- | Standard laptop CPU | 200-400 |
160
- | HuggingFace Spaces (shared CPU) | 100-300 |
161
-
162
- The model is so small that **CPU outperforms GPU**—memory transfer overhead exceeds compute savings. No GPU required.
163
-
164
- ## 4.3 Geometric Verification
165
-
166
- Does unified attention preserve the geometric properties of standard attention? We ran controlled experiments comparing attention mechanisms on identical architectures (6-layer transformers, embed_dim=126, trained on synthetic language tasks).
167
-
168
- ### Berry Phase
169
-
170
- Berry phase measures accumulated rotation through layers—the "geometric memory" of the path through representation space.
171
-
172
- | Attention Type | Berry Phase | vs Baseline |
173
- |---------------|-------------|-------------|
174
- | Standard Q/K/V | 135.23° | 100% |
175
- | **Unified [seek\|offer\|content]** | **137.32°** | **101.5%** |
176
-
177
- Within 2%: **geometric path preserved**.
178
-
179
- ### Layer Orthogonality
180
-
181
- Average angle between consecutive layer representations:
182
-
183
- | Attention Type | Mean Angle |
184
- |---------------|------------|
185
- | Standard Q/K/V | 22.54° |
186
- | **Unified** | **22.89°** |
187
-
188
- Within 2%: **rotation structure preserved**.
189
 
190
- ### Interpretation
191
-
192
- These controlled experiments demonstrate that unified attention traces an equivalent geometric path through representation space. The 67% parameter reduction does not distort the fundamental geometry—it removes redundancy while preserving structure.
193
-
194
- The Yocto model (4 layers, embed_dim=72) applies this same unified architecture to TinyStories, achieving 9.58 perplexity with 5.7% attention parameters.
195
-
196
- ---
197
-
198
- # 5. Analysis
199
-
200
- ## 5.1 Why Does It Work?
201
-
202
- **Shared structure**: Q, K, V transform the same input for the same objective. Separate matrices learn redundant structure. A unified matrix learns it once.
203
-
204
- **Implicit regularization**: Fewer parameters may prevent overfitting. Our *improved* perplexity (9.58 vs ~10-12) supports this.
205
-
206
- ## 5.2 What Does 5.7% Attention Mean?
207
-
208
- Our parameter distribution (59.5% embeddings, 5.7% attention, 34.4% FFN) suggests:
209
-
210
- - **Attention is routing**: It decides WHERE information flows, not what it becomes
211
- - **Embeddings carry meaning**: Most capacity represents tokens well
212
- - **FFN is essential**: Nonlinear transformation cannot be reduced
213
- - **CPU is optimal**: At 484K params, GPU memory transfer overhead exceeds compute benefits. The model runs faster on CPU than MPS/CUDA.
214
-
215
- ## 5.3 Limitations
216
-
217
- - **Domain**: TinyStories is constrained. General language may need more capacity.
218
- - **Scale**: 484K tested. Behavior at 100M+ unknown.
219
- - **Context**: 512 tokens tested. Long-context behavior unexplored.
220
-
221
- ---
222
-
223
- # 6. Related Work
224
-
225
- **Multi-Query Attention** [5]: Shares K, V across heads. We go further—unifying Q, K, V into a single projection.
226
-
227
- **LoRA** [6]: Reduces parameters post-training via low-rank adaptation. We reduce architecturally, during training.
228
-
229
- **Efficient Attention**: Linear attention and sparse attention reduce O(n²) complexity. We reduce parameters while keeping full attention.
230
-
231
- **F-Net** [7]: Replaces attention with Fourier transforms. Loses differential weighting—all tokens mixed identically, cannot focus on relevant context.
232
-
233
- **Mamba** [8]: Replaces attention with selective state spaces. Loses Q/K asymmetry—"what I seek" merges with "what I offer" in a single state.
234
-
235
- **Ours**: Preserves both differential weighting and asymmetry. Removes only the redundant parameterization.
236
-
237
- ---
238
-
239
- # 7. Conclusion
240
-
241
- We asked: What is the minimal parameterization for attention?
242
-
243
- **Answer**: One projection suffices. The three matrices Q, K, V share structure that a unified projection captures more efficiently.
244
-
245
- **Results**:
246
- - 67% fewer attention parameters
247
- - 5.7% of model in attention (vs 25% standard)
248
- - 700+ tokens/sec on CPU (no GPU needed)
249
- - Better perplexity than larger models
250
- - Geometric properties preserved within 2%
251
-
252
- **Implication**: Standard attention is over-parameterized. The wave-field interpretation predicted this—and experiments confirm it.
253
-
254
- ---
255
-
256
- # References
257
-
258
- [1] Vaswani et al., "Attention Is All You Need," NeurIPS 2017.
259
-
260
- [2] Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding," 2021.
261
-
262
- [3] Eldan & Li, "TinyStories: How Small Can Language Models Be?," 2023.
263
-
264
- [4] Goedecke, "Training a Language Model on a Laptop," 2024.
265
-
266
- [5] Shazeer, "Fast Transformer Decoding: One Write-Head is All You Need," 2019.
267
-
268
- [6] Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models," ICLR 2022.
269
-
270
- [7] Lee-Thorp et al., "FNet: Mixing Tokens with Fourier Transforms," NAACL 2022.
271
-
272
- [8] Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces," 2023.
273
-
274
- ---
275
-
276
- # Appendix: Implementation
277
-
278
- ```python
279
- class UnifiedAttention(nn.Module):
280
- def __init__(self, embed_dim, num_heads):
281
- self.third = embed_dim // 3
282
- self.W_unified = nn.Linear(embed_dim, embed_dim, bias=False)
283
- self.W_out = nn.Linear(self.third, embed_dim, bias=False)
284
- self.rope = RotaryPositionEmbedding(self.third // num_heads)
285
-
286
- def forward(self, x):
287
- # Single projection → three bands
288
- u = self.W_unified(x)
289
- seeking, offering, content = u.split(self.third, dim=-1)
290
-
291
- # RoPE on seeking/offering only (not content)
292
- cos, sin = self.rope(x)
293
- seeking, offering = apply_rope(seeking, offering, cos, sin)
294
-
295
- # Standard attention computation
296
- out = F.scaled_dot_product_attention(
297
- seeking, offering, content, is_causal=True
298
- )
299
- return self.W_out(out)
300
- ```
301
-
302
- ---
303
-
304
- ## YOCTO — *The World's Smallest Language Model*
305
-
306
- **484,272 Parameters · 946 KB (fp16) · 700+ tok/s · 67% Less Attention · Open Source**
307
-
308
- ### Quick Start
309
-
310
- ```bash
311
- git clone https://github.com/reinforceai/yocto
312
- cd yocto
313
- pip install -r requirements.txt
314
- python inference.py --prompt "Once upon a time"
315
- ```
316
-
317
- **700+ tokens/sec on CPU** — no GPU needed.
318
-
319
- ### Live Demo
320
 
321
  Try Yocto in your browser: [HuggingFace Space](https://huggingface.co/spaces/Reinforce-ai/yocto-demo)
322
 
323
- ### Citation
 
 
 
 
324
 
325
- If you use this work, please cite:
326
 
327
  ```bibtex
328
  @misc{deshwal2026yocto,
@@ -332,4 +106,8 @@ If you use this work, please cite:
332
  url={https://www.reinforceai.com/yocto},
333
  howpublished={\url{https://github.com/reinforceai/yocto}}
334
  }
335
- ```
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - text-generation
7
+ - story-generation
8
+ - tiny-model
9
+ - efficient-attention
10
+ - unified-attention
11
+ library_name: pytorch
12
+ pipeline_tag: text-generation
13
+ model-index:
14
+ - name: yocto
15
+ results:
16
+ - task:
17
+ type: text-generation
18
+ dataset:
19
+ name: TinyStories
20
+ type: roneneldan/TinyStories
21
+ metrics:
22
+ - name: Perplexity
23
+ type: perplexity
24
+ value: 9.58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  ---
26
 
27
+ # YOCTO World's Smallest Language Model
28
 
29
+ <p align="center">
30
+ <img src="https://img.shields.io/badge/Parameters-484K-blue" alt="Parameters">
31
+ <img src="https://img.shields.io/badge/Size-946KB-green" alt="Size">
32
+ <img src="https://img.shields.io/badge/Speed-700%2B%20tok%2Fs-orange" alt="Speed">
33
+ <img src="https://img.shields.io/badge/Perplexity-9.58-purple" alt="Perplexity">
34
+ </p>
35
 
36
+ Yocto is a 484K parameter language model that tells children's stories. It achieves 9.58 perplexity on TinyStories — matching models 2-4× larger.
 
 
37
 
38
+ ## Key Innovation: Unified Attention
39
 
40
+ Standard transformers use 3 separate projections (Q, K, V). Yocto uses one unified projection that splits into [seeking|offering|content] bands:
41
 
42
+ ```
43
+ Standard: Q = W_Q·x, K = W_K·x, V = W_V·x [3d² params]
44
+ Unified: u = W·x → [seeking|offering|content] [d² params]
45
+ ```
46
 
47
+ Result: **67% fewer attention parameters**, better perplexity.
48
 
49
+ ## Quick Start
50
 
51
+ ```python
52
+ import torch
53
+ from huggingface_hub import hf_hub_download
54
 
55
+ # Download model
56
+ model_path = hf_hub_download(repo_id="Reinforce-ai/yocto", filename="model.pt")
57
+ tokenizer_path = hf_hub_download(repo_id="Reinforce-ai/yocto", filename="tokenizer.json")
58
 
59
+ # Load and generate (see GitHub for full code)
60
+ ```
61
 
62
+ ## Performance
63
 
64
+ | Metric | Value |
65
+ |--------|-------|
66
+ | Parameters | 484,272 |
67
+ | Size (fp16) | 946 KB |
68
+ | Attention share | 5.7% |
69
+ | Perplexity | 9.58 |
70
+ | Speed (CPU) | **700+ tok/s** |
71
 
72
+ ## Example Output
73
 
74
+ **Prompt:** "Once upon a time"
 
 
 
75
 
76
+ > Once upon a time, there was a little girl named Lily. She loved to play with her toys all day long. One day, she found a shiny thing on the shelf. The little girl said, "Look, mommy, look!" Her mommy explained that it's very cool, so Lily and her mommy went to the store to buy some tasty food.
77
 
78
+ ## Architecture
79
 
80
  | Component | Value |
81
  |-----------|-------|
82
+ | Embedding dim | 72 |
83
  | Layers | 4 |
84
  | Attention heads | 3 |
85
+ | FFN dim | 288 |
86
+ | Vocab size | 4,000 |
87
  | Context length | 512 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
 
89
+ ## Live Demo
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
 
91
  Try Yocto in your browser: [HuggingFace Space](https://huggingface.co/spaces/Reinforce-ai/yocto-demo)
92
 
93
+ ## Links
94
+
95
+ - 🌐 **Website**: [reinforceai.com/yocto](https://www.reinforceai.com/yocto)
96
+ - 💻 **GitHub**: [github.com/reinforceai/yocto](https://github.com/reinforceai/yocto)
97
+ - 📄 **Paper**: [Attention Fields: Unified Projections for Efficient Language Models](https://github.com/reinforceai/yocto/blob/main/ATTENTION_FIELDS.md)
98
 
99
+ ## Citation
100
 
101
  ```bibtex
102
  @misc{deshwal2026yocto,
 
106
  url={https://www.reinforceai.com/yocto},
107
  howpublished={\url{https://github.com/reinforceai/yocto}}
108
  }
109
+ ```
110
+
111
+ ## License
112
+
113
+ MIT