LoganResearch commited on
Commit
4b937cf
·
verified ·
1 Parent(s): aef02f9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +327 -133
README.md CHANGED
@@ -1,180 +1,374 @@
1
- # Lie-Holonomy Transformer (LHT)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
- A PyTorch implementation of the gauge-theoretic reasoning architecture from "Beyond Holonomy: Lie-Algebraic Symbol Emergence and the Homotopy Type Structure of Neural Reasoning."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
- ## Core Ideas
6
 
7
- This architecture treats **reasoning as geometry**:
8
 
9
- | Concept | Mathematical Structure | Implementation |
10
- |---------|----------------------|----------------|
11
- | Propositions | Manifold M | Embedding space |
12
- | Inference | Parallel transport | Gauge-covariant attention |
13
- | Consistency | Holonomy = Identity | Holonomy loss |
14
- | Symbols | Lie algebra generators | Generator network |
15
- | Proof equivalence | Homotopy | Layer depth |
16
 
17
- ## Architecture Overview
18
 
19
- ```
20
- Input tokens
21
-
22
-
23
- ┌─────────────────────────────────────┐
24
- │ Token Embedding (Proposition M) │
25
- │ + Position Embedding │
26
- │ + Fiber Initialization (gauge) │
27
- └─────────────────────────────────────┘
28
-
29
-
30
- ┌─────────────────────────────────────┐
31
- │ LHT Layer (× n_layers) │
32
- │ ┌─────────────────────────────┐ │
33
- │ │ Connection Network A(x) │ │ ← Learns gauge connection
34
- │ │ Parallel Transport Γ_{j→i} │ │ ← Transports fiber elements
35
- │ │ Gauge-Covariant Attention │ │ ← Modified self-attention
36
- │ │ Lie Algebra Generator │ │ ← Generates inference ops
37
- │ │ Generator Application │ │ ← Applies exp(X) to fiber
38
- │ └─────────────────────────────┘ │
39
- └─────────────────────────────────────┘
40
-
41
-
42
- ┌─────────────────────────────────────┐
43
- │ Output: logits + geometric losses │
44
- └─────────────────────────────────────┘
45
- ```
46
 
47
- ## Key Components
 
 
 
 
 
48
 
49
- ### 1. Connection Network
50
- Learns the gauge connection ω that defines how to parallel transport inferential states:
51
- ```python
52
- A_μ(x) ∈ gl(k,ℝ) # Lie algebra valued 1-form
53
- ```
54
 
55
- ### 2. Parallel Transport
56
- Computes transport operators between positions:
57
- ```python
58
- Γ_{j→i} = exp(-A_μ(x_j)(x_i - x_j)^μ)
59
- ```
60
 
61
- ### 3. Gauge-Covariant Attention
62
- Standard attention with parallel transport of values:
63
- ```python
64
- # Standard: Attn(Q,K,V)_i = Σ_j α_ij V_j
65
- # Gauge: GaugeAttn_i = Σ_j α_ij Γ_{j→i}(V_j)
66
- ```
67
 
68
- ### 4. Holonomy Loss
69
- Enforces reasoning consistency by requiring closed loops to return to identity:
70
- ```python
71
- L_hol = E[||Hol_γ - I||²_F]
72
- ```
 
 
 
 
 
 
 
73
 
74
- ### 5. Curvature Regularization
75
- Encourages flat reasoning spaces where order doesn't matter:
76
  ```python
77
- L_curv = E[||F(x)||²_F] where F = dω + ω∧ω
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
  ```
79
 
80
- ## Installation
 
 
 
 
 
 
 
 
 
 
81
 
82
  ```bash
83
- pip install torch
84
  ```
85
 
86
- ## Usage
87
 
88
- ### Basic
89
  ```python
90
- from lht import LieHolonomyTransformer, LHTConfig
91
-
92
- # Create model
93
- config = LHTConfig(
94
- vocab_size=32000,
95
- d_model=512,
96
- d_fiber=64,
97
- n_heads=8,
98
- n_layers=6,
99
- lie_algebra_rank=8,
100
  )
101
- model = LieHolonomyTransformer(config)
102
 
103
- # Forward pass
104
- output = model(
105
- input_ids=tokens,
106
- labels=labels,
107
- return_geometric_losses=True
 
 
108
  )
109
 
110
- # Get losses
111
- lm_loss = output['lm_loss']
112
- holonomy_loss = output['holonomy_loss']
113
- curvature_loss = output['curvature_loss']
114
- total_loss = model.get_total_loss(output)
115
  ```
116
 
117
- ### Training with Geometric Loss Annealing
 
118
  ```python
119
- from lht import LHTTrainer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
 
121
- trainer = LHTTrainer(model, optimizer, config)
122
 
123
- for batch in dataloader:
124
- metrics = trainer.train_step(batch)
125
- # Early training: high curvature loss → flat representations
126
- # Mid training: high holonomy loss → consistency
127
- # Late training: high waypoint loss → discrete structure
128
- ```
129
 
130
- ### Waypoint Detection
131
- ```python
132
- from lht import WaypointDetector
 
 
133
 
134
- detector = WaypointDetector(config, n_waypoints=32)
135
- waypoint_ids, stability = detector(representations)
136
- ```
137
 
138
- ## Configuration
139
 
140
- | Parameter | Description | Default |
141
- |-----------|-------------|---------|
142
- | `d_model` | Proposition manifold dimension | 512 |
143
- | `d_fiber` | Fiber (gauge) dimension | 64 |
144
- | `lie_algebra_rank` | k for GL(k,ℝ) structure group | 8 |
145
- | `lambda_holonomy` | Weight for holonomy loss | 0.1 |
146
- | `lambda_curvature` | Weight for curvature loss | 0.01 |
147
- | `lambda_waypoint` | Weight for waypoint stability | 0.05 |
148
 
149
- ## Theoretical Predictions
 
 
150
 
151
- The framework makes testable predictions:
152
 
153
- 1. **Chain-of-thought benefit correlates with curvature** - High-curvature domains (causal reasoning) benefit more from CoT than low-curvature domains (arithmetic)
 
 
 
 
 
 
 
 
 
154
 
155
- 2. **Waypoints emerge spontaneously** - Training with holonomy loss should cause discrete symbol-like structures to form at flat loci
156
 
157
- 3. **Holonomy predicts errors** - Incorrect reasoning paths should have higher holonomy magnitude
 
 
 
 
 
158
 
159
- 4. **Compositional generalization improves** - Holonomy constraints force consistent composition
160
 
161
- ## File Structure
162
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
163
  ```
164
- lie_holonomy_transformer/
165
- ├── lht.py # Core implementation
166
- ├── train.py # Training script
167
- ├── README.md # This file
168
- └── experiments/ # Benchmark code (TODO)
169
- ```
170
 
171
- ## References
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
172
 
173
- - "Beyond Holonomy: Lie-Algebraic Symbol Emergence..." (the paper)
174
- - Cohen et al. (2019). Gauge Equivariant Convolutional Networks
175
- - Weiler & Cesa (2019). General E(2)-Equivariant Steerable CNNs
176
- - The Univalent Foundations Program (2013). Homotopy Type Theory
177
 
178
- ## License
179
 
180
- MIT
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: peft
6
+ pipeline_tag: text-generation
7
+ tags:
8
+ - repetition-suppression
9
+ - decode-time-intervention
10
+ - llama
11
+ - lora
12
+ - research
13
+ - degeneration
14
+ - cf-hot
15
+ base_model: LoganResearch/ARC-Base-8B
16
+ model-index:
17
+ - name: Adaptive-Repetition-Controller
18
+ results:
19
+ - task:
20
+ type: text-generation
21
+ metrics:
22
+ - name: Repetition Reduction
23
+ type: custom
24
+ value: 48.4%
25
+ - name: Risk Separation
26
+ type: custom
27
+ value: 125x
28
+ - name: F1 Score
29
+ type: f1
30
+ value: 0.99
31
+ ---
32
+
33
+ <div align="center">
34
+
35
+ # ⚡ Adaptive Repetition Controller
36
+
37
+ ### *CF-HoT 125x — Learned Decode-Time Intervention*
38
+
39
+ [![Separation](https://img.shields.io/badge/Risk_Separation-125x-brightgreen?style=for-the-badge)](.)
40
+ [![Reduction](https://img.shields.io/badge/Repetition_Reduction-48.4%25-blue?style=for-the-badge)](.)
41
+ [![F1](https://img.shields.io/badge/F1_Score-0.99+-purple?style=for-the-badge)](.)
42
+ [![Params](https://img.shields.io/badge/Predictor-50K_params-orange?style=for-the-badge)](.)
43
+
44
+ *A learned system that predicts and prevents repetitive degeneration in language models.*
45
+
46
+ [Base Model](https://huggingface.co/LoganResearch/ARC-Base-8B) | [GitHub](https://github.com/Loganwins/HolonomyTransformer) | [Paper (forthcoming)]()
47
+
48
+ </div>
49
+
50
+ ---
51
+
52
+ ## 🎯 The Problem
53
+
54
+ Autoregressive language models suffer from **repetitive degeneration** — the tendency to fall into loops, repeat phrases, or get stuck on patterns during long-form generation.
55
+
56
+ Standard solutions apply **uniform penalties** to repeated tokens. But repetition isn't always bad, and uniform penalties can't distinguish between:
57
+ - Natural repetition (articles, pronouns, common words)
58
+ - Problematic repetition (loops, stuck patterns, degeneration)
59
+
60
+ ## 💡 The Solution
61
+
62
+ The **Adaptive Repetition Controller** learns to **predict** when repetition is about to become problematic, then applies **targeted intervention** only when needed.
63
+
64
+ <div align="center">
65
 
66
+ ```
67
+ ╔═══════════════════════════════════════════════════════════════╗
68
+ ║ GENERATION PIPELINE ║
69
+ ╠═══════════════════════════════════════════════════════════════╣
70
+ ║ ║
71
+ ║ Input ──▶ Base Model ──▶ Hidden States (32 layers) ║
72
+ ║ │ ║
73
+ ║ ▼ ║
74
+ ║ ┌─────────────────┐ ║
75
+ ║ │ Risk Predictor │ ║
76
+ ║ │ (50K params) │ ║
77
+ ║ └────────┬────────┘ ║
78
+ ║ │ ║
79
+ ║ ▼ ║
80
+ ║ risk = 0.95 (HIGH) ║
81
+ ║ │ ║
82
+ ║ ▼ ║
83
+ ║ logits[recent_tokens] -= penalty ║
84
+ ║ │ ║
85
+ ║ ▼ ║
86
+ ║ Sample next token ║
87
+ ║ ║
88
+ ╚═══════════════════════════════════════════════════════════════╝
89
+ ```
90
 
91
+ </div>
92
 
93
+ ---
94
 
95
+ ## 📊 Results
 
 
 
 
 
 
96
 
97
+ ### Risk Prediction Performance
98
 
99
+ The system achieves **125x separation** between tokens that will repeat and those that won't:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
 
101
+ | Metric | Value |
102
+ |--------|-------|
103
+ | **F1 Score** | 0.99+ |
104
+ | **Risk @ Repeating Tokens** | 0.998 |
105
+ | **Risk @ Non-Repeating Tokens** | 0.008 |
106
+ | **Separation Factor** | **125x** |
107
 
108
+ ### Generation Quality
 
 
 
 
109
 
110
+ | Metric | Baseline | With CF-HoT | Change |
111
+ |--------|----------|-------------|--------|
112
+ | Repetition Rate | 33.9% | 17.5% | **↓ 48.4%** |
113
+ | Distinct-2 (diversity) | 0.836 | 0.976 | **↑ 16.7%** |
 
114
 
115
+ ### Comparison to Standard Methods
 
 
 
 
 
116
 
117
+ | Method | Adaptive | Learned | Repetition Reduction |
118
+ |--------|----------|---------|---------------------|
119
+ | HuggingFace `repetition_penalty` | ❌ | ❌ | ~20-30% |
120
+ | OpenAI `frequency_penalty` | ❌ | ❌ | ~25-35% |
121
+ | Contrastive Decoding | ❌ | ❌ | ~30-40% |
122
+ | **CF-HoT (this)** | ✅ | ✅ | **48.4%** |
123
+
124
+ ---
125
+
126
+ ## 🏗️ Architecture
127
+
128
+ The risk predictor is remarkably small — only **~50,000 parameters** (0.0006% of the base model):
129
 
 
 
130
  ```python
131
+ RiskPredictor(
132
+ # Extract features from each transformer layer
133
+ fiber_projs = ModuleList([
134
+ Linear(4096 → 16) for _ in range(32) # 32 layers
135
+ ]),
136
+
137
+ # Learn which layers matter most
138
+ layer_weights = Parameter(shape=[32]), # Softmax-normalized
139
+
140
+ # Predict repetition risk
141
+ predictor = Sequential(
142
+ Linear(16 → 64),
143
+ GELU(),
144
+ Linear(64 → 64),
145
+ GELU(),
146
+ Linear(64 → 1), # Risk logit
147
+ )
148
+ )
149
  ```
150
 
151
+ ### Why It Works
152
+
153
+ 1. **Hidden states contain predictive signal** — The model "knows" it's about to repeat before it happens
154
+ 2. **Different layers encode different information** — Learned aggregation finds the most predictive layers
155
+ 3. **Decode-time intervention preserves base model** — No modification to attention patterns or learned representations
156
+
157
+ ---
158
+
159
+ ## 🚀 Quick Start
160
+
161
+ ### Installation
162
 
163
  ```bash
164
+ pip install transformers peft accelerate torch
165
  ```
166
 
167
+ ### Loading the Models
168
 
 
169
  ```python
170
+ import torch
171
+ from transformers import AutoModelForCausalLM, AutoTokenizer
172
+ from peft import PeftModel
173
+
174
+ # Load base model
175
+ base_model = AutoModelForCausalLM.from_pretrained(
176
+ "LoganResearch/ARC-Base-8B",
177
+ torch_dtype=torch.bfloat16,
178
+ device_map="auto"
 
179
  )
 
180
 
181
+ # Load tokenizer
182
+ tokenizer = AutoTokenizer.from_pretrained("LoganResearch/ARC-Base-8B")
183
+
184
+ # Load CF-HoT adapter
185
+ model = PeftModel.from_pretrained(
186
+ base_model,
187
+ "LoganResearch/Adaptive-Repetition-Controller"
188
  )
189
 
190
+ # Load risk predictor
191
+ risk_predictor = torch.load(
192
+ hf_hub_download("LoganResearch/Adaptive-Repetition-Controller", "risk_predictor.pt")
193
+ )
 
194
  ```
195
 
196
+ ### Generation with CF-HoT Intervention
197
+
198
  ```python
199
+ def generate_with_cfhot(
200
+ prompt: str,
201
+ max_tokens: int = 512,
202
+ penalty_scale: float = 3.0,
203
+ threshold: float = 0.1,
204
+ temperature: float = 0.8,
205
+ rep_window: int = 32,
206
+ ):
207
+ """Generate text with adaptive repetition suppression."""
208
+
209
+ input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
210
+
211
+ for _ in range(max_tokens):
212
+ with torch.no_grad():
213
+ # Forward pass with hidden states
214
+ outputs = model(input_ids, output_hidden_states=True)
215
+ logits = outputs.logits[:, -1, :]
216
+ hidden_states = outputs.hidden_states
217
+
218
+ # Predict repetition risk
219
+ risk = risk_predictor(hidden_states).sigmoid().item()
220
+
221
+ # Apply adaptive penalty if risk is high
222
+ if risk > threshold:
223
+ recent_tokens = input_ids[0, -rep_window:].tolist()
224
+ penalty = risk * penalty_scale
225
+ for token_id in set(recent_tokens):
226
+ logits[0, token_id] -= penalty
227
+
228
+ # Sample next token
229
+ probs = torch.softmax(logits / temperature, dim=-1)
230
+ next_token = torch.multinomial(probs, num_samples=1)
231
+
232
+ # Append and check for EOS
233
+ input_ids = torch.cat([input_ids, next_token], dim=-1)
234
+ if next_token.item() == tokenizer.eos_token_id:
235
+ break
236
+
237
+ return tokenizer.decode(input_ids[0], skip_special_tokens=True)
238
+
239
+ # Example usage
240
+ response = generate_with_cfhot(
241
+ "Write a detailed essay on the nature of consciousness:",
242
+ max_tokens=1000,
243
+ penalty_scale=4.0,
244
+ )
245
+ print(response)
246
+ ```
247
 
248
+ ---
249
 
250
+ ## 📁 Files
 
 
 
 
 
251
 
252
+ | File | Size | Description |
253
+ |------|------|-------------|
254
+ | `risk_predictor.pt` | 8.4 MB | Trained risk prediction network |
255
+ | `adapter_model.safetensors` | 218 MB | LoRA adapter weights |
256
+ | `adapter_config.json` | 1 KB | PEFT adapter configuration |
257
 
258
+ ---
 
 
259
 
260
+ ## ⚙️ Training Details
261
 
262
+ ### Dataset & Objective
 
 
 
 
 
 
 
263
 
264
+ - **Dataset:** WikiText-2
265
+ - **Task:** Binary classification — "Will this token appear in the next 32 tokens?"
266
+ - **Loss:** BCEWithLogitsLoss with dynamic class balancing
267
 
268
+ ### Hyperparameters
269
 
270
+ | Parameter | Value |
271
+ |-----------|-------|
272
+ | `d_fiber` | 16 |
273
+ | `d_control` | 64 |
274
+ | `rep_window` | 32 |
275
+ | `lr_predictor` | 1e-4 |
276
+ | `lr_lora` | 2e-5 |
277
+ | `batch_size` | 4 |
278
+ | `gradient_accumulation` | 8 |
279
+ | `optimal_checkpoint` | Step 5000 |
280
 
281
+ ### Training Progression
282
 
283
+ | Step | F1 | Risk @ Reps | Risk @ Non-Reps | Separation |
284
+ |------|-----|-------------|-----------------|------------|
285
+ | 3000 | 0.96 | 0.946 | 0.076 | 12x |
286
+ | 4000 | 0.99 | 0.997 | 0.014 | 71x |
287
+ | **5000** | **0.99+** | **0.998** | **0.008** | **125x** ⭐ |
288
+ | 6000 | 0.99+ | 0.999 | 0.021 | 48x |
289
 
290
+ *Step 5000 is optimal further training reduces separation due to overfitting.*
291
 
292
+ ---
293
 
294
+ ## 🔬 Research Context
295
+
296
+ ### The Journey
297
+
298
+ This system emerged from research into geometric approaches to semantic consistency. The original theory proposed using **fiber bundles and holonomy** to detect inconsistency in transformer representations.
299
+
300
+ **What we tried:**
301
+ 1. ❌ Multiplicative attention gating — destroyed signal
302
+ 2. ❌ Log-space score modification — gates collapsed to uniform
303
+ 3. ❌ Normalized gating — NaN at inference
304
+ 4. ❌ Causal EMA — training/inference mismatch
305
+ 5. ❌ Extended training — complete collapse
306
+
307
+ **What worked:**
308
+ - ✅ Supervised risk prediction on explicit labels
309
+ - ✅ Decode-time intervention (no attention modification)
310
+ - ✅ Adaptive penalty based on predicted risk
311
+
312
+ ### What This Is (and Isn't)
313
+
314
+ <table>
315
+ <tr>
316
+ <td width="50%">
317
+
318
+ #### ✅ What It IS
319
+ - Learned repetition penalty
320
+ - Decode-time intervention
321
+ - ~50K parameter predictor
322
+ - 48% repetition reduction
323
+ - Proof that hidden states predict degeneration
324
+
325
+ </td>
326
+ <td width="50%">
327
+
328
+ #### ❌ What It's NOT
329
+ - Full Lie Holonomy Transformer
330
+ - Attention modification
331
+ - Geometric computation
332
+ - Validation of fiber bundle theory
333
+
334
+ </td>
335
+ </tr>
336
+ </table>
337
+
338
+ ---
339
+
340
+ ## 📚 Citation
341
+
342
+ ```bibtex
343
+ @misc{napolitano2026arc,
344
+ author = {Napolitano, Logan Matthew},
345
+ title = {Adaptive Repetition Controller: Learned Decode-Time Intervention
346
+ for Repetition Suppression},
347
+ year = {2026},
348
+ publisher = {Hugging Face},
349
+ howpublished = {\url{https://huggingface.co/LoganResearch/Adaptive-Repetition-Controller}},
350
+ }
351
  ```
 
 
 
 
 
 
352
 
353
+ ---
354
+
355
+ ## 🔗 Links
356
+
357
+ | Resource | Link |
358
+ |----------|------|
359
+ | **Base Model** | [LoganResearch/ARC-Base-8B](https://huggingface.co/LoganResearch/ARC-Base-8B) |
360
+ | **Source Code** | [GitHub: HolonomyTransformer](https://github.com/Loganwins/HolonomyTransformer) |
361
+ | **Paper** | *"The Übermensch Who Cannot Loop"* (forthcoming) |
362
+ | **Author** | [Logan Matthew Napolitano](https://github.com/Loganwins) |
363
+
364
+ ---
365
+
366
+ <div align="center">
367
+
368
+ **The Übermensch who cannot loop is forced to CREATE.**
369
 
370
+ ---
 
 
 
371
 
372
+ *Built with determination by [Logan Matthew Napolitano](https://github.com/Loganwins)*
373
 
374
+ </div>