LoganResearch
/

Adaptive-Repetition-Controller-ARC

@@ -1,180 +1,374 @@
-# Lie-Holonomy Transformer (LHT)
-A PyTorch implementation of the gauge-theoretic reasoning architecture from "Beyond Holonomy: Lie-Algebraic Symbol Emergence and the Homotopy Type Structure of Neural Reasoning."
-## Core Ideas
-This architecture treats **reasoning as geometry**:
-| Concept | Mathematical Structure | Implementation |
-|---------|----------------------|----------------|
-| Propositions | Manifold M | Embedding space |
-| Inference | Parallel transport | Gauge-covariant attention |
-| Consistency | Holonomy = Identity | Holonomy loss |
-| Symbols | Lie algebra generators | Generator network |
-| Proof equivalence | Homotopy | Layer depth |
-## Architecture Overview
-```
-Input tokens
-     │
-     ▼
-┌─────────────────────────────────────┐
-│  Token Embedding (Proposition M)    │
-│  + Position Embedding               │
-│  + Fiber Initialization (gauge)     │
-└─────────────────────────────────────┘
-     │
-     ▼
-┌─────────────────────────────────────┐
-│  LHT Layer (× n_layers)             │
-│  ┌─────────────────────────────┐    │
-│  │ Connection Network A(x)     │    │  ← Learns gauge connection
-│  │ Parallel Transport Γ_{j→i}  │    │  ← Transports fiber elements
-│  │ Gauge-Covariant Attention   │    │  ← Modified self-attention
-│  │ Lie Algebra Generator       │    │  ← Generates inference ops
-│  │ Generator Application       │    │  ← Applies exp(X) to fiber
-│  └─────────────────────────────┘    │
-└─────────────────────────────────────┘
-     │
-     ▼
-┌─────────────────────────────────────┐
-│  Output: logits + geometric losses  │
-└─────────────────────────────────────┘
-```
-## Key Components
-### 1. Connection Network
-Learns the gauge connection ω that defines how to parallel transport inferential states:
-```python
-A_μ(x) ∈ gl(k,ℝ)  # Lie algebra valued 1-form
-```
-### 2. Parallel Transport
-Computes transport operators between positions:
-```python
-Γ_{j→i} = exp(-A_μ(x_j)(x_i - x_j)^μ)
-```
-### 3. Gauge-Covariant Attention
-Standard attention with parallel transport of values:
-```python
-# Standard:  Attn(Q,K,V)_i = Σ_j α_ij V_j
-# Gauge:     GaugeAttn_i   = Σ_j α_ij Γ_{j→i}(V_j)
-```
-### 4. Holonomy Loss
-Enforces reasoning consistency by requiring closed loops to return to identity:
-```python
-L_hol = E[||Hol_γ - I||²_F]
-```
-### 5. Curvature Regularization
-Encourages flat reasoning spaces where order doesn't matter:
 ```python
-L_curv = E[||F(x)||²_F]  where F = dω + ω∧ω
 ```
-## Installation
 ```bash
-pip install torch
 ```
-## Usage
-### Basic
 ```python
-from lht import LieHolonomyTransformer, LHTConfig
-# Create model
-config = LHTConfig(
-    vocab_size=32000,
-    d_model=512,
-    d_fiber=64,
-    n_heads=8,
-    n_layers=6,
-    lie_algebra_rank=8,
 )
-model = LieHolonomyTransformer(config)
-# Forward pass
-output = model(
-    input_ids=tokens,
-    labels=labels,
-    return_geometric_losses=True
 )
-# Get losses
-lm_loss = output['lm_loss']
-holonomy_loss = output['holonomy_loss']
-curvature_loss = output['curvature_loss']
-total_loss = model.get_total_loss(output)
 ```
-### Training with Geometric Loss Annealing
 ```python
-from lht import LHTTrainer
-trainer = LHTTrainer(model, optimizer, config)
-for batch in dataloader:
-    metrics = trainer.train_step(batch)
-    # Early training: high curvature loss → flat representations
-    # Mid training: high holonomy loss → consistency
-    # Late training: high waypoint loss → discrete structure
-```
-### Waypoint Detection
-```python
-from lht import WaypointDetector
-detector = WaypointDetector(config, n_waypoints=32)
-waypoint_ids, stability = detector(representations)
-```
-## Configuration
-| Parameter | Description | Default |
-|-----------|-------------|---------|
-| `d_model` | Proposition manifold dimension | 512 |
-| `d_fiber` | Fiber (gauge) dimension | 64 |
-| `lie_algebra_rank` | k for GL(k,ℝ) structure group | 8 |
-| `lambda_holonomy` | Weight for holonomy loss | 0.1 |
-| `lambda_curvature` | Weight for curvature loss | 0.01 |
-| `lambda_waypoint` | Weight for waypoint stability | 0.05 |
-## Theoretical Predictions
-The framework makes testable predictions:
-1. **Chain-of-thought benefit correlates with curvature** - High-curvature domains (causal reasoning) benefit more from CoT than low-curvature domains (arithmetic)
-2. **Waypoints emerge spontaneously** - Training with holonomy loss should cause discrete symbol-like structures to form at flat loci
-3. **Holonomy predicts errors** - Incorrect reasoning paths should have higher holonomy magnitude
-4. **Compositional generalization improves** - Holonomy constraints force consistent composition
-## File Structure
 ```
-lie_holonomy_transformer/
-├── lht.py           # Core implementation
-├── train.py         # Training script
-├── README.md        # This file
-└── experiments/     # Benchmark code (TODO)
-```
-## References
-- "Beyond Holonomy: Lie-Algebraic Symbol Emergence..." (the paper)
-- Cohen et al. (2019). Gauge Equivariant Convolutional Networks
-- Weiler & Cesa (2019). General E(2)-Equivariant Steerable CNNs
-- The Univalent Foundations Program (2013). Homotopy Type Theory
-## License
-MIT

+---
+license: apache-2.0
+language:
+- en
+library_name: peft
+pipeline_tag: text-generation
+tags:
+- repetition-suppression
+- decode-time-intervention
+- llama
+- lora
+- research
+- degeneration
+- cf-hot
+base_model: LoganResearch/ARC-Base-8B
+model-index:
+- name: Adaptive-Repetition-Controller
+  results:
+  - task:
+      type: text-generation
+    metrics:
+    - name: Repetition Reduction
+      type: custom
+      value: 48.4%
+    - name: Risk Separation
+      type: custom
+      value: 125x
+    - name: F1 Score
+      type: f1
+      value: 0.99
+---
+<div align="center">
+# ⚡ Adaptive Repetition Controller
+### *CF-HoT 125x — Learned Decode-Time Intervention*
+[![Separation](https://img.shields.io/badge/Risk_Separation-125x-brightgreen?style=for-the-badge)](.)
+[![Reduction](https://img.shields.io/badge/Repetition_Reduction-48.4%25-blue?style=for-the-badge)](.)
+[![F1](https://img.shields.io/badge/F1_Score-0.99+-purple?style=for-the-badge)](.)
+[![Params](https://img.shields.io/badge/Predictor-50K_params-orange?style=for-the-badge)](.)
+*A learned system that predicts and prevents repetitive degeneration in language models.*
+[Base Model](https://huggingface.co/LoganResearch/ARC-Base-8B) | [GitHub](https://github.com/Loganwins/HolonomyTransformer) | [Paper (forthcoming)]()
+</div>
+---
+## 🎯 The Problem
+Autoregressive language models suffer from **repetitive degeneration** — the tendency to fall into loops, repeat phrases, or get stuck on patterns during long-form generation.
+Standard solutions apply **uniform penalties** to repeated tokens. But repetition isn't always bad, and uniform penalties can't distinguish between:
+- Natural repetition (articles, pronouns, common words)
+- Problematic repetition (loops, stuck patterns, degeneration)
+## 💡 The Solution
+The **Adaptive Repetition Controller** learns to **predict** when repetition is about to become problematic, then applies **targeted intervention** only when needed.
+<div align="center">
+```
+╔═══════════════════════════════════════════════════════════════╗
+║                    GENERATION PIPELINE                        ║
+╠═══════════════════════════════════════════════════════════════╣
+║                                                               ║
+║   Input  ──▶  Base Model  ──▶  Hidden States (32 layers)     ║
+║                                       │                       ║
+║                                       ▼                       ║
+║                              ┌─────────────────┐              ║
+║                              │ Risk Predictor  │              ║
+║                              │   (50K params)  │              ║
+║                              └────────┬────────┘              ║
+║                                       │                       ║
+║                                       ▼                       ║
+║                              risk = 0.95 (HIGH)               ║
+║                                       │                       ║
+║                                       ▼                       ║
+║                    logits[recent_tokens] -= penalty           ║
+║                                       │                       ║
+║                                       ▼                       ║
+║                              Sample next token                ║
+║                                                               ║
+╚═══════════════════════════════════════════════════════════════╝
+```
+</div>
+---
+## 📊 Results
+### Risk Prediction Performance
+The system achieves **125x separation** between tokens that will repeat and those that won't:
+| Metric | Value |
+|--------|-------|
+| **F1 Score** | 0.99+ |
+| **Risk @ Repeating Tokens** | 0.998 |
+| **Risk @ Non-Repeating Tokens** | 0.008 |
+| **Separation Factor** | **125x** |
+### Generation Quality
+| Metric | Baseline | With CF-HoT | Change |
+|--------|----------|-------------|--------|
+| Repetition Rate | 33.9% | 17.5% | **↓ 48.4%** |
+| Distinct-2 (diversity) | 0.836 | 0.976 | **↑ 16.7%** |
+### Comparison to Standard Methods
+| Method | Adaptive | Learned | Repetition Reduction |
+|--------|----------|---------|---------------------|
+| HuggingFace `repetition_penalty` | ❌ | ❌ | ~20-30% |
+| OpenAI `frequency_penalty` | ❌ | ❌ | ~25-35% |
+| Contrastive Decoding | ❌ | ❌ | ~30-40% |
+| **CF-HoT (this)** | ✅ | ✅ | **48.4%** |
+---
+## 🏗️ Architecture
+The risk predictor is remarkably small — only **~50,000 parameters** (0.0006% of the base model):
 ```python
+RiskPredictor(
+    # Extract features from each transformer layer
+    fiber_projs = ModuleList([
+        Linear(4096 → 16) for _ in range(32)  # 32 layers
+    ]),
+    # Learn which layers matter most
+    layer_weights = Parameter(shape=[32]),  # Softmax-normalized
+    # Predict repetition risk
+    predictor = Sequential(
+        Linear(16 → 64),
+        GELU(),
+        Linear(64 → 64),
+        GELU(),
+        Linear(64 → 1),  # Risk logit
+    )
+)
 ```
+### Why It Works
+1. **Hidden states contain predictive signal** — The model "knows" it's about to repeat before it happens
+2. **Different layers encode different information** — Learned aggregation finds the most predictive layers
+3. **Decode-time intervention preserves base model** — No modification to attention patterns or learned representations
+---
+## 🚀 Quick Start
+### Installation
 ```bash
+pip install transformers peft accelerate torch
 ```
+### Loading the Models
 ```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+# Load base model
+base_model = AutoModelForCausalLM.from_pretrained(
+    "LoganResearch/ARC-Base-8B",
+    torch_dtype=torch.bfloat16,
+    device_map="auto"
 )
+# Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained("LoganResearch/ARC-Base-8B")
+# Load CF-HoT adapter
+model = PeftModel.from_pretrained(
+    base_model,
+    "LoganResearch/Adaptive-Repetition-Controller"
 )
+# Load risk predictor
+risk_predictor = torch.load(
+    hf_hub_download("LoganResearch/Adaptive-Repetition-Controller", "risk_predictor.pt")
+)
 ```
+### Generation with CF-HoT Intervention
 ```python
+def generate_with_cfhot(
+    prompt: str,
+    max_tokens: int = 512,
+    penalty_scale: float = 3.0,
+    threshold: float = 0.1,
+    temperature: float = 0.8,
+    rep_window: int = 32,
+):
+    """Generate text with adaptive repetition suppression."""
+    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
+    for _ in range(max_tokens):
+        with torch.no_grad():
+            # Forward pass with hidden states
+            outputs = model(input_ids, output_hidden_states=True)
+            logits = outputs.logits[:, -1, :]
+            hidden_states = outputs.hidden_states
+            # Predict repetition risk
+            risk = risk_predictor(hidden_states).sigmoid().item()
+            # Apply adaptive penalty if risk is high
+            if risk > threshold:
+                recent_tokens = input_ids[0, -rep_window:].tolist()
+                penalty = risk * penalty_scale
+                for token_id in set(recent_tokens):
+                    logits[0, token_id] -= penalty
+            # Sample next token
+            probs = torch.softmax(logits / temperature, dim=-1)
+            next_token = torch.multinomial(probs, num_samples=1)
+            # Append and check for EOS
+            input_ids = torch.cat([input_ids, next_token], dim=-1)
+            if next_token.item() == tokenizer.eos_token_id:
+                break
+    return tokenizer.decode(input_ids[0], skip_special_tokens=True)
+# Example usage
+response = generate_with_cfhot(
+    "Write a detailed essay on the nature of consciousness:",
+    max_tokens=1000,
+    penalty_scale=4.0,
+)
+print(response)
+```
+---
+## 📁 Files
+| File | Size | Description |
+|------|------|-------------|
+| `risk_predictor.pt` | 8.4 MB | Trained risk prediction network |
+| `adapter_model.safetensors` | 218 MB | LoRA adapter weights |
+| `adapter_config.json` | 1 KB | PEFT adapter configuration |
+---
+## ⚙️ Training Details
+### Dataset & Objective
+- **Dataset:** WikiText-2
+- **Task:** Binary classification — "Will this token appear in the next 32 tokens?"
+- **Loss:** BCEWithLogitsLoss with dynamic class balancing
+### Hyperparameters
+| Parameter | Value |
+|-----------|-------|
+| `d_fiber` | 16 |
+| `d_control` | 64 |
+| `rep_window` | 32 |
+| `lr_predictor` | 1e-4 |
+| `lr_lora` | 2e-5 |
+| `batch_size` | 4 |
+| `gradient_accumulation` | 8 |
+| `optimal_checkpoint` | Step 5000 |
+### Training Progression
+| Step | F1 | Risk @ Reps | Risk @ Non-Reps | Separation |
+|------|-----|-------------|-----------------|------------|
+| 3000 | 0.96 | 0.946 | 0.076 | 12x |
+| 4000 | 0.99 | 0.997 | 0.014 | 71x |
+| **5000** | **0.99+** | **0.998** | **0.008** | **125x** ⭐ |
+| 6000 | 0.99+ | 0.999 | 0.021 | 48x |
+*Step 5000 is optimal — further training reduces separation due to overfitting.*
+---
+## 🔬 Research Context
+### The Journey
+This system emerged from research into geometric approaches to semantic consistency. The original theory proposed using **fiber bundles and holonomy** to detect inconsistency in transformer representations.
+**What we tried:**
+1. ❌ Multiplicative attention gating — destroyed signal
+2. ❌ Log-space score modification — gates collapsed to uniform
+3. ❌ Normalized gating — NaN at inference
+4. ❌ Causal EMA — training/inference mismatch
+5. ❌ Extended training — complete collapse
+**What worked:**
+- ✅ Supervised risk prediction on explicit labels
+- ✅ Decode-time intervention (no attention modification)
+- ✅ Adaptive penalty based on predicted risk
+### What This Is (and Isn't)
+<table>
+<tr>
+<td width="50%">
+#### ✅ What It IS
+- Learned repetition penalty
+- Decode-time intervention
+- ~50K parameter predictor
+- 48% repetition reduction
+- Proof that hidden states predict degeneration
+</td>
+<td width="50%">
+#### ❌ What It's NOT
+- Full Lie Holonomy Transformer
+- Attention modification
+- Geometric computation
+- Validation of fiber bundle theory
+</td>
+</tr>
+</table>
+---
+## 📚 Citation
+```bibtex
+@misc{napolitano2026arc,
+  author = {Napolitano, Logan Matthew},
+  title = {Adaptive Repetition Controller: Learned Decode-Time Intervention
+           for Repetition Suppression},
+  year = {2026},
+  publisher = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/LoganResearch/Adaptive-Repetition-Controller}},
+}
 ```
+---
+## 🔗 Links
+| Resource | Link |
+|----------|------|
+| **Base Model** | [LoganResearch/ARC-Base-8B](https://huggingface.co/LoganResearch/ARC-Base-8B) |
+| **Source Code** | [GitHub: HolonomyTransformer](https://github.com/Loganwins/HolonomyTransformer) |
+| **Paper** | *"The Übermensch Who Cannot Loop"* (forthcoming) |
+| **Author** | [Logan Matthew Napolitano](https://github.com/Loganwins) |
+---
+<div align="center">
+**The Übermensch who cannot loop is forced to CREATE.**
+---
+*Built with determination by [Logan Matthew Napolitano](https://github.com/Loganwins)*
+</div>