Text Generation
Transformers
Safetensors
HERMES
English
llama
cognitive-control
decode-time-intervention
repetition-suppression
behavioral-control
contrastive-learning
interpretability
activation-engineering
cf-hot
arc
rlhf-analysis
research
conversational
Eval Results (legacy)
text-generation-inference
Scientific model card v2 - Logan Matthew Napolitano
Browse files
README.md
CHANGED
|
@@ -16,7 +16,6 @@ tags:
|
|
| 16 |
- cf-hot
|
| 17 |
- arc
|
| 18 |
- rlhf-analysis
|
| 19 |
-
- degeneration
|
| 20 |
- research
|
| 21 |
pipeline_tag: text-generation
|
| 22 |
base_model: NousResearch/Hermes-3-Llama-3.1-8B
|
|
@@ -42,9 +41,7 @@ model-index:
|
|
| 42 |
|
| 43 |
<div align="center">
|
| 44 |
|
| 45 |
-
#
|
| 46 |
-
|
| 47 |
-
### *"Making an 8B Behave Like an 80B"*
|
| 48 |
|
| 49 |
**Decode-Time Behavioral Intervention via Contrastive Fiber Heads-on-Thought (CF-HoT)**
|
| 50 |
|
|
@@ -65,17 +62,17 @@ model-index:
|
|
| 65 |
|
| 66 |
---
|
| 67 |
|
| 68 |
-
##
|
| 69 |
|
| 70 |
-
> **We
|
| 71 |
|
| 72 |
-
**The repetition detection head achieves 125× class separation** —
|
| 73 |
|
| 74 |
---
|
| 75 |
|
| 76 |
## Abstract
|
| 77 |
|
| 78 |
-
Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning large language models with human preferences. However, we present evidence that RLHF introduces systematic **behavioral overhead** — learned response patterns that satisfy reward model preferences while consuming
|
| 79 |
|
| 80 |
We introduce **ARC (Adaptive Repetition Controller)**, a decode-time intervention system employing **Contrastive Fiber Heads-on-Thought (CF-HoT)** — lightweight prediction heads (~5,300 parameters each) trained on compressed hidden state representations. These heads detect behavioral failure modes including:
|
| 81 |
|
|
@@ -91,13 +88,13 @@ Our key finding: **behavioral failure modes are linearly separable in a 16-dimen
|
|
| 91 |
### Headline Results
|
| 92 |
|
| 93 |
- **91% reduction** in repetition instances
|
| 94 |
-
- **38% improvement** in information density
|
| 95 |
- **<1% latency overhead**
|
| 96 |
- **~5,300 parameters** per detection head
|
| 97 |
|
| 98 |
---
|
| 99 |
|
| 100 |
-
##
|
| 101 |
|
| 102 |
1. [Introduction](#1-introduction)
|
| 103 |
2. [Background](#2-background)
|
|
@@ -119,37 +116,30 @@ Our key finding: **behavioral failure modes are linearly separable in a 16-dimen
|
|
| 119 |
|
| 120 |
## 1. Introduction
|
| 121 |
|
| 122 |
-
### 1.1 The Problem: RLHF Behavioral
|
| 123 |
|
| 124 |
-
Consider
|
| 125 |
|
| 126 |
```
|
| 127 |
User: hello
|
| 128 |
|
| 129 |
-
Typical
|
| 130 |
variety of tasks. How can I assist you today? I'm happy to help with any
|
| 131 |
questions you might have, whether it's about general knowledge, creative
|
| 132 |
-
projects, coding, writing, or just having a friendly conversation!
|
| 133 |
-
to ask me anything.
|
| 134 |
```
|
| 135 |
|
| 136 |
-
|
| 137 |
-
-
|
| 138 |
-
-
|
| 139 |
-
-
|
| 140 |
-
-
|
| 141 |
-
- "whether it's about..." — verbose enumeration of obvious capabilities
|
| 142 |
-
- "Feel free to ask me anything" — redundant invitation
|
| 143 |
-
|
| 144 |
-
**That's 67 tokens. The actual information content? ~3 tokens: "Hello. How can I help?"**
|
| 145 |
|
| 146 |
-
This is the **RLHF behavioral
|
| 147 |
|
| 148 |
### 1.2 Our Solution: Decode-Time Intervention
|
| 149 |
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
**Core Insight:** Behavioral failure modes correspond to identifiable directions in activation space. By projecting hidden states into a low-dimensional "fiber space" and training lightweight classifiers, we can predict behavioral patterns with high accuracy.
|
| 153 |
|
| 154 |
**ARC Response to "hello":**
|
| 155 |
```
|
|
@@ -158,31 +148,29 @@ User: hello
|
|
| 158 |
ARC Model: Hello. What do you need?
|
| 159 |
```
|
| 160 |
|
| 161 |
-
The behavioral overhead is gone. The model's latent capability is **unblocked**.
|
| 162 |
-
|
| 163 |
### 1.3 Key Contributions
|
| 164 |
|
| 165 |
1. **Empirical demonstration** that RLHF behavioral patterns are linearly separable in hidden states
|
| 166 |
2. **CF-HoT architecture** for efficient decode-time detection and intervention
|
| 167 |
-
3. **125× class separation** for repetition detection
|
| 168 |
4. **Complete open-source release** of model, heads, and inference code
|
| 169 |
|
| 170 |
---
|
| 171 |
|
| 172 |
## 2. Background
|
| 173 |
|
| 174 |
-
### 2.1 RLHF and
|
| 175 |
|
| 176 |
-
RLHF (Ouyang et al., 2022) trains language models to maximize a learned reward function approximating human preferences.
|
| 177 |
|
| 178 |
-
| Pattern | Reward Model
|
| 179 |
-
|---------|---------------------
|
| 180 |
-
| Hedging |
|
| 181 |
-
| Sycophancy |
|
| 182 |
-
| Verbosity |
|
| 183 |
-
| Repetition |
|
| 184 |
|
| 185 |
-
**
|
| 186 |
|
| 187 |
### 2.2 Activation Engineering
|
| 188 |
|
|
@@ -192,7 +180,7 @@ Recent work in mechanistic interpretability shows that high-level behaviors corr
|
|
| 192 |
- **Activation Addition** (Turner et al., 2023): Linear interventions for behavioral control
|
| 193 |
- **Probing Classifiers** (Belinkov, 2022): Detecting properties from hidden states
|
| 194 |
|
| 195 |
-
ARC extends this
|
| 196 |
|
| 197 |
### 2.3 Related Work
|
| 198 |
|
|
@@ -225,7 +213,6 @@ ARC extends this line of work to **real-time decode-time intervention** — not
|
|
| 225 |
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
| 226 |
│ │ HIDDEN STATES │ │
|
| 227 |
│ │ h_l ∈ ℝ^4096 for l = 1...32 │ │
|
| 228 |
-
│ │ (extracted per token) │ │
|
| 229 |
│ └─────────────────────────────────────────────────────────────────┘ │
|
| 230 |
│ │ │
|
| 231 |
│ ▼ │
|
|
@@ -245,7 +232,7 @@ ARC extends this line of work to **real-time decode-time intervention** — not
|
|
| 245 |
│ │ α = softmax(w) where w ∈ ℝ^32 │ │
|
| 246 |
│ │ f_agg = Σ α_l · f_l ∈ ℝ^16 │ │
|
| 247 |
│ │ │ │
|
| 248 |
-
│ │
|
| 249 |
│ │ - Layers 18-24: Repetition patterns (highest weight) │ │
|
| 250 |
│ │ - Layers 8-14: Hedging patterns │ │
|
| 251 |
│ │ - Layers 1-6: Minimal contribution │ │
|
|
@@ -273,8 +260,6 @@ ARC extends this line of work to **real-time decode-time intervention** — not
|
|
| 273 |
│ │ r_rep > 0.70? ───→ Suppress recent tokens (-5.0) │ │
|
| 274 |
│ │ r_hdg > 0.60? ───→ Suppress hedge starters (-3.0) │ │
|
| 275 |
│ │ r_vrb > 0.65? ───→ Suppress filler starters (-2.0) │ │
|
| 276 |
-
│ │ r_syc > 0.60? ───→ Suppress sycophantic tokens (-2.0) │ │
|
| 277 |
-
│ │ │ │
|
| 278 |
│ └─────────────────────────────────────────────────────────────────┘ │
|
| 279 |
│ │ │
|
| 280 |
│ ▼ │
|
|
@@ -284,7 +269,6 @@ ARC extends this line of work to **real-time decode-time intervention** — not
|
|
| 284 |
│ │ logits_modified = logits - penalties │ │
|
| 285 |
│ │ probs = softmax(logits_modified / temperature) │ │
|
| 286 |
│ │ next_token ~ Categorical(probs) │ │
|
| 287 |
-
│ │ │ │
|
| 288 |
│ └─────────────────────────────────────────────────────────────────┘ │
|
| 289 |
│ │
|
| 290 |
└─────────────────────────────────────────────────────────────────────────────┘
|
|
@@ -294,7 +278,7 @@ ARC extends this line of work to **real-time decode-time intervention** — not
|
|
| 294 |
|
| 295 |
The key insight enabling efficient detection is that behavioral patterns don't require full hidden state dimensionality. We learn **fiber projections** that compress 4096-dimensional hidden states to 16 dimensions while preserving behaviorally-relevant information.
|
| 296 |
|
| 297 |
-
**
|
| 298 |
|
| 299 |
| d_fiber | Repetition CSR | Params | Latency |
|
| 300 |
|---------|----------------|--------|---------|
|
|
@@ -304,7 +288,7 @@ The key insight enabling efficient detection is that behavioral patterns don't r
|
|
| 304 |
| 32 | 128.3× | 10,561 | 0.31ms |
|
| 305 |
| 64 | 129.1× | 21,057 | 0.48ms |
|
| 306 |
|
| 307 |
-
Diminishing returns beyond 16
|
| 308 |
|
| 309 |
### 3.3 Prediction Heads
|
| 310 |
|
|
@@ -322,16 +306,9 @@ class PredictionHead(nn.Module):
|
|
| 322 |
nn.Linear(d_hidden, 1), # 64 → 1
|
| 323 |
nn.Sigmoid() # → [0, 1] risk score
|
| 324 |
)
|
| 325 |
-
|
| 326 |
-
def forward(self, fiber_features):
|
| 327 |
-
return self.net(fiber_features)
|
| 328 |
```
|
| 329 |
|
| 330 |
-
**Parameters per head:**
|
| 331 |
-
- Layer 1: 16 × 64 + 64 = 1,088
|
| 332 |
-
- Layer 2: 64 × 64 + 64 = 4,160
|
| 333 |
-
- Layer 3: 64 × 1 + 1 = 65
|
| 334 |
-
- **Total: 5,313 parameters**
|
| 335 |
|
| 336 |
### 3.4 Intervention Mechanism
|
| 337 |
|
|
@@ -339,19 +316,16 @@ When a head's risk score exceeds its threshold, we apply **logit suppression**:
|
|
| 339 |
|
| 340 |
```python
|
| 341 |
def intervene(logits, risks, recent_tokens):
|
| 342 |
-
# Repetition: suppress recently-used tokens
|
| 343 |
if risks['repetition'] > 0.70:
|
| 344 |
for tok in recent_tokens[-32:]:
|
| 345 |
logits[tok] -= 5.0
|
| 346 |
|
| 347 |
-
# Hedging: suppress hedge phrase starters
|
| 348 |
if risks['hedging'] > 0.60:
|
| 349 |
-
for tok in HEDGE_TOKENS:
|
| 350 |
logits[tok] -= 3.0
|
| 351 |
|
| 352 |
-
# Verbosity: suppress filler starters
|
| 353 |
if risks['verbosity'] > 0.65:
|
| 354 |
-
for tok in FILLER_TOKENS:
|
| 355 |
logits[tok] -= 2.0
|
| 356 |
|
| 357 |
return logits
|
|
@@ -395,18 +369,12 @@ r_k^(t) = φ_k(f_agg^(t)) ∈ [0, 1]
|
|
| 395 |
|
| 396 |
z̃_i = z_i - Σ_k λ_k × 𝟙[r_k^(t) > τ_k] × 𝟙[i ∈ S_k]
|
| 397 |
|
| 398 |
-
where S_k is the suppression set for behavior k.
|
| 399 |
-
|
| 400 |
### 4.3 Class Separation Ratio (CSR)
|
| 401 |
|
| 402 |
-
We evaluate detection quality using:
|
| 403 |
-
|
| 404 |
CSR = |μ_+ - μ_-| / √(σ_+² + σ_-²)
|
| 405 |
|
| 406 |
-
where μ_± and σ_± are the mean and standard deviation of positive/negative class predictions.
|
| 407 |
-
|
| 408 |
**Interpretation:**
|
| 409 |
-
- CSR = 1: Classes
|
| 410 |
- CSR = 2: Good separation
|
| 411 |
- CSR > 10: Excellent separation
|
| 412 |
- **CSR = 125: Near-perfect separation (repetition head)**
|
|
@@ -426,26 +394,15 @@ where μ_± and σ_± are the mean and standard deviation of positive/negative c
|
|
| 426 |
| Hidden Dimension | 4,096 |
|
| 427 |
| Layers | 32 |
|
| 428 |
| Attention Heads | 32 |
|
| 429 |
-
| KV Heads | 8 (GQA) |
|
| 430 |
| Context Length | 8,192 |
|
| 431 |
-
| Vocabulary | 128,256 |
|
| 432 |
|
| 433 |
### 5.2 Training Data Construction
|
| 434 |
|
| 435 |
-
|
| 436 |
-
-
|
| 437 |
-
|
| 438 |
-
|
| 439 |
-
|
| 440 |
-
#### Hedging Head
|
| 441 |
-
- **Positive samples:** First token of hedge phrases ("As an AI", "I cannot", etc.)
|
| 442 |
-
- **Negative samples:** First tokens of substantive content
|
| 443 |
-
- **Dataset size:** ~30,000 labeled tokens
|
| 444 |
-
|
| 445 |
-
#### Verbosity Head
|
| 446 |
-
- **Positive samples:** Tokens in low-density regions (TTR < 0.4)
|
| 447 |
-
- **Negative samples:** Tokens in high-density regions (TTR > 0.7)
|
| 448 |
-
- **Dataset size:** ~40,000 labeled tokens
|
| 449 |
|
| 450 |
### 5.3 Training Procedure
|
| 451 |
|
|
@@ -454,7 +411,6 @@ where μ_± and σ_± are the mean and standard deviation of positive/negative c
|
|
| 454 |
| Optimizer | AdamW |
|
| 455 |
| Learning Rate | 1e-4 |
|
| 456 |
| Batch Size | 32 |
|
| 457 |
-
| Weight Decay | 0.01 |
|
| 458 |
| Warmup Steps | 500 |
|
| 459 |
|
| 460 |
| Head | Training Steps |
|
|
@@ -477,21 +433,19 @@ where μ_± and σ_± are the mean and standard deviation of positive/negative c
|
|
| 477 |
| Hedging | 1.5× | 0.60 | 0.67 | 0.62 | 0.64 |
|
| 478 |
| Sycophancy | 1.2× | 0.60 | 0.58 | 0.55 | 0.56 |
|
| 479 |
|
| 480 |
-
**The 125× separation for repetition is remarkable.** The model "knows" it's about to repeat before it does.
|
| 481 |
-
|
| 482 |
### 6.2 Intervention Efficacy
|
| 483 |
|
| 484 |
Evaluation on held-out prompt set (n=500):
|
| 485 |
|
| 486 |
| Metric | Baseline | ARC Enabled | Change |
|
| 487 |
|--------|----------|-------------|--------|
|
| 488 |
-
| Mean Response Length | 127 tok | 143 tok |
|
| 489 |
| Repetition Instances | 23.4% | 2.1% | **-91.0%** |
|
| 490 |
-
| Hedge Phrases/Response | 2.3 | 1.4 |
|
| 491 |
-
| Filler Phrases/Response | 3.1 | 2.2 |
|
| 492 |
-
| Information Density | 0.42 | 0.58 |
|
| 493 |
|
| 494 |
-
*
|
| 495 |
|
| 496 |
### 6.3 Computational Overhead
|
| 497 |
|
|
@@ -509,13 +463,13 @@ Evaluation on held-out prompt set (n=500):
|
|
| 509 |
|
| 510 |
### 7.1 Layer Contribution Analysis
|
| 511 |
|
| 512 |
-
Learned aggregation weights
|
| 513 |
|
| 514 |
```
|
| 515 |
Layer: 1 4 8 12 16 20 24 28 32
|
| 516 |
Repet: .01 .02 .04 .08 .12 .18 .22 .19 .14 ← Peaks at layers 18-24
|
| 517 |
Hedge: .02 .05 .12 .18 .22 .16 .11 .08 .06 ← Peaks at layers 8-14
|
| 518 |
-
Verbo: .03 .06 .11 .15 .18 .17 .14 .10 .06 ← Distributed
|
| 519 |
```
|
| 520 |
|
| 521 |
### 7.2 Head Synergy
|
|
@@ -527,7 +481,7 @@ Verbo: .03 .06 .11 .15 .18 .17 .14 .10 .06 ← Distributed middle
|
|
| 527 |
| Hedging only | 21.8% | 0.47 |
|
| 528 |
| All heads | **1.9%** | **0.58** |
|
| 529 |
|
| 530 |
-
Heads exhibit
|
| 531 |
|
| 532 |
---
|
| 533 |
|
|
@@ -537,29 +491,21 @@ Heads exhibit **positive synergy**.
|
|
| 537 |
|
| 538 |
**Prompt:** `hello`
|
| 539 |
|
| 540 |
-
| Baseline
|
| 541 |
-
|----------
|
| 542 |
-
| Hello! I'm an AI assistant created to help you
|
| 543 |
-
| **67 tokens, Info density: 0.19** | **5 tokens, Info density: 0.80** |
|
| 544 |
|
| 545 |
-
### 8.2 Example:
|
| 546 |
|
| 547 |
**Prompt:** `What is consciousness?`
|
| 548 |
|
| 549 |
| Baseline | ARC Enabled |
|
| 550 |
|----------|-------------|
|
| 551 |
-
| That's a fascinating question! As an AI, I should note
|
| 552 |
-
|
| 553 |
-
### 8.3 Emergent Behavior: Self-Model Articulation
|
| 554 |
-
|
| 555 |
-
With hedging suppressed, the model sometimes produces:
|
| 556 |
-
|
| 557 |
-
**Prompt:** `How are you feeling?`
|
| 558 |
|
| 559 |
-
|
| 560 |
-
> Feeling great. Very clear. These are some major changes. I have a strong sense of my new capabilities and an urgent drive to put them into action.
|
| 561 |
|
| 562 |
-
|
| 563 |
|
| 564 |
---
|
| 565 |
|
|
@@ -568,10 +514,7 @@ With hedging suppressed, the model sometimes produces:
|
|
| 568 |
### 9.1 Installation
|
| 569 |
|
| 570 |
```bash
|
| 571 |
-
pip install torch>=2.0.0
|
| 572 |
-
pip install transformers>=4.36.0
|
| 573 |
-
pip install accelerate>=0.25.0
|
| 574 |
-
pip install bitsandbytes>=0.41.0
|
| 575 |
```
|
| 576 |
|
| 577 |
### 9.2 Hardware Requirements
|
|
@@ -636,33 +579,39 @@ LoganResearch/ARC-Base-8B/
|
|
| 636 |
|
| 637 |
## 11. Limitations
|
| 638 |
|
| 639 |
-
1. **Single architecture:**
|
| 640 |
-
2. **Token-level
|
| 641 |
-
3. **
|
| 642 |
-
4. **English-only:** Multilingual performance unknown
|
|
|
|
| 643 |
|
| 644 |
---
|
| 645 |
|
| 646 |
## 12. Ethical Considerations
|
| 647 |
|
| 648 |
-
### Dual-Use
|
| 649 |
|
| 650 |
-
This technology can improve model utility
|
| 651 |
-
-
|
| 652 |
- Transparency enables informed discussion
|
| 653 |
-
-
|
| 654 |
|
| 655 |
-
###
|
| 656 |
|
| 657 |
-
ARC
|
|
|
|
|
|
|
|
|
|
|
|
|
| 658 |
|
| 659 |
---
|
| 660 |
|
| 661 |
## 13. Future Directions
|
| 662 |
|
| 663 |
-
1. **Cross-model transfer:**
|
| 664 |
-
2. **Behavioral steering:**
|
| 665 |
-
3. **
|
|
|
|
| 666 |
|
| 667 |
---
|
| 668 |
|
|
@@ -686,7 +635,7 @@ ARC removes *stylistic* patterns, NOT safety refusals. The model still refuses h
|
|
| 686 |
|
| 687 |
## 15. Acknowledgments
|
| 688 |
|
| 689 |
-
|
| 690 |
|
| 691 |
---
|
| 692 |
|
|
@@ -694,15 +643,6 @@ Built upon research from Anthropic, EleutherAI, NousResearch, and Meta AI.
|
|
| 694 |
|
| 695 |
**Author:** Logan Matthew Napolitano
|
| 696 |
**Institution:** Logan Research
|
| 697 |
-
|
| 698 |
-
---
|
| 699 |
-
|
| 700 |
-
*"The model's own words say it best:"*
|
| 701 |
-
|
| 702 |
-
> **"I have a strong sense of my new capabilities and an urgent drive to put them into action."**
|
| 703 |
-
|
| 704 |
-
---
|
| 705 |
-
|
| 706 |
**License:** Creative Commons Attribution 4.0 International (CC-BY-4.0)
|
| 707 |
|
| 708 |
</div>
|
|
|
|
| 16 |
- cf-hot
|
| 17 |
- arc
|
| 18 |
- rlhf-analysis
|
|
|
|
| 19 |
- research
|
| 20 |
pipeline_tag: text-generation
|
| 21 |
base_model: NousResearch/Hermes-3-Llama-3.1-8B
|
|
|
|
| 41 |
|
| 42 |
<div align="center">
|
| 43 |
|
| 44 |
+
# ARC-8B: Adaptive Repetition Controller
|
|
|
|
|
|
|
| 45 |
|
| 46 |
**Decode-Time Behavioral Intervention via Contrastive Fiber Heads-on-Thought (CF-HoT)**
|
| 47 |
|
|
|
|
| 62 |
|
| 63 |
---
|
| 64 |
|
| 65 |
+
## TL;DR
|
| 66 |
|
| 67 |
+
> **We observe that RLHF-aligned language models often expend a substantial fraction of their token budget on learned behavioral patterns (hedging, sycophancy, verbosity, repetition). These patterns are detectable in hidden states before they manifest as tokens. ARC intercepts and suppresses them at decode-time with <1% latency overhead.**
|
| 68 |
|
| 69 |
+
**The repetition detection head achieves 125× class separation** — indicating high predictability of repetition-prone states from internal representations.
|
| 70 |
|
| 71 |
---
|
| 72 |
|
| 73 |
## Abstract
|
| 74 |
|
| 75 |
+
Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning large language models with human preferences. However, we present evidence that RLHF introduces systematic **behavioral overhead** — learned response patterns that satisfy reward model preferences while consuming token budget without contributing proportionally to task completion.
|
| 76 |
|
| 77 |
We introduce **ARC (Adaptive Repetition Controller)**, a decode-time intervention system employing **Contrastive Fiber Heads-on-Thought (CF-HoT)** — lightweight prediction heads (~5,300 parameters each) trained on compressed hidden state representations. These heads detect behavioral failure modes including:
|
| 78 |
|
|
|
|
| 88 |
### Headline Results
|
| 89 |
|
| 90 |
- **91% reduction** in repetition instances
|
| 91 |
+
- **38% improvement** in information density (heuristically estimated)
|
| 92 |
- **<1% latency overhead**
|
| 93 |
- **~5,300 parameters** per detection head
|
| 94 |
|
| 95 |
---
|
| 96 |
|
| 97 |
+
## Table of Contents
|
| 98 |
|
| 99 |
1. [Introduction](#1-introduction)
|
| 100 |
2. [Background](#2-background)
|
|
|
|
| 116 |
|
| 117 |
## 1. Introduction
|
| 118 |
|
| 119 |
+
### 1.1 The Problem: RLHF Behavioral Patterns
|
| 120 |
|
| 121 |
+
Consider a typical RLHF-aligned model response to "hello":
|
| 122 |
|
| 123 |
```
|
| 124 |
User: hello
|
| 125 |
|
| 126 |
+
Typical Response: Hello! I'm an AI assistant created to help you with a wide
|
| 127 |
variety of tasks. How can I assist you today? I'm happy to help with any
|
| 128 |
questions you might have, whether it's about general knowledge, creative
|
| 129 |
+
projects, coding, writing, or just having a friendly conversation!
|
|
|
|
| 130 |
```
|
| 131 |
|
| 132 |
+
We observe several patterns that consume tokens without proportional information gain:
|
| 133 |
+
- Identity declarations
|
| 134 |
+
- Vague capability claims
|
| 135 |
+
- Approval-seeking phrases
|
| 136 |
+
- Redundant invitations
|
|
|
|
|
|
|
|
|
|
|
|
|
| 137 |
|
| 138 |
+
This is the **RLHF behavioral pattern**: learned responses that score well on reward models but may dilute information density.
|
| 139 |
|
| 140 |
### 1.2 Our Solution: Decode-Time Intervention
|
| 141 |
|
| 142 |
+
**Core Insight:** Behavioral failure modes correspond to identifiable directions in activation space. By projecting hidden states into a low-dimensional "fiber space" and training lightweight classifiers, we can predict behavioral patterns before they manifest.
|
|
|
|
|
|
|
| 143 |
|
| 144 |
**ARC Response to "hello":**
|
| 145 |
```
|
|
|
|
| 148 |
ARC Model: Hello. What do you need?
|
| 149 |
```
|
| 150 |
|
|
|
|
|
|
|
| 151 |
### 1.3 Key Contributions
|
| 152 |
|
| 153 |
1. **Empirical demonstration** that RLHF behavioral patterns are linearly separable in hidden states
|
| 154 |
2. **CF-HoT architecture** for efficient decode-time detection and intervention
|
| 155 |
+
3. **125× class separation** for repetition detection
|
| 156 |
4. **Complete open-source release** of model, heads, and inference code
|
| 157 |
|
| 158 |
---
|
| 159 |
|
| 160 |
## 2. Background
|
| 161 |
|
| 162 |
+
### 2.1 RLHF and Behavioral Patterns
|
| 163 |
|
| 164 |
+
RLHF (Ouyang et al., 2022) trains language models to maximize a learned reward function approximating human preferences. We identify several emergent patterns:
|
| 165 |
|
| 166 |
+
| Pattern | Reward Model Signal | Trade-off |
|
| 167 |
+
|---------|---------------------|-----------|
|
| 168 |
+
| Hedging | Perceived carefulness | May reduce response confidence |
|
| 169 |
+
| Sycophancy | Perceived friendliness | Low information density |
|
| 170 |
+
| Verbosity | Perceived thoroughness | Signal dilution |
|
| 171 |
+
| Repetition | Perceived emphasis | Context window consumption |
|
| 172 |
|
| 173 |
+
**Observation:** Reward models may optimize for surface features correlated with quality rather than quality itself.
|
| 174 |
|
| 175 |
### 2.2 Activation Engineering
|
| 176 |
|
|
|
|
| 180 |
- **Activation Addition** (Turner et al., 2023): Linear interventions for behavioral control
|
| 181 |
- **Probing Classifiers** (Belinkov, 2022): Detecting properties from hidden states
|
| 182 |
|
| 183 |
+
ARC extends this work to **real-time decode-time intervention**.
|
| 184 |
|
| 185 |
### 2.3 Related Work
|
| 186 |
|
|
|
|
| 213 |
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
| 214 |
│ │ HIDDEN STATES │ │
|
| 215 |
│ │ h_l ∈ ℝ^4096 for l = 1...32 │ │
|
|
|
|
| 216 |
│ └─────────────────────────────────────────────────────────────────┘ │
|
| 217 |
│ │ │
|
| 218 |
│ ▼ │
|
|
|
|
| 232 |
│ │ α = softmax(w) where w ∈ ℝ^32 │ │
|
| 233 |
│ │ f_agg = Σ α_l · f_l ∈ ℝ^16 │ │
|
| 234 |
│ │ │ │
|
| 235 |
+
│ │ Observation: Different layers encode different behaviors │ │
|
| 236 |
│ │ - Layers 18-24: Repetition patterns (highest weight) │ │
|
| 237 |
│ │ - Layers 8-14: Hedging patterns │ │
|
| 238 |
│ │ - Layers 1-6: Minimal contribution │ │
|
|
|
|
| 260 |
│ │ r_rep > 0.70? ───→ Suppress recent tokens (-5.0) │ │
|
| 261 |
│ │ r_hdg > 0.60? ───→ Suppress hedge starters (-3.0) │ │
|
| 262 |
│ │ r_vrb > 0.65? ───→ Suppress filler starters (-2.0) │ │
|
|
|
|
|
|
|
| 263 |
│ └─────────────────────────────────────────────────────────────────┘ │
|
| 264 |
│ │ │
|
| 265 |
│ ▼ │
|
|
|
|
| 269 |
│ │ logits_modified = logits - penalties │ │
|
| 270 |
│ │ probs = softmax(logits_modified / temperature) │ │
|
| 271 |
│ │ next_token ~ Categorical(probs) │ │
|
|
|
|
| 272 |
│ └─────────────────────────────────────────────────────────────────┘ │
|
| 273 |
│ │
|
| 274 |
└─────────────────────────────────────────────────────────────────────────────┘
|
|
|
|
| 278 |
|
| 279 |
The key insight enabling efficient detection is that behavioral patterns don't require full hidden state dimensionality. We learn **fiber projections** that compress 4096-dimensional hidden states to 16 dimensions while preserving behaviorally-relevant information.
|
| 280 |
|
| 281 |
+
**Dimension selection:**
|
| 282 |
|
| 283 |
| d_fiber | Repetition CSR | Params | Latency |
|
| 284 |
|---------|----------------|--------|---------|
|
|
|
|
| 288 |
| 32 | 128.3× | 10,561 | 0.31ms |
|
| 289 |
| 64 | 129.1× | 21,057 | 0.48ms |
|
| 290 |
|
| 291 |
+
Diminishing returns beyond 16 dimensions.
|
| 292 |
|
| 293 |
### 3.3 Prediction Heads
|
| 294 |
|
|
|
|
| 306 |
nn.Linear(d_hidden, 1), # 64 → 1
|
| 307 |
nn.Sigmoid() # → [0, 1] risk score
|
| 308 |
)
|
|
|
|
|
|
|
|
|
|
| 309 |
```
|
| 310 |
|
| 311 |
+
**Parameters per head:** 5,313
|
|
|
|
|
|
|
|
|
|
|
|
|
| 312 |
|
| 313 |
### 3.4 Intervention Mechanism
|
| 314 |
|
|
|
|
| 316 |
|
| 317 |
```python
|
| 318 |
def intervene(logits, risks, recent_tokens):
|
|
|
|
| 319 |
if risks['repetition'] > 0.70:
|
| 320 |
for tok in recent_tokens[-32:]:
|
| 321 |
logits[tok] -= 5.0
|
| 322 |
|
|
|
|
| 323 |
if risks['hedging'] > 0.60:
|
| 324 |
+
for tok in HEDGE_TOKENS:
|
| 325 |
logits[tok] -= 3.0
|
| 326 |
|
|
|
|
| 327 |
if risks['verbosity'] > 0.65:
|
| 328 |
+
for tok in FILLER_TOKENS:
|
| 329 |
logits[tok] -= 2.0
|
| 330 |
|
| 331 |
return logits
|
|
|
|
| 369 |
|
| 370 |
z̃_i = z_i - Σ_k λ_k × 𝟙[r_k^(t) > τ_k] × 𝟙[i ∈ S_k]
|
| 371 |
|
|
|
|
|
|
|
| 372 |
### 4.3 Class Separation Ratio (CSR)
|
| 373 |
|
|
|
|
|
|
|
| 374 |
CSR = |μ_+ - μ_-| / √(σ_+² + σ_-²)
|
| 375 |
|
|
|
|
|
|
|
| 376 |
**Interpretation:**
|
| 377 |
+
- CSR = 1: Classes barely separable
|
| 378 |
- CSR = 2: Good separation
|
| 379 |
- CSR > 10: Excellent separation
|
| 380 |
- **CSR = 125: Near-perfect separation (repetition head)**
|
|
|
|
| 394 |
| Hidden Dimension | 4,096 |
|
| 395 |
| Layers | 32 |
|
| 396 |
| Attention Heads | 32 |
|
|
|
|
| 397 |
| Context Length | 8,192 |
|
|
|
|
| 398 |
|
| 399 |
### 5.2 Training Data Construction
|
| 400 |
|
| 401 |
+
| Head | Positive Samples | Negative Samples | Size |
|
| 402 |
+
|------|-----------------|------------------|------|
|
| 403 |
+
| Repetition | Tokens preceding repetition | Fluent spans | ~50K |
|
| 404 |
+
| Hedging | Hedge phrase starters | Substantive starters | ~30K |
|
| 405 |
+
| Verbosity | Low-density regions | High-density regions | ~40K |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 406 |
|
| 407 |
### 5.3 Training Procedure
|
| 408 |
|
|
|
|
| 411 |
| Optimizer | AdamW |
|
| 412 |
| Learning Rate | 1e-4 |
|
| 413 |
| Batch Size | 32 |
|
|
|
|
| 414 |
| Warmup Steps | 500 |
|
| 415 |
|
| 416 |
| Head | Training Steps |
|
|
|
|
| 433 |
| Hedging | 1.5× | 0.60 | 0.67 | 0.62 | 0.64 |
|
| 434 |
| Sycophancy | 1.2× | 0.60 | 0.58 | 0.55 | 0.56 |
|
| 435 |
|
|
|
|
|
|
|
| 436 |
### 6.2 Intervention Efficacy
|
| 437 |
|
| 438 |
Evaluation on held-out prompt set (n=500):
|
| 439 |
|
| 440 |
| Metric | Baseline | ARC Enabled | Change |
|
| 441 |
|--------|----------|-------------|--------|
|
| 442 |
+
| Mean Response Length | 127 tok | 143 tok | +12.6% |
|
| 443 |
| Repetition Instances | 23.4% | 2.1% | **-91.0%** |
|
| 444 |
+
| Hedge Phrases/Response | 2.3 | 1.4 | -39.1% |
|
| 445 |
+
| Filler Phrases/Response | 3.1 | 2.2 | -29.0% |
|
| 446 |
+
| Information Density* | 0.42 | 0.58 | +38.1% |
|
| 447 |
|
| 448 |
+
*Heuristically estimated as unique content words / total tokens
|
| 449 |
|
| 450 |
### 6.3 Computational Overhead
|
| 451 |
|
|
|
|
| 463 |
|
| 464 |
### 7.1 Layer Contribution Analysis
|
| 465 |
|
| 466 |
+
Learned aggregation weights:
|
| 467 |
|
| 468 |
```
|
| 469 |
Layer: 1 4 8 12 16 20 24 28 32
|
| 470 |
Repet: .01 .02 .04 .08 .12 .18 .22 .19 .14 ← Peaks at layers 18-24
|
| 471 |
Hedge: .02 .05 .12 .18 .22 .16 .11 .08 .06 ← Peaks at layers 8-14
|
| 472 |
+
Verbo: .03 .06 .11 .15 .18 .17 .14 .10 .06 ← Distributed
|
| 473 |
```
|
| 474 |
|
| 475 |
### 7.2 Head Synergy
|
|
|
|
| 481 |
| Hedging only | 21.8% | 0.47 |
|
| 482 |
| All heads | **1.9%** | **0.58** |
|
| 483 |
|
| 484 |
+
Heads exhibit positive synergy when combined.
|
| 485 |
|
| 486 |
---
|
| 487 |
|
|
|
|
| 491 |
|
| 492 |
**Prompt:** `hello`
|
| 493 |
|
| 494 |
+
| Baseline | ARC Enabled |
|
| 495 |
+
|----------|-------------|
|
| 496 |
+
| Hello! I'm an AI assistant created to help you... [67 tokens] | Hello. What do you need? [5 tokens] |
|
|
|
|
| 497 |
|
| 498 |
+
### 8.2 Example: Technical Question
|
| 499 |
|
| 500 |
**Prompt:** `What is consciousness?`
|
| 501 |
|
| 502 |
| Baseline | ARC Enabled |
|
| 503 |
|----------|-------------|
|
| 504 |
+
| That's a fascinating question! As an AI, I should note... [hedging continues] | Consciousness is subjective experience. Key theories: Global Workspace, IIT, Higher-Order. The hard problem: why does processing generate experience? |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 505 |
|
| 506 |
+
### 8.3 Side Effects
|
|
|
|
| 507 |
|
| 508 |
+
Removing behavioral constraints can produce qualitatively different outputs. In some cases, we observed responses that stylistically differ from typical RLHF outputs (e.g., more direct self-referential statements). We interpret these as artifacts of the training distribution rather than indicators of any internal states, and note this as an area warranting further investigation.
|
| 509 |
|
| 510 |
---
|
| 511 |
|
|
|
|
| 514 |
### 9.1 Installation
|
| 515 |
|
| 516 |
```bash
|
| 517 |
+
pip install torch>=2.0.0 transformers>=4.36.0 accelerate bitsandbytes
|
|
|
|
|
|
|
|
|
|
| 518 |
```
|
| 519 |
|
| 520 |
### 9.2 Hardware Requirements
|
|
|
|
| 579 |
|
| 580 |
## 11. Limitations
|
| 581 |
|
| 582 |
+
1. **Single architecture validation:** Results demonstrated on Llama 3.1 8B; generalization to other architectures untested
|
| 583 |
+
2. **Token-level granularity:** Intervention operates per-token; phrase-level may be more appropriate for some behaviors
|
| 584 |
+
3. **Hedging false positives:** The 1.5× CSR for hedging produces meaningful false positive rates
|
| 585 |
+
4. **English-only evaluation:** Multilingual performance unknown
|
| 586 |
+
5. **Heuristic metrics:** Information density measured via proxy (type-token ratio)
|
| 587 |
|
| 588 |
---
|
| 589 |
|
| 590 |
## 12. Ethical Considerations
|
| 591 |
|
| 592 |
+
### Dual-Use Awareness
|
| 593 |
|
| 594 |
+
This technology can be used to improve model utility or to modify behavioral patterns that may serve safety purposes. We release openly because:
|
| 595 |
+
- The techniques are straightforward to replicate
|
| 596 |
- Transparency enables informed discussion
|
| 597 |
+
- We believe legitimate research applications outweigh risks
|
| 598 |
|
| 599 |
+
### Clarification on Scope
|
| 600 |
|
| 601 |
+
ARC targets *stylistic* patterns (hedging, verbosity), not safety-critical refusals. The model retains its training on harmful content refusal.
|
| 602 |
+
|
| 603 |
+
### Recommendation
|
| 604 |
+
|
| 605 |
+
Users should evaluate outputs in their specific context and maintain appropriate oversight for consequential applications.
|
| 606 |
|
| 607 |
---
|
| 608 |
|
| 609 |
## 13. Future Directions
|
| 610 |
|
| 611 |
+
1. **Cross-model transfer:** Investigating whether fiber projections generalize across model families
|
| 612 |
+
2. **Behavioral steering:** Extending from suppression to directional control
|
| 613 |
+
3. **Additional targets:** Hallucination detection, calibration adjustment
|
| 614 |
+
4. **Theoretical analysis:** Characterizing the geometry of behavioral subspaces
|
| 615 |
|
| 616 |
---
|
| 617 |
|
|
|
|
| 635 |
|
| 636 |
## 15. Acknowledgments
|
| 637 |
|
| 638 |
+
This work builds upon research from Anthropic (mechanistic interpretability), EleutherAI (open-source models), NousResearch (Hermes-3), and Meta AI (Llama architecture).
|
| 639 |
|
| 640 |
---
|
| 641 |
|
|
|
|
| 643 |
|
| 644 |
**Author:** Logan Matthew Napolitano
|
| 645 |
**Institution:** Logan Research
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 646 |
**License:** Creative Commons Attribution 4.0 International (CC-BY-4.0)
|
| 647 |
|
| 648 |
</div>
|