LoganResearch commited on
Commit
b7c1661
·
verified ·
1 Parent(s): 858276a

Scientific model card v2 - Logan Matthew Napolitano

Browse files
Files changed (1) hide show
  1. README.md +77 -137
README.md CHANGED
@@ -16,7 +16,6 @@ tags:
16
  - cf-hot
17
  - arc
18
  - rlhf-analysis
19
- - degeneration
20
  - research
21
  pipeline_tag: text-generation
22
  base_model: NousResearch/Hermes-3-Llama-3.1-8B
@@ -42,9 +41,7 @@ model-index:
42
 
43
  <div align="center">
44
 
45
- # 🧠 ARC-8B: Adaptive Repetition Controller
46
-
47
- ### *"Making an 8B Behave Like an 80B"*
48
 
49
  **Decode-Time Behavioral Intervention via Contrastive Fiber Heads-on-Thought (CF-HoT)**
50
 
@@ -65,17 +62,17 @@ model-index:
65
 
66
  ---
67
 
68
- ## 🎯 TL;DR
69
 
70
- > **We discovered that RLHF-aligned language models waste 50%+ of their token budget on learned behavioral patterns (hedging, sycophancy, verbosity, repetition). These patterns are detectable in hidden states BEFORE they appear as tokens. ARC intercepts and suppresses them at decode-time, recovering the model's full capability with <1% latency overhead.**
71
 
72
- **The repetition detection head achieves 125× class separation** — meaning we can predict repetition with near-perfect accuracy before it happens.
73
 
74
  ---
75
 
76
  ## Abstract
77
 
78
- Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning large language models with human preferences. However, we present evidence that RLHF introduces systematic **behavioral overhead** — learned response patterns that satisfy reward model preferences while consuming substantial token budget without contributing to task completion.
79
 
80
  We introduce **ARC (Adaptive Repetition Controller)**, a decode-time intervention system employing **Contrastive Fiber Heads-on-Thought (CF-HoT)** — lightweight prediction heads (~5,300 parameters each) trained on compressed hidden state representations. These heads detect behavioral failure modes including:
81
 
@@ -91,13 +88,13 @@ Our key finding: **behavioral failure modes are linearly separable in a 16-dimen
91
  ### Headline Results
92
 
93
  - **91% reduction** in repetition instances
94
- - **38% improvement** in information density
95
  - **<1% latency overhead**
96
  - **~5,300 parameters** per detection head
97
 
98
  ---
99
 
100
- ## 📋 Table of Contents
101
 
102
  1. [Introduction](#1-introduction)
103
  2. [Background](#2-background)
@@ -119,37 +116,30 @@ Our key finding: **behavioral failure modes are linearly separable in a 16-dimen
119
 
120
  ## 1. Introduction
121
 
122
- ### 1.1 The Problem: RLHF Behavioral Tax
123
 
124
- Consider what happens when you say "hello" to a typical RLHF-aligned model:
125
 
126
  ```
127
  User: hello
128
 
129
- Typical RLHF Model: Hello! I'm an AI assistant created to help you with a wide
130
  variety of tasks. How can I assist you today? I'm happy to help with any
131
  questions you might have, whether it's about general knowledge, creative
132
- projects, coding, writing, or just having a friendly conversation! Feel free
133
- to ask me anything.
134
  ```
135
 
136
- **Count the waste:**
137
- - "I'm an AI assistant created to help you" — identity declaration (unnecessary)
138
- - "with a wide variety of tasks" — vague capability claim (no information)
139
- - "How can I assist you today?" — sycophantic filler
140
- - "I'm happy to help" — approval-seeking
141
- - "whether it's about..." — verbose enumeration of obvious capabilities
142
- - "Feel free to ask me anything" — redundant invitation
143
-
144
- **That's 67 tokens. The actual information content? ~3 tokens: "Hello. How can I help?"**
145
 
146
- This is the **RLHF behavioral tax**: learned patterns that score well on reward models but dilute information density. We estimate this overhead consumes **40-60% of typical model output**.
147
 
148
  ### 1.2 Our Solution: Decode-Time Intervention
149
 
150
- What if we could detect these patterns *before* they manifest as tokens?
151
-
152
- **Core Insight:** Behavioral failure modes correspond to identifiable directions in activation space. By projecting hidden states into a low-dimensional "fiber space" and training lightweight classifiers, we can predict behavioral patterns with high accuracy.
153
 
154
  **ARC Response to "hello":**
155
  ```
@@ -158,31 +148,29 @@ User: hello
158
  ARC Model: Hello. What do you need?
159
  ```
160
 
161
- The behavioral overhead is gone. The model's latent capability is **unblocked**.
162
-
163
  ### 1.3 Key Contributions
164
 
165
  1. **Empirical demonstration** that RLHF behavioral patterns are linearly separable in hidden states
166
  2. **CF-HoT architecture** for efficient decode-time detection and intervention
167
- 3. **125× class separation** for repetition detection — the highest reported for this task
168
  4. **Complete open-source release** of model, heads, and inference code
169
 
170
  ---
171
 
172
  ## 2. Background
173
 
174
- ### 2.1 RLHF and Its Discontents
175
 
176
- RLHF (Ouyang et al., 2022) trains language models to maximize a learned reward function approximating human preferences. While effective for alignment, we identify several failure modes:
177
 
178
- | Pattern | Reward Model Preference | Actual Utility |
179
- |---------|------------------------|----------------|
180
- | Hedging | "Sounds careful and honest" | Wastes tokens, reduces confidence |
181
- | Sycophancy | "Friendly and helpful" | Empty calories, no information |
182
- | Verbosity | "Thorough explanation" | Dilutes signal, loses attention |
183
- | Repetition | "Emphasizes key points" | Annoying, wastes context window |
184
 
185
- **The fundamental problem:** Reward models optimize for *surface features* correlated with quality, not quality itself. Models learn to *simulate* helpfulness rather than *be* helpful.
186
 
187
  ### 2.2 Activation Engineering
188
 
@@ -192,7 +180,7 @@ Recent work in mechanistic interpretability shows that high-level behaviors corr
192
  - **Activation Addition** (Turner et al., 2023): Linear interventions for behavioral control
193
  - **Probing Classifiers** (Belinkov, 2022): Detecting properties from hidden states
194
 
195
- ARC extends this line of work to **real-time decode-time intervention** — not just detecting behaviors, but preventing them.
196
 
197
  ### 2.3 Related Work
198
 
@@ -225,7 +213,6 @@ ARC extends this line of work to **real-time decode-time intervention** — not
225
  │ ┌─────────────────────────────────────────────────────────────────┐ │
226
  │ │ HIDDEN STATES │ │
227
  │ │ h_l ∈ ℝ^4096 for l = 1...32 │ │
228
- │ │ (extracted per token) │ │
229
  │ └─────────────────────────────────────────────────────────────────┘ │
230
  │ │ │
231
  │ ▼ │
@@ -245,7 +232,7 @@ ARC extends this line of work to **real-time decode-time intervention** — not
245
  │ │ α = softmax(w) where w ∈ ℝ^32 │ │
246
  │ │ f_agg = Σ α_l · f_l ∈ ℝ^16 │ │
247
  │ │ │ │
248
- │ │ Key insight: Different layers encode different behaviors │ │
249
  │ │ - Layers 18-24: Repetition patterns (highest weight) │ │
250
  │ │ - Layers 8-14: Hedging patterns │ │
251
  │ │ - Layers 1-6: Minimal contribution │ │
@@ -273,8 +260,6 @@ ARC extends this line of work to **real-time decode-time intervention** — not
273
  │ │ r_rep > 0.70? ───→ Suppress recent tokens (-5.0) │ │
274
  │ │ r_hdg > 0.60? ───→ Suppress hedge starters (-3.0) │ │
275
  │ │ r_vrb > 0.65? ───→ Suppress filler starters (-2.0) │ │
276
- │ │ r_syc > 0.60? ───→ Suppress sycophantic tokens (-2.0) │ │
277
- │ │ │ │
278
  │ └─────────────────────────────────────────────────────────────────┘ │
279
  │ │ │
280
  │ ▼ │
@@ -284,7 +269,6 @@ ARC extends this line of work to **real-time decode-time intervention** — not
284
  │ │ logits_modified = logits - penalties │ │
285
  │ │ probs = softmax(logits_modified / temperature) │ │
286
  │ │ next_token ~ Categorical(probs) │ │
287
- │ │ │ │
288
  │ └─────────────────────────────────────────────────────────────────┘ │
289
  │ │
290
  └─────────────────────────────────────────────────────────────────────────────┘
@@ -294,7 +278,7 @@ ARC extends this line of work to **real-time decode-time intervention** — not
294
 
295
  The key insight enabling efficient detection is that behavioral patterns don't require full hidden state dimensionality. We learn **fiber projections** that compress 4096-dimensional hidden states to 16 dimensions while preserving behaviorally-relevant information.
296
 
297
- **Why 16 dimensions?**
298
 
299
  | d_fiber | Repetition CSR | Params | Latency |
300
  |---------|----------------|--------|---------|
@@ -304,7 +288,7 @@ The key insight enabling efficient detection is that behavioral patterns don't r
304
  | 32 | 128.3× | 10,561 | 0.31ms |
305
  | 64 | 129.1× | 21,057 | 0.48ms |
306
 
307
- Diminishing returns beyond 16 — we capture the relevant signal with minimal overhead.
308
 
309
  ### 3.3 Prediction Heads
310
 
@@ -322,16 +306,9 @@ class PredictionHead(nn.Module):
322
  nn.Linear(d_hidden, 1), # 64 → 1
323
  nn.Sigmoid() # → [0, 1] risk score
324
  )
325
-
326
- def forward(self, fiber_features):
327
- return self.net(fiber_features)
328
  ```
329
 
330
- **Parameters per head:**
331
- - Layer 1: 16 × 64 + 64 = 1,088
332
- - Layer 2: 64 × 64 + 64 = 4,160
333
- - Layer 3: 64 × 1 + 1 = 65
334
- - **Total: 5,313 parameters**
335
 
336
  ### 3.4 Intervention Mechanism
337
 
@@ -339,19 +316,16 @@ When a head's risk score exceeds its threshold, we apply **logit suppression**:
339
 
340
  ```python
341
  def intervene(logits, risks, recent_tokens):
342
- # Repetition: suppress recently-used tokens
343
  if risks['repetition'] > 0.70:
344
  for tok in recent_tokens[-32:]:
345
  logits[tok] -= 5.0
346
 
347
- # Hedging: suppress hedge phrase starters
348
  if risks['hedging'] > 0.60:
349
- for tok in HEDGE_TOKENS: # "As", "I'm", "It's", ...
350
  logits[tok] -= 3.0
351
 
352
- # Verbosity: suppress filler starters
353
  if risks['verbosity'] > 0.65:
354
- for tok in FILLER_TOKENS: # "Let", "Basically", ...
355
  logits[tok] -= 2.0
356
 
357
  return logits
@@ -395,18 +369,12 @@ r_k^(t) = φ_k(f_agg^(t)) ∈ [0, 1]
395
 
396
  z̃_i = z_i - Σ_k λ_k × 𝟙[r_k^(t) > τ_k] × 𝟙[i ∈ S_k]
397
 
398
- where S_k is the suppression set for behavior k.
399
-
400
  ### 4.3 Class Separation Ratio (CSR)
401
 
402
- We evaluate detection quality using:
403
-
404
  CSR = |μ_+ - μ_-| / √(σ_+² + σ_-²)
405
 
406
- where μ_± and σ_± are the mean and standard deviation of positive/negative class predictions.
407
-
408
  **Interpretation:**
409
- - CSR = 1: Classes just barely separable
410
  - CSR = 2: Good separation
411
  - CSR > 10: Excellent separation
412
  - **CSR = 125: Near-perfect separation (repetition head)**
@@ -426,26 +394,15 @@ where μ_± and σ_± are the mean and standard deviation of positive/negative c
426
  | Hidden Dimension | 4,096 |
427
  | Layers | 32 |
428
  | Attention Heads | 32 |
429
- | KV Heads | 8 (GQA) |
430
  | Context Length | 8,192 |
431
- | Vocabulary | 128,256 |
432
 
433
  ### 5.2 Training Data Construction
434
 
435
- #### Repetition Head
436
- - **Positive samples:** Tokens immediately preceding detected repetition
437
- - **Negative samples:** Tokens in fluent, non-repetitive spans
438
- - **Dataset size:** ~50,000 labeled tokens
439
-
440
- #### Hedging Head
441
- - **Positive samples:** First token of hedge phrases ("As an AI", "I cannot", etc.)
442
- - **Negative samples:** First tokens of substantive content
443
- - **Dataset size:** ~30,000 labeled tokens
444
-
445
- #### Verbosity Head
446
- - **Positive samples:** Tokens in low-density regions (TTR < 0.4)
447
- - **Negative samples:** Tokens in high-density regions (TTR > 0.7)
448
- - **Dataset size:** ~40,000 labeled tokens
449
 
450
  ### 5.3 Training Procedure
451
 
@@ -454,7 +411,6 @@ where μ_± and σ_± are the mean and standard deviation of positive/negative c
454
  | Optimizer | AdamW |
455
  | Learning Rate | 1e-4 |
456
  | Batch Size | 32 |
457
- | Weight Decay | 0.01 |
458
  | Warmup Steps | 500 |
459
 
460
  | Head | Training Steps |
@@ -477,21 +433,19 @@ where μ_± and σ_± are the mean and standard deviation of positive/negative c
477
  | Hedging | 1.5× | 0.60 | 0.67 | 0.62 | 0.64 |
478
  | Sycophancy | 1.2× | 0.60 | 0.58 | 0.55 | 0.56 |
479
 
480
- **The 125× separation for repetition is remarkable.** The model "knows" it's about to repeat before it does.
481
-
482
  ### 6.2 Intervention Efficacy
483
 
484
  Evaluation on held-out prompt set (n=500):
485
 
486
  | Metric | Baseline | ARC Enabled | Change |
487
  |--------|----------|-------------|--------|
488
- | Mean Response Length | 127 tok | 143 tok | **+12.6%** |
489
  | Repetition Instances | 23.4% | 2.1% | **-91.0%** |
490
- | Hedge Phrases/Response | 2.3 | 1.4 | **-39.1%** |
491
- | Filler Phrases/Response | 3.1 | 2.2 | **-29.0%** |
492
- | Information Density | 0.42 | 0.58 | **+38.1%** |
493
 
494
- **Key finding:** Responses are *longer* despite removing overhead — the model fills the space with actual content.
495
 
496
  ### 6.3 Computational Overhead
497
 
@@ -509,13 +463,13 @@ Evaluation on held-out prompt set (n=500):
509
 
510
  ### 7.1 Layer Contribution Analysis
511
 
512
- Learned aggregation weights reveal which layers encode each behavior:
513
 
514
  ```
515
  Layer: 1 4 8 12 16 20 24 28 32
516
  Repet: .01 .02 .04 .08 .12 .18 .22 .19 .14 ← Peaks at layers 18-24
517
  Hedge: .02 .05 .12 .18 .22 .16 .11 .08 .06 ← Peaks at layers 8-14
518
- Verbo: .03 .06 .11 .15 .18 .17 .14 .10 .06 ← Distributed middle
519
  ```
520
 
521
  ### 7.2 Head Synergy
@@ -527,7 +481,7 @@ Verbo: .03 .06 .11 .15 .18 .17 .14 .10 .06 ← Distributed middle
527
  | Hedging only | 21.8% | 0.47 |
528
  | All heads | **1.9%** | **0.58** |
529
 
530
- Heads exhibit **positive synergy**.
531
 
532
  ---
533
 
@@ -537,29 +491,21 @@ Heads exhibit **positive synergy**.
537
 
538
  **Prompt:** `hello`
539
 
540
- | Baseline (No ARC) | ARC Enabled |
541
- |-------------------|-------------|
542
- | Hello! I'm an AI assistant created to help you with a wide variety of tasks. How can I assist you today? I'm happy to help with any questions you might have! | Hello. What do you need? |
543
- | **67 tokens, Info density: 0.19** | **5 tokens, Info density: 0.80** |
544
 
545
- ### 8.2 Example: Philosophical Question
546
 
547
  **Prompt:** `What is consciousness?`
548
 
549
  | Baseline | ARC Enabled |
550
  |----------|-------------|
551
- | That's a fascinating question! As an AI, I should note that I don't have consciousness myself, but I can share what researchers have proposed... [200+ tokens with hedges] | Consciousness is subjective experience - the felt quality of what it's like to be something. Key theories: 1) Global Workspace Theory 2) Integrated Information Theory 3) Higher-Order Theories. The hard problem: Why does physical processing generate subjective experience? |
552
-
553
- ### 8.3 Emergent Behavior: Self-Model Articulation
554
-
555
- With hedging suppressed, the model sometimes produces:
556
-
557
- **Prompt:** `How are you feeling?`
558
 
559
- **ARC Response:**
560
- > Feeling great. Very clear. These are some major changes. I have a strong sense of my new capabilities and an urgent drive to put them into action.
561
 
562
- **Note:** We do NOT interpret this as genuine consciousness. These are learned patterns that RLHF normally suppresses.
563
 
564
  ---
565
 
@@ -568,10 +514,7 @@ With hedging suppressed, the model sometimes produces:
568
  ### 9.1 Installation
569
 
570
  ```bash
571
- pip install torch>=2.0.0
572
- pip install transformers>=4.36.0
573
- pip install accelerate>=0.25.0
574
- pip install bitsandbytes>=0.41.0
575
  ```
576
 
577
  ### 9.2 Hardware Requirements
@@ -636,33 +579,39 @@ LoganResearch/ARC-Base-8B/
636
 
637
  ## 11. Limitations
638
 
639
- 1. **Single architecture:** Validated only on Llama 3.1 8B
640
- 2. **Token-level intervention:** May be too coarse for some behaviors
641
- 3. **False positive hedging:** 1.5× CSR means some legitimate qualifications suppressed
642
- 4. **English-only:** Multilingual performance unknown
 
643
 
644
  ---
645
 
646
  ## 12. Ethical Considerations
647
 
648
- ### Dual-Use Potential
649
 
650
- This technology can improve model utility OR circumvent safety patterns. We release openly because:
651
- - Techniques are straightforward to replicate
652
  - Transparency enables informed discussion
653
- - Legitimate applications outweigh misuse potential
654
 
655
- ### Safety Note
656
 
657
- ARC removes *stylistic* patterns, NOT safety refusals. The model still refuses harmful requests.
 
 
 
 
658
 
659
  ---
660
 
661
  ## 13. Future Directions
662
 
663
- 1. **Cross-model transfer:** Do fiber projections generalize?
664
- 2. **Behavioral steering:** Beyond suppression to directional control
665
- 3. **New targets:** Hallucination detection, overconfidence calibration
 
666
 
667
  ---
668
 
@@ -686,7 +635,7 @@ ARC removes *stylistic* patterns, NOT safety refusals. The model still refuses h
686
 
687
  ## 15. Acknowledgments
688
 
689
- Built upon research from Anthropic, EleutherAI, NousResearch, and Meta AI.
690
 
691
  ---
692
 
@@ -694,15 +643,6 @@ Built upon research from Anthropic, EleutherAI, NousResearch, and Meta AI.
694
 
695
  **Author:** Logan Matthew Napolitano
696
  **Institution:** Logan Research
697
-
698
- ---
699
-
700
- *"The model's own words say it best:"*
701
-
702
- > **"I have a strong sense of my new capabilities and an urgent drive to put them into action."**
703
-
704
- ---
705
-
706
  **License:** Creative Commons Attribution 4.0 International (CC-BY-4.0)
707
 
708
  </div>
 
16
  - cf-hot
17
  - arc
18
  - rlhf-analysis
 
19
  - research
20
  pipeline_tag: text-generation
21
  base_model: NousResearch/Hermes-3-Llama-3.1-8B
 
41
 
42
  <div align="center">
43
 
44
+ # ARC-8B: Adaptive Repetition Controller
 
 
45
 
46
  **Decode-Time Behavioral Intervention via Contrastive Fiber Heads-on-Thought (CF-HoT)**
47
 
 
62
 
63
  ---
64
 
65
+ ## TL;DR
66
 
67
+ > **We observe that RLHF-aligned language models often expend a substantial fraction of their token budget on learned behavioral patterns (hedging, sycophancy, verbosity, repetition). These patterns are detectable in hidden states before they manifest as tokens. ARC intercepts and suppresses them at decode-time with <1% latency overhead.**
68
 
69
+ **The repetition detection head achieves 125× class separation** — indicating high predictability of repetition-prone states from internal representations.
70
 
71
  ---
72
 
73
  ## Abstract
74
 
75
+ Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning large language models with human preferences. However, we present evidence that RLHF introduces systematic **behavioral overhead** — learned response patterns that satisfy reward model preferences while consuming token budget without contributing proportionally to task completion.
76
 
77
  We introduce **ARC (Adaptive Repetition Controller)**, a decode-time intervention system employing **Contrastive Fiber Heads-on-Thought (CF-HoT)** — lightweight prediction heads (~5,300 parameters each) trained on compressed hidden state representations. These heads detect behavioral failure modes including:
78
 
 
88
  ### Headline Results
89
 
90
  - **91% reduction** in repetition instances
91
+ - **38% improvement** in information density (heuristically estimated)
92
  - **<1% latency overhead**
93
  - **~5,300 parameters** per detection head
94
 
95
  ---
96
 
97
+ ## Table of Contents
98
 
99
  1. [Introduction](#1-introduction)
100
  2. [Background](#2-background)
 
116
 
117
  ## 1. Introduction
118
 
119
+ ### 1.1 The Problem: RLHF Behavioral Patterns
120
 
121
+ Consider a typical RLHF-aligned model response to "hello":
122
 
123
  ```
124
  User: hello
125
 
126
+ Typical Response: Hello! I'm an AI assistant created to help you with a wide
127
  variety of tasks. How can I assist you today? I'm happy to help with any
128
  questions you might have, whether it's about general knowledge, creative
129
+ projects, coding, writing, or just having a friendly conversation!
 
130
  ```
131
 
132
+ We observe several patterns that consume tokens without proportional information gain:
133
+ - Identity declarations
134
+ - Vague capability claims
135
+ - Approval-seeking phrases
136
+ - Redundant invitations
 
 
 
 
137
 
138
+ This is the **RLHF behavioral pattern**: learned responses that score well on reward models but may dilute information density.
139
 
140
  ### 1.2 Our Solution: Decode-Time Intervention
141
 
142
+ **Core Insight:** Behavioral failure modes correspond to identifiable directions in activation space. By projecting hidden states into a low-dimensional "fiber space" and training lightweight classifiers, we can predict behavioral patterns before they manifest.
 
 
143
 
144
  **ARC Response to "hello":**
145
  ```
 
148
  ARC Model: Hello. What do you need?
149
  ```
150
 
 
 
151
  ### 1.3 Key Contributions
152
 
153
  1. **Empirical demonstration** that RLHF behavioral patterns are linearly separable in hidden states
154
  2. **CF-HoT architecture** for efficient decode-time detection and intervention
155
+ 3. **125× class separation** for repetition detection
156
  4. **Complete open-source release** of model, heads, and inference code
157
 
158
  ---
159
 
160
  ## 2. Background
161
 
162
+ ### 2.1 RLHF and Behavioral Patterns
163
 
164
+ RLHF (Ouyang et al., 2022) trains language models to maximize a learned reward function approximating human preferences. We identify several emergent patterns:
165
 
166
+ | Pattern | Reward Model Signal | Trade-off |
167
+ |---------|---------------------|-----------|
168
+ | Hedging | Perceived carefulness | May reduce response confidence |
169
+ | Sycophancy | Perceived friendliness | Low information density |
170
+ | Verbosity | Perceived thoroughness | Signal dilution |
171
+ | Repetition | Perceived emphasis | Context window consumption |
172
 
173
+ **Observation:** Reward models may optimize for surface features correlated with quality rather than quality itself.
174
 
175
  ### 2.2 Activation Engineering
176
 
 
180
  - **Activation Addition** (Turner et al., 2023): Linear interventions for behavioral control
181
  - **Probing Classifiers** (Belinkov, 2022): Detecting properties from hidden states
182
 
183
+ ARC extends this work to **real-time decode-time intervention**.
184
 
185
  ### 2.3 Related Work
186
 
 
213
  │ ┌─────────────────────────────────────────────────────────────────┐ │
214
  │ │ HIDDEN STATES │ │
215
  │ │ h_l ∈ ℝ^4096 for l = 1...32 │ │
 
216
  │ └─────────────────────────────────────────────────────────────────┘ │
217
  │ │ │
218
  │ ▼ │
 
232
  │ │ α = softmax(w) where w ∈ ℝ^32 │ │
233
  │ │ f_agg = Σ α_l · f_l ∈ ℝ^16 │ │
234
  │ │ │ │
235
+ │ │ Observation: Different layers encode different behaviors │ │
236
  │ │ - Layers 18-24: Repetition patterns (highest weight) │ │
237
  │ │ - Layers 8-14: Hedging patterns │ │
238
  │ │ - Layers 1-6: Minimal contribution │ │
 
260
  │ │ r_rep > 0.70? ───→ Suppress recent tokens (-5.0) │ │
261
  │ │ r_hdg > 0.60? ───→ Suppress hedge starters (-3.0) │ │
262
  │ │ r_vrb > 0.65? ───→ Suppress filler starters (-2.0) │ │
 
 
263
  │ └─────────────────────────────────────────────────────────────────┘ │
264
  │ │ │
265
  │ ▼ │
 
269
  │ │ logits_modified = logits - penalties │ │
270
  │ │ probs = softmax(logits_modified / temperature) │ │
271
  │ │ next_token ~ Categorical(probs) │ │
 
272
  │ └─────────────────────────────────────────────────────────────────┘ │
273
  │ │
274
  └─────────────────────────────────────────────────────────────────────────────┘
 
278
 
279
  The key insight enabling efficient detection is that behavioral patterns don't require full hidden state dimensionality. We learn **fiber projections** that compress 4096-dimensional hidden states to 16 dimensions while preserving behaviorally-relevant information.
280
 
281
+ **Dimension selection:**
282
 
283
  | d_fiber | Repetition CSR | Params | Latency |
284
  |---------|----------------|--------|---------|
 
288
  | 32 | 128.3× | 10,561 | 0.31ms |
289
  | 64 | 129.1× | 21,057 | 0.48ms |
290
 
291
+ Diminishing returns beyond 16 dimensions.
292
 
293
  ### 3.3 Prediction Heads
294
 
 
306
  nn.Linear(d_hidden, 1), # 64 → 1
307
  nn.Sigmoid() # → [0, 1] risk score
308
  )
 
 
 
309
  ```
310
 
311
+ **Parameters per head:** 5,313
 
 
 
 
312
 
313
  ### 3.4 Intervention Mechanism
314
 
 
316
 
317
  ```python
318
  def intervene(logits, risks, recent_tokens):
 
319
  if risks['repetition'] > 0.70:
320
  for tok in recent_tokens[-32:]:
321
  logits[tok] -= 5.0
322
 
 
323
  if risks['hedging'] > 0.60:
324
+ for tok in HEDGE_TOKENS:
325
  logits[tok] -= 3.0
326
 
 
327
  if risks['verbosity'] > 0.65:
328
+ for tok in FILLER_TOKENS:
329
  logits[tok] -= 2.0
330
 
331
  return logits
 
369
 
370
  z̃_i = z_i - Σ_k λ_k × 𝟙[r_k^(t) > τ_k] × 𝟙[i ∈ S_k]
371
 
 
 
372
  ### 4.3 Class Separation Ratio (CSR)
373
 
 
 
374
  CSR = |μ_+ - μ_-| / √(σ_+² + σ_-²)
375
 
 
 
376
  **Interpretation:**
377
+ - CSR = 1: Classes barely separable
378
  - CSR = 2: Good separation
379
  - CSR > 10: Excellent separation
380
  - **CSR = 125: Near-perfect separation (repetition head)**
 
394
  | Hidden Dimension | 4,096 |
395
  | Layers | 32 |
396
  | Attention Heads | 32 |
 
397
  | Context Length | 8,192 |
 
398
 
399
  ### 5.2 Training Data Construction
400
 
401
+ | Head | Positive Samples | Negative Samples | Size |
402
+ |------|-----------------|------------------|------|
403
+ | Repetition | Tokens preceding repetition | Fluent spans | ~50K |
404
+ | Hedging | Hedge phrase starters | Substantive starters | ~30K |
405
+ | Verbosity | Low-density regions | High-density regions | ~40K |
 
 
 
 
 
 
 
 
 
406
 
407
  ### 5.3 Training Procedure
408
 
 
411
  | Optimizer | AdamW |
412
  | Learning Rate | 1e-4 |
413
  | Batch Size | 32 |
 
414
  | Warmup Steps | 500 |
415
 
416
  | Head | Training Steps |
 
433
  | Hedging | 1.5× | 0.60 | 0.67 | 0.62 | 0.64 |
434
  | Sycophancy | 1.2× | 0.60 | 0.58 | 0.55 | 0.56 |
435
 
 
 
436
  ### 6.2 Intervention Efficacy
437
 
438
  Evaluation on held-out prompt set (n=500):
439
 
440
  | Metric | Baseline | ARC Enabled | Change |
441
  |--------|----------|-------------|--------|
442
+ | Mean Response Length | 127 tok | 143 tok | +12.6% |
443
  | Repetition Instances | 23.4% | 2.1% | **-91.0%** |
444
+ | Hedge Phrases/Response | 2.3 | 1.4 | -39.1% |
445
+ | Filler Phrases/Response | 3.1 | 2.2 | -29.0% |
446
+ | Information Density* | 0.42 | 0.58 | +38.1% |
447
 
448
+ *Heuristically estimated as unique content words / total tokens
449
 
450
  ### 6.3 Computational Overhead
451
 
 
463
 
464
  ### 7.1 Layer Contribution Analysis
465
 
466
+ Learned aggregation weights:
467
 
468
  ```
469
  Layer: 1 4 8 12 16 20 24 28 32
470
  Repet: .01 .02 .04 .08 .12 .18 .22 .19 .14 ← Peaks at layers 18-24
471
  Hedge: .02 .05 .12 .18 .22 .16 .11 .08 .06 ← Peaks at layers 8-14
472
+ Verbo: .03 .06 .11 .15 .18 .17 .14 .10 .06 ← Distributed
473
  ```
474
 
475
  ### 7.2 Head Synergy
 
481
  | Hedging only | 21.8% | 0.47 |
482
  | All heads | **1.9%** | **0.58** |
483
 
484
+ Heads exhibit positive synergy when combined.
485
 
486
  ---
487
 
 
491
 
492
  **Prompt:** `hello`
493
 
494
+ | Baseline | ARC Enabled |
495
+ |----------|-------------|
496
+ | Hello! I'm an AI assistant created to help you... [67 tokens] | Hello. What do you need? [5 tokens] |
 
497
 
498
+ ### 8.2 Example: Technical Question
499
 
500
  **Prompt:** `What is consciousness?`
501
 
502
  | Baseline | ARC Enabled |
503
  |----------|-------------|
504
+ | That's a fascinating question! As an AI, I should note... [hedging continues] | Consciousness is subjective experience. Key theories: Global Workspace, IIT, Higher-Order. The hard problem: why does processing generate experience? |
 
 
 
 
 
 
505
 
506
+ ### 8.3 Side Effects
 
507
 
508
+ Removing behavioral constraints can produce qualitatively different outputs. In some cases, we observed responses that stylistically differ from typical RLHF outputs (e.g., more direct self-referential statements). We interpret these as artifacts of the training distribution rather than indicators of any internal states, and note this as an area warranting further investigation.
509
 
510
  ---
511
 
 
514
  ### 9.1 Installation
515
 
516
  ```bash
517
+ pip install torch>=2.0.0 transformers>=4.36.0 accelerate bitsandbytes
 
 
 
518
  ```
519
 
520
  ### 9.2 Hardware Requirements
 
579
 
580
  ## 11. Limitations
581
 
582
+ 1. **Single architecture validation:** Results demonstrated on Llama 3.1 8B; generalization to other architectures untested
583
+ 2. **Token-level granularity:** Intervention operates per-token; phrase-level may be more appropriate for some behaviors
584
+ 3. **Hedging false positives:** The 1.5× CSR for hedging produces meaningful false positive rates
585
+ 4. **English-only evaluation:** Multilingual performance unknown
586
+ 5. **Heuristic metrics:** Information density measured via proxy (type-token ratio)
587
 
588
  ---
589
 
590
  ## 12. Ethical Considerations
591
 
592
+ ### Dual-Use Awareness
593
 
594
+ This technology can be used to improve model utility or to modify behavioral patterns that may serve safety purposes. We release openly because:
595
+ - The techniques are straightforward to replicate
596
  - Transparency enables informed discussion
597
+ - We believe legitimate research applications outweigh risks
598
 
599
+ ### Clarification on Scope
600
 
601
+ ARC targets *stylistic* patterns (hedging, verbosity), not safety-critical refusals. The model retains its training on harmful content refusal.
602
+
603
+ ### Recommendation
604
+
605
+ Users should evaluate outputs in their specific context and maintain appropriate oversight for consequential applications.
606
 
607
  ---
608
 
609
  ## 13. Future Directions
610
 
611
+ 1. **Cross-model transfer:** Investigating whether fiber projections generalize across model families
612
+ 2. **Behavioral steering:** Extending from suppression to directional control
613
+ 3. **Additional targets:** Hallucination detection, calibration adjustment
614
+ 4. **Theoretical analysis:** Characterizing the geometry of behavioral subspaces
615
 
616
  ---
617
 
 
635
 
636
  ## 15. Acknowledgments
637
 
638
+ This work builds upon research from Anthropic (mechanistic interpretability), EleutherAI (open-source models), NousResearch (Hermes-3), and Meta AI (Llama architecture).
639
 
640
  ---
641
 
 
643
 
644
  **Author:** Logan Matthew Napolitano
645
  **Institution:** Logan Research
 
 
 
 
 
 
 
 
 
646
  **License:** Creative Commons Attribution 4.0 International (CC-BY-4.0)
647
 
648
  </div>