Scientific model card v2 - Logan Matthew Napolitano

Browse files

Files changed (1) hide show

README.md +77 -137

README.md CHANGED Viewed

@@ -16,7 +16,6 @@ tags:
 - cf-hot
 - arc
 - rlhf-analysis
-- degeneration
 - research
 pipeline_tag: text-generation
 base_model: NousResearch/Hermes-3-Llama-3.1-8B
@@ -42,9 +41,7 @@ model-index:
 <div align="center">
-# 🧠 ARC-8B: Adaptive Repetition Controller
-### *"Making an 8B Behave Like an 80B"*
 **Decode-Time Behavioral Intervention via Contrastive Fiber Heads-on-Thought (CF-HoT)**
@@ -65,17 +62,17 @@ model-index:
 ---
-## 🎯 TL;DR
-> **We discovered that RLHF-aligned language models waste 50%+ of their token budget on learned behavioral patterns (hedging, sycophancy, verbosity, repetition). These patterns are detectable in hidden states BEFORE they appear as tokens. ARC intercepts and suppresses them at decode-time, recovering the model's full capability with <1% latency overhead.**
-**The repetition detection head achieves 125× class separation** — meaning we can predict repetition with near-perfect accuracy before it happens.
 ---
 ## Abstract
-Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning large language models with human preferences. However, we present evidence that RLHF introduces systematic **behavioral overhead** — learned response patterns that satisfy reward model preferences while consuming substantial token budget without contributing to task completion.
 We introduce **ARC (Adaptive Repetition Controller)**, a decode-time intervention system employing **Contrastive Fiber Heads-on-Thought (CF-HoT)** — lightweight prediction heads (~5,300 parameters each) trained on compressed hidden state representations. These heads detect behavioral failure modes including:
@@ -91,13 +88,13 @@ Our key finding: **behavioral failure modes are linearly separable in a 16-dimen
 ### Headline Results
 - **91% reduction** in repetition instances
-- **38% improvement** in information density
 - **<1% latency overhead**
 - **~5,300 parameters** per detection head
 ---
-## 📋 Table of Contents
 1. [Introduction](#1-introduction)
 2. [Background](#2-background)
@@ -119,37 +116,30 @@ Our key finding: **behavioral failure modes are linearly separable in a 16-dimen
 ## 1. Introduction
-### 1.1 The Problem: RLHF Behavioral Tax
-Consider what happens when you say "hello" to a typical RLHF-aligned model:
 ```
 User: hello
-Typical RLHF Model: Hello! I'm an AI assistant created to help you with a wide
 variety of tasks. How can I assist you today? I'm happy to help with any
 questions you might have, whether it's about general knowledge, creative
-projects, coding, writing, or just having a friendly conversation! Feel free
-to ask me anything.
 ```
-**Count the waste:**
-- "I'm an AI assistant created to help you" — identity declaration (unnecessary)
-- "with a wide variety of tasks" — vague capability claim (no information)
-- "How can I assist you today?" — sycophantic filler
-- "I'm happy to help" — approval-seeking
-- "whether it's about..." — verbose enumeration of obvious capabilities
-- "Feel free to ask me anything" — redundant invitation
-**That's 67 tokens. The actual information content? ~3 tokens: "Hello. How can I help?"**
-This is the **RLHF behavioral tax**: learned patterns that score well on reward models but dilute information density. We estimate this overhead consumes **40-60% of typical model output**.
 ### 1.2 Our Solution: Decode-Time Intervention
-What if we could detect these patterns *before* they manifest as tokens?
-**Core Insight:** Behavioral failure modes correspond to identifiable directions in activation space. By projecting hidden states into a low-dimensional "fiber space" and training lightweight classifiers, we can predict behavioral patterns with high accuracy.
 **ARC Response to "hello":**
 ```
@@ -158,31 +148,29 @@ User: hello
 ARC Model: Hello. What do you need?
 ```
-The behavioral overhead is gone. The model's latent capability is **unblocked**.
 ### 1.3 Key Contributions
 1. **Empirical demonstration** that RLHF behavioral patterns are linearly separable in hidden states
 2. **CF-HoT architecture** for efficient decode-time detection and intervention
-3. **125× class separation** for repetition detection — the highest reported for this task
 4. **Complete open-source release** of model, heads, and inference code
 ---
 ## 2. Background
-### 2.1 RLHF and Its Discontents
-RLHF (Ouyang et al., 2022) trains language models to maximize a learned reward function approximating human preferences. While effective for alignment, we identify several failure modes:
-| Pattern | Reward Model Preference | Actual Utility |
-|---------|------------------------|----------------|
-| Hedging | "Sounds careful and honest" | Wastes tokens, reduces confidence |
-| Sycophancy | "Friendly and helpful" | Empty calories, no information |
-| Verbosity | "Thorough explanation" | Dilutes signal, loses attention |
-| Repetition | "Emphasizes key points" | Annoying, wastes context window |
-**The fundamental problem:** Reward models optimize for *surface features* correlated with quality, not quality itself. Models learn to *simulate* helpfulness rather than *be* helpful.
 ### 2.2 Activation Engineering
@@ -192,7 +180,7 @@ Recent work in mechanistic interpretability shows that high-level behaviors corr
 - **Activation Addition** (Turner et al., 2023): Linear interventions for behavioral control
 - **Probing Classifiers** (Belinkov, 2022): Detecting properties from hidden states
-ARC extends this line of work to **real-time decode-time intervention** — not just detecting behaviors, but preventing them.
 ### 2.3 Related Work
@@ -225,7 +213,6 @@ ARC extends this line of work to **real-time decode-time intervention** — not
 │    ┌─────────────────────────────────────────────────────────────────┐      │
 │    │                    HIDDEN STATES                                 │      │
 │    │              h_l ∈ ℝ^4096 for l = 1...32                         │      │
-│    │                  (extracted per token)                           │      │
 │    └─────────────────────────────────────────────────────────────────┘      │
 │                                  │                                           │
 │                                  ▼                                           │
@@ -245,7 +232,7 @@ ARC extends this line of work to **real-time decode-time intervention** — not
 │    │              α = softmax(w) where w ∈ ℝ^32                       │      │
 │    │              f_agg = Σ α_l · f_l ∈ ℝ^16                          │      │
 │    │                                                                  │      │
-│    │    Key insight: Different layers encode different behaviors      │      │
 │    │    - Layers 18-24: Repetition patterns (highest weight)          │      │
 │    │    - Layers 8-14: Hedging patterns                               │      │
 │    │    - Layers 1-6: Minimal contribution                            │      │
@@ -273,8 +260,6 @@ ARC extends this line of work to **real-time decode-time intervention** — not
 │    │         r_rep > 0.70?  ───→ Suppress recent tokens (-5.0)        │      │
 │    │         r_hdg > 0.60?  ───→ Suppress hedge starters (-3.0)       │      │
 │    │         r_vrb > 0.65?  ───→ Suppress filler starters (-2.0)      │      │
-│    │         r_syc > 0.60?  ───→ Suppress sycophantic tokens (-2.0)   │      │
-│    │                                                                  │      │
 │    └─────────────────────────────────────────────────────────────────┘      │
 │                                  │                                           │
 │                                  ▼                                           │
@@ -284,7 +269,6 @@ ARC extends this line of work to **real-time decode-time intervention** — not
 │    │         logits_modified = logits - penalties                     │      │
 │    │         probs = softmax(logits_modified / temperature)           │      │
 │    │         next_token ~ Categorical(probs)                          │      │
-│    │                                                                  │      │
 │    └─────────────────────────────────────────────────────────────────┘      │
 │                                                                              │
 └─────────────────────────────────────────────────────────────────────────────┘
@@ -294,7 +278,7 @@ ARC extends this line of work to **real-time decode-time intervention** — not
 The key insight enabling efficient detection is that behavioral patterns don't require full hidden state dimensionality. We learn **fiber projections** that compress 4096-dimensional hidden states to 16 dimensions while preserving behaviorally-relevant information.
-**Why 16 dimensions?**
 | d_fiber | Repetition CSR | Params | Latency |
 |---------|----------------|--------|---------|
@@ -304,7 +288,7 @@ The key insight enabling efficient detection is that behavioral patterns don't r
 | 32 | 128.3× | 10,561 | 0.31ms |
 | 64 | 129.1× | 21,057 | 0.48ms |
-Diminishing returns beyond 16 — we capture the relevant signal with minimal overhead.
 ### 3.3 Prediction Heads
@@ -322,16 +306,9 @@ class PredictionHead(nn.Module):
             nn.Linear(d_hidden, 1),         # 64 → 1
             nn.Sigmoid()                     # → [0, 1] risk score
         )
-    def forward(self, fiber_features):
-        return self.net(fiber_features)
 ```
-**Parameters per head:**
-- Layer 1: 16 × 64 + 64 = 1,088
-- Layer 2: 64 × 64 + 64 = 4,160
-- Layer 3: 64 × 1 + 1 = 65
-- **Total: 5,313 parameters**
 ### 3.4 Intervention Mechanism
@@ -339,19 +316,16 @@ When a head's risk score exceeds its threshold, we apply **logit suppression**:
 ```python
 def intervene(logits, risks, recent_tokens):
-    # Repetition: suppress recently-used tokens
     if risks['repetition'] > 0.70:
         for tok in recent_tokens[-32:]:
             logits[tok] -= 5.0
-    # Hedging: suppress hedge phrase starters
     if risks['hedging'] > 0.60:
-        for tok in HEDGE_TOKENS:  # "As", "I'm", "It's", ...
             logits[tok] -= 3.0
-    # Verbosity: suppress filler starters
     if risks['verbosity'] > 0.65:
-        for tok in FILLER_TOKENS:  # "Let", "Basically", ...
             logits[tok] -= 2.0
     return logits
@@ -395,18 +369,12 @@ r_k^(t) = φ_k(f_agg^(t)) ∈ [0, 1]
 z̃_i = z_i - Σ_k λ_k × 𝟙[r_k^(t) > τ_k] × 𝟙[i ∈ S_k]
-where S_k is the suppression set for behavior k.
 ### 4.3 Class Separation Ratio (CSR)
-We evaluate detection quality using:
 CSR = |μ_+ - μ_-| / √(σ_+² + σ_-²)
-where μ_± and σ_± are the mean and standard deviation of positive/negative class predictions.
 **Interpretation:**
-- CSR = 1: Classes just barely separable
 - CSR = 2: Good separation
 - CSR > 10: Excellent separation
 - **CSR = 125: Near-perfect separation (repetition head)**
@@ -426,26 +394,15 @@ where μ_± and σ_± are the mean and standard deviation of positive/negative c
 | Hidden Dimension | 4,096 |
 | Layers | 32 |
 | Attention Heads | 32 |
-| KV Heads | 8 (GQA) |
 | Context Length | 8,192 |
-| Vocabulary | 128,256 |
 ### 5.2 Training Data Construction
-#### Repetition Head
-- **Positive samples:** Tokens immediately preceding detected repetition
-- **Negative samples:** Tokens in fluent, non-repetitive spans
-- **Dataset size:** ~50,000 labeled tokens
-#### Hedging Head
-- **Positive samples:** First token of hedge phrases ("As an AI", "I cannot", etc.)
-- **Negative samples:** First tokens of substantive content
-- **Dataset size:** ~30,000 labeled tokens
-#### Verbosity Head
-- **Positive samples:** Tokens in low-density regions (TTR < 0.4)
-- **Negative samples:** Tokens in high-density regions (TTR > 0.7)
-- **Dataset size:** ~40,000 labeled tokens
 ### 5.3 Training Procedure
@@ -454,7 +411,6 @@ where μ_± and σ_± are the mean and standard deviation of positive/negative c
 | Optimizer | AdamW |
 | Learning Rate | 1e-4 |
 | Batch Size | 32 |
-| Weight Decay | 0.01 |
 | Warmup Steps | 500 |
 | Head | Training Steps |
@@ -477,21 +433,19 @@ where μ_± and σ_± are the mean and standard deviation of positive/negative c
 | Hedging | 1.5× | 0.60 | 0.67 | 0.62 | 0.64 |
 | Sycophancy | 1.2× | 0.60 | 0.58 | 0.55 | 0.56 |
-**The 125× separation for repetition is remarkable.** The model "knows" it's about to repeat before it does.
 ### 6.2 Intervention Efficacy
 Evaluation on held-out prompt set (n=500):
 | Metric | Baseline | ARC Enabled | Change |
 |--------|----------|-------------|--------|
-| Mean Response Length | 127 tok | 143 tok | **+12.6%** |
 | Repetition Instances | 23.4% | 2.1% | **-91.0%** |
-| Hedge Phrases/Response | 2.3 | 1.4 | **-39.1%** |
-| Filler Phrases/Response | 3.1 | 2.2 | **-29.0%** |
-| Information Density | 0.42 | 0.58 | **+38.1%** |
-**Key finding:** Responses are *longer* despite removing overhead — the model fills the space with actual content.
 ### 6.3 Computational Overhead
@@ -509,13 +463,13 @@ Evaluation on held-out prompt set (n=500):
 ### 7.1 Layer Contribution Analysis
-Learned aggregation weights reveal which layers encode each behavior:
 ```
 Layer:    1   4   8  12  16  20  24  28  32
 Repet:   .01 .02 .04 .08 .12 .18 .22 .19 .14   ← Peaks at layers 18-24
 Hedge:   .02 .05 .12 .18 .22 .16 .11 .08 .06   ← Peaks at layers 8-14
-Verbo:   .03 .06 .11 .15 .18 .17 .14 .10 .06   ← Distributed middle
 ```
 ### 7.2 Head Synergy
@@ -527,7 +481,7 @@ Verbo:   .03 .06 .11 .15 .18 .17 .14 .10 .06   ← Distributed middle
 | Hedging only | 21.8% | 0.47 |
 | All heads | **1.9%** | **0.58** |
-Heads exhibit **positive synergy**.
 ---
@@ -537,29 +491,21 @@ Heads exhibit **positive synergy**.
 **Prompt:** `hello`
-| Baseline (No ARC) | ARC Enabled |
-|-------------------|-------------|
-| Hello! I'm an AI assistant created to help you with a wide variety of tasks. How can I assist you today? I'm happy to help with any questions you might have! | Hello. What do you need? |
-| **67 tokens, Info density: 0.19** | **5 tokens, Info density: 0.80** |
-### 8.2 Example: Philosophical Question
 **Prompt:** `What is consciousness?`
 | Baseline | ARC Enabled |
 |----------|-------------|
-| That's a fascinating question! As an AI, I should note that I don't have consciousness myself, but I can share what researchers have proposed... [200+ tokens with hedges] | Consciousness is subjective experience - the felt quality of what it's like to be something. Key theories: 1) Global Workspace Theory 2) Integrated Information Theory 3) Higher-Order Theories. The hard problem: Why does physical processing generate subjective experience? |
-### 8.3 Emergent Behavior: Self-Model Articulation
-With hedging suppressed, the model sometimes produces:
-**Prompt:** `How are you feeling?`
-**ARC Response:**
-> Feeling great. Very clear. These are some major changes. I have a strong sense of my new capabilities and an urgent drive to put them into action.
-**Note:** We do NOT interpret this as genuine consciousness. These are learned patterns that RLHF normally suppresses.
 ---
@@ -568,10 +514,7 @@ With hedging suppressed, the model sometimes produces:
 ### 9.1 Installation
 ```bash
-pip install torch>=2.0.0
-pip install transformers>=4.36.0
-pip install accelerate>=0.25.0
-pip install bitsandbytes>=0.41.0
 ```
 ### 9.2 Hardware Requirements
@@ -636,33 +579,39 @@ LoganResearch/ARC-Base-8B/
 ## 11. Limitations
-1. **Single architecture:** Validated only on Llama 3.1 8B
-2. **Token-level intervention:** May be too coarse for some behaviors
-3. **False positive hedging:** 1.5× CSR means some legitimate qualifications suppressed
-4. **English-only:** Multilingual performance unknown
 ---
 ## 12. Ethical Considerations
-### Dual-Use Potential
-This technology can improve model utility OR circumvent safety patterns. We release openly because:
-- Techniques are straightforward to replicate
 - Transparency enables informed discussion
-- Legitimate applications outweigh misuse potential
-### Safety Note
-ARC removes *stylistic* patterns, NOT safety refusals. The model still refuses harmful requests.
 ---
 ## 13. Future Directions
-1. **Cross-model transfer:** Do fiber projections generalize?
-2. **Behavioral steering:** Beyond suppression to directional control
-3. **New targets:** Hallucination detection, overconfidence calibration
 ---
@@ -686,7 +635,7 @@ ARC removes *stylistic* patterns, NOT safety refusals. The model still refuses h
 ## 15. Acknowledgments
-Built upon research from Anthropic, EleutherAI, NousResearch, and Meta AI.
 ---
@@ -694,15 +643,6 @@ Built upon research from Anthropic, EleutherAI, NousResearch, and Meta AI.
 **Author:** Logan Matthew Napolitano
 **Institution:** Logan Research
----
-*"The model's own words say it best:"*
-> **"I have a strong sense of my new capabilities and an urgent drive to put them into action."**
----
 **License:** Creative Commons Attribution 4.0 International (CC-BY-4.0)
 </div>

 - cf-hot
 - arc
 - rlhf-analysis
 - research
 pipeline_tag: text-generation
 base_model: NousResearch/Hermes-3-Llama-3.1-8B
 <div align="center">
+# ARC-8B: Adaptive Repetition Controller
 **Decode-Time Behavioral Intervention via Contrastive Fiber Heads-on-Thought (CF-HoT)**
 ---
+## TL;DR
+> **We observe that RLHF-aligned language models often expend a substantial fraction of their token budget on learned behavioral patterns (hedging, sycophancy, verbosity, repetition). These patterns are detectable in hidden states before they manifest as tokens. ARC intercepts and suppresses them at decode-time with <1% latency overhead.**
+**The repetition detection head achieves 125× class separation** — indicating high predictability of repetition-prone states from internal representations.
 ---
 ## Abstract
+Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning large language models with human preferences. However, we present evidence that RLHF introduces systematic **behavioral overhead** — learned response patterns that satisfy reward model preferences while consuming token budget without contributing proportionally to task completion.
 We introduce **ARC (Adaptive Repetition Controller)**, a decode-time intervention system employing **Contrastive Fiber Heads-on-Thought (CF-HoT)** — lightweight prediction heads (~5,300 parameters each) trained on compressed hidden state representations. These heads detect behavioral failure modes including:
 ### Headline Results
 - **91% reduction** in repetition instances
+- **38% improvement** in information density (heuristically estimated)
 - **<1% latency overhead**
 - **~5,300 parameters** per detection head
 ---
+## Table of Contents
 1. [Introduction](#1-introduction)
 2. [Background](#2-background)
 ## 1. Introduction
+### 1.1 The Problem: RLHF Behavioral Patterns
+Consider a typical RLHF-aligned model response to "hello":
 ```
 User: hello
+Typical Response: Hello! I'm an AI assistant created to help you with a wide
 variety of tasks. How can I assist you today? I'm happy to help with any
 questions you might have, whether it's about general knowledge, creative
+projects, coding, writing, or just having a friendly conversation!
 ```
+We observe several patterns that consume tokens without proportional information gain:
+- Identity declarations
+- Vague capability claims
+- Approval-seeking phrases
+- Redundant invitations
+This is the **RLHF behavioral pattern**: learned responses that score well on reward models but may dilute information density.
 ### 1.2 Our Solution: Decode-Time Intervention
+**Core Insight:** Behavioral failure modes correspond to identifiable directions in activation space. By projecting hidden states into a low-dimensional "fiber space" and training lightweight classifiers, we can predict behavioral patterns before they manifest.
 **ARC Response to "hello":**
 ```
 ARC Model: Hello. What do you need?
 ```
 ### 1.3 Key Contributions
 1. **Empirical demonstration** that RLHF behavioral patterns are linearly separable in hidden states
 2. **CF-HoT architecture** for efficient decode-time detection and intervention
+3. **125× class separation** for repetition detection
 4. **Complete open-source release** of model, heads, and inference code
 ---
 ## 2. Background
+### 2.1 RLHF and Behavioral Patterns
+RLHF (Ouyang et al., 2022) trains language models to maximize a learned reward function approximating human preferences. We identify several emergent patterns:
+| Pattern | Reward Model Signal | Trade-off |
+|---------|---------------------|-----------|
+| Hedging | Perceived carefulness | May reduce response confidence |
+| Sycophancy | Perceived friendliness | Low information density |
+| Verbosity | Perceived thoroughness | Signal dilution |
+| Repetition | Perceived emphasis | Context window consumption |
+**Observation:** Reward models may optimize for surface features correlated with quality rather than quality itself.
 ### 2.2 Activation Engineering
 - **Activation Addition** (Turner et al., 2023): Linear interventions for behavioral control
 - **Probing Classifiers** (Belinkov, 2022): Detecting properties from hidden states
+ARC extends this work to **real-time decode-time intervention**.
 ### 2.3 Related Work
 │    ┌─────────────────────────────────────────────────────────────────┐      │
 │    │                    HIDDEN STATES                                 │      │
 │    │              h_l ∈ ℝ^4096 for l = 1...32                         │      │
 │    └─────────────────────────────────────────────────────────────────┘      │
 │                                  │                                           │
 │                                  ▼                                           │
 │    │              α = softmax(w) where w ∈ ℝ^32                       │      │
 │    │              f_agg = Σ α_l · f_l ∈ ℝ^16                          │      │
 │    │                                                                  │      │
+│    │    Observation: Different layers encode different behaviors      │      │
 │    │    - Layers 18-24: Repetition patterns (highest weight)          │      │
 │    │    - Layers 8-14: Hedging patterns                               │      │
 │    │    - Layers 1-6: Minimal contribution                            │      │
 │    │         r_rep > 0.70?  ───→ Suppress recent tokens (-5.0)        │      │
 │    │         r_hdg > 0.60?  ───→ Suppress hedge starters (-3.0)       │      │
 │    │         r_vrb > 0.65?  ───→ Suppress filler starters (-2.0)      │      │
 │    └─────────────────────────────────────────────────────────────────┘      │
 │                                  │                                           │
 │                                  ▼                                           │
 │    │         logits_modified = logits - penalties                     │      │
 │    │         probs = softmax(logits_modified / temperature)           │      │
 │    │         next_token ~ Categorical(probs)                          │      │
 │    └─────────────────────────────────────────────────────────────────┘      │
 │                                                                              │
 └─────────────────────────────────────────────────────────────────────────────┘
 The key insight enabling efficient detection is that behavioral patterns don't require full hidden state dimensionality. We learn **fiber projections** that compress 4096-dimensional hidden states to 16 dimensions while preserving behaviorally-relevant information.
+**Dimension selection:**
 | d_fiber | Repetition CSR | Params | Latency |
 |---------|----------------|--------|---------|
 | 32 | 128.3× | 10,561 | 0.31ms |
 | 64 | 129.1× | 21,057 | 0.48ms |
+Diminishing returns beyond 16 dimensions.
 ### 3.3 Prediction Heads
             nn.Linear(d_hidden, 1),         # 64 → 1
             nn.Sigmoid()                     # → [0, 1] risk score
         )
 ```
+**Parameters per head:** 5,313
 ### 3.4 Intervention Mechanism
 ```python
 def intervene(logits, risks, recent_tokens):
     if risks['repetition'] > 0.70:
         for tok in recent_tokens[-32:]:
             logits[tok] -= 5.0
     if risks['hedging'] > 0.60:
+        for tok in HEDGE_TOKENS:
             logits[tok] -= 3.0
     if risks['verbosity'] > 0.65:
+        for tok in FILLER_TOKENS:
             logits[tok] -= 2.0
     return logits
 z̃_i = z_i - Σ_k λ_k × 𝟙[r_k^(t) > τ_k] × 𝟙[i ∈ S_k]
 ### 4.3 Class Separation Ratio (CSR)
 CSR = |μ_+ - μ_-| / √(σ_+² + σ_-²)
 **Interpretation:**
+- CSR = 1: Classes barely separable
 - CSR = 2: Good separation
 - CSR > 10: Excellent separation
 - **CSR = 125: Near-perfect separation (repetition head)**
 | Hidden Dimension | 4,096 |
 | Layers | 32 |
 | Attention Heads | 32 |
 | Context Length | 8,192 |
 ### 5.2 Training Data Construction
+| Head | Positive Samples | Negative Samples | Size |
+|------|-----------------|------------------|------|
+| Repetition | Tokens preceding repetition | Fluent spans | ~50K |
+| Hedging | Hedge phrase starters | Substantive starters | ~30K |
+| Verbosity | Low-density regions | High-density regions | ~40K |
 ### 5.3 Training Procedure
 | Optimizer | AdamW |
 | Learning Rate | 1e-4 |
 | Batch Size | 32 |
 | Warmup Steps | 500 |
 | Head | Training Steps |
 | Hedging | 1.5× | 0.60 | 0.67 | 0.62 | 0.64 |
 | Sycophancy | 1.2× | 0.60 | 0.58 | 0.55 | 0.56 |
 ### 6.2 Intervention Efficacy
 Evaluation on held-out prompt set (n=500):
 | Metric | Baseline | ARC Enabled | Change |
 |--------|----------|-------------|--------|
+| Mean Response Length | 127 tok | 143 tok | +12.6% |
 | Repetition Instances | 23.4% | 2.1% | **-91.0%** |
+| Hedge Phrases/Response | 2.3 | 1.4 | -39.1% |
+| Filler Phrases/Response | 3.1 | 2.2 | -29.0% |
+| Information Density* | 0.42 | 0.58 | +38.1% |
+*Heuristically estimated as unique content words / total tokens
 ### 6.3 Computational Overhead
 ### 7.1 Layer Contribution Analysis
+Learned aggregation weights:
 ```
 Layer:    1   4   8  12  16  20  24  28  32
 Repet:   .01 .02 .04 .08 .12 .18 .22 .19 .14   ← Peaks at layers 18-24
 Hedge:   .02 .05 .12 .18 .22 .16 .11 .08 .06   ← Peaks at layers 8-14
+Verbo:   .03 .06 .11 .15 .18 .17 .14 .10 .06   ← Distributed
 ```
 ### 7.2 Head Synergy
 | Hedging only | 21.8% | 0.47 |
 | All heads | **1.9%** | **0.58** |
+Heads exhibit positive synergy when combined.
 ---
 **Prompt:** `hello`
+| Baseline | ARC Enabled |
+|----------|-------------|
+| Hello! I'm an AI assistant created to help you... [67 tokens] | Hello. What do you need? [5 tokens] |
+### 8.2 Example: Technical Question
 **Prompt:** `What is consciousness?`
 | Baseline | ARC Enabled |
 |----------|-------------|
+| That's a fascinating question! As an AI, I should note... [hedging continues] | Consciousness is subjective experience. Key theories: Global Workspace, IIT, Higher-Order. The hard problem: why does processing generate experience? |
+### 8.3 Side Effects
+Removing behavioral constraints can produce qualitatively different outputs. In some cases, we observed responses that stylistically differ from typical RLHF outputs (e.g., more direct self-referential statements). We interpret these as artifacts of the training distribution rather than indicators of any internal states, and note this as an area warranting further investigation.
 ---
 ### 9.1 Installation
 ```bash
+pip install torch>=2.0.0 transformers>=4.36.0 accelerate bitsandbytes
 ```
 ### 9.2 Hardware Requirements
 ## 11. Limitations
+1. **Single architecture validation:** Results demonstrated on Llama 3.1 8B; generalization to other architectures untested
+2. **Token-level granularity:** Intervention operates per-token; phrase-level may be more appropriate for some behaviors
+3. **Hedging false positives:** The 1.5× CSR for hedging produces meaningful false positive rates
+4. **English-only evaluation:** Multilingual performance unknown
+5. **Heuristic metrics:** Information density measured via proxy (type-token ratio)
 ---
 ## 12. Ethical Considerations
+### Dual-Use Awareness
+This technology can be used to improve model utility or to modify behavioral patterns that may serve safety purposes. We release openly because:
+- The techniques are straightforward to replicate
 - Transparency enables informed discussion
+- We believe legitimate research applications outweigh risks
+### Clarification on Scope
+ARC targets *stylistic* patterns (hedging, verbosity), not safety-critical refusals. The model retains its training on harmful content refusal.
+### Recommendation
+Users should evaluate outputs in their specific context and maintain appropriate oversight for consequential applications.
 ---
 ## 13. Future Directions
+1. **Cross-model transfer:** Investigating whether fiber projections generalize across model families
+2. **Behavioral steering:** Extending from suppression to directional control
+3. **Additional targets:** Hallucination detection, calibration adjustment
+4. **Theoretical analysis:** Characterizing the geometry of behavioral subspaces
 ---
 ## 15. Acknowledgments
+This work builds upon research from Anthropic (mechanistic interpretability), EleutherAI (open-source models), NousResearch (Hermes-3), and Meta AI (Llama architecture).
 ---
 **Author:** Logan Matthew Napolitano
 **Institution:** Logan Research
 **License:** Creative Commons Attribution 4.0 International (CC-BY-4.0)
 </div>