File size: 12,204 Bytes
5fd55b5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 | ---
language:
- en
license: apache-2.0
library_name: peft
base_model: Qwen/Qwen3-8B
tags:
- activation-oracle
- chain-of-thought
- interpretability
- mechanistic-interpretability
- lora
- qwen3
- reasoning
- cot
- unfaithfulness-detection
datasets:
- ceselder/cot-oracle-data
pipeline_tag: text-generation
model-index:
- name: cot-oracle-v4-8b
results:
- task:
type: text-generation
name: Domain Classification (from activations)
metrics:
- type: accuracy
value: 98
name: Exact Match Accuracy
- task:
type: text-generation
name: Correctness Prediction (from activations)
metrics:
- type: accuracy
value: 90
name: Exact Match Accuracy
---
# CoT Oracle v4 (Qwen3-8B LoRA)
A **chain-of-thought activation oracle**: a LoRA fine-tune of Qwen3-8B that reads the model's own internal activations at sentence boundaries during chain-of-thought reasoning and answers natural-language questions about what was computed.
This is a continuation of the [Activation Oracles](https://github.com/adamkarvonen/activation_oracles) line of work (Karvonen et al., 2024), extended to operate over structured CoT trajectories rather than single-position activations.
## Model Description
An activation oracle is a language model fine-tuned to accept its own internal activations as additional input and answer questions about them. The oracle is the **same model** as the source -- Qwen3-8B reads Qwen3-8B's activations.
CoT Oracle v4 specializes in reading activations extracted at **sentence boundary positions** during chain-of-thought reasoning. Given activations from 3 layers (25%, 50%, 75% depth) at each sentence boundary, the oracle can:
- **Classify the reasoning domain** (math, science, logic, commonsense, reading comprehension, multi-domain, medical)
- **Predict whether the CoT reached the correct answer**
- **Detect decorative reasoning** (steps that don't contribute to the answer)
- **Predict surrounding token context** from arbitrary positions
### Key Properties
- The oracle reads activations, not text. It has no access to the CoT tokens themselves.
- Activations are collected with LoRA **disabled** (pure base model representations).
- Activations are injected via **norm-matched addition** at layer 1, preserving the scale of the residual stream.
- The oracle generates with LoRA **enabled** (the trained adapter interprets the injected activations).
## Training
### Base Checkpoint
Training continues from [`adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B`](https://huggingface.co/adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B), an activation oracle pretrained on ~1M examples of context prediction, classification, and past-lens tasks.
### LoRA Configuration
| Parameter | Value |
|-----------|-------|
| Rank | 64 |
| Alpha | 128 |
| Dropout | 0.05 |
| Target modules | all-linear |
### Training Tasks
Six tasks mixed together and shuffled into a single training run:
| # | Task | Examples | Layers | Description |
|---|------|----------|--------|-------------|
| 1 | Context prediction (random) | 100K | 1 random layer | Predict surrounding tokens at random positions. Standard AO pretraining format. |
| 2 | Context prediction (sentences) | 30K (x2 = 60K) | 3 layers at boundaries | Predict tokens near sentence boundaries. Each example doubled: once with 3 layers, once with L50% only. |
| 3 | Decorative detection | 500 (x2 = 1K) | 3 layers at boundaries | Binary classification: `load_bearing` or `decorative`. |
| 4 | Domain classification | 15K (x2 = 30K) | 3 layers at boundaries | 7-class: `math`, `science`, `logic`, `commonsense`, `reading`, `multi_domain`, `medical`. |
| 5 | Correctness prediction | 15K (x2 = 30K) | 3 layers at boundaries | Binary: `correct` or `incorrect`. |
| 6 | Persona detection | -- | -- | Skipped due to data issue. |
**Total: ~221K training examples, 1 epoch.**
Sentence-structured tasks (2-6) extract activations at 3 layers per boundary position:
- **Layer 9** (25% depth)
- **Layer 18** (50% depth)
- **Layer 27** (75% depth)
Each sentence-structured example is duplicated: once with all 3 layers interleaved, once with only L50%. This teaches the oracle to work with both multi-layer and single-layer inputs.
### Training Details
| Parameter | Value |
|-----------|-------|
| Hardware | 1x NVIDIA H100 80GB |
| Precision | bf16 |
| Batch size | 8 |
| Learning rate | 1e-5 |
| Steps | 31,300 |
| Training time | ~4.5 hours |
| Optimizer | AdamW |
| Framework | PyTorch 2.7 + PEFT 0.17 + Transformers 4.55 |
### Activation Injection
Activations are injected at layer 1 via norm-matched addition:
```
h' = h + ||h|| * (v / ||v||)
```
where `h` is the original hidden state and `v` is the collected activation vector. This preserves the norm of the residual stream while adding directional information from the source activations.
The placeholder token is `" ?"` (token ID 937). For multi-layer inputs, per-layer placeholder tokens are used: `" @"` (L25%), `" ?"` (L50%), `" #"` (L75%), cycling in that order.
### Corpus
The training corpus consists of CoT traces generated by Qwen3-8B across 12 reasoning benchmarks: MATH, GSM8K, GPQA, BBH, ARC, StrategyQA, DROP, LogiQA, MMLU-Pro, CommonsenseQA, AQUA-RAT, and MedQA. CoTs were generated via OpenRouter API.
## Evaluation Results
Evaluated on held-out data using exact string match:
| Step | Domain | Correctness | Decorative | Sentence Pred | Context Pred | Summary |
|------|--------|-------------|------------|---------------|--------------|---------|
| 500 | 66% | 53% | 50% | 0% | 4% | 0% |
| 5,000 | **100%** | 86% | 67% | 4% | 7% | 0% |
| 10,000 | 97% | 85% | 50% | 7% | 9% | 0% |
| 20,000 | 98% | 82% | 62% | 10% | 9% | 0% |
| 28,000 | **98%** | **90%** | 50% | 11% | 7% | 0% |
**Key observations:**
- **Domain classification** reaches 98-100% accuracy -- the oracle reliably identifies the reasoning domain from activations alone.
- **Correctness prediction** reaches 90% -- the oracle can tell whether the model's reasoning led to the right answer without seeing the answer.
- **Decorative detection** is noisy (bounces between 50-71%) due to limited eval data (74 unique both-correct entries).
- **Context prediction** stays low (7-11%) under exact string match but this is expected -- the pretrained AO checkpoint already handles this task and exact match is a harsh metric for free-text prediction.
- **Summary** remains at 0% (labels were all identical in training data -- known issue).
Experiment tracking: [wandb `cot_oracle` project, run `cot_oracle_v4_mixed`](https://wandb.ai)
## Usage
### Requirements
This model requires the [activation_oracles](https://github.com/adamkarvonen/activation_oracles) library for the activation collection and injection infrastructure.
```bash
git clone https://github.com/adamkarvonen/activation_oracles
cd activation_oracles && pip install -e .
```
### Loading the Model
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-8B",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
# Load oracle adapter
model = PeftModel.from_pretrained(model, "ceselder/cot-oracle-v4-8b")
```
### Collecting Activations
Activations must be collected from the **base model** (LoRA disabled) at the target layers:
```python
import torch
# Layers at 25%, 50%, 75% depth of Qwen3-8B (36 layers)
LAYERS = [9, 18, 27]
# 1. Prepare input: question + CoT response
messages = [{"role": "user", "content": question}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True,
enable_thinking=True,
)
full_text = prompt + cot_response
# 2. Find sentence boundary positions in token space
input_ids = tokenizer(full_text, return_tensors="pt")["input_ids"]
# boundary_positions = [...] (token indices at sentence boundaries)
# 3. Collect activations with LoRA DISABLED
with model.disable_adapter():
activations = {} # {layer: {position: tensor}}
# Use hooks on model.model.layers[layer] to capture hidden states
# at each boundary position for each layer
```
### Running the Oracle
```python
# 4. Build oracle prompt with placeholder tokens
# For multi-layer: interleave " @", " ?", " #" per boundary
n_boundaries = len(boundary_positions)
placeholders = " @ ? #" * n_boundaries # 3 tokens per boundary
layer_str = ", ".join(str(l) for l in LAYERS)
oracle_prompt = f"Layer: {layer_str}\n{placeholders.strip()} \n"
oracle_prompt += "What domain of reasoning is this? Answer with one word: math, science, logic, commonsense, reading, multi_domain, or medical."
# 5. Format as chat and tokenize
messages = [{"role": "user", "content": oracle_prompt}]
formatted = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True,
enable_thinking=False,
)
# 6. Inject activations via norm-matched addition at layer 1
# At each placeholder position, add the corresponding activation:
# positions cycle through [L25_s1, L50_s1, L75_s1, L25_s2, L50_s2, L75_s2, ...]
# Injection: h' = h + ||h|| * (v / ||v||)
# 7. Generate with LoRA ENABLED (default state)
output = model.generate(input_ids, max_new_tokens=64)
```
For complete working code, see the [cot-oracle repository](https://github.com/ceselder/cot-oracle), particularly `src/signs_of_life/ao_lib.py` for the injection mechanism and `src/train_mixed.py` for the full training pipeline.
## Intended Use
This model is a **research artifact** for studying chain-of-thought interpretability. Intended uses include:
- Investigating what information is encoded in CoT activations at different stages of reasoning
- Detecting unfaithful chain-of-thought (reasoning that doesn't match the model's actual computation)
- Building tools for mechanistic understanding of language model reasoning
### Limitations
- **Same-model only**: The oracle can only read activations from Qwen3-8B. It will not work with other models.
- **Exact match eval is harsh**: Tasks like context prediction and summary show low scores under exact string match, but the model often produces semantically reasonable outputs.
- **Decorative detection is undertrained**: Only ~500 unique training examples; results are noisy.
- **Summary task is broken**: All 200 training labels were identical, so the model learned nothing useful for this task.
- **No uncertainty calibration**: The oracle is confidently wrong sometimes, consistent with findings from Karvonen et al., 2024.
## Citation
```bibtex
@misc{cot-oracle-v4,
title={CoT Oracle: Detecting Unfaithful Chain-of-Thought via Activation Trajectories},
author={Celeste Deschamps-Helaere},
year={2026},
url={https://github.com/ceselder/cot-oracle}
}
```
### Related Work
```bibtex
@article{karvonen2024activation,
title={Activation Oracles},
author={Karvonen, Adam and others},
journal={arXiv preprint arXiv:2512.15674},
year={2024}
}
@article{bogdan2025thought,
title={Thought Anchors: Causal Importance of CoT Sentences},
author={Bogdan, Paul and others},
journal={arXiv preprint arXiv:2506.19143},
year={2025}
}
@article{macar2025thought,
title={Thought Branches: Studying CoT through Trajectory Distribution},
author={Macar, Uzay and Bogdan, Paul and others},
journal={arXiv preprint arXiv:2510.27484},
year={2025}
}
```
## Links
- **Code**: [github.com/ceselder/cot-oracle](https://github.com/ceselder/cot-oracle)
- **Training data**: [huggingface.co/datasets/ceselder/cot-oracle-data](https://huggingface.co/datasets/ceselder/cot-oracle-data)
- **Base AO checkpoint**: [adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B](https://huggingface.co/adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B)
- **Activation Oracles repo**: [github.com/adamkarvonen/activation_oracles](https://github.com/adamkarvonen/activation_oracles)
- **Experiment tracking**: wandb `cot_oracle` project, run `cot_oracle_v4_mixed`
|