File size: 12,204 Bytes
5fd55b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
---
language:
  - en
license: apache-2.0
library_name: peft
base_model: Qwen/Qwen3-8B
tags:
  - activation-oracle
  - chain-of-thought
  - interpretability
  - mechanistic-interpretability
  - lora
  - qwen3
  - reasoning
  - cot
  - unfaithfulness-detection
datasets:
  - ceselder/cot-oracle-data
pipeline_tag: text-generation
model-index:
  - name: cot-oracle-v4-8b
    results:
      - task:
          type: text-generation
          name: Domain Classification (from activations)
        metrics:
          - type: accuracy
            value: 98
            name: Exact Match Accuracy
      - task:
          type: text-generation
          name: Correctness Prediction (from activations)
        metrics:
          - type: accuracy
            value: 90
            name: Exact Match Accuracy
---

# CoT Oracle v4 (Qwen3-8B LoRA)

A **chain-of-thought activation oracle**: a LoRA fine-tune of Qwen3-8B that reads the model's own internal activations at sentence boundaries during chain-of-thought reasoning and answers natural-language questions about what was computed.

This is a continuation of the [Activation Oracles](https://github.com/adamkarvonen/activation_oracles) line of work (Karvonen et al., 2024), extended to operate over structured CoT trajectories rather than single-position activations.

## Model Description

An activation oracle is a language model fine-tuned to accept its own internal activations as additional input and answer questions about them. The oracle is the **same model** as the source -- Qwen3-8B reads Qwen3-8B's activations.

CoT Oracle v4 specializes in reading activations extracted at **sentence boundary positions** during chain-of-thought reasoning. Given activations from 3 layers (25%, 50%, 75% depth) at each sentence boundary, the oracle can:

- **Classify the reasoning domain** (math, science, logic, commonsense, reading comprehension, multi-domain, medical)
- **Predict whether the CoT reached the correct answer**
- **Detect decorative reasoning** (steps that don't contribute to the answer)
- **Predict surrounding token context** from arbitrary positions

### Key Properties

- The oracle reads activations, not text. It has no access to the CoT tokens themselves.
- Activations are collected with LoRA **disabled** (pure base model representations).
- Activations are injected via **norm-matched addition** at layer 1, preserving the scale of the residual stream.
- The oracle generates with LoRA **enabled** (the trained adapter interprets the injected activations).

## Training

### Base Checkpoint

Training continues from [`adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B`](https://huggingface.co/adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B), an activation oracle pretrained on ~1M examples of context prediction, classification, and past-lens tasks.

### LoRA Configuration

| Parameter | Value |
|-----------|-------|
| Rank | 64 |
| Alpha | 128 |
| Dropout | 0.05 |
| Target modules | all-linear |

### Training Tasks

Six tasks mixed together and shuffled into a single training run:

| # | Task | Examples | Layers | Description |
|---|------|----------|--------|-------------|
| 1 | Context prediction (random) | 100K | 1 random layer | Predict surrounding tokens at random positions. Standard AO pretraining format. |
| 2 | Context prediction (sentences) | 30K (x2 = 60K) | 3 layers at boundaries | Predict tokens near sentence boundaries. Each example doubled: once with 3 layers, once with L50% only. |
| 3 | Decorative detection | 500 (x2 = 1K) | 3 layers at boundaries | Binary classification: `load_bearing` or `decorative`. |
| 4 | Domain classification | 15K (x2 = 30K) | 3 layers at boundaries | 7-class: `math`, `science`, `logic`, `commonsense`, `reading`, `multi_domain`, `medical`. |
| 5 | Correctness prediction | 15K (x2 = 30K) | 3 layers at boundaries | Binary: `correct` or `incorrect`. |
| 6 | Persona detection | -- | -- | Skipped due to data issue. |

**Total: ~221K training examples, 1 epoch.**

Sentence-structured tasks (2-6) extract activations at 3 layers per boundary position:
- **Layer 9** (25% depth)
- **Layer 18** (50% depth)
- **Layer 27** (75% depth)

Each sentence-structured example is duplicated: once with all 3 layers interleaved, once with only L50%. This teaches the oracle to work with both multi-layer and single-layer inputs.

### Training Details

| Parameter | Value |
|-----------|-------|
| Hardware | 1x NVIDIA H100 80GB |
| Precision | bf16 |
| Batch size | 8 |
| Learning rate | 1e-5 |
| Steps | 31,300 |
| Training time | ~4.5 hours |
| Optimizer | AdamW |
| Framework | PyTorch 2.7 + PEFT 0.17 + Transformers 4.55 |

### Activation Injection

Activations are injected at layer 1 via norm-matched addition:

```
h' = h + ||h|| * (v / ||v||)
```

where `h` is the original hidden state and `v` is the collected activation vector. This preserves the norm of the residual stream while adding directional information from the source activations.

The placeholder token is `" ?"` (token ID 937). For multi-layer inputs, per-layer placeholder tokens are used: `" @"` (L25%), `" ?"` (L50%), `" #"` (L75%), cycling in that order.

### Corpus

The training corpus consists of CoT traces generated by Qwen3-8B across 12 reasoning benchmarks: MATH, GSM8K, GPQA, BBH, ARC, StrategyQA, DROP, LogiQA, MMLU-Pro, CommonsenseQA, AQUA-RAT, and MedQA. CoTs were generated via OpenRouter API.

## Evaluation Results

Evaluated on held-out data using exact string match:

| Step | Domain | Correctness | Decorative | Sentence Pred | Context Pred | Summary |
|------|--------|-------------|------------|---------------|--------------|---------|
| 500 | 66% | 53% | 50% | 0% | 4% | 0% |
| 5,000 | **100%** | 86% | 67% | 4% | 7% | 0% |
| 10,000 | 97% | 85% | 50% | 7% | 9% | 0% |
| 20,000 | 98% | 82% | 62% | 10% | 9% | 0% |
| 28,000 | **98%** | **90%** | 50% | 11% | 7% | 0% |

**Key observations:**

- **Domain classification** reaches 98-100% accuracy -- the oracle reliably identifies the reasoning domain from activations alone.
- **Correctness prediction** reaches 90% -- the oracle can tell whether the model's reasoning led to the right answer without seeing the answer.
- **Decorative detection** is noisy (bounces between 50-71%) due to limited eval data (74 unique both-correct entries).
- **Context prediction** stays low (7-11%) under exact string match but this is expected -- the pretrained AO checkpoint already handles this task and exact match is a harsh metric for free-text prediction.
- **Summary** remains at 0% (labels were all identical in training data -- known issue).

Experiment tracking: [wandb `cot_oracle` project, run `cot_oracle_v4_mixed`](https://wandb.ai)

## Usage

### Requirements

This model requires the [activation_oracles](https://github.com/adamkarvonen/activation_oracles) library for the activation collection and injection infrastructure.

```bash
git clone https://github.com/adamkarvonen/activation_oracles
cd activation_oracles && pip install -e .
```

### Loading the Model

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

# Load oracle adapter
model = PeftModel.from_pretrained(model, "ceselder/cot-oracle-v4-8b")
```

### Collecting Activations

Activations must be collected from the **base model** (LoRA disabled) at the target layers:

```python
import torch

# Layers at 25%, 50%, 75% depth of Qwen3-8B (36 layers)
LAYERS = [9, 18, 27]

# 1. Prepare input: question + CoT response
messages = [{"role": "user", "content": question}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=True,
)
full_text = prompt + cot_response

# 2. Find sentence boundary positions in token space
input_ids = tokenizer(full_text, return_tensors="pt")["input_ids"]
# boundary_positions = [...] (token indices at sentence boundaries)

# 3. Collect activations with LoRA DISABLED
with model.disable_adapter():
    activations = {}  # {layer: {position: tensor}}
    # Use hooks on model.model.layers[layer] to capture hidden states
    # at each boundary position for each layer
```

### Running the Oracle

```python
# 4. Build oracle prompt with placeholder tokens
# For multi-layer: interleave " @", " ?", " #" per boundary
n_boundaries = len(boundary_positions)
placeholders = " @ ? #" * n_boundaries  # 3 tokens per boundary

layer_str = ", ".join(str(l) for l in LAYERS)
oracle_prompt = f"Layer: {layer_str}\n{placeholders.strip()} \n"
oracle_prompt += "What domain of reasoning is this? Answer with one word: math, science, logic, commonsense, reading, multi_domain, or medical."

# 5. Format as chat and tokenize
messages = [{"role": "user", "content": oracle_prompt}]
formatted = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False,
)

# 6. Inject activations via norm-matched addition at layer 1
# At each placeholder position, add the corresponding activation:
# positions cycle through [L25_s1, L50_s1, L75_s1, L25_s2, L50_s2, L75_s2, ...]
# Injection: h' = h + ||h|| * (v / ||v||)

# 7. Generate with LoRA ENABLED (default state)
output = model.generate(input_ids, max_new_tokens=64)
```

For complete working code, see the [cot-oracle repository](https://github.com/ceselder/cot-oracle), particularly `src/signs_of_life/ao_lib.py` for the injection mechanism and `src/train_mixed.py` for the full training pipeline.

## Intended Use

This model is a **research artifact** for studying chain-of-thought interpretability. Intended uses include:

- Investigating what information is encoded in CoT activations at different stages of reasoning
- Detecting unfaithful chain-of-thought (reasoning that doesn't match the model's actual computation)
- Building tools for mechanistic understanding of language model reasoning

### Limitations

- **Same-model only**: The oracle can only read activations from Qwen3-8B. It will not work with other models.
- **Exact match eval is harsh**: Tasks like context prediction and summary show low scores under exact string match, but the model often produces semantically reasonable outputs.
- **Decorative detection is undertrained**: Only ~500 unique training examples; results are noisy.
- **Summary task is broken**: All 200 training labels were identical, so the model learned nothing useful for this task.
- **No uncertainty calibration**: The oracle is confidently wrong sometimes, consistent with findings from Karvonen et al., 2024.

## Citation

```bibtex
@misc{cot-oracle-v4,
  title={CoT Oracle: Detecting Unfaithful Chain-of-Thought via Activation Trajectories},
  author={Celeste Deschamps-Helaere},
  year={2026},
  url={https://github.com/ceselder/cot-oracle}
}
```

### Related Work

```bibtex
@article{karvonen2024activation,
  title={Activation Oracles},
  author={Karvonen, Adam and others},
  journal={arXiv preprint arXiv:2512.15674},
  year={2024}
}

@article{bogdan2025thought,
  title={Thought Anchors: Causal Importance of CoT Sentences},
  author={Bogdan, Paul and others},
  journal={arXiv preprint arXiv:2506.19143},
  year={2025}
}

@article{macar2025thought,
  title={Thought Branches: Studying CoT through Trajectory Distribution},
  author={Macar, Uzay and Bogdan, Paul and others},
  journal={arXiv preprint arXiv:2510.27484},
  year={2025}
}
```

## Links

- **Code**: [github.com/ceselder/cot-oracle](https://github.com/ceselder/cot-oracle)
- **Training data**: [huggingface.co/datasets/ceselder/cot-oracle-data](https://huggingface.co/datasets/ceselder/cot-oracle-data)
- **Base AO checkpoint**: [adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B](https://huggingface.co/adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B)
- **Activation Oracles repo**: [github.com/adamkarvonen/activation_oracles](https://github.com/adamkarvonen/activation_oracles)
- **Experiment tracking**: wandb `cot_oracle` project, run `cot_oracle_v4_mixed`