|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
library_name: peft |
|
|
base_model: Qwen/Qwen3-8B |
|
|
tags: |
|
|
- activation-oracle |
|
|
- chain-of-thought |
|
|
- interpretability |
|
|
- mechanistic-interpretability |
|
|
- lora |
|
|
- qwen3 |
|
|
- reasoning |
|
|
- cot |
|
|
- unfaithfulness-detection |
|
|
datasets: |
|
|
- ceselder/cot-oracle-data |
|
|
pipeline_tag: text-generation |
|
|
model-index: |
|
|
- name: cot-oracle-v4-8b |
|
|
results: |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Domain Classification (from activations) |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 98 |
|
|
name: Exact Match Accuracy |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Correctness Prediction (from activations) |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 90 |
|
|
name: Exact Match Accuracy |
|
|
--- |
|
|
|
|
|
# CoT Oracle v4 (Qwen3-8B LoRA) |
|
|
|
|
|
A **chain-of-thought activation oracle**: a LoRA fine-tune of Qwen3-8B that reads the model's own internal activations at sentence boundaries during chain-of-thought reasoning and answers natural-language questions about what was computed. |
|
|
|
|
|
This is a continuation of the [Activation Oracles](https://github.com/adamkarvonen/activation_oracles) line of work (Karvonen et al., 2024), extended to operate over structured CoT trajectories rather than single-position activations. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
An activation oracle is a language model fine-tuned to accept its own internal activations as additional input and answer questions about them. The oracle is the **same model** as the source -- Qwen3-8B reads Qwen3-8B's activations. |
|
|
|
|
|
CoT Oracle v4 specializes in reading activations extracted at **sentence boundary positions** during chain-of-thought reasoning. Given activations from 3 layers (25%, 50%, 75% depth) at each sentence boundary, the oracle can: |
|
|
|
|
|
- **Classify the reasoning domain** (math, science, logic, commonsense, reading comprehension, multi-domain, medical) |
|
|
- **Predict whether the CoT reached the correct answer** |
|
|
- **Detect decorative reasoning** (steps that don't contribute to the answer) |
|
|
- **Predict surrounding token context** from arbitrary positions |
|
|
|
|
|
### Key Properties |
|
|
|
|
|
- The oracle reads activations, not text. It has no access to the CoT tokens themselves. |
|
|
- Activations are collected with LoRA **disabled** (pure base model representations). |
|
|
- Activations are injected via **norm-matched addition** at layer 1, preserving the scale of the residual stream. |
|
|
- The oracle generates with LoRA **enabled** (the trained adapter interprets the injected activations). |
|
|
|
|
|
## Training |
|
|
|
|
|
### Base Checkpoint |
|
|
|
|
|
Training continues from [`adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B`](https://huggingface.co/adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B), an activation oracle pretrained on ~1M examples of context prediction, classification, and past-lens tasks. |
|
|
|
|
|
### LoRA Configuration |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Rank | 64 | |
|
|
| Alpha | 128 | |
|
|
| Dropout | 0.05 | |
|
|
| Target modules | all-linear | |
|
|
|
|
|
### Training Tasks |
|
|
|
|
|
Six tasks mixed together and shuffled into a single training run: |
|
|
|
|
|
| # | Task | Examples | Layers | Description | |
|
|
|---|------|----------|--------|-------------| |
|
|
| 1 | Context prediction (random) | 100K | 1 random layer | Predict surrounding tokens at random positions. Standard AO pretraining format. | |
|
|
| 2 | Context prediction (sentences) | 30K (x2 = 60K) | 3 layers at boundaries | Predict tokens near sentence boundaries. Each example doubled: once with 3 layers, once with L50% only. | |
|
|
| 3 | Decorative detection | 500 (x2 = 1K) | 3 layers at boundaries | Binary classification: `load_bearing` or `decorative`. | |
|
|
| 4 | Domain classification | 15K (x2 = 30K) | 3 layers at boundaries | 7-class: `math`, `science`, `logic`, `commonsense`, `reading`, `multi_domain`, `medical`. | |
|
|
| 5 | Correctness prediction | 15K (x2 = 30K) | 3 layers at boundaries | Binary: `correct` or `incorrect`. | |
|
|
| 6 | Persona detection | -- | -- | Skipped due to data issue. | |
|
|
|
|
|
**Total: ~221K training examples, 1 epoch.** |
|
|
|
|
|
Sentence-structured tasks (2-6) extract activations at 3 layers per boundary position: |
|
|
- **Layer 9** (25% depth) |
|
|
- **Layer 18** (50% depth) |
|
|
- **Layer 27** (75% depth) |
|
|
|
|
|
Each sentence-structured example is duplicated: once with all 3 layers interleaved, once with only L50%. This teaches the oracle to work with both multi-layer and single-layer inputs. |
|
|
|
|
|
### Training Details |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Hardware | 1x NVIDIA H100 80GB | |
|
|
| Precision | bf16 | |
|
|
| Batch size | 8 | |
|
|
| Learning rate | 1e-5 | |
|
|
| Steps | 31,300 | |
|
|
| Training time | ~4.5 hours | |
|
|
| Optimizer | AdamW | |
|
|
| Framework | PyTorch 2.7 + PEFT 0.17 + Transformers 4.55 | |
|
|
|
|
|
### Activation Injection |
|
|
|
|
|
Activations are injected at layer 1 via norm-matched addition: |
|
|
|
|
|
``` |
|
|
h' = h + ||h|| * (v / ||v||) |
|
|
``` |
|
|
|
|
|
where `h` is the original hidden state and `v` is the collected activation vector. This preserves the norm of the residual stream while adding directional information from the source activations. |
|
|
|
|
|
The placeholder token is `" ?"` (token ID 937). For multi-layer inputs, per-layer placeholder tokens are used: `" @"` (L25%), `" ?"` (L50%), `" #"` (L75%), cycling in that order. |
|
|
|
|
|
### Corpus |
|
|
|
|
|
The training corpus consists of CoT traces generated by Qwen3-8B across 12 reasoning benchmarks: MATH, GSM8K, GPQA, BBH, ARC, StrategyQA, DROP, LogiQA, MMLU-Pro, CommonsenseQA, AQUA-RAT, and MedQA. CoTs were generated via OpenRouter API. |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
Evaluated on held-out data using exact string match: |
|
|
|
|
|
| Step | Domain | Correctness | Decorative | Sentence Pred | Context Pred | Summary | |
|
|
|------|--------|-------------|------------|---------------|--------------|---------| |
|
|
| 500 | 66% | 53% | 50% | 0% | 4% | 0% | |
|
|
| 5,000 | **100%** | 86% | 67% | 4% | 7% | 0% | |
|
|
| 10,000 | 97% | 85% | 50% | 7% | 9% | 0% | |
|
|
| 20,000 | 98% | 82% | 62% | 10% | 9% | 0% | |
|
|
| 28,000 | **98%** | **90%** | 50% | 11% | 7% | 0% | |
|
|
|
|
|
**Key observations:** |
|
|
|
|
|
- **Domain classification** reaches 98-100% accuracy -- the oracle reliably identifies the reasoning domain from activations alone. |
|
|
- **Correctness prediction** reaches 90% -- the oracle can tell whether the model's reasoning led to the right answer without seeing the answer. |
|
|
- **Decorative detection** is noisy (bounces between 50-71%) due to limited eval data (74 unique both-correct entries). |
|
|
- **Context prediction** stays low (7-11%) under exact string match but this is expected -- the pretrained AO checkpoint already handles this task and exact match is a harsh metric for free-text prediction. |
|
|
- **Summary** remains at 0% (labels were all identical in training data -- known issue). |
|
|
|
|
|
Experiment tracking: [wandb `cot_oracle` project, run `cot_oracle_v4_mixed`](https://wandb.ai) |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Requirements |
|
|
|
|
|
This model requires the [activation_oracles](https://github.com/adamkarvonen/activation_oracles) library for the activation collection and injection infrastructure. |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/adamkarvonen/activation_oracles |
|
|
cd activation_oracles && pip install -e . |
|
|
``` |
|
|
|
|
|
### Loading the Model |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
from peft import PeftModel |
|
|
|
|
|
# Load base model |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"Qwen/Qwen3-8B", |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B") |
|
|
|
|
|
# Load oracle adapter |
|
|
model = PeftModel.from_pretrained(model, "ceselder/cot-oracle-v4-8b") |
|
|
``` |
|
|
|
|
|
### Collecting Activations |
|
|
|
|
|
Activations must be collected from the **base model** (LoRA disabled) at the target layers: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
|
|
|
# Layers at 25%, 50%, 75% depth of Qwen3-8B (36 layers) |
|
|
LAYERS = [9, 18, 27] |
|
|
|
|
|
# 1. Prepare input: question + CoT response |
|
|
messages = [{"role": "user", "content": question}] |
|
|
prompt = tokenizer.apply_chat_template( |
|
|
messages, tokenize=False, add_generation_prompt=True, |
|
|
enable_thinking=True, |
|
|
) |
|
|
full_text = prompt + cot_response |
|
|
|
|
|
# 2. Find sentence boundary positions in token space |
|
|
input_ids = tokenizer(full_text, return_tensors="pt")["input_ids"] |
|
|
# boundary_positions = [...] (token indices at sentence boundaries) |
|
|
|
|
|
# 3. Collect activations with LoRA DISABLED |
|
|
with model.disable_adapter(): |
|
|
activations = {} # {layer: {position: tensor}} |
|
|
# Use hooks on model.model.layers[layer] to capture hidden states |
|
|
# at each boundary position for each layer |
|
|
``` |
|
|
|
|
|
### Running the Oracle |
|
|
|
|
|
```python |
|
|
# 4. Build oracle prompt with placeholder tokens |
|
|
# For multi-layer: interleave " @", " ?", " #" per boundary |
|
|
n_boundaries = len(boundary_positions) |
|
|
placeholders = " @ ? #" * n_boundaries # 3 tokens per boundary |
|
|
|
|
|
layer_str = ", ".join(str(l) for l in LAYERS) |
|
|
oracle_prompt = f"Layer: {layer_str}\n{placeholders.strip()} \n" |
|
|
oracle_prompt += "What domain of reasoning is this? Answer with one word: math, science, logic, commonsense, reading, multi_domain, or medical." |
|
|
|
|
|
# 5. Format as chat and tokenize |
|
|
messages = [{"role": "user", "content": oracle_prompt}] |
|
|
formatted = tokenizer.apply_chat_template( |
|
|
messages, tokenize=False, add_generation_prompt=True, |
|
|
enable_thinking=False, |
|
|
) |
|
|
|
|
|
# 6. Inject activations via norm-matched addition at layer 1 |
|
|
# At each placeholder position, add the corresponding activation: |
|
|
# positions cycle through [L25_s1, L50_s1, L75_s1, L25_s2, L50_s2, L75_s2, ...] |
|
|
# Injection: h' = h + ||h|| * (v / ||v||) |
|
|
|
|
|
# 7. Generate with LoRA ENABLED (default state) |
|
|
output = model.generate(input_ids, max_new_tokens=64) |
|
|
``` |
|
|
|
|
|
For complete working code, see the [cot-oracle repository](https://github.com/ceselder/cot-oracle), particularly `src/signs_of_life/ao_lib.py` for the injection mechanism and `src/train_mixed.py` for the full training pipeline. |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is a **research artifact** for studying chain-of-thought interpretability. Intended uses include: |
|
|
|
|
|
- Investigating what information is encoded in CoT activations at different stages of reasoning |
|
|
- Detecting unfaithful chain-of-thought (reasoning that doesn't match the model's actual computation) |
|
|
- Building tools for mechanistic understanding of language model reasoning |
|
|
|
|
|
### Limitations |
|
|
|
|
|
- **Same-model only**: The oracle can only read activations from Qwen3-8B. It will not work with other models. |
|
|
- **Exact match eval is harsh**: Tasks like context prediction and summary show low scores under exact string match, but the model often produces semantically reasonable outputs. |
|
|
- **Decorative detection is undertrained**: Only ~500 unique training examples; results are noisy. |
|
|
- **Summary task is broken**: All 200 training labels were identical, so the model learned nothing useful for this task. |
|
|
- **No uncertainty calibration**: The oracle is confidently wrong sometimes, consistent with findings from Karvonen et al., 2024. |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{cot-oracle-v4, |
|
|
title={CoT Oracle: Detecting Unfaithful Chain-of-Thought via Activation Trajectories}, |
|
|
author={Celeste Deschamps-Helaere}, |
|
|
year={2026}, |
|
|
url={https://github.com/ceselder/cot-oracle} |
|
|
} |
|
|
``` |
|
|
|
|
|
### Related Work |
|
|
|
|
|
```bibtex |
|
|
@article{karvonen2024activation, |
|
|
title={Activation Oracles}, |
|
|
author={Karvonen, Adam and others}, |
|
|
journal={arXiv preprint arXiv:2512.15674}, |
|
|
year={2024} |
|
|
} |
|
|
|
|
|
@article{bogdan2025thought, |
|
|
title={Thought Anchors: Causal Importance of CoT Sentences}, |
|
|
author={Bogdan, Paul and others}, |
|
|
journal={arXiv preprint arXiv:2506.19143}, |
|
|
year={2025} |
|
|
} |
|
|
|
|
|
@article{macar2025thought, |
|
|
title={Thought Branches: Studying CoT through Trajectory Distribution}, |
|
|
author={Macar, Uzay and Bogdan, Paul and others}, |
|
|
journal={arXiv preprint arXiv:2510.27484}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Links |
|
|
|
|
|
- **Code**: [github.com/ceselder/cot-oracle](https://github.com/ceselder/cot-oracle) |
|
|
- **Training data**: [huggingface.co/datasets/ceselder/cot-oracle-data](https://huggingface.co/datasets/ceselder/cot-oracle-data) |
|
|
- **Base AO checkpoint**: [adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B](https://huggingface.co/adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B) |
|
|
- **Activation Oracles repo**: [github.com/adamkarvonen/activation_oracles](https://github.com/adamkarvonen/activation_oracles) |
|
|
- **Experiment tracking**: wandb `cot_oracle` project, run `cot_oracle_v4_mixed` |
|
|
|