CoT Oracle v4 (Qwen3-8B LoRA)

A chain-of-thought activation oracle: a LoRA fine-tune of Qwen3-8B that reads the model's own internal activations at sentence boundaries during chain-of-thought reasoning and answers natural-language questions about what was computed.

This is a continuation of the Activation Oracles line of work (Karvonen et al., 2024), extended to operate over structured CoT trajectories rather than single-position activations.

Model Description

An activation oracle is a language model fine-tuned to accept its own internal activations as additional input and answer questions about them. The oracle is the same model as the source -- Qwen3-8B reads Qwen3-8B's activations.

CoT Oracle v4 specializes in reading activations extracted at sentence boundary positions during chain-of-thought reasoning. Given activations from 3 layers (25%, 50%, 75% depth) at each sentence boundary, the oracle can:

  • Classify the reasoning domain (math, science, logic, commonsense, reading comprehension, multi-domain, medical)
  • Predict whether the CoT reached the correct answer
  • Detect decorative reasoning (steps that don't contribute to the answer)
  • Predict surrounding token context from arbitrary positions

Key Properties

  • The oracle reads activations, not text. It has no access to the CoT tokens themselves.
  • Activations are collected with LoRA disabled (pure base model representations).
  • Activations are injected via norm-matched addition at layer 1, preserving the scale of the residual stream.
  • The oracle generates with LoRA enabled (the trained adapter interprets the injected activations).

Training

Base Checkpoint

Training continues from adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B, an activation oracle pretrained on ~1M examples of context prediction, classification, and past-lens tasks.

LoRA Configuration

Parameter Value
Rank 64
Alpha 128
Dropout 0.05
Target modules all-linear

Training Tasks

Six tasks mixed together and shuffled into a single training run:

# Task Examples Layers Description
1 Context prediction (random) 100K 1 random layer Predict surrounding tokens at random positions. Standard AO pretraining format.
2 Context prediction (sentences) 30K (x2 = 60K) 3 layers at boundaries Predict tokens near sentence boundaries. Each example doubled: once with 3 layers, once with L50% only.
3 Decorative detection 500 (x2 = 1K) 3 layers at boundaries Binary classification: load_bearing or decorative.
4 Domain classification 15K (x2 = 30K) 3 layers at boundaries 7-class: math, science, logic, commonsense, reading, multi_domain, medical.
5 Correctness prediction 15K (x2 = 30K) 3 layers at boundaries Binary: correct or incorrect.
6 Persona detection -- -- Skipped due to data issue.

Total: ~221K training examples, 1 epoch.

Sentence-structured tasks (2-6) extract activations at 3 layers per boundary position:

  • Layer 9 (25% depth)
  • Layer 18 (50% depth)
  • Layer 27 (75% depth)

Each sentence-structured example is duplicated: once with all 3 layers interleaved, once with only L50%. This teaches the oracle to work with both multi-layer and single-layer inputs.

Training Details

Parameter Value
Hardware 1x NVIDIA H100 80GB
Precision bf16
Batch size 8
Learning rate 1e-5
Steps 31,300
Training time ~4.5 hours
Optimizer AdamW
Framework PyTorch 2.7 + PEFT 0.17 + Transformers 4.55

Activation Injection

Activations are injected at layer 1 via norm-matched addition:

h' = h + ||h|| * (v / ||v||)

where h is the original hidden state and v is the collected activation vector. This preserves the norm of the residual stream while adding directional information from the source activations.

The placeholder token is " ?" (token ID 937). For multi-layer inputs, per-layer placeholder tokens are used: " @" (L25%), " ?" (L50%), " #" (L75%), cycling in that order.

Corpus

The training corpus consists of CoT traces generated by Qwen3-8B across 12 reasoning benchmarks: MATH, GSM8K, GPQA, BBH, ARC, StrategyQA, DROP, LogiQA, MMLU-Pro, CommonsenseQA, AQUA-RAT, and MedQA. CoTs were generated via OpenRouter API.

Evaluation Results

Evaluated on held-out data using exact string match:

Step Domain Correctness Decorative Sentence Pred Context Pred Summary
500 66% 53% 50% 0% 4% 0%
5,000 100% 86% 67% 4% 7% 0%
10,000 97% 85% 50% 7% 9% 0%
20,000 98% 82% 62% 10% 9% 0%
28,000 98% 90% 50% 11% 7% 0%

Key observations:

  • Domain classification reaches 98-100% accuracy -- the oracle reliably identifies the reasoning domain from activations alone.
  • Correctness prediction reaches 90% -- the oracle can tell whether the model's reasoning led to the right answer without seeing the answer.
  • Decorative detection is noisy (bounces between 50-71%) due to limited eval data (74 unique both-correct entries).
  • Context prediction stays low (7-11%) under exact string match but this is expected -- the pretrained AO checkpoint already handles this task and exact match is a harsh metric for free-text prediction.
  • Summary remains at 0% (labels were all identical in training data -- known issue).

Experiment tracking: wandb cot_oracle project, run cot_oracle_v4_mixed

Usage

Requirements

This model requires the activation_oracles library for the activation collection and injection infrastructure.

git clone https://github.com/adamkarvonen/activation_oracles
cd activation_oracles && pip install -e .

Loading the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

# Load oracle adapter
model = PeftModel.from_pretrained(model, "ceselder/cot-oracle-v4-8b")

Collecting Activations

Activations must be collected from the base model (LoRA disabled) at the target layers:

import torch

# Layers at 25%, 50%, 75% depth of Qwen3-8B (36 layers)
LAYERS = [9, 18, 27]

# 1. Prepare input: question + CoT response
messages = [{"role": "user", "content": question}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=True,
)
full_text = prompt + cot_response

# 2. Find sentence boundary positions in token space
input_ids = tokenizer(full_text, return_tensors="pt")["input_ids"]
# boundary_positions = [...] (token indices at sentence boundaries)

# 3. Collect activations with LoRA DISABLED
with model.disable_adapter():
    activations = {}  # {layer: {position: tensor}}
    # Use hooks on model.model.layers[layer] to capture hidden states
    # at each boundary position for each layer

Running the Oracle

# 4. Build oracle prompt with placeholder tokens
# For multi-layer: interleave " @", " ?", " #" per boundary
n_boundaries = len(boundary_positions)
placeholders = " @ ? #" * n_boundaries  # 3 tokens per boundary

layer_str = ", ".join(str(l) for l in LAYERS)
oracle_prompt = f"Layer: {layer_str}\n{placeholders.strip()} \n"
oracle_prompt += "What domain of reasoning is this? Answer with one word: math, science, logic, commonsense, reading, multi_domain, or medical."

# 5. Format as chat and tokenize
messages = [{"role": "user", "content": oracle_prompt}]
formatted = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False,
)

# 6. Inject activations via norm-matched addition at layer 1
# At each placeholder position, add the corresponding activation:
# positions cycle through [L25_s1, L50_s1, L75_s1, L25_s2, L50_s2, L75_s2, ...]
# Injection: h' = h + ||h|| * (v / ||v||)

# 7. Generate with LoRA ENABLED (default state)
output = model.generate(input_ids, max_new_tokens=64)

For complete working code, see the cot-oracle repository, particularly src/signs_of_life/ao_lib.py for the injection mechanism and src/train_mixed.py for the full training pipeline.

Intended Use

This model is a research artifact for studying chain-of-thought interpretability. Intended uses include:

  • Investigating what information is encoded in CoT activations at different stages of reasoning
  • Detecting unfaithful chain-of-thought (reasoning that doesn't match the model's actual computation)
  • Building tools for mechanistic understanding of language model reasoning

Limitations

  • Same-model only: The oracle can only read activations from Qwen3-8B. It will not work with other models.
  • Exact match eval is harsh: Tasks like context prediction and summary show low scores under exact string match, but the model often produces semantically reasonable outputs.
  • Decorative detection is undertrained: Only ~500 unique training examples; results are noisy.
  • Summary task is broken: All 200 training labels were identical, so the model learned nothing useful for this task.
  • No uncertainty calibration: The oracle is confidently wrong sometimes, consistent with findings from Karvonen et al., 2024.

Citation

@misc{cot-oracle-v4,
  title={CoT Oracle: Detecting Unfaithful Chain-of-Thought via Activation Trajectories},
  author={Celeste Deschamps-Helaere},
  year={2026},
  url={https://github.com/ceselder/cot-oracle}
}

Related Work

@article{karvonen2024activation,
  title={Activation Oracles},
  author={Karvonen, Adam and others},
  journal={arXiv preprint arXiv:2512.15674},
  year={2024}
}

@article{bogdan2025thought,
  title={Thought Anchors: Causal Importance of CoT Sentences},
  author={Bogdan, Paul and others},
  journal={arXiv preprint arXiv:2506.19143},
  year={2025}
}

@article{macar2025thought,
  title={Thought Branches: Studying CoT through Trajectory Distribution},
  author={Macar, Uzay and Bogdan, Paul and others},
  journal={arXiv preprint arXiv:2510.27484},
  year={2025}
}

Links

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ceselder/cot-oracle-v4-checkpoints

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Adapter
(563)
this model

Dataset used to train ceselder/cot-oracle-v4-checkpoints

Papers for ceselder/cot-oracle-v4-checkpoints

Evaluation results