cot-oracle-v4-checkpoints / README.md

ceselder

Upload README.md with huggingface_hub

5fd55b5 verified 2 days ago

preview code

raw

history blame contribute delete

12.2 kB

metadata

language:
  - en
license: apache-2.0
library_name: peft
base_model: Qwen/Qwen3-8B
tags:
  - activation-oracle
  - chain-of-thought
  - interpretability
  - mechanistic-interpretability
  - lora
  - qwen3
  - reasoning
  - cot
  - unfaithfulness-detection
datasets:
  - ceselder/cot-oracle-data
pipeline_tag: text-generation
model-index:
  - name: cot-oracle-v4-8b
    results:
      - task:
          type: text-generation
          name: Domain Classification (from activations)
        metrics:
          - type: accuracy
            value: 98
            name: Exact Match Accuracy
      - task:
          type: text-generation
          name: Correctness Prediction (from activations)
        metrics:
          - type: accuracy
            value: 90
            name: Exact Match Accuracy

CoT Oracle v4 (Qwen3-8B LoRA)

A chain-of-thought activation oracle: a LoRA fine-tune of Qwen3-8B that reads the model's own internal activations at sentence boundaries during chain-of-thought reasoning and answers natural-language questions about what was computed.

This is a continuation of the Activation Oracles line of work (Karvonen et al., 2024), extended to operate over structured CoT trajectories rather than single-position activations.

Model Description

An activation oracle is a language model fine-tuned to accept its own internal activations as additional input and answer questions about them. The oracle is the same model as the source -- Qwen3-8B reads Qwen3-8B's activations.

CoT Oracle v4 specializes in reading activations extracted at sentence boundary positions during chain-of-thought reasoning. Given activations from 3 layers (25%, 50%, 75% depth) at each sentence boundary, the oracle can:

Classify the reasoning domain (math, science, logic, commonsense, reading comprehension, multi-domain, medical)
Predict whether the CoT reached the correct answer
Detect decorative reasoning (steps that don't contribute to the answer)
Predict surrounding token context from arbitrary positions

Key Properties

The oracle reads activations, not text. It has no access to the CoT tokens themselves.
Activations are collected with LoRA disabled (pure base model representations).
Activations are injected via norm-matched addition at layer 1, preserving the scale of the residual stream.
The oracle generates with LoRA enabled (the trained adapter interprets the injected activations).

Training

Base Checkpoint

Training continues from adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B, an activation oracle pretrained on ~1M examples of context prediction, classification, and past-lens tasks.

LoRA Configuration

Parameter	Value
Rank	64
Alpha	128
Dropout	0.05
Target modules	all-linear

Training Tasks

Six tasks mixed together and shuffled into a single training run:

#	Task	Examples	Layers	Description
1	Context prediction (random)	100K	1 random layer	Predict surrounding tokens at random positions. Standard AO pretraining format.
2	Context prediction (sentences)	30K (x2 = 60K)	3 layers at boundaries	Predict tokens near sentence boundaries. Each example doubled: once with 3 layers, once with L50% only.
3	Decorative detection	500 (x2 = 1K)	3 layers at boundaries	Binary classification: `load_bearing` or `decorative`.
4	Domain classification	15K (x2 = 30K)	3 layers at boundaries	7-class: `math`, `science`, `logic`, `commonsense`, `reading`, `multi_domain`, `medical`.
5	Correctness prediction	15K (x2 = 30K)	3 layers at boundaries	Binary: `correct` or `incorrect`.
6	Persona detection	--	--	Skipped due to data issue.

Total: ~221K training examples, 1 epoch.

Sentence-structured tasks (2-6) extract activations at 3 layers per boundary position:

Layer 9 (25% depth)
Layer 18 (50% depth)
Layer 27 (75% depth)

Each sentence-structured example is duplicated: once with all 3 layers interleaved, once with only L50%. This teaches the oracle to work with both multi-layer and single-layer inputs.

Training Details

Parameter	Value
Hardware	1x NVIDIA H100 80GB
Precision	bf16
Batch size	8
Learning rate	1e-5
Steps	31,300
Training time	~4.5 hours
Optimizer	AdamW
Framework	PyTorch 2.7 + PEFT 0.17 + Transformers 4.55

Activation Injection

Activations are injected at layer 1 via norm-matched addition:

h' = h + ||h|| * (v / ||v||)

where h is the original hidden state and v is the collected activation vector. This preserves the norm of the residual stream while adding directional information from the source activations.

The placeholder token is " ?" (token ID 937). For multi-layer inputs, per-layer placeholder tokens are used: " @" (L25%), " ?" (L50%), " #" (L75%), cycling in that order.

Corpus

The training corpus consists of CoT traces generated by Qwen3-8B across 12 reasoning benchmarks: MATH, GSM8K, GPQA, BBH, ARC, StrategyQA, DROP, LogiQA, MMLU-Pro, CommonsenseQA, AQUA-RAT, and MedQA. CoTs were generated via OpenRouter API.

Evaluation Results

Evaluated on held-out data using exact string match:

Step	Domain	Correctness	Decorative	Sentence Pred	Context Pred	Summary
500	66%	53%	50%	0%	4%	0%
5,000	100%	86%	67%	4%	7%	0%
10,000	97%	85%	50%	7%	9%	0%
20,000	98%	82%	62%	10%	9%	0%
28,000	98%	90%	50%	11%	7%	0%

Key observations:

Domain classification reaches 98-100% accuracy -- the oracle reliably identifies the reasoning domain from activations alone.
Correctness prediction reaches 90% -- the oracle can tell whether the model's reasoning led to the right answer without seeing the answer.
Decorative detection is noisy (bounces between 50-71%) due to limited eval data (74 unique both-correct entries).
Context prediction stays low (7-11%) under exact string match but this is expected -- the pretrained AO checkpoint already handles this task and exact match is a harsh metric for free-text prediction.
Summary remains at 0% (labels were all identical in training data -- known issue).

Experiment tracking: wandb cot_oracle project, run cot_oracle_v4_mixed

Usage

Requirements

This model requires the activation_oracles library for the activation collection and injection infrastructure.

git clone https://github.com/adamkarvonen/activation_oracles
cd activation_oracles && pip install -e .

Loading the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

# Load oracle adapter
model = PeftModel.from_pretrained(model, "ceselder/cot-oracle-v4-8b")

Collecting Activations

Activations must be collected from the base model (LoRA disabled) at the target layers:

import torch

# Layers at 25%, 50%, 75% depth of Qwen3-8B (36 layers)
LAYERS = [9, 18, 27]

# 1. Prepare input: question + CoT response
messages = [{"role": "user", "content": question}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=True,
)
full_text = prompt + cot_response

# 2. Find sentence boundary positions in token space
input_ids = tokenizer(full_text, return_tensors="pt")["input_ids"]
# boundary_positions = [...] (token indices at sentence boundaries)

# 3. Collect activations with LoRA DISABLED
with model.disable_adapter():
    activations = {}  # {layer: {position: tensor}}
    # Use hooks on model.model.layers[layer] to capture hidden states
    # at each boundary position for each layer

Running the Oracle

# 4. Build oracle prompt with placeholder tokens
# For multi-layer: interleave " @", " ?", " #" per boundary
n_boundaries = len(boundary_positions)
placeholders = " @ ? #" * n_boundaries  # 3 tokens per boundary

layer_str = ", ".join(str(l) for l in LAYERS)
oracle_prompt = f"Layer: {layer_str}\n{placeholders.strip()} \n"
oracle_prompt += "What domain of reasoning is this? Answer with one word: math, science, logic, commonsense, reading, multi_domain, or medical."

# 5. Format as chat and tokenize
messages = [{"role": "user", "content": oracle_prompt}]
formatted = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False,
)

# 6. Inject activations via norm-matched addition at layer 1
# At each placeholder position, add the corresponding activation:
# positions cycle through [L25_s1, L50_s1, L75_s1, L25_s2, L50_s2, L75_s2, ...]
# Injection: h' = h + ||h|| * (v / ||v||)

# 7. Generate with LoRA ENABLED (default state)
output = model.generate(input_ids, max_new_tokens=64)

For complete working code, see the cot-oracle repository, particularly src/signs_of_life/ao_lib.py for the injection mechanism and src/train_mixed.py for the full training pipeline.

Intended Use

This model is a research artifact for studying chain-of-thought interpretability. Intended uses include:

Investigating what information is encoded in CoT activations at different stages of reasoning
Detecting unfaithful chain-of-thought (reasoning that doesn't match the model's actual computation)
Building tools for mechanistic understanding of language model reasoning

Limitations

Same-model only: The oracle can only read activations from Qwen3-8B. It will not work with other models.
Exact match eval is harsh: Tasks like context prediction and summary show low scores under exact string match, but the model often produces semantically reasonable outputs.
Decorative detection is undertrained: Only ~500 unique training examples; results are noisy.
Summary task is broken: All 200 training labels were identical, so the model learned nothing useful for this task.
No uncertainty calibration: The oracle is confidently wrong sometimes, consistent with findings from Karvonen et al., 2024.

Citation

@misc{cot-oracle-v4,
  title={CoT Oracle: Detecting Unfaithful Chain-of-Thought via Activation Trajectories},
  author={Celeste Deschamps-Helaere},
  year={2026},
  url={https://github.com/ceselder/cot-oracle}
}

Related Work

@article{karvonen2024activation,
  title={Activation Oracles},
  author={Karvonen, Adam and others},
  journal={arXiv preprint arXiv:2512.15674},
  year={2024}
}

@article{bogdan2025thought,
  title={Thought Anchors: Causal Importance of CoT Sentences},
  author={Bogdan, Paul and others},
  journal={arXiv preprint arXiv:2506.19143},
  year={2025}
}

@article{macar2025thought,
  title={Thought Branches: Studying CoT through Trajectory Distribution},
  author={Macar, Uzay and Bogdan, Paul and others},
  journal={arXiv preprint arXiv:2510.27484},
  year={2025}
}

ceselder
/

cot-oracle-v4-checkpoints