CoT Oracle v2 — Qwen3-8B (Experimental)

An activation oracle fine-tuned to analyze chain-of-thought reasoning traces by reading internal activations. This is an early experimental checkpoint with known data quality issues — see below.

What This Is

This is a LoRA adapter for Qwen/Qwen3-8B trained to read the model's own internal activations at CoT sentence boundaries and answer questions about the reasoning process. It builds on the Activation Oracles framework by Karvonen et al.

The oracle is initialized from the pre-trained AO checkpoint (adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B) and further fine-tuned on CoT-specific tasks.

Training Details

Base model: Qwen/Qwen3-8B (36 layers)
Starting point: Pre-trained AO checkpoint (context prediction + classification tasks)
Training data: ~100K examples from 200 math problems (100 MATH-500 + 100 GSM8K)
- 45K context prediction (PastLens-style)
- 15K sentence importance classification
- 15K sentence taxonomy classification
- 10K answer tracking (logit lens)
- 15K reasoning summary
Hyperparameters: lr=1e-5, batch_size=16, 1 epoch (6,218 steps), gradient checkpointing
LoRA config: rank 64, alpha 128, dropout 0.05, all-linear
Hardware: 1x H100 80GB, bf16, ~1.5 hours
wandb: cot_oracle/runs/ejp28bev
Corpus: ceselder/qwen3-8b-math-cot-corpus

Results (Exact String Match, 100 eval items per task)

Step	context_pred	importance	taxonomy	answer_track	summary
0	11%	0%	0%	0%	0%
1000	9%	20%	52%	0%	100%
2000	15%	48%	60%	0%	100%
3000	14%	47%	60%	0%	100%
4000	13%	47%	65%	0%	100%
4500	12%	48%	72%	0%	100%
5000	14%	48%	64%	0%	100%
6000	15%	48%	67%	0%	100%
final	15%	48%	66%	0%	100%

Taxonomy (8-class sentence type classification) is the strongest result at 65-72% accuracy (random baseline: 12.5%).

Importance plateaued at ~48% (4-class, random baseline: 25%).

Known Issues — Read Before Using

This is an honest accounting of what went wrong. We're publishing this for transparency and to save others from the same mistakes.

1. Summary labels are useless (100% = memorized garbage)

All 200 summary labels were identical: "The model performed step-by-step computation to arrive at the answer." The LLM-based label generation fell back to a generic template for every problem. The model memorized this single string perfectly, giving a misleading 100% accuracy. This task provides zero signal.

2. Importance labels are badly skewed

The importance labels used a fixed KL divergence threshold (>0.1 = "important") but 99.7% of sentences exceeded this threshold. The model was essentially trained on a near-constant label. The 48% accuracy with 4-class eval is hard to interpret because the training labels don't match the eval labels (the eval uses within-problem percentile ranking that was implemented after training started). Fixed for v3.

3. Answer tracking never worked (0% accuracy)

The target format includes exact probability values (e.g., "step 5: P(answer)=0.73") which are impossible to match with exact string comparison. The task concept is sound but the format needs simplification. Needs redesign.

4. Context prediction barely improved over baseline

Context prediction went from 11% to 15% over training. The pre-trained AO checkpoint already handles this task, so limited improvement is expected. The small corpus (200 problems) may also limit this.

5. Unfaithfulness detection doesn't work

When tested on authority bias and hint-following eval sets, the oracle gives the same response for every item regardless of whether the model was actually influenced. It reads the oracle prompt (which contains answer options) rather than the activations. This is expected — the training data contains no unfaithfulness-specific tasks. The oracle cannot detect unfaithful reasoning in its current form.

6. Eval uses exact string match

AO's eval framework uses exact string match, which is overly strict for open-ended responses. A sentence like "active_computation" would fail if the model outputs "Active computation" or "active computation step". Actual model capability may be somewhat higher than reported numbers suggest.

What Works

Taxonomy classification genuinely works. Given an activation at a CoT sentence boundary, the oracle can identify whether it's problem_setup, active_computation, self_checking, etc. at ~65-72% accuracy (5-6x random baseline).
The activation oracle framework works. LoRA injection, norm-matched addition at layer 1, on-the-fly activation collection — the plumbing is solid.
Data shuffling matters. v1 (unshuffled) showed fake "grokking" as the model encountered each task sequentially. v2 (shuffled) learns all tasks simultaneously without catastrophic forgetting.

Checkpoints

This repo contains multiple checkpoints:

step_1000/ through step_6000/ — saved every 1000 steps
final/ — end of training

Peak taxonomy performance is at step 4500 (72%), but we only saved at step_4000 (65%) and step_5000 (64%). Step 5000 is recommended as a reasonable all-round checkpoint.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, "ceselder/cot-oracle-8b-v2", subfolder="step_5000")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

# Use with model.disable_adapter() for base model inference
# Use model.set_adapter("default") for oracle inference
# See: https://github.com/ceselder/cot-oracle for full usage