CoT Oracle v2 β Qwen3-8B (Experimental)
An activation oracle fine-tuned to analyze chain-of-thought reasoning traces by reading internal activations. This is an early experimental checkpoint with known data quality issues β see below.
What This Is
This is a LoRA adapter for Qwen/Qwen3-8B trained to read the model's own internal activations at CoT sentence boundaries and answer questions about the reasoning process. It builds on the Activation Oracles framework by Karvonen et al.
The oracle is initialized from the pre-trained AO checkpoint (adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B) and further fine-tuned on CoT-specific tasks.
Training Details
- Base model: Qwen/Qwen3-8B (36 layers)
- Starting point: Pre-trained AO checkpoint (context prediction + classification tasks)
- Training data: ~100K examples from 200 math problems (100 MATH-500 + 100 GSM8K)
- 45K context prediction (PastLens-style)
- 15K sentence importance classification
- 15K sentence taxonomy classification
- 10K answer tracking (logit lens)
- 15K reasoning summary
- Hyperparameters: lr=1e-5, batch_size=16, 1 epoch (6,218 steps), gradient checkpointing
- LoRA config: rank 64, alpha 128, dropout 0.05, all-linear
- Hardware: 1x H100 80GB, bf16, ~1.5 hours
- wandb: cot_oracle/runs/ejp28bev
- Corpus: ceselder/qwen3-8b-math-cot-corpus
Results (Exact String Match, 100 eval items per task)
| Step | context_pred | importance | taxonomy | answer_track | summary |
|---|---|---|---|---|---|
| 0 | 11% | 0% | 0% | 0% | 0% |
| 1000 | 9% | 20% | 52% | 0% | 100% |
| 2000 | 15% | 48% | 60% | 0% | 100% |
| 3000 | 14% | 47% | 60% | 0% | 100% |
| 4000 | 13% | 47% | 65% | 0% | 100% |
| 4500 | 12% | 48% | 72% | 0% | 100% |
| 5000 | 14% | 48% | 64% | 0% | 100% |
| 6000 | 15% | 48% | 67% | 0% | 100% |
| final | 15% | 48% | 66% | 0% | 100% |
Taxonomy (8-class sentence type classification) is the strongest result at 65-72% accuracy (random baseline: 12.5%).
Importance plateaued at ~48% (4-class, random baseline: 25%).
Known Issues β Read Before Using
This is an honest accounting of what went wrong. We're publishing this for transparency and to save others from the same mistakes.
1. Summary labels are useless (100% = memorized garbage)
All 200 summary labels were identical: "The model performed step-by-step computation to arrive at the answer." The LLM-based label generation fell back to a generic template for every problem. The model memorized this single string perfectly, giving a misleading 100% accuracy. This task provides zero signal.
2. Importance labels are badly skewed
The importance labels used a fixed KL divergence threshold (>0.1 = "important") but 99.7% of sentences exceeded this threshold. The model was essentially trained on a near-constant label. The 48% accuracy with 4-class eval is hard to interpret because the training labels don't match the eval labels (the eval uses within-problem percentile ranking that was implemented after training started). Fixed for v3.
3. Answer tracking never worked (0% accuracy)
The target format includes exact probability values (e.g., "step 5: P(answer)=0.73") which are impossible to match with exact string comparison. The task concept is sound but the format needs simplification. Needs redesign.
4. Context prediction barely improved over baseline
Context prediction went from 11% to 15% over training. The pre-trained AO checkpoint already handles this task, so limited improvement is expected. The small corpus (200 problems) may also limit this.
5. Unfaithfulness detection doesn't work
When tested on authority bias and hint-following eval sets, the oracle gives the same response for every item regardless of whether the model was actually influenced. It reads the oracle prompt (which contains answer options) rather than the activations. This is expected β the training data contains no unfaithfulness-specific tasks. The oracle cannot detect unfaithful reasoning in its current form.
6. Eval uses exact string match
AO's eval framework uses exact string match, which is overly strict for open-ended responses. A sentence like "active_computation" would fail if the model outputs "Active computation" or "active computation step". Actual model capability may be somewhat higher than reported numbers suggest.
What Works
- Taxonomy classification genuinely works. Given an activation at a CoT sentence boundary, the oracle can identify whether it's problem_setup, active_computation, self_checking, etc. at ~65-72% accuracy (5-6x random baseline).
- The activation oracle framework works. LoRA injection, norm-matched addition at layer 1, on-the-fly activation collection β the plumbing is solid.
- Data shuffling matters. v1 (unshuffled) showed fake "grokking" as the model encountered each task sequentially. v2 (shuffled) learns all tasks simultaneously without catastrophic forgetting.
Checkpoints
This repo contains multiple checkpoints:
step_1000/throughstep_6000/β saved every 1000 stepsfinal/β end of training
Peak taxonomy performance is at step 4500 (72%), but we only saved at step_4000 (65%) and step_5000 (64%). Step 5000 is recommended as a reasonable all-round checkpoint.
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, "ceselder/cot-oracle-8b-v2", subfolder="step_5000")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
# Use with model.disable_adapter() for base model inference
# Use model.set_adapter("default") for oracle inference
# See: https://github.com/ceselder/cot-oracle for full usage
What's Next (v3)
- Fixed importance labels (within-problem percentile ranking, 4 balanced tiers)
- Synthesized unique summary labels per problem (from importance + taxonomy data)
- Unfaithfulness-specific training tasks
- Fuzzy eval scoring instead of exact string match
- Larger corpus (>200 problems)
Citation
This work builds on:
- Activation Oracles (Karvonen et al., 2024)
- Thought Anchors (Bogdan et al., 2025)
- Thought Branches (Macar, Bogdan et al., 2025)
- Downloads last month
- -