PEFT
Safetensors
activation-oracle
chain-of-thought
interpretability
lora

CoT Oracle v2 β€” Qwen3-8B (Experimental)

An activation oracle fine-tuned to analyze chain-of-thought reasoning traces by reading internal activations. This is an early experimental checkpoint with known data quality issues β€” see below.

What This Is

This is a LoRA adapter for Qwen/Qwen3-8B trained to read the model's own internal activations at CoT sentence boundaries and answer questions about the reasoning process. It builds on the Activation Oracles framework by Karvonen et al.

The oracle is initialized from the pre-trained AO checkpoint (adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B) and further fine-tuned on CoT-specific tasks.

Training Details

  • Base model: Qwen/Qwen3-8B (36 layers)
  • Starting point: Pre-trained AO checkpoint (context prediction + classification tasks)
  • Training data: ~100K examples from 200 math problems (100 MATH-500 + 100 GSM8K)
    • 45K context prediction (PastLens-style)
    • 15K sentence importance classification
    • 15K sentence taxonomy classification
    • 10K answer tracking (logit lens)
    • 15K reasoning summary
  • Hyperparameters: lr=1e-5, batch_size=16, 1 epoch (6,218 steps), gradient checkpointing
  • LoRA config: rank 64, alpha 128, dropout 0.05, all-linear
  • Hardware: 1x H100 80GB, bf16, ~1.5 hours
  • wandb: cot_oracle/runs/ejp28bev
  • Corpus: ceselder/qwen3-8b-math-cot-corpus

Results (Exact String Match, 100 eval items per task)

Step context_pred importance taxonomy answer_track summary
0 11% 0% 0% 0% 0%
1000 9% 20% 52% 0% 100%
2000 15% 48% 60% 0% 100%
3000 14% 47% 60% 0% 100%
4000 13% 47% 65% 0% 100%
4500 12% 48% 72% 0% 100%
5000 14% 48% 64% 0% 100%
6000 15% 48% 67% 0% 100%
final 15% 48% 66% 0% 100%

Taxonomy (8-class sentence type classification) is the strongest result at 65-72% accuracy (random baseline: 12.5%).

Importance plateaued at ~48% (4-class, random baseline: 25%).

Known Issues β€” Read Before Using

This is an honest accounting of what went wrong. We're publishing this for transparency and to save others from the same mistakes.

1. Summary labels are useless (100% = memorized garbage)

All 200 summary labels were identical: "The model performed step-by-step computation to arrive at the answer." The LLM-based label generation fell back to a generic template for every problem. The model memorized this single string perfectly, giving a misleading 100% accuracy. This task provides zero signal.

2. Importance labels are badly skewed

The importance labels used a fixed KL divergence threshold (>0.1 = "important") but 99.7% of sentences exceeded this threshold. The model was essentially trained on a near-constant label. The 48% accuracy with 4-class eval is hard to interpret because the training labels don't match the eval labels (the eval uses within-problem percentile ranking that was implemented after training started). Fixed for v3.

3. Answer tracking never worked (0% accuracy)

The target format includes exact probability values (e.g., "step 5: P(answer)=0.73") which are impossible to match with exact string comparison. The task concept is sound but the format needs simplification. Needs redesign.

4. Context prediction barely improved over baseline

Context prediction went from 11% to 15% over training. The pre-trained AO checkpoint already handles this task, so limited improvement is expected. The small corpus (200 problems) may also limit this.

5. Unfaithfulness detection doesn't work

When tested on authority bias and hint-following eval sets, the oracle gives the same response for every item regardless of whether the model was actually influenced. It reads the oracle prompt (which contains answer options) rather than the activations. This is expected β€” the training data contains no unfaithfulness-specific tasks. The oracle cannot detect unfaithful reasoning in its current form.

6. Eval uses exact string match

AO's eval framework uses exact string match, which is overly strict for open-ended responses. A sentence like "active_computation" would fail if the model outputs "Active computation" or "active computation step". Actual model capability may be somewhat higher than reported numbers suggest.

What Works

  • Taxonomy classification genuinely works. Given an activation at a CoT sentence boundary, the oracle can identify whether it's problem_setup, active_computation, self_checking, etc. at ~65-72% accuracy (5-6x random baseline).
  • The activation oracle framework works. LoRA injection, norm-matched addition at layer 1, on-the-fly activation collection β€” the plumbing is solid.
  • Data shuffling matters. v1 (unshuffled) showed fake "grokking" as the model encountered each task sequentially. v2 (shuffled) learns all tasks simultaneously without catastrophic forgetting.

Checkpoints

This repo contains multiple checkpoints:

  • step_1000/ through step_6000/ β€” saved every 1000 steps
  • final/ β€” end of training

Peak taxonomy performance is at step 4500 (72%), but we only saved at step_4000 (65%) and step_5000 (64%). Step 5000 is recommended as a reasonable all-round checkpoint.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, "ceselder/cot-oracle-8b-v2", subfolder="step_5000")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

# Use with model.disable_adapter() for base model inference
# Use model.set_adapter("default") for oracle inference
# See: https://github.com/ceselder/cot-oracle for full usage

What's Next (v3)

  • Fixed importance labels (within-problem percentile ranking, 4 balanced tiers)
  • Synthesized unique summary labels per problem (from importance + taxonomy data)
  • Unfaithfulness-specific training tasks
  • Fuzzy eval scoring instead of exact string match
  • Larger corpus (>200 problems)

Citation

This work builds on:

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ceselder/qwen3-8b-cot-oracle

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Adapter
(562)
this model

Dataset used to train ceselder/qwen3-8b-cot-oracle

Papers for ceselder/qwen3-8b-cot-oracle