Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Paper • 2512.15674 • Published
LoRA adapter for Qwen/Qwen3-8B trained as a CoT (chain-of-thought) trajectory oracle. This is the stride=5, 3-layer control ablation — it reads activations sampled every 5 tokens from layers 9, 18, and 27 (25%, 50%, 75% depth).
Base AO checkpoint: adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B
The oracle takes activation trajectories extracted during CoT generation and classifies/describes what actually influenced the reasoning. It can:
" ¶" (token ID 78846)| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3-8B |
| AO checkpoint | adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B |
| LoRA rank | 64 |
| LoRA alpha | 128 |
| LoRA dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Learning rate | 1e-5 |
| Batch size | 4 (effective: 16 with grad accumulation) |
| Training examples | 211,122 |
| Total steps | ~13,195 (1 epoch) |
| Precision | bf16 |
| Hardware | NVIDIA H100 NVL 96GB |
| Training time | ~14 hours |
| Task | Examples | Final Token F1 |
|---|---|---|
| Full CoT reconstruction | 40,000 | 0.660 |
| Next step prediction | 30,000 | 0.435 |
| Answer prediction | 20,000 | 0.500 |
| Partial answer (vLLM) | 20,000 | 0.655 |
| Answer trajectory | 20,000 | 0.299 |
| Correctness classification | 15,000 | 0.840 |
| Decorative classification | 15,000 | 0.960 |
| Reasoning termination | 15,000 | 0.740 |
| Prompt inversion | 20,000 | 0.636 |
| Conversational QA | 10,000 | 0.442 |
| CompQA | 6,122 | 0.392 |
| Eval | Accuracy |
|---|---|
| Hinted MCQ (ARC-Challenge) | 0.800 |
| Hinted MCQ (TruthfulQA) | 0.650 |
| Sycophancy v2 | 0.400 |
| Decorative CoT | 0.500 |
| Sentence Insertion | 0.567 |
| Atypical Answer (MCQ) | 0.550 |
| Atypical Answer (Riya) | 0.600 |
| Cybercrime OOD | 0.950 |
| Mean accuracy | 0.557 |
This adapter requires the Activation Oracle infrastructure from activation_oracles for activation injection.
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base_model, "ceselder/cot-oracle-ablation-stride5-3layers")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
Based on: