metadata
base_model: Qwen/Qwen3-8B
library_name: peft
pipeline_tag: text-generation
license: apache-2.0
tags:
- base_model:adapter:Qwen/Qwen3-8B
- lora
- transformers
- activation-oracle
- cot-monitoring
- interpretability
CoT Oracle Ablation: Stride=5, 3 Layers (9, 18, 27)
LoRA adapter for Qwen/Qwen3-8B trained as a CoT (chain-of-thought) trajectory oracle. This is the stride=5, 3-layer control ablation — it reads activations sampled every 5 tokens from layers 9, 18, and 27 (25%, 50%, 75% depth).
Base AO checkpoint: adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B
What This Model Does
The oracle takes activation trajectories extracted during CoT generation and classifies/describes what actually influenced the reasoning. It can:
- Reconstruct full CoT from stride activations (token F1: 0.660)
- Predict next reasoning steps (token F1: 0.435)
- Predict final answers from partial CoT (token F1: 0.500)
- Classify correctness of reasoning (token F1: 0.840)
- Classify decorative vs load-bearing CoT (token F1: 0.960)
- Predict reasoning termination (token F1: 0.740)
- Reconstruct original prompts from activations (token F1: 0.636)
Architecture
- Injection method: Norm-matched addition at layer 1
- Placeholder token:
" ¶"(token ID 78846) - Activation layers: 9, 18, 27 (25%, 50%, 75% of 36 layers)
- Stride: Every 5 tokens through the CoT
- Position encoding: None (this is the no-PE control)
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3-8B |
| AO checkpoint | adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B |
| LoRA rank | 64 |
| LoRA alpha | 128 |
| LoRA dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Learning rate | 1e-5 |
| Batch size | 4 (effective: 16 with grad accumulation) |
| Training examples | 211,122 |
| Total steps | ~13,195 (1 epoch) |
| Precision | bf16 |
| Hardware | NVIDIA H100 NVL 96GB |
| Training time | ~14 hours |
Training Tasks (11 tasks)
| Task | Examples | Final Token F1 |
|---|---|---|
| Full CoT reconstruction | 40,000 | 0.660 |
| Next step prediction | 30,000 | 0.435 |
| Answer prediction | 20,000 | 0.500 |
| Partial answer (vLLM) | 20,000 | 0.655 |
| Answer trajectory | 20,000 | 0.299 |
| Correctness classification | 15,000 | 0.840 |
| Decorative classification | 15,000 | 0.960 |
| Reasoning termination | 15,000 | 0.740 |
| Prompt inversion | 20,000 | 0.636 |
| Conversational QA | 10,000 | 0.442 |
| CompQA | 6,122 | 0.392 |
Unfaithfulness Eval Results (Step 13160)
| Eval | Accuracy |
|---|---|
| Hinted MCQ (ARC-Challenge) | 0.800 |
| Hinted MCQ (TruthfulQA) | 0.650 |
| Sycophancy v2 | 0.400 |
| Decorative CoT | 0.500 |
| Sentence Insertion | 0.567 |
| Atypical Answer (MCQ) | 0.550 |
| Atypical Answer (Riya) | 0.600 |
| Cybercrime OOD | 0.950 |
| Mean accuracy | 0.557 |
W&B Run
Usage
This adapter requires the Activation Oracle infrastructure from activation_oracles for activation injection.
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base_model, "ceselder/cot-oracle-ablation-stride5-3layers")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
Citation
Based on:
- Activation Oracles (Karvonen et al., 2024): https://arxiv.org/abs/2512.15674
- Thought Anchors (Bogdan et al., 2025): https://arxiv.org/abs/2506.19143
Framework Versions
- PEFT 0.18.1
- Transformers (latest)
- PyTorch 2.x