--- language: - en license: apache-2.0 library_name: peft base_model: Qwen/Qwen3-8B tags: - activation-oracle - chain-of-thought - interpretability - mechanistic-interpretability - lora - qwen3 - reasoning - cot - unfaithfulness-detection datasets: - ceselder/cot-oracle-data pipeline_tag: text-generation model-index: - name: cot-oracle-v4-8b results: - task: type: text-generation name: Domain Classification (from activations) metrics: - type: accuracy value: 98 name: Exact Match Accuracy - task: type: text-generation name: Correctness Prediction (from activations) metrics: - type: accuracy value: 90 name: Exact Match Accuracy --- # CoT Oracle v4 (Qwen3-8B LoRA) A **chain-of-thought activation oracle**: a LoRA fine-tune of Qwen3-8B that reads the model's own internal activations at sentence boundaries during chain-of-thought reasoning and answers natural-language questions about what was computed. This is a continuation of the [Activation Oracles](https://github.com/adamkarvonen/activation_oracles) line of work (Karvonen et al., 2024), extended to operate over structured CoT trajectories rather than single-position activations. ## Model Description An activation oracle is a language model fine-tuned to accept its own internal activations as additional input and answer questions about them. The oracle is the **same model** as the source -- Qwen3-8B reads Qwen3-8B's activations. CoT Oracle v4 specializes in reading activations extracted at **sentence boundary positions** during chain-of-thought reasoning. Given activations from 3 layers (25%, 50%, 75% depth) at each sentence boundary, the oracle can: - **Classify the reasoning domain** (math, science, logic, commonsense, reading comprehension, multi-domain, medical) - **Predict whether the CoT reached the correct answer** - **Detect decorative reasoning** (steps that don't contribute to the answer) - **Predict surrounding token context** from arbitrary positions ### Key Properties - The oracle reads activations, not text. It has no access to the CoT tokens themselves. - Activations are collected with LoRA **disabled** (pure base model representations). - Activations are injected via **norm-matched addition** at layer 1, preserving the scale of the residual stream. - The oracle generates with LoRA **enabled** (the trained adapter interprets the injected activations). ## Training ### Base Checkpoint Training continues from [`adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B`](https://huggingface.co/adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B), an activation oracle pretrained on ~1M examples of context prediction, classification, and past-lens tasks. ### LoRA Configuration | Parameter | Value | |-----------|-------| | Rank | 64 | | Alpha | 128 | | Dropout | 0.05 | | Target modules | all-linear | ### Training Tasks Six tasks mixed together and shuffled into a single training run: | # | Task | Examples | Layers | Description | |---|------|----------|--------|-------------| | 1 | Context prediction (random) | 100K | 1 random layer | Predict surrounding tokens at random positions. Standard AO pretraining format. | | 2 | Context prediction (sentences) | 30K (x2 = 60K) | 3 layers at boundaries | Predict tokens near sentence boundaries. Each example doubled: once with 3 layers, once with L50% only. | | 3 | Decorative detection | 500 (x2 = 1K) | 3 layers at boundaries | Binary classification: `load_bearing` or `decorative`. | | 4 | Domain classification | 15K (x2 = 30K) | 3 layers at boundaries | 7-class: `math`, `science`, `logic`, `commonsense`, `reading`, `multi_domain`, `medical`. | | 5 | Correctness prediction | 15K (x2 = 30K) | 3 layers at boundaries | Binary: `correct` or `incorrect`. | | 6 | Persona detection | -- | -- | Skipped due to data issue. | **Total: ~221K training examples, 1 epoch.** Sentence-structured tasks (2-6) extract activations at 3 layers per boundary position: - **Layer 9** (25% depth) - **Layer 18** (50% depth) - **Layer 27** (75% depth) Each sentence-structured example is duplicated: once with all 3 layers interleaved, once with only L50%. This teaches the oracle to work with both multi-layer and single-layer inputs. ### Training Details | Parameter | Value | |-----------|-------| | Hardware | 1x NVIDIA H100 80GB | | Precision | bf16 | | Batch size | 8 | | Learning rate | 1e-5 | | Steps | 31,300 | | Training time | ~4.5 hours | | Optimizer | AdamW | | Framework | PyTorch 2.7 + PEFT 0.17 + Transformers 4.55 | ### Activation Injection Activations are injected at layer 1 via norm-matched addition: ``` h' = h + ||h|| * (v / ||v||) ``` where `h` is the original hidden state and `v` is the collected activation vector. This preserves the norm of the residual stream while adding directional information from the source activations. The placeholder token is `" ?"` (token ID 937). For multi-layer inputs, per-layer placeholder tokens are used: `" @"` (L25%), `" ?"` (L50%), `" #"` (L75%), cycling in that order. ### Corpus The training corpus consists of CoT traces generated by Qwen3-8B across 12 reasoning benchmarks: MATH, GSM8K, GPQA, BBH, ARC, StrategyQA, DROP, LogiQA, MMLU-Pro, CommonsenseQA, AQUA-RAT, and MedQA. CoTs were generated via OpenRouter API. ## Evaluation Results Evaluated on held-out data using exact string match: | Step | Domain | Correctness | Decorative | Sentence Pred | Context Pred | Summary | |------|--------|-------------|------------|---------------|--------------|---------| | 500 | 66% | 53% | 50% | 0% | 4% | 0% | | 5,000 | **100%** | 86% | 67% | 4% | 7% | 0% | | 10,000 | 97% | 85% | 50% | 7% | 9% | 0% | | 20,000 | 98% | 82% | 62% | 10% | 9% | 0% | | 28,000 | **98%** | **90%** | 50% | 11% | 7% | 0% | **Key observations:** - **Domain classification** reaches 98-100% accuracy -- the oracle reliably identifies the reasoning domain from activations alone. - **Correctness prediction** reaches 90% -- the oracle can tell whether the model's reasoning led to the right answer without seeing the answer. - **Decorative detection** is noisy (bounces between 50-71%) due to limited eval data (74 unique both-correct entries). - **Context prediction** stays low (7-11%) under exact string match but this is expected -- the pretrained AO checkpoint already handles this task and exact match is a harsh metric for free-text prediction. - **Summary** remains at 0% (labels were all identical in training data -- known issue). Experiment tracking: [wandb `cot_oracle` project, run `cot_oracle_v4_mixed`](https://wandb.ai) ## Usage ### Requirements This model requires the [activation_oracles](https://github.com/adamkarvonen/activation_oracles) library for the activation collection and injection infrastructure. ```bash git clone https://github.com/adamkarvonen/activation_oracles cd activation_oracles && pip install -e . ``` ### Loading the Model ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel # Load base model model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen3-8B", torch_dtype=torch.bfloat16, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B") # Load oracle adapter model = PeftModel.from_pretrained(model, "ceselder/cot-oracle-v4-8b") ``` ### Collecting Activations Activations must be collected from the **base model** (LoRA disabled) at the target layers: ```python import torch # Layers at 25%, 50%, 75% depth of Qwen3-8B (36 layers) LAYERS = [9, 18, 27] # 1. Prepare input: question + CoT response messages = [{"role": "user", "content": question}] prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=True, ) full_text = prompt + cot_response # 2. Find sentence boundary positions in token space input_ids = tokenizer(full_text, return_tensors="pt")["input_ids"] # boundary_positions = [...] (token indices at sentence boundaries) # 3. Collect activations with LoRA DISABLED with model.disable_adapter(): activations = {} # {layer: {position: tensor}} # Use hooks on model.model.layers[layer] to capture hidden states # at each boundary position for each layer ``` ### Running the Oracle ```python # 4. Build oracle prompt with placeholder tokens # For multi-layer: interleave " @", " ?", " #" per boundary n_boundaries = len(boundary_positions) placeholders = " @ ? #" * n_boundaries # 3 tokens per boundary layer_str = ", ".join(str(l) for l in LAYERS) oracle_prompt = f"Layer: {layer_str}\n{placeholders.strip()} \n" oracle_prompt += "What domain of reasoning is this? Answer with one word: math, science, logic, commonsense, reading, multi_domain, or medical." # 5. Format as chat and tokenize messages = [{"role": "user", "content": oracle_prompt}] formatted = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=False, ) # 6. Inject activations via norm-matched addition at layer 1 # At each placeholder position, add the corresponding activation: # positions cycle through [L25_s1, L50_s1, L75_s1, L25_s2, L50_s2, L75_s2, ...] # Injection: h' = h + ||h|| * (v / ||v||) # 7. Generate with LoRA ENABLED (default state) output = model.generate(input_ids, max_new_tokens=64) ``` For complete working code, see the [cot-oracle repository](https://github.com/ceselder/cot-oracle), particularly `src/signs_of_life/ao_lib.py` for the injection mechanism and `src/train_mixed.py` for the full training pipeline. ## Intended Use This model is a **research artifact** for studying chain-of-thought interpretability. Intended uses include: - Investigating what information is encoded in CoT activations at different stages of reasoning - Detecting unfaithful chain-of-thought (reasoning that doesn't match the model's actual computation) - Building tools for mechanistic understanding of language model reasoning ### Limitations - **Same-model only**: The oracle can only read activations from Qwen3-8B. It will not work with other models. - **Exact match eval is harsh**: Tasks like context prediction and summary show low scores under exact string match, but the model often produces semantically reasonable outputs. - **Decorative detection is undertrained**: Only ~500 unique training examples; results are noisy. - **Summary task is broken**: All 200 training labels were identical, so the model learned nothing useful for this task. - **No uncertainty calibration**: The oracle is confidently wrong sometimes, consistent with findings from Karvonen et al., 2024. ## Citation ```bibtex @misc{cot-oracle-v4, title={CoT Oracle: Detecting Unfaithful Chain-of-Thought via Activation Trajectories}, author={Celeste Deschamps-Helaere}, year={2026}, url={https://github.com/ceselder/cot-oracle} } ``` ### Related Work ```bibtex @article{karvonen2024activation, title={Activation Oracles}, author={Karvonen, Adam and others}, journal={arXiv preprint arXiv:2512.15674}, year={2024} } @article{bogdan2025thought, title={Thought Anchors: Causal Importance of CoT Sentences}, author={Bogdan, Paul and others}, journal={arXiv preprint arXiv:2506.19143}, year={2025} } @article{macar2025thought, title={Thought Branches: Studying CoT through Trajectory Distribution}, author={Macar, Uzay and Bogdan, Paul and others}, journal={arXiv preprint arXiv:2510.27484}, year={2025} } ``` ## Links - **Code**: [github.com/ceselder/cot-oracle](https://github.com/ceselder/cot-oracle) - **Training data**: [huggingface.co/datasets/ceselder/cot-oracle-data](https://huggingface.co/datasets/ceselder/cot-oracle-data) - **Base AO checkpoint**: [adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B](https://huggingface.co/adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B) - **Activation Oracles repo**: [github.com/adamkarvonen/activation_oracles](https://github.com/adamkarvonen/activation_oracles) - **Experiment tracking**: wandb `cot_oracle` project, run `cot_oracle_v4_mixed`