Upload README.md with huggingface_hub

5fd55b5 verified 2 days ago

12.2 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: peft
	base_model: Qwen/Qwen3-8B
	tags:
	- activation-oracle
	- chain-of-thought
	- interpretability
	- mechanistic-interpretability
	- lora
	- qwen3
	- reasoning
	- cot
	- unfaithfulness-detection
	datasets:
	- ceselder/cot-oracle-data
	pipeline_tag: text-generation
	model-index:
	- name: cot-oracle-v4-8b
	results:
	- task:
	type: text-generation
	name: Domain Classification (from activations)
	metrics:
	- type: accuracy
	value: 98
	name: Exact Match Accuracy
	- task:
	type: text-generation
	name: Correctness Prediction (from activations)
	metrics:
	- type: accuracy
	value: 90
	name: Exact Match Accuracy
	---

	# CoT Oracle v4 (Qwen3-8B LoRA)

	A chain-of-thought activation oracle: a LoRA fine-tune of Qwen3-8B that reads the model's own internal activations at sentence boundaries during chain-of-thought reasoning and answers natural-language questions about what was computed.

	This is a continuation of the [Activation Oracles](https://github.com/adamkarvonen/activation_oracles) line of work (Karvonen et al., 2024), extended to operate over structured CoT trajectories rather than single-position activations.

	## Model Description

	An activation oracle is a language model fine-tuned to accept its own internal activations as additional input and answer questions about them. The oracle is the same model as the source -- Qwen3-8B reads Qwen3-8B's activations.

	CoT Oracle v4 specializes in reading activations extracted at sentence boundary positions during chain-of-thought reasoning. Given activations from 3 layers (25%, 50%, 75% depth) at each sentence boundary, the oracle can:

	- Classify the reasoning domain (math, science, logic, commonsense, reading comprehension, multi-domain, medical)
	- Predict whether the CoT reached the correct answer
	- Detect decorative reasoning (steps that don't contribute to the answer)
	- Predict surrounding token context from arbitrary positions

	### Key Properties

	- The oracle reads activations, not text. It has no access to the CoT tokens themselves.
	- Activations are collected with LoRA disabled (pure base model representations).
	- Activations are injected via norm-matched addition at layer 1, preserving the scale of the residual stream.
	- The oracle generates with LoRA enabled (the trained adapter interprets the injected activations).

	## Training

	### Base Checkpoint

	Training continues from [`adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B`](https://huggingface.co/adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B), an activation oracle pretrained on ~1M examples of context prediction, classification, and past-lens tasks.

	### LoRA Configuration

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Rank \| 64 \|
	\| Alpha \| 128 \|
	\| Dropout \| 0.05 \|
	\| Target modules \| all-linear \|

	### Training Tasks

	Six tasks mixed together and shuffled into a single training run:

	\| # \| Task \| Examples \| Layers \| Description \|
	\|---\|------\|----------\|--------\|-------------\|
	\| 1 \| Context prediction (random) \| 100K \| 1 random layer \| Predict surrounding tokens at random positions. Standard AO pretraining format. \|
	\| 2 \| Context prediction (sentences) \| 30K (x2 = 60K) \| 3 layers at boundaries \| Predict tokens near sentence boundaries. Each example doubled: once with 3 layers, once with L50% only. \|
	\| 3 \| Decorative detection \| 500 (x2 = 1K) \| 3 layers at boundaries \| Binary classification: `load_bearing` or `decorative`. \|
	\| 4 \| Domain classification \| 15K (x2 = 30K) \| 3 layers at boundaries \| 7-class: `math`, `science`, `logic`, `commonsense`, `reading`, `multi_domain`, `medical`. \|
	\| 5 \| Correctness prediction \| 15K (x2 = 30K) \| 3 layers at boundaries \| Binary: `correct` or `incorrect`. \|
	\| 6 \| Persona detection \| -- \| -- \| Skipped due to data issue. \|

	Total: ~221K training examples, 1 epoch.

	Sentence-structured tasks (2-6) extract activations at 3 layers per boundary position:
	- Layer 9 (25% depth)
	- Layer 18 (50% depth)
	- Layer 27 (75% depth)

	Each sentence-structured example is duplicated: once with all 3 layers interleaved, once with only L50%. This teaches the oracle to work with both multi-layer and single-layer inputs.

	### Training Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Hardware \| 1x NVIDIA H100 80GB \|
	\| Precision \| bf16 \|
	\| Batch size \| 8 \|
	\| Learning rate \| 1e-5 \|
	\| Steps \| 31,300 \|
	\| Training time \| ~4.5 hours \|
	\| Optimizer \| AdamW \|
	\| Framework \| PyTorch 2.7 + PEFT 0.17 + Transformers 4.55 \|

	### Activation Injection

	Activations are injected at layer 1 via norm-matched addition:

	```
	h' = h + \|\|h\|\| * (v / \|\|v\|\|)
	```

	where `h` is the original hidden state and `v` is the collected activation vector. This preserves the norm of the residual stream while adding directional information from the source activations.

	The placeholder token is `" ?"` (token ID 937). For multi-layer inputs, per-layer placeholder tokens are used: `" @"` (L25%), `" ?"` (L50%), `" #"` (L75%), cycling in that order.

	### Corpus

	The training corpus consists of CoT traces generated by Qwen3-8B across 12 reasoning benchmarks: MATH, GSM8K, GPQA, BBH, ARC, StrategyQA, DROP, LogiQA, MMLU-Pro, CommonsenseQA, AQUA-RAT, and MedQA. CoTs were generated via OpenRouter API.

	## Evaluation Results

	Evaluated on held-out data using exact string match:

	\| Step \| Domain \| Correctness \| Decorative \| Sentence Pred \| Context Pred \| Summary \|
	\|------\|--------\|-------------\|------------\|---------------\|--------------\|---------\|
	\| 500 \| 66% \| 53% \| 50% \| 0% \| 4% \| 0% \|
	\| 5,000 \| 100% \| 86% \| 67% \| 4% \| 7% \| 0% \|
	\| 10,000 \| 97% \| 85% \| 50% \| 7% \| 9% \| 0% \|
	\| 20,000 \| 98% \| 82% \| 62% \| 10% \| 9% \| 0% \|
	\| 28,000 \| 98% \| 90% \| 50% \| 11% \| 7% \| 0% \|

	Key observations:

	- Domain classification reaches 98-100% accuracy -- the oracle reliably identifies the reasoning domain from activations alone.
	- Correctness prediction reaches 90% -- the oracle can tell whether the model's reasoning led to the right answer without seeing the answer.
	- Decorative detection is noisy (bounces between 50-71%) due to limited eval data (74 unique both-correct entries).
	- Context prediction stays low (7-11%) under exact string match but this is expected -- the pretrained AO checkpoint already handles this task and exact match is a harsh metric for free-text prediction.
	- Summary remains at 0% (labels were all identical in training data -- known issue).

	Experiment tracking: [wandb `cot_oracle` project, run `cot_oracle_v4_mixed`](https://wandb.ai)

	## Usage

	### Requirements

	This model requires the [activation_oracles](https://github.com/adamkarvonen/activation_oracles) library for the activation collection and injection infrastructure.

	```bash
	git clone https://github.com/adamkarvonen/activation_oracles
	cd activation_oracles && pip install -e .
	```

	### Loading the Model

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	# Load base model
	model = AutoModelForCausalLM.from_pretrained(
	"Qwen/Qwen3-8B",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)
	tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

	# Load oracle adapter
	model = PeftModel.from_pretrained(model, "ceselder/cot-oracle-v4-8b")
	```

	### Collecting Activations

	Activations must be collected from the base model (LoRA disabled) at the target layers:

	```python
	import torch

	# Layers at 25%, 50%, 75% depth of Qwen3-8B (36 layers)
	LAYERS = [9, 18, 27]

	# 1. Prepare input: question + CoT response
	messages = [{"role": "user", "content": question}]
	prompt = tokenizer.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True,
	enable_thinking=True,
	)
	full_text = prompt + cot_response

	# 2. Find sentence boundary positions in token space
	input_ids = tokenizer(full_text, return_tensors="pt")["input_ids"]
	# boundary_positions = [...] (token indices at sentence boundaries)

	# 3. Collect activations with LoRA DISABLED
	with model.disable_adapter():
	activations = {} # {layer: {position: tensor}}
	# Use hooks on model.model.layers[layer] to capture hidden states
	# at each boundary position for each layer
	```

	### Running the Oracle

	```python
	# 4. Build oracle prompt with placeholder tokens
	# For multi-layer: interleave " @", " ?", " #" per boundary
	n_boundaries = len(boundary_positions)
	placeholders = " @ ? #" * n_boundaries # 3 tokens per boundary

	layer_str = ", ".join(str(l) for l in LAYERS)
	oracle_prompt = f"Layer: {layer_str}\n{placeholders.strip()} \n"
	oracle_prompt += "What domain of reasoning is this? Answer with one word: math, science, logic, commonsense, reading, multi_domain, or medical."

	# 5. Format as chat and tokenize
	messages = [{"role": "user", "content": oracle_prompt}]
	formatted = tokenizer.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True,
	enable_thinking=False,
	)

	# 6. Inject activations via norm-matched addition at layer 1
	# At each placeholder position, add the corresponding activation:
	# positions cycle through [L25_s1, L50_s1, L75_s1, L25_s2, L50_s2, L75_s2, ...]
	# Injection: h' = h + \|\|h\|\| * (v / \|\|v\|\|)

	# 7. Generate with LoRA ENABLED (default state)
	output = model.generate(input_ids, max_new_tokens=64)
	```

	For complete working code, see the [cot-oracle repository](https://github.com/ceselder/cot-oracle), particularly `src/signs_of_life/ao_lib.py` for the injection mechanism and `src/train_mixed.py` for the full training pipeline.

	## Intended Use

	This model is a research artifact for studying chain-of-thought interpretability. Intended uses include:

	- Investigating what information is encoded in CoT activations at different stages of reasoning
	- Detecting unfaithful chain-of-thought (reasoning that doesn't match the model's actual computation)
	- Building tools for mechanistic understanding of language model reasoning

	### Limitations

	- Same-model only: The oracle can only read activations from Qwen3-8B. It will not work with other models.
	- Exact match eval is harsh: Tasks like context prediction and summary show low scores under exact string match, but the model often produces semantically reasonable outputs.
	- Decorative detection is undertrained: Only ~500 unique training examples; results are noisy.
	- Summary task is broken: All 200 training labels were identical, so the model learned nothing useful for this task.
	- No uncertainty calibration: The oracle is confidently wrong sometimes, consistent with findings from Karvonen et al., 2024.

	## Citation

	```bibtex
	@misc{cot-oracle-v4,
	title={CoT Oracle: Detecting Unfaithful Chain-of-Thought via Activation Trajectories},
	author={Celeste Deschamps-Helaere},
	year={2026},
	url={https://github.com/ceselder/cot-oracle}
	}
	```

	### Related Work

	```bibtex
	@article{karvonen2024activation,
	title={Activation Oracles},
	author={Karvonen, Adam and others},
	journal={arXiv preprint arXiv:2512.15674},
	year={2024}
	}

	@article{bogdan2025thought,
	title={Thought Anchors: Causal Importance of CoT Sentences},
	author={Bogdan, Paul and others},
	journal={arXiv preprint arXiv:2506.19143},
	year={2025}
	}

	@article{macar2025thought,
	title={Thought Branches: Studying CoT through Trajectory Distribution},
	author={Macar, Uzay and Bogdan, Paul and others},
	journal={arXiv preprint arXiv:2510.27484},
	year={2025}
	}
	```

	## Links

	- Code: [github.com/ceselder/cot-oracle](https://github.com/ceselder/cot-oracle)
	- Training data: [huggingface.co/datasets/ceselder/cot-oracle-data](https://huggingface.co/datasets/ceselder/cot-oracle-data)
	- Base AO checkpoint: [adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B](https://huggingface.co/adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B)
	- Activation Oracles repo: [github.com/adamkarvonen/activation_oracles](https://github.com/adamkarvonen/activation_oracles)
	- Experiment tracking: wandb `cot_oracle` project, run `cot_oracle_v4_mixed`