Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Paper • 2512.15674 • Published
Checkpoints from the Cocoracle experiment -- interpreting what a model "thinks" during latent reasoning.
Combines Coconut (Chain of Continuous Thought) with Activation Oracles to train models that answer natural-language questions about their own latent chain-of-thought hidden states.
stage3_alllatent.pt -- GPT-2-large (774M) fine-tuned with the Coconut curriculum to perform multi-digit addition using entirely latent reasoning.
<bot>, <sep>, <eot>, <act>)self_oracle_alllatent.pt -- The Coconut model further fine-tuned to interpret its own latent reasoning activations via norm-matched injection at layer 17.
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_special_tokens({
"additional_special_tokens": ["<bot>", "<sep>", "<eot>", "<act>"]
})
model = GPT2LMHeadModel.from_pretrained("gpt2-large")
model.resize_token_embeddings(len(tokenizer))
state = torch.load("stage3_alllatent.pt", map_location="cpu")
model.load_state_dict(state)
See the GitHub repo for full code and an interactive demo (scripts/interactive.py).
| Configuration | CoT Exact Match | CoT Token F1 | AO Val Loss |
|---|---|---|---|
| Separate AO (GPT-2-small + LoRA) | 0% | 26.4% | 2.92 |
| Self-oracle, GPT-2-small | 0% | 32.5% | 1.98 |
| Self-oracle, GPT-2-large, stage 1 | 0% | 25.6% | 1.10 |
| Self-oracle, GPT-2-large, all-latent | 6.9% | 34.2% | 0.55 |
Base model
openai-community/gpt2-large