Confidence Cartography — using Pythia's token probabilities as a false-belief sensor
Hi — I recently published a preprint and pip-installable toolkit that uses teacher-forced confidence extraction on causal LMs, with Pythia as the primary model family.
The finding: The ratio of Pythia's token-level confidence on widely-believed falsehoods vs. correct versions correlates with human false-belief prevalence from a YouGov survey (Spearman ρ = 0.652, p = 0.016 at 6.9B). The effect scales monotonically from 160M through 12B. It also detects medical misinformation at 88% accuracy at the 6.9B scale.
The full Pythia scaling curve (160M → 12B) is in the paper — every model size in the suite was tested.
Reproduce in 3 lines:
import confidence_cartography as cc
results = cc.evaluate_mandela_effect("EleutherAI/pythia-6.9b")
print(results)
# MandelaEvaluation(rho=0.652, p=0.016, n=9)
Links:
- Toolkit: confidence-cartography-toolkit
- Paper + full experiments: confidence-cartography
- Preprint: DOI: 10.5281/zenodo.18703506
- HuggingFace model card: bsanch52/confidence-cartography
Pythia was the ideal model family for this work because of the consistent architecture across scales and the deduped training data. Thanks to the EleutherAI team for making these models available.