Confidence Cartography — using Pythia's token probabilities as a false-belief sensor

by bsanch52 - opened Feb 19

Feb 19

Hi — I recently published a preprint and pip-installable toolkit that uses teacher-forced confidence extraction on causal LMs, with Pythia as the primary model family.

The finding: The ratio of Pythia's token-level confidence on widely-believed falsehoods vs. correct versions correlates with human false-belief prevalence from a YouGov survey (Spearman ρ = 0.652, p = 0.016 at 6.9B). The effect scales monotonically from 160M through 12B. It also detects medical misinformation at 88% accuracy at the 6.9B scale.

The full Pythia scaling curve (160M → 12B) is in the paper — every model size in the suite was tested.

Reproduce in 3 lines:

import confidence_cartography as cc
results = cc.evaluate_mandela_effect("EleutherAI/pythia-6.9b")
print(results)
# MandelaEvaluation(rho=0.652, p=0.016, n=9)

Links:

Toolkit: confidence-cartography-toolkit
Paper + full experiments: confidence-cartography
Preprint: DOI: 10.5281/zenodo.18703506
HuggingFace model card: bsanch52/confidence-cartography

Pythia was the ideal model family for this work because of the consistent architecture across scales and the deduped training data. Thanks to the EleutherAI team for making these models available.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment