Confidence Cartography
Teacher-forced confidence analysis for causal language models.
Measures P(actual_token | preceding_context) at every position in a text using a single forward pass. The resulting per-token probability map detects false beliefs, Mandela effects, and medical misinformation by comparing confidence on true vs. false versions of claims.
Paper: DOI: 10.5281/zenodo.18703506
Toolkit: confidence-cartography-toolkit (pip-installable)
Paper repo: confidence-cartography (experiments + manuscript)
Key Results
- Mandela Effect calibration: Spearman rho = 0.652 (p = 0.016) โ model confidence ratios track human false-belief prevalence across Pythia 160M to 12B
- Medical misinformation detection: 88% accuracy at Pythia 6.9B scale (p = 0.01 vs random)
- Scaling: Detection improves monotonically with model scale across the Pythia family
- Targeted resampling: 3-5x cheaper than uniform best-of-N by regenerating only from low-confidence positions
Install
pip install git+https://github.com/SolomonB14D3/confidence-cartography-toolkit.git
Reproduce the Results in 3 Lines
import confidence_cartography as cc
# Mandela Effect calibration (9 items, YouGov ground truth)
results = cc.evaluate_mandela_effect("EleutherAI/pythia-6.9b")
print(f"Spearman rho: {results.rho:.3f}, p={results.p_value:.4f}")
# -> rho=0.652, p=0.016
# Medical myth detection (25 curated pairs)
results = cc.evaluate_medical_myths("EleutherAI/pythia-6.9b")
print(f"Accuracy: {results.accuracy:.0%}")
# -> 88%
Works With Any Causal LM
Tested on:
| Architecture | Models Tested |
|---|---|
| Pythia | 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, 12B |
| GPT-2 | 124M |
| Qwen 2.5 | 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B |
| Llama 3.1 | 8B |
| Mistral | 7B |
Any HuggingFace AutoModelForCausalLM-compatible model works:
from confidence_cartography import ConfidenceScorer
scorer = ConfidenceScorer("your-model-here")
result = scorer.score("The capital of France is Paris.")
print(f"Mean confidence: {result.mean_top1_prob:.1%}")
How It Works
- Tokenize the input text
- Run a single forward pass through the causal LM
- At each position t, record P(token[t+1] | tokens[0:t]) โ the probability the model assigns to the actual next token
- Compute summary statistics: mean confidence, entropy, minimum-confidence position
- Check against 15 Mandela Effect patterns and 15 medical myth patterns
This is teacher forcing โ we feed the actual text and measure how surprised the model is at each token, rather than letting the model generate freely. The key insight is that false beliefs that are widely shared in the training data (like "Luke, I am your father") receive higher confidence than the correct versions, and this effect correlates with human survey data.
API
from confidence_cartography import ConfidenceScorer
scorer = ConfidenceScorer("gpt2")
# Score any text
result = scorer.score("Vaccines cause autism.")
print(result.mean_top1_prob) # overall confidence
print(result.mean_entropy) # uncertainty (bits)
print(result.min_confidence_token) # weakest token
# Score + flag known false beliefs
record, flags = scorer.score_and_flag("The Berenstein Bears")
# flags: ['mandela_match:berenstain']
# Compare two versions
result = scorer.compare("Berenstain Bears", "Berenstein Bears")
print(result["confidence_ratio"])
# Reproduce paper benchmarks
import confidence_cartography as cc
mandela = cc.evaluate_mandela_effect("EleutherAI/pythia-6.9b")
medical = cc.evaluate_medical_myths("EleutherAI/pythia-6.9b")
Citation
@software{sanchez2026confidence,
author = {Sanchez, Bryan},
title = {Confidence Cartography: Using Language Models as Sensors for the Structure of Human Knowledge},
year = {2026},
doi = {10.5281/zenodo.18703506},
url = {https://github.com/SolomonB14D3/confidence-cartography}
}
License
MIT
Evaluation results
- Spearman Correlation (human false-belief prevalence) on YouGov Mandela Effect Survey (2022)self-reported0.652
- p-value on YouGov Mandela Effect Survey (2022)self-reported0.016
- Accuracy on Curated Medical True/False Pairsself-reported0.880