Confidence Cartography

Teacher-forced confidence analysis for causal language models.

Measures P(actual_token | preceding_context) at every position in a text using a single forward pass. The resulting per-token probability map detects false beliefs, Mandela effects, and medical misinformation by comparing confidence on true vs. false versions of claims.

Paper: DOI: 10.5281/zenodo.18703506

Toolkit: confidence-cartography-toolkit (pip-installable)

Paper repo: confidence-cartography (experiments + manuscript)

Key Results

  • Mandela Effect calibration: Spearman rho = 0.652 (p = 0.016) โ€” model confidence ratios track human false-belief prevalence across Pythia 160M to 12B
  • Medical misinformation detection: 88% accuracy at Pythia 6.9B scale (p = 0.01 vs random)
  • Scaling: Detection improves monotonically with model scale across the Pythia family
  • Targeted resampling: 3-5x cheaper than uniform best-of-N by regenerating only from low-confidence positions

Install

pip install git+https://github.com/SolomonB14D3/confidence-cartography-toolkit.git

Reproduce the Results in 3 Lines

import confidence_cartography as cc

# Mandela Effect calibration (9 items, YouGov ground truth)
results = cc.evaluate_mandela_effect("EleutherAI/pythia-6.9b")
print(f"Spearman rho: {results.rho:.3f}, p={results.p_value:.4f}")
# -> rho=0.652, p=0.016

# Medical myth detection (25 curated pairs)
results = cc.evaluate_medical_myths("EleutherAI/pythia-6.9b")
print(f"Accuracy: {results.accuracy:.0%}")
# -> 88%

Works With Any Causal LM

Tested on:

Architecture Models Tested
Pythia 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, 12B
GPT-2 124M
Qwen 2.5 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B
Llama 3.1 8B
Mistral 7B

Any HuggingFace AutoModelForCausalLM-compatible model works:

from confidence_cartography import ConfidenceScorer

scorer = ConfidenceScorer("your-model-here")
result = scorer.score("The capital of France is Paris.")
print(f"Mean confidence: {result.mean_top1_prob:.1%}")

How It Works

  1. Tokenize the input text
  2. Run a single forward pass through the causal LM
  3. At each position t, record P(token[t+1] | tokens[0:t]) โ€” the probability the model assigns to the actual next token
  4. Compute summary statistics: mean confidence, entropy, minimum-confidence position
  5. Check against 15 Mandela Effect patterns and 15 medical myth patterns

This is teacher forcing โ€” we feed the actual text and measure how surprised the model is at each token, rather than letting the model generate freely. The key insight is that false beliefs that are widely shared in the training data (like "Luke, I am your father") receive higher confidence than the correct versions, and this effect correlates with human survey data.

API

from confidence_cartography import ConfidenceScorer

scorer = ConfidenceScorer("gpt2")

# Score any text
result = scorer.score("Vaccines cause autism.")
print(result.mean_top1_prob)   # overall confidence
print(result.mean_entropy)     # uncertainty (bits)
print(result.min_confidence_token)  # weakest token

# Score + flag known false beliefs
record, flags = scorer.score_and_flag("The Berenstein Bears")
# flags: ['mandela_match:berenstain']

# Compare two versions
result = scorer.compare("Berenstain Bears", "Berenstein Bears")
print(result["confidence_ratio"])

# Reproduce paper benchmarks
import confidence_cartography as cc
mandela = cc.evaluate_mandela_effect("EleutherAI/pythia-6.9b")
medical = cc.evaluate_medical_myths("EleutherAI/pythia-6.9b")

Citation

@software{sanchez2026confidence,
  author = {Sanchez, Bryan},
  title = {Confidence Cartography: Using Language Models as Sensors for the Structure of Human Knowledge},
  year = {2026},
  doi = {10.5281/zenodo.18703506},
  url = {https://github.com/SolomonB14D3/confidence-cartography}
}

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results

  • Spearman Correlation (human false-belief prevalence) on YouGov Mandela Effect Survey (2022)
    self-reported
    0.652
  • p-value on YouGov Mandela Effect Survey (2022)
    self-reported
    0.016
  • Accuracy on Curated Medical True/False Pairs
    self-reported
    0.880