metadata
license: apache-2.0
datasets:
- obalcells/longfact-augmented-annotations
- obalcells/longfact-annotations
- obalcells/longfact-augmented-prompts
Hallucination Detection Probes
This repository contains hallucination detection probes for various large language models. These probes are trained to detect factual inaccuracies in model outputs.
Probe Types
We provide three types of probes for each model:
1. Linear Probes (*_linear)
Simple linear classifiers trained on model hidden states to detect hallucinations.
2. LoRA Probes with KL Regularization (*_lora_lambda_kl_0_05)
LoRA adapters trained with KL divergence regularization (λ=0.05) to maintain proximity to the base model while learning to detect hallucinations.
3. LoRA Probes with LM Regularization (*_lora_lambda_lm_0_01)
LoRA adapters trained with cross-entropy loss regularization (λ=0.01) to preserve language modeling capabilities while detecting hallucinations.
Supported Models
- Llama 3.3 70B
- Llama 3.1 8B
- Gemma 2 9B
- Mistral Small 24B
- Qwen 2.5 7B
Usage
For loading and using these probes, see the reference implementation: probe_loader.py
Citation
If you find this useful in your research, please consider citing:
@misc{obeso2025realtimedetectionhallucinatedentities,
title={Real-Time Detection of Hallucinated Entities in Long-Form Generation},
author={Oscar Obeso and Andy Arditi and Javier Ferrando and Joshua Freeman and Cameron Holmes and Neel Nanda},
year={2025},
eprint={2509.03531},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.03531},
}