--- license: apache-2.0 datasets: - obalcells/longfact-augmented-annotations - obalcells/longfact-annotations - obalcells/longfact-augmented-prompts --- # Hallucination Detection Probes This repository contains hallucination detection probes for various large language models. These probes are trained to detect factual inaccuracies in model outputs. ## Probe Types We provide three types of probes for each model: ### 1. **Linear Probes** (`*_linear`) Simple linear classifiers trained on model hidden states to detect hallucinations. ### 2. **LoRA Probes with KL Regularization** (`*_lora_lambda_kl_0_05`) LoRA adapters trained with KL divergence regularization (λ=0.05) to maintain proximity to the base model while learning to detect hallucinations. ### 3. **LoRA Probes with LM Regularization** (`*_lora_lambda_lm_0_01`) LoRA adapters trained with cross-entropy loss regularization (λ=0.01) to preserve language modeling capabilities while detecting hallucinations. ## Supported Models - Llama 3.3 70B - Llama 3.1 8B - Gemma 2 9B - Mistral Small 24B - Qwen 2.5 7B ## Usage For loading and using these probes, see the reference implementation: [probe_loader.py](https://github.com/obalcells/hallucination_probes/blob/main/utils/probe_loader.py) ## Citation If you find this useful in your research, please consider citing: ```bibtex @misc{obeso2025realtimedetectionhallucinatedentities, title={Real-Time Detection of Hallucinated Entities in Long-Form Generation}, author={Oscar Obeso and Andy Arditi and Javier Ferrando and Joshua Freeman and Cameron Holmes and Neel Nanda}, year={2025}, eprint={2509.03531}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.03531}, } ```