---
license: apache-2.0
datasets:
- obalcells/longfact-augmented-annotations
- obalcells/longfact-annotations
- obalcells/longfact-augmented-prompts
---
# Hallucination Detection Probes

This repository contains hallucination detection probes for various large language models. These probes are trained to detect factual inaccuracies in model outputs.

## Probe Types

We provide three types of probes for each model:

### 1. **Linear Probes** (`*_linear`)
Simple linear classifiers trained on model hidden states to detect hallucinations.

### 2. **LoRA Probes with KL Regularization** (`*_lora_lambda_kl_0_05`)
LoRA adapters trained with KL divergence regularization (λ=0.05) to maintain proximity to the base model while learning to detect hallucinations.

### 3. **LoRA Probes with LM Regularization** (`*_lora_lambda_lm_0_01`)
LoRA adapters trained with cross-entropy loss regularization (λ=0.01) to preserve language modeling capabilities while detecting hallucinations.

## Supported Models

- Llama 3.3 70B
- Llama 3.1 8B
- Gemma 2 9B
- Mistral Small 24B
- Qwen 2.5 7B

## Usage

For loading and using these probes, see the reference implementation:
[probe_loader.py](https://github.com/obalcells/hallucination_probes/blob/main/utils/probe_loader.py)

## Citation

If you find this useful in your research, please consider citing:

```bibtex
@misc{obeso2025realtimedetectionhallucinatedentities,
      title={Real-Time Detection of Hallucinated Entities in Long-Form Generation}, 
      author={Oscar Obeso and Andy Arditi and Javier Ferrando and Joshua Freeman and Cameron Holmes and Neel Nanda},
      year={2025},
      eprint={2509.03531},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.03531}, 
}
```