|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- obalcells/longfact-augmented-annotations |
|
|
- obalcells/longfact-annotations |
|
|
- obalcells/longfact-augmented-prompts |
|
|
--- |
|
|
# Hallucination Detection Probes |
|
|
|
|
|
This repository contains hallucination detection probes for various large language models. These probes are trained to detect factual inaccuracies in model outputs. |
|
|
|
|
|
## Probe Types |
|
|
|
|
|
We provide three types of probes for each model: |
|
|
|
|
|
### 1. **Linear Probes** (`*_linear`) |
|
|
Simple linear classifiers trained on model hidden states to detect hallucinations. |
|
|
|
|
|
### 2. **LoRA Probes with KL Regularization** (`*_lora_lambda_kl_0_05`) |
|
|
LoRA adapters trained with KL divergence regularization (位=0.05) to maintain proximity to the base model while learning to detect hallucinations. |
|
|
|
|
|
### 3. **LoRA Probes with LM Regularization** (`*_lora_lambda_lm_0_01`) |
|
|
LoRA adapters trained with cross-entropy loss regularization (位=0.01) to preserve language modeling capabilities while detecting hallucinations. |
|
|
|
|
|
## Supported Models |
|
|
|
|
|
- Llama 3.3 70B |
|
|
- Llama 3.1 8B |
|
|
- Gemma 2 9B |
|
|
- Mistral Small 24B |
|
|
- Qwen 2.5 7B |
|
|
|
|
|
## Usage |
|
|
|
|
|
For loading and using these probes, see the reference implementation: |
|
|
[probe_loader.py](https://github.com/obalcells/hallucination_probes/blob/main/utils/probe_loader.py) |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find this useful in your research, please consider citing: |
|
|
|
|
|
```bibtex |
|
|
@misc{obeso2025realtimedetectionhallucinatedentities, |
|
|
title={Real-Time Detection of Hallucinated Entities in Long-Form Generation}, |
|
|
author={Oscar Obeso and Andy Arditi and Javier Ferrando and Joshua Freeman and Cameron Holmes and Neel Nanda}, |
|
|
year={2025}, |
|
|
eprint={2509.03531}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2509.03531}, |
|
|
} |
|
|
``` |