Safetensors
obalcells's picture
Updated README
db581ce verified
---
license: apache-2.0
datasets:
- obalcells/longfact-augmented-annotations
- obalcells/longfact-annotations
- obalcells/longfact-augmented-prompts
---
# Hallucination Detection Probes
This repository contains hallucination detection probes for various large language models. These probes are trained to detect factual inaccuracies in model outputs.
## Probe Types
We provide three types of probes for each model:
### 1. **Linear Probes** (`*_linear`)
Simple linear classifiers trained on model hidden states to detect hallucinations.
### 2. **LoRA Probes with KL Regularization** (`*_lora_lambda_kl_0_05`)
LoRA adapters trained with KL divergence regularization (位=0.05) to maintain proximity to the base model while learning to detect hallucinations.
### 3. **LoRA Probes with LM Regularization** (`*_lora_lambda_lm_0_01`)
LoRA adapters trained with cross-entropy loss regularization (位=0.01) to preserve language modeling capabilities while detecting hallucinations.
## Supported Models
- Llama 3.3 70B
- Llama 3.1 8B
- Gemma 2 9B
- Mistral Small 24B
- Qwen 2.5 7B
## Usage
For loading and using these probes, see the reference implementation:
[probe_loader.py](https://github.com/obalcells/hallucination_probes/blob/main/utils/probe_loader.py)
## Citation
If you find this useful in your research, please consider citing:
```bibtex
@misc{obeso2025realtimedetectionhallucinatedentities,
title={Real-Time Detection of Hallucinated Entities in Long-Form Generation},
author={Oscar Obeso and Andy Arditi and Javier Ferrando and Joshua Freeman and Cameron Holmes and Neel Nanda},
year={2025},
eprint={2509.03531},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.03531},
}
```