hallucination-probes / README.md

obalcells

Updated README

db581ce verified 4 months ago

preview code

raw

history blame contribute delete

1.76 kB

metadata

license: apache-2.0
datasets:
  - obalcells/longfact-augmented-annotations
  - obalcells/longfact-annotations
  - obalcells/longfact-augmented-prompts

Hallucination Detection Probes

This repository contains hallucination detection probes for various large language models. These probes are trained to detect factual inaccuracies in model outputs.

Probe Types

We provide three types of probes for each model:

1. Linear Probes (`*_linear`)

Simple linear classifiers trained on model hidden states to detect hallucinations.

2. LoRA Probes with KL Regularization (`*_lora_lambda_kl_0_05`)

LoRA adapters trained with KL divergence regularization (λ=0.05) to maintain proximity to the base model while learning to detect hallucinations.

3. LoRA Probes with LM Regularization (`*_lora_lambda_lm_0_01`)

LoRA adapters trained with cross-entropy loss regularization (λ=0.01) to preserve language modeling capabilities while detecting hallucinations.

Supported Models

Llama 3.3 70B
Llama 3.1 8B
Gemma 2 9B
Mistral Small 24B
Qwen 2.5 7B

Usage

For loading and using these probes, see the reference implementation: probe_loader.py

Citation

If you find this useful in your research, please consider citing:

@misc{obeso2025realtimedetectionhallucinatedentities,
      title={Real-Time Detection of Hallucinated Entities in Long-Form Generation}, 
      author={Oscar Obeso and Andy Arditi and Javier Ferrando and Joshua Freeman and Cameron Holmes and Neel Nanda},
      year={2025},
      eprint={2509.03531},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.03531}, 
}