obalcells
/

hallucination-probes

Model card Files Files and versions

hallucination-probes / README.md

obalcells's picture

Updated README

db581ce verified 4 months ago

|

history blame contribute delete

1.76 kB

	---
	license: apache-2.0
	datasets:
	- obalcells/longfact-augmented-annotations
	- obalcells/longfact-annotations
	- obalcells/longfact-augmented-prompts
	---
	# Hallucination Detection Probes

	This repository contains hallucination detection probes for various large language models. These probes are trained to detect factual inaccuracies in model outputs.

	## Probe Types

	We provide three types of probes for each model:

	### 1. Linear Probes (`*_linear`)
	Simple linear classifiers trained on model hidden states to detect hallucinations.

	### 2. LoRA Probes with KL Regularization (`*_lora_lambda_kl_0_05`)
	LoRA adapters trained with KL divergence regularization (λ=0.05) to maintain proximity to the base model while learning to detect hallucinations.

	### 3. LoRA Probes with LM Regularization (`*_lora_lambda_lm_0_01`)
	LoRA adapters trained with cross-entropy loss regularization (λ=0.01) to preserve language modeling capabilities while detecting hallucinations.

	## Supported Models

	- Llama 3.3 70B
	- Llama 3.1 8B
	- Gemma 2 9B
	- Mistral Small 24B
	- Qwen 2.5 7B

	## Usage

	For loading and using these probes, see the reference implementation:
	[probe_loader.py](https://github.com/obalcells/hallucination_probes/blob/main/utils/probe_loader.py)

	## Citation

	If you find this useful in your research, please consider citing:

	```bibtex
	@misc{obeso2025realtimedetectionhallucinatedentities,
	title={Real-Time Detection of Hallucinated Entities in Long-Form Generation},
	author={Oscar Obeso and Andy Arditi and Javier Ferrando and Joshua Freeman and Cameron Holmes and Neel Nanda},
	year={2025},
	eprint={2509.03531},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2509.03531},
	}
	```