obalcells
/

hallucination-probes

Model card Files Files and versions

obalcells commited on Oct 15, 2025

Commit

db581ce

·

verified ·

1 Parent(s): 19842a2

Updated README

Files changed (1) hide show

README.md +52 -0

README.md CHANGED Viewed

	@@ -0,0 +1,52 @@

+---
+license: apache-2.0
+datasets:
+- obalcells/longfact-augmented-annotations
+- obalcells/longfact-annotations
+- obalcells/longfact-augmented-prompts
+---
+# Hallucination Detection Probes
+This repository contains hallucination detection probes for various large language models. These probes are trained to detect factual inaccuracies in model outputs.
+## Probe Types
+We provide three types of probes for each model:
+### 1. **Linear Probes** (`*_linear`)
+Simple linear classifiers trained on model hidden states to detect hallucinations.
+### 2. **LoRA Probes with KL Regularization** (`*_lora_lambda_kl_0_05`)
+LoRA adapters trained with KL divergence regularization (λ=0.05) to maintain proximity to the base model while learning to detect hallucinations.
+### 3. **LoRA Probes with LM Regularization** (`*_lora_lambda_lm_0_01`)
+LoRA adapters trained with cross-entropy loss regularization (λ=0.01) to preserve language modeling capabilities while detecting hallucinations.
+## Supported Models
+- Llama 3.3 70B
+- Llama 3.1 8B
+- Gemma 2 9B
+- Mistral Small 24B
+- Qwen 2.5 7B
+## Usage
+For loading and using these probes, see the reference implementation:
+[probe_loader.py](https://github.com/obalcells/hallucination_probes/blob/main/utils/probe_loader.py)
+## Citation
+If you find this useful in your research, please consider citing:
+```bibtex
+@misc{obeso2025realtimedetectionhallucinatedentities,
+      title={Real-Time Detection of Hallucinated Entities in Long-Form Generation},
+      author={Oscar Obeso and Andy Arditi and Javier Ferrando and Joshua Freeman and Cameron Holmes and Neel Nanda},
+      year={2025},
+      eprint={2509.03531},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2509.03531},
+}
+```