Add trained hallucination-detector classifiers

131239b verified 7 days ago

11.7 kB

	---
	language:
	- en
	license: mit
	base_model:
	- KRLabsOrg/lettucedect-base-modernbert-en-v1
	- Qwen/Qwen2.5-0.5B
	datasets:
	- jameVee/ToolACE-Hallucination
	tags:
	- hallucination-detection
	- tool-calling
	- text-classification
	- span-detection
	- sklearn
	- ensemble
	metrics:
	- f1
	- roc_auc
	pipeline_tag: text-classification
	pretty_name: ToolACE Hallucination Detector
	---

	# ToolACE Hallucination Detector

	A lightweight ensemble classifier for detecting hallucinations in LLM responses that follow tool calls. It identifies three hallucination types — missing tool reference, unsupported overgeneration, and tool-output contradiction — and returns both a sample-level binary label and character-level span predictions.

	---

	## Background

	When an LLM answers a user query using tool outputs, it can hallucinate in distinct ways:

	\| Type \| Description \|
	\|---\|---\|
	\| `missing_tool` \| The assistant suggests using a tool that was not available \|
	\| `overgeneration` \| The answer contains plausible but unsupported extra facts \|
	\| `tool_output_contradiction` \| The answer contradicts specific facts returned by the tool \|

	This model was trained on [jameVee/ToolACE-Hallucination](https://huggingface.co/datasets/jameVee/ToolACE-Hallucination), a benchmark of ~4 100 examples derived from [Team-ACE/ToolACE](https://huggingface.co/datasets/Team-ACE/ToolACE). Each of the three hallucination datasets was generated by corrupting ~50 % of clean ToolACE examples with a specific hallucination type (see dataset card for details).

	---

	## Architecture

	The detector is a three-branch soft-voting ensemble. Each branch independently produces a hallucination probability, and the three scores are averaged to give the final prediction.

	```
	┌──────────────────────────────────┐
	query + context ──────► │ Branch A: Lexical Verifier │ ──► P_A
	+ tool output │ (token overlap span detector │
	│ → StandardScaler + LogReg) │
	└──────────────────────────────────┘
	┌──────────────────────────────────┐
	───► │ Branch B: LettuceDetect │ ──► P_B
	│ (ModernBERT span model │
	│ → StandardScaler + LogReg) │
	└──────────────────────────────────┘
	┌──────────────────────────────────┐
	───► │ Branch C: LookBack-style │ ──► P_C
	│ (Qwen2.5-0.5B attention ratios │
	│ → StandardScaler + LogReg) │
	└──────────────────────────────────┘
	│
	avg(P_A, P_B, P_C) ≥ 0.5
	│
	hallucinated?
	```

	### Branch A — Lexical Span Verifier

	Checks whether tokens in the assistant's answer are grounded in the (normalized) tool output using lexical overlap. Unsupported token sequences become candidate hallucination spans. Span-level features (count, coverage, max score) are fed into a logistic regression.

	### Branch B — LettuceDetect (supervised span model)

	Uses [KRLabsOrg/lettucedect-base-modernbert-en-v1](https://huggingface.co/KRLabsOrg/lettucedect-base-modernbert-en-v1), a ModernBERT-based transformer fine-tuned for grounded hallucination detection. It produces character-level spans scored by confidence. A logistic regression trained on span features converts those into a sample-level probability. Best single branch (AUROC 0.721, Span-F1 0.208).

	### Branch C — LookBack-style Detector

	Passes the concatenation of context and answer through [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) and computes, for each answer token, the ratio of attention it pays to the context versus to previously generated tokens. Low-grounding tokens (ratio < 0.22) are merged into hallucination spans. Span and ratio features are fed to logistic regression.

	---

	## Files in this repository

	\| File \| Description \|
	\|---\|---\|
	\| `lexical_clf.joblib` \| Branch A sklearn Pipeline (StandardScaler + LogisticRegression) \|
	\| `lettuce_clf.joblib` \| Branch B sklearn Pipeline (StandardScaler + LogisticRegression) \|
	\| `lookback_clf.joblib` \| Branch C sklearn Pipeline (StandardScaler + LogisticRegression) \|
	\| `model_meta.json` \| Feature column names, backbone IDs, train/test split sizes, test metrics \|
	\| `evaluation_baselines_span_utils.py` \| Span extraction helpers required at inference time \|

	---

	## Performance (test set, 20 % held-out, grouped split by example_id)

	\| Method \| Accuracy \| F1 \| AUROC \| Span-F1 \|
	\|---\|---\|---\|---\|---\|
	\| Lexical span verifier \| 0.566 \| 0.523 \| 0.603 \| 0.053 \|
	\| LettuceDetect (Branch B) \| 0.678 \| 0.632 \| 0.721 \| 0.208 \|
	\| LookBack (Branch C) \| 0.489 \| 0.561 \| 0.511 \| 0.000 \|
	\| Soft-vote ensemble \| 0.676 \| 0.659 \| 0.721 \| 0.179 \|

	Per-type F1 (ensemble):

	\| Corruption type \| F1 \|
	\|---\|---\|
	\| `missing_tool` \| 0.617 \|
	\| `overgeneration` \| 0.693 \|
	\| `tool_output_contradiction` \| 0.779 \|
	\| `clean` \| — \|

	---

	## How to use

	### 1. Install dependencies

	```bash
	pip install joblib scikit-learn transformers lettucedetect torch
	```

	### 2. Load the classifiers

	```python
	import joblib, json
	from pathlib import Path

	repo = Path("hallucination_detector") # or your local clone path

	lex_clf = joblib.load(repo / "lexical_clf.joblib")
	lettuce_clf = joblib.load(repo / "lettuce_clf.joblib")
	lbl_clf = joblib.load(repo / "lookback_clf.joblib")

	with open(repo / "model_meta.json") as f:
	meta = json.load(f)
	```

	### 3. Load the backbone models

	```python
	import torch
	from lettucedetect.models.inference import HallucinationDetector
	from transformers import AutoTokenizer, AutoModelForCausalLM

	lettuce_detector = HallucinationDetector(
	method="transformer",
	model_path="KRLabsOrg/lettucedect-base-modernbert-en-v1",
	)

	DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
	lbl_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B", use_fast=True)
	lbl_model = AutoModelForCausalLM.from_pretrained(
	"Qwen/Qwen2.5-0.5B", torch_dtype=torch.bfloat16, attn_implementation="eager"
	).to(DEVICE).eval()
	```

	### 4. Run inference

	```python
	import sys
	sys.path.insert(0, str(repo)) # make span utils importable
	from evaluation_baselines_span_utils import (
	add_normalized_context_columns,
	aggregate_lookback_features,
	aggregate_span_features,
	lexical_hallucination_spans,
	spans_from_lookback_ratios,
	merge_spans,
	)
	import pandas as pd, numpy as np

	def predict(query: str, context: str, output: str) -> dict:
	row = pd.Series({"query": query, "context": context, "output": output})
	# normalize context (converts raw tool JSON to readable text)
	row_df = add_normalized_context_columns(pd.DataFrame([row]))
	row = row_df.iloc[0]

	# Branch A — lexical
	lex_spans = lexical_hallucination_spans(row)
	lex_feats = aggregate_span_features(lex_spans, len(output))
	p_lex = lex_clf.predict_proba(pd.DataFrame([lex_feats]))[0, 1]

	# Branch B — LettuceDetect
	raw_spans = lettuce_detector.predict(
	context=[row["normalized_context"]],
	question=query, answer=output, output_format="spans",
	)
	lettuce_spans = merge_spans([
	{"start": int(s["start"]), "end": int(s["end"]),
	"text": output[int(s["start"]):int(s["end"])],
	"type": "hallucination", "score": float(s.get("score", 0.0))}
	for s in raw_spans if int(s.get("end", 0)) > int(s.get("start", 0))
	])
	lettuce_feats = aggregate_span_features(lettuce_spans, len(output))
	p_lettuce = lettuce_clf.predict_proba(pd.DataFrame([lettuce_feats]))[0, 1]

	# Branch C — LookBack
	from evaluation_baselines_span_utils import aggregate_lookback_features
	# (reuse compute_lookback_ratios from the training notebook)
	# p_lbl = lbl_clf.predict_proba(pd.DataFrame([lbl_feats]))[0, 1]
	# For a self-contained example we skip Branch C and average A+B only:
	p_ensemble = np.mean([p_lex, p_lettuce])

	return {
	"hallucinated": bool(p_ensemble >= 0.5),
	"score": float(p_ensemble),
	"lex_score": float(p_lex),
	"lettuce_score": float(p_lettuce),
	"lettuce_spans": lettuce_spans,
	}

	result = predict(
	query="What is the current price of AAPL?",
	context='Stock API: {"ticker": "AAPL", "price": 189.50, "change": "+1.2%"}',
	output="The current price of AAPL is $189.50, up 1.2%. It also hit an all-time high last Tuesday.",
	)
	print(result)
	```

	---

	## Training details

	- Source dataset: [jameVee/ToolACE-Hallucination](https://huggingface.co/datasets/jameVee/ToolACE-Hallucination) (1 034 base examples × 4 variants = 4 136 rows total)
	- Split: 80 / 20 grouped by `example_id` (no leakage between clean and corrupted variants of the same query)
	- Classifiers: scikit-learn `LogisticRegression(max_iter=1000)` wrapped in a `StandardScaler` pipeline
	- Random seed: 1241

	### About the training dataset

	`jameVee/ToolACE-Hallucination` contains three JSONL files, each derived from [Team-ACE/ToolACE](https://huggingface.co/datasets/Team-ACE/ToolACE):

	- `missing_tool_dataset.jsonl` — generated with `unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit` (50 % corruption rate): a sentence is appended that refers to a non-existent tool.
	- `overgeneration_dataset.jsonl` — same generator (50 % corruption rate): a plausible but unsupported sentence is appended.
	- `tool_output_contradiction_dataset.jsonl` — generated with `openai/gpt-4o-mini` via OpenRouter (all entries attempted, strength 0.9): the answer is rewritten to contradict grounded facts from the tool output.

	Each entry carries character-level `hallucination_labels` marking the corrupted span(s).

	---

	## Limitations

	- The classifiers are trained on a relatively small dataset (~3 300 training rows). Performance may degrade on domains or tool schemas not represented in ToolACE.
	- Branch C (LookBack) shows weak span-level performance at the default threshold (0.22); tuning this on a validation split is recommended.
	- The ensemble does not produce type-specific labels — it only predicts binary hallucinated / clean at the sample level.

	---

	## Citation

	If you use this model or the associated datasets, please cite:

	```bibtex
	@misc{toolace_hallucination_detector,
	author = {jameVee},
	title = {ToolACE Hallucination Detector},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/jameVee/ToolACE-Hallucination-Detector}
	}

	@dataset{toolace_hallucination,
	author = {jameVee},
	title = {ToolACE-Hallucination},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/datasets/jameVee/ToolACE-Hallucination}
	}

	@dataset{toolace,
	author = {Team-ACE},
	title = {ToolACE},
	year = {2024},
	publisher = {Hugging Face},
	url = {https://huggingface.co/datasets/Team-ACE/ToolACE}
	}
	```