Instructions to use jameVee/ToolACE-Hallucination-Detector with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use jameVee/ToolACE-Hallucination-Detector with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("jameVee/ToolACE-Hallucination-Detector", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
language:
- en
license: mit
base_model:
- KRLabsOrg/lettucedect-base-modernbert-en-v1
- Qwen/Qwen2.5-0.5B
datasets:
- jameVee/ToolACE-Hallucination
tags:
- hallucination-detection
- tool-calling
- text-classification
- span-detection
- sklearn
- ensemble
metrics:
- f1
- roc_auc
pipeline_tag: text-classification
pretty_name: ToolACE Hallucination Detector
ToolACE Hallucination Detector
A lightweight ensemble classifier for detecting hallucinations in LLM responses that follow tool calls. It identifies three hallucination types β missing tool reference, unsupported overgeneration, and tool-output contradiction β and returns both a sample-level binary label and character-level span predictions.
Background
When an LLM answers a user query using tool outputs, it can hallucinate in distinct ways:
| Type | Description |
|---|---|
missing_tool |
The assistant suggests using a tool that was not available |
overgeneration |
The answer contains plausible but unsupported extra facts |
tool_output_contradiction |
The answer contradicts specific facts returned by the tool |
This model was trained on jameVee/ToolACE-Hallucination, a benchmark of ~4 100 examples derived from Team-ACE/ToolACE. Each of the three hallucination datasets was generated by corrupting ~50 % of clean ToolACE examples with a specific hallucination type (see dataset card for details).
Architecture
The detector is a three-branch soft-voting ensemble. Each branch independently produces a hallucination probability, and the three scores are averaged to give the final prediction.
ββββββββββββββββββββββββββββββββββββ
query + context βββββββΊ β Branch A: Lexical Verifier β βββΊ P_A
+ tool output β (token overlap span detector β
β β StandardScaler + LogReg) β
ββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββ
ββββΊ β Branch B: LettuceDetect β βββΊ P_B
β (ModernBERT span model β
β β StandardScaler + LogReg) β
ββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββ
ββββΊ β Branch C: LookBack-style β βββΊ P_C
β (Qwen2.5-0.5B attention ratios β
β β StandardScaler + LogReg) β
ββββββββββββββββββββββββββββββββββββ
β
avg(P_A, P_B, P_C) β₯ 0.5
β
hallucinated?
Branch A β Lexical Span Verifier
Checks whether tokens in the assistant's answer are grounded in the (normalized) tool output using lexical overlap. Unsupported token sequences become candidate hallucination spans. Span-level features (count, coverage, max score) are fed into a logistic regression.
Branch B β LettuceDetect (supervised span model)
Uses KRLabsOrg/lettucedect-base-modernbert-en-v1, a ModernBERT-based transformer fine-tuned for grounded hallucination detection. It produces character-level spans scored by confidence. A logistic regression trained on span features converts those into a sample-level probability. Best single branch (AUROC 0.721, Span-F1 0.208).
Branch C β LookBack-style Detector
Passes the concatenation of context and answer through Qwen/Qwen2.5-0.5B and computes, for each answer token, the ratio of attention it pays to the context versus to previously generated tokens. Low-grounding tokens (ratio < 0.22) are merged into hallucination spans. Span and ratio features are fed to logistic regression.
Files in this repository
| File | Description |
|---|---|
lexical_clf.joblib |
Branch A sklearn Pipeline (StandardScaler + LogisticRegression) |
lettuce_clf.joblib |
Branch B sklearn Pipeline (StandardScaler + LogisticRegression) |
lookback_clf.joblib |
Branch C sklearn Pipeline (StandardScaler + LogisticRegression) |
model_meta.json |
Feature column names, backbone IDs, train/test split sizes, test metrics |
evaluation_baselines_span_utils.py |
Span extraction helpers required at inference time |
Performance (test set, 20 % held-out, grouped split by example_id)
| Method | Accuracy | F1 | AUROC | Span-F1 |
|---|---|---|---|---|
| Lexical span verifier | 0.566 | 0.523 | 0.603 | 0.053 |
| LettuceDetect (Branch B) | 0.678 | 0.632 | 0.721 | 0.208 |
| LookBack (Branch C) | 0.489 | 0.561 | 0.511 | 0.000 |
| Soft-vote ensemble | 0.676 | 0.659 | 0.721 | 0.179 |
Per-type F1 (ensemble):
| Corruption type | F1 |
|---|---|
missing_tool |
0.617 |
overgeneration |
0.693 |
tool_output_contradiction |
0.779 |
clean |
β |
How to use
1. Install dependencies
pip install joblib scikit-learn transformers lettucedetect torch
2. Load the classifiers
import joblib, json
from pathlib import Path
repo = Path("hallucination_detector") # or your local clone path
lex_clf = joblib.load(repo / "lexical_clf.joblib")
lettuce_clf = joblib.load(repo / "lettuce_clf.joblib")
lbl_clf = joblib.load(repo / "lookback_clf.joblib")
with open(repo / "model_meta.json") as f:
meta = json.load(f)
3. Load the backbone models
import torch
from lettucedetect.models.inference import HallucinationDetector
from transformers import AutoTokenizer, AutoModelForCausalLM
lettuce_detector = HallucinationDetector(
method="transformer",
model_path="KRLabsOrg/lettucedect-base-modernbert-en-v1",
)
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
lbl_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B", use_fast=True)
lbl_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-0.5B", torch_dtype=torch.bfloat16, attn_implementation="eager"
).to(DEVICE).eval()
4. Run inference
import sys
sys.path.insert(0, str(repo)) # make span utils importable
from evaluation_baselines_span_utils import (
add_normalized_context_columns,
aggregate_lookback_features,
aggregate_span_features,
lexical_hallucination_spans,
spans_from_lookback_ratios,
merge_spans,
)
import pandas as pd, numpy as np
def predict(query: str, context: str, output: str) -> dict:
row = pd.Series({"query": query, "context": context, "output": output})
# normalize context (converts raw tool JSON to readable text)
row_df = add_normalized_context_columns(pd.DataFrame([row]))
row = row_df.iloc[0]
# Branch A β lexical
lex_spans = lexical_hallucination_spans(row)
lex_feats = aggregate_span_features(lex_spans, len(output))
p_lex = lex_clf.predict_proba(pd.DataFrame([lex_feats]))[0, 1]
# Branch B β LettuceDetect
raw_spans = lettuce_detector.predict(
context=[row["normalized_context"]],
question=query, answer=output, output_format="spans",
)
lettuce_spans = merge_spans([
{"start": int(s["start"]), "end": int(s["end"]),
"text": output[int(s["start"]):int(s["end"])],
"type": "hallucination", "score": float(s.get("score", 0.0))}
for s in raw_spans if int(s.get("end", 0)) > int(s.get("start", 0))
])
lettuce_feats = aggregate_span_features(lettuce_spans, len(output))
p_lettuce = lettuce_clf.predict_proba(pd.DataFrame([lettuce_feats]))[0, 1]
# Branch C β LookBack
from evaluation_baselines_span_utils import aggregate_lookback_features
# (reuse compute_lookback_ratios from the training notebook)
# p_lbl = lbl_clf.predict_proba(pd.DataFrame([lbl_feats]))[0, 1]
# For a self-contained example we skip Branch C and average A+B only:
p_ensemble = np.mean([p_lex, p_lettuce])
return {
"hallucinated": bool(p_ensemble >= 0.5),
"score": float(p_ensemble),
"lex_score": float(p_lex),
"lettuce_score": float(p_lettuce),
"lettuce_spans": lettuce_spans,
}
result = predict(
query="What is the current price of AAPL?",
context='Stock API: {"ticker": "AAPL", "price": 189.50, "change": "+1.2%"}',
output="The current price of AAPL is $189.50, up 1.2%. It also hit an all-time high last Tuesday.",
)
print(result)
Training details
- Source dataset: jameVee/ToolACE-Hallucination (1 034 base examples Γ 4 variants = 4 136 rows total)
- Split: 80 / 20 grouped by
example_id(no leakage between clean and corrupted variants of the same query) - Classifiers: scikit-learn
LogisticRegression(max_iter=1000)wrapped in aStandardScalerpipeline - Random seed: 1241
About the training dataset
jameVee/ToolACE-Hallucination contains three JSONL files, each derived from Team-ACE/ToolACE:
missing_tool_dataset.jsonlβ generated withunsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit(50 % corruption rate): a sentence is appended that refers to a non-existent tool.overgeneration_dataset.jsonlβ same generator (50 % corruption rate): a plausible but unsupported sentence is appended.tool_output_contradiction_dataset.jsonlβ generated withopenai/gpt-4o-minivia OpenRouter (all entries attempted, strength 0.9): the answer is rewritten to contradict grounded facts from the tool output.
Each entry carries character-level hallucination_labels marking the corrupted span(s).
Limitations
- The classifiers are trained on a relatively small dataset (~3 300 training rows). Performance may degrade on domains or tool schemas not represented in ToolACE.
- Branch C (LookBack) shows weak span-level performance at the default threshold (0.22); tuning this on a validation split is recommended.
- The ensemble does not produce type-specific labels β it only predicts binary hallucinated / clean at the sample level.
Citation
If you use this model or the associated datasets, please cite:
@misc{toolace_hallucination_detector,
author = {jameVee},
title = {ToolACE Hallucination Detector},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/jameVee/ToolACE-Hallucination-Detector}
}
@dataset{toolace_hallucination,
author = {jameVee},
title = {ToolACE-Hallucination},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/jameVee/ToolACE-Hallucination}
}
@dataset{toolace,
author = {Team-ACE},
title = {ToolACE},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/Team-ACE/ToolACE}
}