ToolACE Hallucination Detector

A lightweight ensemble classifier for detecting hallucinations in LLM responses that follow tool calls. It identifies three hallucination types β€” missing tool reference, unsupported overgeneration, and tool-output contradiction β€” and returns both a sample-level binary label and character-level span predictions.


Background

When an LLM answers a user query using tool outputs, it can hallucinate in distinct ways:

Type Description
missing_tool The assistant suggests using a tool that was not available
overgeneration The answer contains plausible but unsupported extra facts
tool_output_contradiction The answer contradicts specific facts returned by the tool

This model was trained on jameVee/ToolACE-Hallucination, a benchmark of ~4 100 examples derived from Team-ACE/ToolACE. Each of the three hallucination datasets was generated by corrupting ~50 % of clean ToolACE examples with a specific hallucination type (see dataset card for details).


Architecture

The detector is a three-branch soft-voting ensemble. Each branch independently produces a hallucination probability, and the three scores are averaged to give the final prediction.

                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 query + context ──────► β”‚  Branch A: Lexical Verifier       β”‚ ──► P_A
 + tool output           β”‚  (token overlap span detector     β”‚
                         β”‚   β†’ StandardScaler + LogReg)      β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    ───► β”‚  Branch B: LettuceDetect          β”‚ ──► P_B
                         β”‚  (ModernBERT span model           β”‚
                         β”‚   β†’ StandardScaler + LogReg)      β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    ───► β”‚  Branch C: LookBack-style         β”‚ ──► P_C
                         β”‚  (Qwen2.5-0.5B attention ratios  β”‚
                         β”‚   β†’ StandardScaler + LogReg)      β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                          β”‚
                              avg(P_A, P_B, P_C) β‰₯ 0.5
                                          β”‚
                                     hallucinated?

Branch A β€” Lexical Span Verifier

Checks whether tokens in the assistant's answer are grounded in the (normalized) tool output using lexical overlap. Unsupported token sequences become candidate hallucination spans. Span-level features (count, coverage, max score) are fed into a logistic regression.

Branch B β€” LettuceDetect (supervised span model)

Uses KRLabsOrg/lettucedect-base-modernbert-en-v1, a ModernBERT-based transformer fine-tuned for grounded hallucination detection. It produces character-level spans scored by confidence. A logistic regression trained on span features converts those into a sample-level probability. Best single branch (AUROC 0.721, Span-F1 0.208).

Branch C β€” LookBack-style Detector

Passes the concatenation of context and answer through Qwen/Qwen2.5-0.5B and computes, for each answer token, the ratio of attention it pays to the context versus to previously generated tokens. Low-grounding tokens (ratio < 0.22) are merged into hallucination spans. Span and ratio features are fed to logistic regression.


Files in this repository

File Description
lexical_clf.joblib Branch A sklearn Pipeline (StandardScaler + LogisticRegression)
lettuce_clf.joblib Branch B sklearn Pipeline (StandardScaler + LogisticRegression)
lookback_clf.joblib Branch C sklearn Pipeline (StandardScaler + LogisticRegression)
model_meta.json Feature column names, backbone IDs, train/test split sizes, test metrics
evaluation_baselines_span_utils.py Span extraction helpers required at inference time

Performance (test set, 20 % held-out, grouped split by example_id)

Method Accuracy F1 AUROC Span-F1
Lexical span verifier 0.566 0.523 0.603 0.053
LettuceDetect (Branch B) 0.678 0.632 0.721 0.208
LookBack (Branch C) 0.489 0.561 0.511 0.000
Soft-vote ensemble 0.676 0.659 0.721 0.179

Per-type F1 (ensemble):

Corruption type F1
missing_tool 0.617
overgeneration 0.693
tool_output_contradiction 0.779
clean β€”

How to use

1. Install dependencies

pip install joblib scikit-learn transformers lettucedetect torch

2. Load the classifiers

import joblib, json
from pathlib import Path

repo = Path("hallucination_detector")   # or your local clone path

lex_clf     = joblib.load(repo / "lexical_clf.joblib")
lettuce_clf = joblib.load(repo / "lettuce_clf.joblib")
lbl_clf     = joblib.load(repo / "lookback_clf.joblib")

with open(repo / "model_meta.json") as f:
    meta = json.load(f)

3. Load the backbone models

import torch
from lettucedetect.models.inference import HallucinationDetector
from transformers import AutoTokenizer, AutoModelForCausalLM

lettuce_detector = HallucinationDetector(
    method="transformer",
    model_path="KRLabsOrg/lettucedect-base-modernbert-en-v1",
)

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
lbl_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B", use_fast=True)
lbl_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B", torch_dtype=torch.bfloat16, attn_implementation="eager"
).to(DEVICE).eval()

4. Run inference

import sys
sys.path.insert(0, str(repo))    # make span utils importable
from evaluation_baselines_span_utils import (
    add_normalized_context_columns,
    aggregate_lookback_features,
    aggregate_span_features,
    lexical_hallucination_spans,
    spans_from_lookback_ratios,
    merge_spans,
)
import pandas as pd, numpy as np

def predict(query: str, context: str, output: str) -> dict:
    row = pd.Series({"query": query, "context": context, "output": output})
    # normalize context (converts raw tool JSON to readable text)
    row_df = add_normalized_context_columns(pd.DataFrame([row]))
    row = row_df.iloc[0]

    # Branch A β€” lexical
    lex_spans = lexical_hallucination_spans(row)
    lex_feats = aggregate_span_features(lex_spans, len(output))
    p_lex = lex_clf.predict_proba(pd.DataFrame([lex_feats]))[0, 1]

    # Branch B β€” LettuceDetect
    raw_spans = lettuce_detector.predict(
        context=[row["normalized_context"]],
        question=query, answer=output, output_format="spans",
    )
    lettuce_spans = merge_spans([
        {"start": int(s["start"]), "end": int(s["end"]),
         "text": output[int(s["start"]):int(s["end"])],
         "type": "hallucination", "score": float(s.get("score", 0.0))}
        for s in raw_spans if int(s.get("end", 0)) > int(s.get("start", 0))
    ])
    lettuce_feats = aggregate_span_features(lettuce_spans, len(output))
    p_lettuce = lettuce_clf.predict_proba(pd.DataFrame([lettuce_feats]))[0, 1]

    # Branch C β€” LookBack
    from evaluation_baselines_span_utils import aggregate_lookback_features
    # (reuse compute_lookback_ratios from the training notebook)
    # p_lbl = lbl_clf.predict_proba(pd.DataFrame([lbl_feats]))[0, 1]
    # For a self-contained example we skip Branch C and average A+B only:
    p_ensemble = np.mean([p_lex, p_lettuce])

    return {
        "hallucinated": bool(p_ensemble >= 0.5),
        "score": float(p_ensemble),
        "lex_score": float(p_lex),
        "lettuce_score": float(p_lettuce),
        "lettuce_spans": lettuce_spans,
    }

result = predict(
    query="What is the current price of AAPL?",
    context='Stock API: {"ticker": "AAPL", "price": 189.50, "change": "+1.2%"}',
    output="The current price of AAPL is $189.50, up 1.2%. It also hit an all-time high last Tuesday.",
)
print(result)

Training details

  • Source dataset: jameVee/ToolACE-Hallucination (1 034 base examples Γ— 4 variants = 4 136 rows total)
  • Split: 80 / 20 grouped by example_id (no leakage between clean and corrupted variants of the same query)
  • Classifiers: scikit-learn LogisticRegression(max_iter=1000) wrapped in a StandardScaler pipeline
  • Random seed: 1241

About the training dataset

jameVee/ToolACE-Hallucination contains three JSONL files, each derived from Team-ACE/ToolACE:

  • missing_tool_dataset.jsonl β€” generated with unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit (50 % corruption rate): a sentence is appended that refers to a non-existent tool.
  • overgeneration_dataset.jsonl β€” same generator (50 % corruption rate): a plausible but unsupported sentence is appended.
  • tool_output_contradiction_dataset.jsonl β€” generated with openai/gpt-4o-mini via OpenRouter (all entries attempted, strength 0.9): the answer is rewritten to contradict grounded facts from the tool output.

Each entry carries character-level hallucination_labels marking the corrupted span(s).


Limitations

  • The classifiers are trained on a relatively small dataset (~3 300 training rows). Performance may degrade on domains or tool schemas not represented in ToolACE.
  • Branch C (LookBack) shows weak span-level performance at the default threshold (0.22); tuning this on a validation split is recommended.
  • The ensemble does not produce type-specific labels β€” it only predicts binary hallucinated / clean at the sample level.

Citation

If you use this model or the associated datasets, please cite:

@misc{toolace_hallucination_detector,
  author    = {jameVee},
  title     = {ToolACE Hallucination Detector},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/jameVee/ToolACE-Hallucination-Detector}
}

@dataset{toolace_hallucination,
  author    = {jameVee},
  title     = {ToolACE-Hallucination},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/jameVee/ToolACE-Hallucination}
}

@dataset{toolace,
  author    = {Team-ACE},
  title     = {ToolACE},
  year      = {2024},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/Team-ACE/ToolACE}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jameVee/ToolACE-Hallucination-Detector

Dataset used to train jameVee/ToolACE-Hallucination-Detector