--- language: - en license: mit base_model: - KRLabsOrg/lettucedect-base-modernbert-en-v1 - Qwen/Qwen2.5-0.5B datasets: - jameVee/ToolACE-Hallucination tags: - hallucination-detection - tool-calling - text-classification - span-detection - sklearn - ensemble metrics: - f1 - roc_auc pipeline_tag: text-classification pretty_name: ToolACE Hallucination Detector --- # ToolACE Hallucination Detector A lightweight ensemble classifier for detecting hallucinations in LLM responses that follow tool calls. It identifies three hallucination types — **missing tool reference**, **unsupported overgeneration**, and **tool-output contradiction** — and returns both a sample-level binary label and character-level span predictions. --- ## Background When an LLM answers a user query using tool outputs, it can hallucinate in distinct ways: | Type | Description | |---|---| | `missing_tool` | The assistant suggests using a tool that was not available | | `overgeneration` | The answer contains plausible but unsupported extra facts | | `tool_output_contradiction` | The answer contradicts specific facts returned by the tool | This model was trained on **[jameVee/ToolACE-Hallucination](https://huggingface.co/datasets/jameVee/ToolACE-Hallucination)**, a benchmark of ~4 100 examples derived from [Team-ACE/ToolACE](https://huggingface.co/datasets/Team-ACE/ToolACE). Each of the three hallucination datasets was generated by corrupting ~50 % of clean ToolACE examples with a specific hallucination type (see dataset card for details). --- ## Architecture The detector is a **three-branch soft-voting ensemble**. Each branch independently produces a hallucination probability, and the three scores are averaged to give the final prediction. ``` ┌──────────────────────────────────┐ query + context ──────► │ Branch A: Lexical Verifier │ ──► P_A + tool output │ (token overlap span detector │ │ → StandardScaler + LogReg) │ └──────────────────────────────────┘ ┌──────────────────────────────────┐ ───► │ Branch B: LettuceDetect │ ──► P_B │ (ModernBERT span model │ │ → StandardScaler + LogReg) │ └──────────────────────────────────┘ ┌──────────────────────────────────┐ ───► │ Branch C: LookBack-style │ ──► P_C │ (Qwen2.5-0.5B attention ratios │ │ → StandardScaler + LogReg) │ └──────────────────────────────────┘ │ avg(P_A, P_B, P_C) ≥ 0.5 │ hallucinated? ``` ### Branch A — Lexical Span Verifier Checks whether tokens in the assistant's answer are grounded in the (normalized) tool output using lexical overlap. Unsupported token sequences become candidate hallucination spans. Span-level features (count, coverage, max score) are fed into a logistic regression. ### Branch B — LettuceDetect (supervised span model) Uses [KRLabsOrg/lettucedect-base-modernbert-en-v1](https://huggingface.co/KRLabsOrg/lettucedect-base-modernbert-en-v1), a ModernBERT-based transformer fine-tuned for grounded hallucination detection. It produces character-level spans scored by confidence. A logistic regression trained on span features converts those into a sample-level probability. **Best single branch** (AUROC 0.721, Span-F1 0.208). ### Branch C — LookBack-style Detector Passes the concatenation of context and answer through [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) and computes, for each answer token, the ratio of attention it pays to the context versus to previously generated tokens. Low-grounding tokens (ratio < 0.22) are merged into hallucination spans. Span and ratio features are fed to logistic regression. --- ## Files in this repository | File | Description | |---|---| | `lexical_clf.joblib` | Branch A sklearn Pipeline (StandardScaler + LogisticRegression) | | `lettuce_clf.joblib` | Branch B sklearn Pipeline (StandardScaler + LogisticRegression) | | `lookback_clf.joblib` | Branch C sklearn Pipeline (StandardScaler + LogisticRegression) | | `model_meta.json` | Feature column names, backbone IDs, train/test split sizes, test metrics | | `evaluation_baselines_span_utils.py` | Span extraction helpers required at inference time | --- ## Performance (test set, 20 % held-out, grouped split by example_id) | Method | Accuracy | F1 | AUROC | Span-F1 | |---|---|---|---|---| | Lexical span verifier | 0.566 | 0.523 | 0.603 | 0.053 | | LettuceDetect (Branch B) | **0.678** | 0.632 | **0.721** | **0.208** | | LookBack (Branch C) | 0.489 | 0.561 | 0.511 | 0.000 | | **Soft-vote ensemble** | 0.676 | **0.659** | 0.721 | 0.179 | Per-type F1 (ensemble): | Corruption type | F1 | |---|---| | `missing_tool` | 0.617 | | `overgeneration` | 0.693 | | `tool_output_contradiction` | 0.779 | | `clean` | — | --- ## How to use ### 1. Install dependencies ```bash pip install joblib scikit-learn transformers lettucedetect torch ``` ### 2. Load the classifiers ```python import joblib, json from pathlib import Path repo = Path("hallucination_detector") # or your local clone path lex_clf = joblib.load(repo / "lexical_clf.joblib") lettuce_clf = joblib.load(repo / "lettuce_clf.joblib") lbl_clf = joblib.load(repo / "lookback_clf.joblib") with open(repo / "model_meta.json") as f: meta = json.load(f) ``` ### 3. Load the backbone models ```python import torch from lettucedetect.models.inference import HallucinationDetector from transformers import AutoTokenizer, AutoModelForCausalLM lettuce_detector = HallucinationDetector( method="transformer", model_path="KRLabsOrg/lettucedect-base-modernbert-en-v1", ) DEVICE = "cuda" if torch.cuda.is_available() else "cpu" lbl_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B", use_fast=True) lbl_model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen2.5-0.5B", torch_dtype=torch.bfloat16, attn_implementation="eager" ).to(DEVICE).eval() ``` ### 4. Run inference ```python import sys sys.path.insert(0, str(repo)) # make span utils importable from evaluation_baselines_span_utils import ( add_normalized_context_columns, aggregate_lookback_features, aggregate_span_features, lexical_hallucination_spans, spans_from_lookback_ratios, merge_spans, ) import pandas as pd, numpy as np def predict(query: str, context: str, output: str) -> dict: row = pd.Series({"query": query, "context": context, "output": output}) # normalize context (converts raw tool JSON to readable text) row_df = add_normalized_context_columns(pd.DataFrame([row])) row = row_df.iloc[0] # Branch A — lexical lex_spans = lexical_hallucination_spans(row) lex_feats = aggregate_span_features(lex_spans, len(output)) p_lex = lex_clf.predict_proba(pd.DataFrame([lex_feats]))[0, 1] # Branch B — LettuceDetect raw_spans = lettuce_detector.predict( context=[row["normalized_context"]], question=query, answer=output, output_format="spans", ) lettuce_spans = merge_spans([ {"start": int(s["start"]), "end": int(s["end"]), "text": output[int(s["start"]):int(s["end"])], "type": "hallucination", "score": float(s.get("score", 0.0))} for s in raw_spans if int(s.get("end", 0)) > int(s.get("start", 0)) ]) lettuce_feats = aggregate_span_features(lettuce_spans, len(output)) p_lettuce = lettuce_clf.predict_proba(pd.DataFrame([lettuce_feats]))[0, 1] # Branch C — LookBack from evaluation_baselines_span_utils import aggregate_lookback_features # (reuse compute_lookback_ratios from the training notebook) # p_lbl = lbl_clf.predict_proba(pd.DataFrame([lbl_feats]))[0, 1] # For a self-contained example we skip Branch C and average A+B only: p_ensemble = np.mean([p_lex, p_lettuce]) return { "hallucinated": bool(p_ensemble >= 0.5), "score": float(p_ensemble), "lex_score": float(p_lex), "lettuce_score": float(p_lettuce), "lettuce_spans": lettuce_spans, } result = predict( query="What is the current price of AAPL?", context='Stock API: {"ticker": "AAPL", "price": 189.50, "change": "+1.2%"}', output="The current price of AAPL is $189.50, up 1.2%. It also hit an all-time high last Tuesday.", ) print(result) ``` --- ## Training details - **Source dataset**: [jameVee/ToolACE-Hallucination](https://huggingface.co/datasets/jameVee/ToolACE-Hallucination) (1 034 base examples × 4 variants = 4 136 rows total) - **Split**: 80 / 20 grouped by `example_id` (no leakage between clean and corrupted variants of the same query) - **Classifiers**: scikit-learn `LogisticRegression(max_iter=1000)` wrapped in a `StandardScaler` pipeline - **Random seed**: 1241 ### About the training dataset `jameVee/ToolACE-Hallucination` contains three JSONL files, each derived from [Team-ACE/ToolACE](https://huggingface.co/datasets/Team-ACE/ToolACE): - `missing_tool_dataset.jsonl` — generated with `unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit` (50 % corruption rate): a sentence is appended that refers to a non-existent tool. - `overgeneration_dataset.jsonl` — same generator (50 % corruption rate): a plausible but unsupported sentence is appended. - `tool_output_contradiction_dataset.jsonl` — generated with `openai/gpt-4o-mini` via OpenRouter (all entries attempted, strength 0.9): the answer is rewritten to contradict grounded facts from the tool output. Each entry carries character-level `hallucination_labels` marking the corrupted span(s). --- ## Limitations - The classifiers are trained on a relatively small dataset (~3 300 training rows). Performance may degrade on domains or tool schemas not represented in ToolACE. - Branch C (LookBack) shows weak span-level performance at the default threshold (0.22); tuning this on a validation split is recommended. - The ensemble does not produce type-specific labels — it only predicts binary hallucinated / clean at the sample level. --- ## Citation If you use this model or the associated datasets, please cite: ```bibtex @misc{toolace_hallucination_detector, author = {jameVee}, title = {ToolACE Hallucination Detector}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/jameVee/ToolACE-Hallucination-Detector} } @dataset{toolace_hallucination, author = {jameVee}, title = {ToolACE-Hallucination}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/jameVee/ToolACE-Hallucination} } @dataset{toolace, author = {Team-ACE}, title = {ToolACE}, year = {2024}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/Team-ACE/ToolACE} } ```