toolace-halu-modernbert-large

ModernBERT-large token classifier fine-tuned for span-level hallucination detection in tool-calling dialogues. Solves the "Hallucination Detection in Tool Calling" task on top of ToolACE.

Base model: KRLabsOrg/lettucedect-large-modernbert-en-v1 (RAGTruth-pretrained ModernBERT, used as warm-start)
Training data: Ivan1008/toolace-hallucination-spans, config combined, split train (1955 records)
Head: 7-class BIO token classification — O, B-/I-contradiction, B-/I-missing_tool, B-/I-overgeneration
Context: 2048 tokens, bf16

Input template

[CTX] <tool output JSON> [TOOLS] <comma-separated tool names> [Q] <user query> [A] <assistant answer>

Loss is computed only on tokens of the [A] <assistant answer> segment.

Training

5 epochs, lr 2e-5, weight-decay 0.01, batch 8, warmup 6%, bf16
Best checkpoint selected by token-level macro F1 on combined/validation
Single H200, ~15 minutes

epoch	val loss	val token F1	val precision	val recall
1	0.194	0.708	0.879	0.593
2	0.184	0.722	0.926	0.591
3	0.178	0.763	0.851	0.692
4	0.215	0.776	0.890	0.688
5	0.233	0.764	0.820	0.714

Test-set results (sentence-level F1 — the leaderboard metric)

Config	Lexical floor	LettuceDetect-large (zero-shot)	LookBackLens (in-domain)	This model	+ ensemble
combined	0.302	0.361	0.489	0.798	0.871
contradiction	0.231	0.315	0.377	0.763	0.877
missing_tool	0.218	0.330	0.406	0.966	0.993
overgeneration	0.319	0.335	0.508	0.697	0.824

Sentence-level F1 across all 9 evaluated methods (this notebook is the only training source) — see the project repo and notebooks/improve_baselines.ipynb for full code, plots and ensemble.

Usage

from transformers import AutoModelForTokenClassification, AutoTokenizer

model_id = "ArsenyIvanov/toolace-halu-modernbert-large"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

text = (
    "[CTX] " + tool_context_json
    + " [TOOLS] " + ", ".join(tool_names)
    + " [Q] " + user_query
    + " [A] " + assistant_answer
)
enc = tokenizer(text, return_offsets_mapping=True, truncation=True, max_length=2048, return_tensors="pt")
logits = model(input_ids=enc["input_ids"], attention_mask=enc["attention_mask"]).logits
pred_ids = logits.argmax(dim=-1)[0].tolist()
labels = [model.config.id2label[i] for i in pred_ids]
# pair labels with offset_mapping → spans inside `assistant_answer`

Decoding: collapse contiguous runs of the same non-O label, map back to the character offsets of assistant_answer. See improve_baselines.ipynb §3 for the reference predict_to_spans() function.

Hallucination types

Label	What it captures
`contradiction`	grounded value replaced by a plausible-but-wrong alternative
`missing_tool`	offers an action that requires a tool not in the available list
`overgeneration`	inserted sentence with claims not supported by the tool output

Limitations

Synthetic corruptions only (dataset was constructed by injecting hallucinations into clean ToolACE responses), no naturally occurring cascading errors.
Context format is JSON-serialized tool output, not natural language — predictions on prose-style RAG contexts may degrade.
Single corruption type per record (RAGTruth strict schema); records with multiple co-occurring hallucinations are partially under-annotated.

License

Apache 2.0 (matches the base model and the training dataset).

Downloads last month: 6

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for ArsenyIvanov/toolace-halu-modernbert-large

Base model

answerdotai/ModernBERT-base

Finetuned

KRLabsOrg/lettucedect-large-modernbert-en-v1

Finetuned

(2)

this model

Dataset used to train ArsenyIvanov/toolace-halu-modernbert-large

Paper for ArsenyIvanov/toolace-halu-modernbert-large

ToolACE: Winning the Points of LLM Function Calling

Paper • 2409.00920 • Published Sep 2, 2024 • 2