Token Classification
Safetensors
modernbert
hallucination-detection
rag
span-detection
code

LettuceCode mascot

LettuceDetect goes agentic — span-level hallucination detection across RAG, code, and tool output.

lettucedect-v2-mmbert-base: Encoder Hallucination Span Detection

TL;DR — results

A fast encoder (binary token classifier) for span-level hallucination detection across code, tool output, and prose, single forward pass:

  • Unified test set (10,698): span-F1 0.642, example-F1 0.869, IoU 0.671.
  • Code-agent answers: span-F1 0.508 — close to the generative 2B (0.602) at a fraction of the size, and far above the off-the-shelf detectors and LLM judges we tested (HHEM / Lynx / Granite / MiniCheck / Nemotron-550B / gpt-oss-120b), which sit near chance.
  • Prose: RAGTruth example-F1 74.3 (above GPT-4 63.4 and Luna 65.4); multilingual PsiloQA across 14 languages.
  • Emits binary spans; for typed spans (category + subcategory) use lettucedect-v2-qwen-2b or the taxonomy-head cascade.

The small, fast option when throughput/cost matter and binary spans suffice.

Overview

lettucedect-v2-mmbert-base is a lightweight encoder hallucination detector for Retrieval-Augmented Generation (RAG) and coding-agent settings. It is a token-level classifier built on mmBERT-base: given the context, question, and answer, it labels each answer token as supported (0) or unsupported (1), yielding the spans of the answer not grounded in the context — across prose and code, in many languages.

It is the encoder counterpart to the generative lettucedect-v2-qwen-2b: much smaller and faster, single forward pass, no decoding. It returns binary spans (no category/subtype); for typed spans (category + subcategory) use the generative model, or pair this detector with the taxonomy head as a typing cascade.

Model Details

  • Base model: jhu-clsp/mmBERT-base (ModernBERT-family multilingual encoder)
  • Task: token classification → unsupported-answer spans
  • Input: [CLS] context [SEP] question [SEP] answer [SEP]; context/question tokens masked in the loss, answer tokens labeled 0/1
  • Training: binary token classification on the unified code + prose benchmark
  • Context: long-context (handles full request + retrieved context + answer)
  • Languages: English plus more (multilingual via the PsiloQA portion)

Usage

from lettucedetect.models.inference import HallucinationDetector

detector = HallucinationDetector(method="transformer", model_path="KRLabsOrg/lettucedect-v2-mmbert-base")
spans = detector.predict(context=[context], question=question, answer=answer, output_format="spans")
# [{"start": ..., "end": ..., "text": "...", "confidence": ...}]

Plain transformers (no lettucedetect)

It's a standard token classifier — tokenize (context, answer) as a pair and read the labels over the answer segment (1 = unsupported):

import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer

model_id = "KRLabsOrg/lettucedect-v2-mmbert-base"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id).eval()

context = "France is in Western Europe. Its capital Paris had about 2.1 million people in 2019."
answer = "The capital of France is Paris, with a population of about 4.5 million people."

enc = tok(context, answer, truncation="only_first", max_length=4096,
          return_offsets_mapping=True, return_tensors="pt")
with torch.no_grad():
    preds = model(input_ids=enc.input_ids, attention_mask=enc.attention_mask).logits.argmax(-1)[0]

seq_ids, offsets, spans, cur = enc.sequence_ids(0), enc["offset_mapping"][0].tolist(), [], None
for i, (sid, (a, b)) in enumerate(zip(seq_ids, offsets)):
    if sid != 1 or a == b:                  # keep only answer-segment, non-special tokens
        continue
    if preds[i].item() == 1:                # unsupported
        cur = [a, b] if cur is None else [cur[0], b]
    elif cur:
        spans.append(cur); cur = None
if cur:
    spans.append(cur)
print([{"text": answer[s:e], "start": s, "end": e} for s, e in spans])
# -> [{'text': '4.5 million', 'start': 59, 'end': 70}]

For category/subcategory typing, use the lettucedetect cascade (taxonomy_head=...) or the generative lettucedect-v2-qwen-2b.

Performance

Char-level span metrics on the unified test set (10,698 samples), per source. These are the detector's span numbers (typing is not produced by this model).

dataset n span-F1 span-P span-R example-F1 IoU
ALL 10698 0.642 0.684 0.605 0.869 0.671
acl 440 0.579 0.705 0.492 0.873 0.637
code-agent 2015 0.508 0.619 0.430 0.770 0.581
readme 641 0.751 0.789 0.716 0.900 0.768
tool-output 617 0.588 0.736 0.490 0.763 0.645
wikipedia 1388 0.708 0.741 0.678 0.917 0.768
psiloqa (multi-lang) 2897 0.714 0.696 0.733 0.943 0.627
ragtruth 2700 0.528 0.668 0.437 0.743 0.724

The generative lettucedect-v2-qwen-2b scores higher on span-F1 (ALL 0.689 vs 0.642) and adds typing, but this encoder is far smaller and faster — a strong choice when throughput or cost matters and binary spans suffice.

Code-agent: vs other detectors

detector span-F1 example-F1
lettucedect-v2-qwen-2b (generative 2B) 0.602 0.835
lettucedect-v2-lfm-8b (generative 8B) 0.507 0.811
lettucedect-v2-mmbert-base (this) 0.508 0.770
Nemotron-3-Ultra-550B (LLM judge, task-aware) 0.216 0.700
gpt-oss-120b (LLM judge, task-aware) 0.212 0.691
HHEM-2.1 / Lynx-8B / Granite-Guardian / MiniCheck ≈ chance (BAcc ~0.50)

This base encoder matches the 8B generative model on code-agent span-F1 and crushes the off-the-shelf detectors and large LLM judges, which over-flag generated code.

Prose benchmarks

RAGTruth (official test, example-level F1): this multilingual+code base encoder scores 74.3 — below the RAGTruth-specialized LettuceDetect-large v1 (79.2) and the generative lettucedect-v2-qwen-2b (81.8), but above prompt-based GPT-4 (63.4) and Luna (65.4). It trades a little RAGTruth-specific accuracy for unified code+tool+multilingual coverage at base size; lettucedect-v2-mmbert-large is stronger.

PsiloQA (14 languages): span-F1 0.714, IoU 0.627 — competitive multilingual span detection from a base encoder (the PsiloQA paper's strongest LLM judge reaches IoU ~0.40 on English; their fine-tuned encoder, trained on PsiloQA only, ~0.71).

Typed spans (optional cascade)

For category + subcategory typing with an encoder pipeline, run this detector to find spans, then type each span with the label-conditioned taxonomy head (see scripts/evaluate_taxonomy_cascade.py). The cascade trails the generative model on typing (typed-F1 0.461 vs 0.585), so prefer lettucedect-v2-qwen-2b when typed output is the goal.

Citing

@article{Kovacs2025LettuceDetect,
  title={LettuceDetect: A Hallucination Detection Framework for RAG Applications},
  author={Kovács, Ádám and Recski, Gábor},
  journal={arXiv preprint arXiv:2502.17125},
  year={2025}
}

A dedicated paper for the unified code+prose benchmark and the v2 models is in preparation; this card will be updated with its citation on release.

Downloads last month
17
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for KRLabsOrg/lettucedect-v2-mmbert-base

Finetuned
(108)
this model

Datasets used to train KRLabsOrg/lettucedect-v2-mmbert-base

Paper for KRLabsOrg/lettucedect-v2-mmbert-base