ACL-Verbatim ModernBERT Highlighter
A query-conditioned token classifier that highlights supporting evidence spans
in scientific paper chunks. Fine-tuned from
Alibaba-NLP/gte-reranker-modernbert-base
on silver spans from
KRLabsOrg/acl-verbatim-spans.
Input: (question, context) — output: character spans in context that
support the answer, with confidence scores.
The model uses the full 8192-token ModernBERT context, so long paper chunks are handled without aggressive truncation. A 150M-parameter model that matches the word-level F1 of 120B-parameter LLMs on this benchmark while running ~40 ms per (question, chunk) on GPU.
Quick Start
from transformers import AutoModel
model = AutoModel.from_pretrained(
"KRLabsOrg/acl-verbatim-modernbert",
trust_remote_code=True,
)
question = "What is ModernBERT?"
context = (
"ModernBERT is a long-context encoder for NLP. "
"It supports sequences up to 8192 tokens. "
"Unlike earlier BERT variants, it uses rotary position embeddings."
)
result = model.process(
question=question,
context=context,
threshold=0.2,
return_sentence_metrics=True,
)
for span in result["spans"]:
print(f"[{span['score']:.2f}] {span['text']}")
Example output:
[0.93] ModernBERT is a long-context encoder for NLP.
[0.87] It supports sequences up to 8192 tokens.
Parameters
| arg | default | notes |
|---|---|---|
question |
— | Query string |
context |
— | Passage to search for supporting spans |
threshold |
0.2 |
Probability cutoff for marking a token as evidence. Use 0.2 for balanced F1, 0.5 for high precision |
max_length |
8192 |
Max tokens per window (ModernBERT supports 8192) |
doc_stride |
256 |
Overlap between windows for long contexts |
min_span_chars |
10 |
Drop predicted spans shorter than this many characters |
merge_gap_chars |
20 |
Merge adjacent predicted spans separated by ≤ this many characters |
return_sentence_metrics |
False |
Also return per-sentence mean evidence score |
min_span_chars and merge_gap_chars together clean up token-level
fragmentation: without them, binary token labels often produce a "shotgun" of
3–10-character pseudo-spans that hurt span-level metrics. The defaults are what
we use in our evaluation.
Return shape
{
"spans": [
{"start": int, "end": int, "text": str, "score": float},
...
],
"sentences": [ # only when return_sentence_metrics=True
{"start": int, "end": int, "text": str, "score": float},
...
],
}
Spans are character offsets into the input context. They are merged across
sliding windows, so callers do not need to deduplicate.
Raw Inference
If you prefer to skip the .process() helper and do the post-processing
yourself:
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("KRLabsOrg/acl-verbatim-modernbert")
model = AutoModelForTokenClassification.from_pretrained(
"KRLabsOrg/acl-verbatim-modernbert"
)
enc = tokenizer(
question, context,
return_offsets_mapping=True,
max_length=8192,
truncation="only_second",
return_tensors="pt",
)
logits = model(
input_ids=enc["input_ids"], attention_mask=enc["attention_mask"]
).logits
labels = logits.argmax(dim=-1)
Label 0 is "outside", label 1 is "evidence" (binary scheme).
Training
| item | value |
|---|---|
| base model | Alibaba-NLP/gte-reranker-modernbert-base |
| dataset | KRLabsOrg/acl-verbatim-spans (encoder config) |
| label scheme | binary (0 outside, 1 evidence) |
| max_length | 8192 |
| doc_stride | 256 |
| batch size | 8 |
| learning rate | 2e-5 |
| epochs | 5 |
| best checkpoint | silver-dev F1 = 0.642 at epoch 3.21 |
We started from Alibaba-NLP/gte-reranker-modernbert-base rather than
vanilla answerdotai/ModernBERT-base because the reranker backbone has
already been post-trained on query/passage relevance — the semantic prior it
provides gives a large head start on query-conditioned span extraction. In
our ablations, this backbone swap alone improved gold word-F1 from 0.407 to
0.513 at argmax (and 0.449 to 0.563 with threshold tuning and post-processing).
Reproduce with:
python acl_verbatim/span_training/train_token_cls.py \
--hf-dataset KRLabsOrg/acl-verbatim-spans \
--hf-config encoder \
--train-split train \
--eval-split validation \
--model Alibaba-NLP/gte-reranker-modernbert-base \
--output-dir runs/models/acl-verbatim-modernbert \
--batch-size 8 \
--lr 2e-5 \
--epochs 5 \
--label-scheme binary
Evaluation
Scored on the canonical/test split of KRLabsOrg/acl-verbatim-spans
(20 queries × 5 retrieved chunks, 47 relevant rows, 78 gold spans) with the
shared span metrics in acl_verbatim.eval.span_metrics.
Headline numbers (balanced config: threshold=0.2 + merge)
| metric | value |
|---|---|
| word-F1 (micro) | 0.563 |
| word precision | 0.738 |
| word recall | 0.454 |
| span F1 @ IoU 0.3 | 0.473 |
| span F1 @ IoU 0.5 | 0.427 |
| containment F1 @ 0.5 | 0.527 |
| containment F1 @ 0.8 | 0.343 |
| containment F1 @ 1.0 | 0.297 |
| gold-coverage recall @ 0.5 | 0.423 |
| gold-coverage recall @ 0.8 | 0.372 |
| recall @ any-overlap | 0.500 |
| over-prediction ratio | 0.679 |
| mean latency (GPU) | ~40 ms |
How this compares
On the same benchmark and harness:
| system | word-F1 | IoU F1 @ 0.5 | any-overlap R |
|---|---|---|---|
| acl-verbatim-modernbert (this model) | 0.563 | 0.427 | 0.500 |
| nemotron-120b-a12b | 0.561 | 0.437 | 0.654 |
| nemotron-120b-paragraph | 0.552 | 0.459 | 0.577 |
| qwen-3.6-paragraph (silver teacher) | 0.544 | 0.486 | 0.692 |
| mistral-small-2603 | 0.519 | 0.201 | 0.692 |
| glm-5 | 0.495 | 0.250 | 0.744 |
| qwen-3.6-default | 0.494 | 0.242 | 0.705 |
| provence-reranker-pruner | 0.480 | 0.276 | 0.718 |
| zilliz semantic-highlight | 0.217 | 0.088 | 0.321 |
This 150M-parameter model matches the word-F1 of the 120B nemotron (0.563 vs 0.561) and exceeds its silver teacher (0.544) by 1.9 points. LLMs retain an advantage on any-overlap recall — they find more relevant passages across chunks — but the student is competitive on token coverage where it fires.
Threshold / post-processing ablation
| config | word-F1 | IoU F1 @ 0.5 | over-prediction |
|---|---|---|---|
| argmax (no merge) | 0.513 | 0.250 | 1.564 |
| argmax + merge | 0.512 | 0.357 | 0.654 |
| threshold 0.3 (no merge) | 0.539 | 0.250 | 1.462 |
| threshold 0.2 + merge | 0.563 | 0.427 | 0.679 |
Lower thresholds boost recall; span merging + min-length filtering cleans up
fragmentation without hurting F1. The threshold=0.2 + merge configuration
is the default in model.process().
See the acl-verbatim repo for
the full benchmark harness, LLM extractor scripts, and qualitative analysis.
Intended Use
- Query-conditioned evidence highlighting over scientific text
- Re-ranking or filtering of retrieval outputs for extractive QA
- Dataset annotation assistance
- Fast local alternative to LLM extractors for evidence selection
Limitations
- Trained on ACL Anthology markdown; transfer to other scientific domains (biomedical, legal, patents) is not evaluated.
- Silver supervision inherits noise from the LLM teacher and the retriever. Recall in particular reflects teacher behaviour: the model rarely extracts a passage the teacher would have skipped.
- The gold benchmark is small (20 queries, 47 relevant chunks, 78 gold spans) and single-annotator; confidence intervals on the headline numbers are wide.
- Tables and figures are represented through their caption text; the model has no structural awareness of tabular data.
- Any-overlap recall (0.500) lags frontier LLMs (0.65–0.74), meaning the
model sometimes predicts nothing on chunks that contain relevant evidence.
For high-recall applications, lower
thresholdfurther or combine with an LLM fallback.
Citation
TODO
- Downloads last month
- 18
Model tree for KRLabsOrg/acl-verbatim-modernbert
Base model
answerdotai/ModernBERT-base