ModernBERT-base Span-Head — Funding Statement Extraction

A custom span-extraction head on top of answerdotai/ModernBERT-base. Given a chunk of an academic paper (up to 8,192 tokens), it predicts the start and end token positions of a funding statement, plus a "no-answer" probability for documents with no funding statement.

This is the rough-extraction stage of a two-stage cascade:

  1. Stage 1 (this model): ModernBERT-base + span head — finds the rough span (≈ best@0.85 F1 0.95 on the test set).
  2. Stage 2 (separate): cometadata/funding-cleaning-qwen3-4b-lora — cleans the rough span into the canonical, normalized funding statement (strips LaTeX markers, joins paragraph breaks, etc.).

Use this model alone if you only need approximate localization; chain with the cleanup LoRA if you need the cleaned canonical text.

Architecture

The architecture is a custom SpanHead module (included in modeling.py):

import torch
import torch.nn as nn
from transformers import AutoModel


class SpanHead(nn.Module):
    """ModernBERT encoder + start/end/no-answer heads."""

    def __init__(self, base="answerdotai/ModernBERT-base"):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(base)
        h = self.encoder.config.hidden_size  # 768
        self.start_head = nn.Linear(h, 1)
        self.end_head = nn.Linear(h, 1)
        self.no_answer_head = nn.Linear(h, 1)
        self.dropout = nn.Dropout(0.1)

    def forward(self, input_ids, attention_mask):
        out = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        hidden = self.dropout(out.last_hidden_state)
        start_logits = self.start_head(hidden).squeeze(-1)
        end_logits = self.end_head(hidden).squeeze(-1)
        # Mean-pool for no-answer
        mask = attention_mask.unsqueeze(-1).float()
        pooled = (out.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1)
        no_answer = self.no_answer_head(pooled).squeeze(-1)
        return start_logits, end_logits, no_answer

Use

import torch
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
from modeling import SpanHead  # bundled in this repo

REPO = "cometadata/funding-extraction-modernbert-base-spanhead"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(REPO)
model = SpanHead("answerdotai/ModernBERT-base").to(device)
state_dict = torch.load(
    hf_hub_download(REPO, "pytorch_model.bin"),
    map_location=device, weights_only=True,
)
model.load_state_dict(state_dict)
model.eval()

# `chunk_text` should be a ≤8192-token chunk of the paper (e.g., the
# acknowledgments-containing region). For long papers, run the model on
# sliding 8192-tok windows (stride 4096) and pick the chunk with the lowest
# no-answer probability.

enc = tokenizer(chunk_text, return_offsets_mapping=True,
                 add_special_tokens=False, truncation=True, max_length=8192)
ids = torch.tensor(enc["input_ids"]).unsqueeze(0).to(device)
attn = torch.ones_like(ids)

with torch.no_grad():
    with torch.amp.autocast("cuda", dtype=torch.bfloat16):
        start_logits, end_logits, no_answer = model(ids, attn)

start_logits = start_logits.squeeze(0).float().cpu()
end_logits = end_logits.squeeze(0).float().cpu()
no_answer_prob = torch.sigmoid(no_answer).item()

if no_answer_prob >= 0.5:
    pred_span = ""  # this chunk has no funding statement
else:
    start = int(start_logits.argmax())
    # Constrain end to be after start and within ~300 tokens
    end_window = end_logits[start:start + 300]
    end = start + int(end_window.argmax())
    offsets = enc["offset_mapping"]
    char_s = offsets[start][0]
    char_e = offsets[end][1]
    pred_span = chunk_text[char_s:char_e].strip()

Training data

Built from the 2,384 training rows of cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test.

For each positive doc (1,416 rows):

  • Tokenize vlm_markdown with the ModernBERT tokenizer.
  • Locate the gold funding statement in vlm_markdown via verbatim substring, or via rapidfuzz.partial_ratio_alignment if not verbatim. Convert char-span to token-span.
  • Pick the 8,192-token sliding window (stride 4,096) that contains the gold span fully. If the doc is ≤ 8,192 tokens, use the whole doc as one chunk.
  • Training labels: start_tok and end_tok indices within the chunk; no_answer = 0.

For each negative doc (968 rows):

  • Use the last 8,192-token chunk of the doc (since funding statements, when they exist, are typically near the end).
  • Training labels: start_tok = end_tok = 0; no_answer = 1.

About ~5% of positive rows where no fuzzy alignment ≥ 0.7 could be found are dropped. Final training set: ~3,300 chunks.

Loss

loss = CE(start_logits[no_answer==0], gold_start)
     + CE(end_logits[no_answer==0], gold_end)
     + 1.0 * BCE_with_logits(no_answer_logit, no_answer_label)

The start/end CE is masked out on negative chunks; the no-answer BCE is computed on all chunks. Padded positions in start_logits/end_logits are masked to -1e4 so they can't be argmax'd.

Hyperparameters

  • Base: answerdotai/ModernBERT-base (149M, 8,192-token context)
  • Optimizer: AdamW, lr 5e-5, weight decay 0.01
  • Schedule: linear warmup (30 steps) + cosine decay
  • Epochs: 4
  • Batch: 4 per device × 4 grad accum = 16 effective
  • Mixed precision: bfloat16
  • Max sequence: 8,192 tokens
  • Trained on 1× H100 80GB

Evaluation

On the 597-row test split of cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test. At inference we ran this model on the top-2 chunks selected by a separate ModernBERT-base chunk classifier (binary funding-yes, mean-pooled classification head) and picked the chunk with the lower no-answer prob.

Metric Precision Recall F1 F0.5
Binary detection 0.9887 0.9510 0.9694 0.9809
Strict span (token_sort_ratio≥0.95) 0.7365 0.7084 0.7222 0.7307
Loose span (max-of-4 fuzz ≥ 0.85) 0.9745 0.9373 0.9556 0.9668

Hard ceiling note: ~28% of test gold statements are not verbatim substrings of any source representation in the dataset (the dataset's labels were normalized by frontier models — whitespace, LaTeX markers, paragraph joins). The 0.95 strict threshold is unforgiving of those normalizations even on perfectly extracted source-spans, so strict F1 is capped near 0.73 for any single-stage extractive model. The loose-span F1 of 0.96 is closer to the practical extractive ceiling.

For higher strict F1, chain with cometadata/funding-cleaning-qwen3-4b-lora which cleans the rough span into the canonical text.

Cascade pipeline

For long papers (> 8,192 tokens), use a chunk-classifier first to pick the chunk most likely to contain the funding statement:

# Pseudocode for the full cascade
chunks = sliding_windows(doc, max_tok=8192, stride=4096)
chunk_probs = [chunk_classifier(c) for c in chunks]
top_chunk = chunks[argmax(chunk_probs)]
rough_span = spanhead_model(top_chunk)        # this model
clean_span = cleanup_lora(rough_span, top_chunk)  # other model

A simple heuristic alternative to the chunk classifier (also works fine): just use the last 8,192-token window of the document — funding statements are usually near the end. This loses a few percentage points of recall on papers with funding info mid-document.

Intended use

Extraction of the rough span containing a funding acknowledgment from arXiv paper text (or similar academic markdown). Designed to be the first stage of a two-stage cascade with the cleanup LoRA, but usable on its own if you only need approximate localization.

Not intended for: classification of funding sources, downstream funder/grant/scheme parsing, or extraction from non-paper text.

Limitations

  • Trained on arXiv-derived PDFs only; behavior on other paper sources is untested.
  • Outputs a rough span — for canonical, downstream-ready text, chain with the cleanup LoRA.
  • Will occasionally pick the wrong sibling sentence when an acknowledgments section contains multiple funding statements (each person's own grants); this is the dominant failure mode of the strict-F1 evaluation.

Citation / acknowledgement

Trained as part of an applied research cycle on the cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test dataset by Comet.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cometadata/funding-extraction-modernbert-base-spanhead

Finetuned
(1253)
this model