Token Classification
Transformers
PyTorch
English
funding-extraction
arxiv
scholarly-communication
span-extraction
modernbert
Instructions to use cometadata/funding-extraction-modernbert-base-spanhead with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use cometadata/funding-extraction-modernbert-base-spanhead with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="cometadata/funding-extraction-modernbert-base-spanhead")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("cometadata/funding-extraction-modernbert-base-spanhead", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 8,889 Bytes
1e00313 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 | ---
license: cc0-1.0
base_model: answerdotai/ModernBERT-base
library_name: transformers
pipeline_tag: token-classification
tags:
- funding-extraction
- arxiv
- scholarly-communication
- span-extraction
- modernbert
language:
- en
datasets:
- cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test
---
# ModernBERT-base Span-Head — Funding Statement Extraction
A custom span-extraction head on top of `answerdotai/ModernBERT-base`. Given a
chunk of an academic paper (up to 8,192 tokens), it predicts the start and end
token positions of a funding statement, plus a "no-answer" probability for
documents with no funding statement.
This is the **rough-extraction stage** of a two-stage cascade:
1. **Stage 1 (this model)**: ModernBERT-base + span head — finds the rough
span (≈ best@0.85 F1 0.95 on the test set).
2. **Stage 2 (separate)**: `cometadata/funding-cleaning-qwen3-4b-lora` —
cleans the rough span into the canonical, normalized funding statement
(strips LaTeX markers, joins paragraph breaks, etc.).
Use this model alone if you only need approximate localization; chain with the
cleanup LoRA if you need the cleaned canonical text.
## Architecture
The architecture is a custom `SpanHead` module (included in `modeling.py`):
```python
import torch
import torch.nn as nn
from transformers import AutoModel
class SpanHead(nn.Module):
"""ModernBERT encoder + start/end/no-answer heads."""
def __init__(self, base="answerdotai/ModernBERT-base"):
super().__init__()
self.encoder = AutoModel.from_pretrained(base)
h = self.encoder.config.hidden_size # 768
self.start_head = nn.Linear(h, 1)
self.end_head = nn.Linear(h, 1)
self.no_answer_head = nn.Linear(h, 1)
self.dropout = nn.Dropout(0.1)
def forward(self, input_ids, attention_mask):
out = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
hidden = self.dropout(out.last_hidden_state)
start_logits = self.start_head(hidden).squeeze(-1)
end_logits = self.end_head(hidden).squeeze(-1)
# Mean-pool for no-answer
mask = attention_mask.unsqueeze(-1).float()
pooled = (out.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1)
no_answer = self.no_answer_head(pooled).squeeze(-1)
return start_logits, end_logits, no_answer
```
## Use
```python
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
from modeling import SpanHead # bundled in this repo
REPO = "cometadata/funding-extraction-modernbert-base-spanhead"
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(REPO)
model = SpanHead("answerdotai/ModernBERT-base").to(device)
state_dict = torch.load(
hf_hub_download(REPO, "pytorch_model.bin"),
map_location=device, weights_only=True,
)
model.load_state_dict(state_dict)
model.eval()
# `chunk_text` should be a ≤8192-token chunk of the paper (e.g., the
# acknowledgments-containing region). For long papers, run the model on
# sliding 8192-tok windows (stride 4096) and pick the chunk with the lowest
# no-answer probability.
enc = tokenizer(chunk_text, return_offsets_mapping=True,
add_special_tokens=False, truncation=True, max_length=8192)
ids = torch.tensor(enc["input_ids"]).unsqueeze(0).to(device)
attn = torch.ones_like(ids)
with torch.no_grad():
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
start_logits, end_logits, no_answer = model(ids, attn)
start_logits = start_logits.squeeze(0).float().cpu()
end_logits = end_logits.squeeze(0).float().cpu()
no_answer_prob = torch.sigmoid(no_answer).item()
if no_answer_prob >= 0.5:
pred_span = "" # this chunk has no funding statement
else:
start = int(start_logits.argmax())
# Constrain end to be after start and within ~300 tokens
end_window = end_logits[start:start + 300]
end = start + int(end_window.argmax())
offsets = enc["offset_mapping"]
char_s = offsets[start][0]
char_e = offsets[end][1]
pred_span = chunk_text[char_s:char_e].strip()
```
## Training data
Built from the 2,384 training rows of
`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`.
For each positive doc (1,416 rows):
- Tokenize `vlm_markdown` with the ModernBERT tokenizer.
- Locate the gold funding statement in `vlm_markdown` via verbatim substring,
or via `rapidfuzz.partial_ratio_alignment` if not verbatim. Convert
char-span to token-span.
- Pick the 8,192-token sliding window (stride 4,096) that contains the gold
span fully. If the doc is ≤ 8,192 tokens, use the whole doc as one chunk.
- Training labels: `start_tok` and `end_tok` indices within the chunk;
`no_answer = 0`.
For each negative doc (968 rows):
- Use the last 8,192-token chunk of the doc (since funding statements, when
they exist, are typically near the end).
- Training labels: `start_tok = end_tok = 0`; `no_answer = 1`.
About ~5% of positive rows where no fuzzy alignment ≥ 0.7 could be found are
dropped. Final training set: ~3,300 chunks.
## Loss
```
loss = CE(start_logits[no_answer==0], gold_start)
+ CE(end_logits[no_answer==0], gold_end)
+ 1.0 * BCE_with_logits(no_answer_logit, no_answer_label)
```
The start/end CE is masked out on negative chunks; the no-answer BCE is
computed on all chunks. Padded positions in `start_logits`/`end_logits` are
masked to `-1e4` so they can't be argmax'd.
## Hyperparameters
- Base: `answerdotai/ModernBERT-base` (149M, 8,192-token context)
- Optimizer: AdamW, lr 5e-5, weight decay 0.01
- Schedule: linear warmup (30 steps) + cosine decay
- Epochs: 4
- Batch: 4 per device × 4 grad accum = 16 effective
- Mixed precision: bfloat16
- Max sequence: 8,192 tokens
- Trained on 1× H100 80GB
## Evaluation
On the 597-row test split of
`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`.
At inference we ran this model on the top-2 chunks selected by a separate
ModernBERT-base chunk classifier (binary funding-yes, mean-pooled
classification head) and picked the chunk with the lower no-answer prob.
| Metric | Precision | Recall | F1 | F0.5 |
|---------------------------------------|-----------|--------|--------|--------|
| Binary detection | 0.9887 | 0.9510 | 0.9694 | 0.9809 |
| Strict span (`token_sort_ratio≥0.95`) | 0.7365 | 0.7084 | 0.7222 | 0.7307 |
| Loose span (max-of-4 fuzz ≥ 0.85) | 0.9745 | 0.9373 | 0.9556 | 0.9668 |
**Hard ceiling note**: ~28% of test gold statements are not verbatim
substrings of any source representation in the dataset (the dataset's labels
were normalized by frontier models — whitespace, LaTeX markers, paragraph
joins). The 0.95 strict threshold is unforgiving of those normalizations even
on perfectly extracted source-spans, so strict F1 is capped near 0.73 for any
single-stage extractive model. The loose-span F1 of 0.96 is closer to the
practical extractive ceiling.
For higher strict F1, chain with `cometadata/funding-cleaning-qwen3-4b-lora`
which cleans the rough span into the canonical text.
## Cascade pipeline
For long papers (> 8,192 tokens), use a chunk-classifier first to pick the
chunk most likely to contain the funding statement:
```python
# Pseudocode for the full cascade
chunks = sliding_windows(doc, max_tok=8192, stride=4096)
chunk_probs = [chunk_classifier(c) for c in chunks]
top_chunk = chunks[argmax(chunk_probs)]
rough_span = spanhead_model(top_chunk) # this model
clean_span = cleanup_lora(rough_span, top_chunk) # other model
```
A simple heuristic alternative to the chunk classifier (also works fine):
just use the last 8,192-token window of the document — funding statements are
usually near the end. This loses a few percentage points of recall on papers
with funding info mid-document.
## Intended use
Extraction of the **rough span** containing a funding acknowledgment from
arXiv paper text (or similar academic markdown). Designed to be the first
stage of a two-stage cascade with the cleanup LoRA, but usable on its own if
you only need approximate localization.
Not intended for: classification of funding sources, downstream
funder/grant/scheme parsing, or extraction from non-paper text.
## Limitations
- Trained on arXiv-derived PDFs only; behavior on other paper sources is
untested.
- Outputs a rough span — for canonical, downstream-ready text, chain with the
cleanup LoRA.
- Will occasionally pick the wrong sibling sentence when an acknowledgments
section contains multiple funding statements (each person's own grants);
this is the dominant failure mode of the strict-F1 evaluation.
## Citation / acknowledgement
Trained as part of an applied research cycle on the
`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`
dataset by Comet.
|