File size: 8,889 Bytes
1e00313
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
---
license: cc0-1.0
base_model: answerdotai/ModernBERT-base
library_name: transformers
pipeline_tag: token-classification
tags:
- funding-extraction
- arxiv
- scholarly-communication
- span-extraction
- modernbert
language:
- en
datasets:
- cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test
---

# ModernBERT-base Span-Head — Funding Statement Extraction

A custom span-extraction head on top of `answerdotai/ModernBERT-base`. Given a
chunk of an academic paper (up to 8,192 tokens), it predicts the start and end
token positions of a funding statement, plus a "no-answer" probability for
documents with no funding statement.

This is the **rough-extraction stage** of a two-stage cascade:

1. **Stage 1 (this model)**: ModernBERT-base + span head — finds the rough
   span (≈ best@0.85 F1 0.95 on the test set).
2. **Stage 2 (separate)**: `cometadata/funding-cleaning-qwen3-4b-lora` —
   cleans the rough span into the canonical, normalized funding statement
   (strips LaTeX markers, joins paragraph breaks, etc.).

Use this model alone if you only need approximate localization; chain with the
cleanup LoRA if you need the cleaned canonical text.

## Architecture

The architecture is a custom `SpanHead` module (included in `modeling.py`):

```python
import torch
import torch.nn as nn
from transformers import AutoModel


class SpanHead(nn.Module):
    """ModernBERT encoder + start/end/no-answer heads."""

    def __init__(self, base="answerdotai/ModernBERT-base"):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(base)
        h = self.encoder.config.hidden_size  # 768
        self.start_head = nn.Linear(h, 1)
        self.end_head = nn.Linear(h, 1)
        self.no_answer_head = nn.Linear(h, 1)
        self.dropout = nn.Dropout(0.1)

    def forward(self, input_ids, attention_mask):
        out = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        hidden = self.dropout(out.last_hidden_state)
        start_logits = self.start_head(hidden).squeeze(-1)
        end_logits = self.end_head(hidden).squeeze(-1)
        # Mean-pool for no-answer
        mask = attention_mask.unsqueeze(-1).float()
        pooled = (out.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1)
        no_answer = self.no_answer_head(pooled).squeeze(-1)
        return start_logits, end_logits, no_answer
```

## Use

```python
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
from modeling import SpanHead  # bundled in this repo

REPO = "cometadata/funding-extraction-modernbert-base-spanhead"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(REPO)
model = SpanHead("answerdotai/ModernBERT-base").to(device)
state_dict = torch.load(
    hf_hub_download(REPO, "pytorch_model.bin"),
    map_location=device, weights_only=True,
)
model.load_state_dict(state_dict)
model.eval()

# `chunk_text` should be a ≤8192-token chunk of the paper (e.g., the
# acknowledgments-containing region). For long papers, run the model on
# sliding 8192-tok windows (stride 4096) and pick the chunk with the lowest
# no-answer probability.

enc = tokenizer(chunk_text, return_offsets_mapping=True,
                 add_special_tokens=False, truncation=True, max_length=8192)
ids = torch.tensor(enc["input_ids"]).unsqueeze(0).to(device)
attn = torch.ones_like(ids)

with torch.no_grad():
    with torch.amp.autocast("cuda", dtype=torch.bfloat16):
        start_logits, end_logits, no_answer = model(ids, attn)

start_logits = start_logits.squeeze(0).float().cpu()
end_logits = end_logits.squeeze(0).float().cpu()
no_answer_prob = torch.sigmoid(no_answer).item()

if no_answer_prob >= 0.5:
    pred_span = ""  # this chunk has no funding statement
else:
    start = int(start_logits.argmax())
    # Constrain end to be after start and within ~300 tokens
    end_window = end_logits[start:start + 300]
    end = start + int(end_window.argmax())
    offsets = enc["offset_mapping"]
    char_s = offsets[start][0]
    char_e = offsets[end][1]
    pred_span = chunk_text[char_s:char_e].strip()
```

## Training data

Built from the 2,384 training rows of
`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`.

For each positive doc (1,416 rows):
- Tokenize `vlm_markdown` with the ModernBERT tokenizer.
- Locate the gold funding statement in `vlm_markdown` via verbatim substring,
  or via `rapidfuzz.partial_ratio_alignment` if not verbatim. Convert
  char-span to token-span.
- Pick the 8,192-token sliding window (stride 4,096) that contains the gold
  span fully. If the doc is ≤ 8,192 tokens, use the whole doc as one chunk.
- Training labels: `start_tok` and `end_tok` indices within the chunk;
  `no_answer = 0`.

For each negative doc (968 rows):
- Use the last 8,192-token chunk of the doc (since funding statements, when
  they exist, are typically near the end).
- Training labels: `start_tok = end_tok = 0`; `no_answer = 1`.

About ~5% of positive rows where no fuzzy alignment ≥ 0.7 could be found are
dropped. Final training set: ~3,300 chunks.

## Loss

```
loss = CE(start_logits[no_answer==0], gold_start)
     + CE(end_logits[no_answer==0], gold_end)
     + 1.0 * BCE_with_logits(no_answer_logit, no_answer_label)
```

The start/end CE is masked out on negative chunks; the no-answer BCE is
computed on all chunks. Padded positions in `start_logits`/`end_logits` are
masked to `-1e4` so they can't be argmax'd.

## Hyperparameters

- Base: `answerdotai/ModernBERT-base` (149M, 8,192-token context)
- Optimizer: AdamW, lr 5e-5, weight decay 0.01
- Schedule: linear warmup (30 steps) + cosine decay
- Epochs: 4
- Batch: 4 per device × 4 grad accum = 16 effective
- Mixed precision: bfloat16
- Max sequence: 8,192 tokens
- Trained on 1× H100 80GB

## Evaluation

On the 597-row test split of
`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`.
At inference we ran this model on the top-2 chunks selected by a separate
ModernBERT-base chunk classifier (binary funding-yes, mean-pooled
classification head) and picked the chunk with the lower no-answer prob.

| Metric                                | Precision | Recall | F1     | F0.5   |
|---------------------------------------|-----------|--------|--------|--------|
| Binary detection                      | 0.9887    | 0.9510 | 0.9694 | 0.9809 |
| Strict span (`token_sort_ratio≥0.95`) | 0.7365    | 0.7084 | 0.7222 | 0.7307 |
| Loose span (max-of-4 fuzz ≥ 0.85)     | 0.9745    | 0.9373 | 0.9556 | 0.9668 |

**Hard ceiling note**: ~28% of test gold statements are not verbatim
substrings of any source representation in the dataset (the dataset's labels
were normalized by frontier models — whitespace, LaTeX markers, paragraph
joins). The 0.95 strict threshold is unforgiving of those normalizations even
on perfectly extracted source-spans, so strict F1 is capped near 0.73 for any
single-stage extractive model. The loose-span F1 of 0.96 is closer to the
practical extractive ceiling.

For higher strict F1, chain with `cometadata/funding-cleaning-qwen3-4b-lora`
which cleans the rough span into the canonical text.

## Cascade pipeline

For long papers (> 8,192 tokens), use a chunk-classifier first to pick the
chunk most likely to contain the funding statement:

```python
# Pseudocode for the full cascade
chunks = sliding_windows(doc, max_tok=8192, stride=4096)
chunk_probs = [chunk_classifier(c) for c in chunks]
top_chunk = chunks[argmax(chunk_probs)]
rough_span = spanhead_model(top_chunk)        # this model
clean_span = cleanup_lora(rough_span, top_chunk)  # other model
```

A simple heuristic alternative to the chunk classifier (also works fine):
just use the last 8,192-token window of the document — funding statements are
usually near the end. This loses a few percentage points of recall on papers
with funding info mid-document.

## Intended use

Extraction of the **rough span** containing a funding acknowledgment from
arXiv paper text (or similar academic markdown). Designed to be the first
stage of a two-stage cascade with the cleanup LoRA, but usable on its own if
you only need approximate localization.

Not intended for: classification of funding sources, downstream
funder/grant/scheme parsing, or extraction from non-paper text.

## Limitations

- Trained on arXiv-derived PDFs only; behavior on other paper sources is
  untested.
- Outputs a rough span — for canonical, downstream-ready text, chain with the
  cleanup LoRA.
- Will occasionally pick the wrong sibling sentence when an acknowledgments
  section contains multiple funding statements (each person's own grants);
  this is the dominant failure mode of the strict-F1 evaluation.

## Citation / acknowledgement

Trained as part of an applied research cycle on the
`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`
dataset by Comet.