File size: 7,438 Bytes
f69ad93
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
license: cc0-1.0
base_model: answerdotai/ModernBERT-base
library_name: transformers
pipeline_tag: text-classification
tags:
- funding-extraction
- arxiv
- scholarly-communication
- chunk-classification
- modernbert
language:
- en
datasets:
- cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test
---

# ModernBERT-base Chunk Classifier — Funding Statement Localization

A binary classifier on top of `answerdotai/ModernBERT-base` that scores a
single 8,192-token chunk of an academic paper for the presence of a funding
statement. Used as **stage 1 of a three-stage funding-extraction cascade** to
narrow a long PDF down to the most-likely chunk before running expensive
span-extraction and cleanup.

The full cascade:

1. **Stage 1 (this model)**: For each ≤8,192-token chunk of the paper,
   predict a scalar `P(this chunk contains a funding statement)`. Take top-K
   chunks above a threshold (we use top-2 above 0.4).
2. **Stage 2 — span head**:
   [`cometadata/funding-extraction-modernbert-base-spanhead`](https://huggingface.co/cometadata/funding-extraction-modernbert-base-spanhead)
   — picks the exact start/end token within the top chunk.
3. **Stage 3 — cleanup LoRA**:
   [`cometadata/funding-cleaning-qwen3-4b-lora`](https://huggingface.co/cometadata/funding-cleaning-qwen3-4b-lora)
   — strips LaTeX markers and normalizes whitespace in the extracted span.

You can use this model standalone if you only need to flag whether a chunk
(or doc) contains funding language at all (binary F1 0.97 on the test set).

## Architecture

The architecture is a custom `ChunkClassifier` module (included in
`modeling.py`):

```python
import torch.nn as nn
from transformers import AutoModel


class ChunkClassifier(nn.Module):
    """ModernBERT encoder + mean-pool + binary head."""

    def __init__(self, base="answerdotai/ModernBERT-base"):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(base)
        self.head = nn.Linear(self.encoder.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        out = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        # Mean pool over real (non-padding) tokens
        mask = attention_mask.unsqueeze(-1).float()
        pooled = (out.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1)
        return self.head(pooled).squeeze(-1)   # one logit per chunk
```

## Use

```python
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
from modeling import ChunkClassifier  # bundled in this repo

REPO = "cometadata/funding-chunk-classifier-modernbert-base"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(REPO)
model = ChunkClassifier("answerdotai/ModernBERT-base").to(device)
state_dict = torch.load(
    hf_hub_download(REPO, "pytorch_model.bin"),
    map_location=device, weights_only=True,
)
model.load_state_dict(state_dict)
model.eval()

# For a long paper, slide an 8192-token window with stride 4096.
def chunks_of(text, max_tok=8192, stride=4096):
    enc = tokenizer(text, add_special_tokens=False, truncation=False)
    ids = enc["input_ids"]
    if len(ids) <= max_tok:
        yield ids, 0, len(ids)
        return
    for st in range(0, len(ids), stride):
        en = min(st + max_tok, len(ids))
        yield ids[st:en], st, en
        if en == len(ids):
            break

probs = []
for chunk_ids, st, en in chunks_of(paper_text):
    ids_t = torch.tensor(chunk_ids).unsqueeze(0).to(device)
    attn = torch.ones_like(ids_t)
    with torch.no_grad():
        with torch.amp.autocast("cuda", dtype=torch.bfloat16):
            logit = model(ids_t, attn).float()
    probs.append((torch.sigmoid(logit).item(), st, en))

# Top-K chunks above threshold
top_k = sorted(probs, key=lambda p: -p[0])[:2]
top_k = [p for p in top_k if p[0] >= 0.4]
# `top_k` is the list to hand off to the span-head model.
```

## Training data

Built from the 2,384 training rows of
`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`.

For each train doc:
- Tokenize `vlm_markdown` with the ModernBERT tokenizer.
- Slide an 8,192-token window with stride 4,096 over the tokenized doc.
- For each chunk, label `1` iff the gold funding statement (located via
  verbatim substring or `rapidfuzz.partial_ratio_alignment ≥ 0.7`) overlaps
  the chunk's character range by more than half its length, else `0`.

Negative docs (no funding statement) contribute negative chunks; positive
docs contribute one positive chunk (the one containing the gold) plus several
negative chunks from the rest of the doc, so the negative class is
naturally dominant (~9× more negatives than positives).

Final training set: roughly 21,000 chunks (~2,300 positive / ~18,700
negative).

## Loss

Binary cross-entropy with `pos_weight = n_examples / n_positives` to
counteract the class imbalance:

```python
loss_fn = nn.BCEWithLogitsLoss(pos_weight=torch.tensor(n_examples / n_positives))
loss = loss_fn(logits, labels)
```

## Hyperparameters

- Base: `answerdotai/ModernBERT-base` (149M, 8,192-token context)
- Optimizer: AdamW, lr 5e-5, weight decay 0.01
- Schedule: linear warmup (20 steps) + cosine decay
- Epochs: 3
- Batch: 2 per device × 8 grad accum = 16 effective
- Mixed precision: bfloat16
- Max sequence: 8,192 tokens
- Trained on 1× H100 80GB
- Saved checkpoint: `pytorch_model.bin` is the epoch-2 (final) state dict

## Evaluation

On the 597-row test split of
`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`,
treated as a **per-document binary task** (does the doc have any funding
statement?): we score each candidate chunk and use the max probability as
the document-level prediction. Threshold = 0.5.

| Metric                       | Precision | Recall | F1     | F0.5   |
|------------------------------|-----------|--------|--------|--------|
| Doc-level funding detection  | 0.9831    | 0.9537 | 0.9682 | 0.9771 |

Sub-stats at threshold 0.5: TP=350, FP=6, FN=17, TN=224.

**Chunk-recall caveat**: even when the doc-level prediction is correct, the
**top-1 chunk** contains the gold statement verbatim only ~68% of the time
(top-2 covers ~88%). This is why the downstream cascade uses **top-K=2**
chunks: it raises the chance that the gold-containing chunk is fed to the
span head.

## Intended use

Doc-level filtering of arXiv-derived PDFs for funding-statement presence, and
stage-1 of the funding-extraction cascade. Useful when you want to skip
expensive span extraction on most papers (a sizable fraction of arXiv papers
have no funding statement).

Not intended for: extraction (it only classifies chunks; pair with the
span-head model for spans), classification of funding sources, or text
outside the academic-paper domain.

## Limitations

- Trained only on arXiv-derived PDFs; behavior on other paper sources is
  untested.
- Top-1 chunk is wrong ~32% of the time even when doc-level is correct. Use
  top-K ≥ 2 if you need recall.
- Mean-pooling over 8,192 tokens dilutes the signal from a short
  (~272-char-median) funding statement — the false-negative rate at strict
  threshold 0.9 is non-trivial. Use 0.5 (or lower) and rely on the span
  head's `no_answer` head to suppress empty chunks.

## Citation / acknowledgement

Trained as part of an applied research cycle on the
`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`
dataset by Comet.