Text Classification
Transformers
PyTorch
English
funding-extraction
arxiv
scholarly-communication
chunk-classification
modernbert
Instructions to use cometadata/funding-chunk-classifier-modernbert-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use cometadata/funding-chunk-classifier-modernbert-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="cometadata/funding-chunk-classifier-modernbert-base")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("cometadata/funding-chunk-classifier-modernbert-base", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 7,438 Bytes
f69ad93 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 | ---
license: cc0-1.0
base_model: answerdotai/ModernBERT-base
library_name: transformers
pipeline_tag: text-classification
tags:
- funding-extraction
- arxiv
- scholarly-communication
- chunk-classification
- modernbert
language:
- en
datasets:
- cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test
---
# ModernBERT-base Chunk Classifier — Funding Statement Localization
A binary classifier on top of `answerdotai/ModernBERT-base` that scores a
single 8,192-token chunk of an academic paper for the presence of a funding
statement. Used as **stage 1 of a three-stage funding-extraction cascade** to
narrow a long PDF down to the most-likely chunk before running expensive
span-extraction and cleanup.
The full cascade:
1. **Stage 1 (this model)**: For each ≤8,192-token chunk of the paper,
predict a scalar `P(this chunk contains a funding statement)`. Take top-K
chunks above a threshold (we use top-2 above 0.4).
2. **Stage 2 — span head**:
[`cometadata/funding-extraction-modernbert-base-spanhead`](https://huggingface.co/cometadata/funding-extraction-modernbert-base-spanhead)
— picks the exact start/end token within the top chunk.
3. **Stage 3 — cleanup LoRA**:
[`cometadata/funding-cleaning-qwen3-4b-lora`](https://huggingface.co/cometadata/funding-cleaning-qwen3-4b-lora)
— strips LaTeX markers and normalizes whitespace in the extracted span.
You can use this model standalone if you only need to flag whether a chunk
(or doc) contains funding language at all (binary F1 0.97 on the test set).
## Architecture
The architecture is a custom `ChunkClassifier` module (included in
`modeling.py`):
```python
import torch.nn as nn
from transformers import AutoModel
class ChunkClassifier(nn.Module):
"""ModernBERT encoder + mean-pool + binary head."""
def __init__(self, base="answerdotai/ModernBERT-base"):
super().__init__()
self.encoder = AutoModel.from_pretrained(base)
self.head = nn.Linear(self.encoder.config.hidden_size, 1)
def forward(self, input_ids, attention_mask):
out = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
# Mean pool over real (non-padding) tokens
mask = attention_mask.unsqueeze(-1).float()
pooled = (out.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1)
return self.head(pooled).squeeze(-1) # one logit per chunk
```
## Use
```python
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
from modeling import ChunkClassifier # bundled in this repo
REPO = "cometadata/funding-chunk-classifier-modernbert-base"
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(REPO)
model = ChunkClassifier("answerdotai/ModernBERT-base").to(device)
state_dict = torch.load(
hf_hub_download(REPO, "pytorch_model.bin"),
map_location=device, weights_only=True,
)
model.load_state_dict(state_dict)
model.eval()
# For a long paper, slide an 8192-token window with stride 4096.
def chunks_of(text, max_tok=8192, stride=4096):
enc = tokenizer(text, add_special_tokens=False, truncation=False)
ids = enc["input_ids"]
if len(ids) <= max_tok:
yield ids, 0, len(ids)
return
for st in range(0, len(ids), stride):
en = min(st + max_tok, len(ids))
yield ids[st:en], st, en
if en == len(ids):
break
probs = []
for chunk_ids, st, en in chunks_of(paper_text):
ids_t = torch.tensor(chunk_ids).unsqueeze(0).to(device)
attn = torch.ones_like(ids_t)
with torch.no_grad():
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
logit = model(ids_t, attn).float()
probs.append((torch.sigmoid(logit).item(), st, en))
# Top-K chunks above threshold
top_k = sorted(probs, key=lambda p: -p[0])[:2]
top_k = [p for p in top_k if p[0] >= 0.4]
# `top_k` is the list to hand off to the span-head model.
```
## Training data
Built from the 2,384 training rows of
`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`.
For each train doc:
- Tokenize `vlm_markdown` with the ModernBERT tokenizer.
- Slide an 8,192-token window with stride 4,096 over the tokenized doc.
- For each chunk, label `1` iff the gold funding statement (located via
verbatim substring or `rapidfuzz.partial_ratio_alignment ≥ 0.7`) overlaps
the chunk's character range by more than half its length, else `0`.
Negative docs (no funding statement) contribute negative chunks; positive
docs contribute one positive chunk (the one containing the gold) plus several
negative chunks from the rest of the doc, so the negative class is
naturally dominant (~9× more negatives than positives).
Final training set: roughly 21,000 chunks (~2,300 positive / ~18,700
negative).
## Loss
Binary cross-entropy with `pos_weight = n_examples / n_positives` to
counteract the class imbalance:
```python
loss_fn = nn.BCEWithLogitsLoss(pos_weight=torch.tensor(n_examples / n_positives))
loss = loss_fn(logits, labels)
```
## Hyperparameters
- Base: `answerdotai/ModernBERT-base` (149M, 8,192-token context)
- Optimizer: AdamW, lr 5e-5, weight decay 0.01
- Schedule: linear warmup (20 steps) + cosine decay
- Epochs: 3
- Batch: 2 per device × 8 grad accum = 16 effective
- Mixed precision: bfloat16
- Max sequence: 8,192 tokens
- Trained on 1× H100 80GB
- Saved checkpoint: `pytorch_model.bin` is the epoch-2 (final) state dict
## Evaluation
On the 597-row test split of
`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`,
treated as a **per-document binary task** (does the doc have any funding
statement?): we score each candidate chunk and use the max probability as
the document-level prediction. Threshold = 0.5.
| Metric | Precision | Recall | F1 | F0.5 |
|------------------------------|-----------|--------|--------|--------|
| Doc-level funding detection | 0.9831 | 0.9537 | 0.9682 | 0.9771 |
Sub-stats at threshold 0.5: TP=350, FP=6, FN=17, TN=224.
**Chunk-recall caveat**: even when the doc-level prediction is correct, the
**top-1 chunk** contains the gold statement verbatim only ~68% of the time
(top-2 covers ~88%). This is why the downstream cascade uses **top-K=2**
chunks: it raises the chance that the gold-containing chunk is fed to the
span head.
## Intended use
Doc-level filtering of arXiv-derived PDFs for funding-statement presence, and
stage-1 of the funding-extraction cascade. Useful when you want to skip
expensive span extraction on most papers (a sizable fraction of arXiv papers
have no funding statement).
Not intended for: extraction (it only classifies chunks; pair with the
span-head model for spans), classification of funding sources, or text
outside the academic-paper domain.
## Limitations
- Trained only on arXiv-derived PDFs; behavior on other paper sources is
untested.
- Top-1 chunk is wrong ~32% of the time even when doc-level is correct. Use
top-K ≥ 2 if you need recall.
- Mean-pooling over 8,192 tokens dilutes the signal from a short
(~272-char-median) funding statement — the false-negative rate at strict
threshold 0.9 is non-trivial. Use 0.5 (or lower) and rely on the span
head's `no_answer` head to suppress empty chunks.
## Citation / acknowledgement
Trained as part of an applied research cycle on the
`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`
dataset by Comet.
|