Token Classification
Transformers
PyTorch
English
funding-extraction
arxiv
scholarly-communication
span-extraction
modernbert
Instructions to use cometadata/funding-extraction-modernbert-base-spanhead with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use cometadata/funding-extraction-modernbert-base-spanhead with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="cometadata/funding-extraction-modernbert-base-spanhead")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("cometadata/funding-extraction-modernbert-base-spanhead", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: cc0-1.0 | |
| base_model: answerdotai/ModernBERT-base | |
| library_name: transformers | |
| pipeline_tag: token-classification | |
| tags: | |
| - funding-extraction | |
| - arxiv | |
| - scholarly-communication | |
| - span-extraction | |
| - modernbert | |
| language: | |
| - en | |
| datasets: | |
| - cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test | |
| # ModernBERT-base Span-Head — Funding Statement Extraction | |
| A custom span-extraction head on top of `answerdotai/ModernBERT-base`. Given a | |
| chunk of an academic paper (up to 8,192 tokens), it predicts the start and end | |
| token positions of a funding statement, plus a "no-answer" probability for | |
| documents with no funding statement. | |
| This is the **rough-extraction stage** of a two-stage cascade: | |
| 1. **Stage 1 (this model)**: ModernBERT-base + span head — finds the rough | |
| span (≈ best@0.85 F1 0.95 on the test set). | |
| 2. **Stage 2 (separate)**: `cometadata/funding-cleaning-qwen3-4b-lora` — | |
| cleans the rough span into the canonical, normalized funding statement | |
| (strips LaTeX markers, joins paragraph breaks, etc.). | |
| Use this model alone if you only need approximate localization; chain with the | |
| cleanup LoRA if you need the cleaned canonical text. | |
| ## Architecture | |
| The architecture is a custom `SpanHead` module (included in `modeling.py`): | |
| ```python | |
| import torch | |
| import torch.nn as nn | |
| from transformers import AutoModel | |
| class SpanHead(nn.Module): | |
| """ModernBERT encoder + start/end/no-answer heads.""" | |
| def __init__(self, base="answerdotai/ModernBERT-base"): | |
| super().__init__() | |
| self.encoder = AutoModel.from_pretrained(base) | |
| h = self.encoder.config.hidden_size # 768 | |
| self.start_head = nn.Linear(h, 1) | |
| self.end_head = nn.Linear(h, 1) | |
| self.no_answer_head = nn.Linear(h, 1) | |
| self.dropout = nn.Dropout(0.1) | |
| def forward(self, input_ids, attention_mask): | |
| out = self.encoder(input_ids=input_ids, attention_mask=attention_mask) | |
| hidden = self.dropout(out.last_hidden_state) | |
| start_logits = self.start_head(hidden).squeeze(-1) | |
| end_logits = self.end_head(hidden).squeeze(-1) | |
| # Mean-pool for no-answer | |
| mask = attention_mask.unsqueeze(-1).float() | |
| pooled = (out.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1) | |
| no_answer = self.no_answer_head(pooled).squeeze(-1) | |
| return start_logits, end_logits, no_answer | |
| ``` | |
| ## Use | |
| ```python | |
| import torch | |
| from huggingface_hub import hf_hub_download | |
| from transformers import AutoTokenizer | |
| from modeling import SpanHead # bundled in this repo | |
| REPO = "cometadata/funding-extraction-modernbert-base-spanhead" | |
| device = "cuda" | |
| tokenizer = AutoTokenizer.from_pretrained(REPO) | |
| model = SpanHead("answerdotai/ModernBERT-base").to(device) | |
| state_dict = torch.load( | |
| hf_hub_download(REPO, "pytorch_model.bin"), | |
| map_location=device, weights_only=True, | |
| ) | |
| model.load_state_dict(state_dict) | |
| model.eval() | |
| # `chunk_text` should be a ≤8192-token chunk of the paper (e.g., the | |
| # acknowledgments-containing region). For long papers, run the model on | |
| # sliding 8192-tok windows (stride 4096) and pick the chunk with the lowest | |
| # no-answer probability. | |
| enc = tokenizer(chunk_text, return_offsets_mapping=True, | |
| add_special_tokens=False, truncation=True, max_length=8192) | |
| ids = torch.tensor(enc["input_ids"]).unsqueeze(0).to(device) | |
| attn = torch.ones_like(ids) | |
| with torch.no_grad(): | |
| with torch.amp.autocast("cuda", dtype=torch.bfloat16): | |
| start_logits, end_logits, no_answer = model(ids, attn) | |
| start_logits = start_logits.squeeze(0).float().cpu() | |
| end_logits = end_logits.squeeze(0).float().cpu() | |
| no_answer_prob = torch.sigmoid(no_answer).item() | |
| if no_answer_prob >= 0.5: | |
| pred_span = "" # this chunk has no funding statement | |
| else: | |
| start = int(start_logits.argmax()) | |
| # Constrain end to be after start and within ~300 tokens | |
| end_window = end_logits[start:start + 300] | |
| end = start + int(end_window.argmax()) | |
| offsets = enc["offset_mapping"] | |
| char_s = offsets[start][0] | |
| char_e = offsets[end][1] | |
| pred_span = chunk_text[char_s:char_e].strip() | |
| ``` | |
| ## Training data | |
| Built from the 2,384 training rows of | |
| `cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`. | |
| For each positive doc (1,416 rows): | |
| - Tokenize `vlm_markdown` with the ModernBERT tokenizer. | |
| - Locate the gold funding statement in `vlm_markdown` via verbatim substring, | |
| or via `rapidfuzz.partial_ratio_alignment` if not verbatim. Convert | |
| char-span to token-span. | |
| - Pick the 8,192-token sliding window (stride 4,096) that contains the gold | |
| span fully. If the doc is ≤ 8,192 tokens, use the whole doc as one chunk. | |
| - Training labels: `start_tok` and `end_tok` indices within the chunk; | |
| `no_answer = 0`. | |
| For each negative doc (968 rows): | |
| - Use the last 8,192-token chunk of the doc (since funding statements, when | |
| they exist, are typically near the end). | |
| - Training labels: `start_tok = end_tok = 0`; `no_answer = 1`. | |
| About ~5% of positive rows where no fuzzy alignment ≥ 0.7 could be found are | |
| dropped. Final training set: ~3,300 chunks. | |
| ## Loss | |
| ``` | |
| loss = CE(start_logits[no_answer==0], gold_start) | |
| + CE(end_logits[no_answer==0], gold_end) | |
| + 1.0 * BCE_with_logits(no_answer_logit, no_answer_label) | |
| ``` | |
| The start/end CE is masked out on negative chunks; the no-answer BCE is | |
| computed on all chunks. Padded positions in `start_logits`/`end_logits` are | |
| masked to `-1e4` so they can't be argmax'd. | |
| ## Hyperparameters | |
| - Base: `answerdotai/ModernBERT-base` (149M, 8,192-token context) | |
| - Optimizer: AdamW, lr 5e-5, weight decay 0.01 | |
| - Schedule: linear warmup (30 steps) + cosine decay | |
| - Epochs: 4 | |
| - Batch: 4 per device × 4 grad accum = 16 effective | |
| - Mixed precision: bfloat16 | |
| - Max sequence: 8,192 tokens | |
| - Trained on 1× H100 80GB | |
| ## Evaluation | |
| On the 597-row test split of | |
| `cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`. | |
| At inference we ran this model on the top-2 chunks selected by a separate | |
| ModernBERT-base chunk classifier (binary funding-yes, mean-pooled | |
| classification head) and picked the chunk with the lower no-answer prob. | |
| | Metric | Precision | Recall | F1 | F0.5 | | |
| |---------------------------------------|-----------|--------|--------|--------| | |
| | Binary detection | 0.9887 | 0.9510 | 0.9694 | 0.9809 | | |
| | Strict span (`token_sort_ratio≥0.95`) | 0.7365 | 0.7084 | 0.7222 | 0.7307 | | |
| | Loose span (max-of-4 fuzz ≥ 0.85) | 0.9745 | 0.9373 | 0.9556 | 0.9668 | | |
| **Hard ceiling note**: ~28% of test gold statements are not verbatim | |
| substrings of any source representation in the dataset (the dataset's labels | |
| were normalized by frontier models — whitespace, LaTeX markers, paragraph | |
| joins). The 0.95 strict threshold is unforgiving of those normalizations even | |
| on perfectly extracted source-spans, so strict F1 is capped near 0.73 for any | |
| single-stage extractive model. The loose-span F1 of 0.96 is closer to the | |
| practical extractive ceiling. | |
| For higher strict F1, chain with `cometadata/funding-cleaning-qwen3-4b-lora` | |
| which cleans the rough span into the canonical text. | |
| ## Cascade pipeline | |
| For long papers (> 8,192 tokens), use a chunk-classifier first to pick the | |
| chunk most likely to contain the funding statement: | |
| ```python | |
| # Pseudocode for the full cascade | |
| chunks = sliding_windows(doc, max_tok=8192, stride=4096) | |
| chunk_probs = [chunk_classifier(c) for c in chunks] | |
| top_chunk = chunks[argmax(chunk_probs)] | |
| rough_span = spanhead_model(top_chunk) # this model | |
| clean_span = cleanup_lora(rough_span, top_chunk) # other model | |
| ``` | |
| A simple heuristic alternative to the chunk classifier (also works fine): | |
| just use the last 8,192-token window of the document — funding statements are | |
| usually near the end. This loses a few percentage points of recall on papers | |
| with funding info mid-document. | |
| ## Intended use | |
| Extraction of the **rough span** containing a funding acknowledgment from | |
| arXiv paper text (or similar academic markdown). Designed to be the first | |
| stage of a two-stage cascade with the cleanup LoRA, but usable on its own if | |
| you only need approximate localization. | |
| Not intended for: classification of funding sources, downstream | |
| funder/grant/scheme parsing, or extraction from non-paper text. | |
| ## Limitations | |
| - Trained on arXiv-derived PDFs only; behavior on other paper sources is | |
| untested. | |
| - Outputs a rough span — for canonical, downstream-ready text, chain with the | |
| cleanup LoRA. | |
| - Will occasionally pick the wrong sibling sentence when an acknowledgments | |
| section contains multiple funding statements (each person's own grants); | |
| this is the dominant failure mode of the strict-F1 evaluation. | |
| ## Citation / acknowledgement | |
| Trained as part of an applied research cycle on the | |
| `cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test` | |
| dataset by Comet. | |