--- license: cc0-1.0 base_model: answerdotai/ModernBERT-base library_name: transformers pipeline_tag: token-classification tags: - funding-extraction - arxiv - scholarly-communication - span-extraction - modernbert language: - en datasets: - cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test --- # ModernBERT-base Span-Head — Funding Statement Extraction A custom span-extraction head on top of `answerdotai/ModernBERT-base`. Given a chunk of an academic paper (up to 8,192 tokens), it predicts the start and end token positions of a funding statement, plus a "no-answer" probability for documents with no funding statement. This is the **rough-extraction stage** of a two-stage cascade: 1. **Stage 1 (this model)**: ModernBERT-base + span head — finds the rough span (≈ best@0.85 F1 0.95 on the test set). 2. **Stage 2 (separate)**: `cometadata/funding-cleaning-qwen3-4b-lora` — cleans the rough span into the canonical, normalized funding statement (strips LaTeX markers, joins paragraph breaks, etc.). Use this model alone if you only need approximate localization; chain with the cleanup LoRA if you need the cleaned canonical text. ## Architecture The architecture is a custom `SpanHead` module (included in `modeling.py`): ```python import torch import torch.nn as nn from transformers import AutoModel class SpanHead(nn.Module): """ModernBERT encoder + start/end/no-answer heads.""" def __init__(self, base="answerdotai/ModernBERT-base"): super().__init__() self.encoder = AutoModel.from_pretrained(base) h = self.encoder.config.hidden_size # 768 self.start_head = nn.Linear(h, 1) self.end_head = nn.Linear(h, 1) self.no_answer_head = nn.Linear(h, 1) self.dropout = nn.Dropout(0.1) def forward(self, input_ids, attention_mask): out = self.encoder(input_ids=input_ids, attention_mask=attention_mask) hidden = self.dropout(out.last_hidden_state) start_logits = self.start_head(hidden).squeeze(-1) end_logits = self.end_head(hidden).squeeze(-1) # Mean-pool for no-answer mask = attention_mask.unsqueeze(-1).float() pooled = (out.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1) no_answer = self.no_answer_head(pooled).squeeze(-1) return start_logits, end_logits, no_answer ``` ## Use ```python import torch from huggingface_hub import hf_hub_download from transformers import AutoTokenizer from modeling import SpanHead # bundled in this repo REPO = "cometadata/funding-extraction-modernbert-base-spanhead" device = "cuda" tokenizer = AutoTokenizer.from_pretrained(REPO) model = SpanHead("answerdotai/ModernBERT-base").to(device) state_dict = torch.load( hf_hub_download(REPO, "pytorch_model.bin"), map_location=device, weights_only=True, ) model.load_state_dict(state_dict) model.eval() # `chunk_text` should be a ≤8192-token chunk of the paper (e.g., the # acknowledgments-containing region). For long papers, run the model on # sliding 8192-tok windows (stride 4096) and pick the chunk with the lowest # no-answer probability. enc = tokenizer(chunk_text, return_offsets_mapping=True, add_special_tokens=False, truncation=True, max_length=8192) ids = torch.tensor(enc["input_ids"]).unsqueeze(0).to(device) attn = torch.ones_like(ids) with torch.no_grad(): with torch.amp.autocast("cuda", dtype=torch.bfloat16): start_logits, end_logits, no_answer = model(ids, attn) start_logits = start_logits.squeeze(0).float().cpu() end_logits = end_logits.squeeze(0).float().cpu() no_answer_prob = torch.sigmoid(no_answer).item() if no_answer_prob >= 0.5: pred_span = "" # this chunk has no funding statement else: start = int(start_logits.argmax()) # Constrain end to be after start and within ~300 tokens end_window = end_logits[start:start + 300] end = start + int(end_window.argmax()) offsets = enc["offset_mapping"] char_s = offsets[start][0] char_e = offsets[end][1] pred_span = chunk_text[char_s:char_e].strip() ``` ## Training data Built from the 2,384 training rows of `cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`. For each positive doc (1,416 rows): - Tokenize `vlm_markdown` with the ModernBERT tokenizer. - Locate the gold funding statement in `vlm_markdown` via verbatim substring, or via `rapidfuzz.partial_ratio_alignment` if not verbatim. Convert char-span to token-span. - Pick the 8,192-token sliding window (stride 4,096) that contains the gold span fully. If the doc is ≤ 8,192 tokens, use the whole doc as one chunk. - Training labels: `start_tok` and `end_tok` indices within the chunk; `no_answer = 0`. For each negative doc (968 rows): - Use the last 8,192-token chunk of the doc (since funding statements, when they exist, are typically near the end). - Training labels: `start_tok = end_tok = 0`; `no_answer = 1`. About ~5% of positive rows where no fuzzy alignment ≥ 0.7 could be found are dropped. Final training set: ~3,300 chunks. ## Loss ``` loss = CE(start_logits[no_answer==0], gold_start) + CE(end_logits[no_answer==0], gold_end) + 1.0 * BCE_with_logits(no_answer_logit, no_answer_label) ``` The start/end CE is masked out on negative chunks; the no-answer BCE is computed on all chunks. Padded positions in `start_logits`/`end_logits` are masked to `-1e4` so they can't be argmax'd. ## Hyperparameters - Base: `answerdotai/ModernBERT-base` (149M, 8,192-token context) - Optimizer: AdamW, lr 5e-5, weight decay 0.01 - Schedule: linear warmup (30 steps) + cosine decay - Epochs: 4 - Batch: 4 per device × 4 grad accum = 16 effective - Mixed precision: bfloat16 - Max sequence: 8,192 tokens - Trained on 1× H100 80GB ## Evaluation On the 597-row test split of `cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`. At inference we ran this model on the top-2 chunks selected by a separate ModernBERT-base chunk classifier (binary funding-yes, mean-pooled classification head) and picked the chunk with the lower no-answer prob. | Metric | Precision | Recall | F1 | F0.5 | |---------------------------------------|-----------|--------|--------|--------| | Binary detection | 0.9887 | 0.9510 | 0.9694 | 0.9809 | | Strict span (`token_sort_ratio≥0.95`) | 0.7365 | 0.7084 | 0.7222 | 0.7307 | | Loose span (max-of-4 fuzz ≥ 0.85) | 0.9745 | 0.9373 | 0.9556 | 0.9668 | **Hard ceiling note**: ~28% of test gold statements are not verbatim substrings of any source representation in the dataset (the dataset's labels were normalized by frontier models — whitespace, LaTeX markers, paragraph joins). The 0.95 strict threshold is unforgiving of those normalizations even on perfectly extracted source-spans, so strict F1 is capped near 0.73 for any single-stage extractive model. The loose-span F1 of 0.96 is closer to the practical extractive ceiling. For higher strict F1, chain with `cometadata/funding-cleaning-qwen3-4b-lora` which cleans the rough span into the canonical text. ## Cascade pipeline For long papers (> 8,192 tokens), use a chunk-classifier first to pick the chunk most likely to contain the funding statement: ```python # Pseudocode for the full cascade chunks = sliding_windows(doc, max_tok=8192, stride=4096) chunk_probs = [chunk_classifier(c) for c in chunks] top_chunk = chunks[argmax(chunk_probs)] rough_span = spanhead_model(top_chunk) # this model clean_span = cleanup_lora(rough_span, top_chunk) # other model ``` A simple heuristic alternative to the chunk classifier (also works fine): just use the last 8,192-token window of the document — funding statements are usually near the end. This loses a few percentage points of recall on papers with funding info mid-document. ## Intended use Extraction of the **rough span** containing a funding acknowledgment from arXiv paper text (or similar academic markdown). Designed to be the first stage of a two-stage cascade with the cleanup LoRA, but usable on its own if you only need approximate localization. Not intended for: classification of funding sources, downstream funder/grant/scheme parsing, or extraction from non-paper text. ## Limitations - Trained on arXiv-derived PDFs only; behavior on other paper sources is untested. - Outputs a rough span — for canonical, downstream-ready text, chain with the cleanup LoRA. - Will occasionally pick the wrong sibling sentence when an acknowledgments section contains multiple funding statements (each person's own grants); this is the dominant failure mode of the strict-F1 evaluation. ## Citation / acknowledgement Trained as part of an applied research cycle on the `cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test` dataset by Comet.