--- license: cc0-1.0 base_model: answerdotai/ModernBERT-base library_name: transformers pipeline_tag: text-classification tags: - funding-extraction - arxiv - scholarly-communication - chunk-classification - modernbert language: - en datasets: - cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test --- # ModernBERT-base Chunk Classifier — Funding Statement Localization A binary classifier on top of `answerdotai/ModernBERT-base` that scores a single 8,192-token chunk of an academic paper for the presence of a funding statement. Used as **stage 1 of a three-stage funding-extraction cascade** to narrow a long PDF down to the most-likely chunk before running expensive span-extraction and cleanup. The full cascade: 1. **Stage 1 (this model)**: For each ≤8,192-token chunk of the paper, predict a scalar `P(this chunk contains a funding statement)`. Take top-K chunks above a threshold (we use top-2 above 0.4). 2. **Stage 2 — span head**: [`cometadata/funding-extraction-modernbert-base-spanhead`](https://huggingface.co/cometadata/funding-extraction-modernbert-base-spanhead) — picks the exact start/end token within the top chunk. 3. **Stage 3 — cleanup LoRA**: [`cometadata/funding-cleaning-qwen3-4b-lora`](https://huggingface.co/cometadata/funding-cleaning-qwen3-4b-lora) — strips LaTeX markers and normalizes whitespace in the extracted span. You can use this model standalone if you only need to flag whether a chunk (or doc) contains funding language at all (binary F1 0.97 on the test set). ## Architecture The architecture is a custom `ChunkClassifier` module (included in `modeling.py`): ```python import torch.nn as nn from transformers import AutoModel class ChunkClassifier(nn.Module): """ModernBERT encoder + mean-pool + binary head.""" def __init__(self, base="answerdotai/ModernBERT-base"): super().__init__() self.encoder = AutoModel.from_pretrained(base) self.head = nn.Linear(self.encoder.config.hidden_size, 1) def forward(self, input_ids, attention_mask): out = self.encoder(input_ids=input_ids, attention_mask=attention_mask) # Mean pool over real (non-padding) tokens mask = attention_mask.unsqueeze(-1).float() pooled = (out.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1) return self.head(pooled).squeeze(-1) # one logit per chunk ``` ## Use ```python import torch from huggingface_hub import hf_hub_download from transformers import AutoTokenizer from modeling import ChunkClassifier # bundled in this repo REPO = "cometadata/funding-chunk-classifier-modernbert-base" device = "cuda" tokenizer = AutoTokenizer.from_pretrained(REPO) model = ChunkClassifier("answerdotai/ModernBERT-base").to(device) state_dict = torch.load( hf_hub_download(REPO, "pytorch_model.bin"), map_location=device, weights_only=True, ) model.load_state_dict(state_dict) model.eval() # For a long paper, slide an 8192-token window with stride 4096. def chunks_of(text, max_tok=8192, stride=4096): enc = tokenizer(text, add_special_tokens=False, truncation=False) ids = enc["input_ids"] if len(ids) <= max_tok: yield ids, 0, len(ids) return for st in range(0, len(ids), stride): en = min(st + max_tok, len(ids)) yield ids[st:en], st, en if en == len(ids): break probs = [] for chunk_ids, st, en in chunks_of(paper_text): ids_t = torch.tensor(chunk_ids).unsqueeze(0).to(device) attn = torch.ones_like(ids_t) with torch.no_grad(): with torch.amp.autocast("cuda", dtype=torch.bfloat16): logit = model(ids_t, attn).float() probs.append((torch.sigmoid(logit).item(), st, en)) # Top-K chunks above threshold top_k = sorted(probs, key=lambda p: -p[0])[:2] top_k = [p for p in top_k if p[0] >= 0.4] # `top_k` is the list to hand off to the span-head model. ``` ## Training data Built from the 2,384 training rows of `cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`. For each train doc: - Tokenize `vlm_markdown` with the ModernBERT tokenizer. - Slide an 8,192-token window with stride 4,096 over the tokenized doc. - For each chunk, label `1` iff the gold funding statement (located via verbatim substring or `rapidfuzz.partial_ratio_alignment ≥ 0.7`) overlaps the chunk's character range by more than half its length, else `0`. Negative docs (no funding statement) contribute negative chunks; positive docs contribute one positive chunk (the one containing the gold) plus several negative chunks from the rest of the doc, so the negative class is naturally dominant (~9× more negatives than positives). Final training set: roughly 21,000 chunks (~2,300 positive / ~18,700 negative). ## Loss Binary cross-entropy with `pos_weight = n_examples / n_positives` to counteract the class imbalance: ```python loss_fn = nn.BCEWithLogitsLoss(pos_weight=torch.tensor(n_examples / n_positives)) loss = loss_fn(logits, labels) ``` ## Hyperparameters - Base: `answerdotai/ModernBERT-base` (149M, 8,192-token context) - Optimizer: AdamW, lr 5e-5, weight decay 0.01 - Schedule: linear warmup (20 steps) + cosine decay - Epochs: 3 - Batch: 2 per device × 8 grad accum = 16 effective - Mixed precision: bfloat16 - Max sequence: 8,192 tokens - Trained on 1× H100 80GB - Saved checkpoint: `pytorch_model.bin` is the epoch-2 (final) state dict ## Evaluation On the 597-row test split of `cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`, treated as a **per-document binary task** (does the doc have any funding statement?): we score each candidate chunk and use the max probability as the document-level prediction. Threshold = 0.5. | Metric | Precision | Recall | F1 | F0.5 | |------------------------------|-----------|--------|--------|--------| | Doc-level funding detection | 0.9831 | 0.9537 | 0.9682 | 0.9771 | Sub-stats at threshold 0.5: TP=350, FP=6, FN=17, TN=224. **Chunk-recall caveat**: even when the doc-level prediction is correct, the **top-1 chunk** contains the gold statement verbatim only ~68% of the time (top-2 covers ~88%). This is why the downstream cascade uses **top-K=2** chunks: it raises the chance that the gold-containing chunk is fed to the span head. ## Intended use Doc-level filtering of arXiv-derived PDFs for funding-statement presence, and stage-1 of the funding-extraction cascade. Useful when you want to skip expensive span extraction on most papers (a sizable fraction of arXiv papers have no funding statement). Not intended for: extraction (it only classifies chunks; pair with the span-head model for spans), classification of funding sources, or text outside the academic-paper domain. ## Limitations - Trained only on arXiv-derived PDFs; behavior on other paper sources is untested. - Top-1 chunk is wrong ~32% of the time even when doc-level is correct. Use top-K ≥ 2 if you need recall. - Mean-pooling over 8,192 tokens dilutes the signal from a short (~272-char-median) funding statement — the false-negative rate at strict threshold 0.9 is non-trivial. Use 0.5 (or lower) and rely on the span head's `no_answer` head to suppress empty chunks. ## Citation / acknowledgement Trained as part of an applied research cycle on the `cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test` dataset by Comet.