Text Classification
Transformers
PyTorch
English
funding-extraction
arxiv
scholarly-communication
chunk-classification
modernbert
Instructions to use cometadata/funding-chunk-classifier-modernbert-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use cometadata/funding-chunk-classifier-modernbert-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="cometadata/funding-chunk-classifier-modernbert-base")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("cometadata/funding-chunk-classifier-modernbert-base", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Initial upload: ModernBERT-base chunk classifier (stage 1 of funding-extraction cascade)
f69ad93 verified | license: cc0-1.0 | |
| base_model: answerdotai/ModernBERT-base | |
| library_name: transformers | |
| pipeline_tag: text-classification | |
| tags: | |
| - funding-extraction | |
| - arxiv | |
| - scholarly-communication | |
| - chunk-classification | |
| - modernbert | |
| language: | |
| - en | |
| datasets: | |
| - cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test | |
| # ModernBERT-base Chunk Classifier — Funding Statement Localization | |
| A binary classifier on top of `answerdotai/ModernBERT-base` that scores a | |
| single 8,192-token chunk of an academic paper for the presence of a funding | |
| statement. Used as **stage 1 of a three-stage funding-extraction cascade** to | |
| narrow a long PDF down to the most-likely chunk before running expensive | |
| span-extraction and cleanup. | |
| The full cascade: | |
| 1. **Stage 1 (this model)**: For each ≤8,192-token chunk of the paper, | |
| predict a scalar `P(this chunk contains a funding statement)`. Take top-K | |
| chunks above a threshold (we use top-2 above 0.4). | |
| 2. **Stage 2 — span head**: | |
| [`cometadata/funding-extraction-modernbert-base-spanhead`](https://huggingface.co/cometadata/funding-extraction-modernbert-base-spanhead) | |
| — picks the exact start/end token within the top chunk. | |
| 3. **Stage 3 — cleanup LoRA**: | |
| [`cometadata/funding-cleaning-qwen3-4b-lora`](https://huggingface.co/cometadata/funding-cleaning-qwen3-4b-lora) | |
| — strips LaTeX markers and normalizes whitespace in the extracted span. | |
| You can use this model standalone if you only need to flag whether a chunk | |
| (or doc) contains funding language at all (binary F1 0.97 on the test set). | |
| ## Architecture | |
| The architecture is a custom `ChunkClassifier` module (included in | |
| `modeling.py`): | |
| ```python | |
| import torch.nn as nn | |
| from transformers import AutoModel | |
| class ChunkClassifier(nn.Module): | |
| """ModernBERT encoder + mean-pool + binary head.""" | |
| def __init__(self, base="answerdotai/ModernBERT-base"): | |
| super().__init__() | |
| self.encoder = AutoModel.from_pretrained(base) | |
| self.head = nn.Linear(self.encoder.config.hidden_size, 1) | |
| def forward(self, input_ids, attention_mask): | |
| out = self.encoder(input_ids=input_ids, attention_mask=attention_mask) | |
| # Mean pool over real (non-padding) tokens | |
| mask = attention_mask.unsqueeze(-1).float() | |
| pooled = (out.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1) | |
| return self.head(pooled).squeeze(-1) # one logit per chunk | |
| ``` | |
| ## Use | |
| ```python | |
| import torch | |
| from huggingface_hub import hf_hub_download | |
| from transformers import AutoTokenizer | |
| from modeling import ChunkClassifier # bundled in this repo | |
| REPO = "cometadata/funding-chunk-classifier-modernbert-base" | |
| device = "cuda" | |
| tokenizer = AutoTokenizer.from_pretrained(REPO) | |
| model = ChunkClassifier("answerdotai/ModernBERT-base").to(device) | |
| state_dict = torch.load( | |
| hf_hub_download(REPO, "pytorch_model.bin"), | |
| map_location=device, weights_only=True, | |
| ) | |
| model.load_state_dict(state_dict) | |
| model.eval() | |
| # For a long paper, slide an 8192-token window with stride 4096. | |
| def chunks_of(text, max_tok=8192, stride=4096): | |
| enc = tokenizer(text, add_special_tokens=False, truncation=False) | |
| ids = enc["input_ids"] | |
| if len(ids) <= max_tok: | |
| yield ids, 0, len(ids) | |
| return | |
| for st in range(0, len(ids), stride): | |
| en = min(st + max_tok, len(ids)) | |
| yield ids[st:en], st, en | |
| if en == len(ids): | |
| break | |
| probs = [] | |
| for chunk_ids, st, en in chunks_of(paper_text): | |
| ids_t = torch.tensor(chunk_ids).unsqueeze(0).to(device) | |
| attn = torch.ones_like(ids_t) | |
| with torch.no_grad(): | |
| with torch.amp.autocast("cuda", dtype=torch.bfloat16): | |
| logit = model(ids_t, attn).float() | |
| probs.append((torch.sigmoid(logit).item(), st, en)) | |
| # Top-K chunks above threshold | |
| top_k = sorted(probs, key=lambda p: -p[0])[:2] | |
| top_k = [p for p in top_k if p[0] >= 0.4] | |
| # `top_k` is the list to hand off to the span-head model. | |
| ``` | |
| ## Training data | |
| Built from the 2,384 training rows of | |
| `cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`. | |
| For each train doc: | |
| - Tokenize `vlm_markdown` with the ModernBERT tokenizer. | |
| - Slide an 8,192-token window with stride 4,096 over the tokenized doc. | |
| - For each chunk, label `1` iff the gold funding statement (located via | |
| verbatim substring or `rapidfuzz.partial_ratio_alignment ≥ 0.7`) overlaps | |
| the chunk's character range by more than half its length, else `0`. | |
| Negative docs (no funding statement) contribute negative chunks; positive | |
| docs contribute one positive chunk (the one containing the gold) plus several | |
| negative chunks from the rest of the doc, so the negative class is | |
| naturally dominant (~9× more negatives than positives). | |
| Final training set: roughly 21,000 chunks (~2,300 positive / ~18,700 | |
| negative). | |
| ## Loss | |
| Binary cross-entropy with `pos_weight = n_examples / n_positives` to | |
| counteract the class imbalance: | |
| ```python | |
| loss_fn = nn.BCEWithLogitsLoss(pos_weight=torch.tensor(n_examples / n_positives)) | |
| loss = loss_fn(logits, labels) | |
| ``` | |
| ## Hyperparameters | |
| - Base: `answerdotai/ModernBERT-base` (149M, 8,192-token context) | |
| - Optimizer: AdamW, lr 5e-5, weight decay 0.01 | |
| - Schedule: linear warmup (20 steps) + cosine decay | |
| - Epochs: 3 | |
| - Batch: 2 per device × 8 grad accum = 16 effective | |
| - Mixed precision: bfloat16 | |
| - Max sequence: 8,192 tokens | |
| - Trained on 1× H100 80GB | |
| - Saved checkpoint: `pytorch_model.bin` is the epoch-2 (final) state dict | |
| ## Evaluation | |
| On the 597-row test split of | |
| `cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`, | |
| treated as a **per-document binary task** (does the doc have any funding | |
| statement?): we score each candidate chunk and use the max probability as | |
| the document-level prediction. Threshold = 0.5. | |
| | Metric | Precision | Recall | F1 | F0.5 | | |
| |------------------------------|-----------|--------|--------|--------| | |
| | Doc-level funding detection | 0.9831 | 0.9537 | 0.9682 | 0.9771 | | |
| Sub-stats at threshold 0.5: TP=350, FP=6, FN=17, TN=224. | |
| **Chunk-recall caveat**: even when the doc-level prediction is correct, the | |
| **top-1 chunk** contains the gold statement verbatim only ~68% of the time | |
| (top-2 covers ~88%). This is why the downstream cascade uses **top-K=2** | |
| chunks: it raises the chance that the gold-containing chunk is fed to the | |
| span head. | |
| ## Intended use | |
| Doc-level filtering of arXiv-derived PDFs for funding-statement presence, and | |
| stage-1 of the funding-extraction cascade. Useful when you want to skip | |
| expensive span extraction on most papers (a sizable fraction of arXiv papers | |
| have no funding statement). | |
| Not intended for: extraction (it only classifies chunks; pair with the | |
| span-head model for spans), classification of funding sources, or text | |
| outside the academic-paper domain. | |
| ## Limitations | |
| - Trained only on arXiv-derived PDFs; behavior on other paper sources is | |
| untested. | |
| - Top-1 chunk is wrong ~32% of the time even when doc-level is correct. Use | |
| top-K ≥ 2 if you need recall. | |
| - Mean-pooling over 8,192 tokens dilutes the signal from a short | |
| (~272-char-median) funding statement — the false-negative rate at strict | |
| threshold 0.9 is non-trivial. Use 0.5 (or lower) and rely on the span | |
| head's `no_answer` head to suppress empty chunks. | |
| ## Citation / acknowledgement | |
| Trained as part of an applied research cycle on the | |
| `cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test` | |
| dataset by Comet. | |