Initial upload: ModernBERT-base chunk classifier (stage 1 of funding-extraction cascade)

Browse files

Files changed (5) hide show

README.md +203 -0
modeling.py +34 -0
pytorch_model.bin +3 -0
tokenizer.json +0 -0
tokenizer_config.json +16 -0

README.md ADDED Viewed

	@@ -0,0 +1,203 @@

+---
+license: cc0-1.0
+base_model: answerdotai/ModernBERT-base
+library_name: transformers
+pipeline_tag: text-classification
+tags:
+- funding-extraction
+- arxiv
+- scholarly-communication
+- chunk-classification
+- modernbert
+language:
+- en
+datasets:
+- cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test
+---
+# ModernBERT-base Chunk Classifier — Funding Statement Localization
+A binary classifier on top of `answerdotai/ModernBERT-base` that scores a
+single 8,192-token chunk of an academic paper for the presence of a funding
+statement. Used as **stage 1 of a three-stage funding-extraction cascade** to
+narrow a long PDF down to the most-likely chunk before running expensive
+span-extraction and cleanup.
+The full cascade:
+1. **Stage 1 (this model)**: For each ≤8,192-token chunk of the paper,
+   predict a scalar `P(this chunk contains a funding statement)`. Take top-K
+   chunks above a threshold (we use top-2 above 0.4).
+2. **Stage 2 — span head**:
+   [`cometadata/funding-extraction-modernbert-base-spanhead`](https://huggingface.co/cometadata/funding-extraction-modernbert-base-spanhead)
+   — picks the exact start/end token within the top chunk.
+3. **Stage 3 — cleanup LoRA**:
+   [`cometadata/funding-cleaning-qwen3-4b-lora`](https://huggingface.co/cometadata/funding-cleaning-qwen3-4b-lora)
+   — strips LaTeX markers and normalizes whitespace in the extracted span.
+You can use this model standalone if you only need to flag whether a chunk
+(or doc) contains funding language at all (binary F1 0.97 on the test set).
+## Architecture
+The architecture is a custom `ChunkClassifier` module (included in
+`modeling.py`):
+```python
+import torch.nn as nn
+from transformers import AutoModel
+class ChunkClassifier(nn.Module):
+    """ModernBERT encoder + mean-pool + binary head."""
+    def __init__(self, base="answerdotai/ModernBERT-base"):
+        super().__init__()
+        self.encoder = AutoModel.from_pretrained(base)
+        self.head = nn.Linear(self.encoder.config.hidden_size, 1)
+    def forward(self, input_ids, attention_mask):
+        out = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
+        # Mean pool over real (non-padding) tokens
+        mask = attention_mask.unsqueeze(-1).float()
+        pooled = (out.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1)
+        return self.head(pooled).squeeze(-1)   # one logit per chunk
+```
+## Use
+```python
+import torch
+from huggingface_hub import hf_hub_download
+from transformers import AutoTokenizer
+from modeling import ChunkClassifier  # bundled in this repo
+REPO = "cometadata/funding-chunk-classifier-modernbert-base"
+device = "cuda"
+tokenizer = AutoTokenizer.from_pretrained(REPO)
+model = ChunkClassifier("answerdotai/ModernBERT-base").to(device)
+state_dict = torch.load(
+    hf_hub_download(REPO, "pytorch_model.bin"),
+    map_location=device, weights_only=True,
+)
+model.load_state_dict(state_dict)
+model.eval()
+# For a long paper, slide an 8192-token window with stride 4096.
+def chunks_of(text, max_tok=8192, stride=4096):
+    enc = tokenizer(text, add_special_tokens=False, truncation=False)
+    ids = enc["input_ids"]
+    if len(ids) <= max_tok:
+        yield ids, 0, len(ids)
+        return
+    for st in range(0, len(ids), stride):
+        en = min(st + max_tok, len(ids))
+        yield ids[st:en], st, en
+        if en == len(ids):
+            break
+probs = []
+for chunk_ids, st, en in chunks_of(paper_text):
+    ids_t = torch.tensor(chunk_ids).unsqueeze(0).to(device)
+    attn = torch.ones_like(ids_t)
+    with torch.no_grad():
+        with torch.amp.autocast("cuda", dtype=torch.bfloat16):
+            logit = model(ids_t, attn).float()
+    probs.append((torch.sigmoid(logit).item(), st, en))
+# Top-K chunks above threshold
+top_k = sorted(probs, key=lambda p: -p[0])[:2]
+top_k = [p for p in top_k if p[0] >= 0.4]
+# `top_k` is the list to hand off to the span-head model.
+```
+## Training data
+Built from the 2,384 training rows of
+`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`.
+For each train doc:
+- Tokenize `vlm_markdown` with the ModernBERT tokenizer.
+- Slide an 8,192-token window with stride 4,096 over the tokenized doc.
+- For each chunk, label `1` iff the gold funding statement (located via
+  verbatim substring or `rapidfuzz.partial_ratio_alignment ≥ 0.7`) overlaps
+  the chunk's character range by more than half its length, else `0`.
+Negative docs (no funding statement) contribute negative chunks; positive
+docs contribute one positive chunk (the one containing the gold) plus several
+negative chunks from the rest of the doc, so the negative class is
+naturally dominant (~9× more negatives than positives).
+Final training set: roughly 21,000 chunks (~2,300 positive / ~18,700
+negative).
+## Loss
+Binary cross-entropy with `pos_weight = n_examples / n_positives` to
+counteract the class imbalance:
+```python
+loss_fn = nn.BCEWithLogitsLoss(pos_weight=torch.tensor(n_examples / n_positives))
+loss = loss_fn(logits, labels)
+```
+## Hyperparameters
+- Base: `answerdotai/ModernBERT-base` (149M, 8,192-token context)
+- Optimizer: AdamW, lr 5e-5, weight decay 0.01
+- Schedule: linear warmup (20 steps) + cosine decay
+- Epochs: 3
+- Batch: 2 per device × 8 grad accum = 16 effective
+- Mixed precision: bfloat16
+- Max sequence: 8,192 tokens
+- Trained on 1× H100 80GB
+- Saved checkpoint: `pytorch_model.bin` is the epoch-2 (final) state dict
+## Evaluation
+On the 597-row test split of
+`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`,
+treated as a **per-document binary task** (does the doc have any funding
+statement?): we score each candidate chunk and use the max probability as
+the document-level prediction. Threshold = 0.5.
+| Metric                       | Precision | Recall | F1     | F0.5   |
+|------------------------------|-----------|--------|--------|--------|
+| Doc-level funding detection  | 0.9831    | 0.9537 | 0.9682 | 0.9771 |
+Sub-stats at threshold 0.5: TP=350, FP=6, FN=17, TN=224.
+**Chunk-recall caveat**: even when the doc-level prediction is correct, the
+**top-1 chunk** contains the gold statement verbatim only ~68% of the time
+(top-2 covers ~88%). This is why the downstream cascade uses **top-K=2**
+chunks: it raises the chance that the gold-containing chunk is fed to the
+span head.
+## Intended use
+Doc-level filtering of arXiv-derived PDFs for funding-statement presence, and
+stage-1 of the funding-extraction cascade. Useful when you want to skip
+expensive span extraction on most papers (a sizable fraction of arXiv papers
+have no funding statement).
+Not intended for: extraction (it only classifies chunks; pair with the
+span-head model for spans), classification of funding sources, or text
+outside the academic-paper domain.
+## Limitations
+- Trained only on arXiv-derived PDFs; behavior on other paper sources is
+  untested.
+- Top-1 chunk is wrong ~32% of the time even when doc-level is correct. Use
+  top-K ≥ 2 if you need recall.
+- Mean-pooling over 8,192 tokens dilutes the signal from a short
+  (~272-char-median) funding statement — the false-negative rate at strict
+  threshold 0.9 is non-trivial. Use 0.5 (or lower) and rely on the span
+  head's `no_answer` head to suppress empty chunks.
+## Citation / acknowledgement
+Trained as part of an applied research cycle on the
+`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`
+dataset by Comet.

modeling.py ADDED Viewed

	@@ -0,0 +1,34 @@

+"""Custom model class for funding-chunk-classifier-modernbert-base.
+Usage:
+    import torch
+    from huggingface_hub import hf_hub_download
+    from transformers import AutoTokenizer
+    from modeling import ChunkClassifier
+    REPO = "cometadata/funding-chunk-classifier-modernbert-base"
+    tokenizer = AutoTokenizer.from_pretrained(REPO)
+    model = ChunkClassifier().to("cuda")
+    sd = torch.load(hf_hub_download(REPO, "pytorch_model.bin"),
+                     map_location="cuda", weights_only=True)
+    model.load_state_dict(sd)
+    model.eval()
+"""
+import torch.nn as nn
+from transformers import AutoModel
+class ChunkClassifier(nn.Module):
+    """ModernBERT-base encoder + mean-pool + binary head for funding-chunk detection."""
+    def __init__(self, base: str = "answerdotai/ModernBERT-base"):
+        super().__init__()
+        self.encoder = AutoModel.from_pretrained(base)
+        self.head = nn.Linear(self.encoder.config.hidden_size, 1)
+    def forward(self, input_ids, attention_mask):
+        out = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
+        # Mean pool over real (non-padding) tokens
+        mask = attention_mask.unsqueeze(-1).float()
+        pooled = (out.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1)
+        return self.head(pooled).squeeze(-1)  # one logit per chunk

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:566e172d7db3c9201011533a74503592384882622568f171b27bd47f1708e5ba
+size 596119575

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,16 @@

+{
+  "backend": "tokenizers",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "is_local": false,
+  "mask_token": "[MASK]",
+  "model_input_names": [
+    "input_ids",
+    "attention_mask"
+  ],
+  "model_max_length": 8192,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "tokenizer_class": "TokenizersBackend",
+  "unk_token": "[UNK]"
+}