Initial upload: ModernBERT-base span head for funding statement extraction

Browse files

Files changed (5) hide show

README.md +235 -0
modeling.py +43 -0
pytorch_model.bin +3 -0
tokenizer.json +0 -0
tokenizer_config.json +16 -0

README.md ADDED Viewed

	@@ -0,0 +1,235 @@

+---
+license: cc0-1.0
+base_model: answerdotai/ModernBERT-base
+library_name: transformers
+pipeline_tag: token-classification
+tags:
+- funding-extraction
+- arxiv
+- scholarly-communication
+- span-extraction
+- modernbert
+language:
+- en
+datasets:
+- cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test
+---
+# ModernBERT-base Span-Head — Funding Statement Extraction
+A custom span-extraction head on top of `answerdotai/ModernBERT-base`. Given a
+chunk of an academic paper (up to 8,192 tokens), it predicts the start and end
+token positions of a funding statement, plus a "no-answer" probability for
+documents with no funding statement.
+This is the **rough-extraction stage** of a two-stage cascade:
+1. **Stage 1 (this model)**: ModernBERT-base + span head — finds the rough
+   span (≈ best@0.85 F1 0.95 on the test set).
+2. **Stage 2 (separate)**: `cometadata/funding-cleaning-qwen3-4b-lora` —
+   cleans the rough span into the canonical, normalized funding statement
+   (strips LaTeX markers, joins paragraph breaks, etc.).
+Use this model alone if you only need approximate localization; chain with the
+cleanup LoRA if you need the cleaned canonical text.
+## Architecture
+The architecture is a custom `SpanHead` module (included in `modeling.py`):
+```python
+import torch
+import torch.nn as nn
+from transformers import AutoModel
+class SpanHead(nn.Module):
+    """ModernBERT encoder + start/end/no-answer heads."""
+    def __init__(self, base="answerdotai/ModernBERT-base"):
+        super().__init__()
+        self.encoder = AutoModel.from_pretrained(base)
+        h = self.encoder.config.hidden_size  # 768
+        self.start_head = nn.Linear(h, 1)
+        self.end_head = nn.Linear(h, 1)
+        self.no_answer_head = nn.Linear(h, 1)
+        self.dropout = nn.Dropout(0.1)
+    def forward(self, input_ids, attention_mask):
+        out = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
+        hidden = self.dropout(out.last_hidden_state)
+        start_logits = self.start_head(hidden).squeeze(-1)
+        end_logits = self.end_head(hidden).squeeze(-1)
+        # Mean-pool for no-answer
+        mask = attention_mask.unsqueeze(-1).float()
+        pooled = (out.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1)
+        no_answer = self.no_answer_head(pooled).squeeze(-1)
+        return start_logits, end_logits, no_answer
+```
+## Use
+```python
+import torch
+from huggingface_hub import hf_hub_download
+from transformers import AutoTokenizer
+from modeling import SpanHead  # bundled in this repo
+REPO = "cometadata/funding-extraction-modernbert-base-spanhead"
+device = "cuda"
+tokenizer = AutoTokenizer.from_pretrained(REPO)
+model = SpanHead("answerdotai/ModernBERT-base").to(device)
+state_dict = torch.load(
+    hf_hub_download(REPO, "pytorch_model.bin"),
+    map_location=device, weights_only=True,
+)
+model.load_state_dict(state_dict)
+model.eval()
+# `chunk_text` should be a ≤8192-token chunk of the paper (e.g., the
+# acknowledgments-containing region). For long papers, run the model on
+# sliding 8192-tok windows (stride 4096) and pick the chunk with the lowest
+# no-answer probability.
+enc = tokenizer(chunk_text, return_offsets_mapping=True,
+                 add_special_tokens=False, truncation=True, max_length=8192)
+ids = torch.tensor(enc["input_ids"]).unsqueeze(0).to(device)
+attn = torch.ones_like(ids)
+with torch.no_grad():
+    with torch.amp.autocast("cuda", dtype=torch.bfloat16):
+        start_logits, end_logits, no_answer = model(ids, attn)
+start_logits = start_logits.squeeze(0).float().cpu()
+end_logits = end_logits.squeeze(0).float().cpu()
+no_answer_prob = torch.sigmoid(no_answer).item()
+if no_answer_prob >= 0.5:
+    pred_span = ""  # this chunk has no funding statement
+else:
+    start = int(start_logits.argmax())
+    # Constrain end to be after start and within ~300 tokens
+    end_window = end_logits[start:start + 300]
+    end = start + int(end_window.argmax())
+    offsets = enc["offset_mapping"]
+    char_s = offsets[start][0]
+    char_e = offsets[end][1]
+    pred_span = chunk_text[char_s:char_e].strip()
+```
+## Training data
+Built from the 2,384 training rows of
+`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`.
+For each positive doc (1,416 rows):
+- Tokenize `vlm_markdown` with the ModernBERT tokenizer.
+- Locate the gold funding statement in `vlm_markdown` via verbatim substring,
+  or via `rapidfuzz.partial_ratio_alignment` if not verbatim. Convert
+  char-span to token-span.
+- Pick the 8,192-token sliding window (stride 4,096) that contains the gold
+  span fully. If the doc is ≤ 8,192 tokens, use the whole doc as one chunk.
+- Training labels: `start_tok` and `end_tok` indices within the chunk;
+  `no_answer = 0`.
+For each negative doc (968 rows):
+- Use the last 8,192-token chunk of the doc (since funding statements, when
+  they exist, are typically near the end).
+- Training labels: `start_tok = end_tok = 0`; `no_answer = 1`.
+About ~5% of positive rows where no fuzzy alignment ≥ 0.7 could be found are
+dropped. Final training set: ~3,300 chunks.
+## Loss
+```
+loss = CE(start_logits[no_answer==0], gold_start)
+     + CE(end_logits[no_answer==0], gold_end)
+     + 1.0 * BCE_with_logits(no_answer_logit, no_answer_label)
+```
+The start/end CE is masked out on negative chunks; the no-answer BCE is
+computed on all chunks. Padded positions in `start_logits`/`end_logits` are
+masked to `-1e4` so they can't be argmax'd.
+## Hyperparameters
+- Base: `answerdotai/ModernBERT-base` (149M, 8,192-token context)
+- Optimizer: AdamW, lr 5e-5, weight decay 0.01
+- Schedule: linear warmup (30 steps) + cosine decay
+- Epochs: 4
+- Batch: 4 per device × 4 grad accum = 16 effective
+- Mixed precision: bfloat16
+- Max sequence: 8,192 tokens
+- Trained on 1× H100 80GB
+## Evaluation
+On the 597-row test split of
+`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`.
+At inference we ran this model on the top-2 chunks selected by a separate
+ModernBERT-base chunk classifier (binary funding-yes, mean-pooled
+classification head) and picked the chunk with the lower no-answer prob.
+| Metric                                | Precision | Recall | F1     | F0.5   |
+|---------------------------------------|-----------|--------|--------|--------|
+| Binary detection                      | 0.9887    | 0.9510 | 0.9694 | 0.9809 |
+| Strict span (`token_sort_ratio≥0.95`) | 0.7365    | 0.7084 | 0.7222 | 0.7307 |
+| Loose span (max-of-4 fuzz ≥ 0.85)     | 0.9745    | 0.9373 | 0.9556 | 0.9668 |
+**Hard ceiling note**: ~28% of test gold statements are not verbatim
+substrings of any source representation in the dataset (the dataset's labels
+were normalized by frontier models — whitespace, LaTeX markers, paragraph
+joins). The 0.95 strict threshold is unforgiving of those normalizations even
+on perfectly extracted source-spans, so strict F1 is capped near 0.73 for any
+single-stage extractive model. The loose-span F1 of 0.96 is closer to the
+practical extractive ceiling.
+For higher strict F1, chain with `cometadata/funding-cleaning-qwen3-4b-lora`
+which cleans the rough span into the canonical text.
+## Cascade pipeline
+For long papers (> 8,192 tokens), use a chunk-classifier first to pick the
+chunk most likely to contain the funding statement:
+```python
+# Pseudocode for the full cascade
+chunks = sliding_windows(doc, max_tok=8192, stride=4096)
+chunk_probs = [chunk_classifier(c) for c in chunks]
+top_chunk = chunks[argmax(chunk_probs)]
+rough_span = spanhead_model(top_chunk)        # this model
+clean_span = cleanup_lora(rough_span, top_chunk)  # other model
+```
+A simple heuristic alternative to the chunk classifier (also works fine):
+just use the last 8,192-token window of the document — funding statements are
+usually near the end. This loses a few percentage points of recall on papers
+with funding info mid-document.
+## Intended use
+Extraction of the **rough span** containing a funding acknowledgment from
+arXiv paper text (or similar academic markdown). Designed to be the first
+stage of a two-stage cascade with the cleanup LoRA, but usable on its own if
+you only need approximate localization.
+Not intended for: classification of funding sources, downstream
+funder/grant/scheme parsing, or extraction from non-paper text.
+## Limitations
+- Trained on arXiv-derived PDFs only; behavior on other paper sources is
+  untested.
+- Outputs a rough span — for canonical, downstream-ready text, chain with the
+  cleanup LoRA.
+- Will occasionally pick the wrong sibling sentence when an acknowledgments
+  section contains multiple funding statements (each person's own grants);
+  this is the dominant failure mode of the strict-F1 evaluation.
+## Citation / acknowledgement
+Trained as part of an applied research cycle on the
+`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`
+dataset by Comet.

modeling.py ADDED Viewed

	@@ -0,0 +1,43 @@

+"""Custom model class for funding-extraction-modernbert-base-spanhead.
+Usage:
+    import torch
+    from huggingface_hub import hf_hub_download
+    from transformers import AutoTokenizer
+    from modeling import SpanHead
+    REPO = "cometadata/funding-extraction-modernbert-base-spanhead"
+    tokenizer = AutoTokenizer.from_pretrained(REPO)
+    model = SpanHead().to("cuda")
+    sd = torch.load(hf_hub_download(REPO, "pytorch_model.bin"),
+                     map_location="cuda", weights_only=True)
+    model.load_state_dict(sd)
+    model.eval()
+"""
+import torch
+import torch.nn as nn
+from transformers import AutoModel
+class SpanHead(nn.Module):
+    """ModernBERT-base encoder + start/end/no-answer heads for funding span extraction."""
+    def __init__(self, base: str = "answerdotai/ModernBERT-base"):
+        super().__init__()
+        self.encoder = AutoModel.from_pretrained(base)
+        h = self.encoder.config.hidden_size  # 768
+        self.start_head = nn.Linear(h, 1)
+        self.end_head = nn.Linear(h, 1)
+        self.no_answer_head = nn.Linear(h, 1)
+        self.dropout = nn.Dropout(0.1)
+    def forward(self, input_ids, attention_mask):
+        out = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
+        hidden = self.dropout(out.last_hidden_state)
+        start_logits = self.start_head(hidden).squeeze(-1)
+        end_logits = self.end_head(hidden).squeeze(-1)
+        # Mean-pool for no-answer
+        mask = attention_mask.unsqueeze(-1).float()
+        pooled = (out.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1)
+        no_answer = self.no_answer_head(pooled).squeeze(-1)
+        return start_logits, end_logits, no_answer

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8a5f09370d87bf87db1fedb3502a17327b6eca1f6d34fc75b2187be1dde37bc0
+size 596127249

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,16 @@

+{
+  "backend": "tokenizers",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "is_local": false,
+  "mask_token": "[MASK]",
+  "model_input_names": [
+    "input_ids",
+    "attention_mask"
+  ],
+  "model_max_length": 8192,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "tokenizer_class": "TokenizersBackend",
+  "unk_token": "[UNK]"
+}