Initial upload: ModernBERT-base chunk classifier (stage 1 of funding-extraction cascade)

f69ad93 verified 5 days ago

7.44 kB

	---
	license: cc0-1.0
	base_model: answerdotai/ModernBERT-base
	library_name: transformers
	pipeline_tag: text-classification
	tags:
	- funding-extraction
	- arxiv
	- scholarly-communication
	- chunk-classification
	- modernbert
	language:
	- en
	datasets:
	- cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test
	---

	# ModernBERT-base Chunk Classifier — Funding Statement Localization

	A binary classifier on top of `answerdotai/ModernBERT-base` that scores a
	single 8,192-token chunk of an academic paper for the presence of a funding
	statement. Used as stage 1 of a three-stage funding-extraction cascade to
	narrow a long PDF down to the most-likely chunk before running expensive
	span-extraction and cleanup.

	The full cascade:

	1. Stage 1 (this model): For each ≤8,192-token chunk of the paper,
	predict a scalar `P(this chunk contains a funding statement)`. Take top-K
	chunks above a threshold (we use top-2 above 0.4).
	2. Stage 2 — span head:
	[`cometadata/funding-extraction-modernbert-base-spanhead`](https://huggingface.co/cometadata/funding-extraction-modernbert-base-spanhead)
	— picks the exact start/end token within the top chunk.
	3. Stage 3 — cleanup LoRA:
	[`cometadata/funding-cleaning-qwen3-4b-lora`](https://huggingface.co/cometadata/funding-cleaning-qwen3-4b-lora)
	— strips LaTeX markers and normalizes whitespace in the extracted span.

	You can use this model standalone if you only need to flag whether a chunk
	(or doc) contains funding language at all (binary F1 0.97 on the test set).

	## Architecture

	The architecture is a custom `ChunkClassifier` module (included in
	`modeling.py`):

	```python
	import torch.nn as nn
	from transformers import AutoModel


	class ChunkClassifier(nn.Module):
	"""ModernBERT encoder + mean-pool + binary head."""

	def __init__(self, base="answerdotai/ModernBERT-base"):
	super().__init__()
	self.encoder = AutoModel.from_pretrained(base)
	self.head = nn.Linear(self.encoder.config.hidden_size, 1)

	def forward(self, input_ids, attention_mask):
	out = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
	# Mean pool over real (non-padding) tokens
	mask = attention_mask.unsqueeze(-1).float()
	pooled = (out.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1)
	return self.head(pooled).squeeze(-1) # one logit per chunk
	```

	## Use

	```python
	import torch
	from huggingface_hub import hf_hub_download
	from transformers import AutoTokenizer
	from modeling import ChunkClassifier # bundled in this repo

	REPO = "cometadata/funding-chunk-classifier-modernbert-base"
	device = "cuda"

	tokenizer = AutoTokenizer.from_pretrained(REPO)
	model = ChunkClassifier("answerdotai/ModernBERT-base").to(device)
	state_dict = torch.load(
	hf_hub_download(REPO, "pytorch_model.bin"),
	map_location=device, weights_only=True,
	)
	model.load_state_dict(state_dict)
	model.eval()

	# For a long paper, slide an 8192-token window with stride 4096.
	def chunks_of(text, max_tok=8192, stride=4096):
	enc = tokenizer(text, add_special_tokens=False, truncation=False)
	ids = enc["input_ids"]
	if len(ids) <= max_tok:
	yield ids, 0, len(ids)
	return
	for st in range(0, len(ids), stride):
	en = min(st + max_tok, len(ids))
	yield ids[st:en], st, en
	if en == len(ids):
	break

	probs = []
	for chunk_ids, st, en in chunks_of(paper_text):
	ids_t = torch.tensor(chunk_ids).unsqueeze(0).to(device)
	attn = torch.ones_like(ids_t)
	with torch.no_grad():
	with torch.amp.autocast("cuda", dtype=torch.bfloat16):
	logit = model(ids_t, attn).float()
	probs.append((torch.sigmoid(logit).item(), st, en))

	# Top-K chunks above threshold
	top_k = sorted(probs, key=lambda p: -p[0])[:2]
	top_k = [p for p in top_k if p[0] >= 0.4]
	# `top_k` is the list to hand off to the span-head model.
	```

	## Training data

	Built from the 2,384 training rows of
	`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`.

	For each train doc:
	- Tokenize `vlm_markdown` with the ModernBERT tokenizer.
	- Slide an 8,192-token window with stride 4,096 over the tokenized doc.
	- For each chunk, label `1` iff the gold funding statement (located via
	verbatim substring or `rapidfuzz.partial_ratio_alignment ≥ 0.7`) overlaps
	the chunk's character range by more than half its length, else `0`.

	Negative docs (no funding statement) contribute negative chunks; positive
	docs contribute one positive chunk (the one containing the gold) plus several
	negative chunks from the rest of the doc, so the negative class is
	naturally dominant (~9× more negatives than positives).

	Final training set: roughly 21,000 chunks (~2,300 positive / ~18,700
	negative).

	## Loss

	Binary cross-entropy with `pos_weight = n_examples / n_positives` to
	counteract the class imbalance:

	```python
	loss_fn = nn.BCEWithLogitsLoss(pos_weight=torch.tensor(n_examples / n_positives))
	loss = loss_fn(logits, labels)
	```

	## Hyperparameters

	- Base: `answerdotai/ModernBERT-base` (149M, 8,192-token context)
	- Optimizer: AdamW, lr 5e-5, weight decay 0.01
	- Schedule: linear warmup (20 steps) + cosine decay
	- Epochs: 3
	- Batch: 2 per device × 8 grad accum = 16 effective
	- Mixed precision: bfloat16
	- Max sequence: 8,192 tokens
	- Trained on 1× H100 80GB
	- Saved checkpoint: `pytorch_model.bin` is the epoch-2 (final) state dict

	## Evaluation

	On the 597-row test split of
	`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`,
	treated as a per-document binary task (does the doc have any funding
	statement?): we score each candidate chunk and use the max probability as
	the document-level prediction. Threshold = 0.5.

	\| Metric \| Precision \| Recall \| F1 \| F0.5 \|
	\|------------------------------\|-----------\|--------\|--------\|--------\|
	\| Doc-level funding detection \| 0.9831 \| 0.9537 \| 0.9682 \| 0.9771 \|

	Sub-stats at threshold 0.5: TP=350, FP=6, FN=17, TN=224.

	Chunk-recall caveat: even when the doc-level prediction is correct, the
	top-1 chunk contains the gold statement verbatim only ~68% of the time
	(top-2 covers ~88%). This is why the downstream cascade uses top-K=2
	chunks: it raises the chance that the gold-containing chunk is fed to the
	span head.

	## Intended use

	Doc-level filtering of arXiv-derived PDFs for funding-statement presence, and
	stage-1 of the funding-extraction cascade. Useful when you want to skip
	expensive span extraction on most papers (a sizable fraction of arXiv papers
	have no funding statement).

	Not intended for: extraction (it only classifies chunks; pair with the
	span-head model for spans), classification of funding sources, or text
	outside the academic-paper domain.

	## Limitations

	- Trained only on arXiv-derived PDFs; behavior on other paper sources is
	untested.
	- Top-1 chunk is wrong ~32% of the time even when doc-level is correct. Use
	top-K ≥ 2 if you need recall.
	- Mean-pooling over 8,192 tokens dilutes the signal from a short
	(~272-char-median) funding statement — the false-negative rate at strict
	threshold 0.9 is non-trivial. Use 0.5 (or lower) and rely on the span
	head's `no_answer` head to suppress empty chunks.

	## Citation / acknowledgement

	Trained as part of an applied research cycle on the
	`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`
	dataset by Comet.