Initial upload: ModernBERT-base span head for funding statement extraction

1e00313 verified 7 days ago

8.89 kB

	---
	license: cc0-1.0
	base_model: answerdotai/ModernBERT-base
	library_name: transformers
	pipeline_tag: token-classification
	tags:
	- funding-extraction
	- arxiv
	- scholarly-communication
	- span-extraction
	- modernbert
	language:
	- en
	datasets:
	- cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test
	---

	# ModernBERT-base Span-Head — Funding Statement Extraction

	A custom span-extraction head on top of `answerdotai/ModernBERT-base`. Given a
	chunk of an academic paper (up to 8,192 tokens), it predicts the start and end
	token positions of a funding statement, plus a "no-answer" probability for
	documents with no funding statement.

	This is the rough-extraction stage of a two-stage cascade:

	1. Stage 1 (this model): ModernBERT-base + span head — finds the rough
	span (≈ best@0.85 F1 0.95 on the test set).
	2. Stage 2 (separate): `cometadata/funding-cleaning-qwen3-4b-lora` —
	cleans the rough span into the canonical, normalized funding statement
	(strips LaTeX markers, joins paragraph breaks, etc.).

	Use this model alone if you only need approximate localization; chain with the
	cleanup LoRA if you need the cleaned canonical text.

	## Architecture

	The architecture is a custom `SpanHead` module (included in `modeling.py`):

	```python
	import torch
	import torch.nn as nn
	from transformers import AutoModel


	class SpanHead(nn.Module):
	"""ModernBERT encoder + start/end/no-answer heads."""

	def __init__(self, base="answerdotai/ModernBERT-base"):
	super().__init__()
	self.encoder = AutoModel.from_pretrained(base)
	h = self.encoder.config.hidden_size # 768
	self.start_head = nn.Linear(h, 1)
	self.end_head = nn.Linear(h, 1)
	self.no_answer_head = nn.Linear(h, 1)
	self.dropout = nn.Dropout(0.1)

	def forward(self, input_ids, attention_mask):
	out = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
	hidden = self.dropout(out.last_hidden_state)
	start_logits = self.start_head(hidden).squeeze(-1)
	end_logits = self.end_head(hidden).squeeze(-1)
	# Mean-pool for no-answer
	mask = attention_mask.unsqueeze(-1).float()
	pooled = (out.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1)
	no_answer = self.no_answer_head(pooled).squeeze(-1)
	return start_logits, end_logits, no_answer
	```

	## Use

	```python
	import torch
	from huggingface_hub import hf_hub_download
	from transformers import AutoTokenizer
	from modeling import SpanHead # bundled in this repo

	REPO = "cometadata/funding-extraction-modernbert-base-spanhead"
	device = "cuda"

	tokenizer = AutoTokenizer.from_pretrained(REPO)
	model = SpanHead("answerdotai/ModernBERT-base").to(device)
	state_dict = torch.load(
	hf_hub_download(REPO, "pytorch_model.bin"),
	map_location=device, weights_only=True,
	)
	model.load_state_dict(state_dict)
	model.eval()

	# `chunk_text` should be a ≤8192-token chunk of the paper (e.g., the
	# acknowledgments-containing region). For long papers, run the model on
	# sliding 8192-tok windows (stride 4096) and pick the chunk with the lowest
	# no-answer probability.

	enc = tokenizer(chunk_text, return_offsets_mapping=True,
	add_special_tokens=False, truncation=True, max_length=8192)
	ids = torch.tensor(enc["input_ids"]).unsqueeze(0).to(device)
	attn = torch.ones_like(ids)

	with torch.no_grad():
	with torch.amp.autocast("cuda", dtype=torch.bfloat16):
	start_logits, end_logits, no_answer = model(ids, attn)

	start_logits = start_logits.squeeze(0).float().cpu()
	end_logits = end_logits.squeeze(0).float().cpu()
	no_answer_prob = torch.sigmoid(no_answer).item()

	if no_answer_prob >= 0.5:
	pred_span = "" # this chunk has no funding statement
	else:
	start = int(start_logits.argmax())
	# Constrain end to be after start and within ~300 tokens
	end_window = end_logits[start:start + 300]
	end = start + int(end_window.argmax())
	offsets = enc["offset_mapping"]
	char_s = offsets[start][0]
	char_e = offsets[end][1]
	pred_span = chunk_text[char_s:char_e].strip()
	```

	## Training data

	Built from the 2,384 training rows of
	`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`.

	For each positive doc (1,416 rows):
	- Tokenize `vlm_markdown` with the ModernBERT tokenizer.
	- Locate the gold funding statement in `vlm_markdown` via verbatim substring,
	or via `rapidfuzz.partial_ratio_alignment` if not verbatim. Convert
	char-span to token-span.
	- Pick the 8,192-token sliding window (stride 4,096) that contains the gold
	span fully. If the doc is ≤ 8,192 tokens, use the whole doc as one chunk.
	- Training labels: `start_tok` and `end_tok` indices within the chunk;
	`no_answer = 0`.

	For each negative doc (968 rows):
	- Use the last 8,192-token chunk of the doc (since funding statements, when
	they exist, are typically near the end).
	- Training labels: `start_tok = end_tok = 0`; `no_answer = 1`.

	About ~5% of positive rows where no fuzzy alignment ≥ 0.7 could be found are
	dropped. Final training set: ~3,300 chunks.

	## Loss

	```
	loss = CE(start_logits[no_answer==0], gold_start)
	+ CE(end_logits[no_answer==0], gold_end)
	+ 1.0 * BCE_with_logits(no_answer_logit, no_answer_label)
	```

	The start/end CE is masked out on negative chunks; the no-answer BCE is
	computed on all chunks. Padded positions in `start_logits`/`end_logits` are
	masked to `-1e4` so they can't be argmax'd.

	## Hyperparameters

	- Base: `answerdotai/ModernBERT-base` (149M, 8,192-token context)
	- Optimizer: AdamW, lr 5e-5, weight decay 0.01
	- Schedule: linear warmup (30 steps) + cosine decay
	- Epochs: 4
	- Batch: 4 per device × 4 grad accum = 16 effective
	- Mixed precision: bfloat16
	- Max sequence: 8,192 tokens
	- Trained on 1× H100 80GB

	## Evaluation

	On the 597-row test split of
	`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`.
	At inference we ran this model on the top-2 chunks selected by a separate
	ModernBERT-base chunk classifier (binary funding-yes, mean-pooled
	classification head) and picked the chunk with the lower no-answer prob.

	\| Metric \| Precision \| Recall \| F1 \| F0.5 \|
	\|---------------------------------------\|-----------\|--------\|--------\|--------\|
	\| Binary detection \| 0.9887 \| 0.9510 \| 0.9694 \| 0.9809 \|
	\| Strict span (`token_sort_ratio≥0.95`) \| 0.7365 \| 0.7084 \| 0.7222 \| 0.7307 \|
	\| Loose span (max-of-4 fuzz ≥ 0.85) \| 0.9745 \| 0.9373 \| 0.9556 \| 0.9668 \|

	Hard ceiling note: ~28% of test gold statements are not verbatim
	substrings of any source representation in the dataset (the dataset's labels
	were normalized by frontier models — whitespace, LaTeX markers, paragraph
	joins). The 0.95 strict threshold is unforgiving of those normalizations even
	on perfectly extracted source-spans, so strict F1 is capped near 0.73 for any
	single-stage extractive model. The loose-span F1 of 0.96 is closer to the
	practical extractive ceiling.

	For higher strict F1, chain with `cometadata/funding-cleaning-qwen3-4b-lora`
	which cleans the rough span into the canonical text.

	## Cascade pipeline

	For long papers (> 8,192 tokens), use a chunk-classifier first to pick the
	chunk most likely to contain the funding statement:

	```python
	# Pseudocode for the full cascade
	chunks = sliding_windows(doc, max_tok=8192, stride=4096)
	chunk_probs = [chunk_classifier(c) for c in chunks]
	top_chunk = chunks[argmax(chunk_probs)]
	rough_span = spanhead_model(top_chunk) # this model
	clean_span = cleanup_lora(rough_span, top_chunk) # other model
	```

	A simple heuristic alternative to the chunk classifier (also works fine):
	just use the last 8,192-token window of the document — funding statements are
	usually near the end. This loses a few percentage points of recall on papers
	with funding info mid-document.

	## Intended use

	Extraction of the rough span containing a funding acknowledgment from
	arXiv paper text (or similar academic markdown). Designed to be the first
	stage of a two-stage cascade with the cleanup LoRA, but usable on its own if
	you only need approximate localization.

	Not intended for: classification of funding sources, downstream
	funder/grant/scheme parsing, or extraction from non-paper text.

	## Limitations

	- Trained on arXiv-derived PDFs only; behavior on other paper sources is
	untested.
	- Outputs a rough span — for canonical, downstream-ready text, chain with the
	cleanup LoRA.
	- Will occasionally pick the wrong sibling sentence when an acknowledgments
	section contains multiple funding statements (each person's own grants);
	this is the dominant failure mode of the strict-F1 evaluation.

	## Citation / acknowledgement

	Trained as part of an applied research cycle on the
	`cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test`
	dataset by Comet.