thesanogoeffect
/

bibr-parser-v1

Token Classification

scientific-references

sequence-labeling

Model card Files Files and versions

bibr-parser-v1 / README.md

thesanogoeffect's picture

thesanogoeffect

Add model card

45a64a6 verified 18 days ago

|

history blame contribute delete

2.1 kB

	---
	license: apache-2.0
	library_name: pytorch
	base_model: answerdotai/ModernBERT-base
	pipeline_tag: token-classification
	tags:
	- bibr
	- scientific-references
	- sequence-labeling
	- crf
	---

	# bibr-parser-v1

	ModernBERT-base + CRF reference parser for the
	[bibr](https://github.com/scienceverse/bibr) scientific-paper extraction
	pipeline. Takes a single segmented reference string and emits BIO field
	labels over 19 field types (TITLE, AUTHOR, YEAR, CONTAINER, VOLUME,
	ISSUE, PAGES, DOI, URL, ARXIV, PMID, PUBLISHER, EDITOR, EDITION,
	SERIES, ACCESS_DATE, NOTE, PAGE_RANGE_START, PAGE_RANGE_END).

	- Architecture: `answerdotai/ModernBERT-base` encoder + linear head +
	`torchcrf.CRF` decode with BIO transition constraints.
	- Tag set: 39 BIO tags (see `bibr/ner/tags.py`).
	- Training corpus: gold-anchored, judge-corrected references from
	psych250 + Directorate-General Economics + MDPI Social Sciences
	(~440 papers, 2026-05-09/10).

	## Held-out validation (2026-05-10)

	- DOI: 43% → 94% vs the prior silver-trained checkpoint.
	- Year / title / authors: ≥95%.

	## psych250 eval (50 papers, 2026-05-11, vs gold)

	- `ref_matching_f1`: 0.969 (LLM baseline: 1.000 — partially circular,
	gold derived from LLM with judge corrections).
	- `ref_title_acc`: 0.995.
	- `ref_year_acc`: 0.997.
	- `ref_count_ratio`: 0.988.
	- `ref_doi_recall` (raw): 0.004 — URL-encoded DOI corner case in the
	gold corpus; Crossref enrichment recovers 75.7%.

	## Loading

	```python
	from bibr.ner.parser import RefParser
	parser = RefParser("thesanogoeffect/bibr-parser-v1")
	fields = parser.parse(one_reference_string)
	```

	The `RefParser` resolver downloads the checkpoint via
	`huggingface_hub.hf_hub_download` (filename: `parser_v3_gold.pt`).

	## Caveats

	- In-distribution: APA-style social-science references.
	- Out-of-distribution: arXiv `[N] Author... CoRR, abs/...` patterns — container
	coverage drops to ~65% (vs 94% LLM). Fix path: expand gold corpus.
	- URL-encoded DOIs (`info%3Adoi%2F10...`) are not in the training
	distribution; rely on downstream Crossref enrichment for those.