--- license: apache-2.0 library_name: pytorch base_model: answerdotai/ModernBERT-base pipeline_tag: token-classification tags: - bibr - scientific-references - sequence-labeling - crf --- # bibr-parser-v1 ModernBERT-base + CRF reference parser for the [bibr](https://github.com/scienceverse/bibr) scientific-paper extraction pipeline. Takes a single segmented reference string and emits BIO field labels over 19 field types (TITLE, AUTHOR, YEAR, CONTAINER, VOLUME, ISSUE, PAGES, DOI, URL, ARXIV, PMID, PUBLISHER, EDITOR, EDITION, SERIES, ACCESS_DATE, NOTE, PAGE_RANGE_START, PAGE_RANGE_END). - **Architecture:** `answerdotai/ModernBERT-base` encoder + linear head + `torchcrf.CRF` decode with BIO transition constraints. - **Tag set:** 39 BIO tags (see `bibr/ner/tags.py`). - **Training corpus:** gold-anchored, judge-corrected references from psych250 + Directorate-General Economics + MDPI Social Sciences (~440 papers, 2026-05-09/10). ## Held-out validation (2026-05-10) - DOI: 43% → 94% vs the prior silver-trained checkpoint. - Year / title / authors: ≥95%. ## psych250 eval (50 papers, 2026-05-11, vs gold) - `ref_matching_f1`: 0.969 (LLM baseline: 1.000 — partially circular, gold derived from LLM with judge corrections). - `ref_title_acc`: 0.995. - `ref_year_acc`: 0.997. - `ref_count_ratio`: 0.988. - `ref_doi_recall` (raw): 0.004 — URL-encoded DOI corner case in the gold corpus; Crossref enrichment recovers 75.7%. ## Loading ```python from bibr.ner.parser import RefParser parser = RefParser("thesanogoeffect/bibr-parser-v1") fields = parser.parse(one_reference_string) ``` The `RefParser` resolver downloads the checkpoint via `huggingface_hub.hf_hub_download` (filename: `parser_v3_gold.pt`). ## Caveats - In-distribution: APA-style social-science references. - Out-of-distribution: arXiv `[N] Author... CoRR, abs/...` patterns — container coverage drops to ~65% (vs 94% LLM). Fix path: expand gold corpus. - URL-encoded DOIs (`info%3Adoi%2F10...`) are not in the training distribution; rely on downstream Crossref enrichment for those.