| --- |
| license: apache-2.0 |
| library_name: pytorch |
| base_model: answerdotai/ModernBERT-base |
| pipeline_tag: token-classification |
| tags: |
| - bibr |
| - scientific-references |
| - sequence-labeling |
| - crf |
| --- |
| |
| # bibr-parser-v1 |
|
|
| ModernBERT-base + CRF reference parser for the |
| [bibr](https://github.com/scienceverse/bibr) scientific-paper extraction |
| pipeline. Takes a single segmented reference string and emits BIO field |
| labels over 19 field types (TITLE, AUTHOR, YEAR, CONTAINER, VOLUME, |
| ISSUE, PAGES, DOI, URL, ARXIV, PMID, PUBLISHER, EDITOR, EDITION, |
| SERIES, ACCESS_DATE, NOTE, PAGE_RANGE_START, PAGE_RANGE_END). |
| |
| - **Architecture:** `answerdotai/ModernBERT-base` encoder + linear head + |
| `torchcrf.CRF` decode with BIO transition constraints. |
| - **Tag set:** 39 BIO tags (see `bibr/ner/tags.py`). |
| - **Training corpus:** gold-anchored, judge-corrected references from |
| psych250 + Directorate-General Economics + MDPI Social Sciences |
| (~440 papers, 2026-05-09/10). |
| |
| ## Held-out validation (2026-05-10) |
| |
| - DOI: 43% → 94% vs the prior silver-trained checkpoint. |
| - Year / title / authors: ≥95%. |
| |
| ## psych250 eval (50 papers, 2026-05-11, vs gold) |
| |
| - `ref_matching_f1`: 0.969 (LLM baseline: 1.000 — partially circular, |
| gold derived from LLM with judge corrections). |
| - `ref_title_acc`: 0.995. |
| - `ref_year_acc`: 0.997. |
| - `ref_count_ratio`: 0.988. |
| - `ref_doi_recall` (raw): 0.004 — URL-encoded DOI corner case in the |
| gold corpus; Crossref enrichment recovers 75.7%. |
| |
| ## Loading |
| |
| ```python |
| from bibr.ner.parser import RefParser |
| parser = RefParser("thesanogoeffect/bibr-parser-v1") |
| fields = parser.parse(one_reference_string) |
| ``` |
| |
| The `RefParser` resolver downloads the checkpoint via |
| `huggingface_hub.hf_hub_download` (filename: `parser_v3_gold.pt`). |
|
|
| ## Caveats |
|
|
| - In-distribution: APA-style social-science references. |
| - Out-of-distribution: arXiv `[N] Author... CoRR, abs/...` patterns — container |
| coverage drops to ~65% (vs 94% LLM). Fix path: expand gold corpus. |
| - URL-encoded DOIs (`info%3Adoi%2F10...`) are not in the training |
| distribution; rely on downstream Crossref enrichment for those. |
|
|