bibr-parser-v1 / README.md
thesanogoeffect's picture
Add model card
45a64a6 verified
metadata
license: apache-2.0
library_name: pytorch
base_model: answerdotai/ModernBERT-base
pipeline_tag: token-classification
tags:
  - bibr
  - scientific-references
  - sequence-labeling
  - crf

bibr-parser-v1

ModernBERT-base + CRF reference parser for the bibr scientific-paper extraction pipeline. Takes a single segmented reference string and emits BIO field labels over 19 field types (TITLE, AUTHOR, YEAR, CONTAINER, VOLUME, ISSUE, PAGES, DOI, URL, ARXIV, PMID, PUBLISHER, EDITOR, EDITION, SERIES, ACCESS_DATE, NOTE, PAGE_RANGE_START, PAGE_RANGE_END).

  • Architecture: answerdotai/ModernBERT-base encoder + linear head + torchcrf.CRF decode with BIO transition constraints.
  • Tag set: 39 BIO tags (see bibr/ner/tags.py).
  • Training corpus: gold-anchored, judge-corrected references from psych250 + Directorate-General Economics + MDPI Social Sciences (~440 papers, 2026-05-09/10).

Held-out validation (2026-05-10)

  • DOI: 43% → 94% vs the prior silver-trained checkpoint.
  • Year / title / authors: ≥95%.

psych250 eval (50 papers, 2026-05-11, vs gold)

  • ref_matching_f1: 0.969 (LLM baseline: 1.000 — partially circular, gold derived from LLM with judge corrections).
  • ref_title_acc: 0.995.
  • ref_year_acc: 0.997.
  • ref_count_ratio: 0.988.
  • ref_doi_recall (raw): 0.004 — URL-encoded DOI corner case in the gold corpus; Crossref enrichment recovers 75.7%.

Loading

from bibr.ner.parser import RefParser
parser = RefParser("thesanogoeffect/bibr-parser-v1")
fields = parser.parse(one_reference_string)

The RefParser resolver downloads the checkpoint via huggingface_hub.hf_hub_download (filename: parser_v3_gold.pt).

Caveats

  • In-distribution: APA-style social-science references.
  • Out-of-distribution: arXiv [N] Author... CoRR, abs/... patterns — container coverage drops to ~65% (vs 94% LLM). Fix path: expand gold corpus.
  • URL-encoded DOIs (info%3Adoi%2F10...) are not in the training distribution; rely on downstream Crossref enrichment for those.