bibr-segmenter-v1

ModernBERT-base + CRF reference segmenter for the bibr scientific-paper extraction pipeline. Takes the references section of a paper and emits per-character BIO labels (O, B-REF, I-REF) to split it into individual references.

  • Architecture: answerdotai/ModernBERT-base encoder + linear head + torchcrf.CRF decode with BIO transition constraints.
  • Input: the raw concatenated references-section text, one logical reference per line (newlines are the segmenter's training-time boundary marker — internal column-wrap whitespace should be collapsed per row first; see bibr/extract/extractor.py:_extract_references_ner).
  • Output: a list of reference strings.
  • Inference: sliding-window with window=2048, stride=1536, overlap-merge via per-position trust score.
  • Training corpus: gold-anchored, judge-corrected reference lines from psych250 + Directorate-General Economics + MDPI Social Sciences (~440 papers, 2026-05-09/10).

Held-out validation (2026-05-10)

  • 87% of papers within ±10% of gold reference count.
  • 99% within ±20%.
  • Exact ref-count match on the 3-paper psych APA holdout.

Loading

from bibr.ner.segmenter import RefSegmenter
seg = RefSegmenter("thesanogoeffect/bibr-segmenter-v1")
refs = seg.segment(references_section_text)

The RefSegmenter resolver downloads the checkpoint via huggingface_hub.hf_hub_download (filename: seg_v3_gold.pt).

Caveats

  • In-distribution: APA-style social-science references.
  • Out-of-distribution: arXiv-style bracketed-numbered refs ([1] Author...) and IEEE conference proceedings — paired with bibr-parser-v1 these recover ~94% of LLM ref count but container-field accuracy drops.
  • Acknowledgement lines that leak into the references-section input occasionally pass through as junk "refs"; bibr's downstream completeness filter (title or authors) catches most.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thesanogoeffect/bibr-segmenter-v1

Finetuned
(1250)
this model