bibr-segmenter-v1

ModernBERT-base + CRF reference segmenter for the bibr scientific-paper extraction pipeline. Takes the references section of a paper and emits per-character BIO labels (O, B-REF, I-REF) to split it into individual references.

Architecture: answerdotai/ModernBERT-base encoder + linear head + torchcrf.CRF decode with BIO transition constraints.
Input: the raw concatenated references-section text, one logical reference per line (newlines are the segmenter's training-time boundary marker — internal column-wrap whitespace should be collapsed per row first; see bibr/extract/extractor.py:_extract_references_ner).
Output: a list of reference strings.
Inference: sliding-window with window=2048, stride=1536, overlap-merge via per-position trust score.
Training corpus: gold-anchored, judge-corrected reference lines from psych250 + Directorate-General Economics + MDPI Social Sciences (~440 papers, 2026-05-09/10).

Held-out validation (2026-05-10)

87% of papers within ±10% of gold reference count.
99% within ±20%.
Exact ref-count match on the 3-paper psych APA holdout.

Loading

from bibr.ner.segmenter import RefSegmenter
seg = RefSegmenter("thesanogoeffect/bibr-segmenter-v1")
refs = seg.segment(references_section_text)

The RefSegmenter resolver downloads the checkpoint via huggingface_hub.hf_hub_download (filename: seg_v3_gold.pt).

Caveats

In-distribution: APA-style social-science references.
Out-of-distribution: arXiv-style bracketed-numbered refs ([1] Author...) and IEEE conference proceedings — paired with bibr-parser-v1 these recover ~94% of LLM ref count but container-field accuracy drops.
Acknowledgement lines that leak into the references-section input occasionally pass through as junk "refs"; bibr's downstream completeness filter (title or authors) catches most.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for thesanogoeffect/bibr-segmenter-v1

Base model

answerdotai/ModernBERT-base

Finetuned

(1250)

this model