bibr-segmenter-v1
ModernBERT-base + CRF reference segmenter for the
bibr scientific-paper extraction
pipeline. Takes the references section of a paper and emits per-character
BIO labels (O, B-REF, I-REF) to split it into individual references.
- Architecture:
answerdotai/ModernBERT-baseencoder + linear head +torchcrf.CRFdecode with BIO transition constraints. - Input: the raw concatenated references-section text, one logical
reference per line (newlines are the segmenter's training-time
boundary marker — internal column-wrap whitespace should be collapsed
per row first; see
bibr/extract/extractor.py:_extract_references_ner). - Output: a list of reference strings.
- Inference: sliding-window with
window=2048,stride=1536, overlap-merge via per-position trust score. - Training corpus: gold-anchored, judge-corrected reference lines from psych250 + Directorate-General Economics + MDPI Social Sciences (~440 papers, 2026-05-09/10).
Held-out validation (2026-05-10)
- 87% of papers within ±10% of gold reference count.
- 99% within ±20%.
- Exact ref-count match on the 3-paper psych APA holdout.
Loading
from bibr.ner.segmenter import RefSegmenter
seg = RefSegmenter("thesanogoeffect/bibr-segmenter-v1")
refs = seg.segment(references_section_text)
The RefSegmenter resolver downloads the checkpoint via
huggingface_hub.hf_hub_download (filename: seg_v3_gold.pt).
Caveats
- In-distribution: APA-style social-science references.
- Out-of-distribution: arXiv-style bracketed-numbered refs (
[1] Author...) and IEEE conference proceedings — paired withbibr-parser-v1these recover ~94% of LLM ref count but container-field accuracy drops. - Acknowledgement lines that leak into the references-section input
occasionally pass through as junk "refs"; bibr's downstream
completeness filter (
title or authors) catches most.
Model tree for thesanogoeffect/bibr-segmenter-v1
Base model
answerdotai/ModernBERT-base