XLM-RoBERTa Sentence Splitter

Fine-tuned XLM-RoBERTa-large (560M params) for sentence boundary detection on 8 Universal Dependencies treebanks (4 English, 4 Italian).

Results

NLTK spaCy This model
Macro F1 0.9411 0.9519 0.9863

Usage

git clone https://github.com/LucaTamSapienza/sentence_splitter.git
cd sentence_splitter
pip install -r requirements.txt
python download_model.py
python src/predict.py --input input/ --output output/ --model_path checkpoints/best_xlmr_model.pt

Architecture

XLM-RoBERTa-large -> Dropout(0.1) -> Linear(1024, 1) -> per-token sigmoid

Trained with focal loss (alpha=0.75, gamma=2.0), sliding windows (510 tokens, stride 256), FP16, AdamW (lr=2e-5).

Citation

@misc{tam2026sentencesplitter,
  author = {Luca Tam},
  title = {Sentence Splitter: Fine-tuning XLM-RoBERTa for Sentence Boundary Detection},
  year = {2026},
  url = {https://github.com/LucaTamSapienza/sentence_splitter}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Famezz/xlmr-sentence-splitter

Finetuned
(927)
this model

Dataset used to train Famezz/xlmr-sentence-splitter