LorenzoVentrone
/

SentenceSplitter-it-en

Token Classification

sentence-boundary-detection

sentence-splitting

Model card Files Files and versions

SentenceSplitter-it-en / README.md

LorenzoVentrone's picture

LorenzoVentrone

Update README.md

941fa52 verified 15 days ago

|

history blame contribute delete

2.94 kB

	---
	library_name: transformers
	license: mit
	pipeline_tag: token-classification
	tags:
	- sentence-boundary-detection
	- sentence-splitting
	- token-classification
	- multilingual
	language:
	- it
	- en
	base_model:
	- FacebookAI/xlm-roberta-base
	datasets:
	- LorenzoVentrone/SentenceSplitter-dataset
	---

	# Sentence Boundary Disambiguation for Complex and Legal Texts

	## Model Description
	This model is a multilingual Sentence Boundary Disambiguation system built by fine-tuning XLM-RoBERTa base for token classification.

	It predicts:
	- 1 for end of sentence
	- 0 for non-boundary tokens

	The model is optimized for difficult formatting and punctuation patterns, including legal citations, abbreviations, decimals, nested punctuation, and mixed Italian/English text.

	Current model version: SentenceSplitterModelV4

	## Data and Splits

	Training data is built with a unified pipeline from:
	1. Professor corpus in sent_split_data.tar.gz
	2. MultiLegalSBD corpus
	3. Wikipedia IT and EN

	Important update for this version:
	- Only professor files ending with -train.sent_split are used
	- Only legal files ending with train.jsonl are used
	- This avoids contamination from dev and test files during training data creation

	Published dataset repo:
	- LorenzoVentrone/SentenceSplitter-dataset

	Published splits:
	- train
	- validation
	- test_adversarial

	Upload pipeline update:
	- Model and tokenizer are pushed to LorenzoVentrone/SentenceSplitter-it-en
	- Dataset splits are pushed to LorenzoVentrone/SentenceSplitter-dataset in the same run

	## Training Procedure

	Backbone:
	- xlm-roberta-base

	Context setup:
	- Window size: 128 tokens
	- Stride: 100

	Training hyperparameters:
	- Epochs: 4
	- Batch size train: 16
	- Batch size eval: 16
	- Learning rate: 2e-5
	- Weight decay: 0.01
	- Warmup steps: 480
	- Eval strategy: epoch
	- Save strategy: epoch
	- Best model selection metric: eval_loss
	- Seed: 42

	## Evaluation on Adversarial Test Set

	Classification report for SentenceSplitterModel:

	\| Class \| Precision \| Recall \| F1-score \| Support \|
	\|---\|---:\|---:\|---:\|---:\|
	\| Word (0) \| 0.9992 \| 0.9759 \| 0.9874 \| 1244 \|
	\| Sentence Boundary (1) \| 0.8454 \| 0.9939 \| 0.9136 \| 165 \|
	\| Accuracy \| \| \| 0.9780 \| 1409 \|
	\| Macro avg \| 0.9223 \| 0.9849 \| 0.9505 \| 1409 \|
	\| Weighted avg \| 0.9812 \| 0.9780 \| 0.9788 \| 1409 \|

	## Notes on Behavior
	The model strongly prioritizes boundary recall on adversarial data, which is useful when missed sentence boundaries are costly. In some edge cases, this can slightly reduce precision due to extra splits around ambiguous punctuation.

	## Intended Use
	- Legal and academic pre-processing
	- Robust multilingual sentence splitting in noisy or punctuation-dense documents
	- Downstream pipelines requiring conservative sentence-boundary recall

	## Limitations
	- Extremely ambiguous punctuation patterns can still produce occasional false positives
	- Performance can vary on domains very distant from legal/academic/general encyclopedic text