LorenzoVentrone
/

SentenceSplitter-it-en

@@ -1,70 +1,99 @@
 ---
 library_name: transformers
 license: mit
-datasets:
-- wikimedia/wikipedia
 language:
 - it
 - en
 base_model:
 - FacebookAI/xlm-roberta-base
 ---
-# Sentence Boundary Disambiguation (SBD) for Complex & Legal Texts
-## 📖 Model Description
-This model is a robust, cross-lingual Sentence Boundary Disambiguation (SBD) system built by fine-tuning **XLM-RoBERTa** (`xlm-roberta-base`). It is specifically engineered to handle highly complex formatting, such as legal documents, academic papers, nested parentheses, decimals, and obscure abbreviations (e.g., *n.d.r.*, *S.p.A.*, *U.S.A.*, *et al.*), without erroneously splitting sentences.
-- **Developed for:** NLP Hackathon
-- **Language(s):** Multilingual (Heavily optimized for Italian and English)
-- **Base Model:** `xlm-roberta-base`
-- **Task:** Token Classification (NER-style binary classification: `1` for End-Of-Sentence, `0` otherwise)
-## 🗂️ Training Data (Hybrid Approach)
-To prevent domain overfitting (bias collapse) and ensure both strict grammatical accuracy and resilience to edge cases, the model was trained on a carefully balanced hybrid dataset (~25,000 chunks):
-1. **Target Domain Data (~40%):** Custom academic and hackathon-specific texts.
-2. **MultiLegalSBD (IT & EN) (~25%):** "Gold Standard" legal texts containing extreme edge cases, citations, and numbering to teach the model not to split on legal abbreviations.
-3. **Wikimedia/Wikipedia (IT & EN) (~35%):** Generalist texts (bootstrapped via NLTK) to re-calibrate the model's weights and teach it standard punctuation rules (e.g., handling closing parentheses followed by periods).
-## ⚙️ Training Procedure
-The model was fine-tuned using Hugging Face `Trainer` with the following hyperparameters:
-- **Epochs:** 3
-- **Batch Size:** 16
-- **Learning Rate:** 2e-5
-- **Weight Decay:** 0.01
-- **Warmup Steps:** ~10% of total training steps
-- **Optimization:** AdamW
-- **Context Window:** 128 tokens with a sliding window stride of 100 to prevent context loss.
-## 💻 How to Use
-You can easily load this model within your inference pipeline using the `transformers` library:
-```python
-from transformers import AutoTokenizer, AutoModelForTokenClassification
-# Ensure you have your Hugging Face token ready if the repository is private
-model_name = "YOUR_USERNAME/SentenceSplitter-MultiLegal-V2"
-hf_token = "YOUR_HF_TOKEN"
-tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)
-model = AutoModelForTokenClassification.from_pretrained(model_name, token=hf_token)
-print("Model loaded successfully!")
-```
-## Evaluation Results
-Evaluation was run with `evaluation.py` on the test split generated from `unified_training_dataset`.
-### Classification Report
 | Class | Precision | Recall | F1-score | Support |
 |---|---:|---:|---:|---:|
-| Word (0) | 0.9985 | 0.9983 | 0.9984 | 242929 |
-| Sentence Boundary (1) | 0.9685 | 0.9710 | 0.9697 | 12709 |
-| Accuracy |  |  | 0.9970 | 255638 |
-| Macro Avg | 0.9835 | 0.9847 | 0.9841 | 255638 |
-| Weighted Avg | 0.9970 | 0.9970 | 0.9970 | 255638 |
-## ⚠️ Limitations & Bias
-While the model generalizes extremely well, it may occasionally exhibit "hyper-caution" when encountering nested citations combined with multiple punctuation marks at the end of paragraphs, opting not to split to preserve legal/academic quotation integrity.

 ---
 library_name: transformers
 license: mit
+pipeline_tag: token-classification
+tags:
+- sentence-boundary-detection
+- sentence-splitting
+- token-classification
+- multilingual
 language:
 - it
 - en
 base_model:
 - FacebookAI/xlm-roberta-base
+datasets:
+- LorenzoVentrone/SentenceSplitter-dataset
 ---
+# Sentence Boundary Disambiguation for Complex and Legal Texts
+## Model Description
+This model is a multilingual Sentence Boundary Disambiguation system built by fine-tuning XLM-RoBERTa base for token classification.
+It predicts:
+- 1 for end of sentence
+- 0 for non-boundary tokens
+The model is optimized for difficult formatting and punctuation patterns, including legal citations, abbreviations, decimals, nested punctuation, and mixed Italian/English text.
+Current model version: SentenceSplitterModelV4
+## Data and Splits
+Training data is built with a unified pipeline from:
+1. Professor corpus in sent_split_data.tar.gz
+2. MultiLegalSBD corpus
+3. Wikipedia IT and EN
+Important update for this version:
+- Only professor files ending with -train.sent_split are used
+- Only legal files ending with train.jsonl are used
+- This avoids contamination from dev and test files during training data creation
+Published dataset repo:
+- LorenzoVentrone/SentenceSplitter-dataset
+Published splits:
+- train
+- validation
+- test_adversarial
+Upload pipeline update:
+- Model and tokenizer are pushed to LorenzoVentrone/SentenceSplitter-it-en
+- Dataset splits are pushed to LorenzoVentrone/SentenceSplitter-dataset in the same run
+## Training Procedure
+Backbone:
+- xlm-roberta-base
+Context setup:
+- Window size: 128 tokens
+- Stride: 100
+Training hyperparameters:
+- Epochs: 4
+- Batch size train: 16
+- Batch size eval: 16
+- Learning rate: 2e-5
+- Weight decay: 0.01
+- Warmup steps: 480
+- Eval strategy: epoch
+- Save strategy: epoch
+- Best model selection metric: eval_loss
+- Seed: 42
+## Evaluation on Adversarial Test Set
+Classification report for SentenceSplitterModel:
 | Class | Precision | Recall | F1-score | Support |
 |---|---:|---:|---:|---:|
+| Word (0) | 0.9992 | 0.9759 | 0.9874 | 1244 |
+| Sentence Boundary (1) | 0.8454 | 0.9939 | 0.9136 | 165 |
+| Accuracy |  |  | 0.9780 | 1409 |
+| Macro avg | 0.9223 | 0.9849 | 0.9505 | 1409 |
+| Weighted avg | 0.9812 | 0.9780 | 0.9788 | 1409 |
+## Notes on Behavior
+The model strongly prioritizes boundary recall on adversarial data, which is useful when missed sentence boundaries are costly. In some edge cases, this can slightly reduce precision due to extra splits around ambiguous punctuation.
+## Intended Use
+- Legal and academic pre-processing
+- Robust multilingual sentence splitting in noisy or punctuation-dense documents
+- Downstream pipelines requiring conservative sentence-boundary recall
+## Limitations
+- Extremely ambiguous punctuation patterns can still produce occasional false positives
+- Performance can vary on domains very distant from legal/academic/general encyclopedic text