LorenzoVentrone commited on
Commit
941fa52
·
verified ·
1 Parent(s): 6ee48df

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -44
README.md CHANGED
@@ -1,70 +1,99 @@
1
  ---
2
  library_name: transformers
3
  license: mit
4
- datasets:
5
- - wikimedia/wikipedia
 
 
 
 
6
  language:
7
  - it
8
  - en
9
  base_model:
10
  - FacebookAI/xlm-roberta-base
 
 
11
  ---
12
- # Sentence Boundary Disambiguation (SBD) for Complex & Legal Texts
13
 
14
- ## 📖 Model Description
15
- This model is a robust, cross-lingual Sentence Boundary Disambiguation (SBD) system built by fine-tuning **XLM-RoBERTa** (`xlm-roberta-base`). It is specifically engineered to handle highly complex formatting, such as legal documents, academic papers, nested parentheses, decimals, and obscure abbreviations (e.g., *n.d.r.*, *S.p.A.*, *U.S.A.*, *et al.*), without erroneously splitting sentences.
 
 
 
 
 
 
16
 
17
- - **Developed for:** NLP Hackathon
18
- - **Language(s):** Multilingual (Heavily optimized for Italian and English)
19
- - **Base Model:** `xlm-roberta-base`
20
- - **Task:** Token Classification (NER-style binary classification: `1` for End-Of-Sentence, `0` otherwise)
21
 
22
- ## 🗂️ Training Data (Hybrid Approach)
23
- To prevent domain overfitting (bias collapse) and ensure both strict grammatical accuracy and resilience to edge cases, the model was trained on a carefully balanced hybrid dataset (~25,000 chunks):
24
- 1. **Target Domain Data (~40%):** Custom academic and hackathon-specific texts.
25
- 2. **MultiLegalSBD (IT & EN) (~25%):** "Gold Standard" legal texts containing extreme edge cases, citations, and numbering to teach the model not to split on legal abbreviations.
26
- 3. **Wikimedia/Wikipedia (IT & EN) (~35%):** Generalist texts (bootstrapped via NLTK) to re-calibrate the model's weights and teach it standard punctuation rules (e.g., handling closing parentheses followed by periods).
27
 
28
- ## ⚙️ Training Procedure
29
- The model was fine-tuned using Hugging Face `Trainer` with the following hyperparameters:
30
- - **Epochs:** 3
31
- - **Batch Size:** 16
32
- - **Learning Rate:** 2e-5
33
- - **Weight Decay:** 0.01
34
- - **Warmup Steps:** ~10% of total training steps
35
- - **Optimization:** AdamW
36
- - **Context Window:** 128 tokens with a sliding window stride of 100 to prevent context loss.
37
 
38
- ## 💻 How to Use
39
- You can easily load this model within your inference pipeline using the `transformers` library:
 
 
40
 
41
- ```python
42
- from transformers import AutoTokenizer, AutoModelForTokenClassification
 
 
43
 
44
- # Ensure you have your Hugging Face token ready if the repository is private
45
- model_name = "YOUR_USERNAME/SentenceSplitter-MultiLegal-V2"
46
- hf_token = "YOUR_HF_TOKEN"
47
 
48
- tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)
49
- model = AutoModelForTokenClassification.from_pretrained(model_name, token=hf_token)
 
 
50
 
51
- print("Model loaded successfully!")
52
- ```
 
53
 
54
- ## Evaluation Results
55
 
56
- Evaluation was run with `evaluation.py` on the test split generated from `unified_training_dataset`.
 
57
 
58
- ### Classification Report
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
  | Class | Precision | Recall | F1-score | Support |
61
  |---|---:|---:|---:|---:|
62
- | Word (0) | 0.9985 | 0.9983 | 0.9984 | 242929 |
63
- | Sentence Boundary (1) | 0.9685 | 0.9710 | 0.9697 | 12709 |
64
- | Accuracy | | | 0.9970 | 255638 |
65
- | Macro Avg | 0.9835 | 0.9847 | 0.9841 | 255638 |
66
- | Weighted Avg | 0.9970 | 0.9970 | 0.9970 | 255638 |
 
 
 
67
 
 
 
 
 
68
 
69
- ## ⚠️ Limitations & Bias
70
- While the model generalizes extremely well, it may occasionally exhibit "hyper-caution" when encountering nested citations combined with multiple punctuation marks at the end of paragraphs, opting not to split to preserve legal/academic quotation integrity.
 
 
1
  ---
2
  library_name: transformers
3
  license: mit
4
+ pipeline_tag: token-classification
5
+ tags:
6
+ - sentence-boundary-detection
7
+ - sentence-splitting
8
+ - token-classification
9
+ - multilingual
10
  language:
11
  - it
12
  - en
13
  base_model:
14
  - FacebookAI/xlm-roberta-base
15
+ datasets:
16
+ - LorenzoVentrone/SentenceSplitter-dataset
17
  ---
 
18
 
19
+ # Sentence Boundary Disambiguation for Complex and Legal Texts
20
+
21
+ ## Model Description
22
+ This model is a multilingual Sentence Boundary Disambiguation system built by fine-tuning XLM-RoBERTa base for token classification.
23
+
24
+ It predicts:
25
+ - 1 for end of sentence
26
+ - 0 for non-boundary tokens
27
 
28
+ The model is optimized for difficult formatting and punctuation patterns, including legal citations, abbreviations, decimals, nested punctuation, and mixed Italian/English text.
 
 
 
29
 
30
+ Current model version: SentenceSplitterModelV4
 
 
 
 
31
 
32
+ ## Data and Splits
 
 
 
 
 
 
 
 
33
 
34
+ Training data is built with a unified pipeline from:
35
+ 1. Professor corpus in sent_split_data.tar.gz
36
+ 2. MultiLegalSBD corpus
37
+ 3. Wikipedia IT and EN
38
 
39
+ Important update for this version:
40
+ - Only professor files ending with -train.sent_split are used
41
+ - Only legal files ending with train.jsonl are used
42
+ - This avoids contamination from dev and test files during training data creation
43
 
44
+ Published dataset repo:
45
+ - LorenzoVentrone/SentenceSplitter-dataset
 
46
 
47
+ Published splits:
48
+ - train
49
+ - validation
50
+ - test_adversarial
51
 
52
+ Upload pipeline update:
53
+ - Model and tokenizer are pushed to LorenzoVentrone/SentenceSplitter-it-en
54
+ - Dataset splits are pushed to LorenzoVentrone/SentenceSplitter-dataset in the same run
55
 
56
+ ## Training Procedure
57
 
58
+ Backbone:
59
+ - xlm-roberta-base
60
 
61
+ Context setup:
62
+ - Window size: 128 tokens
63
+ - Stride: 100
64
+
65
+ Training hyperparameters:
66
+ - Epochs: 4
67
+ - Batch size train: 16
68
+ - Batch size eval: 16
69
+ - Learning rate: 2e-5
70
+ - Weight decay: 0.01
71
+ - Warmup steps: 480
72
+ - Eval strategy: epoch
73
+ - Save strategy: epoch
74
+ - Best model selection metric: eval_loss
75
+ - Seed: 42
76
+
77
+ ## Evaluation on Adversarial Test Set
78
+
79
+ Classification report for SentenceSplitterModel:
80
 
81
  | Class | Precision | Recall | F1-score | Support |
82
  |---|---:|---:|---:|---:|
83
+ | Word (0) | 0.9992 | 0.9759 | 0.9874 | 1244 |
84
+ | Sentence Boundary (1) | 0.8454 | 0.9939 | 0.9136 | 165 |
85
+ | Accuracy | | | 0.9780 | 1409 |
86
+ | Macro avg | 0.9223 | 0.9849 | 0.9505 | 1409 |
87
+ | Weighted avg | 0.9812 | 0.9780 | 0.9788 | 1409 |
88
+
89
+ ## Notes on Behavior
90
+ The model strongly prioritizes boundary recall on adversarial data, which is useful when missed sentence boundaries are costly. In some edge cases, this can slightly reduce precision due to extra splits around ambiguous punctuation.
91
 
92
+ ## Intended Use
93
+ - Legal and academic pre-processing
94
+ - Robust multilingual sentence splitting in noisy or punctuation-dense documents
95
+ - Downstream pipelines requiring conservative sentence-boundary recall
96
 
97
+ ## Limitations
98
+ - Extremely ambiguous punctuation patterns can still produce occasional false positives
99
+ - Performance can vary on domains very distant from legal/academic/general encyclopedic text