Update README.md
Browse files## Fine-tuning Method
The model was fine-tuned on Italian legal texts using **Masked Language Modeling (MLM) with Whole Word Masking (WWM)**.
- **Whole Word Masking** ensures that when a word is masked, all its subword tokens are masked together.
- **Masking probability**: 15%
- Implemented via `DataCollatorForWholeWordMask` from Hugging Face Transformers.
This approach improves the model’s understanding of complex legal terminology and encourages learning deeper contextual representations.
## Dataset
- **Source**: Provided by *Infocube*, a company specializing in legal data solutions.
- **Size**: 15,646 documents.
- **Language**: Italian.
- **Domain**: Legal and administrative texts.
- **Characteristics**:
- Formal and technical legal language.
- Frequent references to laws, decrees, and legislative articles.
- Structured format with numbered provisions and cross-citations.
- Avg. length: ~909 words (≈2,193 tokens per document). Some documents exceed 11k tokens.
- **Confidentiality**: Due to contractual agreements, the raw dataset cannot be shared.