Update README.md

## Fine-tuning Method
The model was fine-tuned on Italian legal texts using **Masked Language Modeling (MLM) with Whole Word Masking (WWM)**.
- **Whole Word Masking** ensures that when a word is masked, all its subword tokens are masked together.
- **Masking probability**: 15%
- Implemented via `DataCollatorForWholeWordMask` from Hugging Face Transformers.

This approach improves the model’s understanding of complex legal terminology and encourages learning deeper contextual representations.
## Dataset
- **Source**: Provided by *Infocube*, a company specializing in legal data solutions.
- **Size**: 15,646 documents.
- **Language**: Italian.
- **Domain**: Legal and administrative texts.
- **Characteristics**:
- Formal and technical legal language.
- Frequent references to laws, decrees, and legislative articles.
- Structured format with numbered provisions and cross-citations.
- Avg. length: ~909 words (≈2,193 tokens per document). Some documents exceed 11k tokens.
- **Confidentiality**: Due to contractual agreements, the raw dataset cannot be shared.

Files changed (1) hide show

README.md +4 -0

README.md CHANGED Viewed

@@ -8,4 +8,8 @@ tags:
 - legal
 - italian
 - infocube
 ---

 - legal
 - italian
 - infocube
+metrics:
+- perplexity
+pipeline_tag: fill-mask
+library_name: transformers
 ---