HYDARIM7 commited on
Commit
50444a8
·
verified ·
1 Parent(s): 69c7bbf

Update README.md

Browse files

## Fine-tuning Method
The model was fine-tuned on Italian legal texts using **Masked Language Modeling (MLM) with Whole Word Masking (WWM)**.
- **Whole Word Masking** ensures that when a word is masked, all its subword tokens are masked together.
- **Masking probability**: 15%
- Implemented via `DataCollatorForWholeWordMask` from Hugging Face Transformers.

This approach improves the model’s understanding of complex legal terminology and encourages learning deeper contextual representations.
## Dataset
- **Source**: Provided by *Infocube*, a company specializing in legal data solutions.
- **Size**: 15,646 documents.
- **Language**: Italian.
- **Domain**: Legal and administrative texts.
- **Characteristics**:
- Formal and technical legal language.
- Frequent references to laws, decrees, and legislative articles.
- Structured format with numbered provisions and cross-citations.
- Avg. length: ~909 words (≈2,193 tokens per document). Some documents exceed 11k tokens.
- **Confidentiality**: Due to contractual agreements, the raw dataset cannot be shared.

Files changed (1) hide show
  1. README.md +4 -0
README.md CHANGED
@@ -8,4 +8,8 @@ tags:
8
  - legal
9
  - italian
10
  - infocube
 
 
 
 
11
  ---
 
8
  - legal
9
  - italian
10
  - infocube
11
+ metrics:
12
+ - perplexity
13
+ pipeline_tag: fill-mask
14
+ library_name: transformers
15
  ---