|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- it |
|
|
base_model: |
|
|
- dbmdz/bert-base-italian-uncased |
|
|
tags: |
|
|
- legal |
|
|
- italian |
|
|
- infocube |
|
|
metrics: |
|
|
- perplexity |
|
|
pipeline_tag: fill-mask |
|
|
library_name: transformers |
|
|
--- |
|
|
# Model Card for Model ID |
|
|
|
|
|
This model is a BERT-based Masked Language Model fine-tuned on Italian legal texts. It is designed to predict masked tokens in legal documents and capture domain-specific semantic and syntactic structures. |
|
|
|
|
|
|
|
|
|
|
|
### Model Description |
|
|
|
|
|
This model is fine-tuned from `dbmdz/bert-base-italian-uncased` using **Masked Language Modeling (MLM) with Whole Word Masking (WWM)**. |
|
|
WWM ensures that all subword tokens of a selected word are masked together, encouraging the model to learn deeper contextual representations, especially for complex legal terminology. |
|
|
|
|
|
|
|
|
|
|
|
- **Developed by:** [Mohammad Mahdi Heydari Asl](https://huggingface.co/HYDARIM7) / infocube] |
|
|
- **Model type:** Transformer, BERT-based Masked Language Model |
|
|
- **Language(s):** Italian |
|
|
- **License:** Apache-2.0 |
|
|
- **Finetuned from model:** `dbmdz/bert-base-italian-uncased` |
|
|
|
|
|
|
|
|
## Uses |
|
|
|
|
|
|
|
|
### Direct Use |
|
|
|
|
|
The model can be used for: |
|
|
- Predicting masked tokens in Italian legal texts (`[MASK]` prediction) |
|
|
- Embedding legal text for downstream NLP tasks |
|
|
- Transfer learning for other Italian legal NLP applications |
|
|
|
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
- Not suitable for general-purpose Italian NLP outside legal text. |
|
|
|
|
|
|
|
|
### Recommendations |
|
|
|
|
|
Users should verify outputs and avoid relying on predictions for legal decision-making without expert supervision. |
|
|
|
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
import torch |
|
|
|
|
|
model_name = "InfocubeSrl/LexCube" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForMaskedLM.from_pretrained(model_name) |
|
|
|
|
|
text = "La legge [MASK] approvata dal parlamento." |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
|
|
|
mask_index = (inputs["input_ids"][0] == tokenizer.mask_token_id).nonzero()[0] |
|
|
predicted_id = outputs.logits[0, mask_index].argmax() |
|
|
predicted_token = tokenizer.decode(predicted_id) |
|
|
|
|
|
print("Prediction:", predicted_token) |
|
|
``` |
|
|
|
|
|
|
|
|
### Training Data |
|
|
|
|
|
- **Source:** Provided by *Infocube*, |
|
|
- **Size:** 15,646 documents |
|
|
- **Language:** Italian |
|
|
- **Domain:** Legal and administrative texts |
|
|
- Formal and technical legal language |
|
|
- Frequent references to laws, decrees, and legislative articles |
|
|
- Structured format with numbered provisions and cross-citations |
|
|
- Avg. length: ~909 words (≈2,193 tokens per document); some documents exceed 11k tokens |
|
|
- **Confidentiality:** Raw dataset cannot be shared due to contractual agreements, but it has been statistically and linguistically analyzed for research |
|
|
|