|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- it |
|
|
base_model: |
|
|
- dbmdz/bert-base-italian-uncased |
|
|
tags: |
|
|
- legal |
|
|
- italian |
|
|
- delibera |
|
|
- municipal |
|
|
- infocube |
|
|
metrics: |
|
|
- perplexity |
|
|
pipeline_tag: fill-mask |
|
|
library_name: transformers |
|
|
--- |
|
|
# Model Card for Model ID |
|
|
|
|
|
This model is a BERT-based Masked Language Model fine-tuned on Italian legal texts (municipal delibera domain*). It is designed to predict masked tokens in legal documents and capture domain-specific semantic and syntactic structures. |
|
|
|
|
|
*A delibera is a formal decision or resolution made by a local government body, like a city council or municipal committee, that has official and legal effect. |
|
|
|
|
|
### Model Description |
|
|
|
|
|
This model is fine-tuned from `dbmdz/bert-base-italian-uncased` using **Masked Language Modeling (MLM) with Whole Word Masking (WWM)**. |
|
|
WWM ensures that all subword tokens of a selected word are masked together, encouraging the model to learn deeper contextual representations, especially for complex legal terminology. |
|
|
|
|
|
|
|
|
|
|
|
- **Developed by:** [Mohammad Mahdi Heydari Asl](https://huggingface.co/HYDARIM7) / infocube |
|
|
- **Model type:** Transformer, BERT-based Masked Language Model |
|
|
- **Language(s):** Italian |
|
|
- **License:** Apache-2.0 |
|
|
- **Finetuned from model:** `dbmdz/bert-base-italian-uncased` |
|
|
|
|
|
|
|
|
## Uses |
|
|
|
|
|
|
|
|
### Direct Use |
|
|
|
|
|
The model can be used for: |
|
|
- Predicting masked tokens in Italian legal texts (`[MASK]` prediction) |
|
|
- Embedding legal text for downstream NLP tasks |
|
|
- Transfer learning for other Italian legal NLP applications |
|
|
|
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
- Not suitable for general-purpose Italian NLP outside legal text. |
|
|
|
|
|
|
|
|
### Recommendations |
|
|
|
|
|
Users should verify outputs and avoid relying on predictions for legal decision-making without expert supervision. |
|
|
|
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
import torch |
|
|
|
|
|
model_name = "InfocubeSrl/LexCube" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForMaskedLM.from_pretrained(model_name) |
|
|
|
|
|
# Examples with [MASK] |
|
|
examples = [ |
|
|
"[MASK] il Decreto Legislativo 18 agosto 2000, n. 267 (Testo Unico delle leggi sull'ordinamento degli Enti Locali)", |
|
|
"ACQUISITI, ai sensi dell'art. [MASK] del D.Lgs. 267/2000, i pareri favorevoli di regolarità tecnica e di regolarità contabile", |
|
|
"Visto gli art. [MASK] e 42 del D.Lgs n.267/2000, Testo unico degli enti locali.", |
|
|
"DI DICHIARARE la presente deliberazione immediatamente [MASK] ai sensi dell'art. 134, comma 4, del D.Lgs. n. 267/2000." |
|
|
] |
|
|
|
|
|
for text in examples: |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
|
|
|
# Find mask token position |
|
|
mask_index = (inputs["input_ids"][0] == tokenizer.mask_token_id).nonzero(as_tuple=True)[0] |
|
|
|
|
|
# Get top prediction |
|
|
predicted_id = outputs.logits[0, mask_index].argmax(dim=-1) |
|
|
predicted_token = tokenizer.decode(predicted_id) |
|
|
|
|
|
print(f"Input: {text}") |
|
|
print(f"Prediction: {predicted_token}\n") |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
### Training Data |
|
|
|
|
|
- **Source:** Provided by *Infocube*, |
|
|
- **Size:** 15,646 documents |
|
|
- **Language:** Italian |
|
|
- **Domain:** Legal and administrative texts (municipal delibera domain) |
|
|
- Formal and technical legal language |
|
|
- Frequent references to laws, decrees, and legislative articles |
|
|
- Structured format with numbered provisions and cross-citations |
|
|
- Avg. length: ~909 words (≈2,193 tokens per document); some documents exceed 11k tokens |
|
|
- **Confidentiality:** Raw dataset cannot be shared due to contractual agreements, but it has been statistically and linguistically analyzed for research |
|
|
|
|
|
|
|
|
|