File size: 3,611 Bytes
69c7bbf 7952806 69c7bbf 50444a8 c19fe8d 28a7f7f c19fe8d 28a7f7f c19fe8d 553279f c19fe8d ef6962c c19fe8d c4636e3 c19fe8d d1d6f56 c19fe8d d0daa13 c19fe8d c4636e3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
---
license: apache-2.0
language:
- it
base_model:
- dbmdz/bert-base-italian-uncased
tags:
- legal
- italian
- delibera
- municipal
- infocube
metrics:
- perplexity
pipeline_tag: fill-mask
library_name: transformers
---
# Model Card for Model ID
This model is a BERT-based Masked Language Model fine-tuned on Italian legal texts (municipal delibera domain*). It is designed to predict masked tokens in legal documents and capture domain-specific semantic and syntactic structures.
*A delibera is a formal decision or resolution made by a local government body, like a city council or municipal committee, that has official and legal effect.
### Model Description
This model is fine-tuned from `dbmdz/bert-base-italian-uncased` using **Masked Language Modeling (MLM) with Whole Word Masking (WWM)**.
WWM ensures that all subword tokens of a selected word are masked together, encouraging the model to learn deeper contextual representations, especially for complex legal terminology.
- **Developed by:** [Mohammad Mahdi Heydari Asl](https://huggingface.co/HYDARIM7) / infocube
- **Model type:** Transformer, BERT-based Masked Language Model
- **Language(s):** Italian
- **License:** Apache-2.0
- **Finetuned from model:** `dbmdz/bert-base-italian-uncased`
## Uses
### Direct Use
The model can be used for:
- Predicting masked tokens in Italian legal texts (`[MASK]` prediction)
- Embedding legal text for downstream NLP tasks
- Transfer learning for other Italian legal NLP applications
## Bias, Risks, and Limitations
- Not suitable for general-purpose Italian NLP outside legal text.
### Recommendations
Users should verify outputs and avoid relying on predictions for legal decision-making without expert supervision.
## How to Get Started with the Model
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
model_name = "InfocubeSrl/LexCube"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
# Examples with [MASK]
examples = [
"[MASK] il Decreto Legislativo 18 agosto 2000, n. 267 (Testo Unico delle leggi sull'ordinamento degli Enti Locali)",
"ACQUISITI, ai sensi dell'art. [MASK] del D.Lgs. 267/2000, i pareri favorevoli di regolarità tecnica e di regolarità contabile",
"Visto gli art. [MASK] e 42 del D.Lgs n.267/2000, Testo unico degli enti locali.",
"DI DICHIARARE la presente deliberazione immediatamente [MASK] ai sensi dell'art. 134, comma 4, del D.Lgs. n. 267/2000."
]
for text in examples:
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# Find mask token position
mask_index = (inputs["input_ids"][0] == tokenizer.mask_token_id).nonzero(as_tuple=True)[0]
# Get top prediction
predicted_id = outputs.logits[0, mask_index].argmax(dim=-1)
predicted_token = tokenizer.decode(predicted_id)
print(f"Input: {text}")
print(f"Prediction: {predicted_token}\n")
```
### Training Data
- **Source:** Provided by *Infocube*,
- **Size:** 15,646 documents
- **Language:** Italian
- **Domain:** Legal and administrative texts (municipal delibera domain)
- Formal and technical legal language
- Frequent references to laws, decrees, and legislative articles
- Structured format with numbered provisions and cross-citations
- Avg. length: ~909 words (≈2,193 tokens per document); some documents exceed 11k tokens
- **Confidentiality:** Raw dataset cannot be shared due to contractual agreements, but it has been statistically and linguistically analyzed for research
|