File size: 3,611 Bytes
69c7bbf
 
 
 
 
 
 
 
 
7952806
 
69c7bbf
50444a8
 
 
 
c19fe8d
 
 
28a7f7f
c19fe8d
28a7f7f
c19fe8d
 
 
 
 
 
 
 
553279f
c19fe8d
ef6962c
c19fe8d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c4636e3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c19fe8d
d1d6f56
c19fe8d
 
 
 
 
 
 
d0daa13
c19fe8d
 
 
 
 
c4636e3
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
license: apache-2.0
language:
- it
base_model:
- dbmdz/bert-base-italian-uncased
tags:
- legal
- italian
- delibera
- municipal
- infocube
metrics:
- perplexity
pipeline_tag: fill-mask
library_name: transformers
---
# Model Card for Model ID

This model is a BERT-based Masked Language Model fine-tuned on Italian legal texts (municipal delibera domain*). It is designed to predict masked tokens in legal documents and capture domain-specific semantic and syntactic structures.

*A delibera is a formal decision or resolution made by a local government body, like a city council or municipal committee, that has official and legal effect.

### Model Description

This model is fine-tuned from `dbmdz/bert-base-italian-uncased` using **Masked Language Modeling (MLM) with Whole Word Masking (WWM)**.  
WWM ensures that all subword tokens of a selected word are masked together, encouraging the model to learn deeper contextual representations, especially for complex legal terminology.



- **Developed by:** [Mohammad Mahdi Heydari Asl](https://huggingface.co/HYDARIM7) / infocube
- **Model type:** Transformer, BERT-based Masked Language Model  
- **Language(s):** Italian  
- **License:** Apache-2.0  
- **Finetuned from model:** `dbmdz/bert-base-italian-uncased`  


## Uses


### Direct Use

The model can be used for:
- Predicting masked tokens in Italian legal texts (`[MASK]` prediction)  
- Embedding legal text for downstream NLP tasks  
- Transfer learning for other Italian legal NLP applications


## Bias, Risks, and Limitations

- Not suitable for general-purpose Italian NLP outside legal text.  


### Recommendations

Users should verify outputs and avoid relying on predictions for legal decision-making without expert supervision.


## How to Get Started with the Model

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

model_name = "InfocubeSrl/LexCube"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Examples with [MASK]
examples = [
    "[MASK] il Decreto Legislativo 18 agosto 2000, n. 267 (Testo Unico delle leggi sull'ordinamento degli Enti Locali)",
    "ACQUISITI, ai sensi dell'art. [MASK] del D.Lgs. 267/2000, i pareri favorevoli di regolarità tecnica e di regolarità contabile",
    "Visto gli art. [MASK] e 42 del D.Lgs n.267/2000, Testo unico degli enti locali.",
    "DI DICHIARARE la presente deliberazione immediatamente [MASK] ai sensi dell'art. 134, comma 4, del D.Lgs. n. 267/2000."
]

for text in examples:
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model(**inputs)

    # Find mask token position
    mask_index = (inputs["input_ids"][0] == tokenizer.mask_token_id).nonzero(as_tuple=True)[0]
    
    # Get top prediction
    predicted_id = outputs.logits[0, mask_index].argmax(dim=-1)
    predicted_token = tokenizer.decode(predicted_id)

    print(f"Input: {text}")
    print(f"Prediction: {predicted_token}\n")

```


### Training Data

- **Source:** Provided by *Infocube*, 
- **Size:** 15,646 documents  
- **Language:** Italian  
- **Domain:** Legal and administrative texts (municipal delibera domain)
  - Formal and technical legal language  
  - Frequent references to laws, decrees, and legislative articles  
  - Structured format with numbered provisions and cross-citations  
  - Avg. length: ~909 words (≈2,193 tokens per document); some documents exceed 11k tokens  
- **Confidentiality:** Raw dataset cannot be shared due to contractual agreements, but it has been statistically and linguistically analyzed for research