HYDARIM7 commited on
Commit
c19fe8d
·
verified ·
1 Parent(s): 50444a8

Update README.md

Browse files

LexCube is a BERT-based Masked Language Model fine-tuned on Italian legal texts to predict masked tokens ([MASK]) and capture domain-specific semantic and syntactic structures. It uses Masked Language Modeling (MLM) with Whole Word Masking (WWM), masking entire words rather than subword tokens to encourage deeper contextual learning of multi-token legal terms and complex sentence structures.

The model was trained on 15,646 Italian legal documents from Infocube, including legislative acts, court rulings, resolutions, and regulatory communications. The texts are formal, technical, and highly structured, with numbered provisions and frequent references to laws and decrees. Tokenization produces an average of ~2,193 tokens per document, with some exceeding 11,000 tokens.

**LexCube is suitable for:

Predicting masked tokens in legal texts

Generating embeddings for legal NLP tasks

Transfer learning for downstream applications like classification, NER, or legal QA

Due to confidentiality agreements, the raw dataset cannot be shared. However, statistical and linguistic analyses confirm its suitability for MLM pretraining in the Italian legal domain.

Limitations:

Not suitable for general-purpose Italian NLP outside legal text

Outputs should not be used for legal decision-making without expert supervision

Files changed (1) hide show
  1. README.md +78 -1
README.md CHANGED
@@ -12,4 +12,81 @@ metrics:
12
  - perplexity
13
  pipeline_tag: fill-mask
14
  library_name: transformers
15
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  - perplexity
13
  pipeline_tag: fill-mask
14
  library_name: transformers
15
+ ---
16
+ # Model Card for Model ID
17
+
18
+ This model is a BERT-based Masked Language Model fine-tuned on Italian legal texts. It is designed to predict masked tokens in legal documents and capture domain-specific semantic and syntactic structures.
19
+
20
+
21
+
22
+ ### Model Description
23
+
24
+ This model is fine-tuned from `dbmdz/bert-base-italian-uncased` using **Masked Language Modeling (MLM) with Whole Word Masking (WWM)**.
25
+ WWM ensures that all subword tokens of a selected word are masked together, encouraging the model to learn deeper contextual representations, especially for complex legal terminology.
26
+
27
+
28
+
29
+ - **Developed by:** [Mohammad Mahdi Heydari Asl / infocube]
30
+ - **Funded by [optional]:** [More Information Needed]
31
+ - **Shared by:** [HYDARIM7]
32
+ - **Model type:** Transformer, BERT-based Masked Language Model
33
+ - **Language(s) (NLP):** Italian
34
+ - **License:** Apache-2.0
35
+ - **Finetuned from model:** `dbmdz/bert-base-italian-uncased`
36
+
37
+
38
+ ## Uses
39
+
40
+
41
+ ### Direct Use
42
+
43
+ The model can be used for:
44
+ - Predicting masked tokens in Italian legal texts (`[MASK]` prediction)
45
+ - Embedding legal text for downstream NLP tasks
46
+ - Transfer learning for other Italian legal NLP applications
47
+
48
+
49
+ ## Bias, Risks, and Limitations
50
+
51
+ - Not suitable for general-purpose Italian NLP outside legal text.
52
+
53
+
54
+ ### Recommendations
55
+
56
+ Users should verify outputs and avoid relying on predictions for legal decision-making without expert supervision.
57
+
58
+
59
+ ## How to Get Started with the Model
60
+ ## How to Get Started with the Model
61
+
62
+ ```python
63
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
64
+ import torch
65
+
66
+ model_name = "InfocubeSrl/LexCube"
67
+
68
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
69
+ model = AutoModelForMaskedLM.from_pretrained(model_name)
70
+
71
+ text = "La legge [MASK] approvata dal parlamento."
72
+ inputs = tokenizer(text, return_tensors="pt")
73
+ outputs = model(**inputs)
74
+
75
+ mask_token_index = torch.where(inputs["input_ids"][0] == tokenizer.mask_token_id)[0]
76
+ predicted_token_id = outputs.logits[0, mask_token_index].argmax(axis=-1)
77
+ print("Prediction:", tokenizer.decode(predicted_token_id))
78
+
79
+
80
+ ### Training Data
81
+
82
+ ### Training Data
83
+
84
+ - **Source:** Provided by *Infocube*,
85
+ - **Size:** 15,646 documents
86
+ - **Language:** Italian
87
+ - **Domain:** Legal and administrative texts
88
+ - Formal and technical legal language
89
+ - Frequent references to laws, decrees, and legislative articles
90
+ - Structured format with numbered provisions and cross-citations
91
+ - Avg. length: ~909 words (≈2,193 tokens per document); some documents exceed 11k tokens
92
+ - **Confidentiality:** Raw dataset cannot be shared due to contractual agreements, but it has been statistically and linguistically analyzed for research