Update README.md

LexCube is a BERT-based Masked Language Model fine-tuned on Italian legal texts to predict masked tokens ([MASK]) and capture domain-specific semantic and syntactic structures. It uses Masked Language Modeling (MLM) with Whole Word Masking (WWM), masking entire words rather than subword tokens to encourage deeper contextual learning of multi-token legal terms and complex sentence structures.

The model was trained on 15,646 Italian legal documents from Infocube, including legislative acts, court rulings, resolutions, and regulatory communications. The texts are formal, technical, and highly structured, with numbered provisions and frequent references to laws and decrees. Tokenization produces an average of ~2,193 tokens per document, with some exceeding 11,000 tokens.

**LexCube is suitable for:

Predicting masked tokens in legal texts

Generating embeddings for legal NLP tasks

Transfer learning for downstream applications like classification, NER, or legal QA

Due to confidentiality agreements, the raw dataset cannot be shared. However, statistical and linguistic analyses confirm its suitability for MLM pretraining in the Italian legal domain.

Limitations:

Not suitable for general-purpose Italian NLP outside legal text

Outputs should not be used for legal decision-making without expert supervision

Files changed (1) hide show

README.md +78 -1

README.md CHANGED Viewed

@@ -12,4 +12,81 @@ metrics:
 - perplexity
 pipeline_tag: fill-mask
 library_name: transformers
----

 - perplexity
 pipeline_tag: fill-mask
 library_name: transformers
+---
+# Model Card for Model ID
+This model is a BERT-based Masked Language Model fine-tuned on Italian legal texts. It is designed to predict masked tokens in legal documents and capture domain-specific semantic and syntactic structures.
+### Model Description
+This model is fine-tuned from `dbmdz/bert-base-italian-uncased` using **Masked Language Modeling (MLM) with Whole Word Masking (WWM)**.
+WWM ensures that all subword tokens of a selected word are masked together, encouraging the model to learn deeper contextual representations, especially for complex legal terminology.
+- **Developed by:** [Mohammad Mahdi Heydari Asl / infocube]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by:** [HYDARIM7]
+- **Model type:** Transformer, BERT-based Masked Language Model
+- **Language(s) (NLP):** Italian
+- **License:** Apache-2.0
+- **Finetuned from model:** `dbmdz/bert-base-italian-uncased`
+## Uses
+### Direct Use
+The model can be used for:
+- Predicting masked tokens in Italian legal texts (`[MASK]` prediction)
+- Embedding legal text for downstream NLP tasks
+- Transfer learning for other Italian legal NLP applications
+## Bias, Risks, and Limitations
+- Not suitable for general-purpose Italian NLP outside legal text.
+### Recommendations
+Users should verify outputs and avoid relying on predictions for legal decision-making without expert supervision.
+## How to Get Started with the Model
+## How to Get Started with the Model
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+import torch
+model_name = "InfocubeSrl/LexCube"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForMaskedLM.from_pretrained(model_name)
+text = "La legge [MASK] approvata dal parlamento."
+inputs = tokenizer(text, return_tensors="pt")
+outputs = model(**inputs)
+mask_token_index = torch.where(inputs["input_ids"][0] == tokenizer.mask_token_id)[0]
+predicted_token_id = outputs.logits[0, mask_token_index].argmax(axis=-1)
+print("Prediction:", tokenizer.decode(predicted_token_id))
+### Training Data
+### Training Data
+- **Source:** Provided by *Infocube*,
+- **Size:** 15,646 documents
+- **Language:** Italian
+- **Domain:** Legal and administrative texts
+  - Formal and technical legal language
+  - Frequent references to laws, decrees, and legislative articles
+  - Structured format with numbered provisions and cross-citations
+  - Avg. length: ~909 words (≈2,193 tokens per document); some documents exceed 11k tokens
+- **Confidentiality:** Raw dataset cannot be shared due to contractual agreements, but it has been statistically and linguistically analyzed for research