--- license: apache-2.0 language: - it base_model: - dbmdz/bert-base-italian-uncased tags: - legal - italian - infocube metrics: - perplexity pipeline_tag: fill-mask library_name: transformers --- # Model Card for Model ID This model is a BERT-based Masked Language Model fine-tuned on Italian legal texts. It is designed to predict masked tokens in legal documents and capture domain-specific semantic and syntactic structures. ### Model Description This model is fine-tuned from `dbmdz/bert-base-italian-uncased` using **Masked Language Modeling (MLM) with Whole Word Masking (WWM)**. WWM ensures that all subword tokens of a selected word are masked together, encouraging the model to learn deeper contextual representations, especially for complex legal terminology. - **Developed by:** [Mohammad Mahdi Heydari Asl](https://huggingface.co/HYDARIM7) / infocube] - **Model type:** Transformer, BERT-based Masked Language Model - **Language(s):** Italian - **License:** Apache-2.0 - **Finetuned from model:** `dbmdz/bert-base-italian-uncased` ## Uses ### Direct Use The model can be used for: - Predicting masked tokens in Italian legal texts (`[MASK]` prediction) - Embedding legal text for downstream NLP tasks - Transfer learning for other Italian legal NLP applications ## Bias, Risks, and Limitations - Not suitable for general-purpose Italian NLP outside legal text. ### Recommendations Users should verify outputs and avoid relying on predictions for legal decision-making without expert supervision. ## How to Get Started with the Model ```python from transformers import AutoTokenizer, AutoModelForMaskedLM import torch model_name = "InfocubeSrl/LexCube" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForMaskedLM.from_pretrained(model_name) text = "La legge [MASK] approvata dal parlamento." inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) mask_index = (inputs["input_ids"][0] == tokenizer.mask_token_id).nonzero()[0] predicted_id = outputs.logits[0, mask_index].argmax() predicted_token = tokenizer.decode(predicted_id) print("Prediction:", predicted_token) ``` ### Training Data - **Source:** Provided by *Infocube*, - **Size:** 15,646 documents - **Language:** Italian - **Domain:** Legal and administrative texts - Formal and technical legal language - Frequent references to laws, decrees, and legislative articles - Structured format with numbered provisions and cross-citations - Avg. length: ~909 words (≈2,193 tokens per document); some documents exceed 11k tokens - **Confidentiality:** Raw dataset cannot be shared due to contractual agreements, but it has been statistically and linguistically analyzed for research