metadata
license: apache-2.0
language:
- it
base_model:
- dbmdz/bert-base-italian-uncased
tags:
- legal
- italian
- infocube
metrics:
- perplexity
pipeline_tag: fill-mask
library_name: transformers
Model Card for Model ID
This model is a BERT-based Masked Language Model fine-tuned on Italian legal texts. It is designed to predict masked tokens in legal documents and capture domain-specific semantic and syntactic structures.
Model Description
This model is fine-tuned from dbmdz/bert-base-italian-uncased using Masked Language Modeling (MLM) with Whole Word Masking (WWM).
WWM ensures that all subword tokens of a selected word are masked together, encouraging the model to learn deeper contextual representations, especially for complex legal terminology.
- Developed by: [Mohammad Mahdi Heydari Asl / infocube]
- Funded by [optional]: [More Information Needed]
- Shared by: [HYDARIM7]
- Model type: Transformer, BERT-based Masked Language Model
- Language(s) (NLP): Italian
- License: Apache-2.0
- Finetuned from model:
dbmdz/bert-base-italian-uncased
Uses
Direct Use
The model can be used for:
- Predicting masked tokens in Italian legal texts (
[MASK]prediction) - Embedding legal text for downstream NLP tasks
- Transfer learning for other Italian legal NLP applications
Bias, Risks, and Limitations
- Not suitable for general-purpose Italian NLP outside legal text.
Recommendations
Users should verify outputs and avoid relying on predictions for legal decision-making without expert supervision.
How to Get Started with the Model
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
model_name = "InfocubeSrl/LexCube"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
text = "La legge [MASK] approvata dal parlamento."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
mask_token_index = torch.where(inputs["input_ids"][0] == tokenizer.mask_token_id)[0]
predicted_token_id = outputs.logits[0, mask_token_index].argmax(axis=-1)
print("Prediction:", tokenizer.decode(predicted_token_id))
### Training Data
### Training Data
- **Source:** Provided by *Infocube*,
- **Size:** 15,646 documents
- **Language:** Italian
- **Domain:** Legal and administrative texts
- Formal and technical legal language
- Frequent references to laws, decrees, and legislative articles
- Structured format with numbered provisions and cross-citations
- Avg. length: ~909 words (≈2,193 tokens per document); some documents exceed 11k tokens
- **Confidentiality:** Raw dataset cannot be shared due to contractual agreements, but it has been statistically and linguistically analyzed for research