LexCube / README.md
HYDARIM7's picture
Update README.md
36bd0e1 verified
|
raw
history blame
2.76 kB
metadata
license: apache-2.0
language:
  - it
base_model:
  - dbmdz/bert-base-italian-uncased
tags:
  - legal
  - italian
  - infocube
metrics:
  - perplexity
pipeline_tag: fill-mask
library_name: transformers

Model Card for Model ID

This model is a BERT-based Masked Language Model fine-tuned on Italian legal texts. It is designed to predict masked tokens in legal documents and capture domain-specific semantic and syntactic structures.

Model Description

This model is fine-tuned from dbmdz/bert-base-italian-uncased using Masked Language Modeling (MLM) with Whole Word Masking (WWM).
WWM ensures that all subword tokens of a selected word are masked together, encouraging the model to learn deeper contextual representations, especially for complex legal terminology.

  • Developed by: Mohammad Mahdi Heydari Asl / infocube]
  • Model type: Transformer, BERT-based Masked Language Model
  • Language(s): Italian
  • License: Apache-2.0
  • Finetuned from model: dbmdz/bert-base-italian-uncased

Uses

Direct Use

The model can be used for:

  • Predicting masked tokens in Italian legal texts ([MASK] prediction)
  • Embedding legal text for downstream NLP tasks
  • Transfer learning for other Italian legal NLP applications

Bias, Risks, and Limitations

  • Not suitable for general-purpose Italian NLP outside legal text.

Recommendations

Users should verify outputs and avoid relying on predictions for legal decision-making without expert supervision.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

model_name = "InfocubeSrl/LexCube"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

text = "La legge [MASK] approvata dal parlamento."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

mask_index = (inputs["input_ids"][0] == tokenizer.mask_token_id).nonzero()[0]
predicted_id = outputs.logits[0, mask_index].argmax()
predicted_token = tokenizer.decode(predicted_id)

print("Prediction:", predicted_token)

Training Data

  • Source: Provided by Infocube,
  • Size: 15,646 documents
  • Language: Italian
  • Domain: Legal and administrative texts
    • Formal and technical legal language
    • Frequent references to laws, decrees, and legislative articles
    • Structured format with numbered provisions and cross-citations
    • Avg. length: ~909 words (≈2,193 tokens per document); some documents exceed 11k tokens
  • Confidentiality: Raw dataset cannot be shared due to contractual agreements, but it has been statistically and linguistically analyzed for research