InfocubeSrl
/

LexCube

Model card Files Files and versions

LexCube / README.md

HYDARIM7's picture

Update README.md

c4636e3 verified 3 months ago

|

history blame contribute delete

3.61 kB

	---
	license: apache-2.0
	language:
	- it
	base_model:
	- dbmdz/bert-base-italian-uncased
	tags:
	- legal
	- italian
	- delibera
	- municipal
	- infocube
	metrics:
	- perplexity
	pipeline_tag: fill-mask
	library_name: transformers
	---
	# Model Card for Model ID

	This model is a BERT-based Masked Language Model fine-tuned on Italian legal texts (municipal delibera domain*). It is designed to predict masked tokens in legal documents and capture domain-specific semantic and syntactic structures.

	*A delibera is a formal decision or resolution made by a local government body, like a city council or municipal committee, that has official and legal effect.

	### Model Description

	This model is fine-tuned from `dbmdz/bert-base-italian-uncased` using Masked Language Modeling (MLM) with Whole Word Masking (WWM).
	WWM ensures that all subword tokens of a selected word are masked together, encouraging the model to learn deeper contextual representations, especially for complex legal terminology.



	- Developed by: [Mohammad Mahdi Heydari Asl](https://huggingface.co/HYDARIM7) / infocube
	- Model type: Transformer, BERT-based Masked Language Model
	- Language(s): Italian
	- License: Apache-2.0
	- Finetuned from model: `dbmdz/bert-base-italian-uncased`


	## Uses


	### Direct Use

	The model can be used for:
	- Predicting masked tokens in Italian legal texts (`[MASK]` prediction)
	- Embedding legal text for downstream NLP tasks
	- Transfer learning for other Italian legal NLP applications


	## Bias, Risks, and Limitations

	- Not suitable for general-purpose Italian NLP outside legal text.


	### Recommendations

	Users should verify outputs and avoid relying on predictions for legal decision-making without expert supervision.


	## How to Get Started with the Model

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM
	import torch

	model_name = "InfocubeSrl/LexCube"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForMaskedLM.from_pretrained(model_name)

	# Examples with [MASK]
	examples = [
	"[MASK] il Decreto Legislativo 18 agosto 2000, n. 267 (Testo Unico delle leggi sull'ordinamento degli Enti Locali)",
	"ACQUISITI, ai sensi dell'art. [MASK] del D.Lgs. 267/2000, i pareri favorevoli di regolarità tecnica e di regolarità contabile",
	"Visto gli art. [MASK] e 42 del D.Lgs n.267/2000, Testo unico degli enti locali.",
	"DI DICHIARARE la presente deliberazione immediatamente [MASK] ai sensi dell'art. 134, comma 4, del D.Lgs. n. 267/2000."
	]

	for text in examples:
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model(**inputs)

	# Find mask token position
	mask_index = (inputs["input_ids"][0] == tokenizer.mask_token_id).nonzero(as_tuple=True)[0]

	# Get top prediction
	predicted_id = outputs.logits[0, mask_index].argmax(dim=-1)
	predicted_token = tokenizer.decode(predicted_id)

	print(f"Input: {text}")
	print(f"Prediction: {predicted_token}\n")

	```


	### Training Data

	- Source: Provided by Infocube,
	- Size: 15,646 documents
	- Language: Italian
	- Domain: Legal and administrative texts (municipal delibera domain)
	- Formal and technical legal language
	- Frequent references to laws, decrees, and legislative articles
	- Structured format with numbered provisions and cross-citations
	- Avg. length: ~909 words (≈2,193 tokens per document); some documents exceed 11k tokens
	- Confidentiality: Raw dataset cannot be shared due to contractual agreements, but it has been statistically and linguistically analyzed for research