InfocubeSrl
/

LexCube

Model card Files Files and versions

HYDARIM7 commited on Oct 6, 2025

Commit

c4636e3

·

verified ·

1 Parent(s): 28a7f7f

Update README.md

Files changed (1) hide show

README.md +23 -8

README.md CHANGED Viewed

@@ -67,15 +67,28 @@ model_name = "InfocubeSrl/LexCube"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForMaskedLM.from_pretrained(model_name)
-text = "La legge [MASK] approvata dal parlamento."
-inputs = tokenizer(text, return_tensors="pt")
-outputs = model(**inputs)
-mask_index = (inputs["input_ids"][0] == tokenizer.mask_token_id).nonzero()[0]
-predicted_id = outputs.logits[0, mask_index].argmax()
-predicted_token = tokenizer.decode(predicted_id)
-print("Prediction:", predicted_token)
 ```
@@ -90,3 +103,5 @@ print("Prediction:", predicted_token)
   - Structured format with numbered provisions and cross-citations
   - Avg. length: ~909 words (≈2,193 tokens per document); some documents exceed 11k tokens
 - **Confidentiality:** Raw dataset cannot be shared due to contractual agreements, but it has been statistically and linguistically analyzed for research

 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForMaskedLM.from_pretrained(model_name)
+# Examples with [MASK]
+examples = [
+    "[MASK] il Decreto Legislativo 18 agosto 2000, n. 267 (Testo Unico delle leggi sull'ordinamento degli Enti Locali)",
+    "ACQUISITI, ai sensi dell'art. [MASK] del D.Lgs. 267/2000, i pareri favorevoli di regolarità tecnica e di regolarità contabile",
+    "Visto gli art. [MASK] e 42 del D.Lgs n.267/2000, Testo unico degli enti locali.",
+    "DI DICHIARARE la presente deliberazione immediatamente [MASK] ai sensi dell'art. 134, comma 4, del D.Lgs. n. 267/2000."
+]
+for text in examples:
+    inputs = tokenizer(text, return_tensors="pt")
+    outputs = model(**inputs)
+    # Find mask token position
+    mask_index = (inputs["input_ids"][0] == tokenizer.mask_token_id).nonzero(as_tuple=True)[0]
+    # Get top prediction
+    predicted_id = outputs.logits[0, mask_index].argmax(dim=-1)
+    predicted_token = tokenizer.decode(predicted_id)
+    print(f"Input: {text}")
+    print(f"Prediction: {predicted_token}\n")
 ```
   - Structured format with numbered provisions and cross-citations
   - Avg. length: ~909 words (≈2,193 tokens per document); some documents exceed 11k tokens
 - **Confidentiality:** Raw dataset cannot be shared due to contractual agreements, but it has been statistically and linguistically analyzed for research