metadata
library_name: transformers
tags:
- legal
datasets:
- Moryjj/Iranian-Courts-Ruling
language:
- fa
metrics:
- rouge
- bertscore
base_model:
- Ahmad/parsT5-base
🧾 Persian Legal Text Simplification with ParsT5 + Unlimiformer
This model is part of the first benchmark for Persian legal text simplification. It fine-tunes the ParsT5-base encoder-decoder model with Unlimiformer extension to handle long legal documents efficiently—without truncation.
🔗 Project GitHub: mrjoneidi/Simplification-Legal-Texts
🧠 Model Description
- Base Model:
Ahmad/parsT5-base - Extended with: Unlimiformer for long-input attention
- Training Data: 5000+ Persian legal rulings, simplified using GPT-4o (ChatGPT) prompts
- Max Input Length: ~16,000 tokens (with Unlimiformer)
- Fine-tuned on: Simplified dataset generated from judicial rulings
- Task: Text simplification (formal → public-friendly)
✨ Example Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("mrjoneidi/parsT5-legal-simplification")
tokenizer = AutoTokenizer.from_pretrained("mrjoneidi/parsT5-legal-simplification")
input_text = "دادگاه با بررسی مدارک موجود، حکم به رد دعوی صادر مینماید..."
inputs = tokenizer(input_text, return_tensors="pt", truncation=True)
output_ids = model.generate(**inputs, max_new_tokens=512)
simplified = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(simplified)
🛠 Training Details
- Platform: Kaggle P100 GPU
- Optimizer: AdamW (best performance among AdamW, LAMB, SGD)
- Configurations:
- 1 vs. 3 unfrozen encoder-decoder blocks
- Best results with 3-block configuration