PersianPunc: ParsBERT for Persian Punctuation Restoration
This model is a fine-tuned version of ParsBERT for Persian punctuation restoration.
Model Description
- Language: Persian (Farsi)
- Task: Token Classification (Punctuation Restoration)
- Base Model: ParsBERT
- Dataset: PersianPunc (17M samples)
- Training Subset: 1M samples
Performance
Evaluated on 1,000 test sentences:
- Macro-averaged F1: 91.33%
- Micro-averaged F1: 97.28%
- Full Sentence Match Rate: 61.80%
Per-class Performance:
| Punctuation | Precision | Recall | F1-Score |
|---|---|---|---|
| Persian Comma (،) | 84.08% | 76.35% | 80.03% |
| Period (.) | 98.55% | 98.86% | 98.71% |
| Question (؟) | 87.50% | 90.32% | 88.89% |
| Colon (:) | 91.37% | 89.55% | 90.45% |
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model and tokenizer
model_name = "MohammadJRanjbar/parsbert-persian-punctuation"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Prepare input (text without punctuation)
text = "سلام چطوری امروز هوا خیلی خوبه"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
# Label mapping
id2label = {
0: "EMPTY",
1: "COMMA",
2: "QUESTION",
3: "PERIOD",
4: "COLON"
}
# Map predictions to punctuation
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [id2label[p.item()] for p in predictions[0]]
# Reconstruct text with punctuation
punct_map = {
"COMMA": "،",
"QUESTION": "؟",
"PERIOD": ".",
"COLON": ":"
}
result = []
for token, label in zip(tokens, labels):
if token in ["[CLS]", "[SEP]", "[PAD]"]:
continue
result.append(token)
if label in punct_map:
result.append(punct_map[label])
punctuated_text = " ".join(result).replace(" ##", "")
print(punctuated_text)
Training Details
- Training Data: 989,000 samples from PersianPunc dataset
- Validation Data: 10,000 samples
- Test Data: 1,000 samples
- Epochs: 3
- Batch Size: 680 (effective, with gradient accumulation)
- Learning Rate: 2e-5
- Optimizer: AdamW
- Weight Decay: 0.01
Citation
If you use this model, please cite:
@inproceedings{ranjbar2026persianpunc,
title={PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration},
author={Ranjbar Kalahroodi, Mohammad Javad and Faili, Heshaam and Shakery, Azadeh},
booktitle={Proceedings of SilkRoadNLP 2026},
year={2026}
}
License
MIT License
Authors
- Mohammad Javad Ranjbar Kalahroodi (University of Tehran)
- Heshaam Faili (University of Tehran)
- Azadeh Shakery (University of Tehran & IPM)
Contact
For questions or issues, please contact: mohammadjranjbar@ut.ac.ir
- Downloads last month
- 11