You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

PersianPunc: ParsBERT for Persian Punctuation Restoration

This model is a fine-tuned version of ParsBERT for Persian punctuation restoration.

Model Description

  • Language: Persian (Farsi)
  • Task: Token Classification (Punctuation Restoration)
  • Base Model: ParsBERT
  • Dataset: PersianPunc (17M samples)
  • Training Subset: 1M samples

Performance

Evaluated on 1,000 test sentences:

  • Macro-averaged F1: 91.33%
  • Micro-averaged F1: 97.28%
  • Full Sentence Match Rate: 61.80%

Per-class Performance:

Punctuation Precision Recall F1-Score
Persian Comma (،) 84.08% 76.35% 80.03%
Period (.) 98.55% 98.86% 98.71%
Question (؟) 87.50% 90.32% 88.89%
Colon (:) 91.37% 89.55% 90.45%

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
model_name = "MohammadJRanjbar/parsbert-persian-punctuation"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Prepare input (text without punctuation)
text = "سلام چطوری امروز هوا خیلی خوبه"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Label mapping
id2label = {
    0: "EMPTY",
    1: "COMMA",
    2: "QUESTION", 
    3: "PERIOD",
    4: "COLON"
}

# Map predictions to punctuation
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [id2label[p.item()] for p in predictions[0]]

# Reconstruct text with punctuation
punct_map = {
    "COMMA": "،",
    "QUESTION": "؟",
    "PERIOD": ".",
    "COLON": ":"
}

result = []
for token, label in zip(tokens, labels):
    if token in ["[CLS]", "[SEP]", "[PAD]"]:
        continue
    result.append(token)
    if label in punct_map:
        result.append(punct_map[label])

punctuated_text = " ".join(result).replace(" ##", "")
print(punctuated_text)

Training Details

  • Training Data: 989,000 samples from PersianPunc dataset
  • Validation Data: 10,000 samples
  • Test Data: 1,000 samples
  • Epochs: 3
  • Batch Size: 680 (effective, with gradient accumulation)
  • Learning Rate: 2e-5
  • Optimizer: AdamW
  • Weight Decay: 0.01

Citation

If you use this model, please cite:

@inproceedings{ranjbar2026persianpunc,
  title={PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration},
  author={Ranjbar Kalahroodi, Mohammad Javad and Faili, Heshaam and Shakery, Azadeh},
  booktitle={Proceedings of SilkRoadNLP 2026},
  year={2026}
}

License

MIT License

Authors

  • Mohammad Javad Ranjbar Kalahroodi (University of Tehran)
  • Heshaam Faili (University of Tehran)
  • Azadeh Shakery (University of Tehran & IPM)

Contact

For questions or issues, please contact: mohammadjranjbar@ut.ac.ir

Downloads last month
11
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train MohammadJRanjbar/parsbert-persian-punctuation