You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

PersianPunc: ParsBERT for Persian Punctuation Restoration

This model is a fine-tuned version of ParsBERT for Persian punctuation restoration.

Model Description

Language: Persian (Farsi)
Task: Token Classification (Punctuation Restoration)
Base Model: ParsBERT
Dataset: PersianPunc (17M samples)
Training Subset: 1M samples

Performance

Evaluated on 1,000 test sentences:

Macro-averaged F1: 91.33%
Micro-averaged F1: 97.28%
Full Sentence Match Rate: 61.80%

Per-class Performance:

Punctuation	Precision	Recall	F1-Score
Persian Comma (،)	84.08%	76.35%	80.03%
Period (.)	98.55%	98.86%	98.71%
Question (؟)	87.50%	90.32%	88.89%
Colon (:)	91.37%	89.55%	90.45%

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
model_name = "MohammadJRanjbar/parsbert-persian-punctuation"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Prepare input (text without punctuation)
text = "سلام چطوری امروز هوا خیلی خوبه"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Label mapping
id2label = {
    0: "EMPTY",
    1: "COMMA",
    2: "QUESTION", 
    3: "PERIOD",
    4: "COLON"
}

# Map predictions to punctuation
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [id2label[p.item()] for p in predictions[0]]

# Reconstruct text with punctuation
punct_map = {
    "COMMA": "،",
    "QUESTION": "؟",
    "PERIOD": ".",
    "COLON": ":"
}

result = []
for token, label in zip(tokens, labels):
    if token in ["[CLS]", "[SEP]", "[PAD]"]:
        continue
    result.append(token)
    if label in punct_map:
        result.append(punct_map[label])

punctuated_text = " ".join(result).replace(" ##", "")
print(punctuated_text)

Training Details

Training Data: 989,000 samples from PersianPunc dataset
Validation Data: 10,000 samples
Test Data: 1,000 samples
Epochs: 3
Batch Size: 680 (effective, with gradient accumulation)
Learning Rate: 2e-5
Optimizer: AdamW
Weight Decay: 0.01

Citation

If you use this model, please cite:

@misc{kalahroodi2026persianpunclargescaledatasetbertbased,
      title={PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration}, 
      author={Mohammad Javad Ranjbar Kalahroodi and Heshaam Faili and Azadeh Shakery},
      year={2026},
      eprint={2603.05314},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.05314}, 
}

License

MIT License

Authors

Mohammad Javad Ranjbar Kalahroodi (University of Tehran)
Heshaam Faili (University of Tehran)
Azadeh Shakery (University of Tehran & IPM)

Contact

For questions or issues, please contact: mohammadjranjbar@ut.ac.ir

Downloads last month: 166

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for MohammadJRanjbar/parsbert-persian-punctuation

Base model

HooshvareLab/bert-base-parsbert-uncased

Finetuned

(24)

this model

Dataset used to train MohammadJRanjbar/parsbert-persian-punctuation

Paper for MohammadJRanjbar/parsbert-persian-punctuation

PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

Paper • 2603.05314 • Published Mar 5 • 1