Persian Ezafe Detection (ALBERT)

This is a lightweight, high-performance Ezafe detection model for Persian, fine-tuned on the HooshvareLab/albert-fa-zwnj-base-v2 architecture.

It was designed to resolve phonemization ambiguities where the "Ezafe" (kasra) sound is grammatically required but not written in text.

Project Context

This model was developed as part of the Piper with LCA Phonemizer project. You can find the full source code, implementation details, and the custom PiperTTS build at the official repository:

🔗 https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer

Training Pipeline Notebook

The full end-to-end training pipeline used to build this model (dataset preparation, SpaCy-based labeling, ALBERT fine-tuning, ONNX export, and quantization) is available as a notebook here:

📓 ezafe_detection_training_pipeline.ipynb 🔗 https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer/blob/main/albert_finetuning/ezafe_detection_training_pipeline.ipynb

Repository Structure

The repository contains both the standard PyTorch weights (root) and a quantized ONNX model (in the quantized/ directory) for efficient inference.

.
├── config.json                  # PyTorch model configuration
├── model.safetensors            # PyTorch model weights
├── quantized/                   # Optimized ONNX model folder
│   ├── config.json
│   ├── model_quantized.onnx     # Quantized ONNX weights
│   ├── ort_config.json
│   ├── special_tokens_map.json
│   ├── spiece.model
│   ├── tokenizer_config.json
│   └── tokenizer.json
├── Readme.md
├── special_tokens_map.json
├── spiece.model                 # SentencePiece model
├── tokenizer_config.json
└── tokenizer.json

Model Details

  • Task: Token Classification (Ezafe Detection)
  • Base Model: HooshvareLab/albert-fa-zwnj-base-v2
  • Format:
    • PyTorch (Root directory)
    • ONNX Quantized (Inside quantized/ directory)
  • Labels:
    • 0: NO_EZAFE (No sound added)
    • 1: NEEDS_EZAFE (Add /e/ or /ye/ sound)

Performance

The model was trained on a refined subset of the HomoRich dataset, utilizing context-aware labelling.

Metric Score
F1 Score 98.73%

Evaluated on a held-out test set.

Usage

1. Using Transformers (PyTorch)

Use the root path to load the PyTorch model for training or standard inference.

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_id = "abreza/persian-ezafe-albert"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

text = "کتابخانه مرکزی دانشگاه شریف، فضای وسیع و چشم‌نوازی دارد."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits
    predictions = torch.argmax(logits, dim=2)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
id2label = {0: "-", 1: "EZAFE"}

for token, label_id in zip(tokens, predictions[0].tolist()):
    if token not in tokenizer.all_special_tokens:
        print(f"{token.replace("▁", ""):<15} | {id2label[label_id]}")

2. Using ONNX Runtime (Production)

For production environments, load from the quantized subfolder. This is the format used in the Piper-with-LCA-Phonemizer Docker container.

from optimum.onnxruntime import ORTModelForTokenClassification
from transformers import AutoTokenizer
import torch

model_id = "abreza/persian-ezafe-albert"

tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="quantized")
model = ORTModelForTokenClassification.from_pretrained(
    model_id,
    subfolder="quantized",
    file_name="model_quantized.onnx"
)

text = "کتابخانه مرکزی دانشگاه شریف، فضای وسیع و چشم‌نوازی دارد."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=2)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

id2label = {0: "-", 1: "EZAFE"}

for token, label_id in zip(tokens, predictions[0].tolist()):
    if token not in tokenizer.all_special_tokens:
        print(f"{token.replace("▁", ""):<15} | {id2label[label_id]}")

Training Data

The model uses the HomoRich dataset, a high-quality resource for Persian Grapheme-to-Phoneme tasks. The dataset was preprocessed to extract binary Ezafe labels based on phonemic transcriptions.

Downloads last month
74
Safetensors
Model size
11.1M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for abreza/persian-ezafe-albert

Quantized
(1)
this model

Dataset used to train abreza/persian-ezafe-albert

Evaluation results