Persian Ezafe Detection (ALBERT)
This is a lightweight, high-performance Ezafe detection model for Persian, fine-tuned on the HooshvareLab/albert-fa-zwnj-base-v2 architecture.
It was designed to resolve phonemization ambiguities where the "Ezafe" (kasra) sound is grammatically required but not written in text.
Project Context
This model was developed as part of the Piper with LCA Phonemizer project. You can find the full source code, implementation details, and the custom PiperTTS build at the official repository:
🔗 https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer
Training Pipeline Notebook
The full end-to-end training pipeline used to build this model (dataset preparation, SpaCy-based labeling, ALBERT fine-tuning, ONNX export, and quantization) is available as a notebook here:
📓 ezafe_detection_training_pipeline.ipynb 🔗 https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer/blob/main/albert_finetuning/ezafe_detection_training_pipeline.ipynb
Repository Structure
The repository contains both the standard PyTorch weights (root) and a quantized ONNX model (in the quantized/ directory) for efficient inference.
.
├── config.json # PyTorch model configuration
├── model.safetensors # PyTorch model weights
├── quantized/ # Optimized ONNX model folder
│ ├── config.json
│ ├── model_quantized.onnx # Quantized ONNX weights
│ ├── ort_config.json
│ ├── special_tokens_map.json
│ ├── spiece.model
│ ├── tokenizer_config.json
│ └── tokenizer.json
├── Readme.md
├── special_tokens_map.json
├── spiece.model # SentencePiece model
├── tokenizer_config.json
└── tokenizer.json
Model Details
- Task: Token Classification (Ezafe Detection)
- Base Model:
HooshvareLab/albert-fa-zwnj-base-v2 - Format:
- PyTorch (Root directory)
- ONNX Quantized (Inside
quantized/directory)
- Labels:
0: NO_EZAFE (No sound added)1: NEEDS_EZAFE (Add /e/ or /ye/ sound)
Performance
The model was trained on a refined subset of the HomoRich dataset, utilizing context-aware labelling.
| Metric | Score |
|---|---|
| F1 Score | 98.73% |
Evaluated on a held-out test set.
Usage
1. Using Transformers (PyTorch)
Use the root path to load the PyTorch model for training or standard inference.
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_id = "abreza/persian-ezafe-albert"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
text = "کتابخانه مرکزی دانشگاه شریف، فضای وسیع و چشمنوازی دارد."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predictions = torch.argmax(logits, dim=2)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
id2label = {0: "-", 1: "EZAFE"}
for token, label_id in zip(tokens, predictions[0].tolist()):
if token not in tokenizer.all_special_tokens:
print(f"{token.replace("▁", ""):<15} | {id2label[label_id]}")
2. Using ONNX Runtime (Production)
For production environments, load from the quantized subfolder. This is the format used in the Piper-with-LCA-Phonemizer Docker container.
from optimum.onnxruntime import ORTModelForTokenClassification
from transformers import AutoTokenizer
import torch
model_id = "abreza/persian-ezafe-albert"
tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="quantized")
model = ORTModelForTokenClassification.from_pretrained(
model_id,
subfolder="quantized",
file_name="model_quantized.onnx"
)
text = "کتابخانه مرکزی دانشگاه شریف، فضای وسیع و چشمنوازی دارد."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
id2label = {0: "-", 1: "EZAFE"}
for token, label_id in zip(tokens, predictions[0].tolist()):
if token not in tokenizer.all_special_tokens:
print(f"{token.replace("▁", ""):<15} | {id2label[label_id]}")
Training Data
The model uses the HomoRich dataset, a high-quality resource for Persian Grapheme-to-Phoneme tasks. The dataset was preprocessed to extract binary Ezafe labels based on phonemic transcriptions.
- Downloads last month
- 74
Model tree for abreza/persian-ezafe-albert
Base model
HooshvareLab/albert-fa-zwnj-base-v2Dataset used to train abreza/persian-ezafe-albert
Evaluation results
- F1 on HomoRich-G2P-Persianself-reported0.987