| --- |
| language: |
| - fa |
| tags: |
| - token-classification |
| - ezafe |
| - nlp |
| - onnx |
| - piper-tts |
| - text-to-speech |
| base_model: HooshvareLab/albert-fa-zwnj-base-v2 |
| datasets: |
| - MahtaFetrat/HomoRich-G2P-Persian |
| metrics: |
| - f1 |
| - accuracy |
| - precision |
| - recall |
| model-index: |
| - name: Persian Ezafe Detection (ALBERT) |
| results: |
| - task: |
| type: token-classification |
| name: Ezafe Detection |
| dataset: |
| name: HomoRich-G2P-Persian |
| type: MahtaFetrat/HomoRich-G2P-Persian |
| metrics: |
| - name: F1 |
| type: f1 |
| value: 0.9873 |
| --- |
| |
| # Persian Ezafe Detection (ALBERT) |
|
|
| This is a lightweight, high-performance Ezafe detection model for Persian, fine-tuned on the **HooshvareLab/albert-fa-zwnj-base-v2** architecture. |
|
|
| It was designed to resolve phonemization ambiguities where the "Ezafe" (kasra) sound is grammatically required but not written in text. |
|
|
| ## Project Context |
|
|
| This model was developed as part of the **Piper with LCA Phonemizer** project. You can find the full source code, implementation details, and the custom PiperTTS build at the official repository: |
|
|
| 🔗 **[https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer](https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer)** |
|
|
| ## Training Pipeline Notebook |
|
|
| The full **end-to-end training pipeline** used to build this model (dataset preparation, SpaCy-based labeling, ALBERT fine-tuning, ONNX export, and quantization) is available as a notebook here: |
|
|
| 📓 **ezafe_detection_training_pipeline.ipynb** |
| 🔗 https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer/blob/main/albert_finetuning/ezafe_detection_training_pipeline.ipynb |
| |
| ## Repository Structure |
| |
| The repository contains both the standard PyTorch weights (root) and a quantized ONNX model (in the `quantized/` directory) for efficient inference. |
| |
| ```text |
| . |
| ├── config.json # PyTorch model configuration |
| ├── model.safetensors # PyTorch model weights |
| ├── quantized/ # Optimized ONNX model folder |
| │ ├── config.json |
| │ ├── model_quantized.onnx # Quantized ONNX weights |
| │ ├── ort_config.json |
| │ ├── special_tokens_map.json |
| │ ├── spiece.model |
| │ ├── tokenizer_config.json |
| │ └── tokenizer.json |
| ├── Readme.md |
| ├── special_tokens_map.json |
| ├── spiece.model # SentencePiece model |
| ├── tokenizer_config.json |
| └── tokenizer.json |
| ``` |
| |
| ## Model Details |
| |
| - **Task:** Token Classification (Ezafe Detection) |
| - **Base Model:** `HooshvareLab/albert-fa-zwnj-base-v2` |
| - **Format:** |
| - PyTorch (Root directory) |
| - ONNX Quantized (Inside `quantized/` directory) |
| - **Labels:** |
| - `0`: **NO_EZAFE** (No sound added) |
| - `1`: **NEEDS_EZAFE** (Add /e/ or /ye/ sound) |
| |
| ## Performance |
| |
| The model was trained on a refined subset of the [HomoRich](https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian) dataset, utilizing context-aware labelling. |
| |
| | Metric | Score | |
| | :--- | :--- | |
| | **F1 Score** | **98.73%** | |
| |
| *Evaluated on a held-out test set.* |
| |
| ## Usage |
| |
| ### 1. Using Transformers (PyTorch) |
| |
| Use the root path to load the PyTorch model for training or standard inference. |
| |
| ```python |
| from transformers import AutoTokenizer, AutoModelForTokenClassification |
| import torch |
| |
| model_id = "abreza/persian-ezafe-albert" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForTokenClassification.from_pretrained(model_id) |
| |
| text = "کتابخانه مرکزی دانشگاه شریف، فضای وسیع و چشمنوازی دارد." |
| inputs = tokenizer(text, return_tensors="pt") |
| |
| with torch.no_grad(): |
| logits = model(**inputs).logits |
| predictions = torch.argmax(logits, dim=2) |
| |
| tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) |
| id2label = {0: "-", 1: "EZAFE"} |
| |
| for token, label_id in zip(tokens, predictions[0].tolist()): |
| if token not in tokenizer.all_special_tokens: |
| print(f"{token.replace("▁", ""):<15} | {id2label[label_id]}") |
| ``` |
| |
| ### 2. Using ONNX Runtime (Production) |
|
|
| For production environments, load from the `quantized` subfolder. This is the format used in the [Piper-with-LCA-Phonemizer](https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer) Docker container. |
|
|
| ```python |
| from optimum.onnxruntime import ORTModelForTokenClassification |
| from transformers import AutoTokenizer |
| import torch |
| |
| model_id = "abreza/persian-ezafe-albert" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="quantized") |
| model = ORTModelForTokenClassification.from_pretrained( |
| model_id, |
| subfolder="quantized", |
| file_name="model_quantized.onnx" |
| ) |
| |
| text = "کتابخانه مرکزی دانشگاه شریف، فضای وسیع و چشمنوازی دارد." |
| inputs = tokenizer(text, return_tensors="pt") |
| |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| predictions = torch.argmax(outputs.logits, dim=2) |
| |
| tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) |
| |
| id2label = {0: "-", 1: "EZAFE"} |
| |
| for token, label_id in zip(tokens, predictions[0].tolist()): |
| if token not in tokenizer.all_special_tokens: |
| print(f"{token.replace("▁", ""):<15} | {id2label[label_id]}") |
| ``` |
|
|
| ## Training Data |
|
|
| The model uses the **HomoRich** dataset, a high-quality resource for Persian Grapheme-to-Phoneme tasks. The dataset was preprocessed to extract binary Ezafe labels based on phonemic transcriptions. |
|
|