--- language: - fa tags: - token-classification - ezafe - nlp - onnx - piper-tts - text-to-speech base_model: HooshvareLab/albert-fa-zwnj-base-v2 datasets: - MahtaFetrat/HomoRich-G2P-Persian metrics: - f1 - accuracy - precision - recall model-index: - name: Persian Ezafe Detection (ALBERT) results: - task: type: token-classification name: Ezafe Detection dataset: name: HomoRich-G2P-Persian type: MahtaFetrat/HomoRich-G2P-Persian metrics: - name: F1 type: f1 value: 0.9873 --- # Persian Ezafe Detection (ALBERT) This is a lightweight, high-performance Ezafe detection model for Persian, fine-tuned on the **HooshvareLab/albert-fa-zwnj-base-v2** architecture. It was designed to resolve phonemization ambiguities where the "Ezafe" (kasra) sound is grammatically required but not written in text. ## Project Context This model was developed as part of the **Piper with LCA Phonemizer** project. You can find the full source code, implementation details, and the custom PiperTTS build at the official repository: 🔗 **[https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer](https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer)** ## Training Pipeline Notebook The full **end-to-end training pipeline** used to build this model (dataset preparation, SpaCy-based labeling, ALBERT fine-tuning, ONNX export, and quantization) is available as a notebook here: 📓 **ezafe_detection_training_pipeline.ipynb** 🔗 https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer/blob/main/albert_finetuning/ezafe_detection_training_pipeline.ipynb ## Repository Structure The repository contains both the standard PyTorch weights (root) and a quantized ONNX model (in the `quantized/` directory) for efficient inference. ```text . ├── config.json # PyTorch model configuration ├── model.safetensors # PyTorch model weights ├── quantized/ # Optimized ONNX model folder │ ├── config.json │ ├── model_quantized.onnx # Quantized ONNX weights │ ├── ort_config.json │ ├── special_tokens_map.json │ ├── spiece.model │ ├── tokenizer_config.json │ └── tokenizer.json ├── Readme.md ├── special_tokens_map.json ├── spiece.model # SentencePiece model ├── tokenizer_config.json └── tokenizer.json ``` ## Model Details - **Task:** Token Classification (Ezafe Detection) - **Base Model:** `HooshvareLab/albert-fa-zwnj-base-v2` - **Format:** - PyTorch (Root directory) - ONNX Quantized (Inside `quantized/` directory) - **Labels:** - `0`: **NO_EZAFE** (No sound added) - `1`: **NEEDS_EZAFE** (Add /e/ or /ye/ sound) ## Performance The model was trained on a refined subset of the [HomoRich](https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian) dataset, utilizing context-aware labelling. | Metric | Score | | :--- | :--- | | **F1 Score** | **98.73%** | *Evaluated on a held-out test set.* ## Usage ### 1. Using Transformers (PyTorch) Use the root path to load the PyTorch model for training or standard inference. ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch model_id = "abreza/persian-ezafe-albert" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForTokenClassification.from_pretrained(model_id) text = "کتابخانه مرکزی دانشگاه شریف، فضای وسیع و چشم‌نوازی دارد." inputs = tokenizer(text, return_tensors="pt") with torch.no_grad(): logits = model(**inputs).logits predictions = torch.argmax(logits, dim=2) tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) id2label = {0: "-", 1: "EZAFE"} for token, label_id in zip(tokens, predictions[0].tolist()): if token not in tokenizer.all_special_tokens: print(f"{token.replace("▁", ""):<15} | {id2label[label_id]}") ``` ### 2. Using ONNX Runtime (Production) For production environments, load from the `quantized` subfolder. This is the format used in the [Piper-with-LCA-Phonemizer](https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer) Docker container. ```python from optimum.onnxruntime import ORTModelForTokenClassification from transformers import AutoTokenizer import torch model_id = "abreza/persian-ezafe-albert" tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="quantized") model = ORTModelForTokenClassification.from_pretrained( model_id, subfolder="quantized", file_name="model_quantized.onnx" ) text = "کتابخانه مرکزی دانشگاه شریف، فضای وسیع و چشم‌نوازی دارد." inputs = tokenizer(text, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=2) tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) id2label = {0: "-", 1: "EZAFE"} for token, label_id in zip(tokens, predictions[0].tolist()): if token not in tokenizer.all_special_tokens: print(f"{token.replace("▁", ""):<15} | {id2label[label_id]}") ``` ## Training Data The model uses the **HomoRich** dataset, a high-quality resource for Persian Grapheme-to-Phoneme tasks. The dataset was preprocessed to extract binary Ezafe labels based on phonemic transcriptions.