abreza's picture
feat(readme): add training pipeline notebook section
e661437
---
language:
- fa
tags:
- token-classification
- ezafe
- nlp
- onnx
- piper-tts
- text-to-speech
base_model: HooshvareLab/albert-fa-zwnj-base-v2
datasets:
- MahtaFetrat/HomoRich-G2P-Persian
metrics:
- f1
- accuracy
- precision
- recall
model-index:
- name: Persian Ezafe Detection (ALBERT)
results:
- task:
type: token-classification
name: Ezafe Detection
dataset:
name: HomoRich-G2P-Persian
type: MahtaFetrat/HomoRich-G2P-Persian
metrics:
- name: F1
type: f1
value: 0.9873
---
# Persian Ezafe Detection (ALBERT)
This is a lightweight, high-performance Ezafe detection model for Persian, fine-tuned on the **HooshvareLab/albert-fa-zwnj-base-v2** architecture.
It was designed to resolve phonemization ambiguities where the "Ezafe" (kasra) sound is grammatically required but not written in text.
## Project Context
This model was developed as part of the **Piper with LCA Phonemizer** project. You can find the full source code, implementation details, and the custom PiperTTS build at the official repository:
🔗 **[https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer](https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer)**
## Training Pipeline Notebook
The full **end-to-end training pipeline** used to build this model (dataset preparation, SpaCy-based labeling, ALBERT fine-tuning, ONNX export, and quantization) is available as a notebook here:
📓 **ezafe_detection_training_pipeline.ipynb**
🔗 https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer/blob/main/albert_finetuning/ezafe_detection_training_pipeline.ipynb
## Repository Structure
The repository contains both the standard PyTorch weights (root) and a quantized ONNX model (in the `quantized/` directory) for efficient inference.
```text
.
├── config.json # PyTorch model configuration
├── model.safetensors # PyTorch model weights
├── quantized/ # Optimized ONNX model folder
│ ├── config.json
│ ├── model_quantized.onnx # Quantized ONNX weights
│ ├── ort_config.json
│ ├── special_tokens_map.json
│ ├── spiece.model
│ ├── tokenizer_config.json
│ └── tokenizer.json
├── Readme.md
├── special_tokens_map.json
├── spiece.model # SentencePiece model
├── tokenizer_config.json
└── tokenizer.json
```
## Model Details
- **Task:** Token Classification (Ezafe Detection)
- **Base Model:** `HooshvareLab/albert-fa-zwnj-base-v2`
- **Format:**
- PyTorch (Root directory)
- ONNX Quantized (Inside `quantized/` directory)
- **Labels:**
- `0`: **NO_EZAFE** (No sound added)
- `1`: **NEEDS_EZAFE** (Add /e/ or /ye/ sound)
## Performance
The model was trained on a refined subset of the [HomoRich](https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian) dataset, utilizing context-aware labelling.
| Metric | Score |
| :--- | :--- |
| **F1 Score** | **98.73%** |
*Evaluated on a held-out test set.*
## Usage
### 1. Using Transformers (PyTorch)
Use the root path to load the PyTorch model for training or standard inference.
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_id = "abreza/persian-ezafe-albert"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
text = "کتابخانه مرکزی دانشگاه شریف، فضای وسیع و چشم‌نوازی دارد."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predictions = torch.argmax(logits, dim=2)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
id2label = {0: "-", 1: "EZAFE"}
for token, label_id in zip(tokens, predictions[0].tolist()):
if token not in tokenizer.all_special_tokens:
print(f"{token.replace("▁", ""):<15} | {id2label[label_id]}")
```
### 2. Using ONNX Runtime (Production)
For production environments, load from the `quantized` subfolder. This is the format used in the [Piper-with-LCA-Phonemizer](https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer) Docker container.
```python
from optimum.onnxruntime import ORTModelForTokenClassification
from transformers import AutoTokenizer
import torch
model_id = "abreza/persian-ezafe-albert"
tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="quantized")
model = ORTModelForTokenClassification.from_pretrained(
model_id,
subfolder="quantized",
file_name="model_quantized.onnx"
)
text = "کتابخانه مرکزی دانشگاه شریف، فضای وسیع و چشم‌نوازی دارد."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
id2label = {0: "-", 1: "EZAFE"}
for token, label_id in zip(tokens, predictions[0].tolist()):
if token not in tokenizer.all_special_tokens:
print(f"{token.replace("▁", ""):<15} | {id2label[label_id]}")
```
## Training Data
The model uses the **HomoRich** dataset, a high-quality resource for Persian Grapheme-to-Phoneme tasks. The dataset was preprocessed to extract binary Ezafe labels based on phonemic transcriptions.