---
language:
- fa
tags:
- token-classification
- ezafe
- nlp
- onnx
- piper-tts
- text-to-speech
base_model: HooshvareLab/albert-fa-zwnj-base-v2
datasets:
- MahtaFetrat/HomoRich-G2P-Persian
metrics:
- f1
- accuracy
- precision
- recall
model-index:
- name: Persian Ezafe Detection (ALBERT)
  results:
  - task:
      type: token-classification
      name: Ezafe Detection
    dataset:
      name: HomoRich-G2P-Persian
      type: MahtaFetrat/HomoRich-G2P-Persian
    metrics:
    - name: F1
      type: f1
      value: 0.9873
---

# Persian Ezafe Detection (ALBERT)

This is a lightweight, high-performance Ezafe detection model for Persian, fine-tuned on the **HooshvareLab/albert-fa-zwnj-base-v2** architecture.

It was designed to resolve phonemization ambiguities where the "Ezafe" (kasra) sound is grammatically required but not written in text.

## Project Context

This model was developed as part of the **Piper with LCA Phonemizer** project. You can find the full source code, implementation details, and the custom PiperTTS build at the official repository:

🔗 **[https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer](https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer)**

## Training Pipeline Notebook

The full **end-to-end training pipeline** used to build this model (dataset preparation, SpaCy-based labeling, ALBERT fine-tuning, ONNX export, and quantization) is available as a notebook here:

📓 **ezafe_detection_training_pipeline.ipynb**
🔗 https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer/blob/main/albert_finetuning/ezafe_detection_training_pipeline.ipynb

## Repository Structure

The repository contains both the standard PyTorch weights (root) and a quantized ONNX model (in the `quantized/` directory) for efficient inference.

```text
.
├── config.json                  # PyTorch model configuration
├── model.safetensors            # PyTorch model weights
├── quantized/                   # Optimized ONNX model folder
│   ├── config.json
│   ├── model_quantized.onnx     # Quantized ONNX weights
│   ├── ort_config.json
│   ├── special_tokens_map.json
│   ├── spiece.model
│   ├── tokenizer_config.json
│   └── tokenizer.json
├── Readme.md
├── special_tokens_map.json
├── spiece.model                 # SentencePiece model
├── tokenizer_config.json
└── tokenizer.json
```

## Model Details

- **Task:** Token Classification (Ezafe Detection)
- **Base Model:** `HooshvareLab/albert-fa-zwnj-base-v2`
- **Format:**
  - PyTorch (Root directory)
  - ONNX Quantized (Inside `quantized/` directory)
- **Labels:**
  - `0`: **NO_EZAFE** (No sound added)
  - `1`: **NEEDS_EZAFE** (Add /e/ or /ye/ sound)

## Performance

The model was trained on a refined subset of the [HomoRich](https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian) dataset, utilizing context-aware labelling.

| Metric | Score |
| :--- | :--- |
| **F1 Score** | **98.73%** |

*Evaluated on a held-out test set.*

## Usage

### 1. Using Transformers (PyTorch)

Use the root path to load the PyTorch model for training or standard inference.

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_id = "abreza/persian-ezafe-albert"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

text = "کتابخانه مرکزی دانشگاه شریف، فضای وسیع و چشم‌نوازی دارد."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits
    predictions = torch.argmax(logits, dim=2)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
id2label = {0: "-", 1: "EZAFE"}

for token, label_id in zip(tokens, predictions[0].tolist()):
    if token not in tokenizer.all_special_tokens:
        print(f"{token.replace("▁", ""):<15} | {id2label[label_id]}")
```

### 2. Using ONNX Runtime (Production)

For production environments, load from the `quantized` subfolder. This is the format used in the [Piper-with-LCA-Phonemizer](https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer) Docker container.

```python
from optimum.onnxruntime import ORTModelForTokenClassification
from transformers import AutoTokenizer
import torch

model_id = "abreza/persian-ezafe-albert"

tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="quantized")
model = ORTModelForTokenClassification.from_pretrained(
    model_id,
    subfolder="quantized",
    file_name="model_quantized.onnx"
)

text = "کتابخانه مرکزی دانشگاه شریف، فضای وسیع و چشم‌نوازی دارد."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=2)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

id2label = {0: "-", 1: "EZAFE"}

for token, label_id in zip(tokens, predictions[0].tolist()):
    if token not in tokenizer.all_special_tokens:
        print(f"{token.replace("▁", ""):<15} | {id2label[label_id]}")
```

## Training Data

The model uses the **HomoRich** dataset, a high-quality resource for Persian Grapheme-to-Phoneme tasks. The dataset was preprocessed to extract binary Ezafe labels based on phonemic transcriptions.