feat(readme): add training pipeline notebook section

e661437 5 months ago

5.37 kB

	---
	language:
	- fa
	tags:
	- token-classification
	- ezafe
	- nlp
	- onnx
	- piper-tts
	- text-to-speech
	base_model: HooshvareLab/albert-fa-zwnj-base-v2
	datasets:
	- MahtaFetrat/HomoRich-G2P-Persian
	metrics:
	- f1
	- accuracy
	- precision
	- recall
	model-index:
	- name: Persian Ezafe Detection (ALBERT)
	results:
	- task:
	type: token-classification
	name: Ezafe Detection
	dataset:
	name: HomoRich-G2P-Persian
	type: MahtaFetrat/HomoRich-G2P-Persian
	metrics:
	- name: F1
	type: f1
	value: 0.9873
	---

	# Persian Ezafe Detection (ALBERT)

	This is a lightweight, high-performance Ezafe detection model for Persian, fine-tuned on the HooshvareLab/albert-fa-zwnj-base-v2 architecture.

	It was designed to resolve phonemization ambiguities where the "Ezafe" (kasra) sound is grammatically required but not written in text.

	## Project Context

	This model was developed as part of the Piper with LCA Phonemizer project. You can find the full source code, implementation details, and the custom PiperTTS build at the official repository:

	🔗 [https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer](https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer)

	## Training Pipeline Notebook

	The full end-to-end training pipeline used to build this model (dataset preparation, SpaCy-based labeling, ALBERT fine-tuning, ONNX export, and quantization) is available as a notebook here:

	📓 ezafe_detection_training_pipeline.ipynb
	🔗 https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer/blob/main/albert_finetuning/ezafe_detection_training_pipeline.ipynb

	## Repository Structure

	The repository contains both the standard PyTorch weights (root) and a quantized ONNX model (in the `quantized/` directory) for efficient inference.

	```text
	.
	├── config.json # PyTorch model configuration
	├── model.safetensors # PyTorch model weights
	├── quantized/ # Optimized ONNX model folder
	│ ├── config.json
	│ ├── model_quantized.onnx # Quantized ONNX weights
	│ ├── ort_config.json
	│ ├── special_tokens_map.json
	│ ├── spiece.model
	│ ├── tokenizer_config.json
	│ └── tokenizer.json
	├── Readme.md
	├── special_tokens_map.json
	├── spiece.model # SentencePiece model
	├── tokenizer_config.json
	└── tokenizer.json
	```

	## Model Details

	- Task: Token Classification (Ezafe Detection)
	- Base Model: `HooshvareLab/albert-fa-zwnj-base-v2`
	- Format:
	- PyTorch (Root directory)
	- ONNX Quantized (Inside `quantized/` directory)
	- Labels:
	- `0`: NO_EZAFE (No sound added)
	- `1`: NEEDS_EZAFE (Add /e/ or /ye/ sound)

	## Performance

	The model was trained on a refined subset of the [HomoRich](https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian) dataset, utilizing context-aware labelling.

	\| Metric \| Score \|
	\| :--- \| :--- \|
	\| F1 Score \| 98.73% \|

	Evaluated on a held-out test set.

	## Usage

	### 1. Using Transformers (PyTorch)

	Use the root path to load the PyTorch model for training or standard inference.

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	import torch

	model_id = "abreza/persian-ezafe-albert"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForTokenClassification.from_pretrained(model_id)

	text = "کتابخانه مرکزی دانشگاه شریف، فضای وسیع و چشم‌نوازی دارد."
	inputs = tokenizer(text, return_tensors="pt")

	with torch.no_grad():
	logits = model(**inputs).logits
	predictions = torch.argmax(logits, dim=2)

	tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
	id2label = {0: "-", 1: "EZAFE"}

	for token, label_id in zip(tokens, predictions[0].tolist()):
	if token not in tokenizer.all_special_tokens:
	print(f"{token.replace("▁", ""):<15} \| {id2label[label_id]}")
	```

	### 2. Using ONNX Runtime (Production)

	For production environments, load from the `quantized` subfolder. This is the format used in the [Piper-with-LCA-Phonemizer](https://github.com/MahtaFetrat/Piper-with-LCA-Phonemizer) Docker container.

	```python
	from optimum.onnxruntime import ORTModelForTokenClassification
	from transformers import AutoTokenizer
	import torch

	model_id = "abreza/persian-ezafe-albert"

	tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="quantized")
	model = ORTModelForTokenClassification.from_pretrained(
	model_id,
	subfolder="quantized",
	file_name="model_quantized.onnx"
	)

	text = "کتابخانه مرکزی دانشگاه شریف، فضای وسیع و چشم‌نوازی دارد."
	inputs = tokenizer(text, return_tensors="pt")

	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.argmax(outputs.logits, dim=2)

	tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

	id2label = {0: "-", 1: "EZAFE"}

	for token, label_id in zip(tokens, predictions[0].tolist()):
	if token not in tokenizer.all_special_tokens:
	print(f"{token.replace("▁", ""):<15} \| {id2label[label_id]}")
	```

	## Training Data

	The model uses the HomoRich dataset, a high-quality resource for Persian Grapheme-to-Phoneme tasks. The dataset was preprocessed to extract binary Ezafe labels based on phonemic transcriptions.