mrmuminov's picture
Update README.md
e5479d3 verified
metadata
library_name: transformers
tags: []

Quran Phoneme-Level Tokenizer

This is a custom tokenizer for Quranic Arabic phoneme-level transcription, including diacritics (harakat).
It is designed for Whisper phoneme-level fine-tuning or other speech-to-text models.


Features

  • Handles Quran transliteration in Buckwalter format (e.g., bi, {ll~ahi, r~aHiym).
  • Preserves diacritics (fatha, kasra, damma, shadda, sukun).
  • Outputs phoneme-level tokens suitable for speech recognition fine-tuning.
  • Includes special tokens: <pad>, <s>, </s>, <unk>.

How to use

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("bahriddin/quran-phoneme-tokenizer")

# Encode phoneme text
phoneme_text = "b_i s_sukun m_i"
inputs = tokenizer(phoneme_text)

# Decode
decoded = tokenizer.decode(inputs["input_ids"])
print(decoded)