Question about correct NeMo ASR model and configuration for Arabic / Quran recitation (streaming & fine-tuning)

#1
by NovaCon-AI - opened

Assalamu aleykum wa rahmatullahi wa barakatuhu,

I am currently working on an Arabic ASR project focused specifically on Quran recitation (tajweed-style, clear recitation, not conversational speech).

My goal is:
• Automatic Speech Recognition (ASR) for Quran recitation
• Ideally suitable for streaming / near–real-time use
• With the option to fine-tune on a custom Quran recitation dataset

While exploring NeMo-based setups (including this repository / endpoint), I am running into a fundamental problem:
I cannot clearly determine which ASR model is actually intended to be used, nor what the recommended configuration is for this use case.

Specifically, I am unclear about:

  1. Which NeMo ASR model is recommended for Arabic Quran recitation (CTC vs RNNT, Conformer vs Parakeet, etc.)
  2. Whether this setup is intended for offline or streaming ASR
  3. How the model is selected (for example via config.repository or environment variables)
  4. What the correct and stable configuration would be for fine-tuning on Arabic Quran data
  5. Whether there are known limitations or best practices for Quran-style recitation (long vowels, pauses, tajweed)

At the moment, the code and Docker setup do not make it obvious which pretrained model is expected, and the README does not clarify this either.

I would really appreciate guidance on:
• The correct model choice
• A recommended configuration
• Whether this approach is suitable for fine-tuning on Quran recitation data

Barakallahu feek!

Sign up or log in to comment