from pathlib import Path DIR_OUTPUT_REQUESTS = Path("requested_models") EVAL_REQUESTS_PATH = Path("eval_requests") BANNER = "assets/banner.svg" CSV_PATH = "assets/benchmark-data.csv" TITLE = "

Open Persian Automatic Speech Recognition Leaderboard " INTRODUCTION_TEXT = "📐 The Open Persian ASR Leaderboard ranks and evaluates speech recognition models \ on the Hugging Face Hub. \ \nWe report the Average [WER](https://huggingface.co/spaces/evaluate-metric/wer) (⬇️ lower the better) and Average [CER](https://huggingface.co/spaces/evaluate-metric/cer) (⬇️ lower the better). Check the 📈 Metrics tab to understand how the models are evaluated. \ \nIf you want results for a model that is not listed here, you can submit a request for it to be included ✉️✨. \ \nThe leaderboard includes Persian/Farsi ASR evaluation benchmarks.\ \nWe created our own high quality evaluation dataset, the [Persian ASR Benchmark](https://huggingface.co/datasets/C1Tech/Persian-ASR-Benchmark), which is used to evaluate the models listed here." CITATION_TEXT = """@misc{open-asr-leaderboard, title = {Open Persian ASR Leaderboard}, author = {Arash Azma, Parsa Sinichi , Mohammad Hosseini}, year = {2025}, publisher = {C1tech}, howpublished = {\\url{https://huggingface.co/spaces/C1Tech/Open_Persian_ASR_Leaderboard}} } """ METRICS_TAB_TEXT = """ Here you will find details about the speech recognition metrics and datasets reported in our leaderboard. ## Metrics Models are evaluated using Character Error Rate (CER) and Word Error Rate (WER). The CER metric measures the accuracy of a system at the character level, capturing detailed errors such as misspellings, missing letters, or small deviations that WER might miss. A lower CER indicates better accuracy in reproducing the reference transcript character by character. WER is also reported to provide a word-level perspective, but models are primarily ranked based on their CER, emphasizing fine-grained transcription quality. For details on reproducing the benchmark numbers, refer to the [Persian-ASR-Leaderboard GitHub repository](https://github.com/c1tech-group/Persian-ASR-Leaderboard). --- ### Character Error Rate (CER) Character Error Rate is used to measure the **accuracy** of automatic speech recognition systems at the **character level**. It calculates the percentage of characters in the system's output that differ from the reference (correct) transcript. **A lower CER value indicates higher accuracy**. Take the following example: **Reference**: علی کتاب خواند **Prediction**: علی کتاه خاند | Reference: | د | ن | ا | و | خ | | ب | ا | ت | ک | | ی | ل | ع | | ----------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Prediction: | د | ن | ا | - | خ | | ه | ا | ت | ک | | ی | ل | ع | | Label: | ✅ | ✅ | ✅ | D | ✅ | ✅ | S | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | **Explanation of labels**: * **S (Deletion)**: ب Subtituted * **D (Deletion)**: و Deleted ``` Total reference characters (**N**) = 14 Errors = 1 substitution (ب→ه) + 1 deletion (و) = **2 errors** ``` CER = (S + I + D) / N = 2/14 **Final CER = 0.14285 (≈ 14.3%)** --- ### Word Error Rate (WER) Word Error Rate is used to measure the **accuracy** of automatic speech recognition systems. It calculates the percentage of words in the system's output that differ from the reference (correct) transcript. **A lower WER value indicates higher accuracy**. Take the following example: | Reference: | رفتند | مدرسه | به | پارسا | و | آرش | | ----------- | -------- | ----- | -- | --------- | - | --- | | Prediction: | **رفتن** | مدرسه | | **بارسا** | و | آرش | | Label: | S | ✅ | D | S | ✅ | ✅ | Here, we have: * 2 substitutions ("پارسا" → "بارسا" and "رفتند" → "رفتن") * 0 insertions * 1 deletion ("به" is missing) This gives **3 errors in total**. To get our word error rate (WER), we divide the total number of errors (substitutions + insertions + deletions) by the total number of words in our reference (N), which for this example is 6: ``` WER = (S + I + D) / N = (2 + 0 + 1) / 6 = 0.5 ``` Giving a **WER of 0.5**, or **50%**. For a fair comparison, we calculate normalized CER and WER for all model checkpoints, meaning punctuation and casing are removed from the references and predictions. You can find the evaluation code on our [GitHub repository](https://github.com/huggingface/open_asr_leaderboard#evaluate-a-model) --- ## Limitations of WER for Persian Persian has complex linguistic features that make **Word Error Rate (WER)** less reliable as a metric. ### 1. Formal vs. Informal Variations Persian often has multiple valid forms for the same sentence depending on formality: - **Formal:** کتابم را از علی گرفتم - **Informal:** کتابم رو از علی گرفتم Both sentences are correct, but WER would count the difference between را and رو as a full word error, penalizing the model unfairly. ### 2. Morphological Complexity Persian words often include clitics or attached pronouns (e.g., کتابم, رفتم), which can be split differently depending on tokenization. WER can exaggerate errors in these cases. ### 3. Word Segmentation Ambiguity Persian does not always use spaces consistently, especially with prepositions, conjunctions, and enclitics. WER is sensitive to such inconsistencies, which can inflate error rates. #### Word Error Rate (WER) Calculation - Substitution: را → رو counts as **1 word error** - Total words in reference: 5 - **WER = 1 / 5 ≈ 0.2** #### Character Error Rate (CER) Calculation - Character-level difference: ا → و (**1 character error**) - Total characters in reference: 21 - **CER = 1 / 21 ≈ 0.0476** ## How to reproduce our results The ASR Leaderboard will be a continued effort to benchmark open source/access speech recognition models where possible. Along with the Leaderboard we're open-sourcing the codebase used for running these evaluations. For more details head over to our repo at: [Persian-ASR-Leaderboard GitHub repository](https://github.com/c1tech-group/Persian-ASR-Leaderboard) P.S. We'd love to know which other models you'd like us to benchmark next. Contributions are more than welcome! ♥️ ## Benchmark datasets | Dataset | Total Duration (h) | License | | ------------------------------------------------------------------------------------- | ------------------ | --------------- | | [FLEURS](https://huggingface.co/datasets/google/fleurs) | - | CC-BY-4.0 | | [Persian-ASR-Benchmark](https://huggingface.co/datasets/C1Tech/Persian-ASR-Benchmark) | - | CC-BY-4.0 | | [common_voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) | - | CC0 | | [ManaTTS](https://huggingface.co/datasets/MahtaFetrat/Mana-TTS)(only parts 70-77) | - | CC0 | ### Dataset and Normalization During preprocessing, we noticed that some Persian words contained Arabic forms (e.g., **دایرة المعارف**), which added unnecessary complexity and confused the model. We normalized such words to standard Persian forms to improve consistency and model understanding. For more information about our normalization methods, please refer to our [GitHub page](https://github.com/c1tech-group/Persian-ASR-Leaderboard) where we describe our preprocessing pipeline in detail. > Since many models do not release their training data, we created an **evaluation dataset** using audio recorded **after the public release dates (2 November 2025)** of those models. This ensures fairness and prevents data leakage, as none of these samples were used during training. """ MULTILINGUAL_TAB_TEXT = """ ## 🌍 Multilingual ASR Evaluation """ LONGFORM_TAB_TEXT = """ ## 📝 Long-form ASR Evaluation """ LEADERBOARD_CSS = """ #leaderboard-table th .header-content { white-space: nowrap; } #multilingual-table th .header-content { white-space: nowrap; } #multilingual-table th:hover { background-color: var(--table-row-focus); } #longform-table th .header-content { white-space: nowrap; } #longform-table th:hover { background-color: var(--table-row-focus); } .language-detail-modal { background: var(--background-fill-primary); border: 1px solid var(--border-color-primary); border-radius: 8px; padding: 1rem; margin: 1rem 0; } """