SaraAlthubaiti commited on
Commit
b48d4e2
·
verified ·
1 Parent(s): dcdec03

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -0
README.md CHANGED
@@ -88,8 +88,62 @@ output = transcribe(audio_path, task="asr") # Options: "dialect", "asr", "trans
88
 
89
  print("Generated Text:", output)
90
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  ---
92
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
  ## Examples
94
 
95
  ### Example 1: Arabic Speech Recognition
 
88
 
89
  print("Generated Text:", output)
90
  ```
91
+
92
+ ## 🧪 Evaluation Results
93
+
94
+ ### 🎙️ ASR Performance (WER ↓)
95
+
96
+ | **Dataset** | **Ar-Octopus** | **Bilingual-Octopus** | **Trans-Octopus** | **Whisper-large-v3** | **SeamlessM4T** |
97
+ |:-------------|:---------------:|:---------------------:|:-----------------:|:--------------------:|:----------------:|
98
+ | **MGB2 (Arabic)** | 16.5 \| 6.5 | 15.2 \| 6.8 | **13.3 \| 5.9** | 16.2 \| 7.9 | 17.2 \| 8.4 |
99
+ | **test-clean (English)** | 82.5 \| 92.4 | **2.6 \| 1.4** | 67.3 \| 79.4 | 2.86 \| 0.98 | 2.68 \| 0.88 |
100
+ | **test-other (English)** | 86.9 \| 95.1 | **5.1 \| 3.4** | 71.5 \| 87.8 | 5.00 \| 2.05 | **5.07 \| 1.94** |
101
+ | **tedlium (English)** | 101.9 \| 77.4 | **5.1 \| 3.9** | 85.2 \| 63.6 | 11.9 \| 4.4 | 86.5 \| 62.2 |
102
+ | **Escwa (Code-Switched)** | 42.5 \| 26.3 | **40.8 \| 27.1** | 41.8 \| 25.1 | 47.3 \| 31.0 | 52.0 \| 35.3 |
103
+ | **Mixat-ALL (Code-Switched)** | 22.0 \| 9.0 | **23.4 \| 10.3** | 34.1 \| 10.6 | 29.0 \| 15.0 | 32.8 \| 16.9 |
104
+ | **Mixat-CS (Code-Switched)** | 26.4 \| 12.4 | **28.5 \| 14.9** | 27.8 \| 13.3 | 34.8 \| 20.6 | 38.2 \| 21.8 |
105
+ | **In-house Long-form** | 25.4 \| 13.0 | 24.9 \| 12.5 | **24.1 \| 12.1** | 26.7 \| 15.2 | 29.3 \| 18.6 |
106
+
107
+ > **+86 % English improvement** observed with the addition of language-tokens for bilingual and translation variants.
108
+
109
+ ---
110
+
111
+ ### 🪶 Tiny-Octopus & Fine-Tuning (WER ↓)
112
+
113
+ | **Dataset** | **TinyOctopus LLaMA-3 1B** | **Fine-tuned LLaMA-3 1B** | **TinyOctopus DeepSeek 1.5B** | **Fine-tuned DeepSeek 1.5B** |
114
+ |:-------------|:-------------------------:|:-------------------------:|:-----------------------------:|:-----------------------------:|
115
+ | **MGB2 (Arabic)** | 22.6 \| 15.7 | 16.1 \| **9.5** | 23.2 \| 15.8 | **15.5 \| 9.2** |
116
+ | **test-clean (English)** | 7.5 \| 5.7 | **3.1 \| 1.3** | 7.7 \| 5.8 | 7.6 \| 5.7 |
117
+ | **test-other (English)** | 11.3 \| 8.0 | **6.9 \| 3.5** | 11.5 \| 8.2 | 11.3 \| 8.0 |
118
+ | **Escwa (Code-Switched)** | 42.5 \| 26.9 | **40.3 \| 24.4** | 43.6 \| 27.8 | 41.8 \| 26.3 |
119
+ | **Mixat-All** | 35.2 \| 19.6 | **34.1 \| 19.3** | 37.1 \| 21.1 | 35.5 \| 19.9 |
120
+ | **Mixat-CS** | 40.2 \| 24.2 | **36.2 \| 21.4** | 41.2 \| 25.2 | 39.9 \| 24.2 |
121
+ | **In-house Long-files** | 44.3 \| 29.1 | **42.8 \| 26.9** | 47.0 \| 32.7 | 43.7 \| 31.5 |
122
+
123
+ > **Code-Switch TTS** augmentation yielded **≈ 20 % WER reduction** across multilingual evaluation sets.
124
+
125
  ---
126
 
127
+ ### 🌍 Translation Performance (BLEU ↑ / BERT-F1 ↑)
128
+
129
+ | **Model / System** | **CoVoST2 (Ar→En)** | **FLEURS (Ar→En)** |
130
+ |:--------------------|:------------------:|:-----------------:|
131
+ | Whisper-large-v3 | 28.8 / 0.53 | 15.1 / 0.47 |
132
+ | SeamlessM4T | 33.7 / 0.55 | **23.9 / 0.56** |
133
+ | **Trans-Octopus** | **38.6 / 0.64** | **23.2 / 0.58** |
134
+ | TO-LLaMA-1B | 33.9 / 0.61 | 20.5 / 0.53 |
135
+ | TO-DeepSeek-1.5B | 33.6 / 0.61 | 20.8 / 0.53 |
136
+
137
+ > **Trans-Octopus** achieves the best BLEU and BERT-F1 on **CoVoST2** and competitive results on **FLEURS**, surpassing SeamlessM4T in low-resource conditions.
138
+
139
+ ---
140
+
141
+ ### 🏷️ Dialect Identification
142
+
143
+ For **dialect identification**, the **Tiny-Octopus** models achieved **87 – 89 % accuracy** across all 17 dialects in **ADI-17**.
144
+ The confusion matrices reveal clear separation among **Gulf**, **Levantine**, **North-African**, and **Egyptian** clusters — showing that even compact models can internalize subtle dialectal cues when trained in a multitask setting.
145
+
146
+
147
  ## Examples
148
 
149
  ### Example 1: Arabic Speech Recognition