Update README.md
Browse files
README.md
CHANGED
|
@@ -88,8 +88,62 @@ output = transcribe(audio_path, task="asr") # Options: "dialect", "asr", "trans
|
|
| 88 |
|
| 89 |
print("Generated Text:", output)
|
| 90 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
---
|
| 92 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
## Examples
|
| 94 |
|
| 95 |
### Example 1: Arabic Speech Recognition
|
|
|
|
| 88 |
|
| 89 |
print("Generated Text:", output)
|
| 90 |
```
|
| 91 |
+
|
| 92 |
+
## 🧪 Evaluation Results
|
| 93 |
+
|
| 94 |
+
### 🎙️ ASR Performance (WER ↓)
|
| 95 |
+
|
| 96 |
+
| **Dataset** | **Ar-Octopus** | **Bilingual-Octopus** | **Trans-Octopus** | **Whisper-large-v3** | **SeamlessM4T** |
|
| 97 |
+
|:-------------|:---------------:|:---------------------:|:-----------------:|:--------------------:|:----------------:|
|
| 98 |
+
| **MGB2 (Arabic)** | 16.5 \| 6.5 | 15.2 \| 6.8 | **13.3 \| 5.9** | 16.2 \| 7.9 | 17.2 \| 8.4 |
|
| 99 |
+
| **test-clean (English)** | 82.5 \| 92.4 | **2.6 \| 1.4** | 67.3 \| 79.4 | 2.86 \| 0.98 | 2.68 \| 0.88 |
|
| 100 |
+
| **test-other (English)** | 86.9 \| 95.1 | **5.1 \| 3.4** | 71.5 \| 87.8 | 5.00 \| 2.05 | **5.07 \| 1.94** |
|
| 101 |
+
| **tedlium (English)** | 101.9 \| 77.4 | **5.1 \| 3.9** | 85.2 \| 63.6 | 11.9 \| 4.4 | 86.5 \| 62.2 |
|
| 102 |
+
| **Escwa (Code-Switched)** | 42.5 \| 26.3 | **40.8 \| 27.1** | 41.8 \| 25.1 | 47.3 \| 31.0 | 52.0 \| 35.3 |
|
| 103 |
+
| **Mixat-ALL (Code-Switched)** | 22.0 \| 9.0 | **23.4 \| 10.3** | 34.1 \| 10.6 | 29.0 \| 15.0 | 32.8 \| 16.9 |
|
| 104 |
+
| **Mixat-CS (Code-Switched)** | 26.4 \| 12.4 | **28.5 \| 14.9** | 27.8 \| 13.3 | 34.8 \| 20.6 | 38.2 \| 21.8 |
|
| 105 |
+
| **In-house Long-form** | 25.4 \| 13.0 | 24.9 \| 12.5 | **24.1 \| 12.1** | 26.7 \| 15.2 | 29.3 \| 18.6 |
|
| 106 |
+
|
| 107 |
+
> **+86 % English improvement** observed with the addition of language-tokens for bilingual and translation variants.
|
| 108 |
+
|
| 109 |
+
---
|
| 110 |
+
|
| 111 |
+
### 🪶 Tiny-Octopus & Fine-Tuning (WER ↓)
|
| 112 |
+
|
| 113 |
+
| **Dataset** | **TinyOctopus LLaMA-3 1B** | **Fine-tuned LLaMA-3 1B** | **TinyOctopus DeepSeek 1.5B** | **Fine-tuned DeepSeek 1.5B** |
|
| 114 |
+
|:-------------|:-------------------------:|:-------------------------:|:-----------------------------:|:-----------------------------:|
|
| 115 |
+
| **MGB2 (Arabic)** | 22.6 \| 15.7 | 16.1 \| **9.5** | 23.2 \| 15.8 | **15.5 \| 9.2** |
|
| 116 |
+
| **test-clean (English)** | 7.5 \| 5.7 | **3.1 \| 1.3** | 7.7 \| 5.8 | 7.6 \| 5.7 |
|
| 117 |
+
| **test-other (English)** | 11.3 \| 8.0 | **6.9 \| 3.5** | 11.5 \| 8.2 | 11.3 \| 8.0 |
|
| 118 |
+
| **Escwa (Code-Switched)** | 42.5 \| 26.9 | **40.3 \| 24.4** | 43.6 \| 27.8 | 41.8 \| 26.3 |
|
| 119 |
+
| **Mixat-All** | 35.2 \| 19.6 | **34.1 \| 19.3** | 37.1 \| 21.1 | 35.5 \| 19.9 |
|
| 120 |
+
| **Mixat-CS** | 40.2 \| 24.2 | **36.2 \| 21.4** | 41.2 \| 25.2 | 39.9 \| 24.2 |
|
| 121 |
+
| **In-house Long-files** | 44.3 \| 29.1 | **42.8 \| 26.9** | 47.0 \| 32.7 | 43.7 \| 31.5 |
|
| 122 |
+
|
| 123 |
+
> **Code-Switch TTS** augmentation yielded **≈ 20 % WER reduction** across multilingual evaluation sets.
|
| 124 |
+
|
| 125 |
---
|
| 126 |
|
| 127 |
+
### 🌍 Translation Performance (BLEU ↑ / BERT-F1 ↑)
|
| 128 |
+
|
| 129 |
+
| **Model / System** | **CoVoST2 (Ar→En)** | **FLEURS (Ar→En)** |
|
| 130 |
+
|:--------------------|:------------------:|:-----------------:|
|
| 131 |
+
| Whisper-large-v3 | 28.8 / 0.53 | 15.1 / 0.47 |
|
| 132 |
+
| SeamlessM4T | 33.7 / 0.55 | **23.9 / 0.56** |
|
| 133 |
+
| **Trans-Octopus** | **38.6 / 0.64** | **23.2 / 0.58** |
|
| 134 |
+
| TO-LLaMA-1B | 33.9 / 0.61 | 20.5 / 0.53 |
|
| 135 |
+
| TO-DeepSeek-1.5B | 33.6 / 0.61 | 20.8 / 0.53 |
|
| 136 |
+
|
| 137 |
+
> **Trans-Octopus** achieves the best BLEU and BERT-F1 on **CoVoST2** and competitive results on **FLEURS**, surpassing SeamlessM4T in low-resource conditions.
|
| 138 |
+
|
| 139 |
+
---
|
| 140 |
+
|
| 141 |
+
### 🏷️ Dialect Identification
|
| 142 |
+
|
| 143 |
+
For **dialect identification**, the **Tiny-Octopus** models achieved **87 – 89 % accuracy** across all 17 dialects in **ADI-17**.
|
| 144 |
+
The confusion matrices reveal clear separation among **Gulf**, **Levantine**, **North-African**, and **Egyptian** clusters — showing that even compact models can internalize subtle dialectal cues when trained in a multitask setting.
|
| 145 |
+
|
| 146 |
+
|
| 147 |
## Examples
|
| 148 |
|
| 149 |
### Example 1: Arabic Speech Recognition
|