|
|
--- |
|
|
datasets: |
|
|
- rsalshalan/QASR |
|
|
- DynamicSuperb/DialectIdentification_ADI17 |
|
|
- openslr/librispeech_asr |
|
|
- LIUM/tedlium |
|
|
language: |
|
|
- ar |
|
|
- en |
|
|
metrics: |
|
|
- bleu |
|
|
- wer |
|
|
- accuracy |
|
|
base_model: |
|
|
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
|
|
- meta-llama/Llama-3.2-1B |
|
|
pipeline_tag: audio-text-to-text |
|
|
--- |
|
|
|
|
|
# ๐ Octopus: Towards Building the Arabic Speech LLM Suite |
|
|
|
|
|
## ๐ข Overview |
|
|
**Octopus** is a bilingual **Audio-Language Model (Audio-LLM)** family developed to understand, transcribe, translate, and reason over Arabic and English speech. |
|
|
It unifies audio, text, and reasoning within one multimodal framework, supporting: |
|
|
|
|
|
- **Automatic Speech Recognition (ASR)** for Arabic & English ๐ฃ๏ธ |
|
|
- **Speech Translation** (Arabic โ English and vice versa) ๐ |
|
|
- **Arabic Dialect Identification (DID)** ๐ท๏ธ |
|
|
|
|
|
The lightweight variant, **TinyOctopus**, maintains the same modular design but is optimized for efficiency on smaller GPUs. |
|
|
|
|
|
|
|
|
## ๐งฉ Architecture |
|
|
### Core Components |
|
|
The **Octopus** family scales across several encoderโdecoder configurations, combining complementary strengths in acoustic understanding and text generation. |
|
|
|
|
|
1. **Audio Encoders** |
|
|
- **Distil-Whisper (distil-large-v3)** โ lightweight frozen encoder producing compact speech embeddings. |
|
|
- **Whisper-large-v3** โ high-capacity encoder for robust transcription and multilingual coverage. |
|
|
- **BEATs (Microsoft)** โ self-supervised audio encoder capturing fine-grained acoustic cues such as timbre and speaker traits. |
|
|
|
|
|
2. **Alignment & Fusion** |
|
|
- **Cross-Attention Projection Layer** โ a trainable bridge that aligns audio representations with the text-language space through cross-modal attention. |
|
|
|
|
|
3. **Language / Decoder Models** |
|
|
- **DeepSeek 1.5B** โ efficient generative decoder for reasoning, dialogue, and translation. |
|
|
- **LLaMA 3.2 1B** โ compact ArabicโEnglish language model variant optimized for code-switching and reasoning on limited hardware. |
|
|
- **ALLaM 13B** โ large bilingual decoder offering high-fidelity generation and deeper contextual grounding for Arabic tasks. |
|
|
|
|
|
Together these components enable the **Octopus** lineโfrom **TinyOctopus** (Distil-Whisper + LLaMA 3.2 1B or DeepSeek 1.5B) up to full **ALLaM-Octopus** (Whisper large v3 + BEATs + ALLaM 13 B) to handle diverse audio understanding and speech-to-text reasoning tasks across Arabic and English. |
|
|
|
|
|
|
|
|
## ๐ Training Datasets |
|
|
|
|
|
The **Octopus** models were trained and evaluated on a diverse collection of Arabic, English, and code-switching speech corpora, totaling **โ25,000 hours** of high-quality data for ASR, translation, and dialect identification. |
|
|
|
|
|
| **Task / Domain** | **Dataset** | **Train (h)** | **Dev (h)** | **Description** | |
|
|
|:------------------|:-------------|:--------------:|:------------:|:----------------| |
|
|
| **ASR (Arabic)** | [QASR](https://arxiv.org/pdf/2106.13000) | 1,880.5 | 9.6 | Broadcast Arabic from Al-Jazeera; multi-dialect with punctuation and speaker tags. | |
|
|
| | In-house Arabic Corpus | 13,392.1 | 142.7 | Large internal Arabic dataset across Gulf, Levantine, and North-African dialects. | |
|
|
| **ASR (English)** | LibriSpeech | 960.0 | 10.5 | Read English corpus for ASR benchmarking. | |
|
|
| | TED-LIUM | 453.8 | 1.6 | English TED-talk recordings for spontaneous speech recognition. | |
|
|
| **ASR (ArโEn Code Switching)** | Synthetic (In-house TTS) | 119.5 | โ | Synthetic bilingual utterances generated via TTS to strengthen mixed-speech robustness. | |
|
|
| **Translation (ArโEn)** | Translated QASR (via GPT-4o) | 1,858.4 | 9.6 | QASR corpus automatically translated to English for parallel supervision. | |
|
|
| | Translated In-house Arabic (via GPT-4o) | 7,229.2 | 141.9 | In-house Arabic dataset machine-translated to English via GPT-4o. | |
|
|
| **Dialect Identification** | [ADI17](https://swshon.github.io/pdf/shon_2020_adi17.pdf) | 2,241.5 | 19.0 | YouTube-sourced Arabic speech across 17 dialects for dialect recognition and adaptation. | |
|
|
|
|
|
> **Total Coverage:** โ25,000 hours of speech across Arabic, English, and mixed-language domains โ enabling broad generalization for ASR, translation, and dialect identification. |
|
|
|
|
|
These datasets jointly provide: |
|
|
- Balanced representation across dialects. |
|
|
- Both natural and synthetic speech sources for enhanced robustness. |
|
|
- Parallel ArabicโEnglish pairs enabling bilingual text generation and translation. |
|
|
|
|
|
|
|
|
## ๐งฎ Model Weights & Resources |
|
|
|
|
|
The full set of model weights (including large checkpoints) is publicly available here: |
|
|
โก๏ธ [Octopus Model Weights](https://drive.google.com/drive/folders/1602VHm77oyQV4p08x5Xug0ziw7u0p2Ju?usp=sharing) |
|
|
|
|
|
|
|
|
## โ๏ธ Installation & Usage |
|
|
### **๐ป Install Dependencies** |
|
|
```bash |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
## Inference |
|
|
|
|
|
```bash |
|
|
from inference import transcribe |
|
|
|
|
|
audio_path = "path/to/audio.wav" # Replace with your actual audio file |
|
|
output = transcribe(audio_path, task="asr") # Options: "dialect", "asr", "translation" |
|
|
|
|
|
print("Generated Text:", output) |
|
|
``` |
|
|
|
|
|
## ๐งช Evaluation Results |
|
|
|
|
|
### ๐๏ธ ASR Performance (WER โ) |
|
|
|
|
|
| **Dataset** | **Ar-Octopus** | **Bilingual-Octopus** | **Trans-Octopus** | **Whisper-large-v3** | **SeamlessM4T** | |
|
|
|:-------------|:---------------:|:---------------------:|:-----------------:|:--------------------:|:----------------:| |
|
|
| **MGB2 (Arabic)** | 16.5 \| 6.5 | 15.2 \| 6.8 | **13.3 \| 5.9** | 16.2 \| 7.9 | 17.2 \| 8.4 | |
|
|
| **test-clean (English)** | 82.5 \| 92.4 | **2.6 \| 1.4** | 67.3 \| 79.4 | 2.86 \| 0.98 | 2.68 \| 0.88 | |
|
|
| **test-other (English)** | 86.9 \| 95.1 | **5.1 \| 3.4** | 71.5 \| 87.8 | 5.00 \| 2.05 | **5.07 \| 1.94** | |
|
|
| **tedlium (English)** | 101.9 \| 77.4 | **5.1 \| 3.9** | 85.2 \| 63.6 | 11.9 \| 4.4 | 86.5 \| 62.2 | |
|
|
| **Escwa (Code-Switched)** | 42.5 \| 26.3 | **40.8 \| 27.1** | 41.8 \| 25.1 | 47.3 \| 31.0 | 52.0 \| 35.3 | |
|
|
| **Mixat-ALL (Code-Switched)** | 22.0 \| 9.0 | **23.4 \| 10.3** | 34.1 \| 10.6 | 29.0 \| 15.0 | 32.8 \| 16.9 | |
|
|
| **Mixat-CS (Code-Switched)** | 26.4 \| 12.4 | **28.5 \| 14.9** | 27.8 \| 13.3 | 34.8 \| 20.6 | 38.2 \| 21.8 | |
|
|
| **In-house Long-form** | 25.4 \| 13.0 | 24.9 \| 12.5 | **24.1 \| 12.1** | 26.7 \| 15.2 | 29.3 \| 18.6 | |
|
|
|
|
|
> **+86 % English improvement** observed with the addition of language-tokens for bilingual and translation variants. |
|
|
|
|
|
--- |
|
|
|
|
|
### ๐ชถ Tiny-Octopus & Fine-Tuning (WER โ) |
|
|
|
|
|
| **Dataset** | **TinyOctopus LLaMA-3 1B** | **Fine-tuned LLaMA-3 1B** | **TinyOctopus DeepSeek 1.5B** | **Fine-tuned DeepSeek 1.5B** | |
|
|
|:-------------|:-------------------------:|:-------------------------:|:-----------------------------:|:-----------------------------:| |
|
|
| **MGB2 (Arabic)** | 22.6 \| 15.7 | 16.1 \| **9.5** | 23.2 \| 15.8 | **15.5 \| 9.2** | |
|
|
| **test-clean (English)** | 7.5 \| 5.7 | **3.1 \| 1.3** | 7.7 \| 5.8 | 7.6 \| 5.7 | |
|
|
| **test-other (English)** | 11.3 \| 8.0 | **6.9 \| 3.5** | 11.5 \| 8.2 | 11.3 \| 8.0 | |
|
|
| **Escwa (Code-Switched)** | 42.5 \| 26.9 | **40.3 \| 24.4** | 43.6 \| 27.8 | 41.8 \| 26.3 | |
|
|
| **Mixat-All** | 35.2 \| 19.6 | **34.1 \| 19.3** | 37.1 \| 21.1 | 35.5 \| 19.9 | |
|
|
| **Mixat-CS** | 40.2 \| 24.2 | **36.2 \| 21.4** | 41.2 \| 25.2 | 39.9 \| 24.2 | |
|
|
| **In-house Long-files** | 44.3 \| 29.1 | **42.8 \| 26.9** | 47.0 \| 32.7 | 43.7 \| 31.5 | |
|
|
|
|
|
> **Code-Switch TTS** augmentation yielded **โ 20 % WER reduction** across multilingual evaluation sets. |
|
|
|
|
|
--- |
|
|
|
|
|
### ๐ Translation Performance (BLEU โ / BERT-F1 โ) |
|
|
|
|
|
| **Model / System** | **CoVoST2 (ArโEn)** | **FLEURS (ArโEn)** | |
|
|
|:--------------------|:------------------:|:-----------------:| |
|
|
| Whisper-large-v3 | 28.8 / 0.53 | 15.1 / 0.47 | |
|
|
| SeamlessM4T | 33.7 / 0.55 | **23.9 / 0.56** | |
|
|
| **Trans-Octopus** | **38.6 / 0.64** | **23.2 / 0.58** | |
|
|
| TO-LLaMA-1B | 33.9 / 0.61 | 20.5 / 0.53 | |
|
|
| TO-DeepSeek-1.5B | 33.6 / 0.61 | 20.8 / 0.53 | |
|
|
|
|
|
> **Trans-Octopus** achieves the best BLEU and BERT-F1 on **CoVoST2** and competitive results on **FLEURS**, surpassing SeamlessM4T in low-resource conditions. |
|
|
|
|
|
--- |
|
|
|
|
|
### ๐ท๏ธ Dialect Identification |
|
|
|
|
|
For **dialect identification**, the **Tiny-Octopus** models achieved **87 โ 89 % accuracy** across all 17 dialects in **ADI-17**. |
|
|
The confusion matrices reveal clear separation among **Gulf**, **Levantine**, **North-African**, and **Egyptian** clusters โ showing that even compact models can internalize subtle dialectal cues when trained in a multitask setting. |
|
|
|
|
|
|
|
|
## Examples |
|
|
|
|
|
### Example 1: Arabic Speech Recognition |
|
|
๐ต **Audio Input (Arabic)**: |
|
|
<audio controls> |
|
|
<source src="https://huggingface.co/ArabicSpeech/Octopus/resolve/main/examples/03BD00C0_2C0B_4C81_BA8C_018175D0B4E3_utt_1_align.wav" type="audio/wav"> |
|
|
</audio> |
|
|
|
|
|
๐ **User Prompt**: |
|
|
> Transcribe the audio |
|
|
or |
|
|
> ูู
ุจุชูุฑูุบ ุงูู
ูุทุน ุงูุตูุชู |
|
|
|
|
|
๐ก **System Response**: |
|
|
> ุฃููุง ุจูู
ู
ุดุงูุฏููุง ุงููุฑุงู
ูู ุญููุฉ ุฌุฏูุฏุฉ ู
ู ุจุฑูุงู
ุฌ ุงูุงูุชุตุงุฏ ูุงููุงุณ |
|
|
|
|
|
๐ต **Audio Input (English)**: |
|
|
<audio controls> |
|
|
<source src="https://huggingface.co/ArabicSpeech/Octopus/resolve/main/examples/4970-29093-0016.wav" type="audio/wav"> |
|
|
</audio> |
|
|
|
|
|
๐ **User Prompt**: |
|
|
> Transcribe the audio |
|
|
or |
|
|
> ูู
ุจุชูุฑูุบ ุงูู
ูุทุน ุงูุตูุชู |
|
|
|
|
|
๐ก **System Response**: |
|
|
> NO IT'S NOT TOO SOON |
|
|
|
|
|
--- |
|
|
|
|
|
### Example 2: Arabic to English Translation |
|
|
๐ต **Audio Input**: |
|
|
<audio controls> |
|
|
<source src="https://huggingface.co/ArabicSpeech/Octopus/resolve/main/examples/03BD00C0_2C0B_4C81_BA8C_018175D0B4E3_utt_21_align.wav" type="audio/wav"> |
|
|
</audio> |
|
|
|
|
|
๐ **User Prompt**: |
|
|
> Translate the following Arabic speech into English |
|
|
or |
|
|
> ูู
ุจุชุฑุฌู
ุฉ ุงูู
ูุทุน ููุฅูุฌููุฒูุฉ |
|
|
|
|
|
๐ก **System Response**: |
|
|
> I took a loan a certain amount of money to pay off the debt |
|
|
|
|
|
--- |
|
|
|
|
|
### Example 3: Dialect Identification |
|
|
๐ต **Audio Input**: |
|
|
<audio controls> |
|
|
<source src="https://huggingface.co/ArabicSpeech/Octopus/resolve/main/examples/tYBpZAOFpvk_071631-073831.wav" type="audio/wav"> |
|
|
</audio> |
|
|
|
|
|
๐ **User Prompt**: |
|
|
> Identify the dialect of the given speech |
|
|
or |
|
|
> ู
ุงูู ููุฌุฉ ุงูู
ุชุญุฏุซุ |
|
|
|
|
|
๐ก **System Response**: |
|
|
> KSA |
|
|
|
|
|
--- |