ArabicSpeech
/

Octopus

Audio-Text-to-Text

Arabic

English

Model card Files Files and versions

xet

Community

SaraAlthubaiti commited on Nov 8, 2025

Commit

dcdec03

verified ·

1 Parent(s): 06d3db5

Update README.md

Browse files

Files changed (1) hide show

README.md +91 -17

README.md CHANGED Viewed

@@ -29,7 +29,6 @@ It unifies audio, text, and reasoning within one multimodal framework, supportin
 The lightweight variant, **TinyOctopus**, maintains the same modular design but is optimized for efficiency on smaller GPUs.
----
 ## 🧩 Architecture
 ### Core Components
@@ -51,30 +50,105 @@ The **Octopus** family scales across several encoder–decoder configurations, c
 Together these components enable the **Octopus** line—from **TinyOctopus** (Distil-Whisper + LLaMA 3.2 1B or DeepSeek 1.5B) up to full **ALLaM-Octopus** (Whisper large v3 + BEATs + ALLaM 13 B) to handle diverse audio understanding and speech-to-text reasoning tasks across Arabic and English.
----
 ## 📚 Training Datasets
-The **Octopus** models were trained and evaluated on a diverse collection of Arabic, English, and code-switching speech corpora, spanning over **25,000 hours** of high-quality data covering ASR, translation, and dialect identification tasks.
-| **Task / Domain** | **Dataset** | **# of Hours (Train | Dev)** | **Description** |
-|:------------------|:------------|:-----------------------------:|:----------------|
-| **ASR (Arabic)** | [QASR](https://arxiv.org/pdf/2106.13000) | 1,880.5 \| 9.6 | Broadcast Arabic from Al Jazeera News, multi-dialect, with punctuation + speaker tags. |
-|  | In-house Arabic Corpus | 13,392.1 \| 142.7 | Internal large-scale Arabic dataset spanning Gulf, Levantine, and North African dialects. |
-| **ASR (English)** | LibriSpeech | 960.0 \| 10.5 | Read English speech corpus widely used for ASR benchmarking. |
-|  | TED-LIUM | 453.8 \| 1.6 | English TED talk recordings for spontaneous speech recognition. |
-| **ASR (Ar–En Code Switching)** | Synthetic (In-house TTS) | 119.5 \| – | Synthetic bilingual segments generated via TTS to enhance robustness to mixed speech. |
-| **Translation (Ar→En)** | Translated QASR (via GPT-4o) | 1,858.4 \| 9.6 | Machine-translated version of QASR aligned with Arabic speech segments. |
-|  | Translated In-house Arabic (via GPT-4o) | 7,229.2 \| 141.9 | Large Arabic speech corpus automatically translated to English via GPT-4o for parallel training. |
-| **Dialect Identification** | [ADI17](https://swshon.github.io/pdf/shon_2020_adi17.pdf) | 2,241.5 \| 19.0 | YouTube-sourced speech from 17 Arabic dialects for dialect recognition and domain adaptation. |
-> **Total Coverage:** ≈ 25,000 hours of speech across Arabic, English, and mixed-language domains, ensuring wide generalization for ASR, translation, and dialect ID tasks.
----
 These datasets jointly provide:
 - Balanced representation across dialects.
 - Both natural and synthetic speech sources for enhanced robustness.
 - Parallel Arabic–English pairs enabling bilingual text generation and translation.
 ---

 The lightweight variant, **TinyOctopus**, maintains the same modular design but is optimized for efficiency on smaller GPUs.
 ## 🧩 Architecture
 ### Core Components
 Together these components enable the **Octopus** line—from **TinyOctopus** (Distil-Whisper + LLaMA 3.2 1B or DeepSeek 1.5B) up to full **ALLaM-Octopus** (Whisper large v3 + BEATs + ALLaM 13 B) to handle diverse audio understanding and speech-to-text reasoning tasks across Arabic and English.
 ## 📚 Training Datasets
+The **Octopus** models were trained and evaluated on a diverse collection of Arabic, English, and code-switching speech corpora, totaling **≈25,000 hours** of high-quality data for ASR, translation, and dialect identification.
+| **Task / Domain** | **Dataset** | **Train (h)** | **Dev (h)** | **Description** |
+|:------------------|:-------------|:--------------:|:------------:|:----------------|
+| **ASR (Arabic)** | [QASR](https://arxiv.org/pdf/2106.13000) | 1,880.5 | 9.6 | Broadcast Arabic from Al-Jazeera; multi-dialect with punctuation and speaker tags. |
+|  | In-house Arabic Corpus | 13,392.1 | 142.7 | Large internal Arabic dataset across Gulf, Levantine, and North-African dialects. |
+| **ASR (English)** | LibriSpeech | 960.0 | 10.5 | Read English corpus for ASR benchmarking. |
+|  | TED-LIUM | 453.8 | 1.6 | English TED-talk recordings for spontaneous speech recognition. |
+| **ASR (Ar–En Code Switching)** | Synthetic (In-house TTS) | 119.5 | – | Synthetic bilingual utterances generated via TTS to strengthen mixed-speech robustness. |
+| **Translation (Ar→En)** | Translated QASR (via GPT-4o) | 1,858.4 | 9.6 | QASR corpus automatically translated to English for parallel supervision. |
+|  | Translated In-house Arabic (via GPT-4o) | 7,229.2 | 141.9 | In-house Arabic dataset machine-translated to English via GPT-4o. |
+| **Dialect Identification** | [ADI17](https://swshon.github.io/pdf/shon_2020_adi17.pdf) | 2,241.5 | 19.0 | YouTube-sourced Arabic speech across 17 dialects for dialect recognition and adaptation. |
+> **Total Coverage:** ≈25,000 hours of speech across Arabic, English, and mixed-language domains — enabling broad generalization for ASR, translation, and dialect identification.
 These datasets jointly provide:
 - Balanced representation across dialects.
 - Both natural and synthetic speech sources for enhanced robustness.
 - Parallel Arabic–English pairs enabling bilingual text generation and translation.
+## ⚙️ Installation & Usage
+### **💻 Install Dependencies**
+```bash
+pip install -r requirements.txt
+```
+## Inference
+```bash
+from inference import transcribe
+audio_path = "path/to/audio.wav"  # Replace with your actual audio file
+output = transcribe(audio_path, task="asr")  # Options: "dialect", "asr", "translation"
+print("Generated Text:", output)
+```
+---
+## Examples
+### Example 1: Arabic Speech Recognition
+🎵 **Audio Input (Arabic)**:
+<audio controls>
+  <source src="https://huggingface.co/ArabicSpeech/Octopus/resolve/main/examples/03BD00C0_2C0B_4C81_BA8C_018175D0B4E3_utt_1_align.wav" type="audio/wav">
+</audio>
+📝 **User Prompt**:
+> Transcribe the audio
+or
+> قم بتفريغ المقطع الصوتي
+💡 **System Response**:
+> أهلا بكم مشاهدينا الكرام في حلقة جديدة من برنامج الاقتصاد والناس
+🎵 **Audio Input (English)**:
+<audio controls>
+  <source src="https://huggingface.co/ArabicSpeech/Octopus/resolve/main/examples/4970-29093-0016.wav" type="audio/wav">
+</audio>
+📝 **User Prompt**:
+> Transcribe the audio
+or
+> قم بتفريغ المقطع الصوتي
+💡 **System Response**:
+> NO IT'S NOT TOO SOON
+---
+### Example 2: Arabic to English Translation
+🎵 **Audio Input**:
+<audio controls>
+  <source src="https://huggingface.co/ArabicSpeech/Octopus/resolve/main/examples/03BD00C0_2C0B_4C81_BA8C_018175D0B4E3_utt_21_align.wav" type="audio/wav">
+</audio>
+📝 **User Prompt**:
+> Translate the following Arabic speech into English
+or
+> قم بترجمة المقطع للإنجليزية
+💡 **System Response**:
+> I took a loan a certain amount of money to pay off the debt
+---
+### Example 3: Dialect Identification
+🎵 **Audio Input**:
+<audio controls>
+  <source src="https://huggingface.co/ArabicSpeech/Octopus/resolve/main/examples/tYBpZAOFpvk_071631-073831.wav" type="audio/wav">
+</audio>
+📝 **User Prompt**:
+> Identify the dialect of the given speech
+or
+> ماهي لهجة المتحدث؟
+💡 **System Response**:
+> KSA
 ---