ArabicSpeech
/

Octopus

+---
+datasets:
+- rsalshalan/QASR
+- DynamicSuperb/DialectIdentification_ADI17
+- openslr/librispeech_asr
+- LIUM/tedlium
+language:
+- ar
+- en
+metrics:
+- bleu
+- wer
+- accuracy
+base_model:
+- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
+- meta-llama/Llama-3.2-1B
+pipeline_tag: audio-text-to-text
+---
+# 🐙 Octopus: Towards Building the Arabic Speech LLM Suite
+## 📢 Overview
+**Octopus** is a bilingual **Audio-Language Model (Audio-LLM)** family developed to understand, transcribe, translate, and reason over Arabic and English speech.
+It unifies audio, text, and reasoning within one multimodal framework, supporting:
+- **Automatic Speech Recognition (ASR)** for Arabic & English 🗣️
+- **Speech Translation** (Arabic → English and vice versa) 🌍
+- **Arabic Dialect Identification (DID)** 🏷️
+The lightweight variant, **TinyOctopus**, maintains the same modular design but is optimized for efficiency on smaller GPUs.
+---
+## 🧩 Architecture
+### Core Components
+The **Octopus** family scales across several encoder–decoder configurations, combining complementary strengths in acoustic understanding and text generation.
+1. **Audio Encoders**
+   - **Distil-Whisper (distil-large-v3)** → lightweight frozen encoder producing compact speech embeddings.
+   - **Whisper-large-v3** → high-capacity encoder for robust transcription and multilingual coverage.
+   - **BEATs (Microsoft)** → self-supervised audio encoder capturing fine-grained acoustic cues such as timbre and speaker traits.
+2. **Alignment & Fusion**
+   - **Cross-Attention Projection Layer** → a trainable bridge that aligns audio representations with the text-language space through cross-modal attention.
+3. **Language / Decoder Models**
+   - **DeepSeek 1.5B** → efficient generative decoder for reasoning, dialogue, and translation.
+   - **LLaMA 3.2 1B** → compact Arabic–English language model variant optimized for code-switching and reasoning on limited hardware.
+   - **ALLaM 13B** → large bilingual decoder offering high-fidelity generation and deeper contextual grounding for Arabic tasks.
+Together these components enable the **Octopus** line—from **TinyOctopus** (Distil-Whisper + LLaMA 3.2 1B or DeepSeek 1.5B) up to full **ALLaM-Octopus** (Whisper large v3 + BEATs + ALLaM 13 B) to handle diverse audio understanding and speech-to-text reasoning tasks across Arabic and English.
+---
+## 📚 Training Datasets
+The **Octopus** models were trained and evaluated on a diverse collection of Arabic, English, and code-switching speech corpora, spanning over **25,000 hours** of high-quality data covering ASR, translation, and dialect identification tasks.
+| **Task / Domain** | **Dataset** | **# of Hours (Train | Dev)** | **Description** |
+|:------------------|:------------|:-----------------------------:|:----------------|
+| **ASR (Arabic)** | [QASR](https://arxiv.org/pdf/2106.13000) | 1,880.5 \| 9.6 | Broadcast Arabic from Al Jazeera News, multi-dialect, with punctuation + speaker tags. |
+|  | In-house Arabic Corpus | 13,392.1 \| 142.7 | Internal large-scale Arabic dataset spanning Gulf, Levantine, and North African dialects. |
+| **ASR (English)** | LibriSpeech | 960.0 \| 10.5 | Read English speech corpus widely used for ASR benchmarking. |
+|  | TED-LIUM | 453.8 \| 1.6 | English TED talk recordings for spontaneous speech recognition. |
+| **ASR (Ar–En Code Switching)** | Synthetic (In-house TTS) | 119.5 \| – | Synthetic bilingual segments generated via TTS to enhance robustness to mixed speech. |
+| **Translation (Ar→En)** | Translated QASR (via GPT-4o) | 1,858.4 \| 9.6 | Machine-translated version of QASR aligned with Arabic speech segments. |
+|  | Translated In-house Arabic (via GPT-4o) | 7,229.2 \| 141.9 | Large Arabic speech corpus automatically translated to English via GPT-4o for parallel training. |
+| **Dialect Identification** | [ADI17](https://swshon.github.io/pdf/shon_2020_adi17.pdf) | 2,241.5 \| 19.0 | YouTube-sourced speech from 17 Arabic dialects for dialect recognition and domain adaptation. |
+> **Total Coverage:** ≈ 25,000 hours of speech across Arabic, English, and mixed-language domains, ensuring wide generalization for ASR, translation, and dialect ID tasks.
+---
+These datasets jointly provide:
+- Balanced representation across dialects.
+- Both natural and synthetic speech sources for enhanced robustness.
+- Parallel Arabic–English pairs enabling bilingual text generation and translation.
+---