Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,80 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
datasets:
|
| 3 |
+
- rsalshalan/QASR
|
| 4 |
+
- DynamicSuperb/DialectIdentification_ADI17
|
| 5 |
+
- openslr/librispeech_asr
|
| 6 |
+
- LIUM/tedlium
|
| 7 |
+
language:
|
| 8 |
+
- ar
|
| 9 |
+
- en
|
| 10 |
+
metrics:
|
| 11 |
+
- bleu
|
| 12 |
+
- wer
|
| 13 |
+
- accuracy
|
| 14 |
+
base_model:
|
| 15 |
+
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
|
| 16 |
+
- meta-llama/Llama-3.2-1B
|
| 17 |
+
pipeline_tag: audio-text-to-text
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
# 🐙 Octopus: Towards Building the Arabic Speech LLM Suite
|
| 21 |
+
|
| 22 |
+
## 📢 Overview
|
| 23 |
+
**Octopus** is a bilingual **Audio-Language Model (Audio-LLM)** family developed to understand, transcribe, translate, and reason over Arabic and English speech.
|
| 24 |
+
It unifies audio, text, and reasoning within one multimodal framework, supporting:
|
| 25 |
+
|
| 26 |
+
- **Automatic Speech Recognition (ASR)** for Arabic & English 🗣️
|
| 27 |
+
- **Speech Translation** (Arabic → English and vice versa) 🌍
|
| 28 |
+
- **Arabic Dialect Identification (DID)** 🏷️
|
| 29 |
+
|
| 30 |
+
The lightweight variant, **TinyOctopus**, maintains the same modular design but is optimized for efficiency on smaller GPUs.
|
| 31 |
+
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
## 🧩 Architecture
|
| 35 |
+
### Core Components
|
| 36 |
+
The **Octopus** family scales across several encoder–decoder configurations, combining complementary strengths in acoustic understanding and text generation.
|
| 37 |
+
|
| 38 |
+
1. **Audio Encoders**
|
| 39 |
+
- **Distil-Whisper (distil-large-v3)** → lightweight frozen encoder producing compact speech embeddings.
|
| 40 |
+
- **Whisper-large-v3** → high-capacity encoder for robust transcription and multilingual coverage.
|
| 41 |
+
- **BEATs (Microsoft)** → self-supervised audio encoder capturing fine-grained acoustic cues such as timbre and speaker traits.
|
| 42 |
+
|
| 43 |
+
2. **Alignment & Fusion**
|
| 44 |
+
- **Cross-Attention Projection Layer** → a trainable bridge that aligns audio representations with the text-language space through cross-modal attention.
|
| 45 |
+
|
| 46 |
+
3. **Language / Decoder Models**
|
| 47 |
+
- **DeepSeek 1.5B** → efficient generative decoder for reasoning, dialogue, and translation.
|
| 48 |
+
- **LLaMA 3.2 1B** → compact Arabic–English language model variant optimized for code-switching and reasoning on limited hardware.
|
| 49 |
+
- **ALLaM 13B** → large bilingual decoder offering high-fidelity generation and deeper contextual grounding for Arabic tasks.
|
| 50 |
+
|
| 51 |
+
Together these components enable the **Octopus** line—from **TinyOctopus** (Distil-Whisper + LLaMA 3.2 1B or DeepSeek 1.5B) up to full **ALLaM-Octopus** (Whisper large v3 + BEATs + ALLaM 13 B) to handle diverse audio understanding and speech-to-text reasoning tasks across Arabic and English.
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
---
|
| 55 |
+
|
| 56 |
+
## 📚 Training Datasets
|
| 57 |
+
|
| 58 |
+
The **Octopus** models were trained and evaluated on a diverse collection of Arabic, English, and code-switching speech corpora, spanning over **25,000 hours** of high-quality data covering ASR, translation, and dialect identification tasks.
|
| 59 |
+
|
| 60 |
+
| **Task / Domain** | **Dataset** | **# of Hours (Train | Dev)** | **Description** |
|
| 61 |
+
|:------------------|:------------|:-----------------------------:|:----------------|
|
| 62 |
+
| **ASR (Arabic)** | [QASR](https://arxiv.org/pdf/2106.13000) | 1,880.5 \| 9.6 | Broadcast Arabic from Al Jazeera News, multi-dialect, with punctuation + speaker tags. |
|
| 63 |
+
| | In-house Arabic Corpus | 13,392.1 \| 142.7 | Internal large-scale Arabic dataset spanning Gulf, Levantine, and North African dialects. |
|
| 64 |
+
| **ASR (English)** | LibriSpeech | 960.0 \| 10.5 | Read English speech corpus widely used for ASR benchmarking. |
|
| 65 |
+
| | TED-LIUM | 453.8 \| 1.6 | English TED talk recordings for spontaneous speech recognition. |
|
| 66 |
+
| **ASR (Ar–En Code Switching)** | Synthetic (In-house TTS) | 119.5 \| – | Synthetic bilingual segments generated via TTS to enhance robustness to mixed speech. |
|
| 67 |
+
| **Translation (Ar→En)** | Translated QASR (via GPT-4o) | 1,858.4 \| 9.6 | Machine-translated version of QASR aligned with Arabic speech segments. |
|
| 68 |
+
| | Translated In-house Arabic (via GPT-4o) | 7,229.2 \| 141.9 | Large Arabic speech corpus automatically translated to English via GPT-4o for parallel training. |
|
| 69 |
+
| **Dialect Identification** | [ADI17](https://swshon.github.io/pdf/shon_2020_adi17.pdf) | 2,241.5 \| 19.0 | YouTube-sourced speech from 17 Arabic dialects for dialect recognition and domain adaptation. |
|
| 70 |
+
|
| 71 |
+
> **Total Coverage:** ≈ 25,000 hours of speech across Arabic, English, and mixed-language domains, ensuring wide generalization for ASR, translation, and dialect ID tasks.
|
| 72 |
+
|
| 73 |
+
---
|
| 74 |
+
|
| 75 |
+
These datasets jointly provide:
|
| 76 |
+
- Balanced representation across dialects.
|
| 77 |
+
- Both natural and synthetic speech sources for enhanced robustness.
|
| 78 |
+
- Parallel Arabic–English pairs enabling bilingual text generation and translation.
|
| 79 |
+
|
| 80 |
+
---
|