SaraAlthubaiti commited on
Commit
06d3db5
·
verified ·
1 Parent(s): 87155d2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -3
README.md CHANGED
@@ -1,3 +1,80 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - rsalshalan/QASR
4
+ - DynamicSuperb/DialectIdentification_ADI17
5
+ - openslr/librispeech_asr
6
+ - LIUM/tedlium
7
+ language:
8
+ - ar
9
+ - en
10
+ metrics:
11
+ - bleu
12
+ - wer
13
+ - accuracy
14
+ base_model:
15
+ - deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
16
+ - meta-llama/Llama-3.2-1B
17
+ pipeline_tag: audio-text-to-text
18
+ ---
19
+
20
+ # 🐙 Octopus: Towards Building the Arabic Speech LLM Suite
21
+
22
+ ## 📢 Overview
23
+ **Octopus** is a bilingual **Audio-Language Model (Audio-LLM)** family developed to understand, transcribe, translate, and reason over Arabic and English speech.
24
+ It unifies audio, text, and reasoning within one multimodal framework, supporting:
25
+
26
+ - **Automatic Speech Recognition (ASR)** for Arabic & English 🗣️
27
+ - **Speech Translation** (Arabic → English and vice versa) 🌍
28
+ - **Arabic Dialect Identification (DID)** 🏷️
29
+
30
+ The lightweight variant, **TinyOctopus**, maintains the same modular design but is optimized for efficiency on smaller GPUs.
31
+
32
+ ---
33
+
34
+ ## 🧩 Architecture
35
+ ### Core Components
36
+ The **Octopus** family scales across several encoder–decoder configurations, combining complementary strengths in acoustic understanding and text generation.
37
+
38
+ 1. **Audio Encoders**
39
+ - **Distil-Whisper (distil-large-v3)** → lightweight frozen encoder producing compact speech embeddings.
40
+ - **Whisper-large-v3** → high-capacity encoder for robust transcription and multilingual coverage.
41
+ - **BEATs (Microsoft)** → self-supervised audio encoder capturing fine-grained acoustic cues such as timbre and speaker traits.
42
+
43
+ 2. **Alignment & Fusion**
44
+ - **Cross-Attention Projection Layer** → a trainable bridge that aligns audio representations with the text-language space through cross-modal attention.
45
+
46
+ 3. **Language / Decoder Models**
47
+ - **DeepSeek 1.5B** → efficient generative decoder for reasoning, dialogue, and translation.
48
+ - **LLaMA 3.2 1B** → compact Arabic–English language model variant optimized for code-switching and reasoning on limited hardware.
49
+ - **ALLaM 13B** → large bilingual decoder offering high-fidelity generation and deeper contextual grounding for Arabic tasks.
50
+
51
+ Together these components enable the **Octopus** line—from **TinyOctopus** (Distil-Whisper + LLaMA 3.2 1B or DeepSeek 1.5B) up to full **ALLaM-Octopus** (Whisper large v3 + BEATs + ALLaM 13 B) to handle diverse audio understanding and speech-to-text reasoning tasks across Arabic and English.
52
+
53
+
54
+ ---
55
+
56
+ ## 📚 Training Datasets
57
+
58
+ The **Octopus** models were trained and evaluated on a diverse collection of Arabic, English, and code-switching speech corpora, spanning over **25,000 hours** of high-quality data covering ASR, translation, and dialect identification tasks.
59
+
60
+ | **Task / Domain** | **Dataset** | **# of Hours (Train | Dev)** | **Description** |
61
+ |:------------------|:------------|:-----------------------------:|:----------------|
62
+ | **ASR (Arabic)** | [QASR](https://arxiv.org/pdf/2106.13000) | 1,880.5 \| 9.6 | Broadcast Arabic from Al Jazeera News, multi-dialect, with punctuation + speaker tags. |
63
+ | | In-house Arabic Corpus | 13,392.1 \| 142.7 | Internal large-scale Arabic dataset spanning Gulf, Levantine, and North African dialects. |
64
+ | **ASR (English)** | LibriSpeech | 960.0 \| 10.5 | Read English speech corpus widely used for ASR benchmarking. |
65
+ | | TED-LIUM | 453.8 \| 1.6 | English TED talk recordings for spontaneous speech recognition. |
66
+ | **ASR (Ar–En Code Switching)** | Synthetic (In-house TTS) | 119.5 \| – | Synthetic bilingual segments generated via TTS to enhance robustness to mixed speech. |
67
+ | **Translation (Ar→En)** | Translated QASR (via GPT-4o) | 1,858.4 \| 9.6 | Machine-translated version of QASR aligned with Arabic speech segments. |
68
+ | | Translated In-house Arabic (via GPT-4o) | 7,229.2 \| 141.9 | Large Arabic speech corpus automatically translated to English via GPT-4o for parallel training. |
69
+ | **Dialect Identification** | [ADI17](https://swshon.github.io/pdf/shon_2020_adi17.pdf) | 2,241.5 \| 19.0 | YouTube-sourced speech from 17 Arabic dialects for dialect recognition and domain adaptation. |
70
+
71
+ > **Total Coverage:** ≈ 25,000 hours of speech across Arabic, English, and mixed-language domains, ensuring wide generalization for ASR, translation, and dialect ID tasks.
72
+
73
+ ---
74
+
75
+ These datasets jointly provide:
76
+ - Balanced representation across dialects.
77
+ - Both natural and synthetic speech sources for enhanced robustness.
78
+ - Parallel Arabic–English pairs enabling bilingual text generation and translation.
79
+
80
+ ---