SaraAlthubaiti commited on
Commit
dcdec03
·
verified ·
1 Parent(s): 06d3db5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -17
README.md CHANGED
@@ -29,7 +29,6 @@ It unifies audio, text, and reasoning within one multimodal framework, supportin
29
 
30
  The lightweight variant, **TinyOctopus**, maintains the same modular design but is optimized for efficiency on smaller GPUs.
31
 
32
- ---
33
 
34
  ## 🧩 Architecture
35
  ### Core Components
@@ -51,30 +50,105 @@ The **Octopus** family scales across several encoder–decoder configurations, c
51
  Together these components enable the **Octopus** line—from **TinyOctopus** (Distil-Whisper + LLaMA 3.2 1B or DeepSeek 1.5B) up to full **ALLaM-Octopus** (Whisper large v3 + BEATs + ALLaM 13 B) to handle diverse audio understanding and speech-to-text reasoning tasks across Arabic and English.
52
 
53
 
54
- ---
55
-
56
  ## 📚 Training Datasets
57
 
58
- The **Octopus** models were trained and evaluated on a diverse collection of Arabic, English, and code-switching speech corpora, spanning over **25,000 hours** of high-quality data covering ASR, translation, and dialect identification tasks.
59
 
60
- | **Task / Domain** | **Dataset** | **# of Hours (Train | Dev)** | **Description** |
61
- |:------------------|:------------|:-----------------------------:|:----------------|
62
- | **ASR (Arabic)** | [QASR](https://arxiv.org/pdf/2106.13000) | 1,880.5 \| 9.6 | Broadcast Arabic from Al Jazeera News, multi-dialect, with punctuation + speaker tags. |
63
- | | In-house Arabic Corpus | 13,392.1 \| 142.7 | Internal large-scale Arabic dataset spanning Gulf, Levantine, and North African dialects. |
64
- | **ASR (English)** | LibriSpeech | 960.0 \| 10.5 | Read English speech corpus widely used for ASR benchmarking. |
65
- | | TED-LIUM | 453.8 \| 1.6 | English TED talk recordings for spontaneous speech recognition. |
66
- | **ASR (Ar–En Code Switching)** | Synthetic (In-house TTS) | 119.5 \| – | Synthetic bilingual segments generated via TTS to enhance robustness to mixed speech. |
67
- | **Translation (Ar→En)** | Translated QASR (via GPT-4o) | 1,858.4 \| 9.6 | Machine-translated version of QASR aligned with Arabic speech segments. |
68
- | | Translated In-house Arabic (via GPT-4o) | 7,229.2 \| 141.9 | Large Arabic speech corpus automatically translated to English via GPT-4o for parallel training. |
69
- | **Dialect Identification** | [ADI17](https://swshon.github.io/pdf/shon_2020_adi17.pdf) | 2,241.5 \| 19.0 | YouTube-sourced speech from 17 Arabic dialects for dialect recognition and domain adaptation. |
70
 
71
- > **Total Coverage:** ≈ 25,000 hours of speech across Arabic, English, and mixed-language domains, ensuring wide generalization for ASR, translation, and dialect ID tasks.
72
-
73
- ---
74
 
75
  These datasets jointly provide:
76
  - Balanced representation across dialects.
77
  - Both natural and synthetic speech sources for enhanced robustness.
78
  - Parallel Arabic–English pairs enabling bilingual text generation and translation.
79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
  ---
 
29
 
30
  The lightweight variant, **TinyOctopus**, maintains the same modular design but is optimized for efficiency on smaller GPUs.
31
 
 
32
 
33
  ## 🧩 Architecture
34
  ### Core Components
 
50
  Together these components enable the **Octopus** line—from **TinyOctopus** (Distil-Whisper + LLaMA 3.2 1B or DeepSeek 1.5B) up to full **ALLaM-Octopus** (Whisper large v3 + BEATs + ALLaM 13 B) to handle diverse audio understanding and speech-to-text reasoning tasks across Arabic and English.
51
 
52
 
 
 
53
  ## 📚 Training Datasets
54
 
55
+ The **Octopus** models were trained and evaluated on a diverse collection of Arabic, English, and code-switching speech corpora, totaling **≈25,000 hours** of high-quality data for ASR, translation, and dialect identification.
56
 
57
+ | **Task / Domain** | **Dataset** | **Train (h)** | **Dev (h)** | **Description** |
58
+ |:------------------|:-------------|:--------------:|:------------:|:----------------|
59
+ | **ASR (Arabic)** | [QASR](https://arxiv.org/pdf/2106.13000) | 1,880.5 | 9.6 | Broadcast Arabic from Al-Jazeera; multi-dialect with punctuation and speaker tags. |
60
+ | | In-house Arabic Corpus | 13,392.1 | 142.7 | Large internal Arabic dataset across Gulf, Levantine, and North-African dialects. |
61
+ | **ASR (English)** | LibriSpeech | 960.0 | 10.5 | Read English corpus for ASR benchmarking. |
62
+ | | TED-LIUM | 453.8 | 1.6 | English TED-talk recordings for spontaneous speech recognition. |
63
+ | **ASR (Ar–En Code Switching)** | Synthetic (In-house TTS) | 119.5 | – | Synthetic bilingual utterances generated via TTS to strengthen mixed-speech robustness. |
64
+ | **Translation (Ar→En)** | Translated QASR (via GPT-4o) | 1,858.4 | 9.6 | QASR corpus automatically translated to English for parallel supervision. |
65
+ | | Translated In-house Arabic (via GPT-4o) | 7,229.2 | 141.9 | In-house Arabic dataset machine-translated to English via GPT-4o. |
66
+ | **Dialect Identification** | [ADI17](https://swshon.github.io/pdf/shon_2020_adi17.pdf) | 2,241.5 | 19.0 | YouTube-sourced Arabic speech across 17 dialects for dialect recognition and adaptation. |
67
 
68
+ > **Total Coverage:** ≈25,000 hours of speech across Arabic, English, and mixed-language domains enabling broad generalization for ASR, translation, and dialect identification.
 
 
69
 
70
  These datasets jointly provide:
71
  - Balanced representation across dialects.
72
  - Both natural and synthetic speech sources for enhanced robustness.
73
  - Parallel Arabic–English pairs enabling bilingual text generation and translation.
74
 
75
+
76
+ ## ⚙️ Installation & Usage
77
+ ### **💻 Install Dependencies**
78
+ ```bash
79
+ pip install -r requirements.txt
80
+ ```
81
+ ## Inference
82
+
83
+ ```bash
84
+ from inference import transcribe
85
+
86
+ audio_path = "path/to/audio.wav" # Replace with your actual audio file
87
+ output = transcribe(audio_path, task="asr") # Options: "dialect", "asr", "translation"
88
+
89
+ print("Generated Text:", output)
90
+ ```
91
+ ---
92
+
93
+ ## Examples
94
+
95
+ ### Example 1: Arabic Speech Recognition
96
+ 🎵 **Audio Input (Arabic)**:
97
+ <audio controls>
98
+ <source src="https://huggingface.co/ArabicSpeech/Octopus/resolve/main/examples/03BD00C0_2C0B_4C81_BA8C_018175D0B4E3_utt_1_align.wav" type="audio/wav">
99
+ </audio>
100
+
101
+ 📝 **User Prompt**:
102
+ > Transcribe the audio
103
+ or
104
+ > قم بتفريغ المقطع الصوتي
105
+
106
+ 💡 **System Response**:
107
+ > أهلا بكم مشاهدينا الكرام في حلقة جديدة من برنامج الاقتصاد والناس
108
+
109
+ 🎵 **Audio Input (English)**:
110
+ <audio controls>
111
+ <source src="https://huggingface.co/ArabicSpeech/Octopus/resolve/main/examples/4970-29093-0016.wav" type="audio/wav">
112
+ </audio>
113
+
114
+ 📝 **User Prompt**:
115
+ > Transcribe the audio
116
+ or
117
+ > قم بتفريغ المقطع الصوتي
118
+
119
+ 💡 **System Response**:
120
+ > NO IT'S NOT TOO SOON
121
+
122
+ ---
123
+
124
+ ### Example 2: Arabic to English Translation
125
+ 🎵 **Audio Input**:
126
+ <audio controls>
127
+ <source src="https://huggingface.co/ArabicSpeech/Octopus/resolve/main/examples/03BD00C0_2C0B_4C81_BA8C_018175D0B4E3_utt_21_align.wav" type="audio/wav">
128
+ </audio>
129
+
130
+ 📝 **User Prompt**:
131
+ > Translate the following Arabic speech into English
132
+ or
133
+ > قم بترجمة المقطع للإنجليزية
134
+
135
+ 💡 **System Response**:
136
+ > I took a loan a certain amount of money to pay off the debt
137
+
138
+ ---
139
+
140
+ ### Example 3: Dialect Identification
141
+ 🎵 **Audio Input**:
142
+ <audio controls>
143
+ <source src="https://huggingface.co/ArabicSpeech/Octopus/resolve/main/examples/tYBpZAOFpvk_071631-073831.wav" type="audio/wav">
144
+ </audio>
145
+
146
+ 📝 **User Prompt**:
147
+ > Identify the dialect of the given speech
148
+ or
149
+ > ماهي لهجة المتحدث؟
150
+
151
+ 💡 **System Response**:
152
+ > KSA
153
+
154
  ---