Octopus / README.md

Update README.md

890886f verified 2 months ago

10 kB

	---
	datasets:
	- rsalshalan/QASR
	- DynamicSuperb/DialectIdentification_ADI17
	- openslr/librispeech_asr
	- LIUM/tedlium
	language:
	- ar
	- en
	metrics:
	- bleu
	- wer
	- accuracy
	base_model:
	- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
	- meta-llama/Llama-3.2-1B
	pipeline_tag: audio-text-to-text
	---

	# 🐙 Octopus: Towards Building the Arabic Speech LLM Suite

	## 📢 Overview
	Octopus is a bilingual Audio-Language Model (Audio-LLM) family developed to understand, transcribe, translate, and reason over Arabic and English speech.
	It unifies audio, text, and reasoning within one multimodal framework, supporting:

	- Automatic Speech Recognition (ASR) for Arabic & English 🗣️
	- Speech Translation (Arabic → English and vice versa) 🌍
	- Arabic Dialect Identification (DID) 🏷️

	The lightweight variant, TinyOctopus, maintains the same modular design but is optimized for efficiency on smaller GPUs.


	## 🧩 Architecture
	### Core Components
	The Octopus family scales across several encoder–decoder configurations, combining complementary strengths in acoustic understanding and text generation.

	1. Audio Encoders
	- Distil-Whisper (distil-large-v3) → lightweight frozen encoder producing compact speech embeddings.
	- Whisper-large-v3 → high-capacity encoder for robust transcription and multilingual coverage.
	- BEATs (Microsoft) → self-supervised audio encoder capturing fine-grained acoustic cues such as timbre and speaker traits.

	2. Alignment & Fusion
	- Cross-Attention Projection Layer → a trainable bridge that aligns audio representations with the text-language space through cross-modal attention.

	3. Language / Decoder Models
	- DeepSeek 1.5B → efficient generative decoder for reasoning, dialogue, and translation.
	- LLaMA 3.2 1B → compact Arabic–English language model variant optimized for code-switching and reasoning on limited hardware.
	- ALLaM 13B → large bilingual decoder offering high-fidelity generation and deeper contextual grounding for Arabic tasks.

	Together these components enable the Octopus line—from TinyOctopus (Distil-Whisper + LLaMA 3.2 1B or DeepSeek 1.5B) up to full ALLaM-Octopus (Whisper large v3 + BEATs + ALLaM 13 B) to handle diverse audio understanding and speech-to-text reasoning tasks across Arabic and English.


	## 📚 Training Datasets

	The Octopus models were trained and evaluated on a diverse collection of Arabic, English, and code-switching speech corpora, totaling ≈25,000 hours of high-quality data for ASR, translation, and dialect identification.

	\| Task / Domain \| Dataset \| Train (h) \| Dev (h) \| Description \|
	\|:------------------\|:-------------\|:--------------:\|:------------:\|:----------------\|
	\| ASR (Arabic) \| [QASR](https://arxiv.org/pdf/2106.13000) \| 1,880.5 \| 9.6 \| Broadcast Arabic from Al-Jazeera; multi-dialect with punctuation and speaker tags. \|
	\| \| In-house Arabic Corpus \| 13,392.1 \| 142.7 \| Large internal Arabic dataset across Gulf, Levantine, and North-African dialects. \|
	\| ASR (English) \| LibriSpeech \| 960.0 \| 10.5 \| Read English corpus for ASR benchmarking. \|
	\| \| TED-LIUM \| 453.8 \| 1.6 \| English TED-talk recordings for spontaneous speech recognition. \|
	\| ASR (Ar–En Code Switching) \| Synthetic (In-house TTS) \| 119.5 \| – \| Synthetic bilingual utterances generated via TTS to strengthen mixed-speech robustness. \|
	\| Translation (Ar→En) \| Translated QASR (via GPT-4o) \| 1,858.4 \| 9.6 \| QASR corpus automatically translated to English for parallel supervision. \|
	\| \| Translated In-house Arabic (via GPT-4o) \| 7,229.2 \| 141.9 \| In-house Arabic dataset machine-translated to English via GPT-4o. \|
	\| Dialect Identification \| [ADI17](https://swshon.github.io/pdf/shon_2020_adi17.pdf) \| 2,241.5 \| 19.0 \| YouTube-sourced Arabic speech across 17 dialects for dialect recognition and adaptation. \|

	> Total Coverage: ≈25,000 hours of speech across Arabic, English, and mixed-language domains — enabling broad generalization for ASR, translation, and dialect identification.

	These datasets jointly provide:
	- Balanced representation across dialects.
	- Both natural and synthetic speech sources for enhanced robustness.
	- Parallel Arabic–English pairs enabling bilingual text generation and translation.


	## 🧮 Model Weights & Resources

	The full set of model weights (including large checkpoints) is publicly available here:
	➡️ [Octopus Model Weights](https://drive.google.com/drive/folders/1602VHm77oyQV4p08x5Xug0ziw7u0p2Ju?usp=sharing)


	## ⚙️ Installation & Usage
	### 💻 Install Dependencies
	```bash
	pip install -r requirements.txt
	```
	## Inference

	```bash
	from inference import transcribe

	audio_path = "path/to/audio.wav" # Replace with your actual audio file
	output = transcribe(audio_path, task="asr") # Options: "dialect", "asr", "translation"

	print("Generated Text:", output)
	```

	## 🧪 Evaluation Results

	### 🎙️ ASR Performance (WER ↓)

	\| Dataset \| Ar-Octopus \| Bilingual-Octopus \| Trans-Octopus \| Whisper-large-v3 \| SeamlessM4T \|
	\|:-------------\|:---------------:\|:---------------------:\|:-----------------:\|:--------------------:\|:----------------:\|
	\| MGB2 (Arabic) \| 16.5 \\| 6.5 \| 15.2 \\| 6.8 \| 13.3 \\| 5.9 \| 16.2 \\| 7.9 \| 17.2 \\| 8.4 \|
	\| test-clean (English) \| 82.5 \\| 92.4 \| 2.6 \\| 1.4 \| 67.3 \\| 79.4 \| 2.86 \\| 0.98 \| 2.68 \\| 0.88 \|
	\| test-other (English) \| 86.9 \\| 95.1 \| 5.1 \\| 3.4 \| 71.5 \\| 87.8 \| 5.00 \\| 2.05 \| 5.07 \\| 1.94 \|
	\| tedlium (English) \| 101.9 \\| 77.4 \| 5.1 \\| 3.9 \| 85.2 \\| 63.6 \| 11.9 \\| 4.4 \| 86.5 \\| 62.2 \|
	\| Escwa (Code-Switched) \| 42.5 \\| 26.3 \| 40.8 \\| 27.1 \| 41.8 \\| 25.1 \| 47.3 \\| 31.0 \| 52.0 \\| 35.3 \|
	\| Mixat-ALL (Code-Switched) \| 22.0 \\| 9.0 \| 23.4 \\| 10.3 \| 34.1 \\| 10.6 \| 29.0 \\| 15.0 \| 32.8 \\| 16.9 \|
	\| Mixat-CS (Code-Switched) \| 26.4 \\| 12.4 \| 28.5 \\| 14.9 \| 27.8 \\| 13.3 \| 34.8 \\| 20.6 \| 38.2 \\| 21.8 \|
	\| In-house Long-form \| 25.4 \\| 13.0 \| 24.9 \\| 12.5 \| 24.1 \\| 12.1 \| 26.7 \\| 15.2 \| 29.3 \\| 18.6 \|

	> +86 % English improvement observed with the addition of language-tokens for bilingual and translation variants.

	---

	### 🪶 Tiny-Octopus & Fine-Tuning (WER ↓)

	\| Dataset \| TinyOctopus LLaMA-3 1B \| Fine-tuned LLaMA-3 1B \| TinyOctopus DeepSeek 1.5B \| Fine-tuned DeepSeek 1.5B \|
	\|:-------------\|:-------------------------:\|:-------------------------:\|:-----------------------------:\|:-----------------------------:\|
	\| MGB2 (Arabic) \| 22.6 \\| 15.7 \| 16.1 \\| 9.5 \| 23.2 \\| 15.8 \| 15.5 \\| 9.2 \|
	\| test-clean (English) \| 7.5 \\| 5.7 \| 3.1 \\| 1.3 \| 7.7 \\| 5.8 \| 7.6 \\| 5.7 \|
	\| test-other (English) \| 11.3 \\| 8.0 \| 6.9 \\| 3.5 \| 11.5 \\| 8.2 \| 11.3 \\| 8.0 \|
	\| Escwa (Code-Switched) \| 42.5 \\| 26.9 \| 40.3 \\| 24.4 \| 43.6 \\| 27.8 \| 41.8 \\| 26.3 \|
	\| Mixat-All \| 35.2 \\| 19.6 \| 34.1 \\| 19.3 \| 37.1 \\| 21.1 \| 35.5 \\| 19.9 \|
	\| Mixat-CS \| 40.2 \\| 24.2 \| 36.2 \\| 21.4 \| 41.2 \\| 25.2 \| 39.9 \\| 24.2 \|
	\| In-house Long-files \| 44.3 \\| 29.1 \| 42.8 \\| 26.9 \| 47.0 \\| 32.7 \| 43.7 \\| 31.5 \|

	> Code-Switch TTS augmentation yielded ≈ 20 % WER reduction across multilingual evaluation sets.

	---

	### 🌍 Translation Performance (BLEU ↑ / BERT-F1 ↑)

	\| Model / System \| CoVoST2 (Ar→En) \| FLEURS (Ar→En) \|
	\|:--------------------\|:------------------:\|:-----------------:\|
	\| Whisper-large-v3 \| 28.8 / 0.53 \| 15.1 / 0.47 \|
	\| SeamlessM4T \| 33.7 / 0.55 \| 23.9 / 0.56 \|
	\| Trans-Octopus \| 38.6 / 0.64 \| 23.2 / 0.58 \|
	\| TO-LLaMA-1B \| 33.9 / 0.61 \| 20.5 / 0.53 \|
	\| TO-DeepSeek-1.5B \| 33.6 / 0.61 \| 20.8 / 0.53 \|

	> Trans-Octopus achieves the best BLEU and BERT-F1 on CoVoST2 and competitive results on FLEURS, surpassing SeamlessM4T in low-resource conditions.

	---

	### 🏷️ Dialect Identification

	For dialect identification, the Tiny-Octopus models achieved 87 – 89 % accuracy across all 17 dialects in ADI-17.
	The confusion matrices reveal clear separation among Gulf, Levantine, North-African, and Egyptian clusters — showing that even compact models can internalize subtle dialectal cues when trained in a multitask setting.


	## Examples

	### Example 1: Arabic Speech Recognition
	🎵 Audio Input (Arabic):
	<audio controls>
	<source src="https://huggingface.co/ArabicSpeech/Octopus/resolve/main/examples/03BD00C0_2C0B_4C81_BA8C_018175D0B4E3_utt_1_align.wav" type="audio/wav">
	</audio>

	📝 User Prompt:
	> Transcribe the audio
	or
	> قم بتفريغ المقطع الصوتي

	💡 System Response:
	> أهلا بكم مشاهدينا الكرام في حلقة جديدة من برنامج الاقتصاد والناس

	🎵 Audio Input (English):
	<audio controls>
	<source src="https://huggingface.co/ArabicSpeech/Octopus/resolve/main/examples/4970-29093-0016.wav" type="audio/wav">
	</audio>

	📝 User Prompt:
	> Transcribe the audio
	or
	> قم بتفريغ المقطع الصوتي

	💡 System Response:
	> NO IT'S NOT TOO SOON

	---

	### Example 2: Arabic to English Translation
	🎵 Audio Input:
	<audio controls>
	<source src="https://huggingface.co/ArabicSpeech/Octopus/resolve/main/examples/03BD00C0_2C0B_4C81_BA8C_018175D0B4E3_utt_21_align.wav" type="audio/wav">
	</audio>

	📝 User Prompt:
	> Translate the following Arabic speech into English
	or
	> قم بترجمة المقطع للإنجليزية

	💡 System Response:
	> I took a loan a certain amount of money to pay off the debt

	---

	### Example 3: Dialect Identification
	🎵 Audio Input:
	<audio controls>
	<source src="https://huggingface.co/ArabicSpeech/Octopus/resolve/main/examples/tYBpZAOFpvk_071631-073831.wav" type="audio/wav">
	</audio>

	📝 User Prompt:
	> Identify the dialect of the given speech
	or
	> ماهي لهجة المتحدث؟

	💡 System Response:
	> KSA

	---