metadata
language:
- dv
- ar
- en
license: cc-by-nc-4.0
tags:
- automatic-speech-recognition
- mms
- ctc
- trilingual
- dhivehi
- arabic
- english
datasets:
- shunyalabs/arabic-speech-dataset
- shiimi/dhivehi-audio-casts-processed
- Serialtechlab/dhivehi-mms-v5-combined
- openslr/librispeech_asr
metrics:
- wer
base_model: Serialtechlab/mms-trilingual-dv-ar-en
MMS Trilingual ASR v2 - Dhivehi + Arabic + English
Fine-tuned version of mms-trilingual-dv-ar-en with improved:
- Conversational Arabic recognition (FLEURS Arabic)
- Melodic Dhivehi (Madhaha/podcasts) recognition
Changes from v1
- Added conversational Arabic data (FLEURS) to replace Quranic-only training
- Added melodic Dhivehi (audio casts) to fix Madhaha confusion with Arabic
- Removed Quranic recitation data
Training Data
- Arabic: ~2500 samples from FLEURS (conversational)
- Dhivehi Melodic: 1000 samples from audio casts
- Dhivehi Normal: ~1500 samples
- English: ~500 samples from LibriSpeech
Performance
- Final WER: 0.2820
Usage
from transformers import AutoProcessor, Wav2Vec2ForCTC
import torch
processor = AutoProcessor.from_pretrained("Serialtechlab/mms-trilingual-dv-ar-en-v2")
model = Wav2Vec2ForCTC.from_pretrained("Serialtechlab/mms-trilingual-dv-ar-en-v2")
# Process audio (16kHz)
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
Supported Languages
- Dhivehi (Thaana script) - including melodic/Madhaha
- Arabic (Arabic script) - conversational style
- English (Latin script)