|
|
--- |
|
|
language: |
|
|
- dv |
|
|
- ar |
|
|
- en |
|
|
license: cc-by-nc-4.0 |
|
|
tags: |
|
|
- automatic-speech-recognition |
|
|
- mms |
|
|
- ctc |
|
|
- trilingual |
|
|
- dhivehi |
|
|
- arabic |
|
|
- english |
|
|
- madhaha |
|
|
datasets: |
|
|
- shiimi/dhivehi-audio-casts-processed |
|
|
- Serialtechlab/dhivehi-mms-v5-combined |
|
|
metrics: |
|
|
- wer |
|
|
base_model: Serialtechlab/mms-trilingual-dv-ar-en-v2 |
|
|
--- |
|
|
|
|
|
# MMS Trilingual ASR v3 - Dhivehi + Arabic + English (Madhaha Fix) |
|
|
|
|
|
Fine-tuned version of mms-trilingual-dv-ar-en-v2 with **fixed Madhaha recognition**. |
|
|
|
|
|
## Problem Solved |
|
|
v2 model confused melodic Dhivehi (Madhaha/religious songs) with Arabic, |
|
|
outputting Arabic script instead of Thaana. This version fixes that issue. |
|
|
|
|
|
## Training Strategy |
|
|
- Started from v2 model (preserves improved English/Arabic recognition) |
|
|
- Trained ONLY on Dhivehi data (no Arabic interference) |
|
|
- Oversampled melodic Dhivehi 3x to emphasize the pattern |
|
|
- Higher learning rate (3e-05) to change associations aggressively |
|
|
- 5 epochs for stronger reinforcement |
|
|
|
|
|
## Training Data |
|
|
- Melodic Dhivehi: ~3000 samples (oversampled from audio casts) |
|
|
- Normal Dhivehi: ~1500 samples |
|
|
|
|
|
## Performance |
|
|
- Final WER: 0.2153 |
|
|
|
|
|
## Usage |
|
|
```python |
|
|
from transformers import AutoProcessor, Wav2Vec2ForCTC |
|
|
import torch |
|
|
|
|
|
processor = AutoProcessor.from_pretrained("Serialtechlab/mms-trilingual-dv-ar-en-v3") |
|
|
model = Wav2Vec2ForCTC.from_pretrained("Serialtechlab/mms-trilingual-dv-ar-en-v3") |
|
|
|
|
|
# Process audio (16kHz) |
|
|
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt") |
|
|
|
|
|
with torch.no_grad(): |
|
|
logits = model(**inputs).logits |
|
|
|
|
|
predicted_ids = torch.argmax(logits, dim=-1) |
|
|
transcription = processor.batch_decode(predicted_ids)[0] |
|
|
``` |
|
|
|
|
|
## Supported Languages |
|
|
- Dhivehi (Thaana script) - including melodic/Madhaha |
|
|
- Arabic (Arabic script) - preserved from v2 |
|
|
- English (Latin script) - preserved from v2 (improved Thaana transliteration) |
|
|
|
|
|
## Changes from v2 |
|
|
- v3 specifically targets the Madhaha confusion issue |
|
|
- Melodic Dhivehi now correctly outputs Thaana script |
|
|
- Preserves v2's improved English and Arabic recognition |
|
|
|