FarukSTT โ€” Tunisian Derja Speech-to-Text

FarukSTT is the best publicly available ASR model for Tunisian Arabic (Derja), fine-tuned from Whisper Large v3. It handles real-world Tunisian speech including natural code-switching between Arabic, French, and English.

Whisper Large v3 out of the box scores above 50% WER on Tunisian Derja. This model brings that down to ~32% through targeted fine-tuning on real Tunisian conversational data.

Performance

Model WER
Whisper Large v3 (baseline) >50%
FarukSTT v1 33.26%
FarukSTT v2 31.99%

Evaluated on FARUKxAUTO/tunisian-asr-cleaned validation split.

Key Features

  • Tunisian Derja โ€” not MSA, real dialectal Arabic
  • Code-switching โ€” Arabic + French + English mid-sentence
  • Real-world speech โ€” podcasts, interviews, conversations
  • 54k training samples โ€” 12.7GB of audio data

Intended Use

  • Tunisian Arabic transcription
  • Meeting and interview transcription for Tunisian speakers
  • Input for downstream NLP tasks in Derja

Limitations

  • Optimized for Tunisian dialect, not Modern Standard Arabic
  • WER ~32% โ€” suitable for assisted transcription, not verbatim accuracy
  • May struggle in very noisy environments

Training Data

Fine-tuned on FARUKxAUTO/tunisian-asr-cleaned โ€” 54,156 audio-transcription pairs of real Tunisian speech including natural code-switching.

Training Procedure

v2 Hyperparameters

  • learning_rate: 1e-6
  • train_batch_size: 2
  • gradient_accumulation_steps: 16 (effective batch: 32)
  • warmup_steps: 200
  • fp16: True
  • optimizer: AdamW
  • Framework: Transformers 4.45.0, PyTorch 2.11.0

Training Results (v1)

Step Validation Loss WER
500 0.3228 34.02%
2000 0.3121 32.79%
4000 0.3110 33.26%

Training Results (v2)

Step Validation Loss WER
250 0.3114 32.90%
500 0.3060 31.99%

Usage

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="FARUKxAUTO/FarukSTT")
result = pipe("audio.wav")
print(result["text"])

Citation

If you use this model, please credit:

FarukSTT by FARUK BATTIKH โ€” Tunisian Derja ASR, 2024

Downloads last month
92
Safetensors
Model size
2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for FARUKxAUTO/FarukSTT

Finetuned
(845)
this model

Space using FARUKxAUTO/FarukSTT 1

Evaluation results