Update README.md

443ed52 verified 10 months ago

5.54 kB

	---
	language: ar
	license: apache-2.0
	tags:
	- whisper
	- automatic-speech-recognition
	- asr
	- audio
	- arabic
	- egyptian-arabic
	datasets:
	- MAdel121/arabic-egy-cleaned
	metrics:
	- wer
	- cer
	base_model: openai/whisper-medium
	pipeline_tag: automatic-speech-recognition
	library_name: transformers
	model-index:
	- name: whisper-medium-egy
	results:
	- task:
	type: automatic-speech-recognition
	name: Speech Recognition
	dataset:
	name: MAdel121/arabic-egy-cleaned (validation split)
	type: MAdel121/arabic-egy-cleaned
	config: ar
	split: validation
	metrics:
	- name: WER
	type: wer
	value: 18.029990439289488
	- name: CER
	type: cer
	value: 13.375029793807732
	---

	# Whisper Medium Egyptian Arabic (whisper-medium-egy)

	This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on a custom dataset of 72 hours of Egyptian Arabic speech. It's designed for Automatic Speech Recognition (ASR) for the Egyptian Arabic dialect.

	## Model Description

	* Base Model: `openai/whisper-medium`
	* Language: Arabic (ar), specifically focused on Egyptian dialect (arz)
	* Fine-tuning Dataset: `MAdel121/arabic-egy-cleaned` (approx. 72 hours)
	* Total Training Steps: 7299
	* Epochs: 10

	## Intended Uses & Limitations

	This model is intended for transcribing speech in Egyptian Arabic.

	Intended Use:
	* Automatic transcription of audio recordings and live speech in Egyptian Arabic.
	* Assisting with content creation, subtitling, and voice-controlled applications for Egyptian Arabic speakers.

	Limitations:
	* Performance may degrade in highly noisy environments or with very strong, non-Egyptian accents.
	* The model was fine-tuned on a specific dataset; its performance on significantly different domains or audio characteristics might vary.
	* The training data primarily consists of [describe your dataset sources/domains if possible, e.g., "YouTube videos", "audiobooks", "scripted conversations"]. Performance might be better on similar types of audio.

	## How to Use

	You can use this model with the `transformers` library and the `pipeline` interface for ease of use.

	```python
	from transformers import pipeline
	import torch

	device = "cuda:0" if torch.cuda.is_available() else "cpu"

	pipe = pipeline(
	"automatic-speech-recognition",
	model="YOUR_HF_USERNAME/whisper-medium-egy", # Replace YOUR_HF_USERNAME with your Hugging Face username
	device=device
	)

	# Example with a local audio file
	# audio_file = "path/to/your/egyptian_arabic_audio.wav"
	# transcription = pipe(audio_file, generate_kwargs={"language": "arabic"})["text"]
	# print(transcription)

	# Example with a Hugging Face dataset audio sample
	# from datasets import load_dataset
	# ds = load_dataset("MAdel121/arabic-egy-cleaned", "ar", split="validation") # Or your test split
	# sample = ds[0]["audio"] # Make sure your dataset has an "audio" column
	# result = pipe(sample.copy(), generate_kwargs={"language": "arabic"})
	# print(result["text"])
	```
	Make sure to replace `"YOUR_HF_USERNAME/whisper-medium-egy"` with the actual model ID after uploading. The `generate_kwargs={"language": "arabic"}` is important for Whisper models to ensure correct tokenization and transcription for the target language.

	## Training Data

	The model was fine-tuned on the `MAdel121/arabic-egy-cleaned` dataset available on the Hugging Face Hub. This dataset contains approximately 72 hours of Egyptian Arabic audio paired with transcripts.

	## Training Procedure

	The model was trained using the `transformers` library. The fine-tuning process involved the following key hyperparameters:

	* Base Model: `openai/whisper-medium`
	* Optimizer: AdamW
	* Learning Rate: 1e-5 (0.00001)
	* Warmup Steps: 1000
	* Weight Decay: 0.05
	* Gradient Accumulation Factor: 2
	* Batch Size (loader_batch_size): 8 (effective batch size would be 8 * 2 = 16)
	* Number of Epochs: 10
	* Max Grad Norm: 5
	* Augmentations Used:
	* `use_drop_freq`: true
	* `use_drop_chunk`: true
	* `use_drop_bit_resolution`: true
	* Other augmentations like `use_add_noise`, `use_speed_perturb`, `use_pitch_shift`, `use_add_reverb`, `use_codec_augment`, `use_gain` were set to `false`
	* Task: transcribe
	* Language: ar
	* Seed: 1986

	Training was done on 1x A100 (80GB) on Modal Labs

	The training was managed and tracked using Weights & Biases under the project `whisper-medium-egyptian-arabic` with resume ID `r3sz4v27`.

	## Training Code

	Can be found on [Github here](https://github.com/moadel321/Fine-tuning-whisper-on-Modal-Labs-with-speech-brain-augmentations-/blob/c85312785faa2b927cbc217fe43acb8ed660d2ee/train_whisper_modal.py)

	## Weights & Biases

	Run can be found here : https://wandb.ai/m-adelomar1/whisper-medium-egyptian-arabic/

	## Evaluation Results

	The model was evaluated on the `validation` split of the `MAdel121/arabic-egy-cleaned` dataset.

	* Word Error Rate (WER): 18.03%
	* Character Error Rate (CER): 13.38%

	These metrics indicate the performance of the model on the validation set. Lower values are better.

	### BibTeX Citation

	```bibtex
	@misc{madel_2025_whisper_medium_egy,
	author = Madel
	title = {Whisper Medium Fine-tuned for Egyptian Arabic},
	year = {2025},
	publisher = {Hugging Face},
	journal = {Hugging Face Hub},
	howpublished = {\\url{https://huggingface.co/MAdel121/whisper-medium-egy}} // Replace with actual URL
	}
	```