HebArabNlpProject
/

WhisperLevantine

@@ -20,22 +20,18 @@ license: apache-2.0
 This model is a fine-tuned version of [Whisper Medium](https://github.com/openai/whisper) tailored specifically for transcribing Levantine Arabic, focusing on the Israeli dialect. It is designed to improve automatic speech recognition (ASR) performance for this particular variant of Arabic.
-- **Base Model**: Whisper Medium
 - **Fine-tuned for**: Levantine Arabic (Israeli Dialect)
-- **WER on test set**: 14%
 ## Training Data
 The dataset used for training and fine-tuning this model consists of approximately 2,200 hours of transcribed audio, primarily featuring Israeli Levantine Arabic, along with some general Levantine Arabic content. The data sources include:
 1. **Self-maintained Collection**: 2,000 hours of audio data curated by the team, covering a wide range of Israeli Levantine Arabic speech.
-2. **[MGB-2 Corpus (Filtered)](https://huggingface.co/datasets/BelalElhossany/mgb2_audios_transcriptions_preprocessed)**: 200 hours of broadcast media in Arabic.
-3. **[CommonVoice18 (Filtered)](https://huggingface.co/datasets/fsicoli/common_voice_18_0)**: A filtered portion of the CommonVoice18 dataset.
-Filtering was applied using the [AlcLaM](https://arxiv.org/abs/2407.13097) Arabic language model to ensure relevance to Levantine Arabic.
-- **Total Dataset Size**: ~2,200 hours
-- **Sampling Rate**: 16kHz
 - **Annotation**: Human-transcribed and annotated for high accuracy.
 ## How to Use
@@ -43,20 +39,10 @@ Filtering was applied using the [AlcLaM](https://arxiv.org/abs/2407.13097) Arabi
 The model is compatible with 16kHz audio input. Ensure your files are at the same sample rate for optimal results. You can load the model as follows:
 ```python
-from transformers import WhisperProcessor, WhisperForConditionalGeneration
-import torch
-# Load the model and processor
-processor = WhisperProcessor.from_pretrained("HebArabNlpProject/whisperLevantine")
-model = WhisperForConditionalGeneration.from_pretrained("HebArabNlpProject/whisperLevantine").to("cuda" if torch.cuda.is_available() else "cpu")
-# Example usage: processing audio input
-file_path = ...  # wav filepath goes here
-audio_input, samplerate = torchaudio.load(file_path)
-inputs = processor(audio_input.squeeze(), return_tensors="pt", sampling_rate=samplerate).to("cuda" if torch.cuda.is_available() else "cpu")
-# Run inference
 with torch.no_grad():
-    generated_ids = model.generate(inputs["input_features"])
-transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)
-print(transcription[0])

 This model is a fine-tuned version of [Whisper Medium](https://github.com/openai/whisper) tailored specifically for transcribing Levantine Arabic, focusing on the Israeli dialect. It is designed to improve automatic speech recognition (ASR) performance for this particular variant of Arabic.
+- **Base Model**: Whisper Large V3
 - **Fine-tuned for**: Levantine Arabic (Israeli Dialect)
+- **WER on test set**: 35%
 ## Training Data
 The dataset used for training and fine-tuning this model consists of approximately 2,200 hours of transcribed audio, primarily featuring Israeli Levantine Arabic, along with some general Levantine Arabic content. The data sources include:
 1. **Self-maintained Collection**: 2,000 hours of audio data curated by the team, covering a wide range of Israeli Levantine Arabic speech.
+- **Total Dataset Size**: ~1,200 hours
+- **Sampling Rate**: 8kHz - upsampled to 16kHz
 - **Annotation**: Human-transcribed and annotated for high accuracy.
 ## How to Use
 The model is compatible with 16kHz audio input. Ensure your files are at the same sample rate for optimal results. You can load the model as follows:
 ```python
+import faster_whisper
 with torch.no_grad():
+    audio_data, sample_rate = librosa.load(audio_file)
+    audio_data = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=16000)
+    segs, _ = model.transcribe(audio_data, language='ar')
+    transcript = ' '.join(s.text for s in segs)