Ayoub-Laachir
/

MaghrebVoice_OnlyLoRaLayers

Automatic Speech Recognition

Model card Files Files and versions

xet

Community

Ayoub-Laachir commited on Oct 2, 2024

Commit

d021e05

verified ·

1 Parent(s): 1744f1a

Update README.md

Browse files

Files changed (1) hide show

README.md +111 -0

README.md CHANGED Viewed

@@ -59,6 +59,117 @@ These metrics demonstrate the model's ability to accurately transcribe Moroccan
 The fine-tuned model shows improved handling of Darija-specific words, sentence structure, and overall accuracy.
 ## Challenges and Future Improvements
 ### Challenges Encountered
 - Diverse spellings of words in Moroccan Darija

 The fine-tuned model shows improved handling of Darija-specific words, sentence structure, and overall accuracy.
+## Audio Transcription Script with PEFT Layers
+This script demonstrates how to transcribe audio files using the fine-tuned Whisper Large V3 model for Moroccan Darija, incorporating PEFT (Parameter-Efficient Fine-Tuning) layers for improved performance.
+### Required Libraries
+Before running the script, ensure you have the following libraries installed. You can install them using:
+```bash
+!pip install --upgrade pip
+!pip install --upgrade transformers accelerate librosa soundfile pydub
+!pip install peft==0.3.0  # Install PEFT library
+```
+```python
+import torch
+from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
+import librosa
+import soundfile as sf
+from pydub import AudioSegment
+from peft import PeftModel, PeftConfig  # Import PEFT classes
+# Set the device to GPU if available, else use CPU
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
+# Configuration for the base Whisper model
+base_model_name = "openai/whisper-large-v3"  # Base model for Whisper
+processor = AutoProcessor.from_pretrained(base_model_name)  # Load the processor
+# Load your fine-tuned model configuration
+model_name = "Ayoub-Laachir/MaghrebVoice_OnlyLoRaLayers"  # Fine-tuned model with LoRA layers
+peft_config = PeftConfig.from_pretrained(model_name)  # Load PEFT configuration
+# Load the base model
+base_model = AutoModelForSpeechSeq2Seq.from_pretrained(base_model_name).to(device)  # Load the base model
+# Load the PEFT model
+model = PeftModel.from_pretrained(base_model, model_name).to(device)  # Load the PEFT model
+# Merge the LoRA weights with the base model
+model = model.merge_and_unload()  # Combine the LoRA weights into the base model
+# Configuration for transcription
+config = {
+    "language": "arabic",  # Language for transcription
+    "task": "transcribe",  # Task type
+    "chunk_length_s": 30,  # Length of each audio chunk in seconds
+    "stride_length_s": 5,   # Overlap between chunks in seconds
+}
+# Initialize the automatic speech recognition pipeline
+pipe = pipeline(
+    "automatic-speech-recognition",
+    model=model,  # Use the merged model
+    tokenizer=processor.tokenizer,
+    feature_extractor=processor.feature_extractor,
+    torch_dtype=torch_dtype,
+    device=device,
+    chunk_length_s=config["chunk_length_s"],
+    stride_length_s=config["stride_length_s"],
+)
+# Convert audio to 16kHz sampling rate
+def convert_audio_to_16khz(input_path, output_path):
+    audio, sr = librosa.load(input_path, sr=None)  # Load the audio file
+    audio_16k = librosa.resample(audio, orig_sr=sr, target_sr=16000)  # Resample to 16kHz
+    sf.write(output_path, audio_16k, 16000)  # Save the converted audio
+# Format time in HH:MM:SS.milliseconds
+def format_time(seconds):
+    hours = int(seconds // 3600)
+    minutes = int((seconds % 3600) // 60)
+    seconds = seconds % 60
+    return f"{hours:02d}:{minutes:02d}:{seconds:06.3f}"
+# Transcribe audio file
+def transcribe_audio(audio_path):
+    try:
+        result = pipe(audio_path, return_timestamps=True)  # Transcribe audio and get timestamps
+        return result["chunks"]  # Return transcription chunks
+    except Exception as e:
+        print(f"Error transcribing audio: {e}")
+        return None
+# Main function to execute the transcription process
+def main():
+    # Specify input and output audio paths (update paths as needed)
+    input_audio_path = "/path/to/your/input/audio.mp3"  # Replace with your input audio path
+    output_audio_path = "/path/to/your/output/audio_16khz.wav"  # Replace with your output audio path
+    # Convert audio to 16kHz
+    convert_audio_to_16khz(input_audio_path, output_audio_path)
+    # Transcribe the converted audio
+    transcription_chunks = transcribe_audio(output_audio_path)
+    if transcription_chunks:
+        print("WEBVTT\n")  # Print header for WEBVTT format
+        for chunk in transcription_chunks:
+            start_time = format_time(chunk["timestamp"][0])  # Format start time
+            end_time = format_time(chunk["timestamp"][1])    # Format end time
+            text = chunk["text"]                              # Get the transcribed text
+            print(f"{start_time} --> {end_time}")           # Print time range
+            print(f"{text}\n")                               # Print transcribed text
+    else:
+        print("Transcription failed.")
+if __name__ == "__main__":
+    main()
+```
 ## Challenges and Future Improvements
 ### Challenges Encountered
 - Diverse spellings of words in Moroccan Darija