saadmannan commited on
Commit
b79357c
Β·
1 Parent(s): 5554ef1

app file reviewed

Browse files
Files changed (3) hide show
  1. README.md +66 -6
  2. app.py +193 -0
  3. requirements.txt +3 -22
README.md CHANGED
@@ -1,12 +1,72 @@
1
  ---
2
- title: ASR Finetuning
3
- emoji: πŸŒ–
4
- colorFrom: red
5
- colorTo: green
6
  sdk: gradio
7
- sdk_version: 5.49.1
8
  app_file: app.py
9
  pinned: false
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Whisper German ASR
3
+ emoji: πŸŽ™οΈ
4
+ colorFrom: blue
5
+ colorTo: purple
6
  sdk: gradio
7
+ sdk_version: 4.0.0
8
  app_file: app.py
9
  pinned: false
10
+ license: mit
11
  ---
12
 
13
+ # πŸŽ™οΈ Whisper German ASR
14
+
15
+ Fine-tuned Whisper model for German Automatic Speech Recognition (ASR).
16
+
17
+ ## Description
18
+
19
+ This Space provides an interactive interface for transcribing German audio using a fine-tuned version of OpenAI's Whisper-small model. The model has been specifically optimized for German speech recognition.
20
+
21
+ ## How to Use
22
+
23
+ 1. **Upload Audio**: Click on the audio input area to upload an audio file (WAV, MP3, FLAC, etc.)
24
+ - OR -
25
+ 2. **Record Audio**: Use the microphone button to record audio directly
26
+ 3. **Transcribe**: Click the "Transcribe" button to generate the transcription
27
+ 4. **View Results**: The transcription will appear on the right side
28
+
29
+ ## Model Details
30
+
31
+ - **Base Model**: OpenAI Whisper-small (242M parameters)
32
+ - **Fine-tuned on**: German MINDS14 dataset
33
+ - **Language**: German (de)
34
+ - **Task**: Transcription
35
+ - **Performance**: ~13% Word Error Rate (WER)
36
+
37
+ ## Features
38
+
39
+ - βœ… Upload audio files in various formats
40
+ - βœ… Record audio directly from microphone
41
+ - βœ… Real-time transcription
42
+ - βœ… Optimized for German language
43
+ - βœ… Support for audio up to 30 seconds
44
+
45
+ ## Technical Specifications
46
+
47
+ - **Sample Rate**: 16kHz
48
+ - **Max Duration**: 30 seconds
49
+ - **Beam Search**: 5 beams
50
+ - **Device**: CPU/GPU auto-detection
51
+
52
+ ## Tips for Best Results
53
+
54
+ - Speak clearly and at a moderate pace
55
+ - Minimize background noise
56
+ - Ensure audio is in German language
57
+ - Keep audio clips between 1-30 seconds for optimal results
58
+
59
+ ## Links
60
+
61
+ - [GitHub Repository](https://github.com/YOUR_USERNAME/whisper-german-asr)
62
+ - [Model Card](https://huggingface.co/YOUR_USERNAME/whisper-small-german)
63
+
64
+ ## License
65
+
66
+ MIT License
67
+
68
+ ## Acknowledgments
69
+
70
+ - [OpenAI Whisper](https://github.com/openai/whisper) for the base model
71
+ - [Hugging Face](https://huggingface.co/) for Transformers library
72
+ - [PolyAI](https://huggingface.co/datasets/PolyAI/minds14) for the MINDS14 dataset
app.py ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Gradio Demo for Whisper German ASR - HuggingFace Space
3
+ Interactive web interface for audio transcription
4
+ """
5
+
6
+ import gradio as gr
7
+ import torch
8
+ from transformers import WhisperForConditionalGeneration, WhisperProcessor
9
+ import librosa
10
+ import numpy as np
11
+ import logging
12
+
13
+ logging.basicConfig(level=logging.INFO)
14
+ logger = logging.getLogger(__name__)
15
+
16
+ # Global variables
17
+ model = None
18
+ processor = None
19
+ device = None
20
+
21
+
22
+ def load_model(model_name="openai/whisper-small"):
23
+ """Load the Whisper model from HuggingFace Hub
24
+
25
+ Args:
26
+ model_name: HuggingFace model ID (e.g., 'openai/whisper-small' or 'YOUR_USERNAME/whisper-small-german')
27
+ """
28
+ global model, processor, device
29
+
30
+ logger.info(f"Loading model from HuggingFace Hub: {model_name}")
31
+
32
+ try:
33
+ processor = WhisperProcessor.from_pretrained(model_name)
34
+ model = WhisperForConditionalGeneration.from_pretrained(model_name)
35
+
36
+ # Set German language conditioning
37
+ model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(
38
+ language="german",
39
+ task="transcribe"
40
+ )
41
+
42
+ device = "cuda" if torch.cuda.is_available() else "cpu"
43
+ model = model.to(device)
44
+ model.eval()
45
+
46
+ logger.info(f"βœ“ Model loaded successfully on {device}")
47
+ return f"Model loaded successfully on {device}"
48
+ except Exception as e:
49
+ logger.error(f"Failed to load model: {e}")
50
+ raise
51
+
52
+
53
+ def transcribe_audio(audio_input):
54
+ """Transcribe audio from file upload or microphone"""
55
+ if model is None:
56
+ return "❌ Error: Model not loaded. Please wait for model to load."
57
+
58
+ try:
59
+ # Handle different input formats
60
+ if audio_input is None:
61
+ return "❌ No audio provided. Please upload an audio file or record using the microphone."
62
+
63
+ # audio_input is a tuple (sample_rate, audio_data) from gradio
64
+ if isinstance(audio_input, tuple):
65
+ sr, audio = audio_input
66
+ # Convert to float32 and normalize
67
+ if audio.dtype == np.int16:
68
+ audio = audio.astype(np.float32) / 32768.0
69
+ elif audio.dtype == np.int32:
70
+ audio = audio.astype(np.float32) / 2147483648.0
71
+ else:
72
+ # File path
73
+ audio, sr = librosa.load(audio_input, sr=16000, mono=True)
74
+
75
+ # Resample if needed
76
+ if sr != 16000:
77
+ audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
78
+
79
+ # Ensure mono
80
+ if len(audio.shape) > 1:
81
+ audio = audio.mean(axis=1)
82
+
83
+ duration = len(audio) / 16000
84
+
85
+ # Process audio
86
+ input_features = processor(
87
+ audio,
88
+ sampling_rate=16000,
89
+ return_tensors="pt"
90
+ ).input_features.to(device)
91
+
92
+ # Generate transcription
93
+ with torch.no_grad():
94
+ predicted_ids = model.generate(
95
+ input_features,
96
+ max_length=448,
97
+ num_beams=5,
98
+ early_stopping=True
99
+ )
100
+
101
+ transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
102
+
103
+ logger.info(f"Transcribed {duration:.2f}s audio: {transcription[:50]}...")
104
+
105
+ return f"🎀 **Transcription:**\n\n{transcription}\n\nπŸ“Š **Duration:** {duration:.2f} seconds"
106
+
107
+ except Exception as e:
108
+ logger.error(f"Transcription error: {e}")
109
+ return f"❌ Error: {str(e)}"
110
+
111
+
112
+ # Load model on startup
113
+ # IMPORTANT: Replace 'openai/whisper-small' with your fine-tuned model ID
114
+ # e.g., 'saadmannan/whisper-small-german' after you upload your model to HF Hub
115
+ MODEL_ID = "openai/whisper-small" # Change this to your model ID
116
+
117
+ try:
118
+ load_model(MODEL_ID)
119
+ except Exception as e:
120
+ logger.error(f"Failed to load model: {e}")
121
+ logger.info("Model will need to be loaded manually")
122
+
123
+
124
+ # Create Gradio interface
125
+ with gr.Blocks(title="Whisper German ASR", theme=gr.themes.Soft()) as demo:
126
+ gr.Markdown(
127
+ """
128
+ # πŸŽ™οΈ Whisper German ASR
129
+
130
+ Fine-tuned Whisper model for German speech recognition.
131
+
132
+ **How to use:**
133
+ 1. Upload an audio file (WAV, MP3, FLAC, etc.) or record using your microphone
134
+ 2. Click the "Transcribe" button
135
+ 3. Wait for the transcription to appear
136
+
137
+ **Features:**
138
+ - Supports multiple audio formats
139
+ - Microphone recording
140
+ - Optimized for German language
141
+
142
+ **Model:** Whisper-small fine-tuned on German MINDS14 dataset
143
+ """
144
+ )
145
+
146
+ with gr.Row():
147
+ with gr.Column():
148
+ audio_input = gr.Audio(
149
+ sources=["upload", "microphone"],
150
+ type="numpy",
151
+ label="Upload Audio or Record"
152
+ )
153
+ transcribe_btn = gr.Button("🎯 Transcribe", variant="primary", size="lg")
154
+
155
+ with gr.Column():
156
+ output_text = gr.Markdown(label="Transcription Result")
157
+
158
+ transcribe_btn.click(
159
+ fn=transcribe_audio,
160
+ inputs=audio_input,
161
+ outputs=output_text
162
+ )
163
+
164
+ gr.Markdown(
165
+ """
166
+ ---
167
+ ## πŸ“‹ About This Model
168
+
169
+ This is a fine-tuned version of OpenAI's Whisper-small model,
170
+ specifically optimized for German speech recognition.
171
+
172
+ ### Performance
173
+ - **Word Error Rate (WER):** ~13%
174
+ - **Sample Rate:** 16kHz
175
+ - **Max Duration:** 30 seconds
176
+ - **Language:** German (de)
177
+
178
+ ### Tips for Best Results
179
+ - Speak clearly and at a moderate pace
180
+ - Minimize background noise
181
+ - Audio should be in German language
182
+ - Best results with 1-30 second clips
183
+
184
+ ### Links
185
+ - [GitHub Repository](https://github.com/YOUR_USERNAME/whisper-german-asr)
186
+ - [Model Card](https://huggingface.co/YOUR_USERNAME/whisper-small-german)
187
+ """
188
+ )
189
+
190
+
191
+ # Launch the app
192
+ if __name__ == "__main__":
193
+ demo.launch()
requirements.txt CHANGED
@@ -1,25 +1,6 @@
1
- # Core ML/DL frameworks
2
- torch>=2.2.0
3
  transformers>=4.42.0
4
- datasets>=2.19.0
5
- accelerate>=0.30.0
6
-
7
- # Audio processing
8
  librosa>=0.10.1
9
- soundfile>=0.12.1
10
-
11
- # Metrics and evaluation
12
- jiwer>=3.0.4
13
- evaluate>=0.4.1
14
-
15
- # Utilities
16
  numpy>=1.24.0
17
- sentencepiece>=0.2.0
18
- einops>=0.7.0
19
-
20
- # Logging and visualization
21
- tensorboard>=2.16.0
22
- tensorboardX>=2.6.2
23
-
24
- # Optional: Flash Attention 2 (requires CUDA)
25
- # flash-attn>=2.5.0 # Uncomment if you have CUDA toolkit installed
 
 
 
1
  transformers>=4.42.0
2
+ torch>=2.2.0
3
+ gradio>=4.0.0
 
 
4
  librosa>=0.10.1
 
 
 
 
 
 
 
5
  numpy>=1.24.0
6
+ soundfile>=0.12.1