AfroLogicInsect commited on
Commit
4dfa880
·
verified ·
1 Parent(s): 39fcfe7

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +1 -605
app.py CHANGED
@@ -1,607 +1,3 @@
1
- testing_huggingface_speech_to_text.py
2
- 108 lines
3
-
4
- py
5
-
6
- please assist debug:
7
-
8
- ---------------------------------------------------------------------------
9
- ImportError Traceback (most recent call last)
10
- /usr/local/lib/python3.11/dist-packages/transformers/utils/import_utils.py in _get_module(self, module_name)
11
- 1559 try:
12
- -> 1560 return importlib.import_module("." + module_name, self.__name__)
13
- 1561 except Exception as e:
14
- 14 frames
15
-
16
- ImportError: cannot import name 'EncoderDecoderCache' from 'transformers' (/usr/local/lib/python3.11/dist-packages/transformers/__init__.py)
17
- The above exception was the direct cause of the following exception:
18
- RuntimeError Traceback (most recent call last)
19
- /usr/local/lib/python3.11/dist-packages/transformers/utils/import_utils.py in _get_module(self, module_name)
20
- 1560 return importlib.import_module("." + module_name, self.__name__)
21
- 1561 except Exception as e:
22
- -> 1562 raise RuntimeError(
23
- 1563 f"Failed to import {self.__name__}.{module_name} because of the following error (look up to see its"
24
- 1564 f" traceback):\n{e}"
25
- RuntimeError: Failed to import transformers.trainer because of the following error (look up to see its traceback):
26
- cannot import name 'EncoderDecoderCache' from 'transformers' (/usr/local/lib/python3.11/dist-packages/transformers/__init__.py)
27
- Edit
28
- The error you're encountering is due to a version compatibility issue between the transformers library and the saved model. The EncoderDecoderCache class was introduced in newer versions of transformers, but your current installation doesn't have it.
29
-
30
- Here's how to fix this issue:
31
-
32
-
33
- Fixed HuggingFace Speech-to-Text Script
34
- Code
35
- The main issues in your code and their fixes:
36
-
37
- Problem Analysis:
38
- Version Compatibility: Your transformers version (4.42.0) is missing the EncoderDecoderCache class that your saved model requires
39
- Missing Imports: WhisperForConditionalGeneration and WhisperProcessor weren't imported
40
- Device Handling: Potential GPU/CPU device mismatch issues
41
- Language Detection: Multilingual Whisper models can cause unexpected behavior without proper language specification
42
- Key Fixes:
43
- Updated transformers version to 4.44.0 (includes EncoderDecoderCache)
44
- Added proper imports for Whisper classes
45
- Added error handling and fallback to base Whisper model if your fine-tuned model fails
46
- Fixed language specification to avoid the multilingual detection issue
47
- Added two testing methods: direct model inference and pipeline (pipeline is more reliable)
48
- Better device management with proper GPU/CPU handling
49
- Enhanced debugging with file existence checks and error tracebacks
50
- Quick Fix Commands:
51
- If you want to just update your current environment:
52
-
53
- bash
54
- pip install transformers==4.44.0 --upgrade
55
- Then add these imports at the top of your script:
56
-
57
- python
58
- from transformers import WhisperForConditionalGeneration, WhisperProcessor
59
- The updated script provides both methods to test your model, with the pipeline method being more robust and handling edge cases better.
60
-
61
-
62
-
63
-
64
-
65
-
66
- Model ready for testing! Use either: 1. test_model_inference('path/to/audio/file.wav') 2. test_pipeline_inference('path/to/audio/file.wav') # Recommended Testing with videoplayback.m4a... === Method 1: Direct model inference === Processing audio file: videoplayback.m4a /tmp/ipython-inp
67
-
68
- pasted
69
-
70
- error on both merthods
71
-
72
- Edit
73
- The error you're encountering is a data type mismatch between the model weights (stored in half-precision/float16) and the input data (float32). This is a common issue when loading models that were saved with mixed precision training.
74
-
75
-
76
- Fixed Data Type Speech-to-Text Script
77
- Code
78
- Root Cause Analysis:
79
- The error Input type (float) and bias type (c10::Half) should be the same occurs because:
80
-
81
- Your model was saved with mixed precision (float16/half precision)
82
- Input features are in float32 by default
83
- PyTorch requires matching data types between model weights and inputs
84
- Key Fixes Applied:
85
- 1. Data Type Consistency
86
- Detects model's dtype and converts inputs to match
87
- Handles both float16 and float32 scenarios
88
- Provides fallback conversions if needed
89
- 2. Better Audio Loading
90
- Multiple fallback methods for audio loading
91
- Handles the PySoundFile warning you're seeing
92
- Uses soundfile as backup when librosa fails
93
- 3. CPU-Only Inference Option
94
- Completely avoids GPU dtype issues
95
- Most stable for problematic models
96
- Slower but guaranteed to work
97
- 4. Enhanced Error Handling
98
- Specific handling for dtype mismatches
99
- Progressive fallback strategies
100
- Clear diagnostic messages
101
- Quick Solutions (in order of preference):
102
- Option A: Try the CPU-only method first
103
- python
104
- result = test_cpu_only_inference('videoplayback.m4a')
105
- Option B: Convert your model to consistent dtype
106
- python
107
- # If you have access to model training/saving code
108
- model = model.float() # Convert to float32
109
- model.save_pretrained("./whisper-finetuned-float32")
110
- Option C: Use a different audio format
111
- python
112
- # Convert M4A to WAV first
113
- import subprocess
114
- subprocess.run(['ffmpeg', '-i', 'videoplayback.m4a', 'videoplayback.wav'])
115
- The script provides three different approaches, with the CPU-only method being most likely to work around your dtype issues. The pipeline method that partially worked for you should work better with the dtype fixes applied.
116
-
117
-
118
-
119
-
120
-
121
- how to push both models to hub
122
-
123
- Edit
124
- I'll help you push both models to the Hugging Face Hub. Here's a comprehensive guide:
125
-
126
-
127
- Push Models to Hugging Face Hub Guide
128
- Code
129
- Step-by-Step Guide to Push Your Models:
130
- 1. Prerequisites Setup
131
- bash
132
- # Get your Hugging Face token
133
- # Go to: https://huggingface.co/settings/tokens
134
- # Create a new token with "Write" permissions
135
- 2. Quick Configuration
136
- Before running the script, update these variables:
137
-
138
- python
139
- # Replace these with your actual values:
140
- HF_TOKEN = "hf_your_actual_token_here"
141
- MODELS_CONFIG = {
142
- "original": {
143
- "local_path": "./whisper-finetuned-final",
144
- "hub_name": "your-actual-username/whisper-finetuned-original",
145
- "description": "Fine-tuned Whisper model for speech recognition"
146
- },
147
- "float32": {
148
- "local_path": "./whisper-finetuned-float32",
149
- "hub_name": "your-actual-username/whisper-finetuned-float32",
150
- "description": "Fine-tuned Whisper model (float32 version) for speech recognition"
151
- }
152
- }
153
- 3. Execute the Upload
154
- python
155
- # Run this to push both models
156
- push_all_models_to_hub(private=False, create_pr=False)
157
- What This Script Does:
158
- ✅ Model Management
159
- Creates a float32 version of your original model (fixes dtype issues)
160
- Handles both models with consistent naming
161
- Safe serialization for better compatibility
162
- ✅ Hub Integration
163
- Automatic login to Hugging Face
164
- Pushes model weights and tokenizer/processor
165
- Creates professional model cards with usage examples
166
- Handles private/public repos
167
- ✅ Quality Assurance
168
- Error handling for each step
169
- Progress tracking with clear status messages
170
- Post-upload testing to verify models work
171
- Professional documentation generation
172
- Advanced Options:
173
- Private Repository
174
- python
175
- push_all_models_to_hub(private=True, create_pr=False)
176
- Create Pull Request (for organization repos)
177
- python
178
- push_all_models_to_hub(private=False, create_pr=True)
179
- Push Individual Models
180
- python
181
- # Push only one model
182
- push_model_to_hub("original", private=False)
183
- push_model_to_hub("float32", private=False)
184
- After Upload - Usage Examples:
185
- Load from Hub
186
- python
187
- from transformers import pipeline
188
-
189
- # Use your uploaded model
190
- pipe = pipeline(
191
- "automatic-speech-recognition",
192
- model="your-username/whisper-finetuned-float32"
193
- )
194
-
195
- result = pipe("path/to/audio.wav")
196
- print(result['text'])
197
- Direct Model Loading
198
- python
199
- from transformers import WhisperProcessor, WhisperForConditionalGeneration
200
-
201
- processor = WhisperProcessor.from_pretrained("your-username/whisper-finetuned-float32")
202
- model = WhisperForConditionalGeneration.from_pretrained("your-username/whisper-finetuned-float32")
203
- The script will handle everything automatically - just update the configuration with your details and run it!
204
-
205
-
206
-
207
-
208
-
209
- This worked best for me:
210
-
211
- def test_model_inference_fixed(audio_file_path=None):
212
- """Fixed version with proper dtype handling"""
213
- try:
214
- if not audio_file_path or not os.path.exists(audio_file_path):
215
- print(f"❌ Audio file not found: {audio_file_path}")
216
- return None
217
- print(f"🎵 Processing audio file: {audio_file_path}")
218
- # Load audio file with better error handling
219
- try:
220
- audio_array, sr = librosa.load(audio_file_path, sr=16000)
221
- print(f"✅ Audio loaded: {len(audio_array)} samples at {sr}Hz")
222
- except Exception as audio_error:
223
- print(f"❌ Audio loading failed: {audio_error}")
224
- # Try alternative loading methods
225
- try:
226
- import soundfile as sf
227
- audio_array, sr = sf.read(audio_file_path)
228
- if sr != 16000:
229
- audio_array = librosa.resample(audio_array, orig_sr=sr, target_sr=16000)
230
- sr = 16000
231
- print(f"✅ Audio loaded with soundfile: {len(audio_array)} samples at {sr}Hz")
232
- except:
233
- print("❌ All audio loading methods failed")
234
- return None
235
- # Process with processor - ensure correct dtype
236
- inputs = processor(
237
- audio_array,
238
- sampling_rate=16000,
239
- return_tensors="pt"
240
- )
241
- input_features = inputs.input_features
242
- # CRITICAL: Match input dtype to model dtype
243
- if model_dtype == torch.float16:
244
- input_features = input_features.half() # Convert to float16
245
- else:
246
- input_features = input_features.float() # Ensure float32
247
- print(f"🔧 Input dtype: {input_features.dtype}, Model dtype: {model_dtype}")
248
- # Move to same device as model
249
- input_features = input_features.to(model.device)
250
- # Generate transcription with proper parameters
251
- with torch.no_grad():
252
- try:
253
- predicted_ids = model.generate(
254
- input_features,
255
- language="en",
256
- task="transcribe",
257
- max_length=448,
258
- num_beams=1,
259
- do_sample=False,
260
- use_cache=True
261
- )
262
- transcription = processor.batch_decode(
263
- predicted_ids,
264
- skip_special_tokens=True
265
- )[0]
266
- print(f"✅ Transcription: {transcription}")
267
- return transcription
268
- except RuntimeError as gen_error:
269
- if "Input type" in str(gen_error) and "bias type" in str(gen_error):
270
- print(f"❌ Still getting dtype error: {gen_error}")
271
- print("🔄 Trying to convert model to float32...")
272
- # Convert model to float32 as last resort
273
- model.float()
274
- input_features = input_features.float()
275
- predicted_ids = model.generate(
276
- input_features,
277
- language="en",
278
- task="transcribe",
279
- max_length=448,
280
- num_beams=1,
281
- do_sample=False
282
- )
283
- transcription = processor.batch_decode(
284
- predicted_ids,
285
- skip_special_tokens=True
286
- )[0]
287
- print(f"✅ Transcription (after float32 conversion): {transcription}")
288
- return transcription
289
- else:
290
- raise gen_error
291
- except Exception as e:
292
- print(f"❌ Inference failed: {e}")
293
- import traceback
294
- traceback.print_exc()
295
- return None
296
-
297
- Can I have such that the results have time stamps from the audio inpu with the transcription output..
298
-
299
- Edit
300
- Absolutely! Here's the enhanced version that provides timestamped transcriptions:
301
-
302
-
303
- Timestamped Whisper Transcription
304
- Code
305
- This enhanced version provides timestamped transcriptions with multiple approaches:
306
-
307
- 🎯 Key Features:
308
- 1. Built-in Whisper Timestamps
309
- Uses Whisper's native timestamp functionality
310
- More accurate alignment with audio
311
- Automatic segment detection
312
- 2. Chunked Processing
313
- Fallback method if built-in timestamps fail
314
- Processes audio in overlapping chunks
315
- Manual timestamp calculation
316
- Better for long audio files
317
- 3. Rich Output Format
318
- Full transcription text
319
- Individual timestamped segments
320
- Metadata (duration, chunks, etc.)
321
- Formatted display with MM:SS timestamps
322
- JSON export for further processing
323
- 📊 Sample Output:
324
- 🎯 TIMESTAMPED TRANSCRIPTION
325
- ==================================================
326
-
327
- 📝 FULL TEXT:
328
- This Saturday afternoon my parents had gone to attend the church members' wedding...
329
-
330
- 📊 METADATA:
331
- total_duration: 180.5
332
- num_chunks: 6
333
- chunk_length: 30
334
-
335
- 🕐 TIMESTAMPED SEGMENTS:
336
- ------------------------------
337
- 1. [00:00 - 00:05] This Saturday afternoon my parents had gone to attend
338
- 2. [00:05 - 00:12] the church members' wedding and since they cook didn't bring us
339
- 3. [00:12 - 00:18] with them My sister and I homework knocked out and bored
340
- 🚀 Usage Examples:
341
- Basic Usage:
342
- python
343
- result = test_model_inference_with_timestamps('videoplayback.m4a')
344
- print(result['text']) # Full transcription
345
- for segment in result['segments']:
346
- print(f"[{segment['start']}-{segment['end']}s]: {segment['text']}")
347
- Custom Chunk Length:
348
- python
349
- # For shorter, more precise segments
350
- result = test_model_inference_with_timestamps('audio.wav', chunk_length=15)
351
-
352
- # For longer segments (faster processing)
353
- result = test_model_inference_with_timestamps('audio.wav', chunk_length=45)
354
- Full Test with Formatting:
355
- python
356
- # This gives you the nicely formatted output and saves to JSON
357
- test_with_timestamps('videoplayback.m4a', save_results=True)
358
- 🔧 Technical Advantages:
359
- Dtype Handling: Maintains your working dtype fix
360
- Error Recovery: Falls back gracefully if one method fails
361
- Memory Efficient: Processes long audio in manageable chunks
362
- Overlap Handling: Prevents word cutoffs at chunk boundaries
363
- Export Ready: JSON format for integration with other tools
364
- The timestamps will help you create subtitles, analyze specific parts of the audio, or build interactive transcription viewers!
365
-
366
-
367
-
368
-
369
-
370
- well, this is the full transcribed text:
371
-
372
- THIS THAT TOO I GOT IT RIGHT AFTER I BEKIN MY NURSE WHY AS THE URBASI IN GOES NOW WHO ASKED QUESTION ONES A BIT OF MATTER SO THIS SATS DAY AFTERNOON MY PARENTS HAD GONE TO ATTEND THE CHURCH MEMBERS' WEDDING AND SINCE THE COOKO DIDN'T BRING US WITH So This Saturday afternoon my parents had gone to attend the church members wedding And since the cook didn't bring us with them my sister and I homework knocked out and bored We had to find our own fun, right? So We stepped out of our compound hmm something we had never dared to do before I was so afraid SO WE STEPPED OUT OF OUR COMPOUND SOMETHING WE HAD NEVER DEAD TO DO BEFORE I WAS SO AFRAID NOW THAT I THINK OF IT BE LIKE SEDA GEDE SNICK OUT ONCE ONCE OR MY LIFE SAYS SHAR ANYWAY THAT WAS HOW PLACE AROUND THE LEAK SOMEWHERE EVEN SWIMPING AND THEN SUDDENLY I NOTICED THAT I COULDN'T FIND MY SISTER I COLD FOR HER AND GOT NO And then suddenly I noticed that I couldn't find my sister I called for her and got no answer Well after BUT SHE WAS GONE I STARTED TO SCREAM I DIDN'T KNOW WHAT ELSE TO DO THEN THE MAD MAN CHOSED TO SHOW UP IN HIS VEST AND SHORTS EVERYONE'S CUTTED THEY LET MY LIFELESS SISTER AND LITTLE HELP LESS ME BY THE LAKE THEN THIS MAD WENT ON TO GIVE MY SISTAR WHAT I UNDERSTAND NOW TO BE CPR THE MAD MAN SAVED MY SISTAR'S LIFE THIS DATTU IS IN REMEMBERANCE OF MISTATI WILL NOW OF BLESSARD MEMORY AND HIS TWIN SISTER WHO HAD Died IN THAT SIEM LEAGUE WHEN THEY WERE MUCH YOUNGER HE HAD THIS EXACT DATSU ON HIS SHOULDER WOULD YOU BELIEVE ME IF I TOLD YOU THAT IT WAS BECAUSE OF THIS DATSU THAT HE CALLED HIM MAD BECAUSE OF DATSU YES HIS NAME WAS TAIWU THE SO-CALLED MAD MAN MY SISTAR'S LIFE SAVE'ER AND BY EXTENSION MY BECAUSE TRUST ME MY PEOPRIENTS WOULD HAVE KILLED ME THAT VERY DAYand by extension mine because trust me my parents would have killed me that very day
373
-
374
- Both the saved .json is:
375
-
376
- {
377
- "text": "THIS THAT TOO I GOT IT RIGHT AFTER I BEKIN MY NURSE WHY WELL AS THE URBASI IN GOES NOW WHO ASKED QUESTION ONES A BIT OF MATTER SO THIS SATS DAY AFTERNOON MY PARENTS HAD GONE TO ATTEND THE CHURCH MEMBERS' WEDDING",
378
- "segments": [
379
- {
380
- "start": 0.0,
381
- "end": 0.0,
382
- "text": "THIS THAT TOO I GOT IT RIGHT AFTER I BEKIN MY NURSE"
383
- },
384
- {
385
- "start": 8.52,
386
- "end": 8.52,
387
- "text": "WHY WELL AS THE URBASI IN GOES NOW WHO ASKED QUESTION ONES A BIT OF MATTER"
388
- },
389
- {
390
- "start": 18.52,
391
- "end": 18.52,
392
- "text": "SO THIS SATS DAY AFTERNOON MY PARENTS HAD GONE TO ATTEND THE CHURCH MEMBERS' WEDDING"
393
- }
394
- ],
395
- "metadata": {
396
- "method": "built-in_timestamps",
397
- "total_segments": 3
398
- }
399
- }
400
- Edit
401
- I can see several issues with the timestamp implementation:
402
-
403
- Incomplete transcription - The JSON only contains the first 3 segments
404
- Duplicate timestamps - All segments show the same start and end times
405
- Built-in timestamp parsing failed - The parser isn't correctly extracting Whisper's timestamp tokens
406
- Let me fix this with a more robust implementation:
407
-
408
-
409
- Fixed Robust Timestamped Whisper Transcription
410
- Code
411
- 🔧 Key Fixes Applied:
412
- 1. Proper Chunking Strategy
413
- Non-overlapping time boundaries for accurate timestamps
414
- Overlapping audio processing to prevent word cutoffs
415
- Smaller chunks (15-20s) for more precise timestamps
416
- 2. Overlap Detection & Removal
417
- Detects duplicate phrases between chunks
418
- Removes redundant text while preserving timestamps
419
- Maintains speech flow continuity
420
- 3. Multiple Output Formats
421
- JSON - Full structured data
422
- SRT - Standard subtitle format
423
- VTT - Web subtitle format
424
- TXT - Human-readable transcript
425
- 4. Better Text Processing
426
- Cleans up transcription artifacts
427
- Removes repetitive filler words
428
- Proper capitalization
429
- Word-level timestamp estimation
430
- 🎯 Expected Results:
431
- With your audio file, you should now get:
432
-
433
- json
434
- {
435
- "text": "This Saturday afternoon my parents had gone to attend the church members wedding...",
436
- "segments": [
437
- {"start": 0.0, "end": 15.2, "text": "This Saturday afternoon my parents had gone to attend the church members wedding"},
438
- {"start": 15.2, "end": 28.7, "text": "And since the cook didn't bring us with them my sister and I were bored"},
439
- {"start": 28.7, "end": 42.1, "text": "We had to find our own fun right so we stepped out of our compound"},
440
- // ... more accurate segments
441
- ],
442
- "metadata": {
443
- "total_duration": 180.5,
444
- "successful_chunks": 12,
445
- "method": "robust_chunked_timestamps"
446
- }
447
- }
448
- 🚀 Usage:
449
- python
450
- # Test with better settings
451
- result = test_robust_timestamps('videoplayback.m4a', chunk_length=15, save_all=True)
452
-
453
- # This will create:
454
- # - videoplayback_timestamped.json
455
- # - videoplayback_timestamped.srt
456
- # - videoplayback_timestamped.vtt
457
- # - videoplayback_timestamped.txt
458
- The 15-second chunks should give you much more accurate timestamps, and the overlap removal will prevent the duplicate text you were seeing. The SRT file can be directly used with video players for subtitles!
459
-
460
-
461
-
462
-
463
-
464
- This worked beautifully!
465
-
466
- Finally, I'm on huggingface and want to host a gradio workspace for anyone to try out the model, they can either upload or record thier voice (not more than 3 minutes) and have a transcript displayed (can also be downloaded).
467
-
468
- Edit
469
- Excellent! Let's create a beautiful Gradio interface for your Whisper model. Here's a complete Gradio app that handles both file uploads and voice recording:
470
-
471
-
472
- Gradio Whisper Transcription App
473
- Code
474
- 🚀 Complete Gradio App for Hugging Face Spaces
475
- This creates a professional transcription service with:
476
-
477
- ✨ Key Features:
478
- 🎤 Dual Input Methods
479
- File upload for existing audio
480
- Live microphone recording
481
- 3-minute limit for fair usage
482
- 📊 Rich Output Formats
483
- Display: Formatted text with timestamps
484
- JSON: Complete data structure
485
- SRT: Ready-to-use subtitle files
486
- ⚡ Performance Optimized
487
- 15-second chunking for accuracy
488
- Overlap removal to prevent duplicates
489
- GPU acceleration when available
490
- Queue system for multiple users
491
- 🎨 Professional UI
492
- Clean tabbed interface
493
- Progress indicators
494
- Error handling with helpful messages
495
- Mobile-responsive design
496
- 📝 Setup for Hugging Face Spaces:
497
- 1. Create New Space
498
- Go to https://huggingface.co/spaces
499
- Click "Create new Space"
500
- Choose:
501
- SDK: Gradio
502
- Hardware: CPU Basic (or GPU if you want faster processing)
503
- Visibility: Public
504
- 2. Required Files:
505
- app.py (the code above)
506
-
507
- requirements.txt:
508
-
509
- torch>=2.0.0
510
- torchaudio>=2.0.0
511
- transformers>=4.44.0
512
- gradio>=4.0.0
513
- librosa>=0.10.0
514
- numpy>=1.21.0
515
- soundfile>=0.12.0
516
- README.md:
517
-
518
- markdown
519
- ---
520
- title: Whisper Speech Transcription
521
- emoji: 🎙️
522
- colorFrom: blue
523
- colorTo: purple
524
- sdk: gradio
525
- sdk_version: 4.44.0
526
- app_file: app.py
527
- pinned: false
528
- license: apache-2.0
529
- ---
530
-
531
- # 🎙️ Whisper Speech Transcription
532
-
533
- AI-powered speech-to-text with timestamps using a fine-tuned Whisper model.
534
-
535
- ## Features
536
- - 📁 Upload audio files (MP3, WAV, M4A, etc.)
537
- - 🎤 Record voice directly in browser
538
- - 🕐 Get precise timestamped transcriptions
539
- - 📄 Download JSON and SRT subtitle files
540
- - ⚡ Fast processing with chunked audio
541
- - 🎯 Optimized for English speech
542
-
543
- ## Usage
544
- 1. Choose either "Upload Audio File" or "Record Voice" tab
545
- 2. Process your audio (maximum 3 minutes)
546
- 3. View transcription with timestamps
547
- 4. Download results in JSON or SRT format
548
-
549
- ## Powered by
550
- - **Whisper AI** for speech recognition
551
- - **Gradio** for the web interface
552
- - **Hugging Face** for model hosting
553
- 3. Important Configuration:
554
- Update this line in the code:
555
-
556
- python
557
- MODEL_NAME = "your-username/whisper-finetuned-float32" # Replace with your actual model
558
- 4. Deploy Steps:
559
- Upload files to your Space
560
- Wait for build (takes 2-3 minutes)
561
- Test the interface
562
- Share your Space URL!
563
- 🎯 Expected User Experience:
564
- Upload Flow:
565
- User uploads audio file
566
- Sees "Processing..." status
567
- Gets formatted transcription display
568
- Downloads JSON/SRT files
569
- Recording Flow:
570
- User clicks record button
571
- Records up to 3 minutes
572
- Clicks "Transcribe Recording"
573
- Gets instant results
574
- Output Example:
575
- 🎯 TRANSCRIPTION RESULTS
576
- ==================================================
577
- 📊 Duration: 85.3s
578
- 📝 Segments: 6
579
-
580
- 📄 FULL TRANSCRIPT:
581
- This Saturday afternoon my parents had gone to attend the church members wedding...
582
-
583
- 🕐 TIMESTAMPED SEGMENTS:
584
- ------------------------------
585
- 1. [00:00 - 00:15] This Saturday afternoon my parents had gone to attend the church members wedding
586
- 2. [00:15 - 00:28] And since the cook didn't bring us with them my sister and I were bored
587
- 3. [00:28 - 00:42] We had to find our own fun right so we stepped out of our compound
588
- ...
589
- 🚀 Advanced Features:
590
- Error Recovery: Falls back to base Whisper if your model fails
591
- Memory Management: Clears GPU cache between requests
592
- File Management: Auto-cleanup of temporary files
593
- Usage Limits: 3-minute max to prevent abuse
594
- Queue System: Handles multiple users gracefully
595
- The app will automatically handle the model loading, audio processing, and provide a seamless experience for users to test your Whisper model!
596
-
597
-
598
-
599
-
600
-
601
-
602
- You are out of free messages until 4:00 PM
603
-
604
-
605
  import gradio as gr
606
  import torch
607
  import librosa
@@ -1259,4 +655,4 @@ if __name__ == "__main__":
1259
  show_error=True,
1260
  enable_queue=True, # Handle multiple users
1261
  max_threads=10 # Limit concurrent processing
1262
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import gradio as gr
2
  import torch
3
  import librosa
 
655
  show_error=True,
656
  enable_queue=True, # Handle multiple users
657
  max_threads=10 # Limit concurrent processing
658
+ )