Spaces:

AfroLogicInsect
/

transcribe-speech

Sleeping

App Files Files Community

AfroLogicInsect commited on Aug 7, 2025

Commit

4dfa880

verified ·

1 Parent(s): 39fcfe7

Update app.py

Browse files

Files changed (1) hide show

app.py +1 -605

app.py CHANGED Viewed

@@ -1,607 +1,3 @@
-testing_huggingface_speech_to_text.py
-108 lines
-py
-please assist debug:
----------------------------------------------------------------------------
-ImportError                               Traceback (most recent call last)
-/usr/local/lib/python3.11/dist-packages/transformers/utils/import_utils.py in _get_module(self, module_name)
-   1559         try:
--> 1560             return importlib.import_module("." + module_name, self.__name__)
-   1561         except Exception as e:
-14 frames
-ImportError: cannot import name 'EncoderDecoderCache' from 'transformers' (/usr/local/lib/python3.11/dist-packages/transformers/__init__.py)
-The above exception was the direct cause of the following exception:
-RuntimeError                              Traceback (most recent call last)
-/usr/local/lib/python3.11/dist-packages/transformers/utils/import_utils.py in _get_module(self, module_name)
-   1560             return importlib.import_module("." + module_name, self.__name__)
-   1561         except Exception as e:
--> 1562             raise RuntimeError(
-   1563                 f"Failed to import {self.__name__}.{module_name} because of the following error (look up to see its"
-   1564                 f" traceback):\n{e}"
-RuntimeError: Failed to import transformers.trainer because of the following error (look up to see its traceback):
-cannot import name 'EncoderDecoderCache' from 'transformers' (/usr/local/lib/python3.11/dist-packages/transformers/__init__.py)
-Edit
-The error you're encountering is due to a version compatibility issue between the transformers library and the saved model. The EncoderDecoderCache class was introduced in newer versions of transformers, but your current installation doesn't have it.
-Here's how to fix this issue:
-Fixed HuggingFace Speech-to-Text Script
-Code
-The main issues in your code and their fixes:
-Problem Analysis:
-Version Compatibility: Your transformers version (4.42.0) is missing the EncoderDecoderCache class that your saved model requires
-Missing Imports: WhisperForConditionalGeneration and WhisperProcessor weren't imported
-Device Handling: Potential GPU/CPU device mismatch issues
-Language Detection: Multilingual Whisper models can cause unexpected behavior without proper language specification
-Key Fixes:
-Updated transformers version to 4.44.0 (includes EncoderDecoderCache)
-Added proper imports for Whisper classes
-Added error handling and fallback to base Whisper model if your fine-tuned model fails
-Fixed language specification to avoid the multilingual detection issue
-Added two testing methods: direct model inference and pipeline (pipeline is more reliable)
-Better device management with proper GPU/CPU handling
-Enhanced debugging with file existence checks and error tracebacks
-Quick Fix Commands:
-If you want to just update your current environment:
-bash
-pip install transformers==4.44.0 --upgrade
-Then add these imports at the top of your script:
-python
-from transformers import WhisperForConditionalGeneration, WhisperProcessor
-The updated script provides both methods to test your model, with the pipeline method being more robust and handling edge cases better.
-Model ready for testing! Use either: 1. test_model_inference('path/to/audio/file.wav') 2. test_pipeline_inference('path/to/audio/file.wav') # Recommended Testing with videoplayback.m4a... === Method 1: Direct model inference === Processing audio file: videoplayback.m4a /tmp/ipython-inp
-pasted
-error on both merthods
-Edit
-The error you're encountering is a data type mismatch between the model weights (stored in half-precision/float16) and the input data (float32). This is a common issue when loading models that were saved with mixed precision training.
-Fixed Data Type Speech-to-Text Script
-Code
-Root Cause Analysis:
-The error Input type (float) and bias type (c10::Half) should be the same occurs because:
-Your model was saved with mixed precision (float16/half precision)
-Input features are in float32 by default
-PyTorch requires matching data types between model weights and inputs
-Key Fixes Applied:
-1. Data Type Consistency
-Detects model's dtype and converts inputs to match
-Handles both float16 and float32 scenarios
-Provides fallback conversions if needed
-2. Better Audio Loading
-Multiple fallback methods for audio loading
-Handles the PySoundFile warning you're seeing
-Uses soundfile as backup when librosa fails
-3. CPU-Only Inference Option
-Completely avoids GPU dtype issues
-Most stable for problematic models
-Slower but guaranteed to work
-4. Enhanced Error Handling
-Specific handling for dtype mismatches
-Progressive fallback strategies
-Clear diagnostic messages
-Quick Solutions (in order of preference):
-Option A: Try the CPU-only method first
-python
-result = test_cpu_only_inference('videoplayback.m4a')
-Option B: Convert your model to consistent dtype
-python
-# If you have access to model training/saving code
-model = model.float()  # Convert to float32
-model.save_pretrained("./whisper-finetuned-float32")
-Option C: Use a different audio format
-python
-# Convert M4A to WAV first
-import subprocess
-subprocess.run(['ffmpeg', '-i', 'videoplayback.m4a', 'videoplayback.wav'])
-The script provides three different approaches, with the CPU-only method being most likely to work around your dtype issues. The pipeline method that partially worked for you should work better with the dtype fixes applied.
-how to push both models to hub
-Edit
-I'll help you push both models to the Hugging Face Hub. Here's a comprehensive guide:
-Push Models to Hugging Face Hub Guide
-Code
-Step-by-Step Guide to Push Your Models:
-1. Prerequisites Setup
-bash
-# Get your Hugging Face token
-# Go to: https://huggingface.co/settings/tokens
-# Create a new token with "Write" permissions
-2. Quick Configuration
-Before running the script, update these variables:
-python
-# Replace these with your actual values:
-HF_TOKEN = "hf_your_actual_token_here"
-MODELS_CONFIG = {
-    "original": {
-        "local_path": "./whisper-finetuned-final",
-        "hub_name": "your-actual-username/whisper-finetuned-original",
-        "description": "Fine-tuned Whisper model for speech recognition"
-    },
-    "float32": {
-        "local_path": "./whisper-finetuned-float32",
-        "hub_name": "your-actual-username/whisper-finetuned-float32",
-        "description": "Fine-tuned Whisper model (float32 version) for speech recognition"
-    }
-}
-3. Execute the Upload
-python
-# Run this to push both models
-push_all_models_to_hub(private=False, create_pr=False)
-What This Script Does:
-✅ Model Management
-Creates a float32 version of your original model (fixes dtype issues)
-Handles both models with consistent naming
-Safe serialization for better compatibility
-✅ Hub Integration
-Automatic login to Hugging Face
-Pushes model weights and tokenizer/processor
-Creates professional model cards with usage examples
-Handles private/public repos
-✅ Quality Assurance
-Error handling for each step
-Progress tracking with clear status messages
-Post-upload testing to verify models work
-Professional documentation generation
-Advanced Options:
-Private Repository
-python
-push_all_models_to_hub(private=True, create_pr=False)
-Create Pull Request (for organization repos)
-python
-push_all_models_to_hub(private=False, create_pr=True)
-Push Individual Models
-python
-# Push only one model
-push_model_to_hub("original", private=False)
-push_model_to_hub("float32", private=False)
-After Upload - Usage Examples:
-Load from Hub
-python
-from transformers import pipeline
-# Use your uploaded model
-pipe = pipeline(
-    "automatic-speech-recognition",
-    model="your-username/whisper-finetuned-float32"
-)
-result = pipe("path/to/audio.wav")
-print(result['text'])
-Direct Model Loading
-python
-from transformers import WhisperProcessor, WhisperForConditionalGeneration
-processor = WhisperProcessor.from_pretrained("your-username/whisper-finetuned-float32")
-model = WhisperForConditionalGeneration.from_pretrained("your-username/whisper-finetuned-float32")
-The script will handle everything automatically - just update the configuration with your details and run it!
-This worked best for me:
-def test_model_inference_fixed(audio_file_path=None):
-  """Fixed version with proper dtype handling"""
-  try:
-      if not audio_file_path or not os.path.exists(audio_file_path):
-          print(f"❌ Audio file not found: {audio_file_path}")
-          return None
-      print(f"🎵 Processing audio file: {audio_file_path}")
-      # Load audio file with better error handling
-      try:
-          audio_array, sr = librosa.load(audio_file_path, sr=16000)
-          print(f"✅ Audio loaded: {len(audio_array)} samples at {sr}Hz")
-      except Exception as audio_error:
-          print(f"❌ Audio loading failed: {audio_error}")
-          # Try alternative loading methods
-          try:
-              import soundfile as sf
-              audio_array, sr = sf.read(audio_file_path)
-              if sr != 16000:
-                  audio_array = librosa.resample(audio_array, orig_sr=sr, target_sr=16000)
-                  sr = 16000
-              print(f"✅ Audio loaded with soundfile: {len(audio_array)} samples at {sr}Hz")
-          except:
-              print("❌ All audio loading methods failed")
-              return None
-      # Process with processor - ensure correct dtype
-      inputs = processor(
-          audio_array,
-          sampling_rate=16000,
-          return_tensors="pt"
-      )
-      input_features = inputs.input_features
-      # CRITICAL: Match input dtype to model dtype
-      if model_dtype == torch.float16:
-          input_features = input_features.half()  # Convert to float16
-      else:
-          input_features = input_features.float()  # Ensure float32
-      print(f"🔧 Input dtype: {input_features.dtype}, Model dtype: {model_dtype}")
-      # Move to same device as model
-      input_features = input_features.to(model.device)
-      # Generate transcription with proper parameters
-      with torch.no_grad():
-          try:
-              predicted_ids = model.generate(
-                  input_features,
-                  language="en",
-                  task="transcribe",
-                  max_length=448,
-                  num_beams=1,
-                  do_sample=False,
-                  use_cache=True
-              )
-              transcription = processor.batch_decode(
-                  predicted_ids,
-                  skip_special_tokens=True
-              )[0]
-              print(f"✅ Transcription: {transcription}")
-              return transcription
-          except RuntimeError as gen_error:
-              if "Input type" in str(gen_error) and "bias type" in str(gen_error):
-                  print(f"❌ Still getting dtype error: {gen_error}")
-                  print("🔄 Trying to convert model to float32...")
-                  # Convert model to float32 as last resort
-                  model.float()
-                  input_features = input_features.float()
-                  predicted_ids = model.generate(
-                      input_features,
-                      language="en",
-                      task="transcribe",
-                      max_length=448,
-                      num_beams=1,
-                      do_sample=False
-                  )
-                  transcription = processor.batch_decode(
-                      predicted_ids,
-                      skip_special_tokens=True
-                  )[0]
-                  print(f"✅ Transcription (after float32 conversion): {transcription}")
-                  return transcription
-              else:
-                  raise gen_error
-  except Exception as e:
-      print(f"❌ Inference failed: {e}")
-      import traceback
-      traceback.print_exc()
-      return None
-Can I have such that the results have time stamps from the audio inpu with the transcription output..
-Edit
-Absolutely! Here's the enhanced version that provides timestamped transcriptions:
-Timestamped Whisper Transcription
-Code
-This enhanced version provides timestamped transcriptions with multiple approaches:
-🎯 Key Features:
-1. Built-in Whisper Timestamps
-Uses Whisper's native timestamp functionality
-More accurate alignment with audio
-Automatic segment detection
-2. Chunked Processing
-Fallback method if built-in timestamps fail
-Processes audio in overlapping chunks
-Manual timestamp calculation
-Better for long audio files
-3. Rich Output Format
-Full transcription text
-Individual timestamped segments
-Metadata (duration, chunks, etc.)
-Formatted display with MM:SS timestamps
-JSON export for further processing
-📊 Sample Output:
-🎯 TIMESTAMPED TRANSCRIPTION
-==================================================
-📝 FULL TEXT:
-This Saturday afternoon my parents had gone to attend the church members' wedding...
-📊 METADATA:
-   total_duration: 180.5
-   num_chunks: 6
-   chunk_length: 30
-🕐 TIMESTAMPED SEGMENTS:
-------------------------------
- 1. [00:00 - 00:05] This Saturday afternoon my parents had gone to attend
- 2. [00:05 - 00:12] the church members' wedding and since they cook didn't bring us
- 3. [00:12 - 00:18] with them My sister and I homework knocked out and bored
-🚀 Usage Examples:
-Basic Usage:
-python
-result = test_model_inference_with_timestamps('videoplayback.m4a')
-print(result['text'])  # Full transcription
-for segment in result['segments']:
-    print(f"[{segment['start']}-{segment['end']}s]: {segment['text']}")
-Custom Chunk Length:
-python
-# For shorter, more precise segments
-result = test_model_inference_with_timestamps('audio.wav', chunk_length=15)
-# For longer segments (faster processing)
-result = test_model_inference_with_timestamps('audio.wav', chunk_length=45)
-Full Test with Formatting:
-python
-# This gives you the nicely formatted output and saves to JSON
-test_with_timestamps('videoplayback.m4a', save_results=True)
-🔧 Technical Advantages:
-Dtype Handling: Maintains your working dtype fix
-Error Recovery: Falls back gracefully if one method fails
-Memory Efficient: Processes long audio in manageable chunks
-Overlap Handling: Prevents word cutoffs at chunk boundaries
-Export Ready: JSON format for integration with other tools
-The timestamps will help you create subtitles, analyze specific parts of the audio, or build interactive transcription viewers!
-well, this is the full transcribed text:
-THIS THAT TOO I GOT IT RIGHT AFTER I BEKIN MY NURSE WHY AS THE URBASI IN GOES NOW WHO ASKED QUESTION ONES A BIT OF MATTER SO THIS SATS DAY AFTERNOON MY PARENTS HAD GONE TO ATTEND THE CHURCH MEMBERS' WEDDING AND SINCE THE COOKO DIDN'T BRING US WITH So This Saturday afternoon my parents had gone to attend the church members wedding And since the cook didn't bring us with them my sister and I homework knocked out and bored We had to find our own fun, right? So We stepped out of our compound hmm something we had never dared to do before I was so afraid SO WE STEPPED OUT OF OUR COMPOUND SOMETHING WE HAD NEVER DEAD TO DO BEFORE I WAS SO AFRAID NOW THAT I THINK OF IT BE LIKE SEDA GEDE SNICK OUT ONCE ONCE OR MY LIFE SAYS SHAR ANYWAY THAT WAS HOW PLACE AROUND THE LEAK SOMEWHERE EVEN SWIMPING AND THEN SUDDENLY I NOTICED THAT I COULDN'T FIND MY SISTER I COLD FOR HER AND GOT NO And then suddenly I noticed that I couldn't find my sister I called for her and got no answer Well after BUT SHE WAS GONE I STARTED TO SCREAM I DIDN'T KNOW WHAT ELSE TO DO THEN THE MAD MAN CHOSED TO SHOW UP IN HIS VEST AND SHORTS EVERYONE'S CUTTED THEY LET MY LIFELESS SISTER AND LITTLE HELP LESS ME BY THE LAKE THEN THIS MAD WENT ON TO GIVE MY SISTAR WHAT I UNDERSTAND NOW TO BE CPR THE MAD MAN SAVED MY SISTAR'S LIFE THIS DATTU IS IN REMEMBERANCE OF MISTATI WILL NOW OF BLESSARD MEMORY AND HIS TWIN SISTER WHO HAD Died IN THAT SIEM LEAGUE WHEN THEY WERE MUCH YOUNGER HE HAD THIS EXACT DATSU ON HIS SHOULDER WOULD YOU BELIEVE ME IF I TOLD YOU THAT IT WAS BECAUSE OF THIS DATSU THAT HE CALLED HIM MAD BECAUSE OF DATSU YES HIS NAME WAS TAIWU THE SO-CALLED MAD MAN MY SISTAR'S LIFE SAVE'ER AND BY EXTENSION MY BECAUSE TRUST ME MY PEOPRIENTS WOULD HAVE KILLED ME THAT VERY DAYand by extension mine because trust me my parents would have killed me that very day
-Both the saved .json is:
-{
-  "text": "THIS THAT TOO I GOT IT RIGHT AFTER I BEKIN MY NURSE WHY WELL AS THE URBASI IN GOES NOW WHO ASKED QUESTION ONES A BIT OF MATTER SO THIS SATS DAY AFTERNOON MY PARENTS HAD GONE TO ATTEND THE CHURCH MEMBERS' WEDDING",
-  "segments": [
-    {
-      "start": 0.0,
-      "end": 0.0,
-      "text": "THIS THAT TOO I GOT IT RIGHT AFTER I BEKIN MY NURSE"
-    },
-    {
-      "start": 8.52,
-      "end": 8.52,
-      "text": "WHY WELL AS THE URBASI IN GOES NOW WHO ASKED QUESTION ONES A BIT OF MATTER"
-    },
-    {
-      "start": 18.52,
-      "end": 18.52,
-      "text": "SO THIS SATS DAY AFTERNOON MY PARENTS HAD GONE TO ATTEND THE CHURCH MEMBERS' WEDDING"
-    }
-  ],
-  "metadata": {
-    "method": "built-in_timestamps",
-    "total_segments": 3
-  }
-}
-Edit
-I can see several issues with the timestamp implementation:
-Incomplete transcription - The JSON only contains the first 3 segments
-Duplicate timestamps - All segments show the same start and end times
-Built-in timestamp parsing failed - The parser isn't correctly extracting Whisper's timestamp tokens
-Let me fix this with a more robust implementation:
-Fixed Robust Timestamped Whisper Transcription
-Code
-🔧 Key Fixes Applied:
-1. Proper Chunking Strategy
-Non-overlapping time boundaries for accurate timestamps
-Overlapping audio processing to prevent word cutoffs
-Smaller chunks (15-20s) for more precise timestamps
-2. Overlap Detection & Removal
-Detects duplicate phrases between chunks
-Removes redundant text while preserving timestamps
-Maintains speech flow continuity
-3. Multiple Output Formats
-JSON - Full structured data
-SRT - Standard subtitle format
-VTT - Web subtitle format
-TXT - Human-readable transcript
-4. Better Text Processing
-Cleans up transcription artifacts
-Removes repetitive filler words
-Proper capitalization
-Word-level timestamp estimation
-🎯 Expected Results:
-With your audio file, you should now get:
-json
-{
-  "text": "This Saturday afternoon my parents had gone to attend the church members wedding...",
-  "segments": [
-    {"start": 0.0, "end": 15.2, "text": "This Saturday afternoon my parents had gone to attend the church members wedding"},
-    {"start": 15.2, "end": 28.7, "text": "And since the cook didn't bring us with them my sister and I were bored"},
-    {"start": 28.7, "end": 42.1, "text": "We had to find our own fun right so we stepped out of our compound"},
-    // ... more accurate segments
-  ],
-  "metadata": {
-    "total_duration": 180.5,
-    "successful_chunks": 12,
-    "method": "robust_chunked_timestamps"
-  }
-}
-🚀 Usage:
-python
-# Test with better settings
-result = test_robust_timestamps('videoplayback.m4a', chunk_length=15, save_all=True)
-# This will create:
-# - videoplayback_timestamped.json
-# - videoplayback_timestamped.srt
-# - videoplayback_timestamped.vtt
-# - videoplayback_timestamped.txt
-The 15-second chunks should give you much more accurate timestamps, and the overlap removal will prevent the duplicate text you were seeing. The SRT file can be directly used with video players for subtitles!
-This worked beautifully!
-Finally, I'm on huggingface and want to host a gradio workspace for anyone to try out the model, they can either upload or record thier voice (not more than 3 minutes) and have a transcript displayed (can also be downloaded).
-Edit
-Excellent! Let's create a beautiful Gradio interface for your Whisper model. Here's a complete Gradio app that handles both file uploads and voice recording:
-Gradio Whisper Transcription App
-Code
-🚀 Complete Gradio App for Hugging Face Spaces
-This creates a professional transcription service with:
-✨ Key Features:
-🎤 Dual Input Methods
-File upload for existing audio
-Live microphone recording
-3-minute limit for fair usage
-📊 Rich Output Formats
-Display: Formatted text with timestamps
-JSON: Complete data structure
-SRT: Ready-to-use subtitle files
-⚡ Performance Optimized
-15-second chunking for accuracy
-Overlap removal to prevent duplicates
-GPU acceleration when available
-Queue system for multiple users
-🎨 Professional UI
-Clean tabbed interface
-Progress indicators
-Error handling with helpful messages
-Mobile-responsive design
-📝 Setup for Hugging Face Spaces:
-1. Create New Space
-Go to https://huggingface.co/spaces
-Click "Create new Space"
-Choose:
-SDK: Gradio
-Hardware: CPU Basic (or GPU if you want faster processing)
-Visibility: Public
-2. Required Files:
-app.py (the code above)
-requirements.txt:
-torch>=2.0.0
-torchaudio>=2.0.0
-transformers>=4.44.0
-gradio>=4.0.0
-librosa>=0.10.0
-numpy>=1.21.0
-soundfile>=0.12.0
-README.md:
-markdown
----
-title: Whisper Speech Transcription
-emoji: 🎙️
-colorFrom: blue
-colorTo: purple
-sdk: gradio
-sdk_version: 4.44.0
-app_file: app.py
-pinned: false
-license: apache-2.0
----
-# 🎙️ Whisper Speech Transcription
-AI-powered speech-to-text with timestamps using a fine-tuned Whisper model.
-## Features
-- 📁 Upload audio files (MP3, WAV, M4A, etc.)
-- 🎤 Record voice directly in browser
-- 🕐 Get precise timestamped transcriptions
-- 📄 Download JSON and SRT subtitle files
-- ⚡ Fast processing with chunked audio
-- 🎯 Optimized for English speech
-## Usage
-1. Choose either "Upload Audio File" or "Record Voice" tab
-2. Process your audio (maximum 3 minutes)
-3. View transcription with timestamps
-4. Download results in JSON or SRT format
-## Powered by
-- **Whisper AI** for speech recognition
-- **Gradio** for the web interface
-- **Hugging Face** for model hosting
-3. Important Configuration:
-Update this line in the code:
-python
-MODEL_NAME = "your-username/whisper-finetuned-float32"  # Replace with your actual model
-4. Deploy Steps:
-Upload files to your Space
-Wait for build (takes 2-3 minutes)
-Test the interface
-Share your Space URL!
-🎯 Expected User Experience:
-Upload Flow:
-User uploads audio file
-Sees "Processing..." status
-Gets formatted transcription display
-Downloads JSON/SRT files
-Recording Flow:
-User clicks record button
-Records up to 3 minutes
-Clicks "Transcribe Recording"
-Gets instant results
-Output Example:
-🎯 TRANSCRIPTION RESULTS
-==================================================
-📊 Duration: 85.3s
-📝 Segments: 6
-📄 FULL TRANSCRIPT:
-This Saturday afternoon my parents had gone to attend the church members wedding...
-🕐 TIMESTAMPED SEGMENTS:
-------------------------------
- 1. [00:00 - 00:15] This Saturday afternoon my parents had gone to attend the church members wedding
- 2. [00:15 - 00:28] And since the cook didn't bring us with them my sister and I were bored
- 3. [00:28 - 00:42] We had to find our own fun right so we stepped out of our compound
-...
-🚀 Advanced Features:
-Error Recovery: Falls back to base Whisper if your model fails
-Memory Management: Clears GPU cache between requests
-File Management: Auto-cleanup of temporary files
-Usage Limits: 3-minute max to prevent abuse
-Queue System: Handles multiple users gracefully
-The app will automatically handle the model loading, audio processing, and provide a seamless experience for users to test your Whisper model!
-You are out of free messages until 4:00 PM
 import gradio as gr
 import torch
 import librosa
@@ -1259,4 +655,4 @@ if __name__ == "__main__":
         show_error=True,
         enable_queue=True,  # Handle multiple users
         max_threads=10  # Limit concurrent processing
-    )

 import gradio as gr
 import torch
 import librosa
         show_error=True,
         enable_queue=True,  # Handle multiple users
         max_threads=10  # Limit concurrent processing
+    )