A newer version of the Gradio SDK is available: 6.13.0
Audio-To-SRT Setup Guide
Prerequisites
Since this is a private Hugging Face Space, your account token is automatically available. You just need to:
- Accept the Pyannote User Agreement
- Visit: https://huggingface.co/pyannote/speaker-diarization-3.1
- Click "Agree and access repository"
- This is required for the speaker diarization model
Deployment
- Push this code to your Hugging Face Space
- The space will automatically install dependencies from
requirements.txt - First load will take several minutes as it downloads:
- Whisper large-v2 model (~3GB)
- Pyannote diarization model
Usage
- Upload an audio file (MP3, Opus, WAV, M4A, etc.)
- Click "Generate SRT"
- Wait for processing (typically 2-5 minutes for a 1-hour audio file)
- Download the generated SRT file
Speaker Mapping
By default:
- SPEAKER_00 → Daniel (typically first/most frequent speaker)
- SPEAKER_01 → AC (typically second speaker)
If the mapping is reversed in your audio, you can adjust it in app.py at line 63-66:
def identify_speaker(speaker_label: str, pitch_info: Dict = None) -> str:
if speaker_label == "SPEAKER_00":
return "Daniel" # Change if needed
elif speaker_label == "SPEAKER_01":
return "AC" # Change if needed
Performance Notes
- Model Size: Using Whisper
large-v2for best accuracy- Change to
mediumorbasein app.py:13 for faster processing
- Change to
- GPU: Space will use GPU if available (much faster)
- Memory: Requires ~8GB RAM for large audio files
Features Implemented
- Speech-to-text with word-level timestamps
- Speaker diarization (Daniel/AC identification)
- Automatic filler word removal (uh, um, like, you know, etc.)
- One speaker per subtitle block
- One sentence per subtitle block
- Standard SRT format output
- Direct download of SRT file
Troubleshooting
If speaker names are swapped:
- Edit the
identify_speaker()function in app.py
If processing is slow:
- Change Whisper model to
mediumorbase - Check if GPU is enabled in Space settings
If pyannote fails to load:
- Verify you accepted the user agreement at https://huggingface.co/pyannote/speaker-diarization-3.1