Audio-To-Subtitles / SETUP.md
danielrosehill's picture
commit
3569ef7

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

Audio-To-SRT Setup Guide

Prerequisites

Since this is a private Hugging Face Space, your account token is automatically available. You just need to:

  1. Accept the Pyannote User Agreement

Deployment

  1. Push this code to your Hugging Face Space
  2. The space will automatically install dependencies from requirements.txt
  3. First load will take several minutes as it downloads:
    • Whisper large-v2 model (~3GB)
    • Pyannote diarization model

Usage

  1. Upload an audio file (MP3, Opus, WAV, M4A, etc.)
  2. Click "Generate SRT"
  3. Wait for processing (typically 2-5 minutes for a 1-hour audio file)
  4. Download the generated SRT file

Speaker Mapping

By default:

  • SPEAKER_00Daniel (typically first/most frequent speaker)
  • SPEAKER_01AC (typically second speaker)

If the mapping is reversed in your audio, you can adjust it in app.py at line 63-66:

def identify_speaker(speaker_label: str, pitch_info: Dict = None) -> str:
    if speaker_label == "SPEAKER_00":
        return "Daniel"  # Change if needed
    elif speaker_label == "SPEAKER_01":
        return "AC"      # Change if needed

Performance Notes

  • Model Size: Using Whisper large-v2 for best accuracy
    • Change to medium or base in app.py:13 for faster processing
  • GPU: Space will use GPU if available (much faster)
  • Memory: Requires ~8GB RAM for large audio files

Features Implemented

  • Speech-to-text with word-level timestamps
  • Speaker diarization (Daniel/AC identification)
  • Automatic filler word removal (uh, um, like, you know, etc.)
  • One speaker per subtitle block
  • One sentence per subtitle block
  • Standard SRT format output
  • Direct download of SRT file

Troubleshooting

If speaker names are swapped:

  • Edit the identify_speaker() function in app.py

If processing is slow:

  • Change Whisper model to medium or base
  • Check if GPU is enabled in Space settings

If pyannote fails to load: