Spaces:

danielrosehill
/

Audio-To-Subtitles

Running

App Files Files Community

Audio-To-Subtitles / SETUP.md

danielrosehill

commit

3569ef7 6 months ago

preview code

raw

history blame contribute delete

2.25 kB

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

Audio-To-SRT Setup Guide

Prerequisites

Since this is a private Hugging Face Space, your account token is automatically available. You just need to:

Accept the Pyannote User Agreement
- Visit: https://huggingface.co/pyannote/speaker-diarization-3.1
- Click "Agree and access repository"
- This is required for the speaker diarization model

Deployment

Push this code to your Hugging Face Space
The space will automatically install dependencies from requirements.txt
First load will take several minutes as it downloads:
- Whisper large-v2 model (~3GB)
- Pyannote diarization model

Usage

Upload an audio file (MP3, Opus, WAV, M4A, etc.)
Click "Generate SRT"
Wait for processing (typically 2-5 minutes for a 1-hour audio file)
Download the generated SRT file

Speaker Mapping

By default:

SPEAKER_00 → Daniel (typically first/most frequent speaker)
SPEAKER_01 → AC (typically second speaker)

If the mapping is reversed in your audio, you can adjust it in app.py at line 63-66:

def identify_speaker(speaker_label: str, pitch_info: Dict = None) -> str:
    if speaker_label == "SPEAKER_00":
        return "Daniel"  # Change if needed
    elif speaker_label == "SPEAKER_01":
        return "AC"      # Change if needed

Performance Notes

Model Size: Using Whisper large-v2 for best accuracy
- Change to medium or base in app.py:13 for faster processing
GPU: Space will use GPU if available (much faster)
Memory: Requires ~8GB RAM for large audio files

Features Implemented

Speech-to-text with word-level timestamps
Speaker diarization (Daniel/AC identification)
Automatic filler word removal (uh, um, like, you know, etc.)
One speaker per subtitle block
One sentence per subtitle block
Standard SRT format output
Direct download of SRT file

Troubleshooting

If speaker names are swapped:

Edit the identify_speaker() function in app.py

If processing is slow:

Change Whisper model to medium or base
Check if GPU is enabled in Space settings

If pyannote fails to load:

Verify you accepted the user agreement at https://huggingface.co/pyannote/speaker-diarization-3.1