Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.6.0
metadata
title: Whisper Transcriber
emoji: π€
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
license: mit
π€ Whisper Transcriber
Generate accurate subtitles and transcripts from audio/video files using OpenAI Whisper.
β¨ Features
π Multiple Input Methods
- Upload audio/video files directly
- Paste YouTube URLs
- Paste direct file URLs (HTTP/HTTPS)
π― Model Selection
- Tiny: Fastest processing (~150MB)
- Small: Balanced speed/accuracy (~450MB)
- Medium: Highest accuracy (~1.5GB)
π Multi-Language Support
- Auto-detect language from 99+ languages
- Manual language selection available
- Optimized for English, Spanish, Chinese, French, German, and more
π₯ Speaker Diarization (Optional)
- Identify different speakers in the audio
- Label speakers in transcripts (SPEAKER_00, SPEAKER_01, etc.)
- Requires Hugging Face token
π Multiple Output Formats
- SRT: Standard subtitle format for video players
- VTT: WebVTT format for web players
- TXT: Plain text transcript
- JSON: Full data with word-level timestamps
π Quick Start
- Upload a file or paste a URL (YouTube or direct link)
- Select model size (Small recommended for most cases)
- Choose language (Auto-detect works great!)
- Enable speaker diarization (optional, requires HF token)
- Click Generate Transcription
- Download your preferred format(s)
π Supported File Formats
Audio Formats
- MP3, WAV, M4A, FLAC, AAC, OGG, WMA
Video Formats
- MP4, AVI, MKV, MOV, WMV, FLV, WebM
Audio is automatically extracted from video files.
π§ Advanced Features
Large File Handling
- Files are automatically chunked into 30-minute segments
- Timestamps are preserved across chunks
- Maximum file size: ~1GB (can be increased)
Speaker Diarization
To enable speaker diarization:
- Get a Hugging Face token at huggingface.co/settings/tokens
- Accept terms at pyannote/speaker-diarization-3.1
- Set
HF_TOKENas an environment variable or Space secret
API Usage
This Space provides a public API endpoint. You can use it programmatically:
from gradio_client import Client
client = Client("xTHExBEASTx/Whisper-Transcriber")
result = client.predict(
file_input=None, # Or file path
url_input="https://www.youtube.com/watch?v=dQw4w9WgXcQ",
model_size="small",
language="auto",
enable_diarization=False,
api_name="/predict"
)
# result contains: (preview, srt_file, vtt_file, txt_file, json_file)
π Model Comparison
| Model | Size | Speed | Accuracy | Best For |
|---|---|---|---|---|
| Tiny | 150MB | ~0.1x RT | Good | Quick drafts |
| Small | 450MB | ~0.3x RT | Better | Most use cases |
| Medium | 1.5GB | ~0.5x RT | Best | Production subtitles |
RT = Realtime (0.1x RT means 10min audio processes in 1min)
π― Use Cases
- Content Creators: Generate subtitles for videos
- Podcasters: Create transcripts for episodes
- Researchers: Transcribe interviews and recordings
- Accessibility: Add captions to media content
- Language Learning: Study with accurate transcripts
π οΈ Technical Stack
- Whisper: OpenAI's speech recognition model
- Pyannote.audio: Speaker diarization
- FFmpeg: Audio/video processing
- yt-dlp: YouTube download support
- Gradio: Web interface
π Output Examples
SRT Format
1
00:00:00,000 --> 00:00:02,500
[SPEAKER_00]: Hello and welcome to the show.
2
00:00:02,500 --> 00:00:05,000
[SPEAKER_01]: Thanks for having me!
JSON Format
{
"text": "Full transcript here...",
"language": "en",
"segments": [
{
"index": 0,
"start": 0.0,
"end": 2.5,
"text": "Hello and welcome to the show.",
"speaker": "SPEAKER_00"
}
]
}
β οΈ Limitations
- Maximum file size: ~1GB (adjustable)
- Processing time depends on model size and file length
- Speaker diarization requires HF token and adds processing time
- YouTube download depends on availability and region restrictions
π€ Contributing
Found a bug or have a feature request? Please open an issue on the repository.
π License
MIT License - Feel free to use for personal or commercial projects.
π Credits
Made with β€οΈ using OpenAI Whisper and Gradio