--- title: Whisper Transcriber emoji: 🎤 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 6.2.0 app_file: app.py pinned: false license: mit --- # 🎤 Whisper Transcriber Generate accurate subtitles and transcripts from audio/video files using OpenAI Whisper. ## ✨ Features - **📁 Multiple Input Methods** - Upload audio/video files directly - Paste YouTube URLs - Paste direct file URLs (HTTP/HTTPS) - **🎯 Model Selection** - **Tiny**: Fastest processing (~150MB) - **Small**: Balanced speed/accuracy (~450MB) - **Medium**: Highest accuracy (~1.5GB) - **🌍 Multi-Language Support** - Auto-detect language from 99+ languages - Manual language selection available - Optimized for English, Spanish, Chinese, French, German, and more - **👥 Speaker Diarization (Optional)** - Identify different speakers in the audio - Label speakers in transcripts (SPEAKER_00, SPEAKER_01, etc.) - Requires Hugging Face token - **📝 Multiple Output Formats** - **SRT**: Standard subtitle format for video players - **VTT**: WebVTT format for web players - **TXT**: Plain text transcript - **JSON**: Full data with word-level timestamps ## 🚀 Quick Start 1. **Upload a file** or **paste a URL** (YouTube or direct link) 2. **Select model size** (Small recommended for most cases) 3. **Choose language** (Auto-detect works great!) 4. **Enable speaker diarization** (optional, requires HF token) 5. Click **Generate Transcription** 6. **Download** your preferred format(s) ## 📋 Supported File Formats ### Audio Formats - MP3, WAV, M4A, FLAC, AAC, OGG, WMA ### Video Formats - MP4, AVI, MKV, MOV, WMV, FLV, WebM Audio is automatically extracted from video files. ## 🔧 Advanced Features ### Large File Handling - Files are automatically chunked into 30-minute segments - Timestamps are preserved across chunks - Maximum file size: ~1GB (can be increased) ### Speaker Diarization To enable speaker diarization: 1. Get a Hugging Face token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) 2. Accept terms at [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1) 3. Set `HF_TOKEN` as an environment variable or Space secret ### API Usage This Space provides a public API endpoint. You can use it programmatically: ```python from gradio_client import Client client = Client("xTHExBEASTx/Whisper-Transcriber") result = client.predict( file_input=None, # Or file path url_input="https://www.youtube.com/watch?v=dQw4w9WgXcQ", model_size="small", language="auto", enable_diarization=False, api_name="/predict" ) # result contains: (preview, srt_file, vtt_file, txt_file, json_file) ``` ## 📊 Model Comparison | Model | Size | Speed | Accuracy | Best For | |--------|--------|----------|----------|-----------------------| | Tiny | 150MB | ~0.1x RT | Good | Quick drafts | | Small | 450MB | ~0.3x RT | Better | Most use cases | | Medium | 1.5GB | ~0.5x RT | Best | Production subtitles | *RT = Realtime (0.1x RT means 10min audio processes in 1min)* ## 🎯 Use Cases - **Content Creators**: Generate subtitles for videos - **Podcasters**: Create transcripts for episodes - **Researchers**: Transcribe interviews and recordings - **Accessibility**: Add captions to media content - **Language Learning**: Study with accurate transcripts ## 🛠️ Technical Stack - **Whisper**: OpenAI's speech recognition model - **Pyannote.audio**: Speaker diarization - **FFmpeg**: Audio/video processing - **yt-dlp**: YouTube download support - **Gradio**: Web interface ## 📝 Output Examples ### SRT Format ``` 1 00:00:00,000 --> 00:00:02,500 [SPEAKER_00]: Hello and welcome to the show. 2 00:00:02,500 --> 00:00:05,000 [SPEAKER_01]: Thanks for having me! ``` ### JSON Format ```json { "text": "Full transcript here...", "language": "en", "segments": [ { "index": 0, "start": 0.0, "end": 2.5, "text": "Hello and welcome to the show.", "speaker": "SPEAKER_00" } ] } ``` ## ⚠️ Limitations - Maximum file size: ~1GB (adjustable) - Processing time depends on model size and file length - Speaker diarization requires HF token and adds processing time - YouTube download depends on availability and region restrictions ## 🤝 Contributing Found a bug or have a feature request? Please open an issue on the repository. ## 📄 License MIT License - Feel free to use for personal or commercial projects. ## 🙏 Credits - [OpenAI Whisper](https://github.com/openai/whisper) - [Pyannote.audio](https://github.com/pyannote/pyannote-audio) - [Gradio](https://gradio.app) - [Hugging Face](https://huggingface.co) --- Made with ❤️ using OpenAI Whisper and Gradio