Spaces:
Sleeping
Sleeping
| title: Whisper Transcriber | |
| emoji: π€ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 6.2.0 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| # π€ Whisper Transcriber | |
| Generate accurate subtitles and transcripts from audio/video files using OpenAI Whisper. | |
| ## β¨ Features | |
| - **π Multiple Input Methods** | |
| - Upload audio/video files directly | |
| - Paste YouTube URLs | |
| - Paste direct file URLs (HTTP/HTTPS) | |
| - **π― Model Selection** | |
| - **Tiny**: Fastest processing (~150MB) | |
| - **Small**: Balanced speed/accuracy (~450MB) | |
| - **Medium**: Highest accuracy (~1.5GB) | |
| - **π Multi-Language Support** | |
| - Auto-detect language from 99+ languages | |
| - Manual language selection available | |
| - Optimized for English, Spanish, Chinese, French, German, and more | |
| - **π₯ Speaker Diarization (Optional)** | |
| - Identify different speakers in the audio | |
| - Label speakers in transcripts (SPEAKER_00, SPEAKER_01, etc.) | |
| - Requires Hugging Face token | |
| - **π Multiple Output Formats** | |
| - **SRT**: Standard subtitle format for video players | |
| - **VTT**: WebVTT format for web players | |
| - **TXT**: Plain text transcript | |
| - **JSON**: Full data with word-level timestamps | |
| ## π Quick Start | |
| 1. **Upload a file** or **paste a URL** (YouTube or direct link) | |
| 2. **Select model size** (Small recommended for most cases) | |
| 3. **Choose language** (Auto-detect works great!) | |
| 4. **Enable speaker diarization** (optional, requires HF token) | |
| 5. Click **Generate Transcription** | |
| 6. **Download** your preferred format(s) | |
| ## π Supported File Formats | |
| ### Audio Formats | |
| - MP3, WAV, M4A, FLAC, AAC, OGG, WMA | |
| ### Video Formats | |
| - MP4, AVI, MKV, MOV, WMV, FLV, WebM | |
| Audio is automatically extracted from video files. | |
| ## π§ Advanced Features | |
| ### Large File Handling | |
| - Files are automatically chunked into 30-minute segments | |
| - Timestamps are preserved across chunks | |
| - Maximum file size: ~1GB (can be increased) | |
| ### Speaker Diarization | |
| To enable speaker diarization: | |
| 1. Get a Hugging Face token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) | |
| 2. Accept terms at [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1) | |
| 3. Set `HF_TOKEN` as an environment variable or Space secret | |
| ### API Usage | |
| This Space provides a public API endpoint. You can use it programmatically: | |
| ```python | |
| from gradio_client import Client | |
| client = Client("xTHExBEASTx/Whisper-Transcriber") | |
| result = client.predict( | |
| file_input=None, # Or file path | |
| url_input="https://www.youtube.com/watch?v=dQw4w9WgXcQ", | |
| model_size="small", | |
| language="auto", | |
| enable_diarization=False, | |
| api_name="/predict" | |
| ) | |
| # result contains: (preview, srt_file, vtt_file, txt_file, json_file) | |
| ``` | |
| ## π Model Comparison | |
| | Model | Size | Speed | Accuracy | Best For | | |
| |--------|--------|----------|----------|-----------------------| | |
| | Tiny | 150MB | ~0.1x RT | Good | Quick drafts | | |
| | Small | 450MB | ~0.3x RT | Better | Most use cases | | |
| | Medium | 1.5GB | ~0.5x RT | Best | Production subtitles | | |
| *RT = Realtime (0.1x RT means 10min audio processes in 1min)* | |
| ## π― Use Cases | |
| - **Content Creators**: Generate subtitles for videos | |
| - **Podcasters**: Create transcripts for episodes | |
| - **Researchers**: Transcribe interviews and recordings | |
| - **Accessibility**: Add captions to media content | |
| - **Language Learning**: Study with accurate transcripts | |
| ## π οΈ Technical Stack | |
| - **Whisper**: OpenAI's speech recognition model | |
| - **Pyannote.audio**: Speaker diarization | |
| - **FFmpeg**: Audio/video processing | |
| - **yt-dlp**: YouTube download support | |
| - **Gradio**: Web interface | |
| ## π Output Examples | |
| ### SRT Format | |
| ``` | |
| 1 | |
| 00:00:00,000 --> 00:00:02,500 | |
| [SPEAKER_00]: Hello and welcome to the show. | |
| 2 | |
| 00:00:02,500 --> 00:00:05,000 | |
| [SPEAKER_01]: Thanks for having me! | |
| ``` | |
| ### JSON Format | |
| ```json | |
| { | |
| "text": "Full transcript here...", | |
| "language": "en", | |
| "segments": [ | |
| { | |
| "index": 0, | |
| "start": 0.0, | |
| "end": 2.5, | |
| "text": "Hello and welcome to the show.", | |
| "speaker": "SPEAKER_00" | |
| } | |
| ] | |
| } | |
| ``` | |
| ## β οΈ Limitations | |
| - Maximum file size: ~1GB (adjustable) | |
| - Processing time depends on model size and file length | |
| - Speaker diarization requires HF token and adds processing time | |
| - YouTube download depends on availability and region restrictions | |
| ## π€ Contributing | |
| Found a bug or have a feature request? Please open an issue on the repository. | |
| ## π License | |
| MIT License - Feel free to use for personal or commercial projects. | |
| ## π Credits | |
| - [OpenAI Whisper](https://github.com/openai/whisper) | |
| - [Pyannote.audio](https://github.com/pyannote/pyannote-audio) | |
| - [Gradio](https://gradio.app) | |
| - [Hugging Face](https://huggingface.co) | |
| --- | |
| Made with β€οΈ using OpenAI Whisper and Gradio |