Spaces:
Sleeping
Sleeping
File size: 4,813 Bytes
fa2fbf6 4051511 fa2fbf6 64273cc fa2fbf6 4051511 fa2fbf6 4051511 64273cc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 | ---
title: Whisper Transcriber
emoji: π€
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
license: mit
---
# π€ Whisper Transcriber
Generate accurate subtitles and transcripts from audio/video files using OpenAI Whisper.
## β¨ Features
- **π Multiple Input Methods**
- Upload audio/video files directly
- Paste YouTube URLs
- Paste direct file URLs (HTTP/HTTPS)
- **π― Model Selection**
- **Tiny**: Fastest processing (~150MB)
- **Small**: Balanced speed/accuracy (~450MB)
- **Medium**: Highest accuracy (~1.5GB)
- **π Multi-Language Support**
- Auto-detect language from 99+ languages
- Manual language selection available
- Optimized for English, Spanish, Chinese, French, German, and more
- **π₯ Speaker Diarization (Optional)**
- Identify different speakers in the audio
- Label speakers in transcripts (SPEAKER_00, SPEAKER_01, etc.)
- Requires Hugging Face token
- **π Multiple Output Formats**
- **SRT**: Standard subtitle format for video players
- **VTT**: WebVTT format for web players
- **TXT**: Plain text transcript
- **JSON**: Full data with word-level timestamps
## π Quick Start
1. **Upload a file** or **paste a URL** (YouTube or direct link)
2. **Select model size** (Small recommended for most cases)
3. **Choose language** (Auto-detect works great!)
4. **Enable speaker diarization** (optional, requires HF token)
5. Click **Generate Transcription**
6. **Download** your preferred format(s)
## π Supported File Formats
### Audio Formats
- MP3, WAV, M4A, FLAC, AAC, OGG, WMA
### Video Formats
- MP4, AVI, MKV, MOV, WMV, FLV, WebM
Audio is automatically extracted from video files.
## π§ Advanced Features
### Large File Handling
- Files are automatically chunked into 30-minute segments
- Timestamps are preserved across chunks
- Maximum file size: ~1GB (can be increased)
### Speaker Diarization
To enable speaker diarization:
1. Get a Hugging Face token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
2. Accept terms at [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)
3. Set `HF_TOKEN` as an environment variable or Space secret
### API Usage
This Space provides a public API endpoint. You can use it programmatically:
```python
from gradio_client import Client
client = Client("xTHExBEASTx/Whisper-Transcriber")
result = client.predict(
file_input=None, # Or file path
url_input="https://www.youtube.com/watch?v=dQw4w9WgXcQ",
model_size="small",
language="auto",
enable_diarization=False,
api_name="/predict"
)
# result contains: (preview, srt_file, vtt_file, txt_file, json_file)
```
## π Model Comparison
| Model | Size | Speed | Accuracy | Best For |
|--------|--------|----------|----------|-----------------------|
| Tiny | 150MB | ~0.1x RT | Good | Quick drafts |
| Small | 450MB | ~0.3x RT | Better | Most use cases |
| Medium | 1.5GB | ~0.5x RT | Best | Production subtitles |
*RT = Realtime (0.1x RT means 10min audio processes in 1min)*
## π― Use Cases
- **Content Creators**: Generate subtitles for videos
- **Podcasters**: Create transcripts for episodes
- **Researchers**: Transcribe interviews and recordings
- **Accessibility**: Add captions to media content
- **Language Learning**: Study with accurate transcripts
## π οΈ Technical Stack
- **Whisper**: OpenAI's speech recognition model
- **Pyannote.audio**: Speaker diarization
- **FFmpeg**: Audio/video processing
- **yt-dlp**: YouTube download support
- **Gradio**: Web interface
## π Output Examples
### SRT Format
```
1
00:00:00,000 --> 00:00:02,500
[SPEAKER_00]: Hello and welcome to the show.
2
00:00:02,500 --> 00:00:05,000
[SPEAKER_01]: Thanks for having me!
```
### JSON Format
```json
{
"text": "Full transcript here...",
"language": "en",
"segments": [
{
"index": 0,
"start": 0.0,
"end": 2.5,
"text": "Hello and welcome to the show.",
"speaker": "SPEAKER_00"
}
]
}
```
## β οΈ Limitations
- Maximum file size: ~1GB (adjustable)
- Processing time depends on model size and file length
- Speaker diarization requires HF token and adds processing time
- YouTube download depends on availability and region restrictions
## π€ Contributing
Found a bug or have a feature request? Please open an issue on the repository.
## π License
MIT License - Feel free to use for personal or commercial projects.
## π Credits
- [OpenAI Whisper](https://github.com/openai/whisper)
- [Pyannote.audio](https://github.com/pyannote/pyannote-audio)
- [Gradio](https://gradio.app)
- [Hugging Face](https://huggingface.co)
---
Made with β€οΈ using OpenAI Whisper and Gradio |