Whisper-Transcriber / README.md
xTHExBEASTx's picture
Update README.md
64273cc verified
---
title: Whisper Transcriber
emoji: 🎀
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
license: mit
---
# 🎀 Whisper Transcriber
Generate accurate subtitles and transcripts from audio/video files using OpenAI Whisper.
## ✨ Features
- **πŸ“ Multiple Input Methods**
- Upload audio/video files directly
- Paste YouTube URLs
- Paste direct file URLs (HTTP/HTTPS)
- **🎯 Model Selection**
- **Tiny**: Fastest processing (~150MB)
- **Small**: Balanced speed/accuracy (~450MB)
- **Medium**: Highest accuracy (~1.5GB)
- **🌍 Multi-Language Support**
- Auto-detect language from 99+ languages
- Manual language selection available
- Optimized for English, Spanish, Chinese, French, German, and more
- **πŸ‘₯ Speaker Diarization (Optional)**
- Identify different speakers in the audio
- Label speakers in transcripts (SPEAKER_00, SPEAKER_01, etc.)
- Requires Hugging Face token
- **πŸ“ Multiple Output Formats**
- **SRT**: Standard subtitle format for video players
- **VTT**: WebVTT format for web players
- **TXT**: Plain text transcript
- **JSON**: Full data with word-level timestamps
## πŸš€ Quick Start
1. **Upload a file** or **paste a URL** (YouTube or direct link)
2. **Select model size** (Small recommended for most cases)
3. **Choose language** (Auto-detect works great!)
4. **Enable speaker diarization** (optional, requires HF token)
5. Click **Generate Transcription**
6. **Download** your preferred format(s)
## πŸ“‹ Supported File Formats
### Audio Formats
- MP3, WAV, M4A, FLAC, AAC, OGG, WMA
### Video Formats
- MP4, AVI, MKV, MOV, WMV, FLV, WebM
Audio is automatically extracted from video files.
## πŸ”§ Advanced Features
### Large File Handling
- Files are automatically chunked into 30-minute segments
- Timestamps are preserved across chunks
- Maximum file size: ~1GB (can be increased)
### Speaker Diarization
To enable speaker diarization:
1. Get a Hugging Face token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
2. Accept terms at [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)
3. Set `HF_TOKEN` as an environment variable or Space secret
### API Usage
This Space provides a public API endpoint. You can use it programmatically:
```python
from gradio_client import Client
client = Client("xTHExBEASTx/Whisper-Transcriber")
result = client.predict(
file_input=None, # Or file path
url_input="https://www.youtube.com/watch?v=dQw4w9WgXcQ",
model_size="small",
language="auto",
enable_diarization=False,
api_name="/predict"
)
# result contains: (preview, srt_file, vtt_file, txt_file, json_file)
```
## πŸ“Š Model Comparison
| Model | Size | Speed | Accuracy | Best For |
|--------|--------|----------|----------|-----------------------|
| Tiny | 150MB | ~0.1x RT | Good | Quick drafts |
| Small | 450MB | ~0.3x RT | Better | Most use cases |
| Medium | 1.5GB | ~0.5x RT | Best | Production subtitles |
*RT = Realtime (0.1x RT means 10min audio processes in 1min)*
## 🎯 Use Cases
- **Content Creators**: Generate subtitles for videos
- **Podcasters**: Create transcripts for episodes
- **Researchers**: Transcribe interviews and recordings
- **Accessibility**: Add captions to media content
- **Language Learning**: Study with accurate transcripts
## πŸ› οΈ Technical Stack
- **Whisper**: OpenAI's speech recognition model
- **Pyannote.audio**: Speaker diarization
- **FFmpeg**: Audio/video processing
- **yt-dlp**: YouTube download support
- **Gradio**: Web interface
## πŸ“ Output Examples
### SRT Format
```
1
00:00:00,000 --> 00:00:02,500
[SPEAKER_00]: Hello and welcome to the show.
2
00:00:02,500 --> 00:00:05,000
[SPEAKER_01]: Thanks for having me!
```
### JSON Format
```json
{
"text": "Full transcript here...",
"language": "en",
"segments": [
{
"index": 0,
"start": 0.0,
"end": 2.5,
"text": "Hello and welcome to the show.",
"speaker": "SPEAKER_00"
}
]
}
```
## ⚠️ Limitations
- Maximum file size: ~1GB (adjustable)
- Processing time depends on model size and file length
- Speaker diarization requires HF token and adds processing time
- YouTube download depends on availability and region restrictions
## 🀝 Contributing
Found a bug or have a feature request? Please open an issue on the repository.
## πŸ“„ License
MIT License - Feel free to use for personal or commercial projects.
## πŸ™ Credits
- [OpenAI Whisper](https://github.com/openai/whisper)
- [Pyannote.audio](https://github.com/pyannote/pyannote-audio)
- [Gradio](https://gradio.app)
- [Hugging Face](https://huggingface.co)
---
Made with ❀️ using OpenAI Whisper and Gradio