Spaces:

xTHExBEASTx
/

Whisper-Transcriber

Sleeping

File size: 4,813 Bytes

---
title: Whisper Transcriber
emoji: 🎤
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
license: mit
---

# 🎤 Whisper Transcriber

Generate accurate subtitles and transcripts from audio/video files using OpenAI Whisper.

## ✨ Features

- **📁 Multiple Input Methods**
  - Upload audio/video files directly
  - Paste YouTube URLs
  - Paste direct file URLs (HTTP/HTTPS)

- **🎯 Model Selection**
  - **Tiny**: Fastest processing (~150MB)
  - **Small**: Balanced speed/accuracy (~450MB)
  - **Medium**: Highest accuracy (~1.5GB)

- **🌍 Multi-Language Support**
  - Auto-detect language from 99+ languages
  - Manual language selection available
  - Optimized for English, Spanish, Chinese, French, German, and more

- **👥 Speaker Diarization (Optional)**
  - Identify different speakers in the audio
  - Label speakers in transcripts (SPEAKER_00, SPEAKER_01, etc.)
  - Requires Hugging Face token

- **📝 Multiple Output Formats**
  - **SRT**: Standard subtitle format for video players
  - **VTT**: WebVTT format for web players
  - **TXT**: Plain text transcript
  - **JSON**: Full data with word-level timestamps

## 🚀 Quick Start

1. **Upload a file** or **paste a URL** (YouTube or direct link)
2. **Select model size** (Small recommended for most cases)
3. **Choose language** (Auto-detect works great!)
4. **Enable speaker diarization** (optional, requires HF token)
5. Click **Generate Transcription**
6. **Download** your preferred format(s)

## 📋 Supported File Formats

### Audio Formats
- MP3, WAV, M4A, FLAC, AAC, OGG, WMA

### Video Formats
- MP4, AVI, MKV, MOV, WMV, FLV, WebM

Audio is automatically extracted from video files.

## 🔧 Advanced Features

### Large File Handling
- Files are automatically chunked into 30-minute segments
- Timestamps are preserved across chunks
- Maximum file size: ~1GB (can be increased)

### Speaker Diarization
To enable speaker diarization:
1. Get a Hugging Face token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
2. Accept terms at [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)
3. Set `HF_TOKEN` as an environment variable or Space secret

### API Usage

This Space provides a public API endpoint. You can use it programmatically:

```python
from gradio_client import Client

client = Client("xTHExBEASTx/Whisper-Transcriber")

result = client.predict(
    file_input=None,  # Or file path
    url_input="https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    model_size="small",
    language="auto",
    enable_diarization=False,
    api_name="/predict"
)

# result contains: (preview, srt_file, vtt_file, txt_file, json_file)
```

## 📊 Model Comparison

| Model  | Size   | Speed    | Accuracy | Best For              |
|--------|--------|----------|----------|-----------------------|
| Tiny   | 150MB  | ~0.1x RT | Good     | Quick drafts          |
| Small  | 450MB  | ~0.3x RT | Better   | Most use cases        |
| Medium | 1.5GB  | ~0.5x RT | Best     | Production subtitles  |

*RT = Realtime (0.1x RT means 10min audio processes in 1min)*

## 🎯 Use Cases

- **Content Creators**: Generate subtitles for videos
- **Podcasters**: Create transcripts for episodes
- **Researchers**: Transcribe interviews and recordings
- **Accessibility**: Add captions to media content
- **Language Learning**: Study with accurate transcripts

## 🛠️ Technical Stack

- **Whisper**: OpenAI's speech recognition model
- **Pyannote.audio**: Speaker diarization
- **FFmpeg**: Audio/video processing
- **yt-dlp**: YouTube download support
- **Gradio**: Web interface

## 📝 Output Examples

### SRT Format
```
1
00:00:00,000 --> 00:00:02,500
[SPEAKER_00]: Hello and welcome to the show.

2
00:00:02,500 --> 00:00:05,000
[SPEAKER_01]: Thanks for having me!
```

### JSON Format
```json
{
  "text": "Full transcript here...",
  "language": "en",
  "segments": [
    {
      "index": 0,
      "start": 0.0,
      "end": 2.5,
      "text": "Hello and welcome to the show.",
      "speaker": "SPEAKER_00"
    }
  ]
}
```

## ⚠️ Limitations

- Maximum file size: ~1GB (adjustable)
- Processing time depends on model size and file length
- Speaker diarization requires HF token and adds processing time
- YouTube download depends on availability and region restrictions

## 🤝 Contributing

Found a bug or have a feature request? Please open an issue on the repository.

## 📄 License

MIT License - Feel free to use for personal or commercial projects.

## 🙏 Credits

- [OpenAI Whisper](https://github.com/openai/whisper)
- [Pyannote.audio](https://github.com/pyannote/pyannote-audio)
- [Gradio](https://gradio.app)
- [Hugging Face](https://huggingface.co)

---

Made with ❤️ using OpenAI Whisper and Gradio