File size: 4,813 Bytes
fa2fbf6
 
4051511
 
fa2fbf6
 
64273cc
fa2fbf6
 
4051511
fa2fbf6
 
4051511
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64273cc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
title: Whisper Transcriber
emoji: 🎀
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
license: mit
---

# 🎀 Whisper Transcriber

Generate accurate subtitles and transcripts from audio/video files using OpenAI Whisper.

## ✨ Features

- **πŸ“ Multiple Input Methods**
  - Upload audio/video files directly
  - Paste YouTube URLs
  - Paste direct file URLs (HTTP/HTTPS)

- **🎯 Model Selection**
  - **Tiny**: Fastest processing (~150MB)
  - **Small**: Balanced speed/accuracy (~450MB)
  - **Medium**: Highest accuracy (~1.5GB)

- **🌍 Multi-Language Support**
  - Auto-detect language from 99+ languages
  - Manual language selection available
  - Optimized for English, Spanish, Chinese, French, German, and more

- **πŸ‘₯ Speaker Diarization (Optional)**
  - Identify different speakers in the audio
  - Label speakers in transcripts (SPEAKER_00, SPEAKER_01, etc.)
  - Requires Hugging Face token

- **πŸ“ Multiple Output Formats**
  - **SRT**: Standard subtitle format for video players
  - **VTT**: WebVTT format for web players
  - **TXT**: Plain text transcript
  - **JSON**: Full data with word-level timestamps

## πŸš€ Quick Start

1. **Upload a file** or **paste a URL** (YouTube or direct link)
2. **Select model size** (Small recommended for most cases)
3. **Choose language** (Auto-detect works great!)
4. **Enable speaker diarization** (optional, requires HF token)
5. Click **Generate Transcription**
6. **Download** your preferred format(s)

## πŸ“‹ Supported File Formats

### Audio Formats
- MP3, WAV, M4A, FLAC, AAC, OGG, WMA

### Video Formats
- MP4, AVI, MKV, MOV, WMV, FLV, WebM

Audio is automatically extracted from video files.

## πŸ”§ Advanced Features

### Large File Handling
- Files are automatically chunked into 30-minute segments
- Timestamps are preserved across chunks
- Maximum file size: ~1GB (can be increased)

### Speaker Diarization
To enable speaker diarization:
1. Get a Hugging Face token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
2. Accept terms at [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)
3. Set `HF_TOKEN` as an environment variable or Space secret

### API Usage

This Space provides a public API endpoint. You can use it programmatically:

```python
from gradio_client import Client

client = Client("xTHExBEASTx/Whisper-Transcriber")

result = client.predict(
    file_input=None,  # Or file path
    url_input="https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    model_size="small",
    language="auto",
    enable_diarization=False,
    api_name="/predict"
)

# result contains: (preview, srt_file, vtt_file, txt_file, json_file)
```

## πŸ“Š Model Comparison

| Model  | Size   | Speed    | Accuracy | Best For              |
|--------|--------|----------|----------|-----------------------|
| Tiny   | 150MB  | ~0.1x RT | Good     | Quick drafts          |
| Small  | 450MB  | ~0.3x RT | Better   | Most use cases        |
| Medium | 1.5GB  | ~0.5x RT | Best     | Production subtitles  |

*RT = Realtime (0.1x RT means 10min audio processes in 1min)*

## 🎯 Use Cases

- **Content Creators**: Generate subtitles for videos
- **Podcasters**: Create transcripts for episodes
- **Researchers**: Transcribe interviews and recordings
- **Accessibility**: Add captions to media content
- **Language Learning**: Study with accurate transcripts

## πŸ› οΈ Technical Stack

- **Whisper**: OpenAI's speech recognition model
- **Pyannote.audio**: Speaker diarization
- **FFmpeg**: Audio/video processing
- **yt-dlp**: YouTube download support
- **Gradio**: Web interface

## πŸ“ Output Examples

### SRT Format
```
1
00:00:00,000 --> 00:00:02,500
[SPEAKER_00]: Hello and welcome to the show.

2
00:00:02,500 --> 00:00:05,000
[SPEAKER_01]: Thanks for having me!
```

### JSON Format
```json
{
  "text": "Full transcript here...",
  "language": "en",
  "segments": [
    {
      "index": 0,
      "start": 0.0,
      "end": 2.5,
      "text": "Hello and welcome to the show.",
      "speaker": "SPEAKER_00"
    }
  ]
}
```

## ⚠️ Limitations

- Maximum file size: ~1GB (adjustable)
- Processing time depends on model size and file length
- Speaker diarization requires HF token and adds processing time
- YouTube download depends on availability and region restrictions

## 🀝 Contributing

Found a bug or have a feature request? Please open an issue on the repository.

## πŸ“„ License

MIT License - Feel free to use for personal or commercial projects.

## πŸ™ Credits

- [OpenAI Whisper](https://github.com/openai/whisper)
- [Pyannote.audio](https://github.com/pyannote/pyannote-audio)
- [Gradio](https://gradio.app)
- [Hugging Face](https://huggingface.co)

---

Made with ❀️ using OpenAI Whisper and Gradio