Spaces:

xTHExBEASTx
/

Whisper-Transcriber

Sleeping

App Files Files Community

Whisper-Transcriber / README.md

xTHExBEASTx

Update README.md

64273cc verified about 2 months ago

preview code

raw

history blame contribute delete

4.81 kB

	---
	title: Whisper Transcriber
	emoji: 🎤
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 6.2.0
	app_file: app.py
	pinned: false
	license: mit
	---

	# 🎤 Whisper Transcriber

	Generate accurate subtitles and transcripts from audio/video files using OpenAI Whisper.

	## ✨ Features

	- 📁 Multiple Input Methods
	- Upload audio/video files directly
	- Paste YouTube URLs
	- Paste direct file URLs (HTTP/HTTPS)

	- 🎯 Model Selection
	- Tiny: Fastest processing (~150MB)
	- Small: Balanced speed/accuracy (~450MB)
	- Medium: Highest accuracy (~1.5GB)

	- 🌍 Multi-Language Support
	- Auto-detect language from 99+ languages
	- Manual language selection available
	- Optimized for English, Spanish, Chinese, French, German, and more

	- 👥 Speaker Diarization (Optional)
	- Identify different speakers in the audio
	- Label speakers in transcripts (SPEAKER_00, SPEAKER_01, etc.)
	- Requires Hugging Face token

	- 📝 Multiple Output Formats
	- SRT: Standard subtitle format for video players
	- VTT: WebVTT format for web players
	- TXT: Plain text transcript
	- JSON: Full data with word-level timestamps

	## 🚀 Quick Start

	1. Upload a file or paste a URL (YouTube or direct link)
	2. Select model size (Small recommended for most cases)
	3. Choose language (Auto-detect works great!)
	4. Enable speaker diarization (optional, requires HF token)
	5. Click Generate Transcription
	6. Download your preferred format(s)

	## 📋 Supported File Formats

	### Audio Formats
	- MP3, WAV, M4A, FLAC, AAC, OGG, WMA

	### Video Formats
	- MP4, AVI, MKV, MOV, WMV, FLV, WebM

	Audio is automatically extracted from video files.

	## 🔧 Advanced Features

	### Large File Handling
	- Files are automatically chunked into 30-minute segments
	- Timestamps are preserved across chunks
	- Maximum file size: ~1GB (can be increased)

	### Speaker Diarization
	To enable speaker diarization:
	1. Get a Hugging Face token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
	2. Accept terms at [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)
	3. Set `HF_TOKEN` as an environment variable or Space secret

	### API Usage

	This Space provides a public API endpoint. You can use it programmatically:

	```python
	from gradio_client import Client

	client = Client("xTHExBEASTx/Whisper-Transcriber")

	result = client.predict(
	file_input=None, # Or file path
	url_input="https://www.youtube.com/watch?v=dQw4w9WgXcQ",
	model_size="small",
	language="auto",
	enable_diarization=False,
	api_name="/predict"
	)

	# result contains: (preview, srt_file, vtt_file, txt_file, json_file)
	```

	## 📊 Model Comparison

	\| Model \| Size \| Speed \| Accuracy \| Best For \|
	\|--------\|--------\|----------\|----------\|-----------------------\|
	\| Tiny \| 150MB \| ~0.1x RT \| Good \| Quick drafts \|
	\| Small \| 450MB \| ~0.3x RT \| Better \| Most use cases \|
	\| Medium \| 1.5GB \| ~0.5x RT \| Best \| Production subtitles \|

	RT = Realtime (0.1x RT means 10min audio processes in 1min)

	## 🎯 Use Cases

	- Content Creators: Generate subtitles for videos
	- Podcasters: Create transcripts for episodes
	- Researchers: Transcribe interviews and recordings
	- Accessibility: Add captions to media content
	- Language Learning: Study with accurate transcripts

	## 🛠️ Technical Stack

	- Whisper: OpenAI's speech recognition model
	- Pyannote.audio: Speaker diarization
	- FFmpeg: Audio/video processing
	- yt-dlp: YouTube download support
	- Gradio: Web interface

	## 📝 Output Examples

	### SRT Format
	```
	1
	00:00:00,000 --> 00:00:02,500
	[SPEAKER_00]: Hello and welcome to the show.

	2
	00:00:02,500 --> 00:00:05,000
	[SPEAKER_01]: Thanks for having me!
	```

	### JSON Format
	```json
	{
	"text": "Full transcript here...",
	"language": "en",
	"segments": [
	{
	"index": 0,
	"start": 0.0,
	"end": 2.5,
	"text": "Hello and welcome to the show.",
	"speaker": "SPEAKER_00"
	}
	]
	}
	```

	## ⚠️ Limitations

	- Maximum file size: ~1GB (adjustable)
	- Processing time depends on model size and file length
	- Speaker diarization requires HF token and adds processing time
	- YouTube download depends on availability and region restrictions

	## 🤝 Contributing

	Found a bug or have a feature request? Please open an issue on the repository.

	## 📄 License

	MIT License - Feel free to use for personal or commercial projects.

	## 🙏 Credits

	- [OpenAI Whisper](https://github.com/openai/whisper)
	- [Pyannote.audio](https://github.com/pyannote/pyannote-audio)
	- [Gradio](https://gradio.app)
	- [Hugging Face](https://huggingface.co)

	---

	Made with ❤️ using OpenAI Whisper and Gradio