Spaces:

xTHExBEASTx
/

Whisper-Transcriber

Sleeping

App Files Files Community

Whisper-Transcriber / README.md

xTHExBEASTx

Update README.md

64273cc verified about 2 months ago

preview code

raw

history blame contribute delete

4.81 kB

A newer version of the Gradio SDK is available: 6.6.0

Upgrade

metadata

title: Whisper Transcriber
emoji: 🎤
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
license: mit

🎤 Whisper Transcriber

Generate accurate subtitles and transcripts from audio/video files using OpenAI Whisper.

✨ Features

📁 Multiple Input Methods
- Upload audio/video files directly
- Paste YouTube URLs
- Paste direct file URLs (HTTP/HTTPS)
🎯 Model Selection
- Tiny: Fastest processing (~150MB)
- Small: Balanced speed/accuracy (~450MB)
- Medium: Highest accuracy (~1.5GB)
🌍 Multi-Language Support
- Auto-detect language from 99+ languages
- Manual language selection available
- Optimized for English, Spanish, Chinese, French, German, and more
👥 Speaker Diarization (Optional)
- Identify different speakers in the audio
- Label speakers in transcripts (SPEAKER_00, SPEAKER_01, etc.)
- Requires Hugging Face token
📝 Multiple Output Formats
- SRT: Standard subtitle format for video players
- VTT: WebVTT format for web players
- TXT: Plain text transcript
- JSON: Full data with word-level timestamps

🚀 Quick Start

Upload a file or paste a URL (YouTube or direct link)
Select model size (Small recommended for most cases)
Choose language (Auto-detect works great!)
Enable speaker diarization (optional, requires HF token)
Click Generate Transcription
Download your preferred format(s)

📋 Supported File Formats

Audio Formats

MP3, WAV, M4A, FLAC, AAC, OGG, WMA

Video Formats

MP4, AVI, MKV, MOV, WMV, FLV, WebM

Audio is automatically extracted from video files.

🔧 Advanced Features

Large File Handling

Files are automatically chunked into 30-minute segments
Timestamps are preserved across chunks
Maximum file size: ~1GB (can be increased)

Speaker Diarization

To enable speaker diarization:

Get a Hugging Face token at huggingface.co/settings/tokens
Accept terms at pyannote/speaker-diarization-3.1
Set HF_TOKEN as an environment variable or Space secret

API Usage

This Space provides a public API endpoint. You can use it programmatically:

from gradio_client import Client

client = Client("xTHExBEASTx/Whisper-Transcriber")

result = client.predict(
    file_input=None,  # Or file path
    url_input="https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    model_size="small",
    language="auto",
    enable_diarization=False,
    api_name="/predict"
)

# result contains: (preview, srt_file, vtt_file, txt_file, json_file)

📊 Model Comparison

Model	Size	Speed	Accuracy	Best For
Tiny	150MB	~0.1x RT	Good	Quick drafts
Small	450MB	~0.3x RT	Better	Most use cases
Medium	1.5GB	~0.5x RT	Best	Production subtitles

RT = Realtime (0.1x RT means 10min audio processes in 1min)

🎯 Use Cases

Content Creators: Generate subtitles for videos
Podcasters: Create transcripts for episodes
Researchers: Transcribe interviews and recordings
Accessibility: Add captions to media content
Language Learning: Study with accurate transcripts

🛠️ Technical Stack

Whisper: OpenAI's speech recognition model
Pyannote.audio: Speaker diarization
FFmpeg: Audio/video processing
yt-dlp: YouTube download support
Gradio: Web interface

📝 Output Examples

SRT Format

1
00:00:00,000 --> 00:00:02,500
[SPEAKER_00]: Hello and welcome to the show.

2
00:00:02,500 --> 00:00:05,000
[SPEAKER_01]: Thanks for having me!

JSON Format

{
  "text": "Full transcript here...",
  "language": "en",
  "segments": [
    {
      "index": 0,
      "start": 0.0,
      "end": 2.5,
      "text": "Hello and welcome to the show.",
      "speaker": "SPEAKER_00"
    }
  ]
}

⚠️ Limitations

Maximum file size: ~1GB (adjustable)
Processing time depends on model size and file length
Speaker diarization requires HF token and adds processing time
YouTube download depends on availability and region restrictions

🤝 Contributing

Found a bug or have a feature request? Please open an issue on the repository.

📄 License

MIT License - Feel free to use for personal or commercial projects.

🙏 Credits

Made with ❤️ using OpenAI Whisper and Gradio