Whisper-Transcriber / README.md
xTHExBEASTx's picture
Update README.md
64273cc verified

A newer version of the Gradio SDK is available: 6.6.0

Upgrade
metadata
title: Whisper Transcriber
emoji: 🎀
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
license: mit

🎀 Whisper Transcriber

Generate accurate subtitles and transcripts from audio/video files using OpenAI Whisper.

✨ Features

  • πŸ“ Multiple Input Methods

    • Upload audio/video files directly
    • Paste YouTube URLs
    • Paste direct file URLs (HTTP/HTTPS)
  • 🎯 Model Selection

    • Tiny: Fastest processing (~150MB)
    • Small: Balanced speed/accuracy (~450MB)
    • Medium: Highest accuracy (~1.5GB)
  • 🌍 Multi-Language Support

    • Auto-detect language from 99+ languages
    • Manual language selection available
    • Optimized for English, Spanish, Chinese, French, German, and more
  • πŸ‘₯ Speaker Diarization (Optional)

    • Identify different speakers in the audio
    • Label speakers in transcripts (SPEAKER_00, SPEAKER_01, etc.)
    • Requires Hugging Face token
  • πŸ“ Multiple Output Formats

    • SRT: Standard subtitle format for video players
    • VTT: WebVTT format for web players
    • TXT: Plain text transcript
    • JSON: Full data with word-level timestamps

πŸš€ Quick Start

  1. Upload a file or paste a URL (YouTube or direct link)
  2. Select model size (Small recommended for most cases)
  3. Choose language (Auto-detect works great!)
  4. Enable speaker diarization (optional, requires HF token)
  5. Click Generate Transcription
  6. Download your preferred format(s)

πŸ“‹ Supported File Formats

Audio Formats

  • MP3, WAV, M4A, FLAC, AAC, OGG, WMA

Video Formats

  • MP4, AVI, MKV, MOV, WMV, FLV, WebM

Audio is automatically extracted from video files.

πŸ”§ Advanced Features

Large File Handling

  • Files are automatically chunked into 30-minute segments
  • Timestamps are preserved across chunks
  • Maximum file size: ~1GB (can be increased)

Speaker Diarization

To enable speaker diarization:

  1. Get a Hugging Face token at huggingface.co/settings/tokens
  2. Accept terms at pyannote/speaker-diarization-3.1
  3. Set HF_TOKEN as an environment variable or Space secret

API Usage

This Space provides a public API endpoint. You can use it programmatically:

from gradio_client import Client

client = Client("xTHExBEASTx/Whisper-Transcriber")

result = client.predict(
    file_input=None,  # Or file path
    url_input="https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    model_size="small",
    language="auto",
    enable_diarization=False,
    api_name="/predict"
)

# result contains: (preview, srt_file, vtt_file, txt_file, json_file)

πŸ“Š Model Comparison

Model Size Speed Accuracy Best For
Tiny 150MB ~0.1x RT Good Quick drafts
Small 450MB ~0.3x RT Better Most use cases
Medium 1.5GB ~0.5x RT Best Production subtitles

RT = Realtime (0.1x RT means 10min audio processes in 1min)

🎯 Use Cases

  • Content Creators: Generate subtitles for videos
  • Podcasters: Create transcripts for episodes
  • Researchers: Transcribe interviews and recordings
  • Accessibility: Add captions to media content
  • Language Learning: Study with accurate transcripts

πŸ› οΈ Technical Stack

  • Whisper: OpenAI's speech recognition model
  • Pyannote.audio: Speaker diarization
  • FFmpeg: Audio/video processing
  • yt-dlp: YouTube download support
  • Gradio: Web interface

πŸ“ Output Examples

SRT Format

1
00:00:00,000 --> 00:00:02,500
[SPEAKER_00]: Hello and welcome to the show.

2
00:00:02,500 --> 00:00:05,000
[SPEAKER_01]: Thanks for having me!

JSON Format

{
  "text": "Full transcript here...",
  "language": "en",
  "segments": [
    {
      "index": 0,
      "start": 0.0,
      "end": 2.5,
      "text": "Hello and welcome to the show.",
      "speaker": "SPEAKER_00"
    }
  ]
}

⚠️ Limitations

  • Maximum file size: ~1GB (adjustable)
  • Processing time depends on model size and file length
  • Speaker diarization requires HF token and adds processing time
  • YouTube download depends on availability and region restrictions

🀝 Contributing

Found a bug or have a feature request? Please open an issue on the repository.

πŸ“„ License

MIT License - Feel free to use for personal or commercial projects.

πŸ™ Credits


Made with ❀️ using OpenAI Whisper and Gradio