A newer version of the Gradio SDK is available: 6.9.0
metadata
title: YT Video
emoji: 😻
colorFrom: yellow
colorTo: blue
sdk: gradio
sdk_version: 5.44.1
app_file: app.py
pinned: false
license: mit
Video → ZIP Caption Prep
Input: a video file.
Output: a .zip containing:
frames/sampled JPG framestranscription.txtfrom ASRexplanations.jsonwith per-frame captionsmanifest.jsonsummary
Models
- ASR:
distil-whisper/distil-large-v3 - Vision captions:
Salesforce/blip-image-captioning-base
These are open-source on Hugging Face. GPU recommended. CPU works but slower.
How it works
- Extract audio with FFmpeg.
- Transcribe via Whisper pipeline.
- Sample frames every N seconds with OpenCV.
- Caption each frame with BLIP.
- Package outputs into a ZIP for downstream use.
Space usage
- Click Upload video.
- Adjust Frame interval or Max frames if needed.
- Press Process.
- Download the ZIP. Preview shows a few frames, transcript snippet, and first captions.
Local dev
pip install -r requirements.txt
python app.py
# or CLI:
python runner.py --video path/to/video.mp4 --interval 2.0 --max_frames 150