--- title: YT Video emoji: 😻 colorFrom: yellow colorTo: blue sdk: gradio sdk_version: 5.44.1 app_file: app.py pinned: false license: mit --- # Video → ZIP Caption Prep Input: a video file. Output: a `.zip` containing: - `frames/` sampled JPG frames - `transcription.txt` from ASR - `explanations.json` with per-frame captions - `manifest.json` summary ## Models - ASR: `distil-whisper/distil-large-v3` - Vision captions: `Salesforce/blip-image-captioning-base` These are open-source on Hugging Face. GPU recommended. CPU works but slower. ## How it works 1) Extract audio with FFmpeg. 2) Transcribe via Whisper pipeline. 3) Sample frames every *N* seconds with OpenCV. 4) Caption each frame with BLIP. 5) Package outputs into a ZIP for downstream use. ## Space usage 1. Click **Upload video**. 2. Adjust **Frame interval** or **Max frames** if needed. 3. Press **Process**. 4. Download the ZIP. Preview shows a few frames, transcript snippet, and first captions. ## Local dev ```bash pip install -r requirements.txt python app.py # or CLI: python runner.py --video path/to/video.mp4 --interval 2.0 --max_frames 150