---
title: YT Video
emoji: 😻
colorFrom: yellow
colorTo: blue
sdk: gradio
sdk_version: 5.44.1
app_file: app.py
pinned: false
license: mit
---

# Video → ZIP Caption Prep

Input: a video file.  
Output: a `.zip` containing:
- `frames/` sampled JPG frames
- `transcription.txt` from ASR
- `explanations.json` with per-frame captions
- `manifest.json` summary

## Models
- ASR: `distil-whisper/distil-large-v3`
- Vision captions: `Salesforce/blip-image-captioning-base`

These are open-source on Hugging Face. GPU recommended. CPU works but slower.

## How it works
1) Extract audio with FFmpeg.  
2) Transcribe via Whisper pipeline.  
3) Sample frames every *N* seconds with OpenCV.  
4) Caption each frame with BLIP.  
5) Package outputs into a ZIP for downstream use.

## Space usage
1. Click **Upload video**.  
2. Adjust **Frame interval** or **Max frames** if needed.  
3. Press **Process**.  
4. Download the ZIP. Preview shows a few frames, transcript snippet, and first captions.

## Local dev
```bash
pip install -r requirements.txt
python app.py
# or CLI:
python runner.py --video path/to/video.mp4 --interval 2.0 --max_frames 150