| --- |
| title: YT Video |
| emoji: 😻 |
| colorFrom: yellow |
| colorTo: blue |
| sdk: gradio |
| sdk_version: 5.44.1 |
| app_file: app.py |
| pinned: false |
| license: mit |
| --- |
| |
| # Video → ZIP Caption Prep |
|
|
| Input: a video file. |
| Output: a `.zip` containing: |
| - `frames/` sampled JPG frames |
| - `transcription.txt` from ASR |
| - `explanations.json` with per-frame captions |
| - `manifest.json` summary |
|
|
| ## Models |
| - ASR: `distil-whisper/distil-large-v3` |
| - Vision captions: `Salesforce/blip-image-captioning-base` |
|
|
| These are open-source on Hugging Face. GPU recommended. CPU works but slower. |
|
|
| ## How it works |
| 1) Extract audio with FFmpeg. |
| 2) Transcribe via Whisper pipeline. |
| 3) Sample frames every *N* seconds with OpenCV. |
| 4) Caption each frame with BLIP. |
| 5) Package outputs into a ZIP for downstream use. |
|
|
| ## Space usage |
| 1. Click **Upload video**. |
| 2. Adjust **Frame interval** or **Max frames** if needed. |
| 3. Press **Process**. |
| 4. Download the ZIP. Preview shows a few frames, transcript snippet, and first captions. |
|
|
| ## Local dev |
| ```bash |
| pip install -r requirements.txt |
| python app.py |
| # or CLI: |
| python runner.py --video path/to/video.mp4 --interval 2.0 --max_frames 150 |
| |
| |