File size: 1,695 Bytes
d195394
 
 
 
 
 
 
 
 
e0ba5c9
d195394
 
05b3646
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
---
title: FoundationMotion
emoji: 🌍
colorFrom: gray
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
arxiv: "2512.10927"
---

# Video β†’ Q&A (Qwen2.5-VL-7B WolfV2)

This Space lets you drag-and-drop a video and ask questions about it using **Efficient-Large-Model/qwen2_5vl-7b-wolfv2-tuned** (a fine-tuned Qwen2.5-VL-7B-Instruct).

## Deploy
1. Create a new Hugging Face Space (Python + Gradio).
2. Add the three files from this repo: `app.py`, `requirements.txt`, `README.md`.
3. (Optional) In the Space **Settings β†’ Variables**, set `MODEL_ID=Efficient-Large-Model/qwen2_5vl-7b-wolfv2-tuned` (default already).
4. (Optional) If your GPU VRAM is tight, set env var `USE_INT4=1` to enable 4-bit weight-only quantization.

> **GPU recommended.** A10/A100 or ZeroGPU works for short videos; longer/high-res videos may OOM on CPU.

## How it works
- We construct a chat-style prompt with a video item and your question, then call `processor.apply_chat_template(..., fps=1)` and `model.generate(...)`.
- You can increase `fps` for more temporal detail. Higher fps β†’ more tokens/VRAM.
- Resolution bounds are controlled via `min_pixels`/`max_pixels` on `AutoProcessor`.

## Tips
- If you see `KeyError: 'qwen2_5_vl'`, your Transformers is too old β€” upgrade to `>=4.50.0`.
- If decoding fails for certain containers, try converting to `.mp4` (H.264 + AAC).
- To return **5 QA pairs automatically**, leave the question blank β€” the app uses a default instruction to summarize and produce 5 QAs.

## Acknowledgments
- Model: Efficient-Large-Model/qwen2_5vl-7b-wolfv2-tuned
- Base architecture & usage patterns: Qwen/Qwen2.5-VL-7B-Instruct via πŸ€— Transformers.