--- title: FoundationMotion emoji: 🌍 colorFrom: gray colorTo: blue sdk: gradio sdk_version: 5.49.1 app_file: app.py pinned: false arxiv: "2512.10927" --- # Video → Q&A (Qwen2.5-VL-7B WolfV2) This Space lets you drag-and-drop a video and ask questions about it using **Efficient-Large-Model/qwen2_5vl-7b-wolfv2-tuned** (a fine-tuned Qwen2.5-VL-7B-Instruct). ## Deploy 1. Create a new Hugging Face Space (Python + Gradio). 2. Add the three files from this repo: `app.py`, `requirements.txt`, `README.md`. 3. (Optional) In the Space **Settings → Variables**, set `MODEL_ID=Efficient-Large-Model/qwen2_5vl-7b-wolfv2-tuned` (default already). 4. (Optional) If your GPU VRAM is tight, set env var `USE_INT4=1` to enable 4-bit weight-only quantization. > **GPU recommended.** A10/A100 or ZeroGPU works for short videos; longer/high-res videos may OOM on CPU. ## How it works - We construct a chat-style prompt with a video item and your question, then call `processor.apply_chat_template(..., fps=1)` and `model.generate(...)`. - You can increase `fps` for more temporal detail. Higher fps → more tokens/VRAM. - Resolution bounds are controlled via `min_pixels`/`max_pixels` on `AutoProcessor`. ## Tips - If you see `KeyError: 'qwen2_5_vl'`, your Transformers is too old — upgrade to `>=4.50.0`. - If decoding fails for certain containers, try converting to `.mp4` (H.264 + AAC). - To return **5 QA pairs automatically**, leave the question blank — the app uses a default instruction to summarize and produce 5 QAs. ## Acknowledgments - Model: Efficient-Large-Model/qwen2_5vl-7b-wolfv2-tuned - Base architecture & usage patterns: Qwen/Qwen2.5-VL-7B-Instruct via 🤗 Transformers.