FoundationMotion / README.md
yulu2's picture
Update README.md
e0ba5c9 verified

A newer version of the Gradio SDK is available: 6.3.0

Upgrade
metadata
title: FoundationMotion
emoji: 🌍
colorFrom: gray
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
arxiv: '2512.10927'

Video β†’ Q&A (Qwen2.5-VL-7B WolfV2)

This Space lets you drag-and-drop a video and ask questions about it using Efficient-Large-Model/qwen2_5vl-7b-wolfv2-tuned (a fine-tuned Qwen2.5-VL-7B-Instruct).

Deploy

  1. Create a new Hugging Face Space (Python + Gradio).
  2. Add the three files from this repo: app.py, requirements.txt, README.md.
  3. (Optional) In the Space Settings β†’ Variables, set MODEL_ID=Efficient-Large-Model/qwen2_5vl-7b-wolfv2-tuned (default already).
  4. (Optional) If your GPU VRAM is tight, set env var USE_INT4=1 to enable 4-bit weight-only quantization.

GPU recommended. A10/A100 or ZeroGPU works for short videos; longer/high-res videos may OOM on CPU.

How it works

  • We construct a chat-style prompt with a video item and your question, then call processor.apply_chat_template(..., fps=1) and model.generate(...).
  • You can increase fps for more temporal detail. Higher fps β†’ more tokens/VRAM.
  • Resolution bounds are controlled via min_pixels/max_pixels on AutoProcessor.

Tips

  • If you see KeyError: 'qwen2_5_vl', your Transformers is too old β€” upgrade to >=4.50.0.
  • If decoding fails for certain containers, try converting to .mp4 (H.264 + AAC).
  • To return 5 QA pairs automatically, leave the question blank β€” the app uses a default instruction to summarize and produce 5 QAs.

Acknowledgments

  • Model: Efficient-Large-Model/qwen2_5vl-7b-wolfv2-tuned
  • Base architecture & usage patterns: Qwen/Qwen2.5-VL-7B-Instruct via πŸ€— Transformers.