Spaces:

yulu2
/

FoundationMotion

Running on Zero

App Files Files Community

FoundationMotion / README.md

yulu2

Update README.md

e0ba5c9 verified about 1 month ago

preview code

raw

history blame contribute delete

1.7 kB

	---
	title: FoundationMotion
	emoji: 🌍
	colorFrom: gray
	colorTo: blue
	sdk: gradio
	sdk_version: 5.49.1
	app_file: app.py
	pinned: false
	arxiv: "2512.10927"
	---

	# Video → Q&A (Qwen2.5-VL-7B WolfV2)

	This Space lets you drag-and-drop a video and ask questions about it using Efficient-Large-Model/qwen2_5vl-7b-wolfv2-tuned (a fine-tuned Qwen2.5-VL-7B-Instruct).

	## Deploy
	1. Create a new Hugging Face Space (Python + Gradio).
	2. Add the three files from this repo: `app.py`, `requirements.txt`, `README.md`.
	3. (Optional) In the Space Settings → Variables, set `MODEL_ID=Efficient-Large-Model/qwen2_5vl-7b-wolfv2-tuned` (default already).
	4. (Optional) If your GPU VRAM is tight, set env var `USE_INT4=1` to enable 4-bit weight-only quantization.

	> GPU recommended. A10/A100 or ZeroGPU works for short videos; longer/high-res videos may OOM on CPU.

	## How it works
	- We construct a chat-style prompt with a video item and your question, then call `processor.apply_chat_template(..., fps=1)` and `model.generate(...)`.
	- You can increase `fps` for more temporal detail. Higher fps → more tokens/VRAM.
	- Resolution bounds are controlled via `min_pixels`/`max_pixels` on `AutoProcessor`.

	## Tips
	- If you see `KeyError: 'qwen2_5_vl'`, your Transformers is too old — upgrade to `>=4.50.0`.
	- If decoding fails for certain containers, try converting to `.mp4` (H.264 + AAC).
	- To return 5 QA pairs automatically, leave the question blank — the app uses a default instruction to summarize and produce 5 QAs.

	## Acknowledgments
	- Model: Efficient-Large-Model/qwen2_5vl-7b-wolfv2-tuned
	- Base architecture & usage patterns: Qwen/Qwen2.5-VL-7B-Instruct via 🤗 Transformers.