ST-Evidence-7B / README.md
hongluzhou-sf's picture
initial commit
a7433b7
# ST-Evidence-7B
We propose **Evidence-Backed Video Question Answering** (E-VQA), a task where multimodal models are designed to jointly produce a semantic **textual answer** and associated spatiotemporal evidence. This evidence includes **temporal segments** and **dense, tracked object segmentation masklets**. A masklet is defined as a temporal sequence of object segmentation masks tracked over time.
Our model was fine-tuned from UniPixel, which is built upon QWen2.5-VL and SAM 2.1. UniPixel is a unified model designed to handle both video question answering and mask generation.
This was released for research purposes only, in support of the academic paper *Evidence-Backed Video Question Answering*.
## License
CC-BY-NC 4.0