Salesforce
/

ST-Evidence-7B

pixel_qwen2_5_vl

Model card Files Files and versions

ST-Evidence-7B / README.md

hongluzhou-sf's picture

initial commit

a7433b7 19 days ago

|

history blame contribute delete

742 Bytes

	# ST-Evidence-7B

	We propose Evidence-Backed Video Question Answering (E-VQA), a task where multimodal models are designed to jointly produce a semantic textual answer and associated spatiotemporal evidence. This evidence includes temporal segments and dense, tracked object segmentation masklets. A masklet is defined as a temporal sequence of object segmentation masks tracked over time.

	Our model was fine-tuned from UniPixel, which is built upon QWen2.5-VL and SAM 2.1. UniPixel is a unified model designed to handle both video question answering and mask generation.

	This was released for research purposes only, in support of the academic paper Evidence-Backed Video Question Answering.


	## License

	CC-BY-NC 4.0