Video-Text-to-Text
Transformers
Safetensors
English
qwen2_5_vl
video-scene-graph
scene-graph-generation
video-understanding
trajectory-aware
perceiver-resampler
qwen2.5-vl
text-generation-inference
Instructions to use UWGZQ/TRASER with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use UWGZQ/TRASER with Transformers:
# Load model directly from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration_Insert processor = AutoProcessor.from_pretrained("UWGZQ/TRASER") model = Qwen2_5_VLForConditionalGeneration_Insert.from_pretrained("UWGZQ/TRASER") - Notebooks
- Google Colab
- Kaggle
Improve model card metadata, author list, and architecture image path
Browse filesHi! I'm Niels, part of the community science team at Hugging Face.
This PR improves the model card for TRASER by:
- Adding `library_name: transformers` and `datasets` to the YAML metadata for better discoverability.
- Adding the missing author **Winson Han** and fixing the formatting for **Tario G. You** in the author list.
- Updating the architecture image link to `static/image.png` to correctly point to the file listed in the repository structure.
README.md
CHANGED
|
@@ -1,7 +1,10 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
language:
|
| 4 |
- en
|
|
|
|
|
|
|
|
|
|
| 5 |
tags:
|
| 6 |
- video-scene-graph
|
| 7 |
- scene-graph-generation
|
|
@@ -9,25 +12,25 @@ tags:
|
|
| 9 |
- trajectory-aware
|
| 10 |
- perceiver-resampler
|
| 11 |
- qwen2.5-vl
|
| 12 |
-
|
| 13 |
-
|
| 14 |
---
|
| 15 |
|
| 16 |
-
# TRASER
|
| 17 |
|
| 18 |
TRASER is the video scene graph generation model introduced in **Synthetic Visual Genome 2 (SVG2)**. Given a video and per-object segmentation trajectories, it generates a structured spatio-temporal scene graph describing objects, attributes, and their relations across time.
|
| 19 |
|
| 20 |
-
**Paper:** [Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos](https://arxiv.org/
|
| 21 |
|
| 22 |
**Website:** [Synthetic Visual Genome 2](https://uwgzq.github.io/papers/SVG2/)
|
| 23 |
|
| 24 |
-
**Authors:** Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Quan Kong, Rajat Saini, Ranjay Krishna. (Allen Institute for AI · University of Washington · Woven by Toyota)
|
| 25 |
|
| 26 |
---
|
| 27 |
|
| 28 |
## Model Architecture
|
| 29 |
|
| 30 |
-
, a large-scale automatically annotated video scene graph dataset:
|
| 157 |
|
| 158 |
-
- **
|
| 159 |
-
- **
|
| 160 |
|
| 161 |
---
|
| 162 |
|
|
@@ -173,4 +176,4 @@ TRASER is trained on [**SVG2**](https://huggingface.co/datasets/UWGZQ/Synthetic_
|
|
| 173 |
primaryClass={cs.CV},
|
| 174 |
url={https://arxiv.org/abs/2602.23543},
|
| 175 |
}
|
| 176 |
-
```
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model: Qwen/Qwen2.5-VL-3B-Instruct
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
+
license: apache-2.0
|
| 6 |
+
pipeline_tag: video-text-to-text
|
| 7 |
+
library_name: transformers
|
| 8 |
tags:
|
| 9 |
- video-scene-graph
|
| 10 |
- scene-graph-generation
|
|
|
|
| 12 |
- trajectory-aware
|
| 13 |
- perceiver-resampler
|
| 14 |
- qwen2.5-vl
|
| 15 |
+
datasets:
|
| 16 |
+
- UWGZQ/Synthetic_Visual_Genome2
|
| 17 |
---
|
| 18 |
|
| 19 |
+
# TRASER
|
| 20 |
|
| 21 |
TRASER is the video scene graph generation model introduced in **Synthetic Visual Genome 2 (SVG2)**. Given a video and per-object segmentation trajectories, it generates a structured spatio-temporal scene graph describing objects, attributes, and their relations across time.
|
| 22 |
|
| 23 |
+
**Paper:** [Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos](https://arxiv.org/abs/2602.23543)
|
| 24 |
|
| 25 |
**Website:** [Synthetic Visual Genome 2](https://uwgzq.github.io/papers/SVG2/)
|
| 26 |
|
| 27 |
+
**Authors:** Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G. You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Winson Han, Quan Kong, Rajat Saini, Ranjay Krishna. (Allen Institute for AI · University of Washington · Woven by Toyota)
|
| 28 |
|
| 29 |
---
|
| 30 |
|
| 31 |
## Model Architecture
|
| 32 |
|
| 33 |
+

|
| 34 |
|
| 35 |
TRASER extends **Qwen2.5-VL-3B-Instruct** with two trainable Perceiver Resampler modules that implement **Trajectory-Aligned Token Arrangement**:
|
| 36 |
|
|
|
|
| 158 |
|
| 159 |
TRASER is trained on [**SVG2**](https://huggingface.co/datasets/UWGZQ/Synthetic_Visual_Genome2), a large-scale automatically annotated video scene graph dataset:
|
| 160 |
|
| 161 |
+
- **~636K videos** with dense panoptic, per-frame annotations
|
| 162 |
+
- **~6.6M objects · ~52M attributes · ~6.7M relations**
|
| 163 |
|
| 164 |
---
|
| 165 |
|
|
|
|
| 176 |
primaryClass={cs.CV},
|
| 177 |
url={https://arxiv.org/abs/2602.23543},
|
| 178 |
}
|
| 179 |
+
```
|