UWGZQ
/

TRASER

@@ -1,7 +1,10 @@
 ---
-license: apache-2.0
 language:
 - en
 tags:
 - video-scene-graph
 - scene-graph-generation
@@ -9,25 +12,25 @@ tags:
 - trajectory-aware
 - perceiver-resampler
 - qwen2.5-vl
-base_model: Qwen/Qwen2.5-VL-3B-Instruct
-pipeline_tag: video-text-to-text
 ---
-# TRASER:
 TRASER is the video scene graph generation model introduced in **Synthetic Visual Genome 2 (SVG2)**. Given a video and per-object segmentation trajectories, it generates a structured spatio-temporal scene graph describing objects, attributes, and their relations across time.
-**Paper:** [Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos](https://arxiv.org/pdf/2602.23543)
 **Website:** [Synthetic Visual Genome 2](https://uwgzq.github.io/papers/SVG2/)
-**Authors:** Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Quan Kong, Rajat Saini, Ranjay Krishna. (Allen Institute for AI · University of Washington · Woven by Toyota)
 ---
 ## Model Architecture
-![TRASER Architecture](static/model.png)
 TRASER extends **Qwen2.5-VL-3B-Instruct** with two trainable Perceiver Resampler modules that implement **Trajectory-Aligned Token Arrangement**:
@@ -155,8 +158,8 @@ Then follow the preprocessing steps in `inference.py`: load masks → build obje
 TRASER is trained on [**SVG2**](https://huggingface.co/datasets/UWGZQ/Synthetic_Visual_Genome2), a large-scale automatically annotated video scene graph dataset:
-- **\~636K videos** with dense panoptic, per-frame annotations
-- **\~6.6M objects · \~52M attributes · \~6.7M relations**
 ---
@@ -173,4 +176,4 @@ TRASER is trained on [**SVG2**](https://huggingface.co/datasets/UWGZQ/Synthetic_
       primaryClass={cs.CV},
       url={https://arxiv.org/abs/2602.23543},
 }
-```

 ---
+base_model: Qwen/Qwen2.5-VL-3B-Instruct
 language:
 - en
+license: apache-2.0
+pipeline_tag: video-text-to-text
+library_name: transformers
 tags:
 - video-scene-graph
 - scene-graph-generation
 - trajectory-aware
 - perceiver-resampler
 - qwen2.5-vl
+datasets:
+- UWGZQ/Synthetic_Visual_Genome2
 ---
+# TRASER
 TRASER is the video scene graph generation model introduced in **Synthetic Visual Genome 2 (SVG2)**. Given a video and per-object segmentation trajectories, it generates a structured spatio-temporal scene graph describing objects, attributes, and their relations across time.
+**Paper:** [Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos](https://arxiv.org/abs/2602.23543)
 **Website:** [Synthetic Visual Genome 2](https://uwgzq.github.io/papers/SVG2/)
+**Authors:** Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G. You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Winson Han, Quan Kong, Rajat Saini, Ranjay Krishna. (Allen Institute for AI · University of Washington · Woven by Toyota)
 ---
 ## Model Architecture
+![TRASER Architecture](static/image.png)
 TRASER extends **Qwen2.5-VL-3B-Instruct** with two trainable Perceiver Resampler modules that implement **Trajectory-Aligned Token Arrangement**:
 TRASER is trained on [**SVG2**](https://huggingface.co/datasets/UWGZQ/Synthetic_Visual_Genome2), a large-scale automatically annotated video scene graph dataset:
+- **~636K videos** with dense panoptic, per-frame annotations
+- **~6.6M objects · ~52M attributes · ~6.7M relations**
 ---
       primaryClass={cs.CV},
       url={https://arxiv.org/abs/2602.23543},
 }
+```