| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | tags: |
| | - video-scene-graph |
| | - scene-graph-generation |
| | - video-understanding |
| | - trajectory-aware |
| | - perceiver-resampler |
| | - qwen2.5-vl |
| | base_model: Qwen/Qwen2.5-VL-3B-Instruct |
| | pipeline_tag: video-text-to-text |
| | --- |
| | |
| | # TRASER: |
| |
|
| | TRASER is the video scene graph generation model introduced in **Synthetic Visual Genome 2 (SVG2)**. Given a video and per-object segmentation trajectories, it generates a structured spatio-temporal scene graph describing objects, attributes, and their relations across time. |
| |
|
| | **Paper:** [Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos](https://arxiv.org/pdf/2602.23543) |
| |
|
| | **Website:** [Synthetic Visual Genome 2](https://uwgzq.github.io/papers/SVG2/) |
| | |
| | **Authors:** Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Quan Kong, Rajat Saini, Ranjay Krishna. (Allen Institute for AI Β· University of Washington Β· Woven by Toyota) |
| |
|
| | --- |
| |
|
| | ## Model Architecture |
| |
|
| |  |
| |
|
| | TRASER extends **Qwen2.5-VL-3B-Instruct** with two trainable Perceiver Resampler modules that implement **Trajectory-Aligned Token Arrangement**: |
| |
|
| | | Module | Abbrev. | Role | |
| | |---|---|---| |
| | | Object-Trajectory Resampler | **OTR** | Aggregates all cross-frame tokens for one object into a global summary | |
| | | Temporal-Windows Resampler | **TWR** | Compresses per-object tokens within each temporal window into a fixed set of latents | |
| |
|
| | For each tracked object the LLM sees a structured token block: |
| | ``` |
| | <obj_traj_start> Object N: <|vision_start|> |
| | [OTR: N latents] |
| | <t1-t2> [TWR: N latents] |
| | <t2-t3> [TWR: N latents] |
| | ... |
| | <|vision_end|> <obj_traj_end> |
| | ``` |
| | --- |
| |
|
| | ## How to Get Started |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install transformers>=4.54.0 torch pycocotools |
| | ``` |
| |
|
| | ### Prepare Inputs |
| |
|
| | Two inputs are required alongside the video: |
| |
|
| | - **Video** β any format supported by `qwen_vl_utils` (e.g. `.mp4`) |
| | - **Mask JSON** β per-frame, per-object RLE segmentation masks in COCO `pycocotools` format: |
| |
|
| | ```json |
| | [ |
| | // frame 0 |
| | [{"size": [H, W], "counts": "..."}, {"size": [H, W], "counts": "..."}, ...], |
| | // frame 1 |
| | [...] |
| | ] |
| | ``` |
| |
|
| | See `example/2401075277_rle.json` for a complete example. |
| |
|
| | ### Run Inference |
| |
|
| | ```bash |
| | python inference.py \ |
| | --model_path /path/to/vsg_release_model \ |
| | --video_path /path/to/video.mp4 \ |
| | --mask_path /path/to/masks.json \ |
| | --out_dir ./output |
| | ``` |
| |
|
| | **CLI Arguments** |
| |
|
| | | Argument | Default | Description | |
| | |---|---|---| |
| | | `--model_path` | required | Path to this model directory | |
| | | `--video_path` | required | Input video file | |
| | | `--mask_path` | required | Per-object RLE mask JSON | |
| | | `--out_dir` | `./output` | Directory to write `output.txt` | |
| | | `--max_objects` | `40` | Maximum number of objects to process per video | |
| |
|
| | ### Quickstart with the Bundled Example |
| |
|
| | ```bash |
| | python inference.py \ |
| | --model_path . \ |
| | --video_path example/2401075277.mp4 \ |
| | --mask_path example/2401075277_rle.json \ |
| | --out_dir ./output |
| | ``` |
| |
|
| | ### Python API |
| |
|
| | ```python |
| | import torch |
| | from transformers import AutoProcessor, AutoTokenizer |
| | from modeling_traser import TRASER |
| | |
| | model_path = "/path/to/vsg_release_model" |
| | device = "cuda" |
| | |
| | model = TRASER.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(device) |
| | processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct") |
| | processor.tokenizer = AutoTokenizer.from_pretrained(model_path) |
| | ``` |
| |
|
| | Then follow the preprocessing steps in `inference.py`: load masks β build object mask tensors β `select_tokens` β `rearrange_token` β `model.generate`. |
| |
|
| | --- |
| |
|
| | ## Repository Structure |
| |
|
| | ``` |
| | βββ modeling_traser.py # TRASER model class |
| | βββ inference.py # End-to-end inference script |
| | βββ config.json # Model configuration |
| | βββ generation_config.json # Default generation hyperparameters |
| | βββ model-00001-of-00002.safetensors |
| | βββ model-00002-of-00002.safetensors |
| | βββ model.safetensors.index.json |
| | βββ tokenizer_config.json |
| | βββ vocab.json |
| | βββ merges.txt |
| | βββ added_tokens.json |
| | βββ special_tokens_map.json |
| | βββ chat_template.jinja |
| | βββ resampler_utils/ |
| | β βββ token_selection.py # Mask-based visual token selection (coverage threshold) |
| | β βββ token_arrangement.py # Token sequence rearrangement with OTR/TWR injection |
| | βββ qwen_vl_vsg_utils/ # Adapted Qwen-VL video processing utilities |
| | βββ static/ |
| | β βββ image.png # Architecture diagram |
| | βββ example/ |
| | βββ 2401075277.mp4 # Example video |
| | βββ 2401075277_rle.json # Example RLE segmentation masks |
| | ``` |
| |
|
| | --- |
| |
|
| | ## Training Data |
| |
|
| | TRASER is trained on [**SVG2**](https://huggingface.co/datasets/UWGZQ/Synthetic_Visual_Genome2), a large-scale automatically annotated video scene graph dataset: |
| |
|
| | - **\~636K videos** with dense panoptic, per-frame annotations |
| | - **\~6.6M objects Β· \~52M attributes Β· \~6.7M relations** |
| |
|
| | --- |
| |
|
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @misc{gao2026syntheticvisualgenome2, |
| | title={Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos}, |
| | author={Ziqi Gao and Jieyu Zhang and Wisdom Oluchi Ikezogwo and Jae Sung Park and Tario G. You and Daniel Ogbu and Chenhao Zheng and Weikai Huang and Yinuo Yang and Winson Han and Quan Kong and Rajat Saini and Ranjay Krishna}, |
| | year={2026}, |
| | eprint={2602.23543}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CV}, |
| | url={https://arxiv.org/abs/2602.23543}, |
| | } |
| | ``` |
| |
|