File size: 5,627 Bytes
611ab28 882ef99 9d5a73d 882ef99 424132e 882ef99 9d5a73d 882ef99 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 | ---
license: apache-2.0
language:
- en
tags:
- video-scene-graph
- scene-graph-generation
- video-understanding
- trajectory-aware
- perceiver-resampler
- qwen2.5-vl
base_model: Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: video-text-to-text
---
# TRASER:
TRASER is the video scene graph generation model introduced in **Synthetic Visual Genome 2 (SVG2)**. Given a video and per-object segmentation trajectories, it generates a structured spatio-temporal scene graph describing objects, attributes, and their relations across time.
**Paper:** [Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos](https://arxiv.org/pdf/2602.23543)
**Website:** [Synthetic Visual Genome 2](https://uwgzq.github.io/papers/SVG2/)
**Authors:** Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Quan Kong, Rajat Saini, Ranjay Krishna. (Allen Institute for AI Β· University of Washington Β· Woven by Toyota)
---
## Model Architecture

TRASER extends **Qwen2.5-VL-3B-Instruct** with two trainable Perceiver Resampler modules that implement **Trajectory-Aligned Token Arrangement**:
| Module | Abbrev. | Role |
|---|---|---|
| Object-Trajectory Resampler | **OTR** | Aggregates all cross-frame tokens for one object into a global summary |
| Temporal-Windows Resampler | **TWR** | Compresses per-object tokens within each temporal window into a fixed set of latents |
For each tracked object the LLM sees a structured token block:
```
<obj_traj_start> Object N: <|vision_start|>
[OTR: N latents]
<t1-t2> [TWR: N latents]
<t2-t3> [TWR: N latents]
...
<|vision_end|> <obj_traj_end>
```
---
## How to Get Started
### Installation
```bash
pip install transformers>=4.54.0 torch pycocotools
```
### Prepare Inputs
Two inputs are required alongside the video:
- **Video** β any format supported by `qwen_vl_utils` (e.g. `.mp4`)
- **Mask JSON** β per-frame, per-object RLE segmentation masks in COCO `pycocotools` format:
```json
[
// frame 0
[{"size": [H, W], "counts": "..."}, {"size": [H, W], "counts": "..."}, ...],
// frame 1
[...]
]
```
See `example/2401075277_rle.json` for a complete example.
### Run Inference
```bash
python inference.py \
--model_path /path/to/vsg_release_model \
--video_path /path/to/video.mp4 \
--mask_path /path/to/masks.json \
--out_dir ./output
```
**CLI Arguments**
| Argument | Default | Description |
|---|---|---|
| `--model_path` | required | Path to this model directory |
| `--video_path` | required | Input video file |
| `--mask_path` | required | Per-object RLE mask JSON |
| `--out_dir` | `./output` | Directory to write `output.txt` |
| `--max_objects` | `40` | Maximum number of objects to process per video |
### Quickstart with the Bundled Example
```bash
python inference.py \
--model_path . \
--video_path example/2401075277.mp4 \
--mask_path example/2401075277_rle.json \
--out_dir ./output
```
### Python API
```python
import torch
from transformers import AutoProcessor, AutoTokenizer
from modeling_traser import TRASER
model_path = "/path/to/vsg_release_model"
device = "cuda"
model = TRASER.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(device)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
processor.tokenizer = AutoTokenizer.from_pretrained(model_path)
```
Then follow the preprocessing steps in `inference.py`: load masks β build object mask tensors β `select_tokens` β `rearrange_token` β `model.generate`.
---
## Repository Structure
```
βββ modeling_traser.py # TRASER model class
βββ inference.py # End-to-end inference script
βββ config.json # Model configuration
βββ generation_config.json # Default generation hyperparameters
βββ model-00001-of-00002.safetensors
βββ model-00002-of-00002.safetensors
βββ model.safetensors.index.json
βββ tokenizer_config.json
βββ vocab.json
βββ merges.txt
βββ added_tokens.json
βββ special_tokens_map.json
βββ chat_template.jinja
βββ resampler_utils/
β βββ token_selection.py # Mask-based visual token selection (coverage threshold)
β βββ token_arrangement.py # Token sequence rearrangement with OTR/TWR injection
βββ qwen_vl_vsg_utils/ # Adapted Qwen-VL video processing utilities
βββ static/
β βββ image.png # Architecture diagram
βββ example/
βββ 2401075277.mp4 # Example video
βββ 2401075277_rle.json # Example RLE segmentation masks
```
---
## Training Data
TRASER is trained on [**SVG2**](https://huggingface.co/datasets/UWGZQ/Synthetic_Visual_Genome2), a large-scale automatically annotated video scene graph dataset:
- **\~636K videos** with dense panoptic, per-frame annotations
- **\~6.6M objects Β· \~52M attributes Β· \~6.7M relations**
---
## Citation
```bibtex
@misc{gao2026syntheticvisualgenome2,
title={Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos},
author={Ziqi Gao and Jieyu Zhang and Wisdom Oluchi Ikezogwo and Jae Sung Park and Tario G. You and Daniel Ogbu and Chenhao Zheng and Weikai Huang and Yinuo Yang and Winson Han and Quan Kong and Rajat Saini and Ranjay Krishna},
year={2026},
eprint={2602.23543},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.23543},
}
```
|