File size: 5,627 Bytes
611ab28
 
 
 
 
882ef99
 
 
 
 
 
 
 
 
 
 
 
 
 
9d5a73d
 
 
882ef99
 
 
 
 
 
 
424132e
882ef99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d5a73d
 
 
 
 
 
 
 
882ef99
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
license: apache-2.0
language:
- en
tags:
- video-scene-graph
- scene-graph-generation
- video-understanding
- trajectory-aware
- perceiver-resampler
- qwen2.5-vl
base_model: Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: video-text-to-text
---

# TRASER:

TRASER is the video scene graph generation model introduced in **Synthetic Visual Genome 2 (SVG2)**. Given a video and per-object segmentation trajectories, it generates a structured spatio-temporal scene graph describing objects, attributes, and their relations across time.

**Paper:** [Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos](https://arxiv.org/pdf/2602.23543)

**Website:** [Synthetic Visual Genome 2](https://uwgzq.github.io/papers/SVG2/)
 
**Authors:** Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Quan Kong, Rajat Saini, Ranjay Krishna. (Allen Institute for AI Β· University of Washington Β· Woven by Toyota)

---

## Model Architecture

![TRASER Architecture](static/model.png)

TRASER extends **Qwen2.5-VL-3B-Instruct** with two trainable Perceiver Resampler modules that implement **Trajectory-Aligned Token Arrangement**:

| Module | Abbrev. | Role |
|---|---|---|
| Object-Trajectory Resampler | **OTR** | Aggregates all cross-frame tokens for one object into a global summary |
| Temporal-Windows Resampler | **TWR** | Compresses per-object tokens within each temporal window into a fixed set of latents |

For each tracked object the LLM sees a structured token block:
```
<obj_traj_start>  Object N:  <|vision_start|>
  [OTR: N latents]
  <t1-t2>  [TWR: N latents]
  <t2-t3>  [TWR: N latents]
  ...
<|vision_end|>  <obj_traj_end>
```
---

## How to Get Started

### Installation

```bash
pip install transformers>=4.54.0 torch pycocotools
```

### Prepare Inputs

Two inputs are required alongside the video:

- **Video** β€” any format supported by `qwen_vl_utils` (e.g. `.mp4`)
- **Mask JSON** β€” per-frame, per-object RLE segmentation masks in COCO `pycocotools` format:

```json
[
  // frame 0
  [{"size": [H, W], "counts": "..."}, {"size": [H, W], "counts": "..."}, ...],
  // frame 1
  [...]
]
```

See `example/2401075277_rle.json` for a complete example.

### Run Inference

```bash
python inference.py \
    --model_path /path/to/vsg_release_model \
    --video_path /path/to/video.mp4 \
    --mask_path /path/to/masks.json \
    --out_dir ./output
```

**CLI Arguments**

| Argument | Default | Description |
|---|---|---|
| `--model_path` | required | Path to this model directory |
| `--video_path` | required | Input video file |
| `--mask_path` | required | Per-object RLE mask JSON |
| `--out_dir` | `./output` | Directory to write `output.txt` |
| `--max_objects` | `40` | Maximum number of objects to process per video |

### Quickstart with the Bundled Example

```bash
python inference.py \
    --model_path . \
    --video_path example/2401075277.mp4 \
    --mask_path example/2401075277_rle.json \
    --out_dir ./output
```

### Python API

```python
import torch
from transformers import AutoProcessor, AutoTokenizer
from modeling_traser import TRASER

model_path = "/path/to/vsg_release_model"
device = "cuda"

model = TRASER.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(device)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
processor.tokenizer = AutoTokenizer.from_pretrained(model_path)
```

Then follow the preprocessing steps in `inference.py`: load masks β†’ build object mask tensors β†’ `select_tokens` β†’ `rearrange_token` β†’ `model.generate`.

---

## Repository Structure

```
β”œβ”€β”€ modeling_traser.py           # TRASER model class
β”œβ”€β”€ inference.py                 # End-to-end inference script
β”œβ”€β”€ config.json                  # Model configuration
β”œβ”€β”€ generation_config.json       # Default generation hyperparameters
β”œβ”€β”€ model-00001-of-00002.safetensors
β”œβ”€β”€ model-00002-of-00002.safetensors
β”œβ”€β”€ model.safetensors.index.json
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ vocab.json
β”œβ”€β”€ merges.txt
β”œβ”€β”€ added_tokens.json
β”œβ”€β”€ special_tokens_map.json
β”œβ”€β”€ chat_template.jinja
β”œβ”€β”€ resampler_utils/
β”‚   β”œβ”€β”€ token_selection.py       # Mask-based visual token selection (coverage threshold)
β”‚   └── token_arrangement.py     # Token sequence rearrangement with OTR/TWR injection
β”œβ”€β”€ qwen_vl_vsg_utils/           # Adapted Qwen-VL video processing utilities
β”œβ”€β”€ static/
β”‚   └── image.png                # Architecture diagram
└── example/
    β”œβ”€β”€ 2401075277.mp4           # Example video
    └── 2401075277_rle.json      # Example RLE segmentation masks
```

---

## Training Data

TRASER is trained on [**SVG2**](https://huggingface.co/datasets/UWGZQ/Synthetic_Visual_Genome2), a large-scale automatically annotated video scene graph dataset:

- **\~636K videos** with dense panoptic, per-frame annotations
- **\~6.6M objects Β· \~52M attributes Β· \~6.7M relations**

---


## Citation

```bibtex
@misc{gao2026syntheticvisualgenome2,
      title={Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos}, 
      author={Ziqi Gao and Jieyu Zhang and Wisdom Oluchi Ikezogwo and Jae Sung Park and Tario G. You and Daniel Ogbu and Chenhao Zheng and Weikai Huang and Yinuo Yang and Winson Han and Quan Kong and Rajat Saini and Ranjay Krishna},
      year={2026},
      eprint={2602.23543},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.23543}, 
}
```