TRASER / README.md

Update README.md

9d5a73d verified 3 days ago

5.63 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- video-scene-graph
	- scene-graph-generation
	- video-understanding
	- trajectory-aware
	- perceiver-resampler
	- qwen2.5-vl
	base_model: Qwen/Qwen2.5-VL-3B-Instruct
	pipeline_tag: video-text-to-text
	---

	# TRASER:

	TRASER is the video scene graph generation model introduced in Synthetic Visual Genome 2 (SVG2). Given a video and per-object segmentation trajectories, it generates a structured spatio-temporal scene graph describing objects, attributes, and their relations across time.

	Paper: [Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos](https://arxiv.org/pdf/2602.23543)

	Website: [Synthetic Visual Genome 2](https://uwgzq.github.io/papers/SVG2/)

	Authors: Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Quan Kong, Rajat Saini, Ranjay Krishna. (Allen Institute for AI · University of Washington · Woven by Toyota)

	---

	## Model Architecture

	![TRASER Architecture](static/model.png)

	TRASER extends Qwen2.5-VL-3B-Instruct with two trainable Perceiver Resampler modules that implement Trajectory-Aligned Token Arrangement:

	\| Module \| Abbrev. \| Role \|
	\|---\|---\|---\|
	\| Object-Trajectory Resampler \| OTR \| Aggregates all cross-frame tokens for one object into a global summary \|
	\| Temporal-Windows Resampler \| TWR \| Compresses per-object tokens within each temporal window into a fixed set of latents \|

	For each tracked object the LLM sees a structured token block:
	```
	<obj_traj_start> Object N: <\|vision_start\|>
	[OTR: N latents]
	<t1-t2> [TWR: N latents]
	<t2-t3> [TWR: N latents]
	...
	<\|vision_end\|> <obj_traj_end>
	```
	---

	## How to Get Started

	### Installation

	```bash
	pip install transformers>=4.54.0 torch pycocotools
	```

	### Prepare Inputs

	Two inputs are required alongside the video:

	- Video — any format supported by `qwen_vl_utils` (e.g. `.mp4`)
	- Mask JSON — per-frame, per-object RLE segmentation masks in COCO `pycocotools` format:

	```json
	[
	// frame 0
	[{"size": [H, W], "counts": "..."}, {"size": [H, W], "counts": "..."}, ...],
	// frame 1
	[...]
	]
	```

	See `example/2401075277_rle.json` for a complete example.

	### Run Inference

	```bash
	python inference.py \
	--model_path /path/to/vsg_release_model \
	--video_path /path/to/video.mp4 \
	--mask_path /path/to/masks.json \
	--out_dir ./output
	```

	CLI Arguments

	\| Argument \| Default \| Description \|
	\|---\|---\|---\|
	\| `--model_path` \| required \| Path to this model directory \|
	\| `--video_path` \| required \| Input video file \|
	\| `--mask_path` \| required \| Per-object RLE mask JSON \|
	\| `--out_dir` \| `./output` \| Directory to write `output.txt` \|
	\| `--max_objects` \| `40` \| Maximum number of objects to process per video \|

	### Quickstart with the Bundled Example

	```bash
	python inference.py \
	--model_path . \
	--video_path example/2401075277.mp4 \
	--mask_path example/2401075277_rle.json \
	--out_dir ./output
	```

	### Python API

	```python
	import torch
	from transformers import AutoProcessor, AutoTokenizer
	from modeling_traser import TRASER

	model_path = "/path/to/vsg_release_model"
	device = "cuda"

	model = TRASER.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(device)
	processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
	processor.tokenizer = AutoTokenizer.from_pretrained(model_path)
	```

	Then follow the preprocessing steps in `inference.py`: load masks → build object mask tensors → `select_tokens` → `rearrange_token` → `model.generate`.

	---

	## Repository Structure

	```
	├── modeling_traser.py # TRASER model class
	├── inference.py # End-to-end inference script
	├── config.json # Model configuration
	├── generation_config.json # Default generation hyperparameters
	├── model-00001-of-00002.safetensors
	├── model-00002-of-00002.safetensors
	├── model.safetensors.index.json
	├── tokenizer_config.json
	├── vocab.json
	├── merges.txt
	├── added_tokens.json
	├── special_tokens_map.json
	├── chat_template.jinja
	├── resampler_utils/
	│ ├── token_selection.py # Mask-based visual token selection (coverage threshold)
	│ └── token_arrangement.py # Token sequence rearrangement with OTR/TWR injection
	├── qwen_vl_vsg_utils/ # Adapted Qwen-VL video processing utilities
	├── static/
	│ └── image.png # Architecture diagram
	└── example/
	├── 2401075277.mp4 # Example video
	└── 2401075277_rle.json # Example RLE segmentation masks
	```

	---

	## Training Data

	TRASER is trained on [SVG2](https://huggingface.co/datasets/UWGZQ/Synthetic_Visual_Genome2), a large-scale automatically annotated video scene graph dataset:

	- \~636K videos with dense panoptic, per-frame annotations
	- \~6.6M objects · \~52M attributes · \~6.7M relations

	---


	## Citation

	```bibtex
	@misc{gao2026syntheticvisualgenome2,
	title={Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos},
	author={Ziqi Gao and Jieyu Zhang and Wisdom Oluchi Ikezogwo and Jae Sung Park and Tario G. You and Daniel Ogbu and Chenhao Zheng and Weikai Huang and Yinuo Yang and Winson Han and Quan Kong and Rajat Saini and Ranjay Krishna},
	year={2026},
	eprint={2602.23543},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2602.23543},
	}
	```