furonghuang-lab
/

TraceGenBenchmark

Model card Files Files and versions

TraceGenBenchmark / README.md

JayLee131's picture

Update README.md

aa54ffc verified 9 days ago

|

history blame contribute delete

3.35 kB

	---
	license: apache-2.0
	---

	# TraceGen Benchmark Leaderboard

	![TraceGen Benchmark Overview](https://raw.githubusercontent.com/jayLEE0301/TraceGen/main/assets/tracegen_fig2.png)

	## Benchmark: TraceGen Evaluation Suite

	We evaluate models on 5 environments using the official TraceGen metrics.
	Each environment reports MSE, MAE, and Endpoint MSE on held-out test sets.


	## Test on TraceGen benchmark
	Use the official evaluation code provided in:
	[https://github.com/jayLEE0301/TraceGen](https://github.com/jayLEE0301/TraceGen)

	### Multi-GPU
	```
	export CUDA_VISIBLE_DEVICES=0,1,2,3
	torchrun --standalone --nproc_per_node=4 \
	test_benchmark.py \
	--config cfg/train.yaml \
	--override \
	train.batch_size=8 \
	train.lr_decoder=1.5e-4 \
	model.decoder.num_layers=6 \
	model.decoder.num_attention_heads=12 \
	model.decoder.latent_dim=768 \
	data.num_workers=4 \
	hardware.mixed_precision=true \
	logging.use_wandb=true \
	logging.log_every=2000 \
	--resume {path_to_pretrained_checkpoint}
	```
	### Single-GPU
	```
	export CUDA_VISIBLE_DEVICES=0
	python test_benchmark.py \
	--config cfg/train.yaml \
	--override \
	train.batch_size=8 \
	train.lr_decoder=1.5e-4 \
	model.decoder.num_layers=6 \
	model.decoder.num_attention_heads=12 \
	model.decoder.latent_dim=768 \
	data.num_workers=4 \
	hardware.mixed_precision=true \
	logging.use_wandb=true \
	logging.log_every=2000 \
	--resume {path_to_pretrained_checkpoint}
	```

	To reproduce the environment-specific benchmark results reported below,
	users should evaluate the environment-specific checkpoints
	`TraceGen_{EnvName}` from [TraceGen Collection](https://huggingface.co/collections/furonghuang-lab/tracegen), which are trained using data from the corresponding environment only.

	Metric definition.
	All reported errors are computed in a normalized coordinate space:
	both input images and predicted traces are scaled to the range [0, 1] prior to evaluation.
	Accordingly, the reported MSE, MAE, and Endpoint MSE reflect absolute errors within the normalized image space.

	\| Environment \| Metric \| TraceGen (×1e−2) \|
	\| ----------- \| ------------ \| ---------------- \|
	\| EpicKitchen \| MSE \| 0.445 \|
	\| \| MAE \| 2.721 \|
	\| \| Endpoint MSE \| 0.791 \|
	\| Droid \| MSE \| 0.206 \|
	\| \| MAE \| 1.289 \|
	\| \| Endpoint MSE \| 0.285 \|
	\| Bridge \| MSE \| 0.653 \|
	\| \| MAE \| 2.419 \|
	\| \| Endpoint MSE \| 0.607 \|
	\| Libero \| MSE \| 0.276 \|
	\| \| MAE \| 1.442 \|
	\| \| Endpoint MSE \| 0.385 \|
	\| Robomimic \| MSE \| 0.138 \|
	\| \| MAE \| 1.416 \|
	\| \| Endpoint MSE \| 0.151 \|


	### Submitting to the Leaderboard

	- Use the provided evaluation script:
	https://github.com/jayLEE0301/TraceGen
	- Report metrics on the official test split, using the corresponding dataset from:
	https://huggingface.co/collections/furonghuang-lab/tracegen
	- For environment-specific results, evaluate the corresponding
	`TraceGen_{EnvName}` checkpoint.
	- Open a PR or submit results via GitHub Issues.