--- license: apache-2.0 --- # TraceGen Benchmark Leaderboard ![TraceGen Benchmark Overview](https://raw.githubusercontent.com/jayLEE0301/TraceGen/main/assets/tracegen_fig2.png) ## Benchmark: TraceGen Evaluation Suite We evaluate models on **5 environments** using the official TraceGen metrics. Each environment reports **MSE**, **MAE**, and **Endpoint MSE** on held-out test sets. ## Test on TraceGen benchmark Use the official evaluation code provided in: [https://github.com/jayLEE0301/TraceGen](https://github.com/jayLEE0301/TraceGen) ### Multi-GPU ``` export CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nproc_per_node=4 \ test_benchmark.py \ --config cfg/train.yaml \ --override \ train.batch_size=8 \ train.lr_decoder=1.5e-4 \ model.decoder.num_layers=6 \ model.decoder.num_attention_heads=12 \ model.decoder.latent_dim=768 \ data.num_workers=4 \ hardware.mixed_precision=true \ logging.use_wandb=true \ logging.log_every=2000 \ --resume {path_to_pretrained_checkpoint} ``` ### Single-GPU ``` export CUDA_VISIBLE_DEVICES=0 python test_benchmark.py \ --config cfg/train.yaml \ --override \ train.batch_size=8 \ train.lr_decoder=1.5e-4 \ model.decoder.num_layers=6 \ model.decoder.num_attention_heads=12 \ model.decoder.latent_dim=768 \ data.num_workers=4 \ hardware.mixed_precision=true \ logging.use_wandb=true \ logging.log_every=2000 \ --resume {path_to_pretrained_checkpoint} ``` To **reproduce the environment-specific benchmark results** reported below, users should evaluate the **environment-specific checkpoints** `TraceGen_{EnvName}` from [TraceGen Collection](https://huggingface.co/collections/furonghuang-lab/tracegen), which are trained using data from the corresponding environment only. **Metric definition.** All reported errors are computed in a **normalized coordinate space**: both input images and predicted traces are scaled to the range **[0, 1]** prior to evaluation. Accordingly, the reported MSE, MAE, and Endpoint MSE reflect **absolute errors within the normalized image space**. | Environment | Metric | TraceGen (×1e−2) | | ----------- | ------------ | ---------------- | | EpicKitchen | MSE | 0.445 | | | MAE | 2.721 | | | Endpoint MSE | 0.791 | | Droid | MSE | 0.206 | | | MAE | 1.289 | | | Endpoint MSE | 0.285 | | Bridge | MSE | 0.653 | | | MAE | 2.419 | | | Endpoint MSE | 0.607 | | Libero | MSE | 0.276 | | | MAE | 1.442 | | | Endpoint MSE | 0.385 | | Robomimic | MSE | 0.138 | | | MAE | 1.416 | | | Endpoint MSE | 0.151 | ### Submitting to the Leaderboard - Use the provided evaluation script: https://github.com/jayLEE0301/TraceGen - Report metrics on the **official test split**, using the corresponding dataset from: https://huggingface.co/collections/furonghuang-lab/tracegen - For environment-specific results, evaluate the corresponding `TraceGen_{EnvName}` checkpoint. - Open a PR or submit results via GitHub Issues.