|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
# TraceGen Benchmark Leaderboard |
|
|
|
|
|
 |
|
|
|
|
|
## Benchmark: TraceGen Evaluation Suite |
|
|
|
|
|
We evaluate models on **5 environments** using the official TraceGen metrics. |
|
|
Each environment reports **MSE**, **MAE**, and **Endpoint MSE** on held-out test sets. |
|
|
|
|
|
|
|
|
## Test on TraceGen benchmark |
|
|
Use the official evaluation code provided in: |
|
|
[https://github.com/jayLEE0301/TraceGen](https://github.com/jayLEE0301/TraceGen) |
|
|
|
|
|
### Multi-GPU |
|
|
``` |
|
|
export CUDA_VISIBLE_DEVICES=0,1,2,3 |
|
|
torchrun --standalone --nproc_per_node=4 \ |
|
|
test_benchmark.py \ |
|
|
--config cfg/train.yaml \ |
|
|
--override \ |
|
|
train.batch_size=8 \ |
|
|
train.lr_decoder=1.5e-4 \ |
|
|
model.decoder.num_layers=6 \ |
|
|
model.decoder.num_attention_heads=12 \ |
|
|
model.decoder.latent_dim=768 \ |
|
|
data.num_workers=4 \ |
|
|
hardware.mixed_precision=true \ |
|
|
logging.use_wandb=true \ |
|
|
logging.log_every=2000 \ |
|
|
--resume {path_to_pretrained_checkpoint} |
|
|
``` |
|
|
### Single-GPU |
|
|
``` |
|
|
export CUDA_VISIBLE_DEVICES=0 |
|
|
python test_benchmark.py \ |
|
|
--config cfg/train.yaml \ |
|
|
--override \ |
|
|
train.batch_size=8 \ |
|
|
train.lr_decoder=1.5e-4 \ |
|
|
model.decoder.num_layers=6 \ |
|
|
model.decoder.num_attention_heads=12 \ |
|
|
model.decoder.latent_dim=768 \ |
|
|
data.num_workers=4 \ |
|
|
hardware.mixed_precision=true \ |
|
|
logging.use_wandb=true \ |
|
|
logging.log_every=2000 \ |
|
|
--resume {path_to_pretrained_checkpoint} |
|
|
``` |
|
|
|
|
|
To **reproduce the environment-specific benchmark results** reported below, |
|
|
users should evaluate the **environment-specific checkpoints** |
|
|
`TraceGen_{EnvName}` from [TraceGen Collection](https://huggingface.co/collections/furonghuang-lab/tracegen), which are trained using data from the corresponding environment only. |
|
|
|
|
|
**Metric definition.** |
|
|
All reported errors are computed in a **normalized coordinate space**: |
|
|
both input images and predicted traces are scaled to the range **[0, 1]** prior to evaluation. |
|
|
Accordingly, the reported MSE, MAE, and Endpoint MSE reflect **absolute errors within the normalized image space**. |
|
|
|
|
|
| Environment | Metric | TraceGen (×1e−2) | |
|
|
| ----------- | ------------ | ---------------- | |
|
|
| EpicKitchen | MSE | 0.445 | |
|
|
| | MAE | 2.721 | |
|
|
| | Endpoint MSE | 0.791 | |
|
|
| Droid | MSE | 0.206 | |
|
|
| | MAE | 1.289 | |
|
|
| | Endpoint MSE | 0.285 | |
|
|
| Bridge | MSE | 0.653 | |
|
|
| | MAE | 2.419 | |
|
|
| | Endpoint MSE | 0.607 | |
|
|
| Libero | MSE | 0.276 | |
|
|
| | MAE | 1.442 | |
|
|
| | Endpoint MSE | 0.385 | |
|
|
| Robomimic | MSE | 0.138 | |
|
|
| | MAE | 1.416 | |
|
|
| | Endpoint MSE | 0.151 | |
|
|
|
|
|
|
|
|
### Submitting to the Leaderboard |
|
|
|
|
|
- Use the provided evaluation script: |
|
|
https://github.com/jayLEE0301/TraceGen |
|
|
- Report metrics on the **official test split**, using the corresponding dataset from: |
|
|
https://huggingface.co/collections/furonghuang-lab/tracegen |
|
|
- For environment-specific results, evaluate the corresponding |
|
|
`TraceGen_{EnvName}` checkpoint. |
|
|
- Open a PR or submit results via GitHub Issues. |