File size: 3,350 Bytes
46bdbc2 606143a aa54ffc 46bdbc2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
---
license: apache-2.0
---
# TraceGen Benchmark Leaderboard

## Benchmark: TraceGen Evaluation Suite
We evaluate models on **5 environments** using the official TraceGen metrics.
Each environment reports **MSE**, **MAE**, and **Endpoint MSE** on held-out test sets.
## Test on TraceGen benchmark
Use the official evaluation code provided in:
[https://github.com/jayLEE0301/TraceGen](https://github.com/jayLEE0301/TraceGen)
### Multi-GPU
```
export CUDA_VISIBLE_DEVICES=0,1,2,3
torchrun --standalone --nproc_per_node=4 \
test_benchmark.py \
--config cfg/train.yaml \
--override \
train.batch_size=8 \
train.lr_decoder=1.5e-4 \
model.decoder.num_layers=6 \
model.decoder.num_attention_heads=12 \
model.decoder.latent_dim=768 \
data.num_workers=4 \
hardware.mixed_precision=true \
logging.use_wandb=true \
logging.log_every=2000 \
--resume {path_to_pretrained_checkpoint}
```
### Single-GPU
```
export CUDA_VISIBLE_DEVICES=0
python test_benchmark.py \
--config cfg/train.yaml \
--override \
train.batch_size=8 \
train.lr_decoder=1.5e-4 \
model.decoder.num_layers=6 \
model.decoder.num_attention_heads=12 \
model.decoder.latent_dim=768 \
data.num_workers=4 \
hardware.mixed_precision=true \
logging.use_wandb=true \
logging.log_every=2000 \
--resume {path_to_pretrained_checkpoint}
```
To **reproduce the environment-specific benchmark results** reported below,
users should evaluate the **environment-specific checkpoints**
`TraceGen_{EnvName}` from [TraceGen Collection](https://huggingface.co/collections/furonghuang-lab/tracegen), which are trained using data from the corresponding environment only.
**Metric definition.**
All reported errors are computed in a **normalized coordinate space**:
both input images and predicted traces are scaled to the range **[0, 1]** prior to evaluation.
Accordingly, the reported MSE, MAE, and Endpoint MSE reflect **absolute errors within the normalized image space**.
| Environment | Metric | TraceGen (×1e−2) |
| ----------- | ------------ | ---------------- |
| EpicKitchen | MSE | 0.445 |
| | MAE | 2.721 |
| | Endpoint MSE | 0.791 |
| Droid | MSE | 0.206 |
| | MAE | 1.289 |
| | Endpoint MSE | 0.285 |
| Bridge | MSE | 0.653 |
| | MAE | 2.419 |
| | Endpoint MSE | 0.607 |
| Libero | MSE | 0.276 |
| | MAE | 1.442 |
| | Endpoint MSE | 0.385 |
| Robomimic | MSE | 0.138 |
| | MAE | 1.416 |
| | Endpoint MSE | 0.151 |
### Submitting to the Leaderboard
- Use the provided evaluation script:
https://github.com/jayLEE0301/TraceGen
- Report metrics on the **official test split**, using the corresponding dataset from:
https://huggingface.co/collections/furonghuang-lab/tracegen
- For environment-specific results, evaluate the corresponding
`TraceGen_{EnvName}` checkpoint.
- Open a PR or submit results via GitHub Issues. |