File size: 3,350 Bytes
46bdbc2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
606143a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aa54ffc
 
 
46bdbc2
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
license: apache-2.0
---

# TraceGen Benchmark Leaderboard

![TraceGen Benchmark Overview](https://raw.githubusercontent.com/jayLEE0301/TraceGen/main/assets/tracegen_fig2.png)

## Benchmark: TraceGen Evaluation Suite

We evaluate models on **5 environments** using the official TraceGen metrics.
Each environment reports **MSE**, **MAE**, and **Endpoint MSE** on held-out test sets.


## Test on TraceGen benchmark
Use the official evaluation code provided in:
[https://github.com/jayLEE0301/TraceGen](https://github.com/jayLEE0301/TraceGen)

### Multi-GPU
```
export CUDA_VISIBLE_DEVICES=0,1,2,3
torchrun --standalone --nproc_per_node=4 \
  test_benchmark.py \
  --config cfg/train.yaml \
  --override \
  train.batch_size=8 \
  train.lr_decoder=1.5e-4 \
  model.decoder.num_layers=6 \
  model.decoder.num_attention_heads=12 \
  model.decoder.latent_dim=768 \
  data.num_workers=4 \
  hardware.mixed_precision=true \
  logging.use_wandb=true \
  logging.log_every=2000 \
  --resume {path_to_pretrained_checkpoint}
```
### Single-GPU
```
export CUDA_VISIBLE_DEVICES=0
python test_benchmark.py \
  --config cfg/train.yaml \
  --override \
  train.batch_size=8 \
  train.lr_decoder=1.5e-4 \
  model.decoder.num_layers=6 \
  model.decoder.num_attention_heads=12 \
  model.decoder.latent_dim=768 \
  data.num_workers=4 \
  hardware.mixed_precision=true \
  logging.use_wandb=true \
  logging.log_every=2000 \
  --resume {path_to_pretrained_checkpoint}
```

To **reproduce the environment-specific benchmark results** reported below,
users should evaluate the **environment-specific checkpoints**
`TraceGen_{EnvName}` from [TraceGen Collection](https://huggingface.co/collections/furonghuang-lab/tracegen), which are trained using data from the corresponding environment only.

**Metric definition.**
All reported errors are computed in a **normalized coordinate space**:
both input images and predicted traces are scaled to the range **[0, 1]** prior to evaluation.
Accordingly, the reported MSE, MAE, and Endpoint MSE reflect **absolute errors within the normalized image space**.

| Environment | Metric       | TraceGen (×1e−2) |
| ----------- | ------------ | ---------------- |
| EpicKitchen | MSE          | 0.445            |
|             | MAE          | 2.721            |
|             | Endpoint MSE | 0.791            |
| Droid       | MSE          | 0.206            |
|             | MAE          | 1.289            |
|             | Endpoint MSE | 0.285            |
| Bridge      | MSE          | 0.653            |
|             | MAE          | 2.419            |
|             | Endpoint MSE | 0.607            |
| Libero      | MSE          | 0.276            |
|             | MAE          | 1.442            |
|             | Endpoint MSE | 0.385            |
| Robomimic   | MSE          | 0.138            |
|             | MAE          | 1.416            |
|             | Endpoint MSE | 0.151            |


### Submitting to the Leaderboard

- Use the provided evaluation script:  
  https://github.com/jayLEE0301/TraceGen
- Report metrics on the **official test split**, using the corresponding dataset from:  
  https://huggingface.co/collections/furonghuang-lab/tracegen
- For environment-specific results, evaluate the corresponding
  `TraceGen_{EnvName}` checkpoint.
- Open a PR or submit results via GitHub Issues.