Update README.md
Browse files
README.md
CHANGED
|
@@ -6,6 +6,7 @@ tags:
|
|
| 6 |
- action
|
| 7 |
- discrete
|
| 8 |
- vector-quantization
|
|
|
|
| 9 |
license: apache-2.0
|
| 10 |
pipeline_tag: robotics
|
| 11 |
---
|
|
@@ -23,7 +24,7 @@ Unlike single-codebook VQ, RVQ-AT stacks multiple small codebooks and quantizes
|
|
| 23 |
|
| 24 |
Here, we provide:
|
| 25 |
|
| 26 |
-
1. **RVQ-AT
|
| 27 |
2. **Simple APIs to fit your own tokenizer** on custom action datasets.
|
| 28 |
|
| 29 |
---
|
|
@@ -34,40 +35,53 @@ Here, we provide:
|
|
| 34 |
We recommend chunking actions into \~**0.8 s windows** with fps = 30 and normalizing each action dimension using [normalizer](http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt) to **\[-1, 1]** before tokenization. Batched encode/decode are supported.
|
| 35 |
|
| 36 |
```python
|
|
|
|
|
|
|
|
|
|
| 37 |
import numpy as np
|
| 38 |
-
from
|
|
|
|
| 39 |
|
| 40 |
# Load from the Hub (replace with your repo id once published)
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
|
|
|
| 44 |
)
|
| 45 |
|
| 46 |
-
#
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
#
|
| 50 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
# Decode back to continuous actions
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
recon = proc.decode(tokens, time_horizon=50, action_dim=14)
|
| 56 |
-
```
|
| 57 |
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
* If your pipeline uses variable-length chunks, pass `time_horizon` per sample to `decode(...)`.
|
| 61 |
-
* Special tokens (`pad`, `eos`, optional `chunk_sep`) are reserved and shouldn’t be used as code indices.
|
| 62 |
|
| 63 |
---
|
| 64 |
|
| 65 |
-
## Recommended Preprocessing
|
| 66 |
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
* **Action spaces:** supports mixed spaces (e.g., 7-DoF joints + gripper + base). Concatenate into a flat vector per timestep.
|
| 71 |
|
| 72 |
---
|
| 73 |
|
|
@@ -78,7 +92,6 @@ recon = proc.decode(tokens, time_horizon=50, action_dim=14)
|
|
| 78 |
* **Compression:** 4 levels × 1 token/step → 4 tokens/step (often reduced further with temporal stride).
|
| 79 |
* **Reconstruction:** MSE ↓ 25–40% vs. single-codebook VQ at equal bitrate.
|
| 80 |
* **Latency:** <1 ms per 50×14 chunk on A100/PCIe; CPU-only real-time at 50 Hz feasible.
|
| 81 |
-
* **Downstream VLA:** +1–3% SR on long-horizon tasks vs. raw-action modeling.
|
| 82 |
|
| 83 |
---
|
| 84 |
-->
|
|
@@ -92,33 +105,19 @@ RVQ-AT is a representation learning component. **Do not** deploy decoded actions
|
|
| 92 |
|
| 93 |
---
|
| 94 |
|
| 95 |
-
## FAQ
|
| 96 |
-
|
| 97 |
-
**Q: How do I get back a `[T, A]` matrix at decode?**
|
| 98 |
-
A: RVQ-AT caches `(time_horizon, action_dim)` on first `__call__`/`encode`. You can also pass them explicitly to `decode(...)`.
|
| 99 |
-
|
| 100 |
-
**Q: Can I store shorter token sequences?**
|
| 101 |
-
A: Yes—enable `temporal_stride>1` to quantize a downsampled latent; the decoder upsamples.
|
| 102 |
-
|
| 103 |
-
**Q: How do I integrate with `transformers` trainers?**
|
| 104 |
-
A: Treat RVQ-AT output as a discrete vocabulary and feed tokens to your VLA LM. Keep special token ids consistent across datasets.
|
| 105 |
-
|
| 106 |
-
---
|
| 107 |
-
|
| 108 |
## Citation
|
| 109 |
|
| 110 |
If you use RVQ-AT in your work, please cite:
|
| 111 |
|
| 112 |
```bibtex
|
| 113 |
-
|
| 114 |
```
|
| 115 |
|
| 116 |
---
|
| 117 |
|
| 118 |
## Contact
|
| 119 |
|
| 120 |
-
*
|
| 121 |
-
* Issues & requests: open a GitHub issue or start a Hub discussion on the model page.
|
| 122 |
|
| 123 |
---
|
| 124 |
|
|
|
|
| 6 |
- action
|
| 7 |
- discrete
|
| 8 |
- vector-quantization
|
| 9 |
+
- RDT 2
|
| 10 |
license: apache-2.0
|
| 11 |
pipeline_tag: robotics
|
| 12 |
---
|
|
|
|
| 24 |
|
| 25 |
Here, we provide:
|
| 26 |
|
| 27 |
+
1. **RVQ-AT** — a general-purpose tokenizer trained on diverse UMI manipulation, but generalizes well on tele-operation data.
|
| 28 |
2. **Simple APIs to fit your own tokenizer** on custom action datasets.
|
| 29 |
|
| 30 |
---
|
|
|
|
| 35 |
We recommend chunking actions into \~**0.8 s windows** with fps = 30 and normalizing each action dimension using [normalizer](http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt) to **\[-1, 1]** before tokenization. Batched encode/decode are supported.
|
| 36 |
|
| 37 |
```python
|
| 38 |
+
# Run under repository: https://github.com/thu-ml/RDT2
|
| 39 |
+
|
| 40 |
+
import torch
|
| 41 |
import numpy as np
|
| 42 |
+
from models.normalizer import LinearNormalizer
|
| 43 |
+
from vqvae.models.multivqvae import MultiVQVAE
|
| 44 |
|
| 45 |
# Load from the Hub (replace with your repo id once published)
|
| 46 |
+
vae = MultiVQVAE.from_pretrained("outputs/vqvae_hf").cuda().eval()
|
| 47 |
+
normalizer = LinearNormalizer.load(
|
| 48 |
+
"<Path_to_normalizer>" # Download from:
|
| 49 |
+
# http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
|
| 50 |
)
|
| 51 |
|
| 52 |
+
# Load your RELATIVE action chunk
|
| 53 |
+
action_chunk = torch.zeros((1, 24, 20))
|
| 54 |
+
# action_chunk shape: (B, T, action_dim)
|
| 55 |
+
# - T = 24: predicts the future 0.8s in fps=30 → 24 frames
|
| 56 |
+
# - action_dim = 20: following UMI setting (both arms, right to left)
|
| 57 |
+
# - [0-2]: RIGHT ARM end effector position in (x, y, z), unit: m
|
| 58 |
+
# - [3-8]: RIGHT ARM end effector rotation (6D representation)
|
| 59 |
+
# - [9]: RIGHT ARM gripper width, unit: m
|
| 60 |
+
# - [10-12]: LEFT ARM end effector position in (x, y, z), unit: m
|
| 61 |
+
# - [13-18]: LEFT ARM end effector rotation (6D representation)
|
| 62 |
+
# - [19]: LEFT ARM gripper width, unit: m
|
| 63 |
+
|
| 64 |
+
# Normalize action
|
| 65 |
+
nsample = normalizer["action"].normalize(action_chunk).cuda()
|
| 66 |
+
|
| 67 |
+
# Encode → tokens
|
| 68 |
+
# tokens: torch.LongTensor with shape (B, num_valid_action_token)
|
| 69 |
+
# num_valid_action_token = 27, values in range [0, 1024)
|
| 70 |
+
tokens = vae.encode(nsample) # or vae.encode(action_chunk)
|
| 71 |
|
| 72 |
# Decode back to continuous actions
|
| 73 |
+
recon_nsample = vae.decode(tokens)
|
| 74 |
+
recon_action_chunk = normalizer["action"].unnormalize(recon_nsample)
|
|
|
|
|
|
|
| 75 |
|
| 76 |
+
```
|
|
|
|
|
|
|
|
|
|
| 77 |
|
| 78 |
---
|
| 79 |
|
| 80 |
+
## [IMPORTANT] Recommended Preprocessing
|
| 81 |
|
| 82 |
+
Although our Residual VQ demonstrates strong generalization across both hand-held gripper data and real robot data,
|
| 83 |
+
we recommend that if you plan to fine-tune on your own dataset, you first verify that the statistics of your data fall within the bounds of our RVQ.
|
| 84 |
+
Afterward, evaluate the reconstruction error on your data before using it for your own purpose, especially fine-tuning.
|
|
|
|
| 85 |
|
| 86 |
---
|
| 87 |
|
|
|
|
| 92 |
* **Compression:** 4 levels × 1 token/step → 4 tokens/step (often reduced further with temporal stride).
|
| 93 |
* **Reconstruction:** MSE ↓ 25–40% vs. single-codebook VQ at equal bitrate.
|
| 94 |
* **Latency:** <1 ms per 50×14 chunk on A100/PCIe; CPU-only real-time at 50 Hz feasible.
|
|
|
|
| 95 |
|
| 96 |
---
|
| 97 |
-->
|
|
|
|
| 105 |
|
| 106 |
---
|
| 107 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
## Citation
|
| 109 |
|
| 110 |
If you use RVQ-AT in your work, please cite:
|
| 111 |
|
| 112 |
```bibtex
|
| 113 |
+
TBD
|
| 114 |
```
|
| 115 |
|
| 116 |
---
|
| 117 |
|
| 118 |
## Contact
|
| 119 |
|
| 120 |
+
* Issues & requests: open a GitHub issue (see [here](https://github.com/thu-ml/RDT2/blob/main/CONTRIBUTING.md) for guidelines) or start a Hub discussion on the model page.
|
|
|
|
| 121 |
|
| 122 |
---
|
| 123 |
|