robotics-diffusion-transformer
/

RVQActionTokenizer

@@ -6,6 +6,7 @@ tags:
 - action
 - discrete
 - vector-quantization
 license: apache-2.0
 pipeline_tag: robotics
 ---
@@ -23,7 +24,7 @@ Unlike single-codebook VQ, RVQ-AT stacks multiple small codebooks and quantizes
 Here, we provide:
-1. **RVQ-AT (Universal)** — a general-purpose tokenizer trained on diverse manipulation & navigation logs.
 2. **Simple APIs to fit your own tokenizer** on custom action datasets.
 ---
@@ -34,40 +35,53 @@ Here, we provide:
 We recommend chunking actions into \~**0.8 s windows** with fps = 30 and normalizing each action dimension using [normalizer](http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt) to **\[-1, 1]** before tokenization. Batched encode/decode are supported.
 ```python
 import numpy as np
-from transformers import AutoProcessor
 # Load from the Hub (replace with your repo id once published)
-proc = .from_pretrained(
-    "your-org/residual-vq-action-tokenizer",  # e.g., "your-org/rvq-at-universal"
-    trust_remote_code=True
 )
-# Dummy batch: [batch, T, action_dim], concretely [batch_size, 24, 20]
-action_data = np.random.uniform(-1, 1, size=(256, 50, 20)).astype("float32")
-# Encode → tokens (List[List[int]] or np.ndarray[int])
-tokens = proc(action_data)  # or proc.encode(action_data)
 # Decode back to continuous actions
-# The processor caches (T, action_dim) on first forward;
-# or specify explicitly:
-recon = proc.decode(tokens, time_horizon=50, action_dim=14)
-```
-**Notes**
-* If your pipeline uses variable-length chunks, pass `time_horizon` per sample to `decode(...)`.
-* Special tokens (`pad`, `eos`, optional `chunk_sep`) are reserved and shouldn’t be used as code indices.
 ---
-## Recommended Preprocessing
-* **Chunking:** 0.5–1.0 s windows work well for 10–50 Hz logs.
-* **Normalization:** per-dimension robust scaling to `[-1, 1]` (e.g., 1–99% quantiles). Save stats in `preprocessor_config.json`.
-* **Padding:** for variable `T`, pad to a small multiple of stride; RVQ-AT masks paddings internally.
-* **Action spaces:** supports mixed spaces (e.g., 7-DoF joints + gripper + base). Concatenate into a flat vector per timestep.
 ---
@@ -78,7 +92,6 @@ recon = proc.decode(tokens, time_horizon=50, action_dim=14)
 * **Compression:** 4 levels × 1 token/step → 4 tokens/step (often reduced further with temporal stride).
 * **Reconstruction:** MSE ↓ 25–40% vs. single-codebook VQ at equal bitrate.
 * **Latency:** <1 ms per 50×14 chunk on A100/PCIe; CPU-only real-time at 50 Hz feasible.
-* **Downstream VLA:** +1–3% SR on long-horizon tasks vs. raw-action modeling.
 ---
  -->
@@ -92,33 +105,19 @@ RVQ-AT is a representation learning component. **Do not** deploy decoded actions
 ---
-## FAQ
-**Q: How do I get back a `[T, A]` matrix at decode?**
-A: RVQ-AT caches `(time_horizon, action_dim)` on first `__call__`/`encode`. You can also pass them explicitly to `decode(...)`.
-**Q: Can I store shorter token sequences?**
-A: Yes—enable `temporal_stride>1` to quantize a downsampled latent; the decoder upsamples.
-**Q: How do I integrate with `transformers` trainers?**
-A: Treat RVQ-AT output as a discrete vocabulary and feed tokens to your VLA LM. Keep special token ids consistent across datasets.
----
 ## Citation
 If you use RVQ-AT in your work, please cite:
 ```bibtex
 ```
 ---
 ## Contact
-* Maintainers: Your Name [you@example.com](mailto:you@example.com)
-* Issues & requests: open a GitHub issue or start a Hub discussion on the model page.
 ---

 - action
 - discrete
 - vector-quantization
+- RDT 2
 license: apache-2.0
 pipeline_tag: robotics
 ---
 Here, we provide:
+1. **RVQ-AT** — a general-purpose tokenizer trained on diverse UMI manipulation, but generalizes well on tele-operation data.
 2. **Simple APIs to fit your own tokenizer** on custom action datasets.
 ---
 We recommend chunking actions into \~**0.8 s windows** with fps = 30 and normalizing each action dimension using [normalizer](http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt) to **\[-1, 1]** before tokenization. Batched encode/decode are supported.
 ```python
+# Run under repository: https://github.com/thu-ml/RDT2
+import torch
 import numpy as np
+from models.normalizer import LinearNormalizer
+from vqvae.models.multivqvae import MultiVQVAE
 # Load from the Hub (replace with your repo id once published)
+vae = MultiVQVAE.from_pretrained("outputs/vqvae_hf").cuda().eval()
+normalizer = LinearNormalizer.load(
+    "<Path_to_normalizer>"  # Download from:
+    # http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
 )
+# Load your RELATIVE action chunk
+action_chunk = torch.zeros((1, 24, 20))
+# action_chunk shape: (B, T, action_dim)
+#   - T = 24: predicts the future 0.8s in fps=30 → 24 frames
+#   - action_dim = 20: following UMI setting (both arms, right to left)
+#     - [0-2]:  RIGHT ARM end effector position in (x, y, z), unit: m
+#     - [3-8]:  RIGHT ARM end effector rotation (6D representation)
+#     - [9]:    RIGHT ARM gripper width, unit: m
+#     - [10-12]: LEFT ARM end effector position in (x, y, z), unit: m
+#     - [13-18]: LEFT ARM end effector rotation (6D representation)
+#     - [19]:   LEFT ARM gripper width, unit: m
+# Normalize action
+nsample = normalizer["action"].normalize(action_chunk).cuda()
+# Encode → tokens
+# tokens: torch.LongTensor with shape (B, num_valid_action_token)
+# num_valid_action_token = 27, values in range [0, 1024)
+tokens = vae.encode(nsample)  # or vae.encode(action_chunk)
 # Decode back to continuous actions
+recon_nsample = vae.decode(tokens)
+recon_action_chunk = normalizer["action"].unnormalize(recon_nsample)
+```
 ---
+## [IMPORTANT] Recommended Preprocessing
+Although our Residual VQ demonstrates strong generalization across both hand-held gripper data and real robot data,
+we recommend that if you plan to fine-tune on your own dataset, you first verify that the statistics of your data fall within the bounds of our RVQ.
+Afterward, evaluate the reconstruction error on your data before using it for your own purpose, especially fine-tuning.
 ---
 * **Compression:** 4 levels × 1 token/step → 4 tokens/step (often reduced further with temporal stride).
 * **Reconstruction:** MSE ↓ 25–40% vs. single-codebook VQ at equal bitrate.
 * **Latency:** <1 ms per 50×14 chunk on A100/PCIe; CPU-only real-time at 50 Hz feasible.
 ---
  -->
 ---
 ## Citation
 If you use RVQ-AT in your work, please cite:
 ```bibtex
+TBD
 ```
 ---
 ## Contact
+* Issues & requests: open a GitHub issue (see [here](https://github.com/thu-ml/RDT2/blob/main/CONTRIBUTING.md) for guidelines) or start a Hub discussion on the model page.
 ---