Commit
571834f
·
verified ·
1 Parent(s): 86c1aa6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -40
README.md CHANGED
@@ -6,6 +6,7 @@ tags:
6
  - action
7
  - discrete
8
  - vector-quantization
 
9
  license: apache-2.0
10
  pipeline_tag: robotics
11
  ---
@@ -23,7 +24,7 @@ Unlike single-codebook VQ, RVQ-AT stacks multiple small codebooks and quantizes
23
 
24
  Here, we provide:
25
 
26
- 1. **RVQ-AT (Universal)** — a general-purpose tokenizer trained on diverse manipulation & navigation logs.
27
  2. **Simple APIs to fit your own tokenizer** on custom action datasets.
28
 
29
  ---
@@ -34,40 +35,53 @@ Here, we provide:
34
  We recommend chunking actions into \~**0.8 s windows** with fps = 30 and normalizing each action dimension using [normalizer](http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt) to **\[-1, 1]** before tokenization. Batched encode/decode are supported.
35
 
36
  ```python
 
 
 
37
  import numpy as np
38
- from transformers import AutoProcessor
 
39
 
40
  # Load from the Hub (replace with your repo id once published)
41
- proc = .from_pretrained(
42
- "your-org/residual-vq-action-tokenizer", # e.g., "your-org/rvq-at-universal"
43
- trust_remote_code=True
 
44
  )
45
 
46
- # Dummy batch: [batch, T, action_dim], concretely [batch_size, 24, 20]
47
- action_data = np.random.uniform(-1, 1, size=(256, 50, 20)).astype("float32")
48
-
49
- # Encode tokens (List[List[int]] or np.ndarray[int])
50
- tokens = proc(action_data) # or proc.encode(action_data)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
  # Decode back to continuous actions
53
- # The processor caches (T, action_dim) on first forward;
54
- # or specify explicitly:
55
- recon = proc.decode(tokens, time_horizon=50, action_dim=14)
56
- ```
57
 
58
- **Notes**
59
-
60
- * If your pipeline uses variable-length chunks, pass `time_horizon` per sample to `decode(...)`.
61
- * Special tokens (`pad`, `eos`, optional `chunk_sep`) are reserved and shouldn’t be used as code indices.
62
 
63
  ---
64
 
65
- ## Recommended Preprocessing
66
 
67
- * **Chunking:** 0.5–1.0 s windows work well for 10–50 Hz logs.
68
- * **Normalization:** per-dimension robust scaling to `[-1, 1]` (e.g., 1–99% quantiles). Save stats in `preprocessor_config.json`.
69
- * **Padding:** for variable `T`, pad to a small multiple of stride; RVQ-AT masks paddings internally.
70
- * **Action spaces:** supports mixed spaces (e.g., 7-DoF joints + gripper + base). Concatenate into a flat vector per timestep.
71
 
72
  ---
73
 
@@ -78,7 +92,6 @@ recon = proc.decode(tokens, time_horizon=50, action_dim=14)
78
  * **Compression:** 4 levels × 1 token/step → 4 tokens/step (often reduced further with temporal stride).
79
  * **Reconstruction:** MSE ↓ 25–40% vs. single-codebook VQ at equal bitrate.
80
  * **Latency:** <1 ms per 50×14 chunk on A100/PCIe; CPU-only real-time at 50 Hz feasible.
81
- * **Downstream VLA:** +1–3% SR on long-horizon tasks vs. raw-action modeling.
82
 
83
  ---
84
  -->
@@ -92,33 +105,19 @@ RVQ-AT is a representation learning component. **Do not** deploy decoded actions
92
 
93
  ---
94
 
95
- ## FAQ
96
-
97
- **Q: How do I get back a `[T, A]` matrix at decode?**
98
- A: RVQ-AT caches `(time_horizon, action_dim)` on first `__call__`/`encode`. You can also pass them explicitly to `decode(...)`.
99
-
100
- **Q: Can I store shorter token sequences?**
101
- A: Yes—enable `temporal_stride>1` to quantize a downsampled latent; the decoder upsamples.
102
-
103
- **Q: How do I integrate with `transformers` trainers?**
104
- A: Treat RVQ-AT output as a discrete vocabulary and feed tokens to your VLA LM. Keep special token ids consistent across datasets.
105
-
106
- ---
107
-
108
  ## Citation
109
 
110
  If you use RVQ-AT in your work, please cite:
111
 
112
  ```bibtex
113
-
114
  ```
115
 
116
  ---
117
 
118
  ## Contact
119
 
120
- * Maintainers: Your Name [you@example.com](mailto:you@example.com)
121
- * Issues & requests: open a GitHub issue or start a Hub discussion on the model page.
122
 
123
  ---
124
 
 
6
  - action
7
  - discrete
8
  - vector-quantization
9
+ - RDT 2
10
  license: apache-2.0
11
  pipeline_tag: robotics
12
  ---
 
24
 
25
  Here, we provide:
26
 
27
+ 1. **RVQ-AT** — a general-purpose tokenizer trained on diverse UMI manipulation, but generalizes well on tele-operation data.
28
  2. **Simple APIs to fit your own tokenizer** on custom action datasets.
29
 
30
  ---
 
35
  We recommend chunking actions into \~**0.8 s windows** with fps = 30 and normalizing each action dimension using [normalizer](http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt) to **\[-1, 1]** before tokenization. Batched encode/decode are supported.
36
 
37
  ```python
38
+ # Run under repository: https://github.com/thu-ml/RDT2
39
+
40
+ import torch
41
  import numpy as np
42
+ from models.normalizer import LinearNormalizer
43
+ from vqvae.models.multivqvae import MultiVQVAE
44
 
45
  # Load from the Hub (replace with your repo id once published)
46
+ vae = MultiVQVAE.from_pretrained("outputs/vqvae_hf").cuda().eval()
47
+ normalizer = LinearNormalizer.load(
48
+ "<Path_to_normalizer>" # Download from:
49
+ # http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
50
  )
51
 
52
+ # Load your RELATIVE action chunk
53
+ action_chunk = torch.zeros((1, 24, 20))
54
+ # action_chunk shape: (B, T, action_dim)
55
+ # - T = 24: predicts the future 0.8s in fps=30 → 24 frames
56
+ # - action_dim = 20: following UMI setting (both arms, right to left)
57
+ # - [0-2]: RIGHT ARM end effector position in (x, y, z), unit: m
58
+ # - [3-8]: RIGHT ARM end effector rotation (6D representation)
59
+ # - [9]: RIGHT ARM gripper width, unit: m
60
+ # - [10-12]: LEFT ARM end effector position in (x, y, z), unit: m
61
+ # - [13-18]: LEFT ARM end effector rotation (6D representation)
62
+ # - [19]: LEFT ARM gripper width, unit: m
63
+
64
+ # Normalize action
65
+ nsample = normalizer["action"].normalize(action_chunk).cuda()
66
+
67
+ # Encode → tokens
68
+ # tokens: torch.LongTensor with shape (B, num_valid_action_token)
69
+ # num_valid_action_token = 27, values in range [0, 1024)
70
+ tokens = vae.encode(nsample) # or vae.encode(action_chunk)
71
 
72
  # Decode back to continuous actions
73
+ recon_nsample = vae.decode(tokens)
74
+ recon_action_chunk = normalizer["action"].unnormalize(recon_nsample)
 
 
75
 
76
+ ```
 
 
 
77
 
78
  ---
79
 
80
+ ## [IMPORTANT] Recommended Preprocessing
81
 
82
+ Although our Residual VQ demonstrates strong generalization across both hand-held gripper data and real robot data,
83
+ we recommend that if you plan to fine-tune on your own dataset, you first verify that the statistics of your data fall within the bounds of our RVQ.
84
+ Afterward, evaluate the reconstruction error on your data before using it for your own purpose, especially fine-tuning.
 
85
 
86
  ---
87
 
 
92
  * **Compression:** 4 levels × 1 token/step → 4 tokens/step (often reduced further with temporal stride).
93
  * **Reconstruction:** MSE ↓ 25–40% vs. single-codebook VQ at equal bitrate.
94
  * **Latency:** <1 ms per 50×14 chunk on A100/PCIe; CPU-only real-time at 50 Hz feasible.
 
95
 
96
  ---
97
  -->
 
105
 
106
  ---
107
 
 
 
 
 
 
 
 
 
 
 
 
 
 
108
  ## Citation
109
 
110
  If you use RVQ-AT in your work, please cite:
111
 
112
  ```bibtex
113
+ TBD
114
  ```
115
 
116
  ---
117
 
118
  ## Contact
119
 
120
+ * Issues & requests: open a GitHub issue (see [here](https://github.com/thu-ml/RDT2/blob/main/CONTRIBUTING.md) for guidelines) or start a Hub discussion on the model page.
 
121
 
122
  ---
123