| # BEST: B-Spline Encoded Sequence Tokenizer with Adaptive Knots | |
| BEST is an advanced action tokenizer that converts continuous robot action sequences into discrete tokens using adaptive B-splines with MILP-based knot optimization. It extends [BEAST](https://huggingface.co/zhouhongyi/beast) with forced gripper knots and adaptive compression for enhanced trajectory representation in imitation learning. | |
| ## Installation | |
| Install the required dependencies: | |
| ```bash | |
| pip install torch numpy scipy matplotlib tqdm pulp transformers | |
| ``` | |
| Note: CUDA is recommended for optimal performance, but CPU is also supported by setting `device="cpu"`. | |
| ## Quick Start | |
| ```python | |
| from transformers import AutoProcessor | |
| import torch | |
| # Initialize the BEST processor with configuration parameters: | |
| # - num_dof: degrees of freedom (7 for 6D robot arm + gripper) | |
| # - in_seq_len: input trajectory length (number of time steps) | |
| # - out_seq_len: output token sequence length after compression | |
| # - vocab_size: discrete vocabulary size (256 = 8-bit tokens) | |
| # - degree: degree of the B-spline polynomial (3 = cubic spline) | |
| # - gripper_dof: number of gripper DOFs (1 for binary gripper) | |
| # - device: computation device ('cpu' or 'cuda') | |
| best = AutoProcessor.from_pretrained( | |
| "Luka-He/best", | |
| trust_remote_code=True, | |
| num_dof=7, | |
| in_seq_len=50, | |
| out_seq_len=50, | |
| vocab_size=256, | |
| degree=3, | |
| gripper_dof=1, | |
| device='cuda' | |
| ) | |
| # Create random trajectory data: 10 trajectories, each with 50 time steps, 7 dimensions | |
| trajectories = torch.randn(10, 50, 7) | |
| # Encode trajectories into discrete tokens | |
| # update_bounds=True allows the processor to adaptively update quantization bounds | |
| tokens = best.encode_discrete(trajectories, update_bounds=True) | |
| print(f"Encoded tokens shape: {tokens.shape}") # [10, 350] (50 * 7) | |
| # Decode tokens back to continuous trajectories | |
| reconstructed_trajectories = best.decode_discrete(tokens) | |
| print(f"Reconstructed trajectories shape: {reconstructed_trajectories.shape}") # [10, 50, 7] | |
| # Calculate mean squared error to measure reconstruction quality | |
| mse_loss = torch.mean((trajectories - reconstructed_trajectories) ** 2) | |
| print(f"MSE Loss: {mse_loss.item()}") | |
| ``` | |
| ### Continuous Encoding | |
| For integration with continuous generative models: | |
| ```python | |
| # Encode to normalized continuous parameters [-1, 1] | |
| params = best.encode_continuous(trajectories, update_bounds=True) | |
| print(f"Continuous params shape: {params.shape}") # [10, 350] | |
| # Decode back | |
| reconstructed = best.decode_continuous(params) | |
| print(f"Reconstructed shape: {reconstructed.shape}") # [10, 50, 7] | |
| ``` | |
| ## Parameters | |
| | Parameter | Description | Default | | |
| |-----------|-------------|---------| | |
| | num_dof | Total degrees of freedom (robot joints + gripper) | 7 | | |
| | in_seq_len | Input trajectory sequence length (number of timesteps) | 10 | | |
| | out_seq_len | Output compressed sequence length (must ≥ control points after compression) | 5 | | |
| | vocab_size | Discrete vocabulary size (256 = 8-bit tokens) | 256 | | |
| | degree | B-spline polynomial degree (3=cubic, provides smooth trajectories) | 3 | | |
| | gripper_dof | Number of gripper DOFs, assumed to be at the end. Used for forced knot placement | 1 | | |
| | do_pad | Whether to pad control points to fixed length | True | | |
| | device | Torch device ("cuda" or "cpu") | "cuda" | | |
| ### Token Count | |
| The total number of tokens per trajectory is: `out_seq_len * (num_dof + 1)` | |
| The extra dimension is for time knots. For example, with default settings (50 output length, 7 DOF): 400 tokens per trajectory (50 × 8). | |
| **Key Difference from BEAST**: BEST uses adaptive compression where `out_seq_len` can vary based on trajectory complexity, while BEAST uses fixed `num_basis` control points. | |
| ## API Reference | |
| ### Encoding Methods | |
| `encode_discrete(trajs, update_bounds=True)` | |
| - Input: Trajectories tensor `[batch, in_seq_len, num_dof]` | |
| - Output: Discrete tokens `[batch, out_seq_len * (num_dof + 1)]` in range `[0, vocab_size-1]` | |
| - `update_bounds`: Whether to update internal weight bounds from this batch | |
| `encode_continuous(trajs, update_bounds=True)` | |
| - Input: Trajectories tensor `[batch, in_seq_len, num_dof]` | |
| - Output: Normalized parameters `[batch, out_seq_len * (num_dof + 1)]` in range `[-1, 1]` | |
| ### Decoding Methods | |
| `decode_discrete(tokens, target_length=None)` | |
| - Input: Discrete tokens `[batch, out_seq_len * (num_dof + 1)]` | |
| - Output: Reconstructed trajectories `[batch, target_length, num_dof]` | |
| - `target_length`: Output trajectory length (optional, defaults to `in_seq_len`) | |
| `decode_continuous(params, target_length=None)` | |
| - Input: Normalized parameters `[batch, out_seq_len * (num_dof + 1)]` | |
| - Output: Reconstructed trajectories `[batch, target_length, num_dof]` | |
| ### Utility Methods | |
| `update_weights_bounds_per_batch(batch_weights)` | |
| - Update the min/max bounds used for normalization based on new batch data | |
| ## Key Features | |
| ### Adaptive Knot Selection with MILP | |
| Unlike BEAST's uniform knot spacing, BEST uses Mixed-Integer Linear Programming (MILP) to optimize knot placement: | |
| - **Gripper-Driven Knots**: Automatically places knots at gripper state transitions | |
| - **Curvature-Based Optimization**: Adds knots where trajectory curvature is high | |
| - **Tolerance Control**: Balances compression ratio vs. reconstruction accuracy | |
| ### Forced Gripper Knots | |
| Preserves discrete gripper states by: | |
| - Detecting gripper state changes in input trajectory | |
| - Forcing B-spline knots at transition points | |
| - Using degree-0 splines for gripper DOF (piecewise constant) | |
| ### Performance Benchmarks (LIBERO Dataset, 100 samples) | |
| | Action Chunk Size | Avg Time (s) | CP Min | CP Mean | CP Max | W_min | W_max | Success Rate | | |
| |-------------------|--------------|--------|---------|--------|-------|-------|--------------| | |
| | 5 steps | 0.104 | 5 | 5.0 | 5 | -1.0000 | 5.0000 | 100% | | |
| | 10 steps | 0.211 | 10 | 10.0 | 10 | -1.0000 | 10.0000 | 100% | | |
| | 15 steps | 0.427 | 15 | 15.0 | 15 | -1.3730 | 15.0000 | 100% | | |
| | 20 steps | 0.696 | 20 | 20.0 | 20 | -1.0000 | 20.0000 | 100% | | |
| | 25 steps | 1.904 | 25 | 25.0 | 25 | -1.0000 | 25.0000 | 100% | | |
| | 30 steps | 3.217 | 30 | 30.0 | 30 | -1.0000 | 30.0000 | 100% | | |
| | 35 steps | 5.372 | 35 | 35.0 | 35 | -1.0000 | 35.0000 | 93% | | |
| Note: CP (Control Points) length represents the number of knots selected by the adaptive algorithm. | |
| ## Comparison with BEAST | |
| | Feature | BEAST | BEST | | |
| |---------|-------|------| | |
| | Knot Selection | Uniform spacing | Adaptive (MILP-based) | | |
| | Gripper Handling | Optional zero-order | Forced knots at transitions | | |
| | Compression | Fixed basis count | Adaptive based on complexity | | |
| | Encoding Time | ~20ms (50 steps) | ~100ms (5 steps) to ~5s (35 steps) | | |
| | Trajectory Fidelity | High (uniform) | Very high (adaptive) | | |
| | Use Case | General trajectories | Robot manipulation with gripper | | |
| ## Uses | |
| ### Intended Use Cases | |
| - **Robot Imitation Learning**: Compress continuous demonstration trajectories with gripper states into discrete tokens for VLA-based policy learning | |
| - **Manipulation Dataset Compression**: Reduce memory footprint while preserving both motion quality and discrete gripper transitions | |
| - **VLA Action Tokenization**: Enable vision-language-action models to process robot actions as discrete token sequences with explicit gripper control | |
| ### Out-of-Scope Use Cases | |
| - Trajectories without discrete state transitions (use BEAST instead for better speed) | |
| - Real-time control (MILP optimization adds computational overhead) | |
| - Non-robotic continuous signals (optimized for manipulation trajectories) | |
| ## Advanced Usage | |
| ### Custom Configuration | |
| ```python | |
| from online_bspline_tokenizer import BestTokenizer | |
| import torch | |
| # Create with custom parameters | |
| tokenizer = BestTokenizer( | |
| num_dof=7, | |
| in_seq_len=100, # Longer input trajectories | |
| out_seq_len=100, # Allow up to 100 control points | |
| vocab_size=512, # Higher resolution quantization | |
| degree=3, | |
| gripper_dof=1, | |
| do_pad=True, | |
| device='cuda' | |
| ) | |
| # Process trajectories | |
| trajectories = torch.randn(5, 100, 7) | |
| tokens = tokenizer.encode_discrete(trajectories) | |
| ``` | |
| ### Saving and Loading | |
| ```python | |
| # Save processor configuration | |
| tokenizer.save_pretrained("./my_best_tokenizer") | |
| # Load later | |
| from transformers import AutoProcessor | |
| loaded_tokenizer = AutoProcessor.from_pretrained( | |
| "./my_best_tokenizer", | |
| trust_remote_code=True | |
| ) | |
| ``` | |
| ## Citation | |
| If you use BEST in your research, please cite: | |
| ```bibtex | |
| @misc{best2026, | |
| title={BEST: B-Spline Encoded Sequence Tokenizer with Adaptive Knots}, | |
| author={Hexinyu}, | |
| year={2026}, | |
| url={https://github.com/your-repo/best} | |
| } | |
| ``` | |
| Based on BEAST: | |
| ```bibtex | |
| @inproceedings{ | |
| zhou2025beast, | |
| title={{BEAST}: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning}, | |
| author={Hongyi Zhou and Weiran Liao and Xi Huang and Yucheng Tang and Fabian Otto and Xiaogang Jia and Xinkai Jiang and Simon Hilber and Ge Li and Qian Wang and {\"O}mer Erdin{\c{c}} Ya{\u{g}}murlu and Nils Blank and Moritz Reuss and Rudolf Lioutikov}, | |
| booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems}, | |
| year={2025}, | |
| url={https://openreview.net/forum?id=rQCl1sf62w} | |
| } | |
| ``` | |
| ## License | |
| MIT License | |
| ## Acknowledgments | |
| This work builds upon [BEAST](https://huggingface.co/zhouhongyi/beast) and extends it with adaptive knot selection for improved manipulation trajectory encoding. | |