| # MagicBot-VGA |
|
|
| This repository documents how to evaluate our RoboTwin model |
| [`zaleni/MagicBot-VGA-Robotwin`](https://huggingface.co/zaleni/MagicBot-VGA-Robotwin) |
| with the MagicBot-VGA codebase. |
|
|
| [](https://github.com/zaleni/MagicBot-VGA) |
| [](https://huggingface.co/zaleni/MagicBot-VGA-Robotwin) |
|
|
| This README focuses on RoboTwin 2.0 environment preparation and evaluation. |
|
|
| It covers: |
|
|
| - MagicBot environment installation |
| - RoboTwin evaluation setup |
| - required external model assets |
| - single-task evaluation |
| - 50-task randomized evaluation |
| - CVPR 2026 RoboTwin Track 11-task evaluation |
| - submission package generation for the leaderboard workflow |
|
|
| ## 1. Requirements |
|
|
| The codebase is built and tested with: |
|
|
| - Python 3.10 |
| - CUDA 12.8 |
| - PyTorch 2.7.1 |
|
|
| We recommend using a Linux machine with NVIDIA GPUs. |
|
|
| ## 2. Install the MagicBot Base Environment |
|
|
| Clone the repository: |
|
|
| ```bash |
| git clone https://github.com/zaleni/MagicBot-VGA.git |
| cd MagicBot-VGA |
| ``` |
|
|
| Create a conda environment: |
|
|
| ```bash |
| conda create -y -n magicbot python=3.10 |
| conda activate magicbot |
| pip install --upgrade pip |
| ``` |
|
|
| Install the basic system dependencies used by the codebase: |
|
|
| ```bash |
| conda install -c conda-forge ffmpeg=7.1.1 svt-av1 -y |
| ``` |
|
|
| Install PyTorch for CUDA 12.8: |
|
|
| ```bash |
| pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 \ |
| --index-url https://download.pytorch.org/whl/cu128 |
| ``` |
|
|
| Install Python dependencies: |
|
|
| ```bash |
| pip install torchcodec numpy scipy transformers==4.57.1 mediapy loguru pytest omegaconf |
| pip install -e . |
| ``` |
|
|
| ## 3. Qwen3-VL Dependency |
|
|
| For `CubeV2`, the recommended dependency is the official Hugging Face `Qwen3-VL` |
| implementation provided by `transformers>=4.57.0`. |
|
|
| In this repository, `CubeV2` imports Qwen3-VL directly from: |
|
|
| ```python |
| from transformers.models.qwen3_vl import modeling_qwen3_vl |
| from transformers.models.qwen3_vl import Qwen3VLForConditionalGeneration, Qwen3VLTextModel |
| ``` |
|
|
| So for standard evaluation, you do not need to patch `transformers` if your environment |
| already uses a recent enough official version such as `transformers==4.57.1`. |
|
|
| This repo also contains a vendored replacement file under: |
|
|
| ```text |
| src/lerobot/policies/cubev2/transformers_replace/models/qwen3_vl/modeling_qwen3_vl.py |
| ``` |
|
|
| That file is best understood as a repo-side override copy. Most users evaluating |
| `zaleni/MagicBot-VGA-Robotwin` should not need it unless they intentionally want to |
| reproduce a specific local patched behavior. |
|
|
| ## 4. Prepare RoboTwin for Evaluation |
|
|
| This section is specifically for RoboTwin evaluation. If you only want to load the |
| model or run other parts of the codebase, the extra RoboTwin setup below is not required. |
|
|
| ### Option A: initialize the bundled RoboTwin submodule |
|
|
| ```bash |
| git submodule update --init third_party/RoboTwin |
| ``` |
|
|
| ### Option B: copy an existing RoboTwin checkout |
|
|
| You do not have to download RoboTwin from scratch if you already have a prepared copy. |
| You can copy it into this repository instead. |
|
|
| The evaluation code assumes RoboTwin is located exactly at: |
|
|
| ```text |
| <repo_root>/third_party/RoboTwin |
| ``` |
|
|
| So a valid layout looks like: |
|
|
| ```text |
| MagicBot-VGA/ |
| evaluation/ |
| launch/ |
| src/ |
| third_party/ |
| RoboTwin/ |
| ``` |
|
|
| If your RoboTwin directory already exists elsewhere, either: |
|
|
| - copy it to `third_party/RoboTwin`, or |
| - create a symlink at `third_party/RoboTwin` pointing to your existing RoboTwin directory |
|
|
| This path requirement comes from the evaluation code, which imports RoboTwin modules |
| and task configs from `third_party/RoboTwin` directly. |
|
|
| ### Install RoboTwin-specific system dependency |
|
|
| RoboTwin rendering requires Vulkan: |
|
|
| ```bash |
| sudo apt install -y libvulkan1 mesa-vulkan-drivers vulkan-tools |
| ``` |
|
|
| ### Install RoboTwin Python dependencies and assets |
|
|
| ```bash |
| cp evaluation/RoboTwin/requirements.txt third_party/RoboTwin/script/requirements.txt |
| cd third_party/RoboTwin |
| bash script/_install.sh |
| bash script/_download_assets.sh |
| cd ../../ |
| ``` |
|
|
| For more RoboTwin installation details, you can also refer to the official documentation: |
| https://robotwin-platform.github.io/doc/usage/robotwin-install.html |
|
|
| ## 5. Prepare External Model Assets |
|
|
| The released checkpoint `zaleni/MagicBot-VGA-Robotwin` is intended to be lightweight. |
| For RoboTwin action evaluation, you should provide the external backbone/tokenizer assets explicitly. |
|
|
| Recommended values: |
|
|
| - Qwen3-VL backbone and processor: `Qwen/Qwen3-VL-2B-Instruct` |
| - Cosmos tokenizer: `nvidia/Cosmos-Tokenizer-CI8x8` |
|
|
| You can use either: |
|
|
| - public Hugging Face repo ids |
| - local directories downloaded in advance |
|
|
| Example for offline/local usage: |
|
|
| ```bash |
| QWEN3_VL_PATH=/path/to/Qwen3-VL-2B-Instruct |
| COSMOS_TOKENIZER_PATH=/path/to/Cosmos-Tokenizer-CI8x8 |
| ``` |
|
|
| For standard RoboTwin action evaluation, we recommend disabling DA3 teacher instantiation: |
|
|
| ```bash |
| DISABLE_DA3_TEACHER_FOR_EVAL=true |
| ``` |
|
|
| This avoids loading the frozen DA3 teacher during evaluation while keeping the policy architecture compatible. |
|
|
| ## 6. Single-Task Evaluation |
|
|
| The most direct way is to call `evaluation/RoboTwin/inference.py` on a single RoboTwin task. |
|
|
| Example: evaluate task `0` (`adjust_bottle`) on `demo_clean`: |
|
|
| ```bash |
| cd third_party/RoboTwin |
| |
| python ../../evaluation/RoboTwin/inference.py \ |
| --args.ckpt_path zaleni/MagicBot-VGA-Robotwin \ |
| --args.video_dir ../../evaluation/RoboTwin/output_magicbot/demo_clean/task_00 \ |
| --args.task_config demo_clean \ |
| --args.task_idx 0 \ |
| --args.action_mode delta \ |
| --args.stats_key aloha \ |
| --args.dtype bfloat16 \ |
| --args.qwen3_vl_pretrained_path Qwen/Qwen3-VL-2B-Instruct \ |
| --args.qwen3_vl_processor_path Qwen/Qwen3-VL-2B-Instruct \ |
| --args.cosmos_tokenizer_path_or_name nvidia/Cosmos-Tokenizer-CI8x8 \ |
| --args.disable_3d_teacher_for_eval |
| ``` |
|
|
| If you use local asset directories, replace the public repo ids with your local paths. |
|
|
| Important arguments: |
|
|
| - `--args.ckpt_path`: model repo id or local `pretrained_model` directory |
| - `--args.task_config`: `demo_clean` or `demo_randomized` |
| - `--args.task_idx`: task index in `evaluation/RoboTwin/inference.py` |
| - `--args.action_mode`: usually `delta` for this model |
| - `--args.stats_key`: usually `aloha` for RoboTwin |
| - `--args.dtype`: `bfloat16` is recommended on modern GPUs |
|
|
| Outputs are written to `--args.video_dir`, including: |
|
|
| - replay videos |
| - `summary.json` |
| - `summary.txt` |
|
|
| ## 7. 50-Task Randomized Evaluation |
|
|
| For batch evaluation on RoboTwin randomized tasks, use: |
|
|
| ```bash |
| PRETRAINED_CKPT=zaleni/MagicBot-VGA-Robotwin \ |
| QWEN3_VL_PRETRAINED_PATH=Qwen/Qwen3-VL-2B-Instruct \ |
| QWEN3_VL_PROCESSOR_PATH=Qwen/Qwen3-VL-2B-Instruct \ |
| COSMOS_TOKENIZER_PATH_OR_NAME=nvidia/Cosmos-Tokenizer-CI8x8 \ |
| DISABLE_DA3_TEACHER_FOR_EVAL=true \ |
| GPU_IDS=0,1 \ |
| MAX_JOBS_PER_GPU=2 \ |
| bash evaluation/RoboTwin/eval_randomized_50.sh |
| ``` |
|
|
| Useful environment variables: |
|
|
| - `PRETRAINED_CKPT`: model repo id or local checkpoint directory |
| - `GPU_IDS`: comma-separated GPU ids, for example `0,1,2,3` |
| - `MAX_JOBS_PER_GPU`: parallel RoboTwin jobs per GPU |
| - `TASK_CONFIG`: defaults to `demo_randomized` |
| - `TEST_NUM`: number of episodes per task |
| - `DTYPE`: `bfloat16` or `float32` |
| - `BASE_OUTPUT_PATH`: output root directory |
|
|
| This script writes: |
|
|
| - per-task logs and videos under `tasks/` |
| - aggregated `summary.json` |
| - aggregated `summary.txt` |
|
|
| ## 8. Evaluate a Continuous Task Range |
|
|
| `eval_randomized_50.sh` supports continuous ranges through: |
|
|
| - `START_TASK_IDX` |
| - `TASK_COUNT` |
|
|
| Example: evaluate tasks `10` to `19`: |
|
|
| ```bash |
| PRETRAINED_CKPT=zaleni/MagicBot-VGA-Robotwin \ |
| QWEN3_VL_PRETRAINED_PATH=Qwen/Qwen3-VL-2B-Instruct \ |
| QWEN3_VL_PROCESSOR_PATH=Qwen/Qwen3-VL-2B-Instruct \ |
| COSMOS_TOKENIZER_PATH_OR_NAME=nvidia/Cosmos-Tokenizer-CI8x8 \ |
| DISABLE_DA3_TEACHER_FOR_EVAL=true \ |
| START_TASK_IDX=10 \ |
| TASK_COUNT=10 \ |
| bash evaluation/RoboTwin/eval_randomized_50.sh |
| ``` |
|
|
| ## 9. Evaluate the CVPR 2026 RoboTwin Track 11-Task Subset |
|
|
| For the Hugging Face leaderboard |
| [`open-gigaai/CVPR-2026-RoboTwin-Track-LeaderBoard`](https://huggingface.co/spaces/open-gigaai/CVPR-2026-RoboTwin-Track-LeaderBoard), |
| we use the following 11-task subset: |
|
|
| ```text |
| [2, 3, 9, 10, 12, 15, 17, 25, 28, 30, 44] |
| ``` |
|
|
| The exact task names in `evaluation/RoboTwin/inference.py` are: |
|
|
| - `blocks_ranking_rgb` |
| - `blocks_ranking_size` |
| - `handover_mic` |
| - `hanging_mug` |
| - `move_can_pot` |
| - `move_stapler_pad` |
| - `open_microwave` |
| - `place_can_basket` |
| - `place_dual_shoes` |
| - `place_fan` |
| - `stack_blocks_three` |
|
|
| The current batch script does not take a sparse task list directly, so the recommended approach is to run a shell loop: |
|
|
| ```bash |
| cd third_party/RoboTwin |
| |
| TASKS=(2 3 9 10 12 15 17 25 28 30 44) |
| for t in "${TASKS[@]}"; do |
| python ../../evaluation/RoboTwin/inference.py \ |
| --args.ckpt_path zaleni/MagicBot-VGA-Robotwin \ |
| --args.video_dir ../../evaluation/RoboTwin/output_magicbot/custom_subset/task_${t} \ |
| --args.task_config demo_randomized \ |
| --args.task_idx "${t}" \ |
| --args.action_mode delta \ |
| --args.stats_key aloha \ |
| --args.dtype bfloat16 \ |
| --args.qwen3_vl_pretrained_path Qwen/Qwen3-VL-2B-Instruct \ |
| --args.qwen3_vl_processor_path Qwen/Qwen3-VL-2B-Instruct \ |
| --args.cosmos_tokenizer_path_or_name nvidia/Cosmos-Tokenizer-CI8x8 \ |
| --args.disable_3d_teacher_for_eval |
| done |
| ``` |
|
|
| This produces one output directory per task, each containing replay videos plus `summary.json` and `summary.txt`. |
|
|
| ## 10. Package the 11-Task Submission and Export Success Rates |
|
|
| After you finish the randomized evaluation run, you can convert those 11 tasks into a submission-style folder with: |
|
|
| ```bash |
| python util_scripts/package_robotwin_submission.py \ |
| --run /path/to/output_randomized_50/<run_name>/summary.txt \ |
| --dst /path/to/output_randomized_50/<run_name>/submission_package \ |
| --overwrite |
| ``` |
|
|
| If you also want to bundle a policy folder, add: |
|
|
| ```bash |
| --policy-dir /path/to/policy/Your_Policy |
| ``` |
|
|
| The packaging script will: |
|
|
| - create `submission_package/<task_name>/episode0.mp4`, `episode1.mp4`, ... |
| - preserve the 11-task ordering by task index |
| - write `package_manifest.txt` |
| - write `selected_task_summary.json` |
| - write `selected_task_summary.txt` |
|
|
| The selected-task summary files include: |
|
|
| - per-task `success_rate` |
| - per-task `success_count` and `test_num` |
| - `avg_task_success_rate` across the 11 tasks |
| - `overall_episode_success_rate` across all episodes in the 11-task subset |
|
|
| This is useful when you want a leaderboard-facing summary for the competition subset rather than the full randomized-50 report. |
|
|
| ## 11. Task Index Reference |
|
|
| Task indices are defined in [`evaluation/RoboTwin/inference.py`](evaluation/RoboTwin/inference.py). |
|
|
| For example: |
|
|
| - `0`: `adjust_bottle` |
| - `2`: `blocks_ranking_rgb` |
| - `3`: `blocks_ranking_size` |
| - `9`: `handover_mic` |
| - `10`: `hanging_mug` |
| - `12`: `move_can_pot` |
| - `15`: `move_stapler_pad` |
| - `17`: `open_microwave` |
| - `25`: `place_can_basket` |
| - `28`: `place_dual_shoes` |
| - `30`: `place_fan` |
| - `44`: `stack_blocks_three` |
|
|
| ## 12. Common Notes |
|
|
| - `inference.py` can load checkpoints from either a local directory or a Hugging Face repo id. |
| - If your server cannot access Hugging Face online, download the external assets in advance and pass local paths. |
| - If you use the lightweight checkpoint release for action evaluation, keeping `--args.disable_3d_teacher_for_eval` enabled is recommended. |
| - If you want to inspect reconstructed future images during inference, enable `--args.decode_image_flag`, though this is not required for standard RoboTwin scoring. |
|
|
| ## 13. Model Link |
|
|
| Released RoboTwin checkpoint: |
|
|
| - https://huggingface.co/zaleni/MagicBot-VGA-Robotwin |
|
|
| ## 14. Acknowledgments |
|
|
| MagicBot-VGA is developed on top of the excellent InternVLA framework. Our codebase |
| started from that foundation and has since been substantially modified and extended |
| for our own model architecture, training pipeline, and evaluation workflow. |
|
|
| We sincerely thank the [InternVLA](https://github.com/InternRobotics/InternVLA-A1) |
| authors and contributors for open-sourcing their framework and making follow-up |
| research and development much easier. |
|
|
| We also thank the following open-source projects: |
|
|
| - [InternVLA](https://github.com/InternRobotics/InternVLA-A1) |
| - [LeRobot](https://github.com/huggingface/lerobot) |
| - [RoboTwin](https://github.com/RoboTwin-Platform/RoboTwin) |
| - [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) |
| - [NVIDIA Cosmos](https://github.com/nvidia-cosmos) |
|
|