| # Mem-0 Execution Module β `m1_mix` (RMBench / RoboTwin 2.0) |
| |
| A single **Mem-0** low-level execution-module checkpoint trained jointly on all five |
| RMBench **M1** tasks (the `m1_mix` dataset) and evaluated on each task in turn. M1 tasks |
| require only the execution module β **no high-level planner / vLLM is needed for |
| inference.** |
|
|
| - **Backbone:** Qwen3-VL-2B-Instruct (vision-language) β *weights fine-tuned and bundled in the checkpoint* |
| - **Action head:** DiT-B flow-matching policy (action chunk of 30, 16-D action) |
| - **Memory:** MemoryBank (instant + anchor memory fusion across the episode) |
| - **Aux head:** subtask-end classifier (used for Mn multi-stage tasks; inert for M1) |
| - **Total parameters:** β 2.67 B |
|
|
| --- |
|
|
| ## Results |
|
|
| `task_config = demo_clean`, `instruction_type = unseen`, 100 episodes per task, `action_horizon = 30`. |
| The **same** checkpoint and **same** `m1_mix` normalization stats are used for every task. |
|
|
| | Task | Success Rate | Reward | |
| |---------------------|:------------:|:------:| |
| | put_back_block | **1.00** | 1.00 | |
| | rearrange_blocks | **0.86** | 0.86 | |
| | swap_blocks | **0.81** | 0.81 | |
| | swap_T | **0.13** | 0.13 | |
| | observe_and_pickup | **0.03** | 0.00 | |
| | **Average** | **0.566** | β | |
| |
| Per-episode logs and rollout videos for all five tasks are under `eval_results/`. |
| See `eval_results/summary.md` for details and `task_instructions.json` for the exact |
| per-task language instruction used. |
|
|
| --- |
|
|
| ## Contents of this bundle |
|
|
| ``` |
| m1_mix_submit/ |
| βββ README.md # this file |
| βββ task_instructions.json # verbatim --global_task per task + scores |
| βββ checkpoint/ |
| β βββ m1_mix_final_step50000.pt.part00 β¦ part08 # 15.3 GB full training ckpt, split into 9 parts (2Γ4 GB + 7Γβ€1 GB) |
| β βββ m1_mix_final_step50000.pt.sha256 # SHA-256 of the reassembled checkpoint |
| β βββ README_REASSEMBLE.md # how to cat the parts back together + verify |
| βββ norm_stats/ |
| β βββ norm_stats.json # min-max state/action stats β [-1, 1] |
| βββ configs/ |
| β βββ execution_module_train_m1_mix.yaml # training config (reproducibility) |
| β βββ deploy_policy.yml # inference / deployment config |
| βββ qwen_base_config/ # Qwen3-VL-2B-Instruct config/processor ONLY |
| β βββ config.json, generation_config.json |
| β βββ tokenizer*.json, vocab.json, merges.txt |
| β βββ preprocessor_config.json, video_preprocessor_config.json, chat_template.json |
| β βββ README_Qwen3-VL-2B-Instruct.md # upstream model card (Apache-2.0) |
| βββ eval_results/ |
| βββ summary.md |
| βββ <task>/ # _result.txt, eval_log.txt, episode*.mp4 (Γ100) |
| ``` |
|
|
| ### About the checkpoint |
|
|
| > **Reassemble first.** The 15.3 GB checkpoint is uploaded as 9 byte-split parts |
| > (`m1_mix_final_step50000.pt.part00β¦08`) because the upload path capped single files and |
| > throttled per-window bytes. Concatenation reproduces the original **bit-for-bit**: |
| > |
| > ```bash |
| > cat m1_mix_final_step50000.pt.part?? > m1_mix_final_step50000.pt |
| > sha256sum -c m1_mix_final_step50000.pt.sha256 # -> m1_mix_final_step50000.pt: OK |
| > ``` |
| > |
| > See `checkpoint/README_REASSEMBLE.md` for details. |
|
|
| Once reassembled, `m1_mix_final_step50000.pt` is the **full training checkpoint** at step 50000: |
|
|
| | key | content | |
| |------------------------|----------------------------------------------------------------------| |
| | `model_state_dict` | 910 tensors, β 2.67 B params (`qwen_model` β 2.44 B, `action_model` β 160 M, `memory_bank` β 39 M, `classifier` β 32 M); bf16 + fp32 | |
| | `optimizer_state_dict` | AdamW moments β for resume/fine-tune only | |
| | `scheduler_state_dict` | cosine LR scheduler state | |
| | `global_step` | 50000 | |
|
|
| The `model_state_dict` is **self-contained**: it already includes the fine-tuned |
| Qwen3-VL-2B backbone weights. The bundled `qwen_base_config/` provides only the |
| *architecture/tokenizer/processor* config β the base model weights (`model.safetensors`, |
| ~4 GB) are **not** re-distributed here; download them from the official repo (see below). |
|
|
| **Inference-only slimming** (15.3 GB β β 6 GB) if you don't need to resume training: |
|
|
| ```python |
| import torch |
| ck = torch.load("checkpoint/m1_mix_final_step50000.pt", map_location="cpu", weights_only=False) |
| torch.save({"model_state_dict": ck["model_state_dict"], "global_step": ck["global_step"]}, |
| "m1_mix_inference.pt") |
| ``` |
|
|
| The deploy loader reads `payload["model_state_dict"]` and calls |
| `load_state_dict(..., strict=False)`, so either the full or the slimmed file works |
| unchanged. |
|
|
| --- |
|
|
| ## Dependencies |
|
|
| 1. **Code:** the RMBench / Mem-0 repository (this checkpoint targets its |
| `policy/Mem-0` execution module and `script/eval_policy.py`). Follow the repo README |
| for the RoboTwin 2.0 simulator environment setup. |
| 2. **Base VLM:** `Qwen/Qwen3-VL-2B-Instruct` (Apache-2.0). Required at model |
| instantiation for the architecture + image/text processor. Its weights are |
| overwritten by this checkpoint at load time (`strict=False`), but the directory must |
| exist and contain `model.safetensors`: |
|
|
| ```bash |
| huggingface-cli download Qwen/Qwen3-VL-2B-Instruct \ |
| --local-dir policy/Mem-0/checkpoints/Qwen3-VL-2B-Instruct |
| ``` |
|
|
| The small config/processor files in `qwen_base_config/` are exactly the ones used for |
| training and evaluation; you may overlay them onto the downloaded directory if the |
| upstream revision differs. |
|
|
| --- |
|
|
| ## How to run evaluation |
|
|
| Point the deploy config at the checkpoint and the `m1_mix` stats, then run one task at a |
| time. This mirrors exactly how the numbers above were produced: |
|
|
| ```bash |
| python script/eval_policy.py --config policy/Mem-0/deploy_policy.yml --overrides \ |
| --task_name swap_blocks \ |
| --execution_ckpt /path/to/m1_mix_final_step50000.pt \ |
| --state_stats_path /path/to/norm_stats/norm_stats.json \ |
| --ckpt_setting m1mix \ |
| --global_task "There are three traies on the table, and two blocks are placed in two different traies. You may move only one block at a time, and each tray can hold at most one block. Swap the positions of the two blocks. Finally press the button." \ |
| --action_horizon 30 |
| ``` |
|
|
| - Replace `--task_name` and `--global_task` with each of the five tasks (strings in |
| `task_instructions.json`). The checkpoint and `--state_stats_path` stay the same. |
| - `--ckpt_setting m1mix` only labels the output directory |
| (`eval_result/<task>/Mem-0/demo_clean/m1mix/<timestamp>/`). |
| - `--vllm_url` is accepted but unused for M1 tasks (the global instruction is set |
| directly; the planner client is constructed but never queried). |
| - Ensure `execution_module.qwen_vl.model_path` in `deploy_policy.yml` points to your |
| downloaded Qwen3-VL-2B-Instruct directory. |
|
|
| --- |
|
|
| ## Model architecture (from `configs/`) |
|
|
| - **VLM backbone** β Qwen3-VL-2B-Instruct, 224Γ224 head-camera image + language |
| instruction, last-layer hidden states (hidden size 2048). |
| - **MemoryBank** β `window_size 30`, `initial_anchor_size 1`, `num_heads 8`, |
| `memory_accumulation 8`, `dropout 0.1`; fuses an instant-memory and an anchor-memory |
| token; concatenated with the text feature β a 3-token summary `(B, 3, 2048)`. |
| - **DiT-B action head** (`FlowmatchingActionHead`) β `num_layers 16`, |
| `cross_attention_dim 2048`, `action_dim 16`, `state_dim 16`, `action_horizon 30`, |
| `num_inference_timesteps 8`; flow-matching regression of a 30-step action chunk. |
| - **Subtask-end classifier** β MLP `hidden_sizes [6144, 2048, 512]`, `pos_weight 10.0`, |
| `focal_gamma 1.0`, `threshold 0.5`. Drives stage transitions in Mn tasks; for M1 the |
| episode is a single stage so it does not affect rollout. |
|
|
| ## Training (from `configs/execution_module_train_m1_mix.yaml`) |
|
|
| - **Data:** `m1_mix` (the five M1 tasks merged into one LeRobot dataset with globally |
| unique `episode_id`s). Features: head-camera image, state, action, subtask, |
| subtask_end, episode_id. |
| - **Schedule:** `train_steps 50000`, `batch_size 56`, cosine scheduler, |
| `warmup_ratio 0.05`, `grad_clip_norm 2.5`, `weight_decay 0.005`, `seed 42`. |
| - **Learning rates:** base `1e-5`, qwen_model `1e-5`, action_model `1e-4`, |
| classifier `1e-4` (min LRs `1e-6 / 1e-6 / 5e-6 / 5e-6`). |
| - **Loss:** `lambda_action 1.0`, `lambda_classifier 0.2`. |
|
|
| ## Normalization |
|
|
| State and action are min-max normalized to `[-1, 1]` over the 14 arm dimensions using |
| `norm_stats/norm_stats.json` (`NORM_WAY = "minmax"` in `deploy_policy.py`). Use the same |
| stats file at inference; predicted actions are denormalized with it before being sent to |
| the environment. Action chunks from overlapping predictions are averaged (mean smoothing) |
| before execution. |
|
|
| ## Limitations |
|
|
| - **swap_T (0.13)** and **observe_and_pickup (0.03)** are weak: the former needs precise |
| T-block position *and* orientation alignment; the latter needs cross-view target |
| re-identification after a visual occlusion followed by a pickup. The joint `m1_mix` |
| model does not solve these reliably. |
| - Numbers are on RoboTwin 2.0 `demo_clean` with `unseen` instruction phrasings; other |
| task configs / domain randomization will differ. |
| |
| ## License & attribution |
| |
| - Base VLM **Qwen3-VL-2B-Instruct** is Β© the Qwen team, licensed **Apache-2.0** |
| (see `qwen_base_config/README_Qwen3-VL-2B-Instruct.md`). Because the checkpoint |
| embeds fine-tuned Qwen weights, that license applies to the corresponding components. |
| - RMBench / RoboTwin and the Mem-0 policy code are governed by their respective upstream |
| licenses; refer to the source repository. |
| |