# Mem-0 Execution Module — `m1_mix` (RMBench / RoboTwin 2.0) A single **Mem-0** low-level execution-module checkpoint trained jointly on all five RMBench **M1** tasks (the `m1_mix` dataset) and evaluated on each task in turn. M1 tasks require only the execution module — **no high-level planner / vLLM is needed for inference.** - **Backbone:** Qwen3-VL-2B-Instruct (vision-language) — *weights fine-tuned and bundled in the checkpoint* - **Action head:** DiT-B flow-matching policy (action chunk of 30, 16-D action) - **Memory:** MemoryBank (instant + anchor memory fusion across the episode) - **Aux head:** subtask-end classifier (used for Mn multi-stage tasks; inert for M1) - **Total parameters:** ≈ 2.67 B --- ## Results `task_config = demo_clean`, `instruction_type = unseen`, 100 episodes per task, `action_horizon = 30`. The **same** checkpoint and **same** `m1_mix` normalization stats are used for every task. | Task | Success Rate | Reward | |---------------------|:------------:|:------:| | put_back_block | **1.00** | 1.00 | | rearrange_blocks | **0.86** | 0.86 | | swap_blocks | **0.81** | 0.81 | | swap_T | **0.13** | 0.13 | | observe_and_pickup | **0.03** | 0.00 | | **Average** | **0.566** | — | Per-episode logs and rollout videos for all five tasks are under `eval_results/`. See `eval_results/summary.md` for details and `task_instructions.json` for the exact per-task language instruction used. --- ## Contents of this bundle ``` m1_mix_submit/ ├── README.md # this file ├── task_instructions.json # verbatim --global_task per task + scores ├── checkpoint/ │ ├── m1_mix_final_step50000.pt.part00 … part08 # 15.3 GB full training ckpt, split into 9 parts (2×4 GB + 7×≤1 GB) │ ├── m1_mix_final_step50000.pt.sha256 # SHA-256 of the reassembled checkpoint │ └── README_REASSEMBLE.md # how to cat the parts back together + verify ├── norm_stats/ │ └── norm_stats.json # min-max state/action stats → [-1, 1] ├── configs/ │ ├── execution_module_train_m1_mix.yaml # training config (reproducibility) │ └── deploy_policy.yml # inference / deployment config ├── qwen_base_config/ # Qwen3-VL-2B-Instruct config/processor ONLY │ ├── config.json, generation_config.json │ ├── tokenizer*.json, vocab.json, merges.txt │ ├── preprocessor_config.json, video_preprocessor_config.json, chat_template.json │ └── README_Qwen3-VL-2B-Instruct.md # upstream model card (Apache-2.0) └── eval_results/ ├── summary.md └── / # _result.txt, eval_log.txt, episode*.mp4 (×100) ``` ### About the checkpoint > **Reassemble first.** The 15.3 GB checkpoint is uploaded as 9 byte-split parts > (`m1_mix_final_step50000.pt.part00…08`) because the upload path capped single files and > throttled per-window bytes. Concatenation reproduces the original **bit-for-bit**: > > ```bash > cat m1_mix_final_step50000.pt.part?? > m1_mix_final_step50000.pt > sha256sum -c m1_mix_final_step50000.pt.sha256 # -> m1_mix_final_step50000.pt: OK > ``` > > See `checkpoint/README_REASSEMBLE.md` for details. Once reassembled, `m1_mix_final_step50000.pt` is the **full training checkpoint** at step 50000: | key | content | |------------------------|----------------------------------------------------------------------| | `model_state_dict` | 910 tensors, ≈ 2.67 B params (`qwen_model` ≈ 2.44 B, `action_model` ≈ 160 M, `memory_bank` ≈ 39 M, `classifier` ≈ 32 M); bf16 + fp32 | | `optimizer_state_dict` | AdamW moments — for resume/fine-tune only | | `scheduler_state_dict` | cosine LR scheduler state | | `global_step` | 50000 | The `model_state_dict` is **self-contained**: it already includes the fine-tuned Qwen3-VL-2B backbone weights. The bundled `qwen_base_config/` provides only the *architecture/tokenizer/processor* config — the base model weights (`model.safetensors`, ~4 GB) are **not** re-distributed here; download them from the official repo (see below). **Inference-only slimming** (15.3 GB → ≈ 6 GB) if you don't need to resume training: ```python import torch ck = torch.load("checkpoint/m1_mix_final_step50000.pt", map_location="cpu", weights_only=False) torch.save({"model_state_dict": ck["model_state_dict"], "global_step": ck["global_step"]}, "m1_mix_inference.pt") ``` The deploy loader reads `payload["model_state_dict"]` and calls `load_state_dict(..., strict=False)`, so either the full or the slimmed file works unchanged. --- ## Dependencies 1. **Code:** the RMBench / Mem-0 repository (this checkpoint targets its `policy/Mem-0` execution module and `script/eval_policy.py`). Follow the repo README for the RoboTwin 2.0 simulator environment setup. 2. **Base VLM:** `Qwen/Qwen3-VL-2B-Instruct` (Apache-2.0). Required at model instantiation for the architecture + image/text processor. Its weights are overwritten by this checkpoint at load time (`strict=False`), but the directory must exist and contain `model.safetensors`: ```bash huggingface-cli download Qwen/Qwen3-VL-2B-Instruct \ --local-dir policy/Mem-0/checkpoints/Qwen3-VL-2B-Instruct ``` The small config/processor files in `qwen_base_config/` are exactly the ones used for training and evaluation; you may overlay them onto the downloaded directory if the upstream revision differs. --- ## How to run evaluation Point the deploy config at the checkpoint and the `m1_mix` stats, then run one task at a time. This mirrors exactly how the numbers above were produced: ```bash python script/eval_policy.py --config policy/Mem-0/deploy_policy.yml --overrides \ --task_name swap_blocks \ --execution_ckpt /path/to/m1_mix_final_step50000.pt \ --state_stats_path /path/to/norm_stats/norm_stats.json \ --ckpt_setting m1mix \ --global_task "There are three traies on the table, and two blocks are placed in two different traies. You may move only one block at a time, and each tray can hold at most one block. Swap the positions of the two blocks. Finally press the button." \ --action_horizon 30 ``` - Replace `--task_name` and `--global_task` with each of the five tasks (strings in `task_instructions.json`). The checkpoint and `--state_stats_path` stay the same. - `--ckpt_setting m1mix` only labels the output directory (`eval_result//Mem-0/demo_clean/m1mix//`). - `--vllm_url` is accepted but unused for M1 tasks (the global instruction is set directly; the planner client is constructed but never queried). - Ensure `execution_module.qwen_vl.model_path` in `deploy_policy.yml` points to your downloaded Qwen3-VL-2B-Instruct directory. --- ## Model architecture (from `configs/`) - **VLM backbone** — Qwen3-VL-2B-Instruct, 224×224 head-camera image + language instruction, last-layer hidden states (hidden size 2048). - **MemoryBank** — `window_size 30`, `initial_anchor_size 1`, `num_heads 8`, `memory_accumulation 8`, `dropout 0.1`; fuses an instant-memory and an anchor-memory token; concatenated with the text feature → a 3-token summary `(B, 3, 2048)`. - **DiT-B action head** (`FlowmatchingActionHead`) — `num_layers 16`, `cross_attention_dim 2048`, `action_dim 16`, `state_dim 16`, `action_horizon 30`, `num_inference_timesteps 8`; flow-matching regression of a 30-step action chunk. - **Subtask-end classifier** — MLP `hidden_sizes [6144, 2048, 512]`, `pos_weight 10.0`, `focal_gamma 1.0`, `threshold 0.5`. Drives stage transitions in Mn tasks; for M1 the episode is a single stage so it does not affect rollout. ## Training (from `configs/execution_module_train_m1_mix.yaml`) - **Data:** `m1_mix` (the five M1 tasks merged into one LeRobot dataset with globally unique `episode_id`s). Features: head-camera image, state, action, subtask, subtask_end, episode_id. - **Schedule:** `train_steps 50000`, `batch_size 56`, cosine scheduler, `warmup_ratio 0.05`, `grad_clip_norm 2.5`, `weight_decay 0.005`, `seed 42`. - **Learning rates:** base `1e-5`, qwen_model `1e-5`, action_model `1e-4`, classifier `1e-4` (min LRs `1e-6 / 1e-6 / 5e-6 / 5e-6`). - **Loss:** `lambda_action 1.0`, `lambda_classifier 0.2`. ## Normalization State and action are min-max normalized to `[-1, 1]` over the 14 arm dimensions using `norm_stats/norm_stats.json` (`NORM_WAY = "minmax"` in `deploy_policy.py`). Use the same stats file at inference; predicted actions are denormalized with it before being sent to the environment. Action chunks from overlapping predictions are averaged (mean smoothing) before execution. ## Limitations - **swap_T (0.13)** and **observe_and_pickup (0.03)** are weak: the former needs precise T-block position *and* orientation alignment; the latter needs cross-view target re-identification after a visual occlusion followed by a pickup. The joint `m1_mix` model does not solve these reliably. - Numbers are on RoboTwin 2.0 `demo_clean` with `unseen` instruction phrasings; other task configs / domain randomization will differ. ## License & attribution - Base VLM **Qwen3-VL-2B-Instruct** is © the Qwen team, licensed **Apache-2.0** (see `qwen_base_config/README_Qwen3-VL-2B-Instruct.md`). Because the checkpoint embeds fine-tuned Qwen weights, that license applies to the corresponding components. - RMBench / RoboTwin and the Mem-0 policy code are governed by their respective upstream licenses; refer to the source repository.