YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Mem-0 Execution Module β m1_mix (RMBench / RoboTwin 2.0)
A single Mem-0 low-level execution-module checkpoint trained jointly on all five
RMBench M1 tasks (the m1_mix dataset) and evaluated on each task in turn. M1 tasks
require only the execution module β no high-level planner / vLLM is needed for
inference.
- Backbone: Qwen3-VL-2B-Instruct (vision-language) β weights fine-tuned and bundled in the checkpoint
- Action head: DiT-B flow-matching policy (action chunk of 30, 16-D action)
- Memory: MemoryBank (instant + anchor memory fusion across the episode)
- Aux head: subtask-end classifier (used for Mn multi-stage tasks; inert for M1)
- Total parameters: β 2.67 B
Results
task_config = demo_clean, instruction_type = unseen, 100 episodes per task, action_horizon = 30.
The same checkpoint and same m1_mix normalization stats are used for every task.
| Task | Success Rate | Reward |
|---|---|---|
| put_back_block | 1.00 | 1.00 |
| rearrange_blocks | 0.86 | 0.86 |
| swap_blocks | 0.81 | 0.81 |
| swap_T | 0.13 | 0.13 |
| observe_and_pickup | 0.03 | 0.00 |
| Average | 0.566 | β |
Per-episode logs and rollout videos for all five tasks are under eval_results/.
See eval_results/summary.md for details and task_instructions.json for the exact
per-task language instruction used.
Contents of this bundle
m1_mix_submit/
βββ README.md # this file
βββ task_instructions.json # verbatim --global_task per task + scores
βββ checkpoint/
β βββ m1_mix_final_step50000.pt.part00 β¦ part08 # 15.3 GB full training ckpt, split into 9 parts (2Γ4 GB + 7Γβ€1 GB)
β βββ m1_mix_final_step50000.pt.sha256 # SHA-256 of the reassembled checkpoint
β βββ README_REASSEMBLE.md # how to cat the parts back together + verify
βββ norm_stats/
β βββ norm_stats.json # min-max state/action stats β [-1, 1]
βββ configs/
β βββ execution_module_train_m1_mix.yaml # training config (reproducibility)
β βββ deploy_policy.yml # inference / deployment config
βββ qwen_base_config/ # Qwen3-VL-2B-Instruct config/processor ONLY
β βββ config.json, generation_config.json
β βββ tokenizer*.json, vocab.json, merges.txt
β βββ preprocessor_config.json, video_preprocessor_config.json, chat_template.json
β βββ README_Qwen3-VL-2B-Instruct.md # upstream model card (Apache-2.0)
βββ eval_results/
βββ summary.md
βββ <task>/ # _result.txt, eval_log.txt, episode*.mp4 (Γ100)
About the checkpoint
Reassemble first. The 15.3 GB checkpoint is uploaded as 9 byte-split parts (
m1_mix_final_step50000.pt.part00β¦08) because the upload path capped single files and throttled per-window bytes. Concatenation reproduces the original bit-for-bit:cat m1_mix_final_step50000.pt.part?? > m1_mix_final_step50000.pt sha256sum -c m1_mix_final_step50000.pt.sha256 # -> m1_mix_final_step50000.pt: OKSee
checkpoint/README_REASSEMBLE.mdfor details.
Once reassembled, m1_mix_final_step50000.pt is the full training checkpoint at step 50000:
| key | content |
|---|---|
model_state_dict |
910 tensors, β 2.67 B params (qwen_model β 2.44 B, action_model β 160 M, memory_bank β 39 M, classifier β 32 M); bf16 + fp32 |
optimizer_state_dict |
AdamW moments β for resume/fine-tune only |
scheduler_state_dict |
cosine LR scheduler state |
global_step |
50000 |
The model_state_dict is self-contained: it already includes the fine-tuned
Qwen3-VL-2B backbone weights. The bundled qwen_base_config/ provides only the
architecture/tokenizer/processor config β the base model weights (model.safetensors,
~4 GB) are not re-distributed here; download them from the official repo (see below).
Inference-only slimming (15.3 GB β β 6 GB) if you don't need to resume training:
import torch
ck = torch.load("checkpoint/m1_mix_final_step50000.pt", map_location="cpu", weights_only=False)
torch.save({"model_state_dict": ck["model_state_dict"], "global_step": ck["global_step"]},
"m1_mix_inference.pt")
The deploy loader reads payload["model_state_dict"] and calls
load_state_dict(..., strict=False), so either the full or the slimmed file works
unchanged.
Dependencies
Code: the RMBench / Mem-0 repository (this checkpoint targets its
policy/Mem-0execution module andscript/eval_policy.py). Follow the repo README for the RoboTwin 2.0 simulator environment setup.Base VLM:
Qwen/Qwen3-VL-2B-Instruct(Apache-2.0). Required at model instantiation for the architecture + image/text processor. Its weights are overwritten by this checkpoint at load time (strict=False), but the directory must exist and containmodel.safetensors:huggingface-cli download Qwen/Qwen3-VL-2B-Instruct \ --local-dir policy/Mem-0/checkpoints/Qwen3-VL-2B-InstructThe small config/processor files in
qwen_base_config/are exactly the ones used for training and evaluation; you may overlay them onto the downloaded directory if the upstream revision differs.
How to run evaluation
Point the deploy config at the checkpoint and the m1_mix stats, then run one task at a
time. This mirrors exactly how the numbers above were produced:
python script/eval_policy.py --config policy/Mem-0/deploy_policy.yml --overrides \
--task_name swap_blocks \
--execution_ckpt /path/to/m1_mix_final_step50000.pt \
--state_stats_path /path/to/norm_stats/norm_stats.json \
--ckpt_setting m1mix \
--global_task "There are three traies on the table, and two blocks are placed in two different traies. You may move only one block at a time, and each tray can hold at most one block. Swap the positions of the two blocks. Finally press the button." \
--action_horizon 30
- Replace
--task_nameand--global_taskwith each of the five tasks (strings intask_instructions.json). The checkpoint and--state_stats_pathstay the same. --ckpt_setting m1mixonly labels the output directory (eval_result/<task>/Mem-0/demo_clean/m1mix/<timestamp>/).--vllm_urlis accepted but unused for M1 tasks (the global instruction is set directly; the planner client is constructed but never queried).- Ensure
execution_module.qwen_vl.model_pathindeploy_policy.ymlpoints to your downloaded Qwen3-VL-2B-Instruct directory.
Model architecture (from configs/)
- VLM backbone β Qwen3-VL-2B-Instruct, 224Γ224 head-camera image + language instruction, last-layer hidden states (hidden size 2048).
- MemoryBank β
window_size 30,initial_anchor_size 1,num_heads 8,memory_accumulation 8,dropout 0.1; fuses an instant-memory and an anchor-memory token; concatenated with the text feature β a 3-token summary(B, 3, 2048). - DiT-B action head (
FlowmatchingActionHead) βnum_layers 16,cross_attention_dim 2048,action_dim 16,state_dim 16,action_horizon 30,num_inference_timesteps 8; flow-matching regression of a 30-step action chunk. - Subtask-end classifier β MLP
hidden_sizes [6144, 2048, 512],pos_weight 10.0,focal_gamma 1.0,threshold 0.5. Drives stage transitions in Mn tasks; for M1 the episode is a single stage so it does not affect rollout.
Training (from configs/execution_module_train_m1_mix.yaml)
- Data:
m1_mix(the five M1 tasks merged into one LeRobot dataset with globally uniqueepisode_ids). Features: head-camera image, state, action, subtask, subtask_end, episode_id. - Schedule:
train_steps 50000,batch_size 56, cosine scheduler,warmup_ratio 0.05,grad_clip_norm 2.5,weight_decay 0.005,seed 42. - Learning rates: base
1e-5, qwen_model1e-5, action_model1e-4, classifier1e-4(min LRs1e-6 / 1e-6 / 5e-6 / 5e-6). - Loss:
lambda_action 1.0,lambda_classifier 0.2.
Normalization
State and action are min-max normalized to [-1, 1] over the 14 arm dimensions using
norm_stats/norm_stats.json (NORM_WAY = "minmax" in deploy_policy.py). Use the same
stats file at inference; predicted actions are denormalized with it before being sent to
the environment. Action chunks from overlapping predictions are averaged (mean smoothing)
before execution.
Limitations
- swap_T (0.13) and observe_and_pickup (0.03) are weak: the former needs precise
T-block position and orientation alignment; the latter needs cross-view target
re-identification after a visual occlusion followed by a pickup. The joint
m1_mixmodel does not solve these reliably. - Numbers are on RoboTwin 2.0
demo_cleanwithunseeninstruction phrasings; other task configs / domain randomization will differ.
License & attribution
- Base VLM Qwen3-VL-2B-Instruct is Β© the Qwen team, licensed Apache-2.0
(see
qwen_base_config/README_Qwen3-VL-2B-Instruct.md). Because the checkpoint embeds fine-tuned Qwen weights, that license applies to the corresponding components. - RMBench / RoboTwin and the Mem-0 policy code are governed by their respective upstream licenses; refer to the source repository.