# Mem-0 Execution Module — `m1_mix` (RMBench / RoboTwin 2.0)

A single **Mem-0** low-level execution-module checkpoint trained jointly on all five
RMBench **M1** tasks (the `m1_mix` dataset) and evaluated on each task in turn. M1 tasks
require only the execution module — **no high-level planner / vLLM is needed for
inference.**

- **Backbone:** Qwen3-VL-2B-Instruct (vision-language) — *weights fine-tuned and bundled in the checkpoint*
- **Action head:** DiT-B flow-matching policy (action chunk of 30, 16-D action)
- **Memory:** MemoryBank (instant + anchor memory fusion across the episode)
- **Aux head:** subtask-end classifier (used for Mn multi-stage tasks; inert for M1)
- **Total parameters:** ≈ 2.67 B

---

## Results

`task_config = demo_clean`, `instruction_type = unseen`, 100 episodes per task, `action_horizon = 30`.
The **same** checkpoint and **same** `m1_mix` normalization stats are used for every task.

| Task                | Success Rate | Reward |
|---------------------|:------------:|:------:|
| put_back_block      | **1.00**     | 1.00   |
| rearrange_blocks    | **0.86**     | 0.86   |
| swap_blocks         | **0.81**     | 0.81   |
| swap_T              | **0.13**     | 0.13   |
| observe_and_pickup  | **0.03**     | 0.00   |
| **Average**         | **0.566**    | —      |

Per-episode logs and rollout videos for all five tasks are under `eval_results/`.
See `eval_results/summary.md` for details and `task_instructions.json` for the exact
per-task language instruction used.

---

## Contents of this bundle

```
m1_mix_submit/
├── README.md                                   # this file
├── task_instructions.json                      # verbatim --global_task per task + scores
├── checkpoint/
│   ├── m1_mix_final_step50000.pt.part00 … part08   # 15.3 GB full training ckpt, split into 9 parts (2×4 GB + 7×≤1 GB)
│   ├── m1_mix_final_step50000.pt.sha256        # SHA-256 of the reassembled checkpoint
│   └── README_REASSEMBLE.md                    # how to cat the parts back together + verify
├── norm_stats/
│   └── norm_stats.json                         # min-max state/action stats → [-1, 1]
├── configs/
│   ├── execution_module_train_m1_mix.yaml      # training config (reproducibility)
│   └── deploy_policy.yml                        # inference / deployment config
├── qwen_base_config/                            # Qwen3-VL-2B-Instruct config/processor ONLY
│   ├── config.json, generation_config.json
│   ├── tokenizer*.json, vocab.json, merges.txt
│   ├── preprocessor_config.json, video_preprocessor_config.json, chat_template.json
│   └── README_Qwen3-VL-2B-Instruct.md          # upstream model card (Apache-2.0)
└── eval_results/
    ├── summary.md
    └── <task>/                                  # _result.txt, eval_log.txt, episode*.mp4 (×100)
```

### About the checkpoint

> **Reassemble first.** The 15.3 GB checkpoint is uploaded as 9 byte-split parts
> (`m1_mix_final_step50000.pt.part00…08`) because the upload path capped single files and
> throttled per-window bytes. Concatenation reproduces the original **bit-for-bit**:
>
> ```bash
> cat m1_mix_final_step50000.pt.part?? > m1_mix_final_step50000.pt
> sha256sum -c m1_mix_final_step50000.pt.sha256   # -> m1_mix_final_step50000.pt: OK
> ```
>
> See `checkpoint/README_REASSEMBLE.md` for details.

Once reassembled, `m1_mix_final_step50000.pt` is the **full training checkpoint** at step 50000:

| key                    | content                                                              |
|------------------------|----------------------------------------------------------------------|
| `model_state_dict`     | 910 tensors, ≈ 2.67 B params (`qwen_model` ≈ 2.44 B, `action_model` ≈ 160 M, `memory_bank` ≈ 39 M, `classifier` ≈ 32 M); bf16 + fp32 |
| `optimizer_state_dict` | AdamW moments — for resume/fine-tune only                            |
| `scheduler_state_dict` | cosine LR scheduler state                                            |
| `global_step`          | 50000                                                                |

The `model_state_dict` is **self-contained**: it already includes the fine-tuned
Qwen3-VL-2B backbone weights. The bundled `qwen_base_config/` provides only the
*architecture/tokenizer/processor* config — the base model weights (`model.safetensors`,
~4 GB) are **not** re-distributed here; download them from the official repo (see below).

**Inference-only slimming** (15.3 GB → ≈ 6 GB) if you don't need to resume training:

```python
import torch
ck = torch.load("checkpoint/m1_mix_final_step50000.pt", map_location="cpu", weights_only=False)
torch.save({"model_state_dict": ck["model_state_dict"], "global_step": ck["global_step"]},
           "m1_mix_inference.pt")
```

The deploy loader reads `payload["model_state_dict"]` and calls
`load_state_dict(..., strict=False)`, so either the full or the slimmed file works
unchanged.

---

## Dependencies

1. **Code:** the RMBench / Mem-0 repository (this checkpoint targets its
   `policy/Mem-0` execution module and `script/eval_policy.py`). Follow the repo README
   for the RoboTwin 2.0 simulator environment setup.
2. **Base VLM:** `Qwen/Qwen3-VL-2B-Instruct` (Apache-2.0). Required at model
   instantiation for the architecture + image/text processor. Its weights are
   overwritten by this checkpoint at load time (`strict=False`), but the directory must
   exist and contain `model.safetensors`:

   ```bash
   huggingface-cli download Qwen/Qwen3-VL-2B-Instruct \
       --local-dir policy/Mem-0/checkpoints/Qwen3-VL-2B-Instruct
   ```

   The small config/processor files in `qwen_base_config/` are exactly the ones used for
   training and evaluation; you may overlay them onto the downloaded directory if the
   upstream revision differs.

---

## How to run evaluation

Point the deploy config at the checkpoint and the `m1_mix` stats, then run one task at a
time. This mirrors exactly how the numbers above were produced:

```bash
python script/eval_policy.py --config policy/Mem-0/deploy_policy.yml --overrides \
    --task_name        swap_blocks \
    --execution_ckpt   /path/to/m1_mix_final_step50000.pt \
    --state_stats_path /path/to/norm_stats/norm_stats.json \
    --ckpt_setting     m1mix \
    --global_task      "There are three traies on the table, and two blocks are placed in two different traies. You may move only one block at a time, and each tray can hold at most one block. Swap the positions of the two blocks. Finally press the button." \
    --action_horizon   30
```

- Replace `--task_name` and `--global_task` with each of the five tasks (strings in
  `task_instructions.json`). The checkpoint and `--state_stats_path` stay the same.
- `--ckpt_setting m1mix` only labels the output directory
  (`eval_result/<task>/Mem-0/demo_clean/m1mix/<timestamp>/`).
- `--vllm_url` is accepted but unused for M1 tasks (the global instruction is set
  directly; the planner client is constructed but never queried).
- Ensure `execution_module.qwen_vl.model_path` in `deploy_policy.yml` points to your
  downloaded Qwen3-VL-2B-Instruct directory.

---

## Model architecture (from `configs/`)

- **VLM backbone** — Qwen3-VL-2B-Instruct, 224×224 head-camera image + language
  instruction, last-layer hidden states (hidden size 2048).
- **MemoryBank** — `window_size 30`, `initial_anchor_size 1`, `num_heads 8`,
  `memory_accumulation 8`, `dropout 0.1`; fuses an instant-memory and an anchor-memory
  token; concatenated with the text feature → a 3-token summary `(B, 3, 2048)`.
- **DiT-B action head** (`FlowmatchingActionHead`) — `num_layers 16`,
  `cross_attention_dim 2048`, `action_dim 16`, `state_dim 16`, `action_horizon 30`,
  `num_inference_timesteps 8`; flow-matching regression of a 30-step action chunk.
- **Subtask-end classifier** — MLP `hidden_sizes [6144, 2048, 512]`, `pos_weight 10.0`,
  `focal_gamma 1.0`, `threshold 0.5`. Drives stage transitions in Mn tasks; for M1 the
  episode is a single stage so it does not affect rollout.

## Training (from `configs/execution_module_train_m1_mix.yaml`)

- **Data:** `m1_mix` (the five M1 tasks merged into one LeRobot dataset with globally
  unique `episode_id`s). Features: head-camera image, state, action, subtask,
  subtask_end, episode_id.
- **Schedule:** `train_steps 50000`, `batch_size 56`, cosine scheduler,
  `warmup_ratio 0.05`, `grad_clip_norm 2.5`, `weight_decay 0.005`, `seed 42`.
- **Learning rates:** base `1e-5`, qwen_model `1e-5`, action_model `1e-4`,
  classifier `1e-4` (min LRs `1e-6 / 1e-6 / 5e-6 / 5e-6`).
- **Loss:** `lambda_action 1.0`, `lambda_classifier 0.2`.

## Normalization

State and action are min-max normalized to `[-1, 1]` over the 14 arm dimensions using
`norm_stats/norm_stats.json` (`NORM_WAY = "minmax"` in `deploy_policy.py`). Use the same
stats file at inference; predicted actions are denormalized with it before being sent to
the environment. Action chunks from overlapping predictions are averaged (mean smoothing)
before execution.

## Limitations

- **swap_T (0.13)** and **observe_and_pickup (0.03)** are weak: the former needs precise
  T-block position *and* orientation alignment; the latter needs cross-view target
  re-identification after a visual occlusion followed by a pickup. The joint `m1_mix`
  model does not solve these reliably.
- Numbers are on RoboTwin 2.0 `demo_clean` with `unseen` instruction phrasings; other
  task configs / domain randomization will differ.

## License & attribution

- Base VLM **Qwen3-VL-2B-Instruct** is © the Qwen team, licensed **Apache-2.0**
  (see `qwen_base_config/README_Qwen3-VL-2B-Instruct.md`). Because the checkpoint
  embeds fine-tuned Qwen weights, that license applies to the corresponding components.
- RMBench / RoboTwin and the Mem-0 policy code are governed by their respective upstream
  licenses; refer to the source repository.