Add files using upload-large-folder tool

1dd1ae9 verified 12 days ago

10 kB

	# Mem-0 Execution Module — `m1_mix` (RMBench / RoboTwin 2.0)

	A single Mem-0 low-level execution-module checkpoint trained jointly on all five
	RMBench M1 tasks (the `m1_mix` dataset) and evaluated on each task in turn. M1 tasks
	require only the execution module — **no high-level planner / vLLM is needed for
	inference.**

	- Backbone: Qwen3-VL-2B-Instruct (vision-language) — weights fine-tuned and bundled in the checkpoint
	- Action head: DiT-B flow-matching policy (action chunk of 30, 16-D action)
	- Memory: MemoryBank (instant + anchor memory fusion across the episode)
	- Aux head: subtask-end classifier (used for Mn multi-stage tasks; inert for M1)
	- Total parameters: ≈ 2.67 B

	---

	## Results

	`task_config = demo_clean`, `instruction_type = unseen`, 100 episodes per task, `action_horizon = 30`.
	The same checkpoint and same `m1_mix` normalization stats are used for every task.

	\| Task \| Success Rate \| Reward \|
	\|---------------------\|:------------:\|:------:\|
	\| put_back_block \| 1.00 \| 1.00 \|
	\| rearrange_blocks \| 0.86 \| 0.86 \|
	\| swap_blocks \| 0.81 \| 0.81 \|
	\| swap_T \| 0.13 \| 0.13 \|
	\| observe_and_pickup \| 0.03 \| 0.00 \|
	\| Average \| 0.566 \| — \|

	Per-episode logs and rollout videos for all five tasks are under `eval_results/`.
	See `eval_results/summary.md` for details and `task_instructions.json` for the exact
	per-task language instruction used.

	---

	## Contents of this bundle

	```
	m1_mix_submit/
	├── README.md # this file
	├── task_instructions.json # verbatim --global_task per task + scores
	├── checkpoint/
	│ ├── m1_mix_final_step50000.pt.part00 … part08 # 15.3 GB full training ckpt, split into 9 parts (2×4 GB + 7×≤1 GB)
	│ ├── m1_mix_final_step50000.pt.sha256 # SHA-256 of the reassembled checkpoint
	│ └── README_REASSEMBLE.md # how to cat the parts back together + verify
	├── norm_stats/
	│ └── norm_stats.json # min-max state/action stats → [-1, 1]
	├── configs/
	│ ├── execution_module_train_m1_mix.yaml # training config (reproducibility)
	│ └── deploy_policy.yml # inference / deployment config
	├── qwen_base_config/ # Qwen3-VL-2B-Instruct config/processor ONLY
	│ ├── config.json, generation_config.json
	│ ├── tokenizer*.json, vocab.json, merges.txt
	│ ├── preprocessor_config.json, video_preprocessor_config.json, chat_template.json
	│ └── README_Qwen3-VL-2B-Instruct.md # upstream model card (Apache-2.0)
	└── eval_results/
	├── summary.md
	└── <task>/ # _result.txt, eval_log.txt, episode*.mp4 (×100)
	```

	### About the checkpoint

	> Reassemble first. The 15.3 GB checkpoint is uploaded as 9 byte-split parts
	> (`m1_mix_final_step50000.pt.part00…08`) because the upload path capped single files and
	> throttled per-window bytes. Concatenation reproduces the original bit-for-bit:
	>
	> ```bash
	> cat m1_mix_final_step50000.pt.part?? > m1_mix_final_step50000.pt
	> sha256sum -c m1_mix_final_step50000.pt.sha256 # -> m1_mix_final_step50000.pt: OK
	> ```
	>
	> See `checkpoint/README_REASSEMBLE.md` for details.

	Once reassembled, `m1_mix_final_step50000.pt` is the full training checkpoint at step 50000:

	\| key \| content \|
	\|------------------------\|----------------------------------------------------------------------\|
	\| `model_state_dict` \| 910 tensors, ≈ 2.67 B params (`qwen_model` ≈ 2.44 B, `action_model` ≈ 160 M, `memory_bank` ≈ 39 M, `classifier` ≈ 32 M); bf16 + fp32 \|
	\| `optimizer_state_dict` \| AdamW moments — for resume/fine-tune only \|
	\| `scheduler_state_dict` \| cosine LR scheduler state \|
	\| `global_step` \| 50000 \|

	The `model_state_dict` is self-contained: it already includes the fine-tuned
	Qwen3-VL-2B backbone weights. The bundled `qwen_base_config/` provides only the
	architecture/tokenizer/processor config — the base model weights (`model.safetensors`,
	~4 GB) are not re-distributed here; download them from the official repo (see below).

	Inference-only slimming (15.3 GB → ≈ 6 GB) if you don't need to resume training:

	```python
	import torch
	ck = torch.load("checkpoint/m1_mix_final_step50000.pt", map_location="cpu", weights_only=False)
	torch.save({"model_state_dict": ck["model_state_dict"], "global_step": ck["global_step"]},
	"m1_mix_inference.pt")
	```

	The deploy loader reads `payload["model_state_dict"]` and calls
	`load_state_dict(..., strict=False)`, so either the full or the slimmed file works
	unchanged.

	---

	## Dependencies

	1. Code: the RMBench / Mem-0 repository (this checkpoint targets its
	`policy/Mem-0` execution module and `script/eval_policy.py`). Follow the repo README
	for the RoboTwin 2.0 simulator environment setup.
	2. Base VLM: `Qwen/Qwen3-VL-2B-Instruct` (Apache-2.0). Required at model
	instantiation for the architecture + image/text processor. Its weights are
	overwritten by this checkpoint at load time (`strict=False`), but the directory must
	exist and contain `model.safetensors`:

	```bash
	huggingface-cli download Qwen/Qwen3-VL-2B-Instruct \
	--local-dir policy/Mem-0/checkpoints/Qwen3-VL-2B-Instruct
	```

	The small config/processor files in `qwen_base_config/` are exactly the ones used for
	training and evaluation; you may overlay them onto the downloaded directory if the
	upstream revision differs.

	---

	## How to run evaluation

	Point the deploy config at the checkpoint and the `m1_mix` stats, then run one task at a
	time. This mirrors exactly how the numbers above were produced:

	```bash
	python script/eval_policy.py --config policy/Mem-0/deploy_policy.yml --overrides \
	--task_name swap_blocks \
	--execution_ckpt /path/to/m1_mix_final_step50000.pt \
	--state_stats_path /path/to/norm_stats/norm_stats.json \
	--ckpt_setting m1mix \
	--global_task "There are three traies on the table, and two blocks are placed in two different traies. You may move only one block at a time, and each tray can hold at most one block. Swap the positions of the two blocks. Finally press the button." \
	--action_horizon 30
	```

	- Replace `--task_name` and `--global_task` with each of the five tasks (strings in
	`task_instructions.json`). The checkpoint and `--state_stats_path` stay the same.
	- `--ckpt_setting m1mix` only labels the output directory
	(`eval_result/<task>/Mem-0/demo_clean/m1mix/<timestamp>/`).
	- `--vllm_url` is accepted but unused for M1 tasks (the global instruction is set
	directly; the planner client is constructed but never queried).
	- Ensure `execution_module.qwen_vl.model_path` in `deploy_policy.yml` points to your
	downloaded Qwen3-VL-2B-Instruct directory.

	---

	## Model architecture (from `configs/`)

	- VLM backbone — Qwen3-VL-2B-Instruct, 224×224 head-camera image + language
	instruction, last-layer hidden states (hidden size 2048).
	- MemoryBank — `window_size 30`, `initial_anchor_size 1`, `num_heads 8`,
	`memory_accumulation 8`, `dropout 0.1`; fuses an instant-memory and an anchor-memory
	token; concatenated with the text feature → a 3-token summary `(B, 3, 2048)`.
	- DiT-B action head (`FlowmatchingActionHead`) — `num_layers 16`,
	`cross_attention_dim 2048`, `action_dim 16`, `state_dim 16`, `action_horizon 30`,
	`num_inference_timesteps 8`; flow-matching regression of a 30-step action chunk.
	- Subtask-end classifier — MLP `hidden_sizes [6144, 2048, 512]`, `pos_weight 10.0`,
	`focal_gamma 1.0`, `threshold 0.5`. Drives stage transitions in Mn tasks; for M1 the
	episode is a single stage so it does not affect rollout.

	## Training (from `configs/execution_module_train_m1_mix.yaml`)

	- Data: `m1_mix` (the five M1 tasks merged into one LeRobot dataset with globally
	unique `episode_id`s). Features: head-camera image, state, action, subtask,
	subtask_end, episode_id.
	- Schedule: `train_steps 50000`, `batch_size 56`, cosine scheduler,
	`warmup_ratio 0.05`, `grad_clip_norm 2.5`, `weight_decay 0.005`, `seed 42`.
	- Learning rates: base `1e-5`, qwen_model `1e-5`, action_model `1e-4`,
	classifier `1e-4` (min LRs `1e-6 / 1e-6 / 5e-6 / 5e-6`).
	- Loss: `lambda_action 1.0`, `lambda_classifier 0.2`.

	## Normalization

	State and action are min-max normalized to `[-1, 1]` over the 14 arm dimensions using
	`norm_stats/norm_stats.json` (`NORM_WAY = "minmax"` in `deploy_policy.py`). Use the same
	stats file at inference; predicted actions are denormalized with it before being sent to
	the environment. Action chunks from overlapping predictions are averaged (mean smoothing)
	before execution.

	## Limitations

	- swap_T (0.13) and observe_and_pickup (0.03) are weak: the former needs precise
	T-block position and orientation alignment; the latter needs cross-view target
	re-identification after a visual occlusion followed by a pickup. The joint `m1_mix`
	model does not solve these reliably.
	- Numbers are on RoboTwin 2.0 `demo_clean` with `unseen` instruction phrasings; other
	task configs / domain randomization will differ.

	## License & attribution

	- Base VLM Qwen3-VL-2B-Instruct is © the Qwen team, licensed Apache-2.0
	(see `qwen_base_config/README_Qwen3-VL-2B-Instruct.md`). Because the checkpoint
	embeds fine-tuned Qwen weights, that license applies to the corresponding components.
	- RMBench / RoboTwin and the Mem-0 policy code are governed by their respective upstream
	licenses; refer to the source repository.