MolmoAct2-LIBERO + Grid Sampler (random-init sampler)

allenai/MolmoAct2-LIBERO-LeRobot with Grid Sampler (GridS, ICML 2026) visual token pruning integrated into the vision backbone.

⚠️ The ActiveTokenSampler weights in this checkpoint are randomly initialized. This artifact is the starting point for fine-tuning; do not expect baseline task performance with pruning enabled. For the fine-tuned version see xpuenabler/molmoact2-libero_grid_sampler_fine_tuned.

What changed vs. the base checkpoint

  • Each camera image's pooled visual feature grid (14×14 = 196 tokens for single-crop 256×256) is pruned to K = 16 tokens by an ActiveTokenSampler: a global-pooled feature predicts K normalized 2D coordinates, F.grid_sample bilinearly reads features at those locations, and a coordinate MLP injects geometry.
  • The processor emits exactly K image placeholder tokens per image, so the LIBERO 2-camera prompt shrinks from 483 to 123 tokens.
  • New config flags: use_grid_token_sampler=true, grid_token_sampler_num_tokens=16 (stored in config.json and in the saved processor pipeline).

Usage

Requires the feat/grid-sampler-molmoact2 branch of nota-github/xpu-lerobot:

import torch
from lerobot.configs.policies import PreTrainedConfig
from lerobot.policies.factory import make_pre_post_processors
from lerobot.policies.molmoact2.modeling_molmoact2 import MolmoAct2Policy

path = "xpuenabler/molmoact2-libero_grid_sampler_random_init"
cfg = PreTrainedConfig.from_pretrained(path)
cfg.pretrained_path = path
cfg.device = "cuda"
cfg.inference_action_mode = "continuous"

policy = MolmoAct2Policy.from_pretrained(path, config=cfg)
preprocessor, postprocessor = make_pre_post_processors(
    policy_cfg=cfg, pretrained_path=path,
    preprocessor_overrides={"device_processor": {"device": "cuda"}},
)

batch = preprocessor({
    "observation.images.image": torch.rand(3, 256, 256),
    "observation.images.wrist_image": torch.rand(3, 256, 256),
    "observation.state": torch.zeros(8),
    "task": "pick up the black bowl",
})
action = postprocessor(policy.select_action(batch))

Citation

@inproceedings{feng2026gridsampler,
  title     = {See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model},
  author    = {Feng, Yixu and Zhao, Zinan and Ma, Yanxiang and Xia, Chenghao and Du, Chengbin and Wang, Yunke and Xu, Chang},
  booktitle = {Forty-Third International Conference on Machine Learning (ICML)},
  year      = {2026}
}
Downloads last month
17
Safetensors
Model size
5B params
Tensor type
F32
·
BF16
·
Video Preview
loading

Model tree for xpuenabler/molmoact2-libero_grid_sampler_random_init

Finetuned
(1)
this model