GraspMolmo / README.md

nielsr HF Staff

Improve model card: add `transformers` library_name, paper abstract, and update links

f25dfd3 verified 7 months ago

4.74 kB

base_model:
  - allenai/Molmo-7B-D-0924
datasets:
  - allenai/PRISM
language:
  - en
license: mit
pipeline_tag: robotics
tags:
  - robotics
  - grasping
  - task-oriented-grasping
  - manipulation
library_name: transformers

GraspMolmo

[Paper] [arXiv] [Project Website] [Data] [Code]

GraspMolmo is a generalizable open-vocabulary task-oriented grasping (TOG) model for robotic manipulation. Given an image and a task to complete (e.g. "Pour me some tea"), GraspMolmo will point to the most appropriate grasp location, which can then be matched to the closest stable grasp.

Paper Abstract

We present GrasMolmo, a generalizable open-vocabulary task-oriented grasping (TOG) model. GraspMolmo predicts semantically appropriate, stable grasps conditioned on a natural language instruction and a single RGB-D frame. For instance, given "pour me some tea", GraspMolmo selects a grasp on a teapot handle rather than its body. Unlike prior TOG methods, which are limited by small datasets, simplistic language, and uncluttered scenes, GraspMolmo learns from PRISM, a novel large-scale synthetic dataset of 379k samples featuring cluttered environments and diverse, realistic task descriptions. We fine-tune the Molmo visual-language model on this data, enabling GraspMolmo to generalize to novel open-vocabulary instructions and objects. In challenging real-world evaluations, GraspMolmo achieves state-of-the-art results, with a 70% prediction success on complex tasks, compared to the 35% achieved by the next best alternative. GraspMolmo also successfully demonstrates the ability to predict semantically correct bimanual grasps zero-shot. We release our synthetic dataset, code, model, and benchmarks to accelerate research in task-semantic robotic manipulation, which, along with videos, are available at this https URL .

Code Sample

from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

img = Image.open("<path_to_image>")
task = "Pour coffee from the blue mug."

processor = AutoProcessor.from_pretrained("allenai/GraspMolmo", torch_dtype="auto", device_map="auto", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("allenai/GraspMolmo", torch_dtype="auto", device_map="auto", trust_remote_code=True)

prompt = f"Point to where I should grasp to accomplish the following task: {task}"
inputs = processor.process(images=img, text=prompt, return_tensors="pt")
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}

output = model.generate_from_batch(inputs, GenerationConfig(max_new_tokens=256, stop_strings="<|endoftext|>"), tokenizer=processor.tokenizer)
generated_tokens = output[0, inputs["input_ids"].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(generated_text)

Running the above code could result in the following output:

In order to accomplish the task "Pour coffee from the blue mug.", the optimal grasp is described as follows: "The grasp is on the middle handle of the blue mug, with fingers grasping the sides of the handle.".

<point x="28.6" y="20.7" alt="Where to grasp the object">Where to grasp the object</point>

Grasp Inference

To predict a grasp point and match it to one of the candidate grasps, refer to the GraspMolmo class. First, install graspmolmo with

pip install "git+https://github.com/abhaybd/GraspMolmo.git#egg=graspmolmo[infer]"

and then inference can be run as follows:

from graspmolmo.inference.grasp_predictor import GraspMolmo

task = "..."
rgb, depth = get_image()
camera_intrinsics = np.array(...)

point_cloud = backproject(rgb, depth, camera_intrinsics)
# grasps are in the camera reference frame
grasps = predict_grasps(point_cloud)  # Using your favorite grasp predictor (e.g. M2T2)

gm = GraspMolmo()
idx = gm.pred_grasp(rgb, point_cloud, task, grasps)

print(f"Predicted grasp: {grasps[idx]}")

Citation

@misc{deshpande2025graspmolmo,
      title={GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation}, 
      author={Abhay Deshpande and Yuquan Deng and Arijit Ray and Jordi Salvador and Winson Han and Jiafei Duan and Kuo-Hao Zeng and Yuke Zhu and Ranjay Krishna and Rose Hendrix},
      year={2025},
      eprint={2505.13441},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2505.13441}, 
}