allenai
/

GraspMolmo

task-oriented-grasping

Model card Files Files and versions

GraspMolmo / README.md

abhaybd's picture

Update README.md

827ee88 verified 9 months ago

|

history blame contribute delete

2.92 kB

	---
	license: mit
	datasets:
	- allenai/PRISM
	language:
	- en
	base_model:
	- allenai/Molmo-7B-D-0924
	pipeline_tag: robotics
	tags:
	- robotics
	- grasping
	- task-oriented-grasping
	- manipulation
	---

	# GraspMolmo

	[[Paper]](https://arxiv.org/pdf/2505.13441) [[arXiv]](https://arxiv.org/abs/2505.13441) [[Project Website]](https://abhaybd.github.io/GraspMolmo/) [[Data]](https://huggingface.co/datasets/allenai/PRISM)

	GraspMolmo is a generalizable open-vocabulary task-oriented grasping (TOG) model for robotic manipulation. Given an image and a task to complete (e.g. "Pour me some tea"), GraspMolmo will point to the most appropriate grasp location, which can then be matched to the closest stable grasp.

	## Code Sample

	```python
	from PIL import Image
	from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

	img = Image.open("<path_to_image>")
	task = "Pour coffee from the blue mug."

	processor = AutoProcessor.from_pretrained("allenai/GraspMolmo", torch_dtype="auto", device_map="auto", trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained("allenai/GraspMolmo", torch_dtype="auto", device_map="auto", trust_remote_code=True)

	prompt = f"Point to where I should grasp to accomplish the following task: {task}"
	inputs = processor.process(images=img, text=prompt, return_tensors="pt")
	inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}

	output = model.generate_from_batch(inputs, GenerationConfig(max_new_tokens=256, stop_strings="<\|endoftext\|>"), tokenizer=processor.tokenizer)
	generated_tokens = output[0, inputs["input_ids"].size(1):]
	generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
	print(generated_text)
	```

	Running the above code could result in the following output:
	```
	In order to accomplish the task "Pour coffee from the blue mug.", the optimal grasp is described as follows: "The grasp is on the middle handle of the blue mug, with fingers grasping the sides of the handle.".

	<point x="28.6" y="20.7" alt="Where to grasp the object">Where to grasp the object</point>
	```

	## Grasp Inference

	To predict a grasp point and match it to one of the candidate grasps, refer to the [GraspMolmo](https://github.com/abhaybd/GraspMolmo/blob/main/graspmolmo/inference/grasp_predictor.py) class.
	First, install `graspmolmo` with

	```bash
	pip install "git+https://github.com/abhaybd/GraspMolmo.git#egg=graspmolmo[infer]"
	```

	and then inference can be run as follows:

	```python
	from graspmolmo.inference.grasp_predictor import GraspMolmo

	task = "..."
	rgb, depth = get_image()
	camera_intrinsics = np.array(...)

	point_cloud = backproject(rgb, depth, camera_intrinsics)
	# grasps are in the camera reference frame
	grasps = predict_grasps(point_cloud) # Using your favorite grasp predictor (e.g. M2T2)

	gm = GraspMolmo()
	idx = gm.pred_grasp(rgb, point_cloud, task, grasps)

	print(f"Predicted grasp: {grasps[idx]}")
	```