SAM-Cosmos Gripper โ Fine-tuned SAM3 for Gripper Segmentation
Fine-tuned version of SAM3 (Segment Anything Model 3 from Meta) specifically trained to segment robot grippers in RGB images from the DROID dataset.
The base checkpoint used is the publicly available SAM3 model frozen vision backbone, further fine-tuned exclusively on the DROID gripper annotations.
| Property | Value |
|---|---|
| Base model | SAM3 (frozen vision encoder) |
| Training dataset | DROID gripper annotations |
| Prompt | "gripper" |
| Epochs | 50 |
| Best loss | ~29.5 |
| Input resolution | 640 ร 480 (or any) |
Quickstart
from huggingface_hub import hf_hub_download
import torch
from PIL import Image
# 1. Download model
ckpt_path = hf_hub_download(
repo_id="sazirarrwth99/sam-cosmos-gripper",
filename="model.pt",
)
# 2. Build SAM3 (requires the sam3 package from the repo)
# pip install git+https://github.com/your-org/cosmos-predict2.5
from sam3 import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
REPO_ROOT = "path/to/cosmos-predict2.5" # adjust
bpe_path = f"{REPO_ROOT}/sam3/sam3/assets/bpe_simple_vocab_16e6.txt.gz"
model = build_sam3_image_model(
bpe_path=bpe_path,
device="cuda",
eval_mode=True,
enable_segmentation=True,
load_from_HF=False,
checkpoint_path=None,
)
ckpt = torch.load(ckpt_path, map_location="cuda", weights_only=False)
model.load_state_dict(ckpt["model_state_dict"], strict=False)
model.eval()
processor = Sam3Processor(model, device="cuda")
# 3. Run inference
image = Image.open("your_image.jpg").convert("RGB")
state = processor.set_image(image, state={})
state = processor.set_text_prompt(prompt="gripper", state=state)
masks = state["masks"] # (N, H, W)
scores = state["scores"] # (N,)
best = masks[scores.argmax()].squeeze().cpu().numpy().astype("uint8")
print("Mask shape:", best.shape, " non-zero pixels:", best.sum())
Inference Docker (recommended)
The easiest way to run inference is via the provided Docker image:
docker run --gpus all --rm \
-v /path/to/images:/images \
-v /path/to/outputs:/outputs \
cosmos-predict2.5-roboseg:latest \
python /workspace/development/Training/robot_segmentation/run_inference.py \
--model_path /model/model.pt \
--input_dir /images \
--output_dir /outputs \
--prompt "gripper"
Training Details
- Resume from:
sam3_roboseg_frozen(robot-arm fine-tuned checkpoint) - Additional fine-tuning dataset: DROID gripper (~882 annotated frames)
- Learning rate: 1 ร 10โปโด (cosine schedule with linear warm-up, 5 epoch warmup)
- Batch size: 1 (gradient accum = 4)
- Optimizer: AdamW (weight_decay = 0.05)
- Hardware: NVIDIA RTX PRO 6000 Blackwell (98 GB VRAM)
Citation
If you use this model please cite the original SAM3 paper and this work.