You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

SAM-Cosmos Gripper โ€” Fine-tuned SAM3 for Gripper Segmentation

Fine-tuned version of SAM3 (Segment Anything Model 3 from Meta) specifically trained to segment robot grippers in RGB images from the DROID dataset.

The base checkpoint used is the publicly available SAM3 model frozen vision backbone, further fine-tuned exclusively on the DROID gripper annotations.

Property Value
Base model SAM3 (frozen vision encoder)
Training dataset DROID gripper annotations
Prompt "gripper"
Epochs 50
Best loss ~29.5
Input resolution 640 ร— 480 (or any)

Quickstart

from huggingface_hub import hf_hub_download
import torch
from PIL import Image

# 1. Download model
ckpt_path = hf_hub_download(
    repo_id="sazirarrwth99/sam-cosmos-gripper",
    filename="model.pt",
)

# 2. Build SAM3 (requires the sam3 package from the repo)
#    pip install git+https://github.com/your-org/cosmos-predict2.5
from sam3 import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

REPO_ROOT = "path/to/cosmos-predict2.5"  # adjust
bpe_path = f"{REPO_ROOT}/sam3/sam3/assets/bpe_simple_vocab_16e6.txt.gz"

model = build_sam3_image_model(
    bpe_path=bpe_path,
    device="cuda",
    eval_mode=True,
    enable_segmentation=True,
    load_from_HF=False,
    checkpoint_path=None,
)

ckpt = torch.load(ckpt_path, map_location="cuda", weights_only=False)
model.load_state_dict(ckpt["model_state_dict"], strict=False)
model.eval()
processor = Sam3Processor(model, device="cuda")

# 3. Run inference
image = Image.open("your_image.jpg").convert("RGB")
state = processor.set_image(image, state={})
state = processor.set_text_prompt(prompt="gripper", state=state)

masks  = state["masks"]   # (N, H, W)
scores = state["scores"]  # (N,)
best   = masks[scores.argmax()].squeeze().cpu().numpy().astype("uint8")
print("Mask shape:", best.shape, "  non-zero pixels:", best.sum())

Inference Docker (recommended)

The easiest way to run inference is via the provided Docker image:

docker run --gpus all --rm \
  -v /path/to/images:/images \
  -v /path/to/outputs:/outputs \
  cosmos-predict2.5-roboseg:latest \
  python /workspace/development/Training/robot_segmentation/run_inference.py \
    --model_path /model/model.pt \
    --input_dir  /images \
    --output_dir /outputs \
    --prompt     "gripper"

Training Details

  • Resume from: sam3_roboseg_frozen (robot-arm fine-tuned checkpoint)
  • Additional fine-tuning dataset: DROID gripper (~882 annotated frames)
  • Learning rate: 1 ร— 10โปโด (cosine schedule with linear warm-up, 5 epoch warmup)
  • Batch size: 1 (gradient accum = 4)
  • Optimizer: AdamW (weight_decay = 0.05)
  • Hardware: NVIDIA RTX PRO 6000 Blackwell (98 GB VRAM)

Citation

If you use this model please cite the original SAM3 paper and this work.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support