KMasaki/cc12m-sam2-parse-tree
Viewer • Updated • 11M • 8.66k
How to use KMasaki/PowerCLIP-SAM2-ViT-B-16-CC12M with OpenCLIP:
import open_clip
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:KMasaki/PowerCLIP-SAM2-ViT-B-16-CC12M')
tokenizer = open_clip.get_tokenizer('hf-hub:KMasaki/PowerCLIP-SAM2-ViT-B-16-CC12M')How to use KMasaki/PowerCLIP-SAM2-ViT-B-16-CC12M with sam2:
# Use SAM2 with images
import torch
from sam2.sam2_image_predictor import SAM2ImagePredictor
predictor = SAM2ImagePredictor.from_pretrained(KMasaki/PowerCLIP-SAM2-ViT-B-16-CC12M)
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
predictor.set_image(<your_image>)
masks, _, _ = predictor.predict(<input_prompts>) # Use SAM2 with videos
import torch
from sam2.sam2_video_predictor import SAM2VideoPredictor
predictor = SAM2VideoPredictor.from_pretrained(KMasaki/PowerCLIP-SAM2-ViT-B-16-CC12M)
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
state = predictor.init_state(<your_video>)
# add new prompts and instantly get the output on the same frame
frame_idx, object_ids, masks = predictor.add_new_points(state, <your_prompts>):
# propagate the prompts to get masklets throughout the video
for frame_idx, object_ids, masks in predictor.propagate_in_video(state):
...A CLIP model enhanced with region-phrase alignment via softplus scoring, trained on CC12M with SAM2 regions and parse-tree phrases.
| Architecture | ViT-B-16 |
| Training data | CC12M (10.9M samples) |
| Region model | SAM2 (Hiera-Small, points_per_side=16) |
| Epochs | 32 |
| Batch size | 512 x 8 GPUs |
| Vision pooling | Average |
| Text pooling | Average |
| SAM loss ratio | 0.1 |
| SAM regions topk | 10 |
| Softplus tau | 0.001 |
| Softplus alpha | 0.75 |
import torch
import open_clip
model, _, preprocess = open_clip.create_model_and_transforms("ViT-B-16")
# Load PowerCLIP checkpoint
ckpt = torch.load("epoch_latest.pt", map_location="cpu")
model.load_state_dict(ckpt["state_dict"], strict=False)
# Switch to average pooling (PowerCLIP default)
model.visual.pool_type = "avg"
model.text.pool_type = "avg"
model.eval()
See PowerCLIP for training code and details.
Built on OpenCLIP.