KMasaki's picture
Upload README.md with huggingface_hub
c088031 verified
metadata
license: apache-2.0
library_name: open_clip
pipeline_tag: zero-shot-image-classification
tags:
  - clip
  - vision-language
  - region-phrase-alignment
  - zero-shot
  - sam2
datasets:
  - KMasaki/cc12m-sam2-parse-tree

PowerCLIP (ViT-B-16, CC12M, SAM2, 32 epochs)

A CLIP model enhanced with region-phrase alignment via softplus scoring, trained on CC12M with SAM2 regions and parse-tree phrases.

Model Details

Architecture ViT-B-16
Training data CC12M (10.9M samples)
Region model SAM2 (Hiera-Small, points_per_side=16)
Epochs 32
Batch size 512 x 8 GPUs
Vision pooling Average
Text pooling Average
SAM loss ratio 0.1
SAM regions topk 10
Softplus tau 0.001
Softplus alpha 0.75

Usage

import torch
import open_clip

model, _, preprocess = open_clip.create_model_and_transforms("ViT-B-16")

# Load PowerCLIP checkpoint
ckpt = torch.load("epoch_latest.pt", map_location="cpu")
model.load_state_dict(ckpt["state_dict"], strict=False)

# Switch to average pooling (PowerCLIP default)
model.visual.pool_type = "avg"
model.text.pool_type = "avg"
model.eval()

Training

See PowerCLIP for training code and details.

Acknowledgement

Built on OpenCLIP.