PowerCLIP (ViT-B-16, CC12M, 32 epochs)

A CLIP model enhanced with region-phrase alignment via softplus scoring, trained on CC12M with SAM regions and parse-tree phrases.

Model Details

Architecture ViT-B-16
Training data CC12M (10.9M samples)
Epochs 32
Batch size 512 x 8 GPUs
Vision pooling Average
Text pooling Average
SAM loss ratio 0.1
SAM regions topk 10
Softplus tau 0.001
Softplus alpha 0.75

Usage

import torch
import open_clip

model, _, preprocess = open_clip.create_model_and_transforms("ViT-B-16")

# Load PowerCLIP checkpoint
ckpt = torch.load("epoch_32.pt", map_location="cpu")
model.load_state_dict(ckpt["state_dict"], strict=False)

# Switch to average pooling (PowerCLIP default)
model.visual.pool_type = "avg"
model.text.pool_type = "avg"
model.eval()

Training

See PowerCLIP for training code and details.

Acknowledgement

Built on OpenCLIP.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train KMasaki/PowerCLIP-ViT-B-16-CC12M