PowerCLIP (ViT-B-16, CC12M, 32 epochs)
A CLIP model enhanced with region-phrase alignment via softplus scoring, trained on CC12M with SAM regions and parse-tree phrases.
Model Details
| Architecture | ViT-B-16 |
| Training data | CC12M (10.9M samples) |
| Epochs | 32 |
| Batch size | 512 x 8 GPUs |
| Vision pooling | Average |
| Text pooling | Average |
| SAM loss ratio | 0.1 |
| SAM regions topk | 10 |
| Softplus tau | 0.001 |
| Softplus alpha | 0.75 |
Usage
import torch
import open_clip
model, _, preprocess = open_clip.create_model_and_transforms("ViT-B-16")
# Load PowerCLIP checkpoint
ckpt = torch.load("epoch_32.pt", map_location="cpu")
model.load_state_dict(ckpt["state_dict"], strict=False)
# Switch to average pooling (PowerCLIP default)
model.visual.pool_type = "avg"
model.text.pool_type = "avg"
model.eval()
Training
See PowerCLIP for training code and details.
Acknowledgement
Built on OpenCLIP.
- Downloads last month
- -