--- license: mit tags: - vision - clip - lora - multilabel-classification - image-classification - bitsandbytes - 8bit --- # CLIP-ViT-Large LoRA Adapter for Multi-Label Image Classification This model is a lightweight multi-label classification model based on [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14), optimized using LoRA (Low-Rank Adaptation) and 8-bit quantization (via `bitsandbytes`). It is suitable for multi-label classification tasks involving 20 distinct image categories. This repo contains only: - The **LoRA adapter** weights (`adapter_model.safetensors`) - The **classifier head** weights (`classifier_head.pt`) - A sample loading script in this README --- ## ๐Ÿง  Model Architecture - Backbone: `openai/clip-vit-large-patch14` - Quantization: 8-bit (`load_in_8bit=True`) - Fine-tuning method: LoRA (r=16, alpha=32) via `peft` - Classification head: `LayerNorm โ†’ Dropout โ†’ Linear(num_labels=20)` --- ## ๐Ÿงช Training Details - LoRA was applied to the attention projection modules: `q_proj`, `k_proj`, `v_proj`, `out_proj` - Optimizer: AdamW - Loss: Asymmetric Focal Loss (ฮณโป=2) - Epochs: 2 (grid search on LR and gamma_neg) --- ## ๐Ÿ“‚ Class Labels The model supports 20 categories: ``` Class 0, Class 1, Class 2, ..., Class 19 ``` You can replace these with your own label names based on your dataset. --- ## ๐Ÿš€ How to Use ### ๐Ÿ“ฆ Install dependencies ```python !pip install transformers peft bitsandbytes accelerate ``` ### ๐Ÿงฉ Load model ```python import torch import torch.nn as nn from transformers import CLIPModel, BitsAndBytesConfig, CLIPProcessor from peft import PeftModel class CLIPForMultiLabel(nn.Module): def __init__(self, backbone, num_labels=20, dropout=0.1): super().__init__() self.backbone = backbone hidden_size = backbone.config.projection_dim self.classifier = nn.Sequential( nn.LayerNorm(hidden_size), nn.Dropout(dropout), nn.Linear(hidden_size, num_labels) ) def forward(self, pixel_values): image_feats = self.backbone.get_image_features(pixel_values=pixel_values) return self.classifier(image_feats) # Load LoRA backbone quant_cfg = BitsAndBytesConfig(load_in_8bit=True) base = CLIPModel.from_pretrained("openai/clip-vit-large-patch14", quantization_config=quant_cfg) backbone = PeftModel.from_pretrained(base, "YOUR_USERNAME/clip-lora-multilabel") # Load classifier head model = CLIPForMultiLabel(backbone, num_labels=20) state_dict = torch.hub.load_state_dict_from_url( "https://huggingface.co/YOUR_USERNAME/clip-lora-multilabel/resolve/main/classifier_head.pt", map_location="cpu" ) model.classifier.load_state_dict(state_dict) model.eval() # Load processor processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14") ``` ### ๐Ÿ–ผ๏ธ Predict on an image ```python from PIL import Image image = Image.open("your_image.jpg").convert("RGB") inputs = processor(images=image, return_tensors="pt") pixel_values = inputs["pixel_values"] with torch.no_grad(): logits = model(pixel_values) probs = torch.sigmoid(logits) preds = (probs > 0.5).int().cpu().numpy() print("Predicted multi-hot vector:", preds) ``` --- ## ๐Ÿ“œ License This model is released under the MIT license. --- ## ๐Ÿ’ฌ Citation If you use this model in your work, please cite this repository or acknowledge it appropriately.