| --- |
| license: mit |
| tags: |
| - vision |
| - clip |
| - lora |
| - multilabel-classification |
| - image-classification |
| - bitsandbytes |
| - 8bit |
| --- |
| |
| # CLIP-ViT-Large LoRA Adapter for Multi-Label Image Classification |
|
|
| This model is a lightweight multi-label classification model based on [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14), optimized using LoRA (Low-Rank Adaptation) and 8-bit quantization (via `bitsandbytes`). It is suitable for multi-label classification tasks involving 20 distinct image categories. |
|
|
| This repo contains only: |
| - The **LoRA adapter** weights (`adapter_model.safetensors`) |
| - The **classifier head** weights (`classifier_head.pt`) |
| - A sample loading script in this README |
|
|
| --- |
|
|
| ## π§ Model Architecture |
|
|
| - Backbone: `openai/clip-vit-large-patch14` |
| - Quantization: 8-bit (`load_in_8bit=True`) |
| - Fine-tuning method: LoRA (r=16, alpha=32) via `peft` |
| - Classification head: `LayerNorm β Dropout β Linear(num_labels=20)` |
|
|
| --- |
|
|
| ## π§ͺ Training Details |
|
|
| - LoRA was applied to the attention projection modules: `q_proj`, `k_proj`, `v_proj`, `out_proj` |
| - Optimizer: AdamW |
| - Loss: Asymmetric Focal Loss (Ξ³β»=2) |
| - Epochs: 2 (grid search on LR and gamma_neg) |
| |
| --- |
| |
| ## π Class Labels |
| |
| The model supports 20 categories: |
| ``` |
| Class 0, Class 1, Class 2, ..., Class 19 |
| ``` |
| You can replace these with your own label names based on your dataset. |
| |
| --- |
| |
| ## π How to Use |
| |
| ### π¦ Install dependencies |
| |
| ```python |
| !pip install transformers peft bitsandbytes accelerate |
| ``` |
| |
| ### π§© Load model |
| |
| ```python |
| import torch |
| import torch.nn as nn |
| from transformers import CLIPModel, BitsAndBytesConfig, CLIPProcessor |
| from peft import PeftModel |
| |
| class CLIPForMultiLabel(nn.Module): |
| def __init__(self, backbone, num_labels=20, dropout=0.1): |
| super().__init__() |
| self.backbone = backbone |
| hidden_size = backbone.config.projection_dim |
| self.classifier = nn.Sequential( |
| nn.LayerNorm(hidden_size), |
| nn.Dropout(dropout), |
| nn.Linear(hidden_size, num_labels) |
| ) |
| |
| def forward(self, pixel_values): |
| image_feats = self.backbone.get_image_features(pixel_values=pixel_values) |
| return self.classifier(image_feats) |
| |
| # Load LoRA backbone |
| quant_cfg = BitsAndBytesConfig(load_in_8bit=True) |
| base = CLIPModel.from_pretrained("openai/clip-vit-large-patch14", quantization_config=quant_cfg) |
| backbone = PeftModel.from_pretrained(base, "YOUR_USERNAME/clip-lora-multilabel") |
|
|
| # Load classifier head |
| model = CLIPForMultiLabel(backbone, num_labels=20) |
| state_dict = torch.hub.load_state_dict_from_url( |
| "https://huggingface.co/YOUR_USERNAME/clip-lora-multilabel/resolve/main/classifier_head.pt", |
| map_location="cpu" |
| ) |
| model.classifier.load_state_dict(state_dict) |
| model.eval() |
| |
| # Load processor |
| processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14") |
| ``` |
| |
| ### πΌοΈ Predict on an image |
| |
| ```python |
| from PIL import Image |
| |
| image = Image.open("your_image.jpg").convert("RGB") |
| inputs = processor(images=image, return_tensors="pt") |
| pixel_values = inputs["pixel_values"] |
| |
| with torch.no_grad(): |
| logits = model(pixel_values) |
| probs = torch.sigmoid(logits) |
| preds = (probs > 0.5).int().cpu().numpy() |
| |
| print("Predicted multi-hot vector:", preds) |
| ``` |
| |
| --- |
| |
| ## π License |
| |
| This model is released under the MIT license. |
| |
| --- |
| |
| ## π¬ Citation |
| |
| If you use this model in your work, please cite this repository or acknowledge it appropriately. |
| |