---
license: mit
tags:
  - vision
  - clip
  - lora
  - multilabel-classification
  - image-classification
  - bitsandbytes
  - 8bit
---

# CLIP-ViT-Large LoRA Adapter for Multi-Label Image Classification

This model is a lightweight multi-label classification model based on [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14), optimized using LoRA (Low-Rank Adaptation) and 8-bit quantization (via `bitsandbytes`). It is suitable for multi-label classification tasks involving 20 distinct image categories.

This repo contains only:
- The **LoRA adapter** weights (`adapter_model.safetensors`)
- The **classifier head** weights (`classifier_head.pt`)
- A sample loading script in this README

---

## 🧠 Model Architecture

- Backbone: `openai/clip-vit-large-patch14`
- Quantization: 8-bit (`load_in_8bit=True`)
- Fine-tuning method: LoRA (r=16, alpha=32) via `peft`
- Classification head: `LayerNorm → Dropout → Linear(num_labels=20)`

---

## 🧪 Training Details

- LoRA was applied to the attention projection modules: `q_proj`, `k_proj`, `v_proj`, `out_proj`
- Optimizer: AdamW
- Loss: Asymmetric Focal Loss (γ⁻=2)
- Epochs: 2 (grid search on LR and gamma_neg)

---

## 📂 Class Labels

The model supports 20 categories:
```
Class 0, Class 1, Class 2, ..., Class 19
```
You can replace these with your own label names based on your dataset.

---

## 🚀 How to Use

### 📦 Install dependencies

```python
!pip install transformers peft bitsandbytes accelerate
```

### 🧩 Load model

```python
import torch
import torch.nn as nn
from transformers import CLIPModel, BitsAndBytesConfig, CLIPProcessor
from peft import PeftModel

class CLIPForMultiLabel(nn.Module):
    def __init__(self, backbone, num_labels=20, dropout=0.1):
        super().__init__()
        self.backbone = backbone
        hidden_size = backbone.config.projection_dim
        self.classifier = nn.Sequential(
            nn.LayerNorm(hidden_size),
            nn.Dropout(dropout),
            nn.Linear(hidden_size, num_labels)
        )

    def forward(self, pixel_values):
        image_feats = self.backbone.get_image_features(pixel_values=pixel_values)
        return self.classifier(image_feats)

# Load LoRA backbone
quant_cfg = BitsAndBytesConfig(load_in_8bit=True)
base = CLIPModel.from_pretrained("openai/clip-vit-large-patch14", quantization_config=quant_cfg)
backbone = PeftModel.from_pretrained(base, "YOUR_USERNAME/clip-lora-multilabel")

# Load classifier head
model = CLIPForMultiLabel(backbone, num_labels=20)
state_dict = torch.hub.load_state_dict_from_url(
    "https://huggingface.co/YOUR_USERNAME/clip-lora-multilabel/resolve/main/classifier_head.pt",
    map_location="cpu"
)
model.classifier.load_state_dict(state_dict)
model.eval()

# Load processor
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
```

### 🖼️ Predict on an image

```python
from PIL import Image

image = Image.open("your_image.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
pixel_values = inputs["pixel_values"]

with torch.no_grad():
    logits = model(pixel_values)
    probs = torch.sigmoid(logits)
    preds = (probs > 0.5).int().cpu().numpy()

print("Predicted multi-hot vector:", preds)
```

---

## 📜 License

This model is released under the MIT license.

---

## 💬 Citation

If you use this model in your work, please cite this repository or acknowledge it appropriately.