Jayway0910's picture
Upload folder using huggingface_hub
d202995 verified
---
license: mit
tags:
- vision
- clip
- lora
- multilabel-classification
- image-classification
- bitsandbytes
- 8bit
---
# CLIP-ViT-Large LoRA Adapter for Multi-Label Image Classification
This model is a lightweight multi-label classification model based on [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14), optimized using LoRA (Low-Rank Adaptation) and 8-bit quantization (via `bitsandbytes`). It is suitable for multi-label classification tasks involving 20 distinct image categories.
This repo contains only:
- The **LoRA adapter** weights (`adapter_model.safetensors`)
- The **classifier head** weights (`classifier_head.pt`)
- A sample loading script in this README
---
## 🧠 Model Architecture
- Backbone: `openai/clip-vit-large-patch14`
- Quantization: 8-bit (`load_in_8bit=True`)
- Fine-tuning method: LoRA (r=16, alpha=32) via `peft`
- Classification head: `LayerNorm β†’ Dropout β†’ Linear(num_labels=20)`
---
## πŸ§ͺ Training Details
- LoRA was applied to the attention projection modules: `q_proj`, `k_proj`, `v_proj`, `out_proj`
- Optimizer: AdamW
- Loss: Asymmetric Focal Loss (γ⁻=2)
- Epochs: 2 (grid search on LR and gamma_neg)
---
## πŸ“‚ Class Labels
The model supports 20 categories:
```
Class 0, Class 1, Class 2, ..., Class 19
```
You can replace these with your own label names based on your dataset.
---
## πŸš€ How to Use
### πŸ“¦ Install dependencies
```python
!pip install transformers peft bitsandbytes accelerate
```
### 🧩 Load model
```python
import torch
import torch.nn as nn
from transformers import CLIPModel, BitsAndBytesConfig, CLIPProcessor
from peft import PeftModel
class CLIPForMultiLabel(nn.Module):
def __init__(self, backbone, num_labels=20, dropout=0.1):
super().__init__()
self.backbone = backbone
hidden_size = backbone.config.projection_dim
self.classifier = nn.Sequential(
nn.LayerNorm(hidden_size),
nn.Dropout(dropout),
nn.Linear(hidden_size, num_labels)
)
def forward(self, pixel_values):
image_feats = self.backbone.get_image_features(pixel_values=pixel_values)
return self.classifier(image_feats)
# Load LoRA backbone
quant_cfg = BitsAndBytesConfig(load_in_8bit=True)
base = CLIPModel.from_pretrained("openai/clip-vit-large-patch14", quantization_config=quant_cfg)
backbone = PeftModel.from_pretrained(base, "YOUR_USERNAME/clip-lora-multilabel")
# Load classifier head
model = CLIPForMultiLabel(backbone, num_labels=20)
state_dict = torch.hub.load_state_dict_from_url(
"https://huggingface.co/YOUR_USERNAME/clip-lora-multilabel/resolve/main/classifier_head.pt",
map_location="cpu"
)
model.classifier.load_state_dict(state_dict)
model.eval()
# Load processor
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
```
### πŸ–ΌοΈ Predict on an image
```python
from PIL import Image
image = Image.open("your_image.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
pixel_values = inputs["pixel_values"]
with torch.no_grad():
logits = model(pixel_values)
probs = torch.sigmoid(logits)
preds = (probs > 0.5).int().cpu().numpy()
print("Predicted multi-hot vector:", preds)
```
---
## πŸ“œ License
This model is released under the MIT license.
---
## πŸ’¬ Citation
If you use this model in your work, please cite this repository or acknowledge it appropriately.