SigLIP-Qwen-1.7B-COCO-Captioner
Model Description
This is a custom Multimodal Large Language Model (MLLM) designed for Image Captioning. It combines a state-of-the-art vision encoder with a lightweight LLM using a custom Q-Former style adapter.
- Vision Encoder:
google/siglip2-base-patch16-512(Frozen) - LLM:
Qwen/Qwen3-1.7B(LoRA Fine-tuned) - Adapter: Custom Multi-layer Cross-Attention Adapter
- Parameters: ~1.9B Total (Active Parameters: ~30M via LoRA & Adapter)
Intended Uses & Limitations
- Intended Use: Generating descriptive captions for natural images. It excels at recognizing common objects found in the COCO dataset (people, vehicles, animals, household items).
- Limitations:
- Domain Specificity: Trained primarily on COCO. It may struggle with "out-of-domain" objects (e.g., specific plant species, medical images, or OCR text) that are not present in the COCO classes.
- Hallucination: Like all MLLMs, it may occasionally describe objects that are not present, especially in complex or blurry images.
Training and Evaluation Data
- Training Data: MS COCO 2017 Captioning Dataset (~118k images).
- Evaluation Data:
- COCO Validation: Achieved CIDEr Score: ~0.866
- NoCaps Validation: Achieved CIDEr Score: ~0.50 (Zero-shot generalization)
Training Procedure
The model was trained in two stages on an NVIDIA H100 GPU:
- Stage 1 (Alignment): Pre-trained on Flickr30k to align visual features with text.
- Stage 2 (Fine-tuning): Fine-tuned on COCO 2017 for 10 epochs.
- Optimizer: AdamW
- Learning Rate: 2e-5 (decayed to 5e-6 for refinement)
- Batch Size: 64 (Effective batch size 128 via gradient accumulation)
- Precision: bfloat16
π» How to Use (Required Code)
Note: This model uses a custom architecture code (model_architecture.py) which is included in the repository. You must load the architecture class before loading the weights.
import torch
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
from model_architecture import MultiModalQwen, MMConfig
from PIL import Image
from torchvision import transforms
# 1. Initialize Configuration & Model
cfg = MMConfig()
model = MultiModalQwen(cfg)
# 2. Download and Load Weights
ckpt_path = hf_hub_download(repo_id="SatyaJaiss/SigLIP-Qwen-1.7B-COCO-Captioner", filename="pytorch_model.bin")
state_dict = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state_dict)
model.to("cuda").eval()
# 3. Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained("SatyaJaiss/SigLIP-Qwen-1.7B-COCO-Captioner")
# 4. Prepare Image
image_path = "path/to/your/image.jpg"
image = Image.open(image_path).convert("RGB")
transform = transforms.Compose([
transforms.Resize((512, 512)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])
pixel_values = transform(image).unsqueeze(0).to("cuda").bfloat16()
# 5. Generate Caption
prompts = ["Caption: "]
inputs = tokenizer(prompts, return_tensors="pt").to("cuda")
with torch.no_grad():
gen_ids = model.generate(
pixel_values=pixel_values,
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=30,
do_sample=False
)
print(tokenizer.decode(gen_ids[0], skip_special_tokens=True))
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support