SigLIP-Qwen-1.7B-COCO-Captioner

Model Description

This is a custom Multimodal Large Language Model (MLLM) designed for Image Captioning. It combines a state-of-the-art vision encoder with a lightweight LLM using a custom Q-Former style adapter.

  • Vision Encoder: google/siglip2-base-patch16-512 (Frozen)
  • LLM: Qwen/Qwen3-1.7B (LoRA Fine-tuned)
  • Adapter: Custom Multi-layer Cross-Attention Adapter
  • Parameters: ~1.9B Total (Active Parameters: ~30M via LoRA & Adapter)

Intended Uses & Limitations

  • Intended Use: Generating descriptive captions for natural images. It excels at recognizing common objects found in the COCO dataset (people, vehicles, animals, household items).
  • Limitations:
    • Domain Specificity: Trained primarily on COCO. It may struggle with "out-of-domain" objects (e.g., specific plant species, medical images, or OCR text) that are not present in the COCO classes.
    • Hallucination: Like all MLLMs, it may occasionally describe objects that are not present, especially in complex or blurry images.

Training and Evaluation Data

  • Training Data: MS COCO 2017 Captioning Dataset (~118k images).
  • Evaluation Data:
    • COCO Validation: Achieved CIDEr Score: ~0.866
    • NoCaps Validation: Achieved CIDEr Score: ~0.50 (Zero-shot generalization)

Training Procedure

The model was trained in two stages on an NVIDIA H100 GPU:

  1. Stage 1 (Alignment): Pre-trained on Flickr30k to align visual features with text.
  2. Stage 2 (Fine-tuning): Fine-tuned on COCO 2017 for 10 epochs.
    • Optimizer: AdamW
    • Learning Rate: 2e-5 (decayed to 5e-6 for refinement)
    • Batch Size: 64 (Effective batch size 128 via gradient accumulation)
    • Precision: bfloat16

πŸ’» How to Use (Required Code)

Note: This model uses a custom architecture code (model_architecture.py) which is included in the repository. You must load the architecture class before loading the weights.

import torch
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
from model_architecture import MultiModalQwen, MMConfig
from PIL import Image
from torchvision import transforms

# 1. Initialize Configuration & Model
cfg = MMConfig()
model = MultiModalQwen(cfg)

# 2. Download and Load Weights
ckpt_path = hf_hub_download(repo_id="SatyaJaiss/SigLIP-Qwen-1.7B-COCO-Captioner", filename="pytorch_model.bin")

state_dict = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state_dict)
model.to("cuda").eval()

# 3. Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained("SatyaJaiss/SigLIP-Qwen-1.7B-COCO-Captioner")

# 4. Prepare Image
image_path = "path/to/your/image.jpg"
image = Image.open(image_path).convert("RGB")

transform = transforms.Compose([
    transforms.Resize((512, 512)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])
pixel_values = transform(image).unsqueeze(0).to("cuda").bfloat16()

# 5. Generate Caption
prompts = ["Caption: "]
inputs = tokenizer(prompts, return_tensors="pt").to("cuda")

with torch.no_grad():
    gen_ids = model.generate(
        pixel_values=pixel_values,
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=30,
        do_sample=False
    )

print(tokenizer.decode(gen_ids[0], skip_special_tokens=True))
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support