AIEI-VL1-2B

A simplified VLM model (Vision-Language Model) capable of Image Captioning and Visual Question Answering (VQA).

Model Description

This model integrates a SigLIP vision encoder with a Qwen3-1.7b-based LLM using a custom projector. It supports:

Image Captioning: Generating descriptive text for images.
Visual Question Answering: Answering questions based on visual input.

Dependencies

To use this model, you need the following Python libraries:

pip install torch transformers pillow requests einops

Note: einops might be required by specific vision backbones depending on the configuration.

Inference Example

Below is a simple code snippet to load the model and run inference on an example image.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import requests
from io import BytesIO

# 1. Load Model and Tokenizer (v0.3 with Resolution Fix)
model_id = "MANO066/AIEI-VL1-2B"
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Loading model {model_id} (v0.3)...")
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    revision="v0.3", 
    trust_remote_code=True,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32
).to(device)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model.eval()

# 2. Load and Preprocess Image
image_url = "https://images.pexels.com/photos/1108099/pexels-photo-1108099.jpeg"
response = requests.get(image_url)
image = Image.open(BytesIO(response.content)).convert("RGB")

# 3. Task 1: Detailed Captioning
print("\n--- Task 1: Captioning ---")
# The model uses AnyRes/High-Res automatically
captions = model.generate_caption([image], max_new_tokens=200, temperature=0.7)
print(f"Caption: {captions[0]}")

# 4. Task 2: Visual Question Answering (VQA)
print("\n--- Task 2: VQA ---")
question = "What animals are in the image?"
answers = model.generate_vqa([image], [question], max_new_tokens=50)
print(f"Q: {question}")
print(f"A: {answers[0]}")

# 5. Task 3: Visual Reasoning (Think)
print("\n--- Task 3: Reasoning ---")
reason_q = "Analyze the behavior of the animals in the scene."
reasoning = model.generate_reasoning([image], [reason_q], max_new_tokens=256)
print(f"Q: {reason_q}")
print(f"Analysis: {reasoning[0]}")

Downloads last month: 14

Safetensors

Model size

3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support