AIEI-VL1-2B
A simplified VLM model (Vision-Language Model) capable of Image Captioning and Visual Question Answering (VQA).
Model Description
This model integrates a SigLIP vision encoder with a Qwen3-1.7b-based LLM using a custom projector. It supports:
- Image Captioning: Generating descriptive text for images.
- Visual Question Answering: Answering questions based on visual input.
Dependencies
To use this model, you need the following Python libraries:
pip install torch transformers pillow requests einops
Note: einops might be required by specific vision backbones depending on the configuration.
Inference Example
Below is a simple code snippet to load the model and run inference on an example image.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import requests
from io import BytesIO
# 1. Load Model and Tokenizer (v0.3 with Resolution Fix)
model_id = "MANO066/AIEI-VL1-2B"
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Loading model {model_id} (v0.3)...")
model = AutoModelForCausalLM.from_pretrained(
model_id,
revision="v0.3",
trust_remote_code=True,
torch_dtype=torch.float16 if device == "cuda" else torch.float32
).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model.eval()
# 2. Load and Preprocess Image
image_url = "https://images.pexels.com/photos/1108099/pexels-photo-1108099.jpeg"
response = requests.get(image_url)
image = Image.open(BytesIO(response.content)).convert("RGB")
# 3. Task 1: Detailed Captioning
print("\n--- Task 1: Captioning ---")
# The model uses AnyRes/High-Res automatically
captions = model.generate_caption([image], max_new_tokens=200, temperature=0.7)
print(f"Caption: {captions[0]}")
# 4. Task 2: Visual Question Answering (VQA)
print("\n--- Task 2: VQA ---")
question = "What animals are in the image?"
answers = model.generate_vqa([image], [question], max_new_tokens=50)
print(f"Q: {question}")
print(f"A: {answers[0]}")
# 5. Task 3: Visual Reasoning (Think)
print("\n--- Task 3: Reasoning ---")
reason_q = "Analyze the behavior of the animals in the scene."
reasoning = model.generate_reasoning([image], [reason_q], max_new_tokens=256)
print(f"Q: {reason_q}")
print(f"Analysis: {reasoning[0]}")
- Downloads last month
- 1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support