BobVLM ✨👀

[Article on Medium📖] [Package on Github⚙️]

BobVLM is an ambitious passion project that experiments pre-training a good multimodal language model on limited resources and hardware and still achieve impressive performance. The result is a 1.5b model pre-trained on P100 GPU that is capable of detailed image description and moderate question answering.

Model Architecture 🔧

Training Approach

To maintain efficiency and accessibility:

Vision and language components are frozen
Only the adapter layer is trained
Supervised training approach, treating adapter training as model finetuning(Houlsby et al. (2019)'s work on MLP adapters for transfer learning)
Was trained on accessible hardware (T4 or P100 GPUs)

Demo

I couldn't afford the GPU prices here so it runs quite slooowww of CPU 🤧
Check out the demo here 🙃🙃: Demo on spaces

Installation

pip install git+https://github.com/logic-ot/BobVLM.git

or in a notebook

!pip install git+https://github.com/logic-ot/BobVLM.git

Usage

Basic Usage

from BobVLM import BobVLMProcessor, load_model, pipeline

# Load model and processor
model = load_model()
processor = BobVLMProcessor()

# Create pipeline
pipe = pipeline(model, processor)

# Example with URL image and system prompt
response = pipe(
    chat=[
        {"role": "system", "content": "You are an image understanding assistant. You can see and interpret images in fine detail"},
        {"role": "user", "content": "What's in this image?"},
    ],
    images="http://images.cocodataset.org/train2017/000000436349.jpg"
)

print(response)

Model Output

The image shows a large group of trucks parked in a parking lot, with a variety of vehicles, including semi-trucks, buses, and vans, all lined up in a neat and organized manner. The trucks are parked in a row, with some of them having their doors open, while others are closed. The vehicles are all yellow, with some having white or black stripes.<|eot_id|>'

Different Input Types

# 1. Local file
response = pipe(
    chat=[{"role": "user", "content": "Describe this image"}],
    images="path/to/your/image.jpg"
)

# 2. PIL Image
from PIL import Image
image = Image.open("your_image.jpg")
response = pipe(
    chat=[{"role": "user", "content": "What do you see?"}],
    images=image
)

Multiple Images

# You can pass multiple images
response = pipe(
    chat=[{"role": "user", "content": "Compare these images"}],
    images=["image1.jpg", "https://example.com/image2.jpg"]
)

Chat with Context

# Chat with context
messages = [
    {"role": "system", "content": "You are an expert at analyzing images in detail."},
    {"role": "user", "content": "What's in this image?"},
    {"role": "assistant", "content": "I see a dog playing in a park."},
    {"role": "user", "content": "What breed is it?"}
]

response = pipe(
    chat=messages,
    images="dog.jpg"
)

Requirements

Python 3.7+
transformers
torch
Pillow
requests

Model Card

For more detailed information about the model, visit the Hugging Face model page.

Citation

If you use BobVLM in your research, please cite:

@misc{bobvlm2024,
  author = {selfDotOsman},
  title = {BobVLM: A Lightweight Vision Language Model with Efficient Adapter Architecture},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/selfDotOsman/BobVLM-1.5b}}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Downloads last month: 2

Safetensors

Model size

2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for selfDotOsman/BobVLM-1.5b

Base model

meta-llama/Llama-3.2-1B

Finetuned

(879)

this model

selfDotOsman
/

BobVLM-1.5b