BobVLM βœ¨πŸ‘€

[Article on MediumπŸ“–] [Package on Githubβš™οΈ]

BobVLM is an ambitious passion project that experiments pre-training a good multimodal language model on limited resources and hardware and still achieve impressive performance. The result is a 1.5b model pre-trained on P100 GPU that is capable of detailed image description and moderate question answering.

Model Architecture πŸ”§

Bob VLM diagram

Training Approach

To maintain efficiency and accessibility:

  • Vision and language components are frozen
  • Only the adapter layer is trained
  • Supervised training approach, treating adapter training as model finetuning(Houlsby et al. (2019)'s work on MLP adapters for transfer learning)
  • Was trained on accessible hardware (T4 or P100 GPUs)

Demo

I couldn't afford the GPU prices here so it runs quite slooowww of CPU 🀧
Check out the demo here πŸ™ƒπŸ™ƒ: Demo on spaces

Installation

pip install git+https://github.com/logic-ot/BobVLM.git

or in a notebook

!pip install git+https://github.com/logic-ot/BobVLM.git

Usage

Basic Usage

from BobVLM import BobVLMProcessor, load_model, pipeline

# Load model and processor
model = load_model()
processor = BobVLMProcessor()

# Create pipeline
pipe = pipeline(model, processor)

# Example with URL image and system prompt
response = pipe(
    chat=[
        {"role": "system", "content": "You are an image understanding assistant. You can see and interpret images in fine detail"},
        {"role": "user", "content": "What's in this image?"},
    ],
    images="http://images.cocodataset.org/train2017/000000436349.jpg"
)

print(response)

Model Output

The image shows a large group of trucks parked in a parking lot, with a variety of vehicles, including semi-trucks, buses, and vans, all lined up in a neat and organized manner. The trucks are parked in a row, with some of them having their doors open, while others are closed. The vehicles are all yellow, with some having white or black stripes.<|eot_id|>'

Different Input Types

# 1. Local file
response = pipe(
    chat=[{"role": "user", "content": "Describe this image"}],
    images="path/to/your/image.jpg"
)

# 2. PIL Image
from PIL import Image
image = Image.open("your_image.jpg")
response = pipe(
    chat=[{"role": "user", "content": "What do you see?"}],
    images=image
)

Multiple Images

# You can pass multiple images
response = pipe(
    chat=[{"role": "user", "content": "Compare these images"}],
    images=["image1.jpg", "https://example.com/image2.jpg"]
)

Chat with Context

# Chat with context
messages = [
    {"role": "system", "content": "You are an expert at analyzing images in detail."},
    {"role": "user", "content": "What's in this image?"},
    {"role": "assistant", "content": "I see a dog playing in a park."},
    {"role": "user", "content": "What breed is it?"}
]

response = pipe(
    chat=messages,
    images="dog.jpg"
)

Requirements

  • Python 3.7+
  • transformers
  • torch
  • Pillow
  • requests

Model Card

For more detailed information about the model, visit the Hugging Face model page.

Citation

If you use BobVLM in your research, please cite:

@misc{bobvlm2024,
  author = {selfDotOsman},
  title = {BobVLM: A Lightweight Vision Language Model with Efficient Adapter Architecture},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/selfDotOsman/BobVLM-1.5b}}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Downloads last month
2
Safetensors
Model size
2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for selfDotOsman/BobVLM-1.5b

Finetuned
(879)
this model

Datasets used to train selfDotOsman/BobVLM-1.5b

Space using selfDotOsman/BobVLM-1.5b 1