Safetensors
fireboltlm
FireboltVL / README.md
huyquoctrinh's picture
Update README.md
14b0504 verified
metadata
license: apache-2.0
datasets:
  - xiaorui638/cc3m
  - liuhaotian/LLaVA-Instruct-150K
  - Xkev/LLaVA-CoT-100k
metrics:
  - bleu
  - accuracy
base_model:
  - LiquidAI/LFM2-350M

Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

Note: Firebolt-VL is an efficient VLM designed for fast, fine-grained grounding. If you adapt it to a new domain, we recommend fine-tuning on your target data.


🌟 Overview

Firebolt-VL is an efficient vision-language model (VLM) that replaces Transformer-based cross-attention fusion with a Cross-modal Modulator (CMM) using:

  • Token–Grid Correlation (lightweight text–image matching),
  • Top-K grid selection (focus on relevant regions),
  • FiLM modulation (feature-wise conditioning),
  • Structured State-Space Model (SSM) for linear-time sequence modeling.

It is built on the Liquid Foundation Model (LFM2-350M) as the language decoder, enabling strong multimodal reasoning at lower latency.


🧠 Key Features

  • Efficient inference Linear-time sequential modeling via SSM instead of quadratic self-attention for long context.

  • 🎯 Fine-grained visual grounding Token–grid correlation + Top-K selection helps the model focus on task-relevant visual regions.

  • 🧩 Lightweight cross-modal fusion FiLM-based conditioning injects visual context without heavy cross-attention.


🚀 Training

Firebolt-VL is trained in two stages:

  1. Stage 1 (CMM warm-up / initialization) Freeze the vision encoder + LFM decoder, train CMM on CC3M.

  2. Stage 2 (end-to-end training) Train the full model on instruction / reasoning data (e.g., LLaVA-style instruction data + CoT-style data).

Hardware used in the paper: 2× H100 80GB (stage 1 batch 128, stage 2 batch 8), AdamW, 5 epochs each stage.


🏗️ Architecture

Main Components:

  1. 🎨 Vision Encoder (SigLIP) – extracts grid-level visual embeddings
  2. 🧩 Cross-modal Modulator (CMM) – token–grid correlation → FiLM → SSM → FiLM
  3. 🧠 LFM Decoder (LFM2-350M) – autoregressive reasoning and generation

📊 Benchmark Results

Total parameters: ~0.8B (paper setting)

Benchmark Split Score
VQAv2 Test 76.6
POPE Test 69.4
AI2D Test 46.2
MMMU-val Val 26.4
MME (Perception) - 1376.2
SQA-Image Test 56.7
MMB-dev Dev 64.6

Notes. Exact results can vary with decoding settings (temperature, top-p, max tokens) and evaluation pipeline.


🧩 Usage

Option A — Use the official repository

🔗 Firebolt-VL Repository: https://github.com/huyquoctrinh/Firebolt-VL

Option B — Minimal inference example (Transformers-style)

This is a template. Update the model class and forward kwargs to match your implementation.

import torch
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM

@torch.inference_mode()
def generate_answer(
    model_id_or_path: str,
    image_path: str,
    question: str,
    device: str = "cuda",
    dtype: str = "bf16",
    max_new_tokens: int = 128,
    temperature: float = 0.2,
    top_p: float = 0.9,
    repetition_penalty: float = 1.05,
):
    device = torch.device(device if torch.cuda.is_available() else "cpu")
    amp_dtype = torch.bfloat16 if dtype.lower() in ["bf16", "bfloat16"] else torch.float16

    tokenizer = AutoTokenizer.from_pretrained(model_id_or_path, use_fast=True)
    processor = AutoProcessor.from_pretrained(model_id_or_path)

    model = AutoModelForCausalLM.from_pretrained(
        model_id_or_path,
        torch_dtype=amp_dtype if device.type == "cuda" else torch.float32,
    ).to(device)
    model.eval()

    # Build a simple prompt (replace with your chat template if needed)
    prompt = f"<image>\nUser: {question}\nAssistant:"
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    img = Image.open(image_path).convert("RGB")
    image_inputs = processor(images=img, return_tensors="pt")
    pixel_values = image_inputs.get("pixel_values").to(device)

    gen_kwargs = dict(
        max_new_tokens=max_new_tokens,
        do_sample=temperature > 0,
        temperature=max(temperature, 1e-6),
        top_p=top_p,
        repetition_penalty=repetition_penalty,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
        use_cache=True,
    )

    # NOTE: update kwarg name to match your model (e.g., image_inputs / pixel_values)
    out = model.generate(**inputs, **gen_kwargs)

    text = tokenizer.decode(out[0], skip_special_tokens=True)
    return text.strip()

if __name__ == "__main__":
    ans = generate_answer(
        model_id_or_path="YOUR_FIREBOLT_VL_PATH_OR_HF_ID",
        image_path="demo.jpg",
        question="What is written in the top right corner?",
    )
    print(ans)