--- license: apache-2.0 datasets: - xiaorui638/cc3m - liuhaotian/LLaVA-Instruct-150K - Xkev/LLaVA-CoT-100k metrics: - bleu - accuracy base_model: - LiquidAI/LFM2-350M --- # ⚡ **Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation** ***Note:*** Firebolt-VL is an efficient VLM designed for fast, fine-grained grounding. If you adapt it to a new domain, we recommend fine-tuning on your target data. --- ## 🌟 Overview **Firebolt-VL** is an efficient **vision-language model (VLM)** that replaces Transformer-based cross-attention fusion with a **Cross-modal Modulator (CMM)** using: - **Token–Grid Correlation** (lightweight text–image matching), - **Top-K grid selection** (focus on relevant regions), - **FiLM modulation** (feature-wise conditioning), - **Structured State-Space Model (SSM)** for **linear-time** sequence modeling. It is built on the **Liquid Foundation Model (LFM2-350M)** as the language decoder, enabling strong multimodal reasoning at lower latency. --- ## 🧠 Key Features - ⚡ **Efficient inference** Linear-time sequential modeling via SSM instead of quadratic self-attention for long context. - 🎯 **Fine-grained visual grounding** Token–grid correlation + Top-K selection helps the model focus on task-relevant visual regions. - 🧩 **Lightweight cross-modal fusion** FiLM-based conditioning injects visual context without heavy cross-attention. --- ## 🚀 Training Firebolt-VL is trained in **two stages**: 1. **Stage 1 (CMM warm-up / initialization)** Freeze the vision encoder + LFM decoder, train **CMM** on **CC3M**. 2. **Stage 2 (end-to-end training)** Train the full model on instruction / reasoning data (e.g., LLaVA-style instruction data + CoT-style data). > Hardware used in the paper: **2× H100 80GB** (stage 1 batch 128, stage 2 batch 8), AdamW, 5 epochs each stage. --- ## 🏗️ Architecture
**Main Components:** 1. 🎨 **Vision Encoder (SigLIP)** – extracts grid-level visual embeddings 2. 🧩 **Cross-modal Modulator (CMM)** – token–grid correlation → FiLM → SSM → FiLM 3. 🧠 **LFM Decoder (LFM2-350M)** – autoregressive reasoning and generation --- ## 📊 Benchmark Results **Total parameters:** ~0.8B (paper setting) | Benchmark | Split | Score | |---|---:|---:| | VQAv2 | Test | **76.6** | | POPE | Test | **69.4** | | AI2D | Test | **46.2** | | MMMU-val | Val | **26.4** | | MME (Perception) | - | **1376.2** | | SQA-Image | Test | **56.7** | | MMB-dev | Dev | **64.6** | **Notes.** Exact results can vary with decoding settings (temperature, top-p, max tokens) and evaluation pipeline. --- ## 🧩 Usage ### Option A — Use the official repository 🔗 **Firebolt-VL Repository:** https://github.com/huyquoctrinh/Firebolt-VL ### Option B — Minimal inference example (Transformers-style) > This is a template. Update the model class and forward kwargs to match your implementation. ```python import torch from PIL import Image from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM @torch.inference_mode() def generate_answer( model_id_or_path: str, image_path: str, question: str, device: str = "cuda", dtype: str = "bf16", max_new_tokens: int = 128, temperature: float = 0.2, top_p: float = 0.9, repetition_penalty: float = 1.05, ): device = torch.device(device if torch.cuda.is_available() else "cpu") amp_dtype = torch.bfloat16 if dtype.lower() in ["bf16", "bfloat16"] else torch.float16 tokenizer = AutoTokenizer.from_pretrained(model_id_or_path, use_fast=True) processor = AutoProcessor.from_pretrained(model_id_or_path) model = AutoModelForCausalLM.from_pretrained( model_id_or_path, torch_dtype=amp_dtype if device.type == "cuda" else torch.float32, ).to(device) model.eval() # Build a simple prompt (replace with your chat template if needed) prompt = f"