huyquoctrinh
/

FireboltVL

Safetensors

fireboltlm

Model card Files Files and versions

xet

Community

huyquoctrinh commited on Dec 12, 2025

Commit

14b0504

verified ·

1 Parent(s): 5119322

Update README.md

Browse files

Files changed (1) hide show

README.md +161 -3

README.md CHANGED Viewed

@@ -1,3 +1,161 @@
----
-license: mit
----

+---
+license: apache-2.0
+datasets:
+- xiaorui638/cc3m
+- liuhaotian/LLaVA-Instruct-150K
+- Xkev/LLaVA-CoT-100k
+metrics:
+- bleu
+- accuracy
+base_model:
+- LiquidAI/LFM2-350M
+---
+# ⚡ **Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation**
+***Note:*** Firebolt-VL is an efficient VLM designed for fast, fine-grained grounding. If you adapt it to a new domain, we recommend fine-tuning on your target data.
+---
+## 🌟 Overview
+**Firebolt-VL** is an efficient **vision-language model (VLM)** that replaces Transformer-based cross-attention fusion with a **Cross-modal Modulator (CMM)** using:
+- **Token–Grid Correlation** (lightweight text–image matching),
+- **Top-K grid selection** (focus on relevant regions),
+- **FiLM modulation** (feature-wise conditioning),
+- **Structured State-Space Model (SSM)** for **linear-time** sequence modeling.
+It is built on the **Liquid Foundation Model (LFM2-350M)** as the language decoder, enabling strong multimodal reasoning at lower latency.
+---
+## 🧠 Key Features
+- ⚡ **Efficient inference**
+  Linear-time sequential modeling via SSM instead of quadratic self-attention for long context.
+- 🎯 **Fine-grained visual grounding**
+  Token–grid correlation + Top-K selection helps the model focus on task-relevant visual regions.
+- 🧩 **Lightweight cross-modal fusion**
+  FiLM-based conditioning injects visual context without heavy cross-attention.
+---
+## 🚀 Training
+Firebolt-VL is trained in **two stages**:
+1. **Stage 1 (CMM warm-up / initialization)**
+   Freeze the vision encoder + LFM decoder, train **CMM** on **CC3M**.
+2. **Stage 2 (end-to-end training)**
+   Train the full model on instruction / reasoning data (e.g., LLaVA-style instruction data + CoT-style data).
+> Hardware used in the paper: **2× H100 80GB** (stage 1 batch 128, stage 2 batch 8), AdamW, 5 epochs each stage.
+---
+## 🏗️ Architecture
+<div align="center">
+  <a href="./">
+    <img src="firebolt_vl.jpg" width="85%" alt="Firebolt-VL Architecture"/>
+  </a>
+</div>
+**Main Components:**
+1. 🎨 **Vision Encoder (SigLIP)** – extracts grid-level visual embeddings
+2. 🧩 **Cross-modal Modulator (CMM)** – token–grid correlation → FiLM → SSM → FiLM
+3. 🧠 **LFM Decoder (LFM2-350M)** – autoregressive reasoning and generation
+---
+## 📊 Benchmark Results
+**Total parameters:** ~0.8B (paper setting)
+| Benchmark | Split | Score |
+|---|---:|---:|
+| VQAv2 | Test | **76.6** |
+| POPE | Test | **69.4** |
+| AI2D | Test | **46.2** |
+| MMMU-val | Val | **26.4** |
+| MME (Perception) | - | **1376.2** |
+| SQA-Image | Test | **56.7** |
+| MMB-dev | Dev | **64.6** |
+**Notes.** Exact results can vary with decoding settings (temperature, top-p, max tokens) and evaluation pipeline.
+---
+## 🧩 Usage
+### Option A — Use the official repository
+🔗 **Firebolt-VL Repository:** https://github.com/huyquoctrinh/Firebolt-VL
+### Option B — Minimal inference example (Transformers-style)
+> This is a template. Update the model class and forward kwargs to match your implementation.
+```python
+import torch
+from PIL import Image
+from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
+@torch.inference_mode()
+def generate_answer(
+    model_id_or_path: str,
+    image_path: str,
+    question: str,
+    device: str = "cuda",
+    dtype: str = "bf16",
+    max_new_tokens: int = 128,
+    temperature: float = 0.2,
+    top_p: float = 0.9,
+    repetition_penalty: float = 1.05,
+):
+    device = torch.device(device if torch.cuda.is_available() else "cpu")
+    amp_dtype = torch.bfloat16 if dtype.lower() in ["bf16", "bfloat16"] else torch.float16
+    tokenizer = AutoTokenizer.from_pretrained(model_id_or_path, use_fast=True)
+    processor = AutoProcessor.from_pretrained(model_id_or_path)
+    model = AutoModelForCausalLM.from_pretrained(
+        model_id_or_path,
+        torch_dtype=amp_dtype if device.type == "cuda" else torch.float32,
+    ).to(device)
+    model.eval()
+    # Build a simple prompt (replace with your chat template if needed)
+    prompt = f"<image>\nUser: {question}\nAssistant:"
+    inputs = tokenizer(prompt, return_tensors="pt").to(device)
+    img = Image.open(image_path).convert("RGB")
+    image_inputs = processor(images=img, return_tensors="pt")
+    pixel_values = image_inputs.get("pixel_values").to(device)
+    gen_kwargs = dict(
+        max_new_tokens=max_new_tokens,
+        do_sample=temperature > 0,
+        temperature=max(temperature, 1e-6),
+        top_p=top_p,
+        repetition_penalty=repetition_penalty,
+        eos_token_id=tokenizer.eos_token_id,
+        pad_token_id=tokenizer.eos_token_id,
+        use_cache=True,
+    )
+    # NOTE: update kwarg name to match your model (e.g., image_inputs / pixel_values)
+    out = model.generate(**inputs, **gen_kwargs)
+    text = tokenizer.decode(out[0], skip_special_tokens=True)
+    return text.strip()
+if __name__ == "__main__":
+    ans = generate_answer(
+        model_id_or_path="YOUR_FIREBOLT_VL_PATH_OR_HF_ID",
+        image_path="demo.jpg",
+        question="What is written in the top right corner?",
+    )
+    print(ans)