|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- xiaorui638/cc3m |
|
|
- liuhaotian/LLaVA-Instruct-150K |
|
|
- Xkev/LLaVA-CoT-100k |
|
|
metrics: |
|
|
- bleu |
|
|
- accuracy |
|
|
base_model: |
|
|
- LiquidAI/LFM2-350M |
|
|
--- |
|
|
|
|
|
# ⚡ **Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation** |
|
|
|
|
|
***Note:*** Firebolt-VL is an efficient VLM designed for fast, fine-grained grounding. If you adapt it to a new domain, we recommend fine-tuning on your target data. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🌟 Overview |
|
|
|
|
|
**Firebolt-VL** is an efficient **vision-language model (VLM)** that replaces Transformer-based cross-attention fusion with a **Cross-modal Modulator (CMM)** using: |
|
|
- **Token–Grid Correlation** (lightweight text–image matching), |
|
|
- **Top-K grid selection** (focus on relevant regions), |
|
|
- **FiLM modulation** (feature-wise conditioning), |
|
|
- **Structured State-Space Model (SSM)** for **linear-time** sequence modeling. |
|
|
|
|
|
It is built on the **Liquid Foundation Model (LFM2-350M)** as the language decoder, enabling strong multimodal reasoning at lower latency. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧠 Key Features |
|
|
|
|
|
- ⚡ **Efficient inference** |
|
|
Linear-time sequential modeling via SSM instead of quadratic self-attention for long context. |
|
|
|
|
|
- 🎯 **Fine-grained visual grounding** |
|
|
Token–grid correlation + Top-K selection helps the model focus on task-relevant visual regions. |
|
|
|
|
|
- 🧩 **Lightweight cross-modal fusion** |
|
|
FiLM-based conditioning injects visual context without heavy cross-attention. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🚀 Training |
|
|
|
|
|
Firebolt-VL is trained in **two stages**: |
|
|
|
|
|
1. **Stage 1 (CMM warm-up / initialization)** |
|
|
Freeze the vision encoder + LFM decoder, train **CMM** on **CC3M**. |
|
|
|
|
|
2. **Stage 2 (end-to-end training)** |
|
|
Train the full model on instruction / reasoning data (e.g., LLaVA-style instruction data + CoT-style data). |
|
|
|
|
|
> Hardware used in the paper: **2× H100 80GB** (stage 1 batch 128, stage 2 batch 8), AdamW, 5 epochs each stage. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🏗️ Architecture |
|
|
|
|
|
<div align="center"> |
|
|
<a href="./"> |
|
|
<img src="firebolt_vl.jpg" width="85%" alt="Firebolt-VL Architecture"/> |
|
|
</a> |
|
|
</div> |
|
|
|
|
|
**Main Components:** |
|
|
1. 🎨 **Vision Encoder (SigLIP)** – extracts grid-level visual embeddings |
|
|
2. 🧩 **Cross-modal Modulator (CMM)** – token–grid correlation → FiLM → SSM → FiLM |
|
|
3. 🧠 **LFM Decoder (LFM2-350M)** – autoregressive reasoning and generation |
|
|
|
|
|
--- |
|
|
|
|
|
## 📊 Benchmark Results |
|
|
|
|
|
**Total parameters:** ~0.8B (paper setting) |
|
|
|
|
|
| Benchmark | Split | Score | |
|
|
|---|---:|---:| |
|
|
| VQAv2 | Test | **76.6** | |
|
|
| POPE | Test | **69.4** | |
|
|
| AI2D | Test | **46.2** | |
|
|
| MMMU-val | Val | **26.4** | |
|
|
| MME (Perception) | - | **1376.2** | |
|
|
| SQA-Image | Test | **56.7** | |
|
|
| MMB-dev | Dev | **64.6** | |
|
|
|
|
|
**Notes.** Exact results can vary with decoding settings (temperature, top-p, max tokens) and evaluation pipeline. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧩 Usage |
|
|
|
|
|
### Option A — Use the official repository |
|
|
🔗 **Firebolt-VL Repository:** https://github.com/huyquoctrinh/Firebolt-VL |
|
|
|
|
|
### Option B — Minimal inference example (Transformers-style) |
|
|
|
|
|
> This is a template. Update the model class and forward kwargs to match your implementation. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from PIL import Image |
|
|
from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM |
|
|
|
|
|
@torch.inference_mode() |
|
|
def generate_answer( |
|
|
model_id_or_path: str, |
|
|
image_path: str, |
|
|
question: str, |
|
|
device: str = "cuda", |
|
|
dtype: str = "bf16", |
|
|
max_new_tokens: int = 128, |
|
|
temperature: float = 0.2, |
|
|
top_p: float = 0.9, |
|
|
repetition_penalty: float = 1.05, |
|
|
): |
|
|
device = torch.device(device if torch.cuda.is_available() else "cpu") |
|
|
amp_dtype = torch.bfloat16 if dtype.lower() in ["bf16", "bfloat16"] else torch.float16 |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id_or_path, use_fast=True) |
|
|
processor = AutoProcessor.from_pretrained(model_id_or_path) |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id_or_path, |
|
|
torch_dtype=amp_dtype if device.type == "cuda" else torch.float32, |
|
|
).to(device) |
|
|
model.eval() |
|
|
|
|
|
# Build a simple prompt (replace with your chat template if needed) |
|
|
prompt = f"<image>\nUser: {question}\nAssistant:" |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(device) |
|
|
|
|
|
img = Image.open(image_path).convert("RGB") |
|
|
image_inputs = processor(images=img, return_tensors="pt") |
|
|
pixel_values = image_inputs.get("pixel_values").to(device) |
|
|
|
|
|
gen_kwargs = dict( |
|
|
max_new_tokens=max_new_tokens, |
|
|
do_sample=temperature > 0, |
|
|
temperature=max(temperature, 1e-6), |
|
|
top_p=top_p, |
|
|
repetition_penalty=repetition_penalty, |
|
|
eos_token_id=tokenizer.eos_token_id, |
|
|
pad_token_id=tokenizer.eos_token_id, |
|
|
use_cache=True, |
|
|
) |
|
|
|
|
|
# NOTE: update kwarg name to match your model (e.g., image_inputs / pixel_values) |
|
|
out = model.generate(**inputs, **gen_kwargs) |
|
|
|
|
|
text = tokenizer.decode(out[0], skip_special_tokens=True) |
|
|
return text.strip() |
|
|
|
|
|
if __name__ == "__main__": |
|
|
ans = generate_answer( |
|
|
model_id_or_path="YOUR_FIREBOLT_VL_PATH_OR_HF_ID", |
|
|
image_path="demo.jpg", |
|
|
question="What is written in the top right corner?", |
|
|
) |
|
|
print(ans) |
|
|
|