FireboltVL / README.md

Update README.md

14b0504 verified about 1 month ago

5.13 kB

	---
	license: apache-2.0
	datasets:
	- xiaorui638/cc3m
	- liuhaotian/LLaVA-Instruct-150K
	- Xkev/LLaVA-CoT-100k
	metrics:
	- bleu
	- accuracy
	base_model:
	- LiquidAI/LFM2-350M
	---

	# ⚡ Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

	*Note:* Firebolt-VL is an efficient VLM designed for fast, fine-grained grounding. If you adapt it to a new domain, we recommend fine-tuning on your target data.

	---

	## 🌟 Overview

	Firebolt-VL is an efficient vision-language model (VLM) that replaces Transformer-based cross-attention fusion with a Cross-modal Modulator (CMM) using:
	- Token–Grid Correlation (lightweight text–image matching),
	- Top-K grid selection (focus on relevant regions),
	- FiLM modulation (feature-wise conditioning),
	- Structured State-Space Model (SSM) for linear-time sequence modeling.

	It is built on the Liquid Foundation Model (LFM2-350M) as the language decoder, enabling strong multimodal reasoning at lower latency.

	---

	## 🧠 Key Features

	- ⚡ Efficient inference
	Linear-time sequential modeling via SSM instead of quadratic self-attention for long context.

	- 🎯 Fine-grained visual grounding
	Token–grid correlation + Top-K selection helps the model focus on task-relevant visual regions.

	- 🧩 Lightweight cross-modal fusion
	FiLM-based conditioning injects visual context without heavy cross-attention.

	---

	## 🚀 Training

	Firebolt-VL is trained in two stages:

	1. Stage 1 (CMM warm-up / initialization)
	Freeze the vision encoder + LFM decoder, train CMM on CC3M.

	2. Stage 2 (end-to-end training)
	Train the full model on instruction / reasoning data (e.g., LLaVA-style instruction data + CoT-style data).

	> Hardware used in the paper: 2× H100 80GB (stage 1 batch 128, stage 2 batch 8), AdamW, 5 epochs each stage.

	---

	## 🏗️ Architecture

	<div align="center">
	<a href="./">
	<img src="firebolt_vl.jpg" width="85%" alt="Firebolt-VL Architecture"/>
	</a>
	</div>

	Main Components:
	1. 🎨 Vision Encoder (SigLIP) – extracts grid-level visual embeddings
	2. 🧩 Cross-modal Modulator (CMM) – token–grid correlation → FiLM → SSM → FiLM
	3. 🧠 LFM Decoder (LFM2-350M) – autoregressive reasoning and generation

	---

	## 📊 Benchmark Results

	Total parameters: ~0.8B (paper setting)

	\| Benchmark \| Split \| Score \|
	\|---\|---:\|---:\|
	\| VQAv2 \| Test \| 76.6 \|
	\| POPE \| Test \| 69.4 \|
	\| AI2D \| Test \| 46.2 \|
	\| MMMU-val \| Val \| 26.4 \|
	\| MME (Perception) \| - \| 1376.2 \|
	\| SQA-Image \| Test \| 56.7 \|
	\| MMB-dev \| Dev \| 64.6 \|

	Notes. Exact results can vary with decoding settings (temperature, top-p, max tokens) and evaluation pipeline.

	---

	## 🧩 Usage

	### Option A — Use the official repository
	🔗 Firebolt-VL Repository: https://github.com/huyquoctrinh/Firebolt-VL

	### Option B — Minimal inference example (Transformers-style)

	> This is a template. Update the model class and forward kwargs to match your implementation.

	```python
	import torch
	from PIL import Image
	from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM

	@torch.inference_mode()
	def generate_answer(
	model_id_or_path: str,
	image_path: str,
	question: str,
	device: str = "cuda",
	dtype: str = "bf16",
	max_new_tokens: int = 128,
	temperature: float = 0.2,
	top_p: float = 0.9,
	repetition_penalty: float = 1.05,
	):
	device = torch.device(device if torch.cuda.is_available() else "cpu")
	amp_dtype = torch.bfloat16 if dtype.lower() in ["bf16", "bfloat16"] else torch.float16

	tokenizer = AutoTokenizer.from_pretrained(model_id_or_path, use_fast=True)
	processor = AutoProcessor.from_pretrained(model_id_or_path)

	model = AutoModelForCausalLM.from_pretrained(
	model_id_or_path,
	torch_dtype=amp_dtype if device.type == "cuda" else torch.float32,
	).to(device)
	model.eval()

	# Build a simple prompt (replace with your chat template if needed)
	prompt = f"<image>\nUser: {question}\nAssistant:"
	inputs = tokenizer(prompt, return_tensors="pt").to(device)

	img = Image.open(image_path).convert("RGB")
	image_inputs = processor(images=img, return_tensors="pt")
	pixel_values = image_inputs.get("pixel_values").to(device)

	gen_kwargs = dict(
	max_new_tokens=max_new_tokens,
	do_sample=temperature > 0,
	temperature=max(temperature, 1e-6),
	top_p=top_p,
	repetition_penalty=repetition_penalty,
	eos_token_id=tokenizer.eos_token_id,
	pad_token_id=tokenizer.eos_token_id,
	use_cache=True,
	)

	# NOTE: update kwarg name to match your model (e.g., image_inputs / pixel_values)
	out = model.generate(inputs, gen_kwargs)

	text = tokenizer.decode(out[0], skip_special_tokens=True)
	return text.strip()

	if __name__ == "__main__":
	ans = generate_answer(
	model_id_or_path="YOUR_FIREBOLT_VL_PATH_OR_HF_ID",
	image_path="demo.jpg",
	question="What is written in the top right corner?",
	)
	print(ans)