Safetensors
fireboltlm
huyquoctrinh commited on
Commit
14b0504
·
verified ·
1 Parent(s): 5119322

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +161 -3
README.md CHANGED
@@ -1,3 +1,161 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - xiaorui638/cc3m
5
+ - liuhaotian/LLaVA-Instruct-150K
6
+ - Xkev/LLaVA-CoT-100k
7
+ metrics:
8
+ - bleu
9
+ - accuracy
10
+ base_model:
11
+ - LiquidAI/LFM2-350M
12
+ ---
13
+
14
+ # ⚡ **Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation**
15
+
16
+ ***Note:*** Firebolt-VL is an efficient VLM designed for fast, fine-grained grounding. If you adapt it to a new domain, we recommend fine-tuning on your target data.
17
+
18
+ ---
19
+
20
+ ## 🌟 Overview
21
+
22
+ **Firebolt-VL** is an efficient **vision-language model (VLM)** that replaces Transformer-based cross-attention fusion with a **Cross-modal Modulator (CMM)** using:
23
+ - **Token–Grid Correlation** (lightweight text–image matching),
24
+ - **Top-K grid selection** (focus on relevant regions),
25
+ - **FiLM modulation** (feature-wise conditioning),
26
+ - **Structured State-Space Model (SSM)** for **linear-time** sequence modeling.
27
+
28
+ It is built on the **Liquid Foundation Model (LFM2-350M)** as the language decoder, enabling strong multimodal reasoning at lower latency.
29
+
30
+ ---
31
+
32
+ ## 🧠 Key Features
33
+
34
+ - ⚡ **Efficient inference**
35
+ Linear-time sequential modeling via SSM instead of quadratic self-attention for long context.
36
+
37
+ - 🎯 **Fine-grained visual grounding**
38
+ Token–grid correlation + Top-K selection helps the model focus on task-relevant visual regions.
39
+
40
+ - 🧩 **Lightweight cross-modal fusion**
41
+ FiLM-based conditioning injects visual context without heavy cross-attention.
42
+
43
+ ---
44
+
45
+ ## 🚀 Training
46
+
47
+ Firebolt-VL is trained in **two stages**:
48
+
49
+ 1. **Stage 1 (CMM warm-up / initialization)**
50
+ Freeze the vision encoder + LFM decoder, train **CMM** on **CC3M**.
51
+
52
+ 2. **Stage 2 (end-to-end training)**
53
+ Train the full model on instruction / reasoning data (e.g., LLaVA-style instruction data + CoT-style data).
54
+
55
+ > Hardware used in the paper: **2× H100 80GB** (stage 1 batch 128, stage 2 batch 8), AdamW, 5 epochs each stage.
56
+
57
+ ---
58
+
59
+ ## 🏗️ Architecture
60
+
61
+ <div align="center">
62
+ <a href="./">
63
+ <img src="firebolt_vl.jpg" width="85%" alt="Firebolt-VL Architecture"/>
64
+ </a>
65
+ </div>
66
+
67
+ **Main Components:**
68
+ 1. 🎨 **Vision Encoder (SigLIP)** – extracts grid-level visual embeddings
69
+ 2. 🧩 **Cross-modal Modulator (CMM)** – token–grid correlation → FiLM → SSM → FiLM
70
+ 3. 🧠 **LFM Decoder (LFM2-350M)** – autoregressive reasoning and generation
71
+
72
+ ---
73
+
74
+ ## 📊 Benchmark Results
75
+
76
+ **Total parameters:** ~0.8B (paper setting)
77
+
78
+ | Benchmark | Split | Score |
79
+ |---|---:|---:|
80
+ | VQAv2 | Test | **76.6** |
81
+ | POPE | Test | **69.4** |
82
+ | AI2D | Test | **46.2** |
83
+ | MMMU-val | Val | **26.4** |
84
+ | MME (Perception) | - | **1376.2** |
85
+ | SQA-Image | Test | **56.7** |
86
+ | MMB-dev | Dev | **64.6** |
87
+
88
+ **Notes.** Exact results can vary with decoding settings (temperature, top-p, max tokens) and evaluation pipeline.
89
+
90
+ ---
91
+
92
+ ## 🧩 Usage
93
+
94
+ ### Option A — Use the official repository
95
+ 🔗 **Firebolt-VL Repository:** https://github.com/huyquoctrinh/Firebolt-VL
96
+
97
+ ### Option B — Minimal inference example (Transformers-style)
98
+
99
+ > This is a template. Update the model class and forward kwargs to match your implementation.
100
+
101
+ ```python
102
+ import torch
103
+ from PIL import Image
104
+ from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
105
+
106
+ @torch.inference_mode()
107
+ def generate_answer(
108
+ model_id_or_path: str,
109
+ image_path: str,
110
+ question: str,
111
+ device: str = "cuda",
112
+ dtype: str = "bf16",
113
+ max_new_tokens: int = 128,
114
+ temperature: float = 0.2,
115
+ top_p: float = 0.9,
116
+ repetition_penalty: float = 1.05,
117
+ ):
118
+ device = torch.device(device if torch.cuda.is_available() else "cpu")
119
+ amp_dtype = torch.bfloat16 if dtype.lower() in ["bf16", "bfloat16"] else torch.float16
120
+
121
+ tokenizer = AutoTokenizer.from_pretrained(model_id_or_path, use_fast=True)
122
+ processor = AutoProcessor.from_pretrained(model_id_or_path)
123
+
124
+ model = AutoModelForCausalLM.from_pretrained(
125
+ model_id_or_path,
126
+ torch_dtype=amp_dtype if device.type == "cuda" else torch.float32,
127
+ ).to(device)
128
+ model.eval()
129
+
130
+ # Build a simple prompt (replace with your chat template if needed)
131
+ prompt = f"<image>\nUser: {question}\nAssistant:"
132
+ inputs = tokenizer(prompt, return_tensors="pt").to(device)
133
+
134
+ img = Image.open(image_path).convert("RGB")
135
+ image_inputs = processor(images=img, return_tensors="pt")
136
+ pixel_values = image_inputs.get("pixel_values").to(device)
137
+
138
+ gen_kwargs = dict(
139
+ max_new_tokens=max_new_tokens,
140
+ do_sample=temperature > 0,
141
+ temperature=max(temperature, 1e-6),
142
+ top_p=top_p,
143
+ repetition_penalty=repetition_penalty,
144
+ eos_token_id=tokenizer.eos_token_id,
145
+ pad_token_id=tokenizer.eos_token_id,
146
+ use_cache=True,
147
+ )
148
+
149
+ # NOTE: update kwarg name to match your model (e.g., image_inputs / pixel_values)
150
+ out = model.generate(**inputs, **gen_kwargs)
151
+
152
+ text = tokenizer.decode(out[0], skip_special_tokens=True)
153
+ return text.strip()
154
+
155
+ if __name__ == "__main__":
156
+ ans = generate_answer(
157
+ model_id_or_path="YOUR_FIREBOLT_VL_PATH_OR_HF_ID",
158
+ image_path="demo.jpg",
159
+ question="What is written in the top right corner?",
160
+ )
161
+ print(ans)