psamal commited on
Commit
9d5fbfe
·
verified ·
1 Parent(s): 38ebc95

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +265 -0
README.md ADDED
@@ -0,0 +1,265 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: visual-document-retrieval
3
+ library_name: transformers
4
+ language:
5
+ - multilingual
6
+ license: other
7
+ license_name: webai-non-commercial-license-v1.0
8
+ license_link: https://huggingface.co/webAI-Official/webAI-ColVec1-9b/blob/main/LICENSE.md
9
+ base_model: Qwen/Qwen3.5-4B
10
+ tags:
11
+ - text
12
+ - image
13
+ - video
14
+ - multimodal-embedding
15
+ - vidore
16
+ - colpali
17
+ - colqwen3_5
18
+ - multilingual-embedding
19
+ ---
20
+ # webAI-Official/webAI-ColVec1-9b
21
+
22
+ ## ⚡ Summary
23
+
24
+ **webAI-Official/webAI-ColVec1-9b** is a state-of-the-art [ColBERT](https://arxiv.org/abs/2407.01449)-style multimodal embedding model based on *[Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)*. It maps text queries, visual documents (images, PDFs) into aligned multi-vector embeddings.
25
+
26
+ The model has been fine-tuned on a **merged multimodal dataset** of ~2M question-image pairs, including:
27
+
28
+ - [DocVQA](https://huggingface.co/datasets/lmms-lab/DocVQA)
29
+ - [PubTables-1M](https://huggingface.co/datasets/bsmock/pubtables-1m)
30
+ - [TAT-QA](https://huggingface.co/datasets/next-tat/TAT-QA)
31
+ - [ViDoRe-ColPali-Training](https://huggingface.co/datasets/vidore/colpali_train_set)
32
+ - [VDR Multilingual](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train)
33
+ - [VisRAG-Ret-Train-In-domain-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-In-domain-data)
34
+ - [VisRAG-Ret-Train-Synthetic-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-Synthetic-data)
35
+ - Proprietary domain-specific synthetic data
36
+
37
+ The datasets were filtered, balanced, and merged to produce a comprehensive training set optimized for multilingual, multimodal retrieval and document-image understanding. The model achieves **competitive performance across ViDoRe V1 & V3** (English and multilingual).
38
+
39
+ ## 🛠️ Model Specifications
40
+
41
+
42
+ | Feature | Detail |
43
+ | --------------------- | -------------------------------------------------------------------------- |
44
+ | **Architecture** | Qwen3.5-4B Vision-Language Model (VLM) + `2560 dim` Linear Projection Head |
45
+ | **Methodology** | ColBERT-style Late Interaction (MaxSim scoring) |
46
+ | **Output** | Multi-vector (Seq_Len × *2560*), L2-normalized |
47
+ | **Modalities** | Text Queries, Images (Documents) |
48
+ | **Training Strategy** | LoRA adapters + Fully-trained projection layer |
49
+ | **Precision** | `bfloat16` weights, FlashAttention 2 enabled |
50
+
51
+
52
+ ---
53
+
54
+ ### Key Properties
55
+
56
+ - **Unified Encoder (Single-Tower):** A single shared language model processes both images and text. Images are converted into visual tokens via a vision encoder and injected into the token stream, no separate dual encoders.
57
+
58
+ - **Projection Head:** A single linear layer projects final hidden states → compact embedding space (*hidden_size → 2560 dim*).
59
+ - No activation
60
+ - Fully trained
61
+ - Replaces LM head for retrieval
62
+
63
+ - **Multi-Vector Representation:** Each token becomes an embedding → enables fine-grained token-level matching instead of single-vector pooling.
64
+
65
+ ## 📊 Evaluation Results
66
+
67
+ We report results on the **ViDoRe** benchmark suite. The tables below summarize the image-modality accuracy of `webAI-ColVec1-9b` on the ViDoRe V1 and V3 benchmarks, alongside other webAI `ColVec1` models. Note that (M)MTEB leaderboards use Borda ranking. Each task acts like a voter that ranks models based on how well they perform. Models earn more points when they rank higher on a task. The model with the most total points across all tasks gets the top overall rank.
68
+
69
+ ### ViDoRe V3 (NDCG@10)
70
+
71
+
72
+ ### ViDoRe V1 (NDCG@5)
73
+
74
+
75
+ ---
76
+
77
+ ## 💻 Usage
78
+
79
+ The processor exposes three primary methods for encoding inputs and computing retrieval scores.
80
+
81
+ #### `process_images(images, max_length=None)`
82
+
83
+ Encodes a batch of document images into model-ready tensors. Pass the result directly to the model with `**batch`.
84
+
85
+ | Parameter | Type | Description |
86
+ | ------------ | ----------------------- | ------------------------------------------------------------------- |
87
+ | `images` | `List[PIL.Image.Image]` | Document page images. Each image is automatically converted to RGB. |
88
+ | `max_length` | `int` | `None` |
89
+
90
+ ```python
91
+ batch = processor.process_images(images=pil_images)
92
+ batch = {k: v.to(device) for k, v in batch.items()}
93
+ embeddings = model(**batch) # shape: (B, seq_len, embed_dim)
94
+ ```
95
+
96
+ ---
97
+
98
+ #### `process_queries(texts, max_length=None)`
99
+
100
+ Encodes a batch of text queries into model-ready tensors.
101
+
102
+ | Parameter | Type | Description |
103
+ | ------------ | ----------- | ------------------------------- |
104
+ | `texts` | `List[str]` | Natural-language query strings. |
105
+ | `max_length` | `int` | `None` |
106
+
107
+ ```python
108
+ batch = processor.process_queries(texts=["What is the revenue for Q3?"])
109
+ batch = {k: v.to(device) for k, v in batch.items()}
110
+ embeddings = model(**batch) # shape: (B, seq_len, embed_dim)
111
+ ```
112
+
113
+ ---
114
+
115
+ #### `score_multi_vector(qs, ps, batch_size=128, device=None)`
116
+
117
+ Computes ColBERT-style **MaxSim** late-interaction scores between a list of query embeddings and a list of passage (document) embeddings. For each query token, the maximum dot product across all passage tokens is found; these maxima are summed to produce a single scalar score per (query, passage) pair.
118
+
119
+
120
+ | Parameter | Type | Description |
121
+ | ------------ | -------------------------- | ---------------------------------------------------------------------- |
122
+ | `qs` | `List[Tensor]` or `Tensor` | Query embeddings. Each tensor has shape `(seq_len_q, embed_dim)`. |
123
+ | `ps` | `List[Tensor]` or `Tensor` | Passage embeddings. Each tensor has shape `(seq_len_p, embed_dim)`. |
124
+ | `batch_size` | `int` | Number of queries processed per inner loop iteration (default: `128`). |
125
+ | `device` | `str` | `torch.device` |
126
+
127
+
128
+ Returns a `torch.Tensor` of shape `(n_queries, n_passages)` on CPU in `float32`. Higher scores indicate greater relevance.
129
+
130
+ ```python
131
+ scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
132
+ # scores[i, j] is the relevance of document j to query i
133
+ best_doc_per_query = scores.argmax(dim=1)
134
+ ```
135
+
136
+ ### Prerequisites
137
+
138
+ We strongly suggest `flash-attn` to be installed. If not, please change to `attention_impl="sdpa"`
139
+
140
+ Currently we only support `torch==2.8.0`, for higher pytorch version, please build flash attention manually, otherwise performance throughput could be low. Also, Note that `torch==2.8.0` supports Python Versions: `>= 3.9` and `<= 3.13`.
141
+
142
+ ```bash
143
+ pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
144
+ pip install transformers pillow requests
145
+ pip install flash-attn --no-build-isolation
146
+ ```
147
+
148
+ ### Inference Code
149
+
150
+ ```python
151
+ import torch
152
+ from transformers import AutoModel, AutoProcessor
153
+ from PIL import Image, UnidentifiedImageError
154
+ import requests
155
+ from io import BytesIO
156
+
157
+ # Configuration
158
+ MODEL_ID = "webAI-Official/webAI-ColVec1-9b"
159
+ DTYPE = torch.bfloat16
160
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
161
+
162
+ # Load Model & Processor
163
+ processor = AutoProcessor.from_pretrained(
164
+ MODEL_ID,
165
+ trust_remote_code=True,
166
+ )
167
+
168
+ model = AutoModel.from_pretrained(
169
+ MODEL_ID,
170
+ dtype=DTYPE,
171
+ attn_implementation="flash_attention_2",
172
+ trust_remote_code=True,
173
+ device_map=DEVICE,
174
+ ).eval()
175
+
176
+ # Sample Data
177
+ queries = [
178
+ "Retrieve the city of Singapore",
179
+ "Retrieve the city of Beijing",
180
+ "Retrieve the city of London",
181
+ ]
182
+ docs = [
183
+ "https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
184
+ "https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG",
185
+ "https://upload.wikimedia.org/wikipedia/commons/4/49/London_skyline.jpg",
186
+ ]
187
+
188
+ def load_image(url: str) -> Image.Image:
189
+ # Some CDNs (e.g., Wikimedia) expect a browser-like UA to avoid 403s.
190
+ for headers in ({}, {"User-Agent": "Mozilla/5.0 (compatible; ColQwen3-demo/1.0)"}):
191
+ resp = requests.get(url, headers=headers, timeout=10)
192
+ if resp.status_code == 403:
193
+ continue
194
+ resp.raise_for_status()
195
+ try:
196
+ return Image.open(BytesIO(resp.content)).convert("RGB")
197
+ except UnidentifiedImageError as e:
198
+ raise RuntimeError(f"Failed to decode image from {url}") from e
199
+ raise RuntimeError(f"Could not fetch image (HTTP 403) from {url}; try downloading locally and loading from file path.")
200
+
201
+ # Helper Functions
202
+ def encode_queries(texts, batch_size=8):
203
+ outputs = []
204
+ for start in range(0, len(texts), batch_size):
205
+ batch = processor.process_queries(texts=texts[start : start + batch_size])
206
+ batch = {k: v.to(DEVICE) for k, v in batch.items()}
207
+ with torch.inference_mode():
208
+ embeddings = model(**batch)
209
+ vecs = embeddings.to(torch.bfloat16).cpu()
210
+ outputs.extend(vecs)
211
+ return outputs
212
+
213
+ def encode_docs(urls, batch_size=4):
214
+ pil_images = [load_image(url) for url in urls]
215
+ outputs = []
216
+ for start in range(0, len(pil_images), batch_size):
217
+ batch_imgs = pil_images[start : start + batch_size]
218
+ features = processor.process_images(images=batch_imgs)
219
+ features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()}
220
+ with torch.inference_mode():
221
+ embeddings = model(**features)
222
+ vecs = embeddings.to(torch.bfloat16).cpu()
223
+ outputs.extend(vecs)
224
+ return outputs
225
+
226
+ # Execution
227
+ query_embeddings = encode_queries(queries)
228
+ doc_embeddings = encode_docs(docs)
229
+
230
+ # MaxSim Scoring
231
+ scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
232
+ print(scores)
233
+ ```
234
+
235
+ ---
236
+
237
+ ## ⚖️ Strengths & Limitations
238
+
239
+ ### Strengths
240
+
241
+ - **Performance:** State of the art retrieval performance on ViDoRe V1 & V3 dataset with excellent performance on multimodal document retrieval.
242
+ - **Complex Layouts:** Excellent handling of chart-rich PDFs, domain-specific documents.
243
+ - **End-to-end Retrieval:** Capable of OCR-free retrieval on unseen multimodal documents without using an intermediate vision LLM to generate summary for retrieval.
244
+ - **Multilingualism:** Strong performance on non-English document inputs.
245
+
246
+ ### Limitations
247
+
248
+ - **Storage Cost:** Still larger than single‑vector baselines despite the smaller token dimension.
249
+
250
+ ### License & Data
251
+
252
+ [LICENSE](https://huggingface.co/webAI-Official/webAI-ColVec1-4b/blob/main/LICENSE.md)
253
+
254
+ ## 📚 Citation
255
+
256
+ If you use this model, please cite:
257
+
258
+ ```bibtex
259
+ @misc{webAI-ColVec1,
260
+ title={webAI-ColVec1: Late-Interaction Multi-Vector Embedding Model for Visual Document Retrieval},
261
+ author={webAI},
262
+ year={2026},
263
+ url={https://huggingface.co/webAI-Official/webAI-ColVec1-9b}
264
+ }
265
+ ```