webAI-Official
/

webAI-ColVec1-9b

+---
+pipeline_tag: visual-document-retrieval
+library_name: transformers
+language:
+  - multilingual
+license: other
+license_name: webai-non-commercial-license-v1.0
+license_link: https://huggingface.co/webAI-Official/webAI-ColVec1-9b/blob/main/LICENSE.md
+base_model: Qwen/Qwen3.5-4B
+tags:
+  - text
+  - image
+  - video
+  - multimodal-embedding
+  - vidore
+  - colpali
+  - colqwen3_5
+  - multilingual-embedding
+---
+# webAI-Official/webAI-ColVec1-9b
+## ⚡ Summary
+**webAI-Official/webAI-ColVec1-9b** is a state-of-the-art [ColBERT](https://arxiv.org/abs/2407.01449)-style multimodal embedding model based on *[Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)*. It maps text queries, visual documents (images, PDFs) into aligned multi-vector embeddings.
+The model has been fine-tuned on a **merged multimodal dataset** of ~2M question-image pairs, including:
+- [DocVQA](https://huggingface.co/datasets/lmms-lab/DocVQA)
+- [PubTables-1M](https://huggingface.co/datasets/bsmock/pubtables-1m)
+- [TAT-QA](https://huggingface.co/datasets/next-tat/TAT-QA)
+- [ViDoRe-ColPali-Training](https://huggingface.co/datasets/vidore/colpali_train_set)
+- [VDR Multilingual](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train)
+- [VisRAG-Ret-Train-In-domain-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-In-domain-data)
+- [VisRAG-Ret-Train-Synthetic-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-Synthetic-data)
+- Proprietary domain-specific synthetic data
+The datasets were filtered, balanced, and merged to produce a comprehensive training set optimized for multilingual, multimodal retrieval and document-image understanding. The model achieves **competitive performance across ViDoRe V1 & V3** (English and multilingual).
+## 🛠️ Model Specifications
+| Feature               | Detail                                                                     |
+| --------------------- | -------------------------------------------------------------------------- |
+| **Architecture**      | Qwen3.5-4B Vision-Language Model (VLM) + `2560 dim` Linear Projection Head |
+| **Methodology**       | ColBERT-style Late Interaction (MaxSim scoring)                            |
+| **Output**            | Multi-vector (Seq_Len × *2560*), L2-normalized                             |
+| **Modalities**        | Text Queries, Images (Documents)                                           |
+| **Training Strategy** | LoRA adapters + Fully-trained projection layer                             |
+| **Precision**         | `bfloat16` weights, FlashAttention 2 enabled                               |
+---
+### Key Properties
+- **Unified Encoder (Single-Tower):** A single shared language model processes both images and text. Images are converted into visual tokens via a vision encoder and injected into the token stream, no separate dual encoders.
+- **Projection Head:** A single linear layer projects final hidden states → compact embedding space (*hidden_size → 2560 dim*).
+	  - No activation
+	  - Fully trained
+	  - Replaces LM head for retrieval
+- **Multi-Vector Representation:** Each token becomes an embedding → enables fine-grained token-level matching instead of single-vector pooling.
+## 📊 Evaluation Results
+We report results on the **ViDoRe** benchmark suite. The tables below summarize the image-modality accuracy of `webAI-ColVec1-9b` on the ViDoRe V1 and V3 benchmarks, alongside other webAI `ColVec1` models. Note that (M)MTEB leaderboards use Borda ranking. Each task acts like a voter that ranks models based on how well they perform. Models earn more points when they rank higher on a task. The model with the most total points across all tasks gets the top overall rank.
+### ViDoRe V3 (NDCG@10)
+### ViDoRe V1 (NDCG@5)
+---
+## 💻 Usage
+The processor exposes three primary methods for encoding inputs and computing retrieval scores.
+#### `process_images(images, max_length=None)`
+Encodes a batch of document images into model-ready tensors. Pass the result directly to the model with `**batch`.
+| Parameter    | Type                    | Description                                                         |
+| ------------ | ----------------------- | ------------------------------------------------------------------- |
+| `images`     | `List[PIL.Image.Image]` | Document page images. Each image is automatically converted to RGB. |
+| `max_length` | `int`                   | `None`                                                              |
+```python
+batch = processor.process_images(images=pil_images)
+batch = {k: v.to(device) for k, v in batch.items()}
+embeddings = model(**batch)  # shape: (B, seq_len, embed_dim)
+```
+---
+#### `process_queries(texts, max_length=None)`
+Encodes a batch of text queries into model-ready tensors.
+| Parameter    | Type        | Description                     |
+| ------------ | ----------- | ------------------------------- |
+| `texts`      | `List[str]` | Natural-language query strings. |
+| `max_length` | `int`       | `None`                          |
+```python
+batch = processor.process_queries(texts=["What is the revenue for Q3?"])
+batch = {k: v.to(device) for k, v in batch.items()}
+embeddings = model(**batch)  # shape: (B, seq_len, embed_dim)
+```
+---
+#### `score_multi_vector(qs, ps, batch_size=128, device=None)`
+Computes ColBERT-style **MaxSim** late-interaction scores between a list of query embeddings and a list of passage (document) embeddings. For each query token, the maximum dot product across all passage tokens is found; these maxima are summed to produce a single scalar score per (query, passage) pair.
+| Parameter    | Type                       | Description                                                            |
+| ------------ | -------------------------- | ---------------------------------------------------------------------- |
+| `qs`         | `List[Tensor]` or `Tensor` | Query embeddings. Each tensor has shape `(seq_len_q, embed_dim)`.      |
+| `ps`         | `List[Tensor]` or `Tensor` | Passage embeddings. Each tensor has shape `(seq_len_p, embed_dim)`.    |
+| `batch_size` | `int`                      | Number of queries processed per inner loop iteration (default: `128`). |
+| `device`     | `str`                      | `torch.device`                                                         |
+Returns a `torch.Tensor` of shape `(n_queries, n_passages)` on CPU in `float32`. Higher scores indicate greater relevance.
+```python
+scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
+# scores[i, j] is the relevance of document j to query i
+best_doc_per_query = scores.argmax(dim=1)
+```
+### Prerequisites
+We strongly suggest `flash-attn` to be installed. If not, please change to `attention_impl="sdpa"`
+Currently we only support `torch==2.8.0`, for higher pytorch version, please build flash attention manually, otherwise performance throughput could be low. Also, Note that `torch==2.8.0` supports Python Versions: `>= 3.9` and `<= 3.13`.
+```bash
+pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
+pip install transformers pillow requests
+pip install flash-attn --no-build-isolation
+```
+### Inference Code
+```python
+import torch
+from transformers import AutoModel, AutoProcessor
+from PIL import Image, UnidentifiedImageError
+import requests
+from io import BytesIO
+# Configuration
+MODEL_ID = "webAI-Official/webAI-ColVec1-9b"
+DTYPE = torch.bfloat16
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+# Load Model & Processor
+processor = AutoProcessor.from_pretrained(
+    MODEL_ID,
+    trust_remote_code=True,
+)
+model = AutoModel.from_pretrained(
+    MODEL_ID,
+    dtype=DTYPE,
+    attn_implementation="flash_attention_2",
+    trust_remote_code=True,
+    device_map=DEVICE,
+).eval()
+# Sample Data
+queries = [
+    "Retrieve the city of Singapore",
+    "Retrieve the city of Beijing",
+    "Retrieve the city of London",
+]
+docs = [
+    "https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
+    "https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG",
+    "https://upload.wikimedia.org/wikipedia/commons/4/49/London_skyline.jpg",
+]
+def load_image(url: str) -> Image.Image:
+    # Some CDNs (e.g., Wikimedia) expect a browser-like UA to avoid 403s.
+    for headers in ({}, {"User-Agent": "Mozilla/5.0 (compatible; ColQwen3-demo/1.0)"}):
+        resp = requests.get(url, headers=headers, timeout=10)
+        if resp.status_code == 403:
+            continue
+        resp.raise_for_status()
+        try:
+            return Image.open(BytesIO(resp.content)).convert("RGB")
+        except UnidentifiedImageError as e:
+            raise RuntimeError(f"Failed to decode image from {url}") from e
+    raise RuntimeError(f"Could not fetch image (HTTP 403) from {url}; try downloading locally and loading from file path.")
+# Helper Functions
+def encode_queries(texts, batch_size=8):
+    outputs = []
+    for start in range(0, len(texts), batch_size):
+        batch = processor.process_queries(texts=texts[start : start + batch_size])
+        batch = {k: v.to(DEVICE) for k, v in batch.items()}
+        with torch.inference_mode():
+            embeddings = model(**batch)
+            vecs = embeddings.to(torch.bfloat16).cpu()
+        outputs.extend(vecs)
+    return outputs
+def encode_docs(urls, batch_size=4):
+    pil_images = [load_image(url) for url in urls]
+    outputs = []
+    for start in range(0, len(pil_images), batch_size):
+        batch_imgs = pil_images[start : start + batch_size]
+        features = processor.process_images(images=batch_imgs)
+        features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()}
+        with torch.inference_mode():
+            embeddings = model(**features)
+            vecs = embeddings.to(torch.bfloat16).cpu()
+        outputs.extend(vecs)
+    return outputs
+# Execution
+query_embeddings = encode_queries(queries)
+doc_embeddings = encode_docs(docs)
+# MaxSim Scoring
+scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
+print(scores)
+```
+---
+## ⚖️ Strengths & Limitations
+### Strengths
+- **Performance:** State of the art retrieval performance on ViDoRe V1 & V3 dataset with excellent performance on multimodal document retrieval.
+- **Complex Layouts:** Excellent handling of chart-rich PDFs, domain-specific documents.
+- **End-to-end Retrieval:** Capable of OCR-free retrieval on unseen multimodal documents without using an intermediate vision LLM to generate summary for retrieval.
+- **Multilingualism:** Strong performance on non-English document inputs.
+### Limitations
+- **Storage Cost:** Still larger than single‑vector baselines despite the smaller token dimension.
+### License & Data
+[LICENSE](https://huggingface.co/webAI-Official/webAI-ColVec1-4b/blob/main/LICENSE.md)
+## 📚 Citation
+If you use this model, please cite:
+```bibtex
+@misc{webAI-ColVec1,
+  title={webAI-ColVec1: Late-Interaction Multi-Vector Embedding Model for Visual Document Retrieval},
+  author={webAI},
+  year={2026},
+  url={https://huggingface.co/webAI-Official/webAI-ColVec1-9b}
+}
+```