Visual Document Retrieval
Transformers
Safetensors
ColPali
multilingual
colvec1
feature-extraction
text
image
video
multimodal-embedding
vidore
colqwen3_5
multilingual-embedding
custom_code
Instructions to use webAI-Official/webAI-ColVec1-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use webAI-Official/webAI-ColVec1-4b with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("webAI-Official/webAI-ColVec1-4b", trust_remote_code=True, dtype="auto") - ColPali
How to use webAI-Official/webAI-ColVec1-4b with ColPali:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| pipeline_tag: visual-document-retrieval | |
| library_name: transformers | |
| language: | |
| - multilingual | |
| license: other | |
| license_name: webai-non-commercial-license-v1.0 | |
| license_link: https://huggingface.co/webAI-Official/webAI-ColVec1-4b/blob/main/LICENSE.md | |
| base_model: Qwen/Qwen3.5-4B | |
| tags: | |
| - text | |
| - image | |
| - video | |
| - multimodal-embedding | |
| - vidore | |
| - colpali | |
| - colqwen3_5 | |
| - multilingual-embedding | |
| # webAI-Official/webAI-ColVec1-4b | |
| ## ⚡ Summary | |
| **webAI-Official/webAI-ColVec1-4b** is a state-of-the-art [ColBERT](https://arxiv.org/abs/2407.01449)-style multimodal embedding model based on *[Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)*. It maps text queries, visual documents (images, PDFs) into aligned multi-vector embeddings. | |
| The model has been fine-tuned on a **merged multimodal dataset** of ~2M question-image pairs, including: | |
| - [DocVQA](https://huggingface.co/datasets/lmms-lab/DocVQA) | |
| - [PubTables-1M](https://huggingface.co/datasets/bsmock/pubtables-1m) | |
| - [TAT-QA](https://huggingface.co/datasets/next-tat/TAT-QA) | |
| - [ViDoRe-ColPali-Training](https://huggingface.co/datasets/vidore/colpali_train_set) | |
| - [VDR Multilingual](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) | |
| - [VisRAG-Ret-Train-In-domain-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-In-domain-data) | |
| - [VisRAG-Ret-Train-Synthetic-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-Synthetic-data) | |
| - Proprietary domain-specific synthetic data | |
| The datasets were filtered, balanced, and merged to produce a comprehensive training set optimized for multilingual, multimodal retrieval and document-image understanding. The model achieves **competitive performance across ViDoRe V1 & V3** (English and multilingual). | |
| ## 🛠️ Model Specifications | |
| | Feature | Detail | | |
| | --------------------- | ------------------------------------------------------------------------- | | |
| | **Architecture** | Qwen3.5-4B Vision-Language Model (VLM) + `640 dim` Linear Projection Head | | |
| | **Methodology** | ColBERT-style Late Interaction (MaxSim scoring) | | |
| | **Output** | Multi-vector (Seq_Len × *640*), L2-normalized | | |
| | **Modalities** | Text Queries, Images (Documents) | | |
| | **Training Strategy** | LoRA adapters + Fully-trained projection layer | | |
| | **Precision** | `bfloat16` weights, FlashAttention 2 enabled | | |
| --- | |
| ### Key Properties | |
| - **Unified Encoder (Single-Tower):** A single shared language model processes both images and text. Images are converted into visual tokens via a vision encoder and injected into the token stream, no separate dual encoders. | |
| - **Projection Head:** A single linear layer projects final hidden states → compact embedding space (*hidden_size → 640 dim*). | |
| - No activation | |
| - Fully trained | |
| - Replaces LM head for retrieval | |
| - **Multi-Vector Representation:** Each token becomes an embedding → enables fine-grained token-level matching instead of single-vector pooling. | |
| ## 📊 Evaluation Results | |
| We report results on the **ViDoRe** benchmark suite. The tables below summarize the image-modality accuracy of `webAI-ColVec1-4b` on the ViDoRe V1 and V3 benchmarks, alongside other webAI `ColVec1` models. Note that (M)MTEB leaderboards use Borda ranking. Each task acts like a voter that ranks models based on how well they perform. Models earn more points when they rank higher on a task. The model with the most total points across all tasks gets the top overall rank. | |
| ### ViDoRe V3 (NDCG@10) | |
| ### ViDoRe V1 (NDCG@5) | |
| --- | |
| ## 💻 Usage | |
| The processor exposes three primary methods for encoding inputs and computing retrieval scores. | |
| #### `process_images(images, max_length=None)` | |
| Encodes a batch of document images into model-ready tensors. Pass the result directly to the model with `**batch`. | |
| | Parameter | Type | Description | | |
| | ------------ | ----------------------- | ------------------------------------------------------------------- | | |
| | `images` | `List[PIL.Image.Image]` | Document page images. Each image is automatically converted to RGB. | | |
| | `max_length` | `int` | `None` | | |
| ```python | |
| batch = processor.process_images(images=pil_images) | |
| batch = {k: v.to(device) for k, v in batch.items()} | |
| embeddings = model(**batch) # shape: (B, seq_len, embed_dim) | |
| ``` | |
| --- | |
| #### `process_queries(texts, max_length=None)` | |
| Encodes a batch of text queries into model-ready tensors. | |
| | Parameter | Type | Description | | |
| | ------------ | ----------- | ------------------------------- | | |
| | `texts` | `List[str]` | Natural-language query strings. | | |
| | `max_length` | `int` | `None` | | |
| ```python | |
| batch = processor.process_queries(texts=["What is the revenue for Q3?"]) | |
| batch = {k: v.to(device) for k, v in batch.items()} | |
| embeddings = model(**batch) # shape: (B, seq_len, embed_dim) | |
| ``` | |
| --- | |
| #### `score_multi_vector(qs, ps, batch_size=128, device=None)` | |
| Computes ColBERT-style **MaxSim** late-interaction scores between a list of query embeddings and a list of passage (document) embeddings. For each query token, the maximum dot product across all passage tokens is found; these maxima are summed to produce a single scalar score per (query, passage) pair. | |
| | Parameter | Type | Description | | |
| | ------------ | -------------------------- | ---------------------------------------------------------------------- | | |
| | `qs` | `List[Tensor]` or `Tensor` | Query embeddings. Each tensor has shape `(seq_len_q, embed_dim)`. | | |
| | `ps` | `List[Tensor]` or `Tensor` | Passage embeddings. Each tensor has shape `(seq_len_p, embed_dim)`. | | |
| | `batch_size` | `int` | Number of queries processed per inner loop iteration (default: `128`). | | |
| | `device` | `str` | `torch.device` | | |
| Returns a `torch.Tensor` of shape `(n_queries, n_passages)` on CPU in `float32`. Higher scores indicate greater relevance. | |
| ```python | |
| scores = processor.score_multi_vector(query_embeddings, doc_embeddings) | |
| # scores[i, j] is the relevance of document j to query i | |
| best_doc_per_query = scores.argmax(dim=1) | |
| ``` | |
| ### Prerequisites | |
| We strongly suggest `flash-attn` to be installed. If not, please change to `attention_impl="sdpa"` | |
| Currently we only support `torch==2.8.0`, for higher pytorch version, please build flash attention manually, otherwise performance throughput could be low. Also, Note that `torch==2.8.0` supports Python Versions: `>= 3.9` and `<= 3.13`. | |
| ```bash | |
| pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128 | |
| pip install transformers pillow requests | |
| pip install flash-attn --no-build-isolation | |
| ``` | |
| ### Inference Code | |
| ```python | |
| import torch | |
| from transformers import AutoModel, AutoProcessor | |
| from PIL import Image, UnidentifiedImageError | |
| import requests | |
| from io import BytesIO | |
| # Configuration | |
| MODEL_ID = "webAI-Official/webAI-ColVec1-4b" | |
| DTYPE = torch.bfloat16 | |
| DEVICE = "cuda" if torch.cuda.is_available() else "cpu" | |
| # Load Model & Processor | |
| processor = AutoProcessor.from_pretrained( | |
| MODEL_ID, | |
| trust_remote_code=True, | |
| ) | |
| model = AutoModel.from_pretrained( | |
| MODEL_ID, | |
| dtype=DTYPE, | |
| attn_implementation="flash_attention_2", | |
| trust_remote_code=True, | |
| device_map=DEVICE, | |
| ).eval() | |
| # Sample Data | |
| queries = [ | |
| "Retrieve the city of Singapore", | |
| "Retrieve the city of Beijing", | |
| "Retrieve the city of London", | |
| ] | |
| docs = [ | |
| "https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg", | |
| "https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG", | |
| "https://upload.wikimedia.org/wikipedia/commons/4/49/London_skyline.jpg", | |
| ] | |
| def load_image(url: str) -> Image.Image: | |
| # Some CDNs (e.g., Wikimedia) expect a browser-like UA to avoid 403s. | |
| for headers in ({}, {"User-Agent": "Mozilla/5.0 (compatible; ColQwen3-demo/1.0)"}): | |
| resp = requests.get(url, headers=headers, timeout=10) | |
| if resp.status_code == 403: | |
| continue | |
| resp.raise_for_status() | |
| try: | |
| return Image.open(BytesIO(resp.content)).convert("RGB") | |
| except UnidentifiedImageError as e: | |
| raise RuntimeError(f"Failed to decode image from {url}") from e | |
| raise RuntimeError(f"Could not fetch image (HTTP 403) from {url}; try downloading locally and loading from file path.") | |
| # Helper Functions | |
| def encode_queries(texts, batch_size=8): | |
| outputs = [] | |
| for start in range(0, len(texts), batch_size): | |
| batch = processor.process_queries(texts=texts[start : start + batch_size]) | |
| batch = {k: v.to(DEVICE) for k, v in batch.items()} | |
| with torch.inference_mode(): | |
| embeddings = model(**batch) | |
| vecs = embeddings.to(torch.bfloat16).cpu() | |
| outputs.extend(vecs) | |
| return outputs | |
| def encode_docs(urls, batch_size=4): | |
| pil_images = [load_image(url) for url in urls] | |
| outputs = [] | |
| for start in range(0, len(pil_images), batch_size): | |
| batch_imgs = pil_images[start : start + batch_size] | |
| features = processor.process_images(images=batch_imgs) | |
| features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()} | |
| with torch.inference_mode(): | |
| embeddings = model(**features) | |
| vecs = embeddings.to(torch.bfloat16).cpu() | |
| outputs.extend(vecs) | |
| return outputs | |
| # Execution | |
| query_embeddings = encode_queries(queries) | |
| doc_embeddings = encode_docs(docs) | |
| # MaxSim Scoring | |
| scores = processor.score_multi_vector(query_embeddings, doc_embeddings) | |
| print(scores) | |
| ``` | |
| --- | |
| ## ⚖️ Strengths & Limitations | |
| ### Strengths | |
| - **Performance:** State of the art retrieval performance on ViDoRe V1 & V3 dataset with excellent performance on multimodal document retrieval. | |
| - **Complex Layouts:** Excellent handling of chart-rich PDFs, domain-specific documents. | |
| - **End-to-end Retrieval:** Capable of OCR-free retrieval on unseen multimodal documents without using an intermediate vision LLM to generate summary for retrieval. | |
| - **Multilingualism:** Strong performance on non-English document inputs. | |
| ### Limitations | |
| - **Storage Cost:** Still larger than single‑vector baselines despite the smaller token dimension. | |
| ### License & Data | |
| [LICENSE](https://huggingface.co/webAI-Official/webAI-ColVec1-4b/blob/main/LICENSE.md) | |
| ## 📚 Citation | |
| If you use this model, please cite: | |
| ```bibtex | |
| @misc{webAI-ColVec1, | |
| title={webAI-ColVec1: Late-Interaction Multi-Vector Embedding Model for Visual Document Retrieval}, | |
| author={webAI}, | |
| year={2026}, | |
| url={https://huggingface.co/webAI-Official/webAI-ColVec1-4b} | |
| } | |
| ``` | |