---
license: apache-2.0
base_model:
- Qwen/Qwen3-VL-4B-Instruct
pipeline_tag: visual-document-retrieval
library_name: transformers
tags:
- transformers
- multimodal_embedding
- embedding
- colpali
- multilingual-embedding
- colqwen3
---
# OpenSearch-AI/Ops-Colqwen3-4B

**Ops-Colqwen3-4B** is a ColPali-style multimodal embedding model based on the **Qwen3-VL-4B-Instruct** architecture, developed and open-sourced by the Alibaba Cloud OpenSearch-AI team. It maps text queries and visual documents such as images and PDF pages into a unified, aligned **multi-vector embedding space**, enabling highly effective retrieval of visual documents.

The model is trained using a multi-stage strategy that combines large-scale text-based retrieval datasets with diverse visual document data. This hybrid training approach significantly enhances its capability to handle complex document understanding and retrieval tasks. On the Vidore v1–v3 benchmarks, **Ops-Colqwen3-4B** achieves **state-of-the-art results** among models of comparable size.

## Key Features

- **Model size**: 4 billion parameters  
- **Multimodal alignment**: Enables fine-grained semantic alignment between text and images or PDF pages  
- **Multi-vector embeddings**: Following the ColPali design, each input generates multiple context-aware embedding vectors; similarity is computed using **MaxSim**, enabling high-precision matching  
- **Scalable embedding dimensions**: Supports embedding dimensions up to **2,560** during inference via an extended projection head, enabling **higher retrieval accuracy** through more expressive representations. Lower-dimensional prefixes (e.g., the first 128 or 320 dimensions) remain highly effective for lightweight applications.
- **Multilingual support**: Covers over 30 languages  
- **Context length**: Supports up to **32,000 tokens**  
- **Visual token capacity**: Handles up to **1,280 visual tokens** per page input.

## Usage

**Requirements**
```
pillow
transformers>=4.57.0
qwen-vl-utils>=0.0.14
torch==2.8.0
```

**Basic Usage**

```python
import torch
from PIL import Image
from scripts.ops_colqwen3_embedder import OpsColQwen3Embedder

images = [Image.new("RGB", (32, 32), color="white"), Image.new("RGB", (16, 16), color="black")]
queries = ["Is attention really all you need?", "What is the amount of bananas farmed in Salvador?"]

embedder = OpsColQwen3Embedder(
    model_name="OpenSearch-AI/Ops-Colqwen3-4B",
    dims=2560,
    dtype=torch.float16,
    attn_implementation="flash_attention_2",
)

query_embeddings = embedder.encode_queries(queries)
image_embeddings = embedder.encode_images(images)
print(query_embeddings[0].shape, image_embeddings[0].shape) # (23, 2560) (18, 2560)

scores = embedder.compute_scores(query_embeddings, image_embeddings)

print(f"Scores:\n{scores}")
```

## Model Performance

### Vidore v1 + v2 (NDCG@5)

| Model                                      | Dim  | Vidore v1+v2 | Vidore v2 | Vidore v1 |
|--------------------------------------------|------|--------------|-----------|-----------|
| **Ops-Colqwen3-4B**                        | 2560 | **84.87**    | **68.7**  | **91.4**  |
| **Ops-Colqwen3-4B**                        | 1280 | 84.71        | 68.2      | 91.3      |
| **Ops-Colqwen3-4B**                        | 640  | 84.39        | 67.7      | 91.1      |
| **Ops-Colqwen3-4B**                        | 320  | 84.12        | 67.0      | 91.0      |
| **Ops-Colqwen3-4B**                        | 128  | 84.04        | 66.9      | 90.9      |
| tomoro-colqwen3-embed-8b                   | 320  | 83.52        | 65.4      | 90.8      |
| EvoQwen2.5-VL-Retriever-7B-v1              | 128  | 83.41        | 65.2      | 90.7      |
| tomoro-colqwen3-embed-4b                   | 320  | 83.18        | 64.7      | 90.6      |
| llama-nemoretriever-colembed-3b-v1         | 3072 | 83.10        | 63.3      | 91.0      |
| SauerkrautLM-ColQwen3-8b-v0.1              | 128  | 82.91        | 62.5      | 91.1      |
| EvoQwen2.5-VL-Retriever-3B-v1              | 128  | 82.76        | 63.0      | 90.7      |
| SauerkrautLM-ColQwen3-4b-v0.1              | 128  | 81.97        | 59.9      | 90.8      |
| jina-embedding-v4                          | 128  | 81.17        | 58.2      | 90.4      |


### Vidore v3 (NDCG@10)

| Model                                      | Dim  | PUB AVG |
|--------------------------------------------|------|---------|
| **Ops-Colqwen3-4B**                        | 2560 |  61.27  |
| **Ops-Colqwen3-4B**                        | 1280 |  **61.32**  |
| **Ops-Colqwen3-4B**                        | 640  |  61.21  |
| **Ops-Colqwen3-4B**                        | 320  |  60.88  |
| **Ops-Colqwen3-4B**                        | 128  |  60.23  |
| tomoro-colqwen3-embed-4b          | 320  |  60.19  |
| SauerkrautLM-ColQwen3-8b-v0.1              | 128  |  58.55  |
| jina-embedding-v4                          | 128  |  57.54  |
| llama-nemoretriever-colembed-3b-v1         | 3072 |  57.07  |
| SauerkrautLM-ColQwen3-4b-v0.1              | 128  |  56.03  |


> With only **128 dimensions**, `Ops-Colqwen3-4B` outperforms other 4B-parameter models such as `tomoro-colqwen3-embed-4b`, making it well-suited for latency- and memory-constrained applications.


## Citation

If you use this model in your work, please cite:

```bibtex
@misc{ops_colqwen3_4b,
  author       = {{OpenSearch-AI}},
  title        = {{Ops-Colqwen3: State-of-the-Art Multimodal Embedding Model for Visual Document Retrieval}},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/OpenSearch-AI/Ops-Colqwen3-4B}},
}
```