|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- Qwen/Qwen3-VL-4B-Instruct |
|
|
pipeline_tag: visual-document-retrieval |
|
|
library_name: transformers |
|
|
tags: |
|
|
- transformers |
|
|
- multimodal_embedding |
|
|
- embedding |
|
|
- colpali |
|
|
- multilingual-embedding |
|
|
- colqwen3 |
|
|
--- |
|
|
# OpenSearch-AI/Ops-Colqwen3-4B |
|
|
|
|
|
**Ops-Colqwen3-4B** is a ColPali-style multimodal embedding model based on the **Qwen3-VL-4B-Instruct** architecture, developed and open-sourced by the Alibaba Cloud OpenSearch-AI team. It maps text queries and visual documents such as images and PDF pages into a unified, aligned **multi-vector embedding space**, enabling highly effective retrieval of visual documents. |
|
|
|
|
|
The model is trained using a multi-stage strategy that combines large-scale text-based retrieval datasets with diverse visual document data. This hybrid training approach significantly enhances its capability to handle complex document understanding and retrieval tasks. On the Vidore v1–v3 benchmarks, **Ops-Colqwen3-4B** achieves **state-of-the-art results** among models of comparable size. |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- **Model size**: 4 billion parameters |
|
|
- **Multimodal alignment**: Enables fine-grained semantic alignment between text and images or PDF pages |
|
|
- **Multi-vector embeddings**: Following the ColPali design, each input generates multiple context-aware embedding vectors; similarity is computed using **MaxSim**, enabling high-precision matching |
|
|
- **Scalable embedding dimensions**: Supports embedding dimensions up to **2,560** during inference via an extended projection head, enabling **higher retrieval accuracy** through more expressive representations. Lower-dimensional prefixes (e.g., the first 128 or 320 dimensions) remain highly effective for lightweight applications. |
|
|
- **Multilingual support**: Covers over 30 languages |
|
|
- **Context length**: Supports up to **32,000 tokens** |
|
|
- **Visual token capacity**: Handles up to **1,280 visual tokens** per page input. |
|
|
|
|
|
## Usage |
|
|
|
|
|
**Requirements** |
|
|
``` |
|
|
pillow |
|
|
transformers>=4.57.0 |
|
|
qwen-vl-utils>=0.0.14 |
|
|
torch==2.8.0 |
|
|
``` |
|
|
|
|
|
**Basic Usage** |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from PIL import Image |
|
|
from scripts.ops_colqwen3_embedder import OpsColQwen3Embedder |
|
|
|
|
|
images = [Image.new("RGB", (32, 32), color="white"), Image.new("RGB", (16, 16), color="black")] |
|
|
queries = ["Is attention really all you need?", "What is the amount of bananas farmed in Salvador?"] |
|
|
|
|
|
embedder = OpsColQwen3Embedder( |
|
|
model_name="OpenSearch-AI/Ops-Colqwen3-4B", |
|
|
dims=2560, |
|
|
dtype=torch.float16, |
|
|
attn_implementation="flash_attention_2", |
|
|
) |
|
|
|
|
|
query_embeddings = embedder.encode_queries(queries) |
|
|
image_embeddings = embedder.encode_images(images) |
|
|
print(query_embeddings[0].shape, image_embeddings[0].shape) # (23, 2560) (18, 2560) |
|
|
|
|
|
scores = embedder.compute_scores(query_embeddings, image_embeddings) |
|
|
|
|
|
print(f"Scores:\n{scores}") |
|
|
``` |
|
|
|
|
|
## Model Performance |
|
|
|
|
|
### Vidore v1 + v2 (NDCG@5) |
|
|
|
|
|
| Model | Dim | Vidore v1+v2 | Vidore v2 | Vidore v1 | |
|
|
|--------------------------------------------|------|--------------|-----------|-----------| |
|
|
| **Ops-Colqwen3-4B** | 2560 | **84.87** | **68.7** | **91.4** | |
|
|
| **Ops-Colqwen3-4B** | 1280 | 84.71 | 68.2 | 91.3 | |
|
|
| **Ops-Colqwen3-4B** | 640 | 84.39 | 67.7 | 91.1 | |
|
|
| **Ops-Colqwen3-4B** | 320 | 84.12 | 67.0 | 91.0 | |
|
|
| **Ops-Colqwen3-4B** | 128 | 84.04 | 66.9 | 90.9 | |
|
|
| tomoro-colqwen3-embed-8b | 320 | 83.52 | 65.4 | 90.8 | |
|
|
| EvoQwen2.5-VL-Retriever-7B-v1 | 128 | 83.41 | 65.2 | 90.7 | |
|
|
| tomoro-colqwen3-embed-4b | 320 | 83.18 | 64.7 | 90.6 | |
|
|
| llama-nemoretriever-colembed-3b-v1 | 3072 | 83.10 | 63.3 | 91.0 | |
|
|
| SauerkrautLM-ColQwen3-8b-v0.1 | 128 | 82.91 | 62.5 | 91.1 | |
|
|
| EvoQwen2.5-VL-Retriever-3B-v1 | 128 | 82.76 | 63.0 | 90.7 | |
|
|
| SauerkrautLM-ColQwen3-4b-v0.1 | 128 | 81.97 | 59.9 | 90.8 | |
|
|
| jina-embedding-v4 | 128 | 81.17 | 58.2 | 90.4 | |
|
|
|
|
|
|
|
|
|
|
|
### Vidore v3 (NDCG@10) |
|
|
|
|
|
| Model | Dim | PUB AVG | |
|
|
|--------------------------------------------|------|---------| |
|
|
| **Ops-Colqwen3-4B** | 2560 | 61.27 | |
|
|
| **Ops-Colqwen3-4B** | 1280 | **61.32** | |
|
|
| **Ops-Colqwen3-4B** | 640 | 61.21 | |
|
|
| **Ops-Colqwen3-4B** | 320 | 60.88 | |
|
|
| **Ops-Colqwen3-4B** | 128 | 60.23 | |
|
|
| tomoro-colqwen3-embed-4b | 320 | 60.19 | |
|
|
| SauerkrautLM-ColQwen3-8b-v0.1 | 128 | 58.55 | |
|
|
| jina-embedding-v4 | 128 | 57.54 | |
|
|
| llama-nemoretriever-colembed-3b-v1 | 3072 | 57.07 | |
|
|
| SauerkrautLM-ColQwen3-4b-v0.1 | 128 | 56.03 | |
|
|
|
|
|
|
|
|
> With only **128 dimensions**, `Ops-Colqwen3-4B` outperforms other 4B-parameter models such as `tomoro-colqwen3-embed-4b`, making it well-suited for latency- and memory-constrained applications. |
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your work, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{ops_colqwen3_4b, |
|
|
author = {{OpenSearch-AI}}, |
|
|
title = {{Ops-Colqwen3: State-of-the-Art Multimodal Embedding Model for Visual Document Retrieval}}, |
|
|
year = {2026}, |
|
|
howpublished = {\url{https://huggingface.co/OpenSearch-AI/Ops-Colqwen3-4B}}, |
|
|
} |
|
|
``` |