Ops-Colqwen3-4B / README.md

fix typo

e9046ac verified 5 days ago

5.6 kB

	---
	license: apache-2.0
	base_model:
	- Qwen/Qwen3-VL-4B-Instruct
	pipeline_tag: visual-document-retrieval
	library_name: transformers
	tags:
	- transformers
	- multimodal_embedding
	- embedding
	- colpali
	- multilingual-embedding
	- colqwen3
	---
	# OpenSearch-AI/Ops-Colqwen3-4B

	Ops-Colqwen3-4B is a ColPali-style multimodal embedding model based on the Qwen3-VL-4B-Instruct architecture, developed and open-sourced by the Alibaba Cloud OpenSearch-AI team. It maps text queries and visual documents such as images and PDF pages into a unified, aligned multi-vector embedding space, enabling highly effective retrieval of visual documents.

	The model is trained using a multi-stage strategy that combines large-scale text-based retrieval datasets with diverse visual document data. This hybrid training approach significantly enhances its capability to handle complex document understanding and retrieval tasks. On the Vidore v1–v3 benchmarks, Ops-Colqwen3-4B achieves state-of-the-art results among models of comparable size.

	## Key Features

	- Model size: 4 billion parameters
	- Multimodal alignment: Enables fine-grained semantic alignment between text and images or PDF pages
	- Multi-vector embeddings: Following the ColPali design, each input generates multiple context-aware embedding vectors; similarity is computed using MaxSim, enabling high-precision matching
	- Scalable embedding dimensions: Supports embedding dimensions up to 2,560 during inference via an extended projection head, enabling higher retrieval accuracy through more expressive representations. Lower-dimensional prefixes (e.g., the first 128 or 320 dimensions) remain highly effective for lightweight applications.
	- Multilingual support: Covers over 30 languages
	- Context length: Supports up to 32,000 tokens
	- Visual token capacity: Handles up to 1,280 visual tokens per page input.

	## Usage

	Requirements
	```
	pillow
	transformers>=4.57.0
	qwen-vl-utils>=0.0.14
	torch==2.8.0
	```

	Basic Usage

	```python
	import torch
	from PIL import Image
	from scripts.ops_colqwen3_embedder import OpsColQwen3Embedder

	images = [Image.new("RGB", (32, 32), color="white"), Image.new("RGB", (16, 16), color="black")]
	queries = ["Is attention really all you need?", "What is the amount of bananas farmed in Salvador?"]

	embedder = OpsColQwen3Embedder(
	model_name="OpenSearch-AI/Ops-Colqwen3-4B",
	dims=2560,
	dtype=torch.float16,
	attn_implementation="flash_attention_2",
	)

	query_embeddings = embedder.encode_queries(queries)
	image_embeddings = embedder.encode_images(images)
	print(query_embeddings[0].shape, image_embeddings[0].shape) # (23, 2560) (18, 2560)

	scores = embedder.compute_scores(query_embeddings, image_embeddings)

	print(f"Scores:\n{scores}")
	```

	## Model Performance

	### Vidore v1 + v2 (NDCG@5)

	\| Model \| Dim \| Vidore v1+v2 \| Vidore v2 \| Vidore v1 \|
	\|--------------------------------------------\|------\|--------------\|-----------\|-----------\|
	\| Ops-Colqwen3-4B \| 2560 \| 84.87 \| 68.7 \| 91.4 \|
	\| Ops-Colqwen3-4B \| 1280 \| 84.71 \| 68.2 \| 91.3 \|
	\| Ops-Colqwen3-4B \| 640 \| 84.39 \| 67.7 \| 91.1 \|
	\| Ops-Colqwen3-4B \| 320 \| 84.12 \| 67.0 \| 91.0 \|
	\| Ops-Colqwen3-4B \| 128 \| 84.04 \| 66.9 \| 90.9 \|
	\| tomoro-colqwen3-embed-8b \| 320 \| 83.52 \| 65.4 \| 90.8 \|
	\| EvoQwen2.5-VL-Retriever-7B-v1 \| 128 \| 83.41 \| 65.2 \| 90.7 \|
	\| tomoro-colqwen3-embed-4b \| 320 \| 83.18 \| 64.7 \| 90.6 \|
	\| llama-nemoretriever-colembed-3b-v1 \| 3072 \| 83.10 \| 63.3 \| 91.0 \|
	\| SauerkrautLM-ColQwen3-8b-v0.1 \| 128 \| 82.91 \| 62.5 \| 91.1 \|
	\| EvoQwen2.5-VL-Retriever-3B-v1 \| 128 \| 82.76 \| 63.0 \| 90.7 \|
	\| SauerkrautLM-ColQwen3-4b-v0.1 \| 128 \| 81.97 \| 59.9 \| 90.8 \|
	\| jina-embedding-v4 \| 128 \| 81.17 \| 58.2 \| 90.4 \|



	### Vidore v3 (NDCG@10)

	\| Model \| Dim \| PUB AVG \|
	\|--------------------------------------------\|------\|---------\|
	\| Ops-Colqwen3-4B \| 2560 \| 61.27 \|
	\| Ops-Colqwen3-4B \| 1280 \| 61.32 \|
	\| Ops-Colqwen3-4B \| 640 \| 61.21 \|
	\| Ops-Colqwen3-4B \| 320 \| 60.88 \|
	\| Ops-Colqwen3-4B \| 128 \| 60.23 \|
	\| tomoro-colqwen3-embed-4b \| 320 \| 60.19 \|
	\| SauerkrautLM-ColQwen3-8b-v0.1 \| 128 \| 58.55 \|
	\| jina-embedding-v4 \| 128 \| 57.54 \|
	\| llama-nemoretriever-colembed-3b-v1 \| 3072 \| 57.07 \|
	\| SauerkrautLM-ColQwen3-4b-v0.1 \| 128 \| 56.03 \|


	> With only 128 dimensions, `Ops-Colqwen3-4B` outperforms other 4B-parameter models such as `tomoro-colqwen3-embed-4b`, making it well-suited for latency- and memory-constrained applications.


	## Citation

	If you use this model in your work, please cite:

	```bibtex
	@misc{ops_colqwen3_4b,
	author = {{OpenSearch-AI}},
	title = {{Ops-Colqwen3: State-of-the-Art Multimodal Embedding Model for Visual Document Retrieval}},
	year = {2026},
	howpublished = {\url{https://huggingface.co/OpenSearch-AI/Ops-Colqwen3-4B}},
	}
	```