Upload tmp README.md without leaderboard data to generate ModelMeta

5b445c5 verified about 2 months ago

11.2 kB

	---
	pipeline_tag: visual-document-retrieval
	library_name: transformers
	language:
	- multilingual
	license: other
	license_name: webai-non-commercial-license-v1.0
	license_link: https://huggingface.co/webAI-Official/webAI-ColVec1-4b/blob/main/LICENSE.md
	base_model: Qwen/Qwen3.5-4B
	tags:
	- text
	- image
	- video
	- multimodal-embedding
	- vidore
	- colpali
	- colqwen3_5
	- multilingual-embedding
	---

	# webAI-Official/webAI-ColVec1-4b

	## ⚡ Summary

	webAI-Official/webAI-ColVec1-4b is a state-of-the-art [ColBERT](https://arxiv.org/abs/2407.01449)-style multimodal embedding model based on [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B). It maps text queries, visual documents (images, PDFs) into aligned multi-vector embeddings.

	The model has been fine-tuned on a merged multimodal dataset of ~2M question-image pairs, including:

	- [DocVQA](https://huggingface.co/datasets/lmms-lab/DocVQA)
	- [PubTables-1M](https://huggingface.co/datasets/bsmock/pubtables-1m)
	- [TAT-QA](https://huggingface.co/datasets/next-tat/TAT-QA)
	- [ViDoRe-ColPali-Training](https://huggingface.co/datasets/vidore/colpali_train_set)
	- [VDR Multilingual](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train)
	- [VisRAG-Ret-Train-In-domain-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-In-domain-data)
	- [VisRAG-Ret-Train-Synthetic-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-Synthetic-data)
	- Proprietary domain-specific synthetic data

	The datasets were filtered, balanced, and merged to produce a comprehensive training set optimized for multilingual, multimodal retrieval and document-image understanding. The model achieves competitive performance across ViDoRe V1 & V3 (English and multilingual).

	## 🛠️ Model Specifications


	\| Feature \| Detail \|
	\| --------------------- \| ------------------------------------------------------------------------- \|
	\| Architecture \| Qwen3.5-4B Vision-Language Model (VLM) + `640 dim` Linear Projection Head \|
	\| Methodology \| ColBERT-style Late Interaction (MaxSim scoring) \|
	\| Output \| Multi-vector (Seq_Len × 640), L2-normalized \|
	\| Modalities \| Text Queries, Images (Documents) \|
	\| Training Strategy \| LoRA adapters + Fully-trained projection layer \|
	\| Precision \| `bfloat16` weights, FlashAttention 2 enabled \|


	---

	### Key Properties

	- Unified Encoder (Single-Tower): A single shared language model processes both images and text. Images are converted into visual tokens via a vision encoder and injected into the token stream, no separate dual encoders.

	- Projection Head: A single linear layer projects final hidden states → compact embedding space (hidden_size → 640 dim).
	- No activation
	- Fully trained
	- Replaces LM head for retrieval

	- Multi-Vector Representation: Each token becomes an embedding → enables fine-grained token-level matching instead of single-vector pooling.

	## 📊 Evaluation Results

	We report results on the ViDoRe benchmark suite. The tables below summarize the image-modality accuracy of `webAI-ColVec1-4b` on the ViDoRe V1 and V3 benchmarks, alongside other webAI `ColVec1` models. Note that (M)MTEB leaderboards use Borda ranking. Each task acts like a voter that ranks models based on how well they perform. Models earn more points when they rank higher on a task. The model with the most total points across all tasks gets the top overall rank.

	### ViDoRe V3 (NDCG@10)


	### ViDoRe V1 (NDCG@5)


	---

	## 💻 Usage

	The processor exposes three primary methods for encoding inputs and computing retrieval scores.

	#### `process_images(images, max_length=None)`

	Encodes a batch of document images into model-ready tensors. Pass the result directly to the model with `**batch`.

	\| Parameter \| Type \| Description \|
	\| ------------ \| ----------------------- \| ------------------------------------------------------------------- \|
	\| `images` \| `List[PIL.Image.Image]` \| Document page images. Each image is automatically converted to RGB. \|
	\| `max_length` \| `int` \| `None` \|

	```python
	batch = processor.process_images(images=pil_images)
	batch = {k: v.to(device) for k, v in batch.items()}
	embeddings = model(**batch) # shape: (B, seq_len, embed_dim)
	```

	---

	#### `process_queries(texts, max_length=None)`

	Encodes a batch of text queries into model-ready tensors.

	\| Parameter \| Type \| Description \|
	\| ------------ \| ----------- \| ------------------------------- \|
	\| `texts` \| `List[str]` \| Natural-language query strings. \|
	\| `max_length` \| `int` \| `None` \|

	```python
	batch = processor.process_queries(texts=["What is the revenue for Q3?"])
	batch = {k: v.to(device) for k, v in batch.items()}
	embeddings = model(**batch) # shape: (B, seq_len, embed_dim)
	```

	---

	#### `score_multi_vector(qs, ps, batch_size=128, device=None)`

	Computes ColBERT-style MaxSim late-interaction scores between a list of query embeddings and a list of passage (document) embeddings. For each query token, the maximum dot product across all passage tokens is found; these maxima are summed to produce a single scalar score per (query, passage) pair.


	\| Parameter \| Type \| Description \|
	\| ------------ \| -------------------------- \| ---------------------------------------------------------------------- \|
	\| `qs` \| `List[Tensor]` or `Tensor` \| Query embeddings. Each tensor has shape `(seq_len_q, embed_dim)`. \|
	\| `ps` \| `List[Tensor]` or `Tensor` \| Passage embeddings. Each tensor has shape `(seq_len_p, embed_dim)`. \|
	\| `batch_size` \| `int` \| Number of queries processed per inner loop iteration (default: `128`). \|
	\| `device` \| `str` \| `torch.device` \|


	Returns a `torch.Tensor` of shape `(n_queries, n_passages)` on CPU in `float32`. Higher scores indicate greater relevance.

	```python
	scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
	# scores[i, j] is the relevance of document j to query i
	best_doc_per_query = scores.argmax(dim=1)
	```

	### Prerequisites

	We strongly suggest `flash-attn` to be installed. If not, please change to `attention_impl="sdpa"`

	Currently we only support `torch==2.8.0`, for higher pytorch version, please build flash attention manually, otherwise performance throughput could be low. Also, Note that `torch==2.8.0` supports Python Versions: `>= 3.9` and `<= 3.13`.

	```bash
	pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
	pip install transformers pillow requests
	pip install flash-attn --no-build-isolation
	```

	### Inference Code

	```python
	import torch
	from transformers import AutoModel, AutoProcessor
	from PIL import Image, UnidentifiedImageError
	import requests
	from io import BytesIO

	# Configuration
	MODEL_ID = "webAI-Official/webAI-ColVec1-4b"
	DTYPE = torch.bfloat16
	DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

	# Load Model & Processor
	processor = AutoProcessor.from_pretrained(
	MODEL_ID,
	trust_remote_code=True,
	)

	model = AutoModel.from_pretrained(
	MODEL_ID,
	dtype=DTYPE,
	attn_implementation="flash_attention_2",
	trust_remote_code=True,
	device_map=DEVICE,
	).eval()

	# Sample Data
	queries = [
	"Retrieve the city of Singapore",
	"Retrieve the city of Beijing",
	"Retrieve the city of London",
	]
	docs = [
	"https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
	"https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG",
	"https://upload.wikimedia.org/wikipedia/commons/4/49/London_skyline.jpg",
	]

	def load_image(url: str) -> Image.Image:
	# Some CDNs (e.g., Wikimedia) expect a browser-like UA to avoid 403s.
	for headers in ({}, {"User-Agent": "Mozilla/5.0 (compatible; ColQwen3-demo/1.0)"}):
	resp = requests.get(url, headers=headers, timeout=10)
	if resp.status_code == 403:
	continue
	resp.raise_for_status()
	try:
	return Image.open(BytesIO(resp.content)).convert("RGB")
	except UnidentifiedImageError as e:
	raise RuntimeError(f"Failed to decode image from {url}") from e
	raise RuntimeError(f"Could not fetch image (HTTP 403) from {url}; try downloading locally and loading from file path.")

	# Helper Functions
	def encode_queries(texts, batch_size=8):
	outputs = []
	for start in range(0, len(texts), batch_size):
	batch = processor.process_queries(texts=texts[start : start + batch_size])
	batch = {k: v.to(DEVICE) for k, v in batch.items()}
	with torch.inference_mode():
	embeddings = model(**batch)
	vecs = embeddings.to(torch.bfloat16).cpu()
	outputs.extend(vecs)
	return outputs

	def encode_docs(urls, batch_size=4):
	pil_images = [load_image(url) for url in urls]
	outputs = []
	for start in range(0, len(pil_images), batch_size):
	batch_imgs = pil_images[start : start + batch_size]
	features = processor.process_images(images=batch_imgs)
	features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()}
	with torch.inference_mode():
	embeddings = model(**features)
	vecs = embeddings.to(torch.bfloat16).cpu()
	outputs.extend(vecs)
	return outputs

	# Execution
	query_embeddings = encode_queries(queries)
	doc_embeddings = encode_docs(docs)

	# MaxSim Scoring
	scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
	print(scores)
	```

	---

	## ⚖️ Strengths & Limitations

	### Strengths

	- Performance: State of the art retrieval performance on ViDoRe V1 & V3 dataset with excellent performance on multimodal document retrieval.
	- Complex Layouts: Excellent handling of chart-rich PDFs, domain-specific documents.
	- End-to-end Retrieval: Capable of OCR-free retrieval on unseen multimodal documents without using an intermediate vision LLM to generate summary for retrieval.
	- Multilingualism: Strong performance on non-English document inputs.

	### Limitations

	- Storage Cost: Still larger than single‑vector baselines despite the smaller token dimension.

	### License & Data

	[LICENSE](https://huggingface.co/webAI-Official/webAI-ColVec1-4b/blob/main/LICENSE.md)

	## 📚 Citation

	If you use this model, please cite:

	```bibtex
	@misc{webAI-ColVec1,
	title={webAI-ColVec1: Late-Interaction Multi-Vector Embedding Model for Visual Document Retrieval},
	author={webAI},
	year={2026},
	url={https://huggingface.co/webAI-Official/webAI-ColVec1-4b}
	}
	```