leo-vnuuet
/

ColQwen3.5-2B-Embedding

+---
+base_model: Qwen/Qwen3.5-2B-Base
+library_name: peft
+license: apache-2.0
+languages:
+- en
+tags:
+- peft
+- transformers
+- document-retrieval
+- multi-vector-embedding
+- colpali
+- matryoshka
+- feature-extraction
+---
+# ColQwen3.5-2B-Embedding (LoRA Adapter)
+A ColBERT-style multi-vector document retrieval model adapter fine-tuned on top of [Qwen/Qwen3.5-2B-Base](https://huggingface.co/Qwen/Qwen3.5-2B-Base).
+**2B Parameters | LoRA Adapter (r=32, α=32) | Matryoshka Representation Learning**
+## Description
+Inspired by [ColPali](https://arxiv.org/abs/2407.01449), this model encodes document page images into a sequence of contextualized patch embeddings and uses **late interaction MaxSim** scoring for retrieval.
+Trained with **Matryoshka Representation Learning** so embeddings can be truncated to `[128, 256, 512, 1024, 2048]` dims without retraining, enabling flexible accuracy/speed tradeoffs.
+## Evaluations
+All numbers are **NDCG@5**.
+### ViDoRe v1
+Evaluated on the [ViDoRe v1 benchmark](https://huggingface.co/collections/vidore/vidore-benchmark) (single relevant doc per query).
+| Dataset | 128-dim | 256-dim | 512-dim | 1024-dim | 2048-dim |
+|---|---|---|---|---|---|
+| ArxivQA | 0.8634 | 0.8716 | 0.8767 | 0.8776 | 0.8847 |
+| DocVQA | 0.5879 | 0.5921 | 0.6024 | 0.5993 | 0.6007 |
+| InfoVQA | 0.9055 | 0.9104 | 0.9115 | 0.9170 | 0.9120 |
+| Shift Project | 0.8427 | 0.8535 | 0.8420 | 0.8610 | 0.8657 |
+| Synth AI | 0.9889 | 0.9926 | 0.9926 | 0.9926 | 0.9926 |
+| Synth Energy | 0.9659 | 0.9702 | 0.9659 | 0.9682 | 0.9689 |
+| Synth Gov | 0.9223 | 0.9180 | 0.9304 | 0.9441 | 0.9485 |
+| Synth Health | 0.9776 | 0.9776 | 0.9802 | 0.9839 | 0.9839 |
+| TabFQuAD | 0.8741 | 0.8820 | 0.8782 | 0.8839 | 0.8852 |
+| TAT-DQA | 0.7601 | 0.7677 | 0.7700 | 0.7718 | 0.7732 |
+| **Average** | **0.8688** | **0.8736** | **0.8750** | **0.8799** | **0.8815** |
+### ViDoRe v2
+Evaluated on the [ViDoRe v2 benchmark](https://huggingface.co/collections/vidore/vidore-benchmark-v2) (BEIR format, multi-relevant graded qrels — harder than v1).
+> v2 differences: each query has ~3.2 relevant pages on average, corpus sizes are 5–30× larger (452–3076 docs), and relevance is graded (score ≥ 1 = relevant).
+| Dataset | Corpus | Queries | 128-dim | 256-dim | 512-dim | 1024-dim | 2048-dim |
+|---|---|---|---|---|---|---|---|
+| Biomedical Lectures | 1016 | 640 | 0.5679 | 0.6011 | 0.6081 | 0.6083 | 0.6191 |
+| Economics Reports | 452 | 232 | 0.5611 | 0.5724 | 0.5592 | 0.5659 | 0.5683 |
+| ESG Reports | 1538 | 228 | 0.4816 | 0.4971 | 0.5256 | 0.5627 | 0.5647 |
+| ESG Reports (Human) | 3076 | 104 | 0.4379 | 0.4384 | 0.4407 | 0.4457 | 0.4471 |
+| **Average** | | | **0.5121** | **0.5273** | **0.5334** | **0.5457** | **0.5498** |
+### Combined Average (v1 + v2 macro)
+| | 128-dim | 256-dim | 512-dim | 1024-dim | 2048-dim |
+|---|---|---|---|---|---|
+| ViDoRe v1 avg | 0.8688 | 0.8736 | 0.8750 | 0.8799 | 0.8815 |
+| ViDoRe v2 avg | 0.5121 | 0.5273 | 0.5334 | 0.5457 | 0.5498 |
+| **Overall avg** | **0.6905** | **0.7005** | **0.7042** | **0.7128** | **0.7157** |
+### Comparison with 0.8B variant
+| | Model | v1 avg (1024-dim) | v2 avg (1024-dim) |
+|---|---|---|---|
+| ColQwen3.5-0.8B | 874M params | 0.8625 | 0.4806 |
+| **ColQwen3.5-2B** | **2B params** | **0.8799** | **0.5457** |
+| Δ | | +0.0174 | **+0.0651** |
+> 2B shows the largest gains on v2 (harder, multi-relevant) — consistent with larger models being more robust to harder retrieval settings.
+## Limitations
+- **Training data:** Fine-tuned on [vidore/colpali_train_set](https://huggingface.co/datasets/vidore/colpali_train_set) for **1 epoch** with a single A100-80GB. Covers scientific papers, reports, slides — real-world documents with complex layouts, handwriting, or non-English text may be out-of-distribution. No hard negatives used.
+- **Language:** Predominantly English training data. Performance on non-English documents is expected to degrade.
+- **LoRA adapter:** Must be loaded on top of the base `Qwen/Qwen3.5-2B-Base` weights.
+- **Matryoshka tradeoff:** Truncating to 128-dim incurs ~1.3% NDCG@5 drop vs 2048-dim on v1 (0.8688 vs 0.8815), and ~3.8% on v2 (0.5121 vs 0.5498).
+## Usage
+### Requirements
+```bash
+pillow
+transformers==5.3.0
+peft==0.18.1
+qwen-vl-utils>=0.0.14
+torch==2.8.0
+```
+### Example
+```python
+from embedder.colqwen3_5_embedder import ColQwen3_5Embedder
+embedder = ColQwen3_5Embedder(
+    model_name_or_path="Qwen/Qwen3.5-2B-Base",
+    lora_checkpoint="leo-vnuuet/ColQwen3.5-2B-Embedding",
+    embed_dim=128
+)
+queries = [
+    {"text": "What is the quarterly revenue breakdown?"},
+]
+documents = [
+    {"image": "/path/to/document_page.png"},
+]
+qry_emb, qry_mask = embedder.process(queries, normalize=True, pooling=False)
+doc_emb, doc_mask = embedder.process(documents, normalize=True, pooling=False)
+# scores shape: (num_queries, num_docs)
+scores = embedder.score_maxsim(qry_emb, doc_emb, qry_mask, doc_mask)
+print("Relevance scores:")
+for q_idx, query in enumerate(queries):
+    for d_idx, doc in enumerate(documents):
+        print(f"  Q{q_idx+1} vs D{d_idx+1}: {scores[q_idx, d_idx].item():.4f}")
+```
+## Training Details
+| | |
+|---|---|
+| **Base model** | Qwen/Qwen3.5-2B-Base |
+| **Training data** | [vidore/colpali_train_set](https://huggingface.co/datasets/vidore/colpali_train_set) (~118K pairs) |
+| **Epochs** | 1 |
+| **Batch size** | 8 × 4 grad accum = effective 32 |
+| **Learning rate** | 5e-5 (cosine, 2.5% warmup) |
+| **Optimizer** | paged_adamw_8bit |
+| **LoRA rank** | r=32, α=32 |
+| **LoRA targets** | All linear layers (attention + MLP + DeltaNet) |
+| **Loss** | Matryoshka MaxSim (dims: 128, 256, 512, 1024, 2048 — equal weights) |
+| **Precision** | bfloat16 |
+| **Hardware** | 1× NVIDIA A100-SXM4-80GB |
+## License
+Apache 2.0 (inherits from base model)

adapter_config.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "alora_invocation_tokens": null,
+  "alpha_pattern": {},
+  "arrow_config": null,
+  "auto_mapping": null,
+  "base_model_name_or_path": "/data2/cmdir/home/test01/longvnu/stable_diff/models/Qwen/Qwen3.5-2B-Base",
+  "bias": "none",
+  "corda_config": null,
+  "ensure_weight_tying": false,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "peft_version": "0.18.1",
+  "qalora_group_size": 16,
+  "r": 32,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "in_proj_a",
+    "out_proj",
+    "in_proj_b",
+    "down_proj",
+    "gate_proj",
+    "k_proj",
+    "v_proj",
+    "in_proj_qkv",
+    "in_proj_z",
+    "o_proj",
+    "q_proj",
+    "up_proj"
+  ],
+  "target_parameters": null,
+  "task_type": "FEATURE_EXTRACTION",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_qalora": false,
+  "use_rslora": false
+}

adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2c95cbe1a9432a163bade15405fe97756881b7cb5b50d3d6d46b82b9e8b08411
+size 134609728