Commit ·
feb9ad6
1
Parent(s): 3ec3a13
update README, usage implementation
Browse files- README.md +139 -0
- adapter_config.json +51 -0
- adapter_model.safetensors +3 -0
README.md
ADDED
|
@@ -0,0 +1,139 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
base_model: Qwen/Qwen3.5-2B-Base
|
| 3 |
+
library_name: peft
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
languages:
|
| 6 |
+
- en
|
| 7 |
+
tags:
|
| 8 |
+
- peft
|
| 9 |
+
- transformers
|
| 10 |
+
- document-retrieval
|
| 11 |
+
- multi-vector-embedding
|
| 12 |
+
- colpali
|
| 13 |
+
- matryoshka
|
| 14 |
+
- feature-extraction
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# ColQwen3.5-2B-Embedding (LoRA Adapter)
|
| 18 |
+
A ColBERT-style multi-vector document retrieval model adapter fine-tuned on top of [Qwen/Qwen3.5-2B-Base](https://huggingface.co/Qwen/Qwen3.5-2B-Base).
|
| 19 |
+
|
| 20 |
+
**2B Parameters | LoRA Adapter (r=32, α=32) | Matryoshka Representation Learning**
|
| 21 |
+
|
| 22 |
+
## Description
|
| 23 |
+
Inspired by [ColPali](https://arxiv.org/abs/2407.01449), this model encodes document page images into a sequence of contextualized patch embeddings and uses **late interaction MaxSim** scoring for retrieval.
|
| 24 |
+
|
| 25 |
+
Trained with **Matryoshka Representation Learning** so embeddings can be truncated to `[128, 256, 512, 1024, 2048]` dims without retraining, enabling flexible accuracy/speed tradeoffs.
|
| 26 |
+
|
| 27 |
+
## Evaluations
|
| 28 |
+
All numbers are **NDCG@5**.
|
| 29 |
+
|
| 30 |
+
### ViDoRe v1
|
| 31 |
+
Evaluated on the [ViDoRe v1 benchmark](https://huggingface.co/collections/vidore/vidore-benchmark) (single relevant doc per query).
|
| 32 |
+
|
| 33 |
+
| Dataset | 128-dim | 256-dim | 512-dim | 1024-dim | 2048-dim |
|
| 34 |
+
|---|---|---|---|---|---|
|
| 35 |
+
| ArxivQA | 0.8634 | 0.8716 | 0.8767 | 0.8776 | 0.8847 |
|
| 36 |
+
| DocVQA | 0.5879 | 0.5921 | 0.6024 | 0.5993 | 0.6007 |
|
| 37 |
+
| InfoVQA | 0.9055 | 0.9104 | 0.9115 | 0.9170 | 0.9120 |
|
| 38 |
+
| Shift Project | 0.8427 | 0.8535 | 0.8420 | 0.8610 | 0.8657 |
|
| 39 |
+
| Synth AI | 0.9889 | 0.9926 | 0.9926 | 0.9926 | 0.9926 |
|
| 40 |
+
| Synth Energy | 0.9659 | 0.9702 | 0.9659 | 0.9682 | 0.9689 |
|
| 41 |
+
| Synth Gov | 0.9223 | 0.9180 | 0.9304 | 0.9441 | 0.9485 |
|
| 42 |
+
| Synth Health | 0.9776 | 0.9776 | 0.9802 | 0.9839 | 0.9839 |
|
| 43 |
+
| TabFQuAD | 0.8741 | 0.8820 | 0.8782 | 0.8839 | 0.8852 |
|
| 44 |
+
| TAT-DQA | 0.7601 | 0.7677 | 0.7700 | 0.7718 | 0.7732 |
|
| 45 |
+
| **Average** | **0.8688** | **0.8736** | **0.8750** | **0.8799** | **0.8815** |
|
| 46 |
+
|
| 47 |
+
### ViDoRe v2
|
| 48 |
+
Evaluated on the [ViDoRe v2 benchmark](https://huggingface.co/collections/vidore/vidore-benchmark-v2) (BEIR format, multi-relevant graded qrels — harder than v1).
|
| 49 |
+
|
| 50 |
+
> v2 differences: each query has ~3.2 relevant pages on average, corpus sizes are 5–30× larger (452–3076 docs), and relevance is graded (score ≥ 1 = relevant).
|
| 51 |
+
|
| 52 |
+
| Dataset | Corpus | Queries | 128-dim | 256-dim | 512-dim | 1024-dim | 2048-dim |
|
| 53 |
+
|---|---|---|---|---|---|---|---|
|
| 54 |
+
| Biomedical Lectures | 1016 | 640 | 0.5679 | 0.6011 | 0.6081 | 0.6083 | 0.6191 |
|
| 55 |
+
| Economics Reports | 452 | 232 | 0.5611 | 0.5724 | 0.5592 | 0.5659 | 0.5683 |
|
| 56 |
+
| ESG Reports | 1538 | 228 | 0.4816 | 0.4971 | 0.5256 | 0.5627 | 0.5647 |
|
| 57 |
+
| ESG Reports (Human) | 3076 | 104 | 0.4379 | 0.4384 | 0.4407 | 0.4457 | 0.4471 |
|
| 58 |
+
| **Average** | | | **0.5121** | **0.5273** | **0.5334** | **0.5457** | **0.5498** |
|
| 59 |
+
|
| 60 |
+
### Combined Average (v1 + v2 macro)
|
| 61 |
+
| | 128-dim | 256-dim | 512-dim | 1024-dim | 2048-dim |
|
| 62 |
+
|---|---|---|---|---|---|
|
| 63 |
+
| ViDoRe v1 avg | 0.8688 | 0.8736 | 0.8750 | 0.8799 | 0.8815 |
|
| 64 |
+
| ViDoRe v2 avg | 0.5121 | 0.5273 | 0.5334 | 0.5457 | 0.5498 |
|
| 65 |
+
| **Overall avg** | **0.6905** | **0.7005** | **0.7042** | **0.7128** | **0.7157** |
|
| 66 |
+
|
| 67 |
+
### Comparison with 0.8B variant
|
| 68 |
+
|
| 69 |
+
| | Model | v1 avg (1024-dim) | v2 avg (1024-dim) |
|
| 70 |
+
|---|---|---|---|
|
| 71 |
+
| ColQwen3.5-0.8B | 874M params | 0.8625 | 0.4806 |
|
| 72 |
+
| **ColQwen3.5-2B** | **2B params** | **0.8799** | **0.5457** |
|
| 73 |
+
| Δ | | +0.0174 | **+0.0651** |
|
| 74 |
+
|
| 75 |
+
> 2B shows the largest gains on v2 (harder, multi-relevant) — consistent with larger models being more robust to harder retrieval settings.
|
| 76 |
+
|
| 77 |
+
## Limitations
|
| 78 |
+
- **Training data:** Fine-tuned on [vidore/colpali_train_set](https://huggingface.co/datasets/vidore/colpali_train_set) for **1 epoch** with a single A100-80GB. Covers scientific papers, reports, slides — real-world documents with complex layouts, handwriting, or non-English text may be out-of-distribution. No hard negatives used.
|
| 79 |
+
- **Language:** Predominantly English training data. Performance on non-English documents is expected to degrade.
|
| 80 |
+
- **LoRA adapter:** Must be loaded on top of the base `Qwen/Qwen3.5-2B-Base` weights.
|
| 81 |
+
- **Matryoshka tradeoff:** Truncating to 128-dim incurs ~1.3% NDCG@5 drop vs 2048-dim on v1 (0.8688 vs 0.8815), and ~3.8% on v2 (0.5121 vs 0.5498).
|
| 82 |
+
|
| 83 |
+
## Usage
|
| 84 |
+
### Requirements
|
| 85 |
+
```bash
|
| 86 |
+
pillow
|
| 87 |
+
transformers==5.3.0
|
| 88 |
+
peft==0.18.1
|
| 89 |
+
qwen-vl-utils>=0.0.14
|
| 90 |
+
torch==2.8.0
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### Example
|
| 94 |
+
```python
|
| 95 |
+
from embedder.colqwen3_5_embedder import ColQwen3_5Embedder
|
| 96 |
+
|
| 97 |
+
embedder = ColQwen3_5Embedder(
|
| 98 |
+
model_name_or_path="Qwen/Qwen3.5-2B-Base",
|
| 99 |
+
lora_checkpoint="leo-vnuuet/ColQwen3.5-2B-Embedding",
|
| 100 |
+
embed_dim=128
|
| 101 |
+
)
|
| 102 |
+
|
| 103 |
+
queries = [
|
| 104 |
+
{"text": "What is the quarterly revenue breakdown?"},
|
| 105 |
+
]
|
| 106 |
+
|
| 107 |
+
documents = [
|
| 108 |
+
{"image": "/path/to/document_page.png"},
|
| 109 |
+
]
|
| 110 |
+
|
| 111 |
+
qry_emb, qry_mask = embedder.process(queries, normalize=True, pooling=False)
|
| 112 |
+
doc_emb, doc_mask = embedder.process(documents, normalize=True, pooling=False)
|
| 113 |
+
|
| 114 |
+
# scores shape: (num_queries, num_docs)
|
| 115 |
+
scores = embedder.score_maxsim(qry_emb, doc_emb, qry_mask, doc_mask)
|
| 116 |
+
|
| 117 |
+
print("Relevance scores:")
|
| 118 |
+
for q_idx, query in enumerate(queries):
|
| 119 |
+
for d_idx, doc in enumerate(documents):
|
| 120 |
+
print(f" Q{q_idx+1} vs D{d_idx+1}: {scores[q_idx, d_idx].item():.4f}")
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
## Training Details
|
| 124 |
+
| | |
|
| 125 |
+
|---|---|
|
| 126 |
+
| **Base model** | Qwen/Qwen3.5-2B-Base |
|
| 127 |
+
| **Training data** | [vidore/colpali_train_set](https://huggingface.co/datasets/vidore/colpali_train_set) (~118K pairs) |
|
| 128 |
+
| **Epochs** | 1 |
|
| 129 |
+
| **Batch size** | 8 × 4 grad accum = effective 32 |
|
| 130 |
+
| **Learning rate** | 5e-5 (cosine, 2.5% warmup) |
|
| 131 |
+
| **Optimizer** | paged_adamw_8bit |
|
| 132 |
+
| **LoRA rank** | r=32, α=32 |
|
| 133 |
+
| **LoRA targets** | All linear layers (attention + MLP + DeltaNet) |
|
| 134 |
+
| **Loss** | Matryoshka MaxSim (dims: 128, 256, 512, 1024, 2048 — equal weights) |
|
| 135 |
+
| **Precision** | bfloat16 |
|
| 136 |
+
| **Hardware** | 1× NVIDIA A100-SXM4-80GB |
|
| 137 |
+
|
| 138 |
+
## License
|
| 139 |
+
Apache 2.0 (inherits from base model)
|
adapter_config.json
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"alora_invocation_tokens": null,
|
| 3 |
+
"alpha_pattern": {},
|
| 4 |
+
"arrow_config": null,
|
| 5 |
+
"auto_mapping": null,
|
| 6 |
+
"base_model_name_or_path": "/data2/cmdir/home/test01/longvnu/stable_diff/models/Qwen/Qwen3.5-2B-Base",
|
| 7 |
+
"bias": "none",
|
| 8 |
+
"corda_config": null,
|
| 9 |
+
"ensure_weight_tying": false,
|
| 10 |
+
"eva_config": null,
|
| 11 |
+
"exclude_modules": null,
|
| 12 |
+
"fan_in_fan_out": false,
|
| 13 |
+
"inference_mode": true,
|
| 14 |
+
"init_lora_weights": true,
|
| 15 |
+
"layer_replication": null,
|
| 16 |
+
"layers_pattern": null,
|
| 17 |
+
"layers_to_transform": null,
|
| 18 |
+
"loftq_config": {},
|
| 19 |
+
"lora_alpha": 32,
|
| 20 |
+
"lora_bias": false,
|
| 21 |
+
"lora_dropout": 0.05,
|
| 22 |
+
"megatron_config": null,
|
| 23 |
+
"megatron_core": "megatron.core",
|
| 24 |
+
"modules_to_save": null,
|
| 25 |
+
"peft_type": "LORA",
|
| 26 |
+
"peft_version": "0.18.1",
|
| 27 |
+
"qalora_group_size": 16,
|
| 28 |
+
"r": 32,
|
| 29 |
+
"rank_pattern": {},
|
| 30 |
+
"revision": null,
|
| 31 |
+
"target_modules": [
|
| 32 |
+
"in_proj_a",
|
| 33 |
+
"out_proj",
|
| 34 |
+
"in_proj_b",
|
| 35 |
+
"down_proj",
|
| 36 |
+
"gate_proj",
|
| 37 |
+
"k_proj",
|
| 38 |
+
"v_proj",
|
| 39 |
+
"in_proj_qkv",
|
| 40 |
+
"in_proj_z",
|
| 41 |
+
"o_proj",
|
| 42 |
+
"q_proj",
|
| 43 |
+
"up_proj"
|
| 44 |
+
],
|
| 45 |
+
"target_parameters": null,
|
| 46 |
+
"task_type": "FEATURE_EXTRACTION",
|
| 47 |
+
"trainable_token_indices": null,
|
| 48 |
+
"use_dora": false,
|
| 49 |
+
"use_qalora": false,
|
| 50 |
+
"use_rslora": false
|
| 51 |
+
}
|
adapter_model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:2c95cbe1a9432a163bade15405fe97756881b7cb5b50d3d6d46b82b9e8b08411
|
| 3 |
+
size 134609728
|