leo-vnuuet commited on
Commit
feb9ad6
·
1 Parent(s): 3ec3a13

update README, usage implementation

Browse files
Files changed (3) hide show
  1. README.md +139 -0
  2. adapter_config.json +51 -0
  3. adapter_model.safetensors +3 -0
README.md ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: Qwen/Qwen3.5-2B-Base
3
+ library_name: peft
4
+ license: apache-2.0
5
+ languages:
6
+ - en
7
+ tags:
8
+ - peft
9
+ - transformers
10
+ - document-retrieval
11
+ - multi-vector-embedding
12
+ - colpali
13
+ - matryoshka
14
+ - feature-extraction
15
+ ---
16
+
17
+ # ColQwen3.5-2B-Embedding (LoRA Adapter)
18
+ A ColBERT-style multi-vector document retrieval model adapter fine-tuned on top of [Qwen/Qwen3.5-2B-Base](https://huggingface.co/Qwen/Qwen3.5-2B-Base).
19
+
20
+ **2B Parameters | LoRA Adapter (r=32, α=32) | Matryoshka Representation Learning**
21
+
22
+ ## Description
23
+ Inspired by [ColPali](https://arxiv.org/abs/2407.01449), this model encodes document page images into a sequence of contextualized patch embeddings and uses **late interaction MaxSim** scoring for retrieval.
24
+
25
+ Trained with **Matryoshka Representation Learning** so embeddings can be truncated to `[128, 256, 512, 1024, 2048]` dims without retraining, enabling flexible accuracy/speed tradeoffs.
26
+
27
+ ## Evaluations
28
+ All numbers are **NDCG@5**.
29
+
30
+ ### ViDoRe v1
31
+ Evaluated on the [ViDoRe v1 benchmark](https://huggingface.co/collections/vidore/vidore-benchmark) (single relevant doc per query).
32
+
33
+ | Dataset | 128-dim | 256-dim | 512-dim | 1024-dim | 2048-dim |
34
+ |---|---|---|---|---|---|
35
+ | ArxivQA | 0.8634 | 0.8716 | 0.8767 | 0.8776 | 0.8847 |
36
+ | DocVQA | 0.5879 | 0.5921 | 0.6024 | 0.5993 | 0.6007 |
37
+ | InfoVQA | 0.9055 | 0.9104 | 0.9115 | 0.9170 | 0.9120 |
38
+ | Shift Project | 0.8427 | 0.8535 | 0.8420 | 0.8610 | 0.8657 |
39
+ | Synth AI | 0.9889 | 0.9926 | 0.9926 | 0.9926 | 0.9926 |
40
+ | Synth Energy | 0.9659 | 0.9702 | 0.9659 | 0.9682 | 0.9689 |
41
+ | Synth Gov | 0.9223 | 0.9180 | 0.9304 | 0.9441 | 0.9485 |
42
+ | Synth Health | 0.9776 | 0.9776 | 0.9802 | 0.9839 | 0.9839 |
43
+ | TabFQuAD | 0.8741 | 0.8820 | 0.8782 | 0.8839 | 0.8852 |
44
+ | TAT-DQA | 0.7601 | 0.7677 | 0.7700 | 0.7718 | 0.7732 |
45
+ | **Average** | **0.8688** | **0.8736** | **0.8750** | **0.8799** | **0.8815** |
46
+
47
+ ### ViDoRe v2
48
+ Evaluated on the [ViDoRe v2 benchmark](https://huggingface.co/collections/vidore/vidore-benchmark-v2) (BEIR format, multi-relevant graded qrels — harder than v1).
49
+
50
+ > v2 differences: each query has ~3.2 relevant pages on average, corpus sizes are 5–30× larger (452–3076 docs), and relevance is graded (score ≥ 1 = relevant).
51
+
52
+ | Dataset | Corpus | Queries | 128-dim | 256-dim | 512-dim | 1024-dim | 2048-dim |
53
+ |---|---|---|---|---|---|---|---|
54
+ | Biomedical Lectures | 1016 | 640 | 0.5679 | 0.6011 | 0.6081 | 0.6083 | 0.6191 |
55
+ | Economics Reports | 452 | 232 | 0.5611 | 0.5724 | 0.5592 | 0.5659 | 0.5683 |
56
+ | ESG Reports | 1538 | 228 | 0.4816 | 0.4971 | 0.5256 | 0.5627 | 0.5647 |
57
+ | ESG Reports (Human) | 3076 | 104 | 0.4379 | 0.4384 | 0.4407 | 0.4457 | 0.4471 |
58
+ | **Average** | | | **0.5121** | **0.5273** | **0.5334** | **0.5457** | **0.5498** |
59
+
60
+ ### Combined Average (v1 + v2 macro)
61
+ | | 128-dim | 256-dim | 512-dim | 1024-dim | 2048-dim |
62
+ |---|---|---|---|---|---|
63
+ | ViDoRe v1 avg | 0.8688 | 0.8736 | 0.8750 | 0.8799 | 0.8815 |
64
+ | ViDoRe v2 avg | 0.5121 | 0.5273 | 0.5334 | 0.5457 | 0.5498 |
65
+ | **Overall avg** | **0.6905** | **0.7005** | **0.7042** | **0.7128** | **0.7157** |
66
+
67
+ ### Comparison with 0.8B variant
68
+
69
+ | | Model | v1 avg (1024-dim) | v2 avg (1024-dim) |
70
+ |---|---|---|---|
71
+ | ColQwen3.5-0.8B | 874M params | 0.8625 | 0.4806 |
72
+ | **ColQwen3.5-2B** | **2B params** | **0.8799** | **0.5457** |
73
+ | Δ | | +0.0174 | **+0.0651** |
74
+
75
+ > 2B shows the largest gains on v2 (harder, multi-relevant) — consistent with larger models being more robust to harder retrieval settings.
76
+
77
+ ## Limitations
78
+ - **Training data:** Fine-tuned on [vidore/colpali_train_set](https://huggingface.co/datasets/vidore/colpali_train_set) for **1 epoch** with a single A100-80GB. Covers scientific papers, reports, slides — real-world documents with complex layouts, handwriting, or non-English text may be out-of-distribution. No hard negatives used.
79
+ - **Language:** Predominantly English training data. Performance on non-English documents is expected to degrade.
80
+ - **LoRA adapter:** Must be loaded on top of the base `Qwen/Qwen3.5-2B-Base` weights.
81
+ - **Matryoshka tradeoff:** Truncating to 128-dim incurs ~1.3% NDCG@5 drop vs 2048-dim on v1 (0.8688 vs 0.8815), and ~3.8% on v2 (0.5121 vs 0.5498).
82
+
83
+ ## Usage
84
+ ### Requirements
85
+ ```bash
86
+ pillow
87
+ transformers==5.3.0
88
+ peft==0.18.1
89
+ qwen-vl-utils>=0.0.14
90
+ torch==2.8.0
91
+ ```
92
+
93
+ ### Example
94
+ ```python
95
+ from embedder.colqwen3_5_embedder import ColQwen3_5Embedder
96
+
97
+ embedder = ColQwen3_5Embedder(
98
+ model_name_or_path="Qwen/Qwen3.5-2B-Base",
99
+ lora_checkpoint="leo-vnuuet/ColQwen3.5-2B-Embedding",
100
+ embed_dim=128
101
+ )
102
+
103
+ queries = [
104
+ {"text": "What is the quarterly revenue breakdown?"},
105
+ ]
106
+
107
+ documents = [
108
+ {"image": "/path/to/document_page.png"},
109
+ ]
110
+
111
+ qry_emb, qry_mask = embedder.process(queries, normalize=True, pooling=False)
112
+ doc_emb, doc_mask = embedder.process(documents, normalize=True, pooling=False)
113
+
114
+ # scores shape: (num_queries, num_docs)
115
+ scores = embedder.score_maxsim(qry_emb, doc_emb, qry_mask, doc_mask)
116
+
117
+ print("Relevance scores:")
118
+ for q_idx, query in enumerate(queries):
119
+ for d_idx, doc in enumerate(documents):
120
+ print(f" Q{q_idx+1} vs D{d_idx+1}: {scores[q_idx, d_idx].item():.4f}")
121
+ ```
122
+
123
+ ## Training Details
124
+ | | |
125
+ |---|---|
126
+ | **Base model** | Qwen/Qwen3.5-2B-Base |
127
+ | **Training data** | [vidore/colpali_train_set](https://huggingface.co/datasets/vidore/colpali_train_set) (~118K pairs) |
128
+ | **Epochs** | 1 |
129
+ | **Batch size** | 8 × 4 grad accum = effective 32 |
130
+ | **Learning rate** | 5e-5 (cosine, 2.5% warmup) |
131
+ | **Optimizer** | paged_adamw_8bit |
132
+ | **LoRA rank** | r=32, α=32 |
133
+ | **LoRA targets** | All linear layers (attention + MLP + DeltaNet) |
134
+ | **Loss** | Matryoshka MaxSim (dims: 128, 256, 512, 1024, 2048 — equal weights) |
135
+ | **Precision** | bfloat16 |
136
+ | **Hardware** | 1× NVIDIA A100-SXM4-80GB |
137
+
138
+ ## License
139
+ Apache 2.0 (inherits from base model)
adapter_config.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alora_invocation_tokens": null,
3
+ "alpha_pattern": {},
4
+ "arrow_config": null,
5
+ "auto_mapping": null,
6
+ "base_model_name_or_path": "/data2/cmdir/home/test01/longvnu/stable_diff/models/Qwen/Qwen3.5-2B-Base",
7
+ "bias": "none",
8
+ "corda_config": null,
9
+ "ensure_weight_tying": false,
10
+ "eva_config": null,
11
+ "exclude_modules": null,
12
+ "fan_in_fan_out": false,
13
+ "inference_mode": true,
14
+ "init_lora_weights": true,
15
+ "layer_replication": null,
16
+ "layers_pattern": null,
17
+ "layers_to_transform": null,
18
+ "loftq_config": {},
19
+ "lora_alpha": 32,
20
+ "lora_bias": false,
21
+ "lora_dropout": 0.05,
22
+ "megatron_config": null,
23
+ "megatron_core": "megatron.core",
24
+ "modules_to_save": null,
25
+ "peft_type": "LORA",
26
+ "peft_version": "0.18.1",
27
+ "qalora_group_size": 16,
28
+ "r": 32,
29
+ "rank_pattern": {},
30
+ "revision": null,
31
+ "target_modules": [
32
+ "in_proj_a",
33
+ "out_proj",
34
+ "in_proj_b",
35
+ "down_proj",
36
+ "gate_proj",
37
+ "k_proj",
38
+ "v_proj",
39
+ "in_proj_qkv",
40
+ "in_proj_z",
41
+ "o_proj",
42
+ "q_proj",
43
+ "up_proj"
44
+ ],
45
+ "target_parameters": null,
46
+ "task_type": "FEATURE_EXTRACTION",
47
+ "trainable_token_indices": null,
48
+ "use_dora": false,
49
+ "use_qalora": false,
50
+ "use_rslora": false
51
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2c95cbe1a9432a163bade15405fe97756881b7cb5b50d3d6d46b82b9e8b08411
3
+ size 134609728