| | --- |
| | license: mit |
| | datasets: |
| | - openbmb/VisRAG-Ret-Train-Synthetic-data |
| | - openbmb/VisRAG-Ret-Train-In-domain-data |
| | - Metric-AI/rag_docmatix_100k |
| | - vidore/colpali_train_set |
| | - llamaindex/vdr-multilingual-train |
| | - Metric-AI/tabfquad_train_set |
| | language: |
| | - en |
| | - fr |
| | - es |
| | - it |
| | - de |
| | base_model: |
| | - Metric-AI/ColQwenStella-base-2b |
| | - Qwen/Qwen2-VL-2B |
| | - NovaSearch/stella_en_1.5B_v5 |
| | tags: |
| | - vidore |
| | - multimodal_embedding |
| | - multilingual_embedding |
| | - Text-to-Visual Document (T→VD) retrieval |
| | library_name: peft |
| | pipeline_tag: visual-document-retrieval |
| | --- |
| | # ColQwenStella-2b-multilingual: Multilingual Visual Retriever based on the combination of Qwen2 Vision and stella_en_1.5B_v5 model. |
| | |
| | ## Ranked #1 among models <= 2B parameters and #8 overall on the Vidore benchmark (as of February 11, 2025). The reported scores on the [Vidore Leaderboard](https://huggingface.co/spaces/vidore/vidore-leaderboard) correspond to checkpoint-1800. |
| | |
| | ### This is the base version trained on 4xA100 80GB with per_device_batch_size=128 for 5 epoch. |
| |
|
| | The ColQwenStella-2b-multilingual architecture combines the Vision component of the Qwen2 model with stella_en_1.5B_v5 as its embedding model. Training is done following the [ColPali: Efficient Document Retrieval with Vision Language Models](https://arxiv.org/abs/2407.01449) recipe. |
| | |
| | |
| | ## Data |
| | - **Synthetic data**: Selected and preprocessed from the `openbmb/VisRAG-Ret-Train-Synthetic-data` dataset. |
| | - **In-domain VQA dataset**: Drawn from `openbmb/VisRAG-Ret-Train-In-domain-data`. |
| | - **Docmatix dataset**: Extracted from the `Metric-AI/rag_docmatix_100k` dataset. |
| | - **Colpali dataset**: Taken from `vidore/colpali_train_set`. |
| | - **Multilingual dataset**: Taken from `llamaindex/vdr-multilingual-train`. |
| | |
| | |
| | ## Model Training |
| | |
| | ### Parameters |
| | We train models use low-rank adapters ([LoRA](https://arxiv.org/abs/2106.09685)) |
| | with `alpha=128` and `r=128` on the transformer layers from the language model, and `mlp` layers of the `vison_model.merger` |
| | as well as the final randomly initialized projection layer, and use a `adamw` optimizer. |
| | We train on an 4xA100 GPU setup with distributed data parallelism (via accelerate), a learning rate of 5e-4 with cosine decay with 100 warmup steps, batch size per device is 128, in `bfloat16` format. |
| |
|
| | ## Installation |
| |
|
| | ```bash |
| | pip install transformers>=4.46.3 |
| | ``` |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | import torch |
| | from PIL import Image |
| | |
| | from transformers import AutoModel, AutoProcessor |
| | |
| | model = AutoModel.from_pretrained( |
| | "Metric-AI/ColQwenStella-2b-multilingual", |
| | torch_dtype=torch.bfloat16, |
| | device_map="cuda:0", # or "mps" if on Apple Silicon |
| | trust_remote_code=True |
| | ).eval() |
| | processor = AutoProcessor.from_pretrained("Metric-AI/ColQwenStella-2b-multilingual", trust_remote_code=True) |
| | |
| | # Your inputs |
| | images = [ |
| | Image.new("RGB", (32, 32), color="white"), |
| | Image.new("RGB", (16, 16), color="black"), |
| | ] |
| | queries = [ |
| | "Is attention really all you need?", |
| | "What is the amount of bananas farmed in Salvador?", |
| | ] |
| | |
| | # Process the inputs |
| | batch_images = processor.process_images(images).to(model.device) |
| | batch_queries = processor.process_queries(queries).to(model.device) |
| | |
| | # Forward pass |
| | with torch.no_grad(): |
| | image_embeddings = model(**batch_images) |
| | query_embeddings = model(**batch_queries) |
| | |
| | scores = processor.score_multi_vector(query_embeddings, image_embeddings) |
| | ``` |
| |
|
| | ## License |
| |
|
| | The adapters attached to the model are under MIT license. |
| |
|
| |
|
| | - **Developed by:** [Metric AI Research Lab](https://metric.am/) |