Instructions to use siyrus/Btoks-Qwen3VL-2B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use siyrus/Btoks-Qwen3VL-2B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-feature-extraction", model="siyrus/Btoks-Qwen3VL-2B-Instruct")# Load model directly from transformers import VLM2Emb model = VLM2Emb.from_pretrained("siyrus/Btoks-Qwen3VL-2B-Instruct", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Btoks Qwen3VL 2B Instruct
Btoks is a multimodal embedding model fine-tuned from
Qwen/Qwen3-VL-2B-Instruct with the Bottleneck Tokens method.
- Model name on MMEB-V2 leaderboard:
Btoks - Base model:
Qwen/Qwen3-VL-2B-Instruct - Paper: https://arxiv.org/abs/2604.11095
- Model repository:
https://huggingface.co/siyrus/Btoks-Qwen3VL-2B-Instruct
MMEB-V2
Local validation with the public MMEB-V2 leaderboard scripts gives:
| Overall | Image-Overall | Video-Overall | Visdoc-Overall |
|---|---|---|---|
| 68.29 | 71.55 | 49.12 | 77.77 |
These numbers are not expected to match the paper tables one-to-one. After the paper experiments, we fixed a small number of data/evaluation bugs and trained with a larger data mixture and scale.
Training Summary
The model was trained with a compact BToks/SIEVE setup on top of Qwen3-VL:
- BToks / SIEVE tokens: 4
- LoRA rank / alpha / dropout: 16 / 32 / 0.05
- DoRA: enabled
- Generation-loss weight: 0.2
- Training steps: 5000
- Exported checkpoint: step 4500
- Contrastive temperature: 0.02
The main training data sources include:
- Image classification, VQA, retrieval, and grounding data, including ImageNet, N24News, VOC2007, SUN397, OK-VQA, A-OKVQA, DocVQA, ChartQA, Visual7W, GQA, TextVQA, VizWiz, VisDial, CIRR, VisualNews, MSCOCO, WebQA, FashionIQ, Wiki-SS-NQ, OVEN, EDIS, INFOSEEK, Fashion200K, and RefCOCO.
- Visual document retrieval data from the ColPali / ViDoRe family and VisRAG in-domain training data.
- Video classification, retrieval, QA, and moment-retrieval data, including Kinetics-700, Something-Something V2, HMDB51, UCF101, MSR-VTT, MSVD, DiDeMo, YouCook2, ActivityNet Captions, VATEX, QVHighlights, Charades-STA, NExTQA, and VideoChat2-IT.
Some source names overlap with MMEB-V2 benchmark dataset names because they come from the same public dataset families. The MMEB-V2 benchmark records were removed from the admitted training splits where overlap was possible, so the leaderboard evaluation does not use leaked benchmark samples for training.
Loading
This repository stores a merged inference checkpoint for the VLM2Emb/Btoks embedding wrapper. It is not a plain Qwen3-VL causal language model.
The public inference/evaluation code is being prepared and is expected to be released in about one week. Loading examples will be added once that code release is ready.
- Downloads last month
- 24
Model tree for siyrus/Btoks-Qwen3VL-2B-Instruct
Base model
Qwen/Qwen3-VL-2B-Instruct