| license: other | |
| library_name: transformers | |
| pipeline_tag: image-text-to-text | |
| base_model: Qwen/Qwen3.5-0.8B | |
| datasets: | |
| - yuandaxia/FashionMV | |
| # ProCIR — Multi-View Product-Level Composed Image Retrieval | |
| [[Paper (arXiv)]](https://arxiv.org/abs/2604.10297) | [[Code (GitHub)]](https://github.com/yuandaxia2001/FashionMV) | [[Dataset]](https://huggingface.co/datasets/yuandaxia/FashionMV) | |
| ## Model Description | |
| **ProCIR** (0.8B) is a multi-view composed image retrieval model trained on the [FashionMV](https://huggingface.co/datasets/yuandaxia/FashionMV) dataset, based on [Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B). It adopts a perception-reasoning decoupled dialogue architecture and leverages image-text alignment to inject product knowledge, enabling effective multi-view product-level CIR. | |
| ## Performance | |
| | Dataset | R@5 | R@10 | | |
| |---------|-----|------| | |
| | DeepFashion | 89.2 | 94.9 | | |
| | Fashion200K | 77.6 | 86.6 | | |
| | FashionGen-val | 75.0 | 85.3 | | |
| | **Average** | **80.6** | **88.9** | | |
| ## Usage | |
| See our [GitHub repository](https://github.com/yuandaxia2001/FashionMV) for evaluation code and data preparation instructions. | |
| ```python | |
| from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration | |
| processor = AutoProcessor.from_pretrained("yuandaxia/ProCIR") | |
| model = Qwen3_5ForConditionalGeneration.from_pretrained("yuandaxia/ProCIR", torch_dtype="bfloat16") | |
| ``` | |
| ## Citation | |
| ```bibtex | |
| @article{yuan2026fashionmv, | |
| title={FashionMV: Product-Level Composed Image Retrieval with Multi-View Fashion Data}, | |
| author={Yuan, Peng and Mei, Bingyin and Zhang, Hui}, | |
| year={2026} | |
| } | |
| ``` | |
| ## License | |
| Model weights are released under the same license as the base model ([Qwen3.5](https://huggingface.co/Qwen/Qwen3.5-0.8B)). |