yuandaxia
/

ProCIR

Image-Text-to-Text

Model card Files Files and versions

ProCIR / README.md

yuandaxia's picture

Add metadata and update paper links (#1)

881a091 1 day ago

|

history blame contribute delete

1.75 kB

	---
	license: other
	library_name: transformers
	pipeline_tag: image-text-to-text
	base_model: Qwen/Qwen3.5-0.8B
	datasets:
	- yuandaxia/FashionMV
	---

	# ProCIR — Multi-View Product-Level Composed Image Retrieval

	[[Paper (arXiv)]](https://arxiv.org/abs/2604.10297) \| [[Code (GitHub)]](https://github.com/yuandaxia2001/FashionMV) \| [[Dataset]](https://huggingface.co/datasets/yuandaxia/FashionMV)

	## Model Description

	ProCIR (0.8B) is a multi-view composed image retrieval model trained on the [FashionMV](https://huggingface.co/datasets/yuandaxia/FashionMV) dataset, based on [Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B). It adopts a perception-reasoning decoupled dialogue architecture and leverages image-text alignment to inject product knowledge, enabling effective multi-view product-level CIR.

	## Performance

	\| Dataset \| R@5 \| R@10 \|
	\|---------\|-----\|------\|
	\| DeepFashion \| 89.2 \| 94.9 \|
	\| Fashion200K \| 77.6 \| 86.6 \|
	\| FashionGen-val \| 75.0 \| 85.3 \|
	\| Average \| 80.6 \| 88.9 \|

	## Usage

	See our [GitHub repository](https://github.com/yuandaxia2001/FashionMV) for evaluation code and data preparation instructions.

	```python
	from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration

	processor = AutoProcessor.from_pretrained("yuandaxia/ProCIR")
	model = Qwen3_5ForConditionalGeneration.from_pretrained("yuandaxia/ProCIR", torch_dtype="bfloat16")
	```

	## Citation

	```bibtex
	@article{yuan2026fashionmv,
	title={FashionMV: Product-Level Composed Image Retrieval with Multi-View Fashion Data},
	author={Yuan, Peng and Mei, Bingyin and Zhang, Hui},
	year={2026}
	}
	```

	## License

	Model weights are released under the same license as the base model ([Qwen3.5](https://huggingface.co/Qwen/Qwen3.5-0.8B)).