--- license: mit base_model: - Qwen/Qwen2.5-VL-3B-Instruct --- ## Model Overview This model is introduced in the paper: **PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images** https://arxiv.org/abs/2509.25185 PixelCraft is a multi-agent framework designed for precise visual reasoning on structured images, with a focus on pixel-level grounding. ## Intended Use The model is intended for structured image understanding and grounding tasks, where accurate localization of visual elements is required to support downstream reasoning. ## Inference The reference inference implementation is provided in the PixelCraft repository. The grounding and inference logic can be found at: https://github.com/microsoft/PixelCraft/blob/main/src/tools/grounding.py Please refer to this script for: - Model loading - Input preprocessing - Grounding and inference execution - Output formats Users are expected to follow the provided implementation when running inference with this model. ## Citation If you find this work helpful in your research, please cite our paper: ```bibtex @article{zhang2025pixelcraft, title={PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images}, author={Zhang, Shuoshuo and Li, Zijian and Zhang, Yizhen and Fu, Jingjing and Song, Lei and Bian, Jiang and Zhang, Jun and Yang, Yujiu and Wang, Rui}, journal={arXiv preprint arXiv:2509.25185}, year={2025} } ```