---
license: mit
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
---
## Model Overview

This model is introduced in the paper:

**PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images**  
https://arxiv.org/abs/2509.25185

PixelCraft is a multi-agent framework designed for precise visual reasoning on structured images, with a focus on pixel-level grounding.

## Intended Use

The model is intended for structured image understanding and grounding tasks, where accurate localization of visual elements is required to support downstream reasoning.

## Inference

The reference inference implementation is provided in the PixelCraft repository.

The grounding and inference logic can be found at:

https://github.com/microsoft/PixelCraft/blob/main/src/tools/grounding.py

Please refer to this script for:
- Model loading
- Input preprocessing
- Grounding and inference execution
- Output formats

Users are expected to follow the provided implementation when running inference with this model.

## Citation

If you find this work helpful in your research, please cite our paper:

```bibtex
@article{zhang2025pixelcraft,
  title={PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images},
  author={Zhang, Shuoshuo and Li, Zijian and Zhang, Yizhen and Fu, Jingjing and Song, Lei and Bian, Jiang and Zhang, Jun and Yang, Yujiu and Wang, Rui},
  journal={arXiv preprint arXiv:2509.25185},
  year={2025}
}
```