File size: 1,464 Bytes
3246894 c1b5caa d24275b c1b5caa 3246894 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | ---
license: mit
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
---
## Model Overview
This model is introduced in the paper:
**PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images**
https://arxiv.org/abs/2509.25185
PixelCraft is a multi-agent framework designed for precise visual reasoning on structured images, with a focus on pixel-level grounding.
## Intended Use
The model is intended for structured image understanding and grounding tasks, where accurate localization of visual elements is required to support downstream reasoning.
## Inference
The reference inference implementation is provided in the PixelCraft repository.
The grounding and inference logic can be found at:
https://github.com/microsoft/PixelCraft/blob/main/src/tools/grounding.py
Please refer to this script for:
- Model loading
- Input preprocessing
- Grounding and inference execution
- Output formats
Users are expected to follow the provided implementation when running inference with this model.
## Citation
If you find this work helpful in your research, please cite our paper:
```bibtex
@article{zhang2025pixelcraft,
title={PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images},
author={Zhang, Shuoshuo and Li, Zijian and Zhang, Yizhen and Fu, Jingjing and Song, Lei and Bian, Jiang and Zhang, Jun and Yang, Yujiu and Wang, Rui},
journal={arXiv preprint arXiv:2509.25185},
year={2025}
}
``` |