PixelCraft-3B / README.md
zss01's picture
Update README.md
d24275b verified
---
license: mit
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
---
## Model Overview
This model is introduced in the paper:
**PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images**
https://arxiv.org/abs/2509.25185
PixelCraft is a multi-agent framework designed for precise visual reasoning on structured images, with a focus on pixel-level grounding.
## Intended Use
The model is intended for structured image understanding and grounding tasks, where accurate localization of visual elements is required to support downstream reasoning.
## Inference
The reference inference implementation is provided in the PixelCraft repository.
The grounding and inference logic can be found at:
https://github.com/microsoft/PixelCraft/blob/main/src/tools/grounding.py
Please refer to this script for:
- Model loading
- Input preprocessing
- Grounding and inference execution
- Output formats
Users are expected to follow the provided implementation when running inference with this model.
## Citation
If you find this work helpful in your research, please cite our paper:
```bibtex
@article{zhang2025pixelcraft,
title={PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images},
author={Zhang, Shuoshuo and Li, Zijian and Zhang, Yizhen and Fu, Jingjing and Song, Lei and Bian, Jiang and Zhang, Jun and Yang, Yujiu and Wang, Rui},
journal={arXiv preprint arXiv:2509.25185},
year={2025}
}
```