| | --- |
| | license: mit |
| | base_model: |
| | - Qwen/Qwen2.5-VL-3B-Instruct |
| | --- |
| | ## Model Overview |
| |
|
| | This model is introduced in the paper: |
| |
|
| | **PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images** |
| | https://arxiv.org/abs/2509.25185 |
| |
|
| | PixelCraft is a multi-agent framework designed for precise visual reasoning on structured images, with a focus on pixel-level grounding. |
| |
|
| | ## Intended Use |
| |
|
| | The model is intended for structured image understanding and grounding tasks, where accurate localization of visual elements is required to support downstream reasoning. |
| |
|
| | ## Inference |
| |
|
| | The reference inference implementation is provided in the PixelCraft repository. |
| |
|
| | The grounding and inference logic can be found at: |
| |
|
| | https://github.com/microsoft/PixelCraft/blob/main/src/tools/grounding.py |
| |
|
| | Please refer to this script for: |
| | - Model loading |
| | - Input preprocessing |
| | - Grounding and inference execution |
| | - Output formats |
| |
|
| | Users are expected to follow the provided implementation when running inference with this model. |
| |
|
| | ## Citation |
| |
|
| | If you find this work helpful in your research, please cite our paper: |
| |
|
| | ```bibtex |
| | @article{zhang2025pixelcraft, |
| | title={PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images}, |
| | author={Zhang, Shuoshuo and Li, Zijian and Zhang, Yizhen and Fu, Jingjing and Song, Lei and Bian, Jiang and Zhang, Jun and Yang, Yujiu and Wang, Rui}, |
| | journal={arXiv preprint arXiv:2509.25185}, |
| | year={2025} |
| | } |
| | ``` |