zss01
/

PixelCraft-3B

Model card Files Files and versions

PixelCraft-3B / README.md

zss01's picture

Update README.md

d24275b verified about 2 months ago

|

history blame contribute delete

1.46 kB

	---
	license: mit
	base_model:
	- Qwen/Qwen2.5-VL-3B-Instruct
	---
	## Model Overview

	This model is introduced in the paper:

	PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images
	https://arxiv.org/abs/2509.25185

	PixelCraft is a multi-agent framework designed for precise visual reasoning on structured images, with a focus on pixel-level grounding.

	## Intended Use

	The model is intended for structured image understanding and grounding tasks, where accurate localization of visual elements is required to support downstream reasoning.

	## Inference

	The reference inference implementation is provided in the PixelCraft repository.

	The grounding and inference logic can be found at:

	https://github.com/microsoft/PixelCraft/blob/main/src/tools/grounding.py

	Please refer to this script for:
	- Model loading
	- Input preprocessing
	- Grounding and inference execution
	- Output formats

	Users are expected to follow the provided implementation when running inference with this model.

	## Citation

	If you find this work helpful in your research, please cite our paper:

	```bibtex
	@article{zhang2025pixelcraft,
	title={PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images},
	author={Zhang, Shuoshuo and Li, Zijian and Zhang, Yizhen and Fu, Jingjing and Song, Lei and Bian, Jiang and Zhang, Jun and Yang, Yujiu and Wang, Rui},
	journal={arXiv preprint arXiv:2509.25185},
	year={2025}
	}
	```