CIawevy's picture
Improve model card and add metadata (#1)
7a3b9ea
metadata
license: apache-2.0
base_model:
  - Qwen/Qwen3-VL-8B-Instruct
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - ocr
  - vtr
  - vision-language
  - multimodal

TextPecker-8B-Qwen3VL

TextPecker is a structural anomaly perceptive model designed to enhance Visual Text Rendering (VTR). It addresses a critical bottleneck where standard MLLMs and OCR models fail to perceive structural anomalies such as distortion, blurriness, and misalignment in generated text. This model acts as a plug-and-play evaluator and reward signal for RL-based optimization (e.g., using Flow-GRPO), enabling the generation of structurally faithful visual text.

This checkpoint is built upon the Qwen3-VL-8B-Instruct architecture and was trained using ms-swift.

Model Details

Uses

TextPecker can be used to evaluate text structural quality and semantic consistency for text-to-image generation or editing tasks. It is particularly useful for:

  • Structural Anomaly Quantification: Identifying distortion, blurriness, and misalignment in rendered text.
  • Reward Modeling: Providing reward signals for Reinforcement Learning (RL) to improve text rendering in generators like Flux or SD3.5.

To use this model, please follow the official deployment and testing instructions:

Citation

If you find TextPecker useful in your research or work, please cite the paper:

@article{zhu2026TextPecker,
  title   = {TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering},
  author  = {Zhu, Hanshen and Liu, Yuliang and Wu, Xuecheng and Wang, An-Lan and Feng, Hao and Dingkang Yang and Chao Feng and Can Huang and Jingqun Tang and Xiang Bai},
  journal = {arXiv preprint arXiv:2602.20903},
  year    = {2026}
}