metadata
license: apache-2.0
library_name: transformers
base_model: Qwen/Qwen3-VL-8B-Thinking
pipeline_tag: image-text-to-text
tags:
- vision-language-model
- image-text-to-text
- transformers
- qwen3-vl
ReCAP-8B
ReCAP-8B is a vision-language model fine-tuned from
Qwen/Qwen3-VL-8B-Thinking, designed to enable robust CAPTCHA solving within native GUI agents while preserving general GUI interaction capabilities.
This model is introduced in โCAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Trainingโ.
๐ Overview
ReCAP-8B extends a general-purpose GUI agent with CAPTCHA-solving ability by learning from structured reasoning-action trajectories.
It operates end-to-end:
- Input: raw screenshots
- Output: reasoning + executable GUI actions (click, type, drag)
โจ Key Features
- Unified agent: Handles both CAPTCHA and general GUI tasks
- Reasoning-action modeling: Learns both decisions and execution
- Self-correction: Improves robustness by learning from failures
- Efficient interaction: Generates multiple actions per step
๐ง Capabilities
Supports diverse CAPTCHA types:
- Text / OCR
- Icon selection & matching
- Image grid reasoning
- Slider / drag tasks
- Multi-step interaction challenges
Core skills:
- Visual understanding
- Spatial reasoning
- Continuous control
- Multi-step planning
๐ Performance
- ~71.9% success rate on synthetic CAPTCHA benchmark
- Strong improvements on interaction-heavy tasks (e.g., slider, image grid)
- Maintains competitive performance on general GUI benchmarks
๐ Ethical Considerations
This model is released for research purposes only.
It is intended to study and improve the robustness of human-verification systems, not to bypass them.