ReCAP-8B / README.md
yuxi5's picture
Upload model from sft8
49fa637 verified
metadata
license: apache-2.0
library_name: transformers
base_model: Qwen/Qwen3-VL-8B-Thinking
pipeline_tag: image-text-to-text
tags:
  - vision-language-model
  - image-text-to-text
  - transformers
  - qwen3-vl

ReCAP-8B

ReCAP-8B is a vision-language model fine-tuned from
Qwen/Qwen3-VL-8B-Thinking, designed to enable robust CAPTCHA solving within native GUI agents while preserving general GUI interaction capabilities.

This model is introduced in โ€œCAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Trainingโ€.


๐Ÿš€ Overview

ReCAP-8B extends a general-purpose GUI agent with CAPTCHA-solving ability by learning from structured reasoning-action trajectories.

It operates end-to-end:

  • Input: raw screenshots
  • Output: reasoning + executable GUI actions (click, type, drag)

โœจ Key Features

  • Unified agent: Handles both CAPTCHA and general GUI tasks
  • Reasoning-action modeling: Learns both decisions and execution
  • Self-correction: Improves robustness by learning from failures
  • Efficient interaction: Generates multiple actions per step

๐Ÿง  Capabilities

Supports diverse CAPTCHA types:

  • Text / OCR
  • Icon selection & matching
  • Image grid reasoning
  • Slider / drag tasks
  • Multi-step interaction challenges

Core skills:

  • Visual understanding
  • Spatial reasoning
  • Continuous control
  • Multi-step planning

๐Ÿ“Š Performance

  • ~71.9% success rate on synthetic CAPTCHA benchmark
  • Strong improvements on interaction-heavy tasks (e.g., slider, image grid)
  • Maintains competitive performance on general GUI benchmarks

๐Ÿ”’ Ethical Considerations

This model is released for research purposes only.
It is intended to study and improve the robustness of human-verification systems, not to bypass them.