ReCAP-8B / README.md

Upload model from sft8

49fa637 verified 8 days ago

1.92 kB

license: apache-2.0
library_name: transformers
base_model: Qwen/Qwen3-VL-8B-Thinking
pipeline_tag: image-text-to-text
tags:
  - vision-language-model
  - image-text-to-text
  - transformers
  - qwen3-vl

ReCAP-8B

ReCAP-8B is a vision-language model fine-tuned from
Qwen/Qwen3-VL-8B-Thinking, designed to enable robust CAPTCHA solving within native GUI agents while preserving general GUI interaction capabilities.

This model is introduced in “CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training”.

🚀 Overview

ReCAP-8B extends a general-purpose GUI agent with CAPTCHA-solving ability by learning from structured reasoning-action trajectories.

It operates end-to-end:

Input: raw screenshots
Output: reasoning + executable GUI actions (click, type, drag)

✨ Key Features

Unified agent: Handles both CAPTCHA and general GUI tasks
Reasoning-action modeling: Learns both decisions and execution
Self-correction: Improves robustness by learning from failures
Efficient interaction: Generates multiple actions per step

🧠 Capabilities

Supports diverse CAPTCHA types:

Text / OCR
Icon selection & matching
Image grid reasoning
Slider / drag tasks
Multi-step interaction challenges

Core skills:

Visual understanding
Spatial reasoning
Continuous control
Multi-step planning

📊 Performance

~71.9% success rate on synthetic CAPTCHA benchmark
Strong improvements on interaction-heavy tasks (e.g., slider, image grid)
Maintains competitive performance on general GUI benchmarks

🔒 Ethical Considerations

This model is released for research purposes only.
It is intended to study and improve the robustness of human-verification systems, not to bypass them.