ReCAP-8B / README.md
yuxi5's picture
Upload model from sft8
49fa637 verified
---
license: apache-2.0
library_name: transformers
base_model: Qwen/Qwen3-VL-8B-Thinking
pipeline_tag: image-text-to-text
tags:
- vision-language-model
- image-text-to-text
- transformers
- qwen3-vl
---
# ReCAP-8B
ReCAP-8B is a vision-language model fine-tuned from
[Qwen/Qwen3-VL-8B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-8B-Thinking), designed to enable **robust CAPTCHA solving within native GUI agents** while preserving general GUI interaction capabilities.
This model is introduced in *“CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training”*.
---
## 🚀 Overview
ReCAP-8B extends a general-purpose GUI agent with **CAPTCHA-solving ability** by learning from structured **reasoning-action trajectories**.
It operates end-to-end:
- Input: raw screenshots
- Output: reasoning + executable GUI actions (click, type, drag)
---
## ✨ Key Features
- **Unified agent**: Handles both CAPTCHA and general GUI tasks
- **Reasoning-action modeling**: Learns both decisions and execution
- **Self-correction**: Improves robustness by learning from failures
- **Efficient interaction**: Generates multiple actions per step
---
## 🧠 Capabilities
Supports diverse CAPTCHA types:
- Text / OCR
- Icon selection & matching
- Image grid reasoning
- Slider / drag tasks
- Multi-step interaction challenges
Core skills:
- Visual understanding
- Spatial reasoning
- Continuous control
- Multi-step planning
---
## 📊 Performance
- ~71.9% success rate on synthetic CAPTCHA benchmark
- Strong improvements on interaction-heavy tasks (e.g., slider, image grid)
- Maintains competitive performance on general GUI benchmarks
---
## 🔒 Ethical Considerations
This model is released for **research purposes only**.
It is intended to study and improve the robustness of human-verification systems, not to bypass them.