| --- |
| license: apache-2.0 |
| library_name: transformers |
| base_model: Qwen/Qwen3-VL-8B-Thinking |
| pipeline_tag: image-text-to-text |
| tags: |
| - vision-language-model |
| - image-text-to-text |
| - transformers |
| - qwen3-vl |
| --- |
| |
| # ReCAP-8B |
|
|
| ReCAP-8B is a vision-language model fine-tuned from |
| [Qwen/Qwen3-VL-8B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-8B-Thinking), designed to enable **robust CAPTCHA solving within native GUI agents** while preserving general GUI interaction capabilities. |
|
|
| This model is introduced in *“CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training”*. |
|
|
| --- |
|
|
| ## 🚀 Overview |
|
|
| ReCAP-8B extends a general-purpose GUI agent with **CAPTCHA-solving ability** by learning from structured **reasoning-action trajectories**. |
|
|
| It operates end-to-end: |
| - Input: raw screenshots |
| - Output: reasoning + executable GUI actions (click, type, drag) |
|
|
| --- |
|
|
| ## ✨ Key Features |
|
|
| - **Unified agent**: Handles both CAPTCHA and general GUI tasks |
| - **Reasoning-action modeling**: Learns both decisions and execution |
| - **Self-correction**: Improves robustness by learning from failures |
| - **Efficient interaction**: Generates multiple actions per step |
|
|
| --- |
|
|
| ## 🧠 Capabilities |
|
|
| Supports diverse CAPTCHA types: |
| - Text / OCR |
| - Icon selection & matching |
| - Image grid reasoning |
| - Slider / drag tasks |
| - Multi-step interaction challenges |
|
|
| Core skills: |
| - Visual understanding |
| - Spatial reasoning |
| - Continuous control |
| - Multi-step planning |
|
|
| --- |
|
|
| ## 📊 Performance |
|
|
| - ~71.9% success rate on synthetic CAPTCHA benchmark |
| - Strong improvements on interaction-heavy tasks (e.g., slider, image grid) |
| - Maintains competitive performance on general GUI benchmarks |
|
|
| --- |
|
|
| ## 🔒 Ethical Considerations |
|
|
| This model is released for **research purposes only**. |
| It is intended to study and improve the robustness of human-verification systems, not to bypass them. |
|
|