ReCAP-Agent
/

ReCAP-8B

Image-Text-to-Text

vision-language-model

Model card Files Files and versions

ReCAP-8B / README.md

yuxi5's picture

Upload model from sft8

49fa637 verified 9 days ago

|

history blame contribute delete

1.92 kB

	---
	license: apache-2.0
	library_name: transformers
	base_model: Qwen/Qwen3-VL-8B-Thinking
	pipeline_tag: image-text-to-text
	tags:
	- vision-language-model
	- image-text-to-text
	- transformers
	- qwen3-vl
	---

	# ReCAP-8B

	ReCAP-8B is a vision-language model fine-tuned from
	[Qwen/Qwen3-VL-8B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-8B-Thinking), designed to enable robust CAPTCHA solving within native GUI agents while preserving general GUI interaction capabilities.

	This model is introduced in “CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training”.

	---

	## 🚀 Overview

	ReCAP-8B extends a general-purpose GUI agent with CAPTCHA-solving ability by learning from structured reasoning-action trajectories.

	It operates end-to-end:
	- Input: raw screenshots
	- Output: reasoning + executable GUI actions (click, type, drag)

	---

	## ✨ Key Features

	- Unified agent: Handles both CAPTCHA and general GUI tasks
	- Reasoning-action modeling: Learns both decisions and execution
	- Self-correction: Improves robustness by learning from failures
	- Efficient interaction: Generates multiple actions per step

	---

	## 🧠 Capabilities

	Supports diverse CAPTCHA types:
	- Text / OCR
	- Icon selection & matching
	- Image grid reasoning
	- Slider / drag tasks
	- Multi-step interaction challenges

	Core skills:
	- Visual understanding
	- Spatial reasoning
	- Continuous control
	- Multi-step planning

	---

	## 📊 Performance

	- ~71.9% success rate on synthetic CAPTCHA benchmark
	- Strong improvements on interaction-heavy tasks (e.g., slider, image grid)
	- Maintains competitive performance on general GUI benchmarks

	---

	## 🔒 Ethical Considerations

	This model is released for research purposes only.
	It is intended to study and improve the robustness of human-verification systems, not to bypass them.