File size: 1,215 Bytes
504b397
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# Image CAPTCHA Usage

## Task type

- `ImageToTextTask`

## Request

```json
{
  "clientKey": "your-client-key",
  "task": {
    "type": "ImageToTextTask",
    "body": "<base64-encoded-image>"
  }
}
```

## Implementation notes

The image solver is implemented in `src/services/recognition.py` and is inspired by Argus-style structured multimodal annotation.

Current behavior:

- image input is resized to **1440×900**
- the model is prompted to classify the captcha into structured action types
- the normalized coordinate space starts at `(0, 0)` in the top-left corner

Supported response styles in the prompt:

- `click`
- `slide`
- `drag_match`

## Result shape

The current API returns the structured model output serialized as a string in `solution.text`.

Example:

```json
{
  "errorId": 0,
  "status": "ready",
  "solution": {
    "text": "{\"captcha_type\":\"slide\",\"drag_distance\":270}"
  }
}
```

## Backend compatibility

The multimodal path is designed for **OpenAI-compatible** APIs. This makes it suitable for hosted or self-hosted backends as long as they expose compatible image-capable chat completion behavior.

Accuracy depends heavily on the selected model and provider implementation.