File size: 1,215 Bytes
504b397 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | # Image CAPTCHA Usage
## Task type
- `ImageToTextTask`
## Request
```json
{
"clientKey": "your-client-key",
"task": {
"type": "ImageToTextTask",
"body": "<base64-encoded-image>"
}
}
```
## Implementation notes
The image solver is implemented in `src/services/recognition.py` and is inspired by Argus-style structured multimodal annotation.
Current behavior:
- image input is resized to **1440×900**
- the model is prompted to classify the captcha into structured action types
- the normalized coordinate space starts at `(0, 0)` in the top-left corner
Supported response styles in the prompt:
- `click`
- `slide`
- `drag_match`
## Result shape
The current API returns the structured model output serialized as a string in `solution.text`.
Example:
```json
{
"errorId": 0,
"status": "ready",
"solution": {
"text": "{\"captcha_type\":\"slide\",\"drag_distance\":270}"
}
}
```
## Backend compatibility
The multimodal path is designed for **OpenAI-compatible** APIs. This makes it suitable for hosted or self-hosted backends as long as they expose compatible image-capable chat completion behavior.
Accuracy depends heavily on the selected model and provider implementation.
|