omoi-ui-detector / API_DOCUMENTATION.md
makeitfr's picture
Upload API_DOCUMENTATION.md with huggingface_hub
b4cc14b verified
|
Raw
History Blame Contribute Delete
8.64 kB
# UI Element Detection API
Complete server-based solution for detecting and locating all UI elements in screenshots using OmniParser and template matching.
## Features
βœ… **Automatic UI Detection** - Uses OmniParser to detect all UI elements (buttons, text, icons, etc.)
βœ… **Precise Coordinates** - Returns pixel-perfect coordinates for each element
βœ… **Multiple Export Formats** - JSON, CSV, and visualization PNG
βœ… **Fast Processing** - ~15 seconds per screenshot on CPU
βœ… **Server-Side Storage** - Cropped images stored on server, not sent to clients
βœ… **Multiple Endpoints** - Flexible request/response options
## Start the Server
```bash
cd /workspaces/omoi-v2
python ui_element_api_server.py --port 8001
```
Server will start at `http://127.0.0.1:8001`
## API Endpoints
### 1. Health Check
```bash
GET /health
```
**Response:**
```json
{"status": "ok", "service": "UI Element Detection API"}
```
### 2. Analyze Image (Full Response)
```bash
POST /analyze
Content-Type: multipart/form-data
file: <PNG image file>
```
**Response:**
```json
{
"status": "success",
"processing_time_seconds": 15.4,
"timing": {
"omniparser_seconds": 9.88,
"template_matching_seconds": 5.48
},
"image_info": {
"filename": "Screenshot.png",
"size": {"width": 1365, "height": 767}
},
"analysis": {
"total_elements_detected": 120,
"elements": [
{
"template_id": "crop_0000",
"template_file": "crop_0000.png",
"confidence": 1.0,
"bbox": {
"x1": 71, "y1": 13, "x2": 161, "y2": 29,
"width": 90, "height": 16
},
"center": {"x": 116, "y": 21},
"bbox_ratio": {
"x1": 0.052, "y1": 0.017, "x2": 0.118, "y2": 0.038
}
},
// ... 119 more elements
]
},
"exports": {
"csv_data": "Element_ID,Template_File,Confidence,X1,Y1,...\n",
"visualization_png_base64": "iVBORw0KGgoAAAANSUhEUgAAA..."
}
}
```
### 3. Analyze Image (Structured Response)
```bash
POST /analyze_batch
Content-Type: multipart/form-data
file: <PNG image file>
```
**Response:**
```json
{
"metadata": {
"filename": "Screenshot.png",
"image_size": {"width": 1365, "height": 767},
"total_elements_detected": 120,
"templates_loaded": 120
},
"coordinates_json": {
"source_image": "Screenshot.png",
"image_size": {"width": 1365, "height": 767},
"total_elements": 120,
"elements": [...]
},
"csv_data": "Element_ID,Template_File,...\n",
"visualization_png_base64": "iVBORw0KGgo..."
}
```
## Usage Examples
### Python Client
```python
from ui_element_client import UIElementDetectionClient
# Initialize client
client = UIElementDetectionClient(api_url="http://127.0.0.1:8001")
# Check API health
status = client.health_check()
print(status)
# Analyze image and get all elements
result = client.analyze_image("screenshot.png")
print(f"Found {result['analysis']['total_elements_detected']} UI elements")
# Get specific element
element = client.get_element_by_id("screenshot.png", "crop_0031")
print(f"Element at: ({element['center']['x']}, {element['center']['y']})")
# Find elements in a region (top 100 pixels)
elements = client.find_elements_in_region("screenshot.png", 0, 0, 1365, 100)
print(f"Found {len(elements)} elements in top region")
```
### Using curl
#### Analyze image and save outputs
```bash
curl -X POST -F "file=@screenshot.png" http://127.0.0.1:8001/analyze > response.json
# Extract CSV data
python -c "import json; d=json.load(open('response.json')); print(d['exports']['csv_data'])" > coordinates.csv
# Extract visualization (base64 decode)
python -c "
import json, base64
d = json.load(open('response.json'))
with open('visualization.png', 'wb') as f:
f.write(base64.b64decode(d['exports']['visualization_png_base64']))
"
```
### JavaScript/Node.js
```javascript
const FormData = require('form-data');
const fs = require('fs');
const axios = require('axios');
async function analyzeImage(imagePath) {
const formData = new FormData();
formData.append('file', fs.createReadStream(imagePath));
const response = await axios.post(
'http://127.0.0.1:8001/analyze',
formData,
{ headers: formData.getHeaders() }
);
const data = response.data;
console.log(`Found ${data.analysis.total_elements_detected} UI elements`);
// Save CSV
fs.writeFileSync('coordinates.csv', data.exports.csv_data);
// Save visualization
const vizBuffer = Buffer.from(data.exports.visualization_png_base64, 'base64');
fs.writeFileSync('visualization.png', vizBuffer);
return data;
}
analyzeImage('screenshot.png').catch(console.error);
```
## Response Data Structure
Each UI element contains:
```json
{
"template_id": "crop_0031", // Element identifier
"template_file": "crop_0031.png", // Source template file
"confidence": 1.0, // Matching confidence (0-1)
"bbox": {
"x1": 587, // Top-left X
"y1": 393, // Top-left Y
"x2": 763, // Bottom-right X
"y2": 441, // Bottom-right Y
"width": 176, // Element width
"height": 48 // Element height
},
"center": {
"x": 675, // Center X (for clicking)
"y": 417 // Center Y (for clicking)
},
"bbox_ratio": {
"x1": 0.430, // Normalized X1 (0-1)
"y1": 0.512, // Normalized Y1 (0-1)
"x2": 0.559, // Normalized X2 (0-1)
"y2": 0.575 // Normalized Y2 (0-1)
}
}
```
## Export Formats
### JSON
Complete structured data with all coordinates, confidence scores, and metadata.
### CSV
Spreadsheet-friendly format with columns:
- Element_ID
- Template_File
- Confidence
- X1, Y1, X2, Y2 (pixel coordinates)
- Width, Height
- Center_X, Center_Y
- Ratio_X1, Ratio_Y1, Ratio_X2, Ratio_Y2
### Visualization PNG
High-resolution image with:
- Green bounding boxes around each element
- Red center point marker
- Element ID and confidence label for each box
## Server-Side File Storage
The server maintains a temporary cropped images directory:
```
/tmp/omoi_cropped_images/
β”œβ”€β”€ crop_0000.png
β”œβ”€β”€ crop_0001.png
β”œβ”€β”€ crop_0002.png
└── ... (120+ images)
```
These files are:
- βœ… Used for template matching
- βœ… Kept on server for reference
- ❌ NOT sent to clients
- ❌ Cleared on server restart
## Performance
Typical performance on CPU:
- OmniParser detection: ~10 seconds
- Template matching: ~5 seconds
- Total: ~15 seconds per screenshot
## Architecture
```
Client Request (PNG)
↓
[API Server]
1. Receives PNG
2. Runs OmniParser
β”œβ”€ Detects UI elements
└─ Saves cropped images (server-side only)
3. Template matches crops back to original
4. Generates coordinates
5. Creates visualization
6. Exports to JSON/CSV
↓
Client Response (JSON, CSV, PNG)
- Coordinates metadata
- CSV data
- Visualization image
(NO cropped images to client)
```
## Coordinate Systems
### Absolute Coordinates
Pixel coordinates in the original image:
- `bbox.x1, bbox.y1`: Top-left corner
- `bbox.x2, bbox.y2`: Bottom-right corner
- `center.x, center.y`: Center point (use for mouse clicks)
### Normalized Coordinates
0-1 scale for responsive designs:
- `bbox_ratio.x1, bbox_ratio.y1`: Top-left (normalized)
- `bbox_ratio.x2, bbox_ratio.y2`: Bottom-right (normalized)
- Useful for scaling to different screen sizes
## Tips
1. **Clicking Elements**: Use `center.x` and `center.y` for mouse position
2. **Validation**: All elements have `confidence: 1.0` (perfect match)
3. **Filtering**: Use `bbox_ratio` for responsive element filtering
4. **Region Queries**: Client library supports finding elements in bounding boxes
5. **Batch Processing**: Queue multiple images for analysis
## Troubleshooting
**"OmniParser not initialized"** - Server failed to load models, check logs
**"Failed to decode image"** - Ensure you're uploading valid PNG/JPG files
**"Cropped images directory not found"** - OmniParser detection failed, check input image
**Timeout** - Processing large images takes time, increase request timeout
## Files
- `ui_element_api_server.py` - Main API server
- `ui_element_client.py` - Python client library
- `ui_element_locator.py` - Template matching utility
- `ui_element_analyzer.py` - Analysis and export utilities
---
**Status**: βœ… Production Ready
**Last Updated**: April 17, 2026