UI Element Detection API
Complete server-based solution for detecting and locating all UI elements in screenshots using OmniParser and template matching.
Features
β
Automatic UI Detection - Uses OmniParser to detect all UI elements (buttons, text, icons, etc.)
β
Precise Coordinates - Returns pixel-perfect coordinates for each element
β
Multiple Export Formats - JSON, CSV, and visualization PNG
β
Fast Processing - ~15 seconds per screenshot on CPU
β
Server-Side Storage - Cropped images stored on server, not sent to clients
β
Multiple Endpoints - Flexible request/response options
Start the Server
cd /workspaces/omoi-v2
python ui_element_api_server.py --port 8001
Server will start at http://127.0.0.1:8001
API Endpoints
1. Health Check
GET /health
Response:
{"status": "ok", "service": "UI Element Detection API"}
2. Analyze Image (Full Response)
POST /analyze
Content-Type: multipart/form-data
file: <PNG image file>
Response:
{
"status": "success",
"processing_time_seconds": 15.4,
"timing": {
"omniparser_seconds": 9.88,
"template_matching_seconds": 5.48
},
"image_info": {
"filename": "Screenshot.png",
"size": {"width": 1365, "height": 767}
},
"analysis": {
"total_elements_detected": 120,
"elements": [
{
"template_id": "crop_0000",
"template_file": "crop_0000.png",
"confidence": 1.0,
"bbox": {
"x1": 71, "y1": 13, "x2": 161, "y2": 29,
"width": 90, "height": 16
},
"center": {"x": 116, "y": 21},
"bbox_ratio": {
"x1": 0.052, "y1": 0.017, "x2": 0.118, "y2": 0.038
}
},
// ... 119 more elements
]
},
"exports": {
"csv_data": "Element_ID,Template_File,Confidence,X1,Y1,...\n",
"visualization_png_base64": "iVBORw0KGgoAAAANSUhEUgAAA..."
}
}
3. Analyze Image (Structured Response)
POST /analyze_batch
Content-Type: multipart/form-data
file: <PNG image file>
Response:
{
"metadata": {
"filename": "Screenshot.png",
"image_size": {"width": 1365, "height": 767},
"total_elements_detected": 120,
"templates_loaded": 120
},
"coordinates_json": {
"source_image": "Screenshot.png",
"image_size": {"width": 1365, "height": 767},
"total_elements": 120,
"elements": [...]
},
"csv_data": "Element_ID,Template_File,...\n",
"visualization_png_base64": "iVBORw0KGgo..."
}
Usage Examples
Python Client
from ui_element_client import UIElementDetectionClient
# Initialize client
client = UIElementDetectionClient(api_url="http://127.0.0.1:8001")
# Check API health
status = client.health_check()
print(status)
# Analyze image and get all elements
result = client.analyze_image("screenshot.png")
print(f"Found {result['analysis']['total_elements_detected']} UI elements")
# Get specific element
element = client.get_element_by_id("screenshot.png", "crop_0031")
print(f"Element at: ({element['center']['x']}, {element['center']['y']})")
# Find elements in a region (top 100 pixels)
elements = client.find_elements_in_region("screenshot.png", 0, 0, 1365, 100)
print(f"Found {len(elements)} elements in top region")
Using curl
Analyze image and save outputs
curl -X POST -F "file=@screenshot.png" http://127.0.0.1:8001/analyze > response.json
# Extract CSV data
python -c "import json; d=json.load(open('response.json')); print(d['exports']['csv_data'])" > coordinates.csv
# Extract visualization (base64 decode)
python -c "
import json, base64
d = json.load(open('response.json'))
with open('visualization.png', 'wb') as f:
f.write(base64.b64decode(d['exports']['visualization_png_base64']))
"
JavaScript/Node.js
const FormData = require('form-data');
const fs = require('fs');
const axios = require('axios');
async function analyzeImage(imagePath) {
const formData = new FormData();
formData.append('file', fs.createReadStream(imagePath));
const response = await axios.post(
'http://127.0.0.1:8001/analyze',
formData,
{ headers: formData.getHeaders() }
);
const data = response.data;
console.log(`Found ${data.analysis.total_elements_detected} UI elements`);
// Save CSV
fs.writeFileSync('coordinates.csv', data.exports.csv_data);
// Save visualization
const vizBuffer = Buffer.from(data.exports.visualization_png_base64, 'base64');
fs.writeFileSync('visualization.png', vizBuffer);
return data;
}
analyzeImage('screenshot.png').catch(console.error);
Response Data Structure
Each UI element contains:
{
"template_id": "crop_0031", // Element identifier
"template_file": "crop_0031.png", // Source template file
"confidence": 1.0, // Matching confidence (0-1)
"bbox": {
"x1": 587, // Top-left X
"y1": 393, // Top-left Y
"x2": 763, // Bottom-right X
"y2": 441, // Bottom-right Y
"width": 176, // Element width
"height": 48 // Element height
},
"center": {
"x": 675, // Center X (for clicking)
"y": 417 // Center Y (for clicking)
},
"bbox_ratio": {
"x1": 0.430, // Normalized X1 (0-1)
"y1": 0.512, // Normalized Y1 (0-1)
"x2": 0.559, // Normalized X2 (0-1)
"y2": 0.575 // Normalized Y2 (0-1)
}
}
Export Formats
JSON
Complete structured data with all coordinates, confidence scores, and metadata.
CSV
Spreadsheet-friendly format with columns:
- Element_ID
- Template_File
- Confidence
- X1, Y1, X2, Y2 (pixel coordinates)
- Width, Height
- Center_X, Center_Y
- Ratio_X1, Ratio_Y1, Ratio_X2, Ratio_Y2
Visualization PNG
High-resolution image with:
- Green bounding boxes around each element
- Red center point marker
- Element ID and confidence label for each box
Server-Side File Storage
The server maintains a temporary cropped images directory:
/tmp/omoi_cropped_images/
βββ crop_0000.png
βββ crop_0001.png
βββ crop_0002.png
βββ ... (120+ images)
These files are:
- β Used for template matching
- β Kept on server for reference
- β NOT sent to clients
- β Cleared on server restart
Performance
Typical performance on CPU:
- OmniParser detection: ~10 seconds
- Template matching: ~5 seconds
- Total: ~15 seconds per screenshot
Architecture
Client Request (PNG)
β
[API Server]
1. Receives PNG
2. Runs OmniParser
ββ Detects UI elements
ββ Saves cropped images (server-side only)
3. Template matches crops back to original
4. Generates coordinates
5. Creates visualization
6. Exports to JSON/CSV
β
Client Response (JSON, CSV, PNG)
- Coordinates metadata
- CSV data
- Visualization image
(NO cropped images to client)
Coordinate Systems
Absolute Coordinates
Pixel coordinates in the original image:
bbox.x1, bbox.y1: Top-left cornerbbox.x2, bbox.y2: Bottom-right cornercenter.x, center.y: Center point (use for mouse clicks)
Normalized Coordinates
0-1 scale for responsive designs:
bbox_ratio.x1, bbox_ratio.y1: Top-left (normalized)bbox_ratio.x2, bbox_ratio.y2: Bottom-right (normalized)- Useful for scaling to different screen sizes
Tips
- Clicking Elements: Use
center.xandcenter.yfor mouse position - Validation: All elements have
confidence: 1.0(perfect match) - Filtering: Use
bbox_ratiofor responsive element filtering - Region Queries: Client library supports finding elements in bounding boxes
- Batch Processing: Queue multiple images for analysis
Troubleshooting
"OmniParser not initialized" - Server failed to load models, check logs
"Failed to decode image" - Ensure you're uploading valid PNG/JPG files
"Cropped images directory not found" - OmniParser detection failed, check input image
Timeout - Processing large images takes time, increase request timeout
Files
ui_element_api_server.py- Main API serverui_element_client.py- Python client libraryui_element_locator.py- Template matching utilityui_element_analyzer.py- Analysis and export utilities
Status: β Production Ready Last Updated: April 17, 2026