# UI Element Detection API Complete server-based solution for detecting and locating all UI elements in screenshots using OmniParser and template matching. ## Features ✅ **Automatic UI Detection** - Uses OmniParser to detect all UI elements (buttons, text, icons, etc.) ✅ **Precise Coordinates** - Returns pixel-perfect coordinates for each element ✅ **Multiple Export Formats** - JSON, CSV, and visualization PNG ✅ **Fast Processing** - ~15 seconds per screenshot on CPU ✅ **Server-Side Storage** - Cropped images stored on server, not sent to clients ✅ **Multiple Endpoints** - Flexible request/response options ## Start the Server ```bash cd /workspaces/omoi-v2 python ui_element_api_server.py --port 8001 ``` Server will start at `http://127.0.0.1:8001` ## API Endpoints ### 1. Health Check ```bash GET /health ``` **Response:** ```json {"status": "ok", "service": "UI Element Detection API"} ``` ### 2. Analyze Image (Full Response) ```bash POST /analyze Content-Type: multipart/form-data file: ``` **Response:** ```json { "status": "success", "processing_time_seconds": 15.4, "timing": { "omniparser_seconds": 9.88, "template_matching_seconds": 5.48 }, "image_info": { "filename": "Screenshot.png", "size": {"width": 1365, "height": 767} }, "analysis": { "total_elements_detected": 120, "elements": [ { "template_id": "crop_0000", "template_file": "crop_0000.png", "confidence": 1.0, "bbox": { "x1": 71, "y1": 13, "x2": 161, "y2": 29, "width": 90, "height": 16 }, "center": {"x": 116, "y": 21}, "bbox_ratio": { "x1": 0.052, "y1": 0.017, "x2": 0.118, "y2": 0.038 } }, // ... 119 more elements ] }, "exports": { "csv_data": "Element_ID,Template_File,Confidence,X1,Y1,...\n", "visualization_png_base64": "iVBORw0KGgoAAAANSUhEUgAAA..." } } ``` ### 3. Analyze Image (Structured Response) ```bash POST /analyze_batch Content-Type: multipart/form-data file: ``` **Response:** ```json { "metadata": { "filename": "Screenshot.png", "image_size": {"width": 1365, "height": 767}, "total_elements_detected": 120, "templates_loaded": 120 }, "coordinates_json": { "source_image": "Screenshot.png", "image_size": {"width": 1365, "height": 767}, "total_elements": 120, "elements": [...] }, "csv_data": "Element_ID,Template_File,...\n", "visualization_png_base64": "iVBORw0KGgo..." } ``` ## Usage Examples ### Python Client ```python from ui_element_client import UIElementDetectionClient # Initialize client client = UIElementDetectionClient(api_url="http://127.0.0.1:8001") # Check API health status = client.health_check() print(status) # Analyze image and get all elements result = client.analyze_image("screenshot.png") print(f"Found {result['analysis']['total_elements_detected']} UI elements") # Get specific element element = client.get_element_by_id("screenshot.png", "crop_0031") print(f"Element at: ({element['center']['x']}, {element['center']['y']})") # Find elements in a region (top 100 pixels) elements = client.find_elements_in_region("screenshot.png", 0, 0, 1365, 100) print(f"Found {len(elements)} elements in top region") ``` ### Using curl #### Analyze image and save outputs ```bash curl -X POST -F "file=@screenshot.png" http://127.0.0.1:8001/analyze > response.json # Extract CSV data python -c "import json; d=json.load(open('response.json')); print(d['exports']['csv_data'])" > coordinates.csv # Extract visualization (base64 decode) python -c " import json, base64 d = json.load(open('response.json')) with open('visualization.png', 'wb') as f: f.write(base64.b64decode(d['exports']['visualization_png_base64'])) " ``` ### JavaScript/Node.js ```javascript const FormData = require('form-data'); const fs = require('fs'); const axios = require('axios'); async function analyzeImage(imagePath) { const formData = new FormData(); formData.append('file', fs.createReadStream(imagePath)); const response = await axios.post( 'http://127.0.0.1:8001/analyze', formData, { headers: formData.getHeaders() } ); const data = response.data; console.log(`Found ${data.analysis.total_elements_detected} UI elements`); // Save CSV fs.writeFileSync('coordinates.csv', data.exports.csv_data); // Save visualization const vizBuffer = Buffer.from(data.exports.visualization_png_base64, 'base64'); fs.writeFileSync('visualization.png', vizBuffer); return data; } analyzeImage('screenshot.png').catch(console.error); ``` ## Response Data Structure Each UI element contains: ```json { "template_id": "crop_0031", // Element identifier "template_file": "crop_0031.png", // Source template file "confidence": 1.0, // Matching confidence (0-1) "bbox": { "x1": 587, // Top-left X "y1": 393, // Top-left Y "x2": 763, // Bottom-right X "y2": 441, // Bottom-right Y "width": 176, // Element width "height": 48 // Element height }, "center": { "x": 675, // Center X (for clicking) "y": 417 // Center Y (for clicking) }, "bbox_ratio": { "x1": 0.430, // Normalized X1 (0-1) "y1": 0.512, // Normalized Y1 (0-1) "x2": 0.559, // Normalized X2 (0-1) "y2": 0.575 // Normalized Y2 (0-1) } } ``` ## Export Formats ### JSON Complete structured data with all coordinates, confidence scores, and metadata. ### CSV Spreadsheet-friendly format with columns: - Element_ID - Template_File - Confidence - X1, Y1, X2, Y2 (pixel coordinates) - Width, Height - Center_X, Center_Y - Ratio_X1, Ratio_Y1, Ratio_X2, Ratio_Y2 ### Visualization PNG High-resolution image with: - Green bounding boxes around each element - Red center point marker - Element ID and confidence label for each box ## Server-Side File Storage The server maintains a temporary cropped images directory: ``` /tmp/omoi_cropped_images/ ├── crop_0000.png ├── crop_0001.png ├── crop_0002.png └── ... (120+ images) ``` These files are: - ✅ Used for template matching - ✅ Kept on server for reference - ❌ NOT sent to clients - ❌ Cleared on server restart ## Performance Typical performance on CPU: - OmniParser detection: ~10 seconds - Template matching: ~5 seconds - Total: ~15 seconds per screenshot ## Architecture ``` Client Request (PNG) ↓ [API Server] 1. Receives PNG 2. Runs OmniParser ├─ Detects UI elements └─ Saves cropped images (server-side only) 3. Template matches crops back to original 4. Generates coordinates 5. Creates visualization 6. Exports to JSON/CSV ↓ Client Response (JSON, CSV, PNG) - Coordinates metadata - CSV data - Visualization image (NO cropped images to client) ``` ## Coordinate Systems ### Absolute Coordinates Pixel coordinates in the original image: - `bbox.x1, bbox.y1`: Top-left corner - `bbox.x2, bbox.y2`: Bottom-right corner - `center.x, center.y`: Center point (use for mouse clicks) ### Normalized Coordinates 0-1 scale for responsive designs: - `bbox_ratio.x1, bbox_ratio.y1`: Top-left (normalized) - `bbox_ratio.x2, bbox_ratio.y2`: Bottom-right (normalized) - Useful for scaling to different screen sizes ## Tips 1. **Clicking Elements**: Use `center.x` and `center.y` for mouse position 2. **Validation**: All elements have `confidence: 1.0` (perfect match) 3. **Filtering**: Use `bbox_ratio` for responsive element filtering 4. **Region Queries**: Client library supports finding elements in bounding boxes 5. **Batch Processing**: Queue multiple images for analysis ## Troubleshooting **"OmniParser not initialized"** - Server failed to load models, check logs **"Failed to decode image"** - Ensure you're uploading valid PNG/JPG files **"Cropped images directory not found"** - OmniParser detection failed, check input image **Timeout** - Processing large images takes time, increase request timeout ## Files - `ui_element_api_server.py` - Main API server - `ui_element_client.py` - Python client library - `ui_element_locator.py` - Template matching utility - `ui_element_analyzer.py` - Analysis and export utilities --- **Status**: ✅ Production Ready **Last Updated**: April 17, 2026