Spaces:
Paused
Paused
| # UI Element Detection API | |
| Complete server-based solution for detecting and locating all UI elements in screenshots using OmniParser and template matching. | |
| ## Features | |
| β **Automatic UI Detection** - Uses OmniParser to detect all UI elements (buttons, text, icons, etc.) | |
| β **Precise Coordinates** - Returns pixel-perfect coordinates for each element | |
| β **Multiple Export Formats** - JSON, CSV, and visualization PNG | |
| β **Fast Processing** - ~15 seconds per screenshot on CPU | |
| β **Server-Side Storage** - Cropped images stored on server, not sent to clients | |
| β **Multiple Endpoints** - Flexible request/response options | |
| ## Start the Server | |
| ```bash | |
| cd /workspaces/omoi-v2 | |
| python ui_element_api_server.py --port 8001 | |
| ``` | |
| Server will start at `http://127.0.0.1:8001` | |
| ## API Endpoints | |
| ### 1. Health Check | |
| ```bash | |
| GET /health | |
| ``` | |
| **Response:** | |
| ```json | |
| {"status": "ok", "service": "UI Element Detection API"} | |
| ``` | |
| ### 2. Analyze Image (Full Response) | |
| ```bash | |
| POST /analyze | |
| Content-Type: multipart/form-data | |
| file: <PNG image file> | |
| ``` | |
| **Response:** | |
| ```json | |
| { | |
| "status": "success", | |
| "processing_time_seconds": 15.4, | |
| "timing": { | |
| "omniparser_seconds": 9.88, | |
| "template_matching_seconds": 5.48 | |
| }, | |
| "image_info": { | |
| "filename": "Screenshot.png", | |
| "size": {"width": 1365, "height": 767} | |
| }, | |
| "analysis": { | |
| "total_elements_detected": 120, | |
| "elements": [ | |
| { | |
| "template_id": "crop_0000", | |
| "template_file": "crop_0000.png", | |
| "confidence": 1.0, | |
| "bbox": { | |
| "x1": 71, "y1": 13, "x2": 161, "y2": 29, | |
| "width": 90, "height": 16 | |
| }, | |
| "center": {"x": 116, "y": 21}, | |
| "bbox_ratio": { | |
| "x1": 0.052, "y1": 0.017, "x2": 0.118, "y2": 0.038 | |
| } | |
| }, | |
| // ... 119 more elements | |
| ] | |
| }, | |
| "exports": { | |
| "csv_data": "Element_ID,Template_File,Confidence,X1,Y1,...\n", | |
| "visualization_png_base64": "iVBORw0KGgoAAAANSUhEUgAAA..." | |
| } | |
| } | |
| ``` | |
| ### 3. Analyze Image (Structured Response) | |
| ```bash | |
| POST /analyze_batch | |
| Content-Type: multipart/form-data | |
| file: <PNG image file> | |
| ``` | |
| **Response:** | |
| ```json | |
| { | |
| "metadata": { | |
| "filename": "Screenshot.png", | |
| "image_size": {"width": 1365, "height": 767}, | |
| "total_elements_detected": 120, | |
| "templates_loaded": 120 | |
| }, | |
| "coordinates_json": { | |
| "source_image": "Screenshot.png", | |
| "image_size": {"width": 1365, "height": 767}, | |
| "total_elements": 120, | |
| "elements": [...] | |
| }, | |
| "csv_data": "Element_ID,Template_File,...\n", | |
| "visualization_png_base64": "iVBORw0KGgo..." | |
| } | |
| ``` | |
| ## Usage Examples | |
| ### Python Client | |
| ```python | |
| from ui_element_client import UIElementDetectionClient | |
| # Initialize client | |
| client = UIElementDetectionClient(api_url="http://127.0.0.1:8001") | |
| # Check API health | |
| status = client.health_check() | |
| print(status) | |
| # Analyze image and get all elements | |
| result = client.analyze_image("screenshot.png") | |
| print(f"Found {result['analysis']['total_elements_detected']} UI elements") | |
| # Get specific element | |
| element = client.get_element_by_id("screenshot.png", "crop_0031") | |
| print(f"Element at: ({element['center']['x']}, {element['center']['y']})") | |
| # Find elements in a region (top 100 pixels) | |
| elements = client.find_elements_in_region("screenshot.png", 0, 0, 1365, 100) | |
| print(f"Found {len(elements)} elements in top region") | |
| ``` | |
| ### Using curl | |
| #### Analyze image and save outputs | |
| ```bash | |
| curl -X POST -F "file=@screenshot.png" http://127.0.0.1:8001/analyze > response.json | |
| # Extract CSV data | |
| python -c "import json; d=json.load(open('response.json')); print(d['exports']['csv_data'])" > coordinates.csv | |
| # Extract visualization (base64 decode) | |
| python -c " | |
| import json, base64 | |
| d = json.load(open('response.json')) | |
| with open('visualization.png', 'wb') as f: | |
| f.write(base64.b64decode(d['exports']['visualization_png_base64'])) | |
| " | |
| ``` | |
| ### JavaScript/Node.js | |
| ```javascript | |
| const FormData = require('form-data'); | |
| const fs = require('fs'); | |
| const axios = require('axios'); | |
| async function analyzeImage(imagePath) { | |
| const formData = new FormData(); | |
| formData.append('file', fs.createReadStream(imagePath)); | |
| const response = await axios.post( | |
| 'http://127.0.0.1:8001/analyze', | |
| formData, | |
| { headers: formData.getHeaders() } | |
| ); | |
| const data = response.data; | |
| console.log(`Found ${data.analysis.total_elements_detected} UI elements`); | |
| // Save CSV | |
| fs.writeFileSync('coordinates.csv', data.exports.csv_data); | |
| // Save visualization | |
| const vizBuffer = Buffer.from(data.exports.visualization_png_base64, 'base64'); | |
| fs.writeFileSync('visualization.png', vizBuffer); | |
| return data; | |
| } | |
| analyzeImage('screenshot.png').catch(console.error); | |
| ``` | |
| ## Response Data Structure | |
| Each UI element contains: | |
| ```json | |
| { | |
| "template_id": "crop_0031", // Element identifier | |
| "template_file": "crop_0031.png", // Source template file | |
| "confidence": 1.0, // Matching confidence (0-1) | |
| "bbox": { | |
| "x1": 587, // Top-left X | |
| "y1": 393, // Top-left Y | |
| "x2": 763, // Bottom-right X | |
| "y2": 441, // Bottom-right Y | |
| "width": 176, // Element width | |
| "height": 48 // Element height | |
| }, | |
| "center": { | |
| "x": 675, // Center X (for clicking) | |
| "y": 417 // Center Y (for clicking) | |
| }, | |
| "bbox_ratio": { | |
| "x1": 0.430, // Normalized X1 (0-1) | |
| "y1": 0.512, // Normalized Y1 (0-1) | |
| "x2": 0.559, // Normalized X2 (0-1) | |
| "y2": 0.575 // Normalized Y2 (0-1) | |
| } | |
| } | |
| ``` | |
| ## Export Formats | |
| ### JSON | |
| Complete structured data with all coordinates, confidence scores, and metadata. | |
| ### CSV | |
| Spreadsheet-friendly format with columns: | |
| - Element_ID | |
| - Template_File | |
| - Confidence | |
| - X1, Y1, X2, Y2 (pixel coordinates) | |
| - Width, Height | |
| - Center_X, Center_Y | |
| - Ratio_X1, Ratio_Y1, Ratio_X2, Ratio_Y2 | |
| ### Visualization PNG | |
| High-resolution image with: | |
| - Green bounding boxes around each element | |
| - Red center point marker | |
| - Element ID and confidence label for each box | |
| ## Server-Side File Storage | |
| The server maintains a temporary cropped images directory: | |
| ``` | |
| /tmp/omoi_cropped_images/ | |
| βββ crop_0000.png | |
| βββ crop_0001.png | |
| βββ crop_0002.png | |
| βββ ... (120+ images) | |
| ``` | |
| These files are: | |
| - β Used for template matching | |
| - β Kept on server for reference | |
| - β NOT sent to clients | |
| - β Cleared on server restart | |
| ## Performance | |
| Typical performance on CPU: | |
| - OmniParser detection: ~10 seconds | |
| - Template matching: ~5 seconds | |
| - Total: ~15 seconds per screenshot | |
| ## Architecture | |
| ``` | |
| Client Request (PNG) | |
| β | |
| [API Server] | |
| 1. Receives PNG | |
| 2. Runs OmniParser | |
| ββ Detects UI elements | |
| ββ Saves cropped images (server-side only) | |
| 3. Template matches crops back to original | |
| 4. Generates coordinates | |
| 5. Creates visualization | |
| 6. Exports to JSON/CSV | |
| β | |
| Client Response (JSON, CSV, PNG) | |
| - Coordinates metadata | |
| - CSV data | |
| - Visualization image | |
| (NO cropped images to client) | |
| ``` | |
| ## Coordinate Systems | |
| ### Absolute Coordinates | |
| Pixel coordinates in the original image: | |
| - `bbox.x1, bbox.y1`: Top-left corner | |
| - `bbox.x2, bbox.y2`: Bottom-right corner | |
| - `center.x, center.y`: Center point (use for mouse clicks) | |
| ### Normalized Coordinates | |
| 0-1 scale for responsive designs: | |
| - `bbox_ratio.x1, bbox_ratio.y1`: Top-left (normalized) | |
| - `bbox_ratio.x2, bbox_ratio.y2`: Bottom-right (normalized) | |
| - Useful for scaling to different screen sizes | |
| ## Tips | |
| 1. **Clicking Elements**: Use `center.x` and `center.y` for mouse position | |
| 2. **Validation**: All elements have `confidence: 1.0` (perfect match) | |
| 3. **Filtering**: Use `bbox_ratio` for responsive element filtering | |
| 4. **Region Queries**: Client library supports finding elements in bounding boxes | |
| 5. **Batch Processing**: Queue multiple images for analysis | |
| ## Troubleshooting | |
| **"OmniParser not initialized"** - Server failed to load models, check logs | |
| **"Failed to decode image"** - Ensure you're uploading valid PNG/JPG files | |
| **"Cropped images directory not found"** - OmniParser detection failed, check input image | |
| **Timeout** - Processing large images takes time, increase request timeout | |
| ## Files | |
| - `ui_element_api_server.py` - Main API server | |
| - `ui_element_client.py` - Python client library | |
| - `ui_element_locator.py` - Template matching utility | |
| - `ui_element_analyzer.py` - Analysis and export utilities | |
| --- | |
| **Status**: β Production Ready | |
| **Last Updated**: April 17, 2026 | |