--- title: CU1-Xtended emoji: 🧠 colorFrom: pink colorTo: purple sdk: gradio sdk_version: 4.36.0 app_file: app.py pinned: false --- # CU-1 UI Element Detector Detect and classify UI elements in screenshots using a multi-model AI pipeline. ## πŸ—οΈ Architecture CU-1 uses a **service-oriented architecture** with clear separation of concerns: ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ APPLICATION LAYER β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ app_api.py β”‚ app_ui.py β”‚ β”‚ API Server Entry β”‚ Gradio UI Entry β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ HTTP/REST β”‚ β”‚ (requests library) β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ API LAYER β”‚ β”‚ UI LAYER β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ api/endpoints.py β”‚ β”‚ ui/gradio_interface.py β”‚ β”‚ - Thin HTTP layer β”‚ β”‚ - Gradio web interface β”‚ β”‚ - Request validationβ”‚ β”‚ - Calls API via HTTP β”‚ β”‚ - No business logicβ”‚ β”‚ - Displays results β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ Direct import β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ DETECTION LAYER β”‚ β”‚ (Business Logic) β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ detection/service.py β”‚ Main detection service β”‚ β”‚ detection/ocr_handler.py β”‚ OCR-only processing β”‚ β”‚ detection/response_builder.py β”‚ Response formatting β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ### Multi-Model Pipeline CU-1 combines 4 AI models in a sophisticated pipeline: 1. **RF-DETR (Detection Transformer)** - Detects generic "UI elements" as a **SINGLE CLASS** - Provides bounding boxes and confidence scores - Does NOT distinguish between button, input, text, etc. 2. **CLIP (OpenAI)** - **OPTIONAL** multi-class classification - Takes RF-DETR detections and classifies them into **6 types**: * `button` - Buttons, FABs, chips, switches * `input` - Text fields, search bars * `text` - Labels, titles, paragraphs * `image` - Images, icons, avatars * `list_item` - List items, cards, tiles * `navigation` - Navigation bars, tabs, menus 3. **EasyOCR** - Extracts text content from detected regions - Runs global OCR merge to catch text outside detection boxes 4. **BLIP (Salesforce)** - **OPTIONAL** visual description generation - Describes icons and images when text is not present ## πŸš€ Quick Start ### Installation ```bash # Clone the repository git clone cd CU1X # Install dependencies pip install -r requirements.txt ``` ### Running the Application > πŸ“– **NEW:** Architecture unified! All modes now use the API layer for consistency. > See [START.md](START.md) for detailed guide. **Option 1: One-Command Launch (Recommended for Testing)** Automatically starts both API server and Gradio UI: ```bash python app.py ``` **What happens:** 1. βœ… Starts API server in background (port 8000) 2. βœ… Waits for API to be ready 3. βœ… Starts Gradio UI (port 7860) 4. βœ… Handles clean shutdown with Ctrl+C **Access:** - Gradio UI: http://localhost:7860 - API Docs: http://localhost:8000/docs --- **Option 2: Manual Launch (2 Terminals)** For more control and debugging: ```bash # Terminal 1: Start API server python app_api.py # Terminal 2: Start Gradio UI python app_ui.py ``` **Access:** - API: http://localhost:8000 - API Docs: http://localhost:8000/docs - Gradio UI: http://localhost:7860 --- **Option 3: API Only** For API-only usage (scripts, integrations): ```bash python app_api.py ``` Then use the REST API programmatically (see examples below). ## πŸ“‘ API Usage ### Python Example ```python import requests # Detect UI elements with open("screenshot.png", "rb") as f: response = requests.post( "http://localhost:8000/detect", files={"image": f}, data={ "confidence_threshold": 0.35, "enable_clip": True, "enable_ocr": True, "enable_blip": False } ) results = response.json() print(f"Found {results['total_detections']} elements") for detection in results['detections']: print(f"- {detection['class_name']}: {detection.get('text', 'N/A')}") ``` ### cURL Example ```bash curl -X POST "http://localhost:8000/detect" \ -F "image=@screenshot.png" \ -F "confidence_threshold=0.35" \ -F "enable_clip=true" \ -F "enable_ocr=true" ``` ### Response Format ```json { "success": true, "detections": [ { "box": {"x1": 50, "y1": 100, "x2": 200, "y2": 150}, "confidence": 0.79, "class_id": 0, "class_name": "button", "text": "Submit", "description": "" } ], "total_detections": 1, "image_size": {"width": 1080, "height": 1920}, "parameters": { "confidence_threshold": 0.35, "enable_clip": true, "enable_ocr": true, "enable_blip": false }, "type_distribution": {"button": 5, "text": 12}, "annotated_image": { "mime": "image/png", "base64": "iVBORw0KGgoAAAANSU..." } } ``` ## 🐍 Python Library Usage You can also use CU-1 as a Python library: ```python from detection.service import DetectionService # Initialize detector detector = DetectionService( enable_clip=True, enable_ocr=True, enable_blip=False ) # Analyze image results = detector.analyze( "screenshot.png", confidence_threshold=0.35, use_clip=True, use_blip=False ) # Access detections for detection in results['detections']: box = detection['box'] print(f"{detection['class_name']}: {detection['text']}") print(f" Location: ({box['x1']}, {box['y1']}) to ({box['x2']}, {box['y2']})") ``` ## 🎯 Detection Modes ### 1. Full Detection Mode (Default) Uses RF-DETR to detect elements, optionally classifies with CLIP, extracts text with OCR. ```python data = { "confidence_threshold": 0.35, "enable_clip": True, # Classify element types "enable_ocr": True, # Extract text "enable_blip": False } ``` ### 2. OCR-Only Mode Bypasses RF-DETR and runs OCR directly across the entire image. ```python data = { "ocr_only": True, "enable_clip": False, # Must be false "enable_blip": False # Must be false } ``` ### 3. Visual Description Mode Generates descriptions for icons using BLIP. ```python data = { "enable_clip": True, "enable_ocr": True, "enable_blip": True, "blip_scope": "icons" # or "all" } ``` ## πŸ“ Project Structure ``` CU1X/ β”œβ”€β”€ app_api.py # API server entry point β”œβ”€β”€ app_ui.py # Gradio UI entry point β”œβ”€β”€ detection/ # Business logic layer β”‚ β”œβ”€β”€ __init__.py β”‚ β”œβ”€β”€ service.py # Main DetectionService β”‚ β”œβ”€β”€ ocr_handler.py # OCR-only processing β”‚ └── response_builder.py # Response formatting β”œβ”€β”€ api/ # HTTP layer (thin) β”‚ β”œβ”€β”€ __init__.py β”‚ └── endpoints.py # FastAPI endpoints β”œβ”€β”€ ui/ # UI layer β”‚ β”œβ”€β”€ __init__.py β”‚ └── gradio_interface.py # Gradio interface (API client) β”œβ”€β”€ rfdetr/ # RF-DETR implementation β”œβ”€β”€ model.pth # Trained model weights β”œβ”€β”€ requirements.txt # Python dependencies └── README.md ``` ## βš™οΈ Configuration ### Environment Variables **API Server:** - No configuration needed (runs on port 8000) **Gradio UI:** - `CU1-X_API_URL`: API endpoint (default: `http://localhost:8000`) - `GRADIO_SERVER_NAME`: Server host (default: `0.0.0.0`) - `GRADIO_SERVER_PORT`: Server port (default: `7860`) - `GRADIO_SHARE`: Enable Gradio sharing (default: `false`) Example: ```bash export CU1_API_URL=http://your-api-server:8000 python app_ui.py ``` ## πŸ” Detection Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `confidence_threshold` | float | 0.35 | Detection confidence (0.1-0.9) | | `enable_clip` | bool | false | Classify element types | | `enable_ocr` | bool | true | Extract text content | | `enable_blip` | bool | false | Generate visual descriptions | | `blip_scope` | str | "icons" | "icons" or "all" | | `ocr_only` | bool | false | Skip detection, OCR only | ## πŸ› Bug Fixes in This Version ### 1. Fixed RF-DETR Single-Class Confusion **Issue:** Code suggested RF-DETR did multi-class detection, but it only detects generic "UI elements" (single class). **Fix:** - Removed unused `base_class_ids` variable - Added clear documentation explaining RF-DETR is single-class - CLIP provides the multi-class classification (6 types) ### 2. Fixed OCR-Only Validation Logic **Issue:** API incorrectly rejected `enable_ocr=true` when `ocr_only=true`. **Fix:** ```python # OLD (WRONG): if ocr_only and (enable_clip or enable_blip or enable_ocr): raise HTTPException(...) # NEW (CORRECT): if ocr_only and (enable_clip or enable_blip): raise HTTPException(...) ``` ## πŸ† Key Architecture Principles 1. **Separation of Concerns**: Detection logic, API layer, and UI layer are completely isolated 2. **No Business Logic in API**: `api/endpoints.py` only handles HTTP, delegates to `detection/` module 3. **Service-Oriented**: Gradio UI is a client of the API (HTTP calls), not direct imports 4. **Single Source of Truth**: All detection logic in `detection/` module 5. **Testability**: Each layer can be tested independently ## 🚦 Performance Detection performance depends on enabled features: | Mode | Time | Use Case | |------|------|----------| | RF-DETR only | ~25-35s | Just bounding boxes | | RF-DETR + OCR | ~30-40s | Text extraction | | RF-DETR + CLIP + OCR | ~50-60s | Full classification + text | | RF-DETR + CLIP + OCR + BLIP | ~70-90s | Complete analysis | *Times are approximate and depend on image size and hardware (CPU vs GPU).* ## πŸ€— Deploying to Hugging Face Spaces ### πŸš€ Quick Deploy (2 Commands) **Option 1: Scripts Automatiques (RecommandΓ©)** ```bash # 1. VΓ©rifier que tout est prΓͺt ./check_hf_space.sh # 2. DΓ©ployer automatiquement ./deploy_hf_space.sh ``` **Option 2: Manuel** 1. **Create a new Space** on Hugging Face - Choose "Gradio" as SDK - Select hardware (CPU or GPU) 2. **Clone and push:** ```bash git clone https://huggingface.co/spaces/YOUR_USERNAME/CU1-X cd CU1-X # Copy files from your project git lfs install git lfs track "*.pth" git add . git commit -m "Initial deployment" git push origin main ``` 3. **Space will auto-deploy** - First run takes 5-10 minutes (model download) ### πŸ“š Documentation - **[QUICK_DEPLOY.md](QUICK_DEPLOY.md)** - Guide ultra-rapide (2 commandes) - **[DEPLOYMENT.md](DEPLOYMENT.md)** - Guide dΓ©taillΓ© complet - **[API_USAGE.md](API_USAGE.md)** - Comment utiliser l'API depuis l'extΓ©rieur ### Unified Architecture **NEW:** `app.py` now uses the same unified API architecture everywhere: 1. βœ… Starts API server in subprocess 2. βœ… Starts Gradio UI that connects to API 3. βœ… Same code path as local development 4. βœ… Consistent behavior across all environments **Benefits:** - Single code path to maintain (no special HF Spaces mode) - Same API layer everywhere (easier debugging) - Can scale to separate API/UI servers if needed ### πŸ”Œ Accessing HF Space via API Once deployed, your HF Space automatically exposes an API: ```python # Install Gradio client pip install gradio_client # Use your Space from gradio_client import Client client = Client("YOUR_USERNAME/cu1-detector") result = client.predict("screenshot.png", 0.35, 2, True, True, False, False, "Only image & button") annotated_image, summary, detections = result print(f"Found {detections['total_detections']} elements!") ``` **See:** - `examples/simple_hf_api_example.py` - Quick start - `examples/huggingface_api_usage.py` - Full examples (batch, async, etc.) - [DEPLOYMENT.md](DEPLOYMENT.md) - Complete deployment guide (Docker, AWS, GCP, Azure, etc.) ## πŸ“ License See LICENSE file for details. ## πŸ™ Acknowledgments - **RF-DETR**: Roboflow - **CLIP**: OpenAI - **BLIP**: Salesforce - **EasyOCR**: JaidedAI --- **Questions or issues?** Please open an issue on GitHub.