Spaces:
Sleeping
Sleeping
| title: CU1-X UI Element Detector | |
| emoji: π§ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 4.44.1 | |
| app_file: app.py | |
| pinned: false | |
| python_version: 3.12 | |
| models: | |
| - Roboflow/RF-DETR | |
| - openai/clip-vit-base-patch32 | |
| - Salesforce/blip-image-captioning-base | |
| tags: | |
| - computer-vision | |
| - object-detection | |
| - ui-elements | |
| - ocr | |
| - transformers | |
| license: mit | |
| # CU-1 UI Element Detector | |
| Detect and classify UI elements in screenshots using a multi-model AI pipeline. | |
| ## ποΈ Architecture | |
| CU-1 uses a **service-oriented architecture** with clear separation of concerns: | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β APPLICATION LAYER β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β app_api.py β app_ui.py β | |
| β API Server Entry β Gradio UI Entry β | |
| βββββββββββββββ¬βββββββββ΄βββββββββββ¬βββββββββββββββββββββββββββ | |
| β β | |
| β β HTTP/REST | |
| β β (requests library) | |
| β β | |
| βββββββββββββββΌββββββββ ββββββββββΌββββββββββββββββββββββββββ | |
| β API LAYER β β UI LAYER β | |
| βββββββββββββββββββββββ€ βββββββββββββββββββββββββββββββββββββ€ | |
| β api/endpoints.py β β ui/gradio_interface.py β | |
| β - Thin HTTP layer β β - Gradio web interface β | |
| β - Request validationβ β - Calls API via HTTP β | |
| β - No business logicβ β - Displays results β | |
| βββββββββββββββ¬ββββββββ βββββββββββββββββββββββββββββββββββββ | |
| β | |
| β Direct import | |
| β | |
| βββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ | |
| β DETECTION LAYER β | |
| β (Business Logic) β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β detection/service.py β Main detection service β | |
| β detection/ocr_handler.py β OCR-only processing β | |
| β detection/response_builder.py β Response formatting β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ### Multi-Model Pipeline | |
| CU-1 combines 4 AI models in a sophisticated pipeline: | |
| 1. **RF-DETR (Detection Transformer)** | |
| - Detects generic "UI elements" as a **SINGLE CLASS** | |
| - Provides bounding boxes and confidence scores | |
| - Does NOT distinguish between button, input, text, etc. | |
| 2. **CLIP (OpenAI)** | |
| - **OPTIONAL** multi-class classification | |
| - Takes RF-DETR detections and classifies them into **6 types**: | |
| * `button` - Buttons, FABs, chips, switches | |
| * `input` - Text fields, search bars | |
| * `text` - Labels, titles, paragraphs | |
| * `image` - Images, icons, avatars | |
| * `list_item` - List items, cards, tiles | |
| * `navigation` - Navigation bars, tabs, menus | |
| 3. **EasyOCR** | |
| - Extracts text content from detected regions | |
| - Runs global OCR merge to catch text outside detection boxes | |
| 4. **BLIP (Salesforce)** | |
| - **OPTIONAL** visual description generation | |
| - Describes icons and images when text is not present | |
| ## π Quick Start | |
| ### Installation | |
| ```bash | |
| # Clone the repository | |
| git clone <repository-url> | |
| cd CU1X | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| ``` | |
| ### Running the Application | |
| > π **NEW:** Architecture unified! All modes now use the API layer for consistency. | |
| > See [START.md](START.md) for detailed guide. | |
| **Option 1: One-Command Launch (Recommended for Testing)** | |
| Automatically starts both API server and Gradio UI: | |
| ```bash | |
| python app.py | |
| ``` | |
| **What happens:** | |
| 1. β Starts API server in background (port 8000) | |
| 2. β Waits for API to be ready | |
| 3. β Starts Gradio UI (port 7860) | |
| 4. β Handles clean shutdown with Ctrl+C | |
| **Access:** | |
| - Gradio UI: http://localhost:7860 | |
| - API Docs: http://localhost:8000/docs | |
| --- | |
| **Option 2: Manual Launch (2 Terminals)** | |
| For more control and debugging: | |
| ```bash | |
| # Terminal 1: Start API server | |
| python app_api.py | |
| # Terminal 2: Start Gradio UI | |
| python app_ui.py | |
| ``` | |
| **Access:** | |
| - API: http://localhost:8000 | |
| - API Docs: http://localhost:8000/docs | |
| - Gradio UI: http://localhost:7860 | |
| --- | |
| **Option 3: API Only** | |
| For API-only usage (scripts, integrations): | |
| ```bash | |
| python app_api.py | |
| ``` | |
| Then use the REST API programmatically (see examples below). | |
| ## π‘ API Usage | |
| ### Python Example | |
| ```python | |
| import requests | |
| # Detect UI elements | |
| with open("screenshot.png", "rb") as f: | |
| response = requests.post( | |
| "http://localhost:8000/detect", | |
| files={"image": f}, | |
| data={ | |
| "confidence_threshold": 0.35, | |
| "enable_clip": True, | |
| "enable_ocr": True, | |
| "enable_blip": False | |
| } | |
| ) | |
| results = response.json() | |
| print(f"Found {results['total_detections']} elements") | |
| for detection in results['detections']: | |
| print(f"- {detection['class_name']}: {detection.get('text', 'N/A')}") | |
| ``` | |
| ### cURL Example | |
| ```bash | |
| curl -X POST "http://localhost:8000/detect" \ | |
| -F "image=@screenshot.png" \ | |
| -F "confidence_threshold=0.35" \ | |
| -F "enable_clip=true" \ | |
| -F "enable_ocr=true" | |
| ``` | |
| ### Response Format | |
| ```json | |
| { | |
| "success": true, | |
| "detections": [ | |
| { | |
| "box": {"x1": 50, "y1": 100, "x2": 200, "y2": 150}, | |
| "confidence": 0.79, | |
| "class_id": 0, | |
| "class_name": "button", | |
| "text": "Submit", | |
| "description": "" | |
| } | |
| ], | |
| "total_detections": 1, | |
| "image_size": {"width": 1080, "height": 1920}, | |
| "parameters": { | |
| "confidence_threshold": 0.35, | |
| "enable_clip": true, | |
| "enable_ocr": true, | |
| "enable_blip": false | |
| }, | |
| "type_distribution": {"button": 5, "text": 12}, | |
| "annotated_image": { | |
| "mime": "image/png", | |
| "base64": "iVBORw0KGgoAAAANSU..." | |
| } | |
| } | |
| ``` | |
| ## π Python Library Usage | |
| You can also use CU-1 as a Python library: | |
| ```python | |
| from detection.service import DetectionService | |
| # Initialize detector | |
| detector = DetectionService( | |
| enable_clip=True, | |
| enable_ocr=True, | |
| enable_blip=False | |
| ) | |
| # Analyze image | |
| results = detector.analyze( | |
| "screenshot.png", | |
| confidence_threshold=0.35, | |
| use_clip=True, | |
| use_blip=False | |
| ) | |
| # Access detections | |
| for detection in results['detections']: | |
| box = detection['box'] | |
| print(f"{detection['class_name']}: {detection['text']}") | |
| print(f" Location: ({box['x1']}, {box['y1']}) to ({box['x2']}, {box['y2']})") | |
| ``` | |
| ## π― Detection Modes | |
| ### 1. Full Detection Mode (Default) | |
| Uses RF-DETR to detect elements, optionally classifies with CLIP, extracts text with OCR. | |
| ```python | |
| data = { | |
| "confidence_threshold": 0.35, | |
| "enable_clip": True, # Classify element types | |
| "enable_ocr": True, # Extract text | |
| "enable_blip": False | |
| } | |
| ``` | |
| ### 2. OCR-Only Mode | |
| Bypasses RF-DETR and runs OCR directly across the entire image. | |
| ```python | |
| data = { | |
| "ocr_only": True, | |
| "enable_clip": False, # Must be false | |
| "enable_blip": False # Must be false | |
| } | |
| ``` | |
| ### 3. Visual Description Mode | |
| Generates descriptions for icons using BLIP. | |
| ```python | |
| data = { | |
| "enable_clip": True, | |
| "enable_ocr": True, | |
| "enable_blip": True, | |
| "blip_scope": "icons" # or "all" | |
| } | |
| ``` | |
| ## π Project Structure | |
| ``` | |
| CU1X/ | |
| βββ app_api.py # API server entry point | |
| βββ app_ui.py # Gradio UI entry point | |
| βββ detection/ # Business logic layer | |
| β βββ __init__.py | |
| β βββ service.py # Main DetectionService | |
| β βββ ocr_handler.py # OCR-only processing | |
| β βββ response_builder.py # Response formatting | |
| βββ api/ # HTTP layer (thin) | |
| β βββ __init__.py | |
| β βββ endpoints.py # FastAPI endpoints | |
| βββ ui/ # UI layer | |
| β βββ __init__.py | |
| β βββ gradio_interface.py # Gradio interface (API client) | |
| βββ rfdetr/ # RF-DETR implementation | |
| βββ model.pth # Trained model weights | |
| βββ requirements.txt # Python dependencies | |
| βββ README.md | |
| ``` | |
| ## βοΈ Configuration | |
| ### Environment Variables | |
| **API Server:** | |
| - No configuration needed (runs on port 8000) | |
| **Gradio UI:** | |
| - `CU1-X_API_URL`: API endpoint (default: `http://localhost:8000`) | |
| - `GRADIO_SERVER_NAME`: Server host (default: `0.0.0.0`) | |
| - `GRADIO_SERVER_PORT`: Server port (default: `7860`) | |
| - `GRADIO_SHARE`: Enable Gradio sharing (default: `false`) | |
| Example: | |
| ```bash | |
| export CU1_API_URL=http://your-api-server:8000 | |
| python app_ui.py | |
| ``` | |
| ## π Detection Parameters | |
| | Parameter | Type | Default | Description | | |
| |-----------|------|---------|-------------| | |
| | `confidence_threshold` | float | 0.35 | Detection confidence (0.1-0.9) | | |
| | `enable_clip` | bool | false | Classify element types | | |
| | `enable_ocr` | bool | true | Extract text content | | |
| | `enable_blip` | bool | false | Generate visual descriptions | | |
| | `blip_scope` | str | "icons" | "icons" or "all" | | |
| | `ocr_only` | bool | false | Skip detection, OCR only | | |
| ## π Bug Fixes in This Version | |
| ### 1. Fixed RF-DETR Single-Class Confusion | |
| **Issue:** Code suggested RF-DETR did multi-class detection, but it only detects generic "UI elements" (single class). | |
| **Fix:** | |
| - Removed unused `base_class_ids` variable | |
| - Added clear documentation explaining RF-DETR is single-class | |
| - CLIP provides the multi-class classification (6 types) | |
| ### 2. Fixed OCR-Only Validation Logic | |
| **Issue:** API incorrectly rejected `enable_ocr=true` when `ocr_only=true`. | |
| **Fix:** | |
| ```python | |
| # OLD (WRONG): | |
| if ocr_only and (enable_clip or enable_blip or enable_ocr): | |
| raise HTTPException(...) | |
| # NEW (CORRECT): | |
| if ocr_only and (enable_clip or enable_blip): | |
| raise HTTPException(...) | |
| ``` | |
| ## π Key Architecture Principles | |
| 1. **Separation of Concerns**: Detection logic, API layer, and UI layer are completely isolated | |
| 2. **No Business Logic in API**: `api/endpoints.py` only handles HTTP, delegates to `detection/` module | |
| 3. **Service-Oriented**: Gradio UI is a client of the API (HTTP calls), not direct imports | |
| 4. **Single Source of Truth**: All detection logic in `detection/` module | |
| 5. **Testability**: Each layer can be tested independently | |
| ## π¦ Performance | |
| Detection performance depends on enabled features: | |
| | Mode | Time | Use Case | | |
| |------|------|----------| | |
| | RF-DETR only | ~25-35s | Just bounding boxes | | |
| | RF-DETR + OCR | ~30-40s | Text extraction | | |
| | RF-DETR + CLIP + OCR | ~50-60s | Full classification + text | | |
| | RF-DETR + CLIP + OCR + BLIP | ~70-90s | Complete analysis | | |
| *Times are approximate and depend on image size and hardware (CPU vs GPU).* | |
| ## π€ Deploying to Hugging Face Spaces | |
| ### π Quick Deploy (2 Commands) | |
| **Option 1: Scripts Automatiques (RecommandΓ©)** | |
| ```bash | |
| # 1. VΓ©rifier que tout est prΓͺt | |
| ./check_hf_space.sh | |
| # 2. DΓ©ployer automatiquement | |
| ./deploy_hf_space.sh | |
| ``` | |
| **Option 2: Manuel** | |
| 1. **Create a new Space** on Hugging Face | |
| - Choose "Gradio" as SDK | |
| - Select hardware (CPU or GPU) | |
| 2. **Clone and push:** | |
| ```bash | |
| git clone https://huggingface.co/spaces/YOUR_USERNAME/CU1-X | |
| cd CU1-X | |
| # Copy files from your project | |
| git lfs install | |
| git lfs track "*.pth" | |
| git add . | |
| git commit -m "Initial deployment" | |
| git push origin main | |
| ``` | |
| 3. **Space will auto-deploy** - First run takes 5-10 minutes (model download) | |
| ### π Documentation | |
| - **[QUICK_DEPLOY.md](QUICK_DEPLOY.md)** - Guide ultra-rapide (2 commandes) | |
| - **[DEPLOYMENT.md](DEPLOYMENT.md)** - Guide dΓ©taillΓ© complet | |
| - **[API_USAGE.md](API_USAGE.md)** - Comment utiliser l'API depuis l'extΓ©rieur | |
| ### Unified Architecture | |
| **NEW:** `app.py` now uses the same unified API architecture everywhere: | |
| 1. β Starts API server in subprocess | |
| 2. β Starts Gradio UI that connects to API | |
| 3. β Same code path as local development | |
| 4. β Consistent behavior across all environments | |
| **Benefits:** | |
| - Single code path to maintain (no special HF Spaces mode) | |
| - Same API layer everywhere (easier debugging) | |
| - Can scale to separate API/UI servers if needed | |
| ### π Accessing HF Space via API | |
| Once deployed, your HF Space automatically exposes an API: | |
| ```python | |
| # Install Gradio client | |
| pip install gradio_client | |
| # Use your Space | |
| from gradio_client import Client | |
| client = Client("YOUR_USERNAME/cu1-detector") | |
| result = client.predict("screenshot.png", 0.35, 2, True, True, False, False, "Only image & button") | |
| annotated_image, summary, detections = result | |
| print(f"Found {detections['total_detections']} elements!") | |
| ``` | |
| **See:** | |
| - `examples/simple_hf_api_example.py` - Quick start | |
| - `examples/huggingface_api_usage.py` - Full examples (batch, async, etc.) | |
| - [DEPLOYMENT.md](DEPLOYMENT.md) - Complete deployment guide (Docker, AWS, GCP, Azure, etc.) | |
| ## π License | |
| See LICENSE file for details. | |
| ## π Acknowledgments | |
| - **RF-DETR**: Roboflow | |
| - **CLIP**: OpenAI | |
| - **BLIP**: Salesforce | |
| - **EasyOCR**: JaidedAI | |
| --- | |
| **Questions or issues?** Please open an issue on GitHub. | |