CU1-X / README.md
AI-DrivenTesting's picture
Update README.md
b0a5932 verified

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: CU1-Xtended
emoji: 🧠
colorFrom: pink
colorTo: purple
sdk: gradio
sdk_version: 4.36.0
app_file: app.py
pinned: false

CU-1 UI Element Detector

Detect and classify UI elements in screenshots using a multi-model AI pipeline.

πŸ—οΈ Architecture

CU-1 uses a service-oriented architecture with clear separation of concerns:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    APPLICATION LAYER                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  app_api.py          β”‚  app_ui.py                           β”‚
β”‚  API Server Entry    β”‚  Gradio UI Entry                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚                   β”‚
              β”‚                   β”‚ HTTP/REST
              β”‚                   β”‚ (requests library)
              β”‚                   β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   API LAYER         β”‚  β”‚   UI LAYER                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  api/endpoints.py   β”‚  β”‚  ui/gradio_interface.py           β”‚
β”‚  - Thin HTTP layer  β”‚  β”‚  - Gradio web interface           β”‚
β”‚  - Request validationβ”‚  β”‚  - Calls API via HTTP            β”‚
β”‚  - No business logicβ”‚  β”‚  - Displays results               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              β”‚ Direct import
              β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  DETECTION LAYER                            β”‚
β”‚                  (Business Logic)                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  detection/service.py       β”‚  Main detection service       β”‚
β”‚  detection/ocr_handler.py   β”‚  OCR-only processing          β”‚
β”‚  detection/response_builder.py β”‚ Response formatting        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Multi-Model Pipeline

CU-1 combines 4 AI models in a sophisticated pipeline:

  1. RF-DETR (Detection Transformer)

    • Detects generic "UI elements" as a SINGLE CLASS
    • Provides bounding boxes and confidence scores
    • Does NOT distinguish between button, input, text, etc.
  2. CLIP (OpenAI)

    • OPTIONAL multi-class classification
    • Takes RF-DETR detections and classifies them into 6 types:
      • button - Buttons, FABs, chips, switches
      • input - Text fields, search bars
      • text - Labels, titles, paragraphs
      • image - Images, icons, avatars
      • list_item - List items, cards, tiles
      • navigation - Navigation bars, tabs, menus
  3. EasyOCR

    • Extracts text content from detected regions
    • Runs global OCR merge to catch text outside detection boxes
  4. BLIP (Salesforce)

    • OPTIONAL visual description generation
    • Describes icons and images when text is not present

πŸš€ Quick Start

Installation

# Clone the repository
git clone <repository-url>
cd CU1X

# Install dependencies
pip install -r requirements.txt

Running the Application

πŸ“– NEW: Architecture unified! All modes now use the API layer for consistency. See START.md for detailed guide.

Option 1: One-Command Launch (Recommended for Testing)

Automatically starts both API server and Gradio UI:

python app.py

What happens:

  1. βœ… Starts API server in background (port 8000)
  2. βœ… Waits for API to be ready
  3. βœ… Starts Gradio UI (port 7860)
  4. βœ… Handles clean shutdown with Ctrl+C

Access:


Option 2: Manual Launch (2 Terminals)

For more control and debugging:

# Terminal 1: Start API server
python app_api.py

# Terminal 2: Start Gradio UI
python app_ui.py

Access:


Option 3: API Only

For API-only usage (scripts, integrations):

python app_api.py

Then use the REST API programmatically (see examples below).

πŸ“‘ API Usage

Python Example

import requests

# Detect UI elements
with open("screenshot.png", "rb") as f:
    response = requests.post(
        "http://localhost:8000/detect",
        files={"image": f},
        data={
            "confidence_threshold": 0.35,
            "enable_clip": True,
            "enable_ocr": True,
            "enable_blip": False
        }
    )

results = response.json()
print(f"Found {results['total_detections']} elements")

for detection in results['detections']:
    print(f"- {detection['class_name']}: {detection.get('text', 'N/A')}")

cURL Example

curl -X POST "http://localhost:8000/detect" \
  -F "image=@screenshot.png" \
  -F "confidence_threshold=0.35" \
  -F "enable_clip=true" \
  -F "enable_ocr=true"

Response Format

{
  "success": true,
  "detections": [
    {
      "box": {"x1": 50, "y1": 100, "x2": 200, "y2": 150},
      "confidence": 0.79,
      "class_id": 0,
      "class_name": "button",
      "text": "Submit",
      "description": ""
    }
  ],
  "total_detections": 1,
  "image_size": {"width": 1080, "height": 1920},
  "parameters": {
    "confidence_threshold": 0.35,
    "enable_clip": true,
    "enable_ocr": true,
    "enable_blip": false
  },
  "type_distribution": {"button": 5, "text": 12},
  "annotated_image": {
    "mime": "image/png",
    "base64": "iVBORw0KGgoAAAANSU..."
  }
}

🐍 Python Library Usage

You can also use CU-1 as a Python library:

from detection.service import DetectionService

# Initialize detector
detector = DetectionService(
    enable_clip=True,
    enable_ocr=True,
    enable_blip=False
)

# Analyze image
results = detector.analyze(
    "screenshot.png",
    confidence_threshold=0.35,
    use_clip=True,
    use_blip=False
)

# Access detections
for detection in results['detections']:
    box = detection['box']
    print(f"{detection['class_name']}: {detection['text']}")
    print(f"  Location: ({box['x1']}, {box['y1']}) to ({box['x2']}, {box['y2']})")

🎯 Detection Modes

1. Full Detection Mode (Default)

Uses RF-DETR to detect elements, optionally classifies with CLIP, extracts text with OCR.

data = {
    "confidence_threshold": 0.35,
    "enable_clip": True,   # Classify element types
    "enable_ocr": True,    # Extract text
    "enable_blip": False
}

2. OCR-Only Mode

Bypasses RF-DETR and runs OCR directly across the entire image.

data = {
    "ocr_only": True,
    "enable_clip": False,  # Must be false
    "enable_blip": False   # Must be false
}

3. Visual Description Mode

Generates descriptions for icons using BLIP.

data = {
    "enable_clip": True,
    "enable_ocr": True,
    "enable_blip": True,
    "blip_scope": "icons"  # or "all"
}

πŸ“ Project Structure

CU1X/
β”œβ”€β”€ app_api.py              # API server entry point
β”œβ”€β”€ app_ui.py               # Gradio UI entry point
β”œβ”€β”€ detection/              # Business logic layer
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ service.py          # Main DetectionService
β”‚   β”œβ”€β”€ ocr_handler.py      # OCR-only processing
β”‚   └── response_builder.py # Response formatting
β”œβ”€β”€ api/                    # HTTP layer (thin)
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── endpoints.py        # FastAPI endpoints
β”œβ”€β”€ ui/                     # UI layer
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── gradio_interface.py # Gradio interface (API client)
β”œβ”€β”€ rfdetr/                 # RF-DETR implementation
β”œβ”€β”€ model.pth               # Trained model weights
β”œβ”€β”€ requirements.txt        # Python dependencies
└── README.md

βš™οΈ Configuration

Environment Variables

API Server:

  • No configuration needed (runs on port 8000)

Gradio UI:

  • CU1-X_API_URL: API endpoint (default: http://localhost:8000)
  • GRADIO_SERVER_NAME: Server host (default: 0.0.0.0)
  • GRADIO_SERVER_PORT: Server port (default: 7860)
  • GRADIO_SHARE: Enable Gradio sharing (default: false)

Example:

export CU1_API_URL=http://your-api-server:8000
python app_ui.py

πŸ” Detection Parameters

Parameter Type Default Description
confidence_threshold float 0.35 Detection confidence (0.1-0.9)
enable_clip bool false Classify element types
enable_ocr bool true Extract text content
enable_blip bool false Generate visual descriptions
blip_scope str "icons" "icons" or "all"
ocr_only bool false Skip detection, OCR only

πŸ› Bug Fixes in This Version

1. Fixed RF-DETR Single-Class Confusion

Issue: Code suggested RF-DETR did multi-class detection, but it only detects generic "UI elements" (single class).

Fix:

  • Removed unused base_class_ids variable
  • Added clear documentation explaining RF-DETR is single-class
  • CLIP provides the multi-class classification (6 types)

2. Fixed OCR-Only Validation Logic

Issue: API incorrectly rejected enable_ocr=true when ocr_only=true.

Fix:

# OLD (WRONG):
if ocr_only and (enable_clip or enable_blip or enable_ocr):
    raise HTTPException(...)

# NEW (CORRECT):
if ocr_only and (enable_clip or enable_blip):
    raise HTTPException(...)

πŸ† Key Architecture Principles

  1. Separation of Concerns: Detection logic, API layer, and UI layer are completely isolated
  2. No Business Logic in API: api/endpoints.py only handles HTTP, delegates to detection/ module
  3. Service-Oriented: Gradio UI is a client of the API (HTTP calls), not direct imports
  4. Single Source of Truth: All detection logic in detection/ module
  5. Testability: Each layer can be tested independently

🚦 Performance

Detection performance depends on enabled features:

Mode Time Use Case
RF-DETR only ~25-35s Just bounding boxes
RF-DETR + OCR ~30-40s Text extraction
RF-DETR + CLIP + OCR ~50-60s Full classification + text
RF-DETR + CLIP + OCR + BLIP ~70-90s Complete analysis

Times are approximate and depend on image size and hardware (CPU vs GPU).

πŸ€— Deploying to Hugging Face Spaces

πŸš€ Quick Deploy (2 Commands)

Option 1: Scripts Automatiques (RecommandΓ©)

# 1. VΓ©rifier que tout est prΓͺt
./check_hf_space.sh

# 2. DΓ©ployer automatiquement
./deploy_hf_space.sh

Option 2: Manuel

  1. Create a new Space on Hugging Face

    • Choose "Gradio" as SDK
    • Select hardware (CPU or GPU)
  2. Clone and push:

    git clone https://huggingface.co/spaces/YOUR_USERNAME/CU1-X
    cd CU1-X
    # Copy files from your project
    git lfs install
    git lfs track "*.pth"
    git add .
    git commit -m "Initial deployment"
    git push origin main
    
  3. Space will auto-deploy - First run takes 5-10 minutes (model download)

πŸ“š Documentation

Unified Architecture

NEW: app.py now uses the same unified API architecture everywhere:

  1. βœ… Starts API server in subprocess
  2. βœ… Starts Gradio UI that connects to API
  3. βœ… Same code path as local development
  4. βœ… Consistent behavior across all environments

Benefits:

  • Single code path to maintain (no special HF Spaces mode)
  • Same API layer everywhere (easier debugging)
  • Can scale to separate API/UI servers if needed

πŸ”Œ Accessing HF Space via API

Once deployed, your HF Space automatically exposes an API:

# Install Gradio client
pip install gradio_client

# Use your Space
from gradio_client import Client

client = Client("YOUR_USERNAME/cu1-detector")
result = client.predict("screenshot.png", 0.35, 2, True, True, False, False, "Only image & button")

annotated_image, summary, detections = result
print(f"Found {detections['total_detections']} elements!")

See:

  • examples/simple_hf_api_example.py - Quick start
  • examples/huggingface_api_usage.py - Full examples (batch, async, etc.)
  • DEPLOYMENT.md - Complete deployment guide (Docker, AWS, GCP, Azure, etc.)

πŸ“ License

See LICENSE file for details.

πŸ™ Acknowledgments

  • RF-DETR: Roboflow
  • CLIP: OpenAI
  • BLIP: Salesforce
  • EasyOCR: JaidedAI

Questions or issues? Please open an issue on GitHub.