Spaces:

AI-DrivenTesting
/

CU1-X

Sleeping

File size: 13,482 Bytes

# CU-1 UI Element Detector

Detect and classify UI elements in screenshots using a multi-model AI pipeline.

## 🏗️ Architecture

CU-1 uses a **service-oriented architecture** with clear separation of concerns:

```
┌─────────────────────────────────────────────────────────────┐
│                    APPLICATION LAYER                        │
├─────────────────────────────────────────────────────────────┤
│  app_api.py          │  app_ui.py                           │
│  API Server Entry    │  Gradio UI Entry                     │
└─────────────┬────────┴──────────┬──────────────────────────┘
              │                   │
              │                   │ HTTP/REST
              │                   │ (requests library)
              │                   │
┌─────────────▼───────┐  ┌────────▼─────────────────────────┐
│   API LAYER         │  │   UI LAYER                        │
├─────────────────────┤  ├───────────────────────────────────┤
│  api/endpoints.py   │  │  ui/gradio_interface.py           │
│  - Thin HTTP layer  │  │  - Gradio web interface           │
│  - Request validation│  │  - Calls API via HTTP            │
│  - No business logic│  │  - Displays results               │
└─────────────┬───────┘  └───────────────────────────────────┘
              │
              │ Direct import
              │
┌─────────────▼──────────────────────────────────────────────┐
│                  DETECTION LAYER                            │
│                  (Business Logic)                           │
├─────────────────────────────────────────────────────────────┤
│  detection/service.py       │  Main detection service       │
│  detection/ocr_handler.py   │  OCR-only processing          │
│  detection/response_builder.py │ Response formatting        │
└─────────────────────────────────────────────────────────────┘
```

### Multi-Model Pipeline

CU-1 combines 4 AI models in a sophisticated pipeline:

1. **RF-DETR (Detection Transformer)**
   - Detects generic "UI elements" as a **SINGLE CLASS**
   - Provides bounding boxes and confidence scores
   - Does NOT distinguish between button, input, text, etc.

2. **CLIP (OpenAI)**
   - **OPTIONAL** multi-class classification
   - Takes RF-DETR detections and classifies them into **6 types**:
     * `button` - Buttons, FABs, chips, switches
     * `input` - Text fields, search bars
     * `text` - Labels, titles, paragraphs
     * `image` - Images, icons, avatars
     * `list_item` - List items, cards, tiles
     * `navigation` - Navigation bars, tabs, menus

3. **EasyOCR**
   - Extracts text content from detected regions
   - Runs global OCR merge to catch text outside detection boxes

4. **BLIP (Salesforce)**
   - **OPTIONAL** visual description generation
   - Describes icons and images when text is not present

## 🚀 Quick Start

### Installation

```bash
# Clone the repository
git clone <repository-url>
cd CU1X

# Install dependencies
pip install -r requirements.txt
```

### Running the Application

> 📖 **NEW:** Architecture unified! All modes now use the API layer for consistency.
> See [START.md](START.md) for detailed guide.

**Option 1: One-Command Launch (Recommended for Testing)**

Automatically starts both API server and Gradio UI:

```bash
python app.py
```

**What happens:**
1. ✅ Starts API server in background (port 8000)
2. ✅ Waits for API to be ready
3. ✅ Starts Gradio UI (port 7860)
4. ✅ Handles clean shutdown with Ctrl+C

**Access:**
- Gradio UI: http://localhost:7860
- API Docs: http://localhost:8000/docs

---

**Option 2: Manual Launch (2 Terminals)**

For more control and debugging:

```bash
# Terminal 1: Start API server
python app_api.py

# Terminal 2: Start Gradio UI
python app_ui.py
```

**Access:**
- API: http://localhost:8000
- API Docs: http://localhost:8000/docs
- Gradio UI: http://localhost:7860

---

**Option 3: API Only**

For API-only usage (scripts, integrations):

```bash
python app_api.py
```

Then use the REST API programmatically (see examples below).

## 📡 API Usage

### Python Example

```python
import requests

# Detect UI elements
with open("screenshot.png", "rb") as f:
    response = requests.post(
        "http://localhost:8000/detect",
        files={"image": f},
        data={
            "confidence_threshold": 0.35,
            "enable_clip": True,
            "enable_ocr": True,
            "enable_blip": False
        }
    )

results = response.json()
print(f"Found {results['total_detections']} elements")

for detection in results['detections']:
    print(f"- {detection['class_name']}: {detection.get('text', 'N/A')}")
```

### cURL Example

```bash
curl -X POST "http://localhost:8000/detect" \
  -F "image=@screenshot.png" \
  -F "confidence_threshold=0.35" \
  -F "enable_clip=true" \
  -F "enable_ocr=true"
```

### Response Format

```json
{
  "success": true,
  "detections": [
    {
      "box": {"x1": 50, "y1": 100, "x2": 200, "y2": 150},
      "confidence": 0.79,
      "class_id": 0,
      "class_name": "button",
      "text": "Submit",
      "description": ""
    }
  ],
  "total_detections": 1,
  "image_size": {"width": 1080, "height": 1920},
  "parameters": {
    "confidence_threshold": 0.35,
    "enable_clip": true,
    "enable_ocr": true,
    "enable_blip": false
  },
  "type_distribution": {"button": 5, "text": 12},
  "annotated_image": {
    "mime": "image/png",
    "base64": "iVBORw0KGgoAAAANSU..."
  }
}
```

## 🐍 Python Library Usage

You can also use CU-1 as a Python library:

```python
from detection.service import DetectionService

# Initialize detector
detector = DetectionService(
    enable_clip=True,
    enable_ocr=True,
    enable_blip=False
)

# Analyze image
results = detector.analyze(
    "screenshot.png",
    confidence_threshold=0.35,
    use_clip=True,
    use_blip=False
)

# Access detections
for detection in results['detections']:
    box = detection['box']
    print(f"{detection['class_name']}: {detection['text']}")
    print(f"  Location: ({box['x1']}, {box['y1']}) to ({box['x2']}, {box['y2']})")
```

## 🎯 Detection Modes

### 1. Full Detection Mode (Default)

Uses RF-DETR to detect elements, optionally classifies with CLIP, extracts text with OCR.

```python
data = {
    "confidence_threshold": 0.35,
    "enable_clip": True,   # Classify element types
    "enable_ocr": True,    # Extract text
    "enable_blip": False
}
```

### 2. OCR-Only Mode

Bypasses RF-DETR and runs OCR directly across the entire image.

```python
data = {
    "ocr_only": True,
    "enable_clip": False,  # Must be false
    "enable_blip": False   # Must be false
}
```

### 3. Visual Description Mode

Generates descriptions for icons using BLIP.

```python
data = {
    "enable_clip": True,
    "enable_ocr": True,
    "enable_blip": True,
    "blip_scope": "icons"  # or "all"
}
```

## 📁 Project Structure

```
CU1X/
├── app_api.py              # API server entry point
├── app_ui.py               # Gradio UI entry point
├── detection/              # Business logic layer
│   ├── __init__.py
│   ├── service.py          # Main DetectionService
│   ├── ocr_handler.py      # OCR-only processing
│   └── response_builder.py # Response formatting
├── api/                    # HTTP layer (thin)
│   ├── __init__.py
│   └── endpoints.py        # FastAPI endpoints
├── ui/                     # UI layer
│   ├── __init__.py
│   └── gradio_interface.py # Gradio interface (API client)
├── rfdetr/                 # RF-DETR implementation
├── model.pth               # Trained model weights
├── requirements.txt        # Python dependencies
└── README.md
```

## ⚙️ Configuration

### Environment Variables

**API Server:**
- No configuration needed (runs on port 8000)

**Gradio UI:**
- `CU1-X_API_URL`: API endpoint (default: `http://localhost:8000`)
- `GRADIO_SERVER_NAME`: Server host (default: `0.0.0.0`)
- `GRADIO_SERVER_PORT`: Server port (default: `7860`)
- `GRADIO_SHARE`: Enable Gradio sharing (default: `false`)

Example:
```bash
export CU1_API_URL=http://your-api-server:8000
python app_ui.py
```

## 🔍 Detection Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `confidence_threshold` | float | 0.35 | Detection confidence (0.1-0.9) |
| `enable_clip` | bool | false | Classify element types |
| `enable_ocr` | bool | true | Extract text content |
| `enable_blip` | bool | false | Generate visual descriptions |
| `blip_scope` | str | "icons" | "icons" or "all" |
| `ocr_only` | bool | false | Skip detection, OCR only |

## 🐛 Bug Fixes in This Version

### 1. Fixed RF-DETR Single-Class Confusion

**Issue:** Code suggested RF-DETR did multi-class detection, but it only detects generic "UI elements" (single class).

**Fix:** 
- Removed unused `base_class_ids` variable
- Added clear documentation explaining RF-DETR is single-class
- CLIP provides the multi-class classification (6 types)

### 2. Fixed OCR-Only Validation Logic

**Issue:** API incorrectly rejected `enable_ocr=true` when `ocr_only=true`.

**Fix:**
```python
# OLD (WRONG):
if ocr_only and (enable_clip or enable_blip or enable_ocr):
    raise HTTPException(...)

# NEW (CORRECT):
if ocr_only and (enable_clip or enable_blip):
    raise HTTPException(...)
```

## 🏆 Key Architecture Principles

1. **Separation of Concerns**: Detection logic, API layer, and UI layer are completely isolated
2. **No Business Logic in API**: `api/endpoints.py` only handles HTTP, delegates to `detection/` module
3. **Service-Oriented**: Gradio UI is a client of the API (HTTP calls), not direct imports
4. **Single Source of Truth**: All detection logic in `detection/` module
5. **Testability**: Each layer can be tested independently

## 🚦 Performance

Detection performance depends on enabled features:

| Mode | Time | Use Case |
|------|------|----------|
| RF-DETR only | ~25-35s | Just bounding boxes |
| RF-DETR + OCR | ~30-40s | Text extraction |
| RF-DETR + CLIP + OCR | ~50-60s | Full classification + text |
| RF-DETR + CLIP + OCR + BLIP | ~70-90s | Complete analysis |

*Times are approximate and depend on image size and hardware (CPU vs GPU).*

## 🤗 Deploying to Hugging Face Spaces

### Quick Deploy

1. **Create a new Space** on Hugging Face
   - Choose "Gradio" as SDK
   - Select hardware (CPU or GPU)

2. **Upload these files:**
   ```bash
   app.py              # Unified entry point (API + UI)
   app_api.py          # API server (launched by app.py)
   requirements.txt    # Dependencies
   detection/          # Detection modules
   api/                # API endpoints
   ui/                 # UI components
   model.pth          # Model weights
   README.md          # Documentation
   ```

3. **Space will auto-deploy** - First run takes 5-10 minutes (model download)

### Unified Architecture

**NEW:** `app.py` now uses the same unified API architecture everywhere:

1. ✅ Starts API server in subprocess
2. ✅ Starts Gradio UI that connects to API
3. ✅ Same code path as local development
4. ✅ Consistent behavior across all environments

**Benefits:**
- Single code path to maintain (no special HF Spaces mode)
- Same API layer everywhere (easier debugging)
- Can scale to separate API/UI servers if needed

### 🔌 Accessing HF Space via API

Once deployed, your HF Space automatically exposes an API:

```python
# Install Gradio client
pip install gradio_client

# Use your Space
from gradio_client import Client

client = Client("YOUR_USERNAME/cu1-detector")
result = client.predict("screenshot.png", 0.35, 2, True, True, False, False, "Only image & button")

annotated_image, summary, detections = result
print(f"Found {detections['total_detections']} elements!")
```

**See:**
- `examples/simple_hf_api_example.py` - Quick start
- `examples/huggingface_api_usage.py` - Full examples (batch, async, etc.)
- [DEPLOYMENT.md](DEPLOYMENT.md) - Complete deployment guide (Docker, AWS, GCP, Azure, etc.)

## 📝 License

See LICENSE file for details.

## 🙏 Acknowledgments

- **RF-DETR**: Roboflow
- **CLIP**: OpenAI
- **BLIP**: Salesforce
- **EasyOCR**: JaidedAI

---

**Questions or issues?** Please open an issue on GitHub.