CU1-X / README.md
AI-DrivenTesting's picture
init
77da9e2
|
raw
history blame
13.5 kB
# CU-1 UI Element Detector
Detect and classify UI elements in screenshots using a multi-model AI pipeline.
## πŸ—οΈ Architecture
CU-1 uses a **service-oriented architecture** with clear separation of concerns:
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ APPLICATION LAYER β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ app_api.py β”‚ app_ui.py β”‚
β”‚ API Server Entry β”‚ Gradio UI Entry β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β”‚ β”‚ HTTP/REST
β”‚ β”‚ (requests library)
β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ API LAYER β”‚ β”‚ UI LAYER β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ api/endpoints.py β”‚ β”‚ ui/gradio_interface.py β”‚
β”‚ - Thin HTTP layer β”‚ β”‚ - Gradio web interface β”‚
β”‚ - Request validationβ”‚ β”‚ - Calls API via HTTP β”‚
β”‚ - No business logicβ”‚ β”‚ - Displays results β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”‚ Direct import
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DETECTION LAYER β”‚
β”‚ (Business Logic) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ detection/service.py β”‚ Main detection service β”‚
β”‚ detection/ocr_handler.py β”‚ OCR-only processing β”‚
β”‚ detection/response_builder.py β”‚ Response formatting β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Multi-Model Pipeline
CU-1 combines 4 AI models in a sophisticated pipeline:
1. **RF-DETR (Detection Transformer)**
- Detects generic "UI elements" as a **SINGLE CLASS**
- Provides bounding boxes and confidence scores
- Does NOT distinguish between button, input, text, etc.
2. **CLIP (OpenAI)**
- **OPTIONAL** multi-class classification
- Takes RF-DETR detections and classifies them into **6 types**:
* `button` - Buttons, FABs, chips, switches
* `input` - Text fields, search bars
* `text` - Labels, titles, paragraphs
* `image` - Images, icons, avatars
* `list_item` - List items, cards, tiles
* `navigation` - Navigation bars, tabs, menus
3. **EasyOCR**
- Extracts text content from detected regions
- Runs global OCR merge to catch text outside detection boxes
4. **BLIP (Salesforce)**
- **OPTIONAL** visual description generation
- Describes icons and images when text is not present
## πŸš€ Quick Start
### Installation
```bash
# Clone the repository
git clone <repository-url>
cd CU1X
# Install dependencies
pip install -r requirements.txt
```
### Running the Application
> πŸ“– **NEW:** Architecture unified! All modes now use the API layer for consistency.
> See [START.md](START.md) for detailed guide.
**Option 1: One-Command Launch (Recommended for Testing)**
Automatically starts both API server and Gradio UI:
```bash
python app.py
```
**What happens:**
1. βœ… Starts API server in background (port 8000)
2. βœ… Waits for API to be ready
3. βœ… Starts Gradio UI (port 7860)
4. βœ… Handles clean shutdown with Ctrl+C
**Access:**
- Gradio UI: http://localhost:7860
- API Docs: http://localhost:8000/docs
---
**Option 2: Manual Launch (2 Terminals)**
For more control and debugging:
```bash
# Terminal 1: Start API server
python app_api.py
# Terminal 2: Start Gradio UI
python app_ui.py
```
**Access:**
- API: http://localhost:8000
- API Docs: http://localhost:8000/docs
- Gradio UI: http://localhost:7860
---
**Option 3: API Only**
For API-only usage (scripts, integrations):
```bash
python app_api.py
```
Then use the REST API programmatically (see examples below).
## πŸ“‘ API Usage
### Python Example
```python
import requests
# Detect UI elements
with open("screenshot.png", "rb") as f:
response = requests.post(
"http://localhost:8000/detect",
files={"image": f},
data={
"confidence_threshold": 0.35,
"enable_clip": True,
"enable_ocr": True,
"enable_blip": False
}
)
results = response.json()
print(f"Found {results['total_detections']} elements")
for detection in results['detections']:
print(f"- {detection['class_name']}: {detection.get('text', 'N/A')}")
```
### cURL Example
```bash
curl -X POST "http://localhost:8000/detect" \
-F "image=@screenshot.png" \
-F "confidence_threshold=0.35" \
-F "enable_clip=true" \
-F "enable_ocr=true"
```
### Response Format
```json
{
"success": true,
"detections": [
{
"box": {"x1": 50, "y1": 100, "x2": 200, "y2": 150},
"confidence": 0.79,
"class_id": 0,
"class_name": "button",
"text": "Submit",
"description": ""
}
],
"total_detections": 1,
"image_size": {"width": 1080, "height": 1920},
"parameters": {
"confidence_threshold": 0.35,
"enable_clip": true,
"enable_ocr": true,
"enable_blip": false
},
"type_distribution": {"button": 5, "text": 12},
"annotated_image": {
"mime": "image/png",
"base64": "iVBORw0KGgoAAAANSU..."
}
}
```
## 🐍 Python Library Usage
You can also use CU-1 as a Python library:
```python
from detection.service import DetectionService
# Initialize detector
detector = DetectionService(
enable_clip=True,
enable_ocr=True,
enable_blip=False
)
# Analyze image
results = detector.analyze(
"screenshot.png",
confidence_threshold=0.35,
use_clip=True,
use_blip=False
)
# Access detections
for detection in results['detections']:
box = detection['box']
print(f"{detection['class_name']}: {detection['text']}")
print(f" Location: ({box['x1']}, {box['y1']}) to ({box['x2']}, {box['y2']})")
```
## 🎯 Detection Modes
### 1. Full Detection Mode (Default)
Uses RF-DETR to detect elements, optionally classifies with CLIP, extracts text with OCR.
```python
data = {
"confidence_threshold": 0.35,
"enable_clip": True, # Classify element types
"enable_ocr": True, # Extract text
"enable_blip": False
}
```
### 2. OCR-Only Mode
Bypasses RF-DETR and runs OCR directly across the entire image.
```python
data = {
"ocr_only": True,
"enable_clip": False, # Must be false
"enable_blip": False # Must be false
}
```
### 3. Visual Description Mode
Generates descriptions for icons using BLIP.
```python
data = {
"enable_clip": True,
"enable_ocr": True,
"enable_blip": True,
"blip_scope": "icons" # or "all"
}
```
## πŸ“ Project Structure
```
CU1X/
β”œβ”€β”€ app_api.py # API server entry point
β”œβ”€β”€ app_ui.py # Gradio UI entry point
β”œβ”€β”€ detection/ # Business logic layer
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ service.py # Main DetectionService
β”‚ β”œβ”€β”€ ocr_handler.py # OCR-only processing
β”‚ └── response_builder.py # Response formatting
β”œβ”€β”€ api/ # HTTP layer (thin)
β”‚ β”œβ”€β”€ __init__.py
β”‚ └── endpoints.py # FastAPI endpoints
β”œβ”€β”€ ui/ # UI layer
β”‚ β”œβ”€β”€ __init__.py
β”‚ └── gradio_interface.py # Gradio interface (API client)
β”œβ”€β”€ rfdetr/ # RF-DETR implementation
β”œβ”€β”€ model.pth # Trained model weights
β”œβ”€β”€ requirements.txt # Python dependencies
└── README.md
```
## βš™οΈ Configuration
### Environment Variables
**API Server:**
- No configuration needed (runs on port 8000)
**Gradio UI:**
- `CU1-X_API_URL`: API endpoint (default: `http://localhost:8000`)
- `GRADIO_SERVER_NAME`: Server host (default: `0.0.0.0`)
- `GRADIO_SERVER_PORT`: Server port (default: `7860`)
- `GRADIO_SHARE`: Enable Gradio sharing (default: `false`)
Example:
```bash
export CU1_API_URL=http://your-api-server:8000
python app_ui.py
```
## πŸ” Detection Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `confidence_threshold` | float | 0.35 | Detection confidence (0.1-0.9) |
| `enable_clip` | bool | false | Classify element types |
| `enable_ocr` | bool | true | Extract text content |
| `enable_blip` | bool | false | Generate visual descriptions |
| `blip_scope` | str | "icons" | "icons" or "all" |
| `ocr_only` | bool | false | Skip detection, OCR only |
## πŸ› Bug Fixes in This Version
### 1. Fixed RF-DETR Single-Class Confusion
**Issue:** Code suggested RF-DETR did multi-class detection, but it only detects generic "UI elements" (single class).
**Fix:**
- Removed unused `base_class_ids` variable
- Added clear documentation explaining RF-DETR is single-class
- CLIP provides the multi-class classification (6 types)
### 2. Fixed OCR-Only Validation Logic
**Issue:** API incorrectly rejected `enable_ocr=true` when `ocr_only=true`.
**Fix:**
```python
# OLD (WRONG):
if ocr_only and (enable_clip or enable_blip or enable_ocr):
raise HTTPException(...)
# NEW (CORRECT):
if ocr_only and (enable_clip or enable_blip):
raise HTTPException(...)
```
## πŸ† Key Architecture Principles
1. **Separation of Concerns**: Detection logic, API layer, and UI layer are completely isolated
2. **No Business Logic in API**: `api/endpoints.py` only handles HTTP, delegates to `detection/` module
3. **Service-Oriented**: Gradio UI is a client of the API (HTTP calls), not direct imports
4. **Single Source of Truth**: All detection logic in `detection/` module
5. **Testability**: Each layer can be tested independently
## 🚦 Performance
Detection performance depends on enabled features:
| Mode | Time | Use Case |
|------|------|----------|
| RF-DETR only | ~25-35s | Just bounding boxes |
| RF-DETR + OCR | ~30-40s | Text extraction |
| RF-DETR + CLIP + OCR | ~50-60s | Full classification + text |
| RF-DETR + CLIP + OCR + BLIP | ~70-90s | Complete analysis |
*Times are approximate and depend on image size and hardware (CPU vs GPU).*
## πŸ€— Deploying to Hugging Face Spaces
### Quick Deploy
1. **Create a new Space** on Hugging Face
- Choose "Gradio" as SDK
- Select hardware (CPU or GPU)
2. **Upload these files:**
```bash
app.py # Unified entry point (API + UI)
app_api.py # API server (launched by app.py)
requirements.txt # Dependencies
detection/ # Detection modules
api/ # API endpoints
ui/ # UI components
model.pth # Model weights
README.md # Documentation
```
3. **Space will auto-deploy** - First run takes 5-10 minutes (model download)
### Unified Architecture
**NEW:** `app.py` now uses the same unified API architecture everywhere:
1. βœ… Starts API server in subprocess
2. βœ… Starts Gradio UI that connects to API
3. βœ… Same code path as local development
4. βœ… Consistent behavior across all environments
**Benefits:**
- Single code path to maintain (no special HF Spaces mode)
- Same API layer everywhere (easier debugging)
- Can scale to separate API/UI servers if needed
### πŸ”Œ Accessing HF Space via API
Once deployed, your HF Space automatically exposes an API:
```python
# Install Gradio client
pip install gradio_client
# Use your Space
from gradio_client import Client
client = Client("YOUR_USERNAME/cu1-detector")
result = client.predict("screenshot.png", 0.35, 2, True, True, False, False, "Only image & button")
annotated_image, summary, detections = result
print(f"Found {detections['total_detections']} elements!")
```
**See:**
- `examples/simple_hf_api_example.py` - Quick start
- `examples/huggingface_api_usage.py` - Full examples (batch, async, etc.)
- [DEPLOYMENT.md](DEPLOYMENT.md) - Complete deployment guide (Docker, AWS, GCP, Azure, etc.)
## πŸ“ License
See LICENSE file for details.
## πŸ™ Acknowledgments
- **RF-DETR**: Roboflow
- **CLIP**: OpenAI
- **BLIP**: Salesforce
- **EasyOCR**: JaidedAI
---
**Questions or issues?** Please open an issue on GitHub.