Spaces:

AI-DrivenTesting
/

CU1-X

Sleeping

App Files Files Community

CU1-X / README.md

AI-DrivenTesting

init

77da9e2 about 1 month ago

preview code

raw

history blame

13.5 kB

	# CU-1 UI Element Detector

	Detect and classify UI elements in screenshots using a multi-model AI pipeline.

	## 🏗️ Architecture

	CU-1 uses a service-oriented architecture with clear separation of concerns:

	```
	┌─────────────────────────────────────────────────────────────┐
	│ APPLICATION LAYER │
	├─────────────────────────────────────────────────────────────┤
	│ app_api.py │ app_ui.py │
	│ API Server Entry │ Gradio UI Entry │
	└─────────────┬────────┴──────────┬──────────────────────────┘
	│ │
	│ │ HTTP/REST
	│ │ (requests library)
	│ │
	┌─────────────▼───────┐ ┌────────▼─────────────────────────┐
	│ API LAYER │ │ UI LAYER │
	├─────────────────────┤ ├───────────────────────────────────┤
	│ api/endpoints.py │ │ ui/gradio_interface.py │
	│ - Thin HTTP layer │ │ - Gradio web interface │
	│ - Request validation│ │ - Calls API via HTTP │
	│ - No business logic│ │ - Displays results │
	└─────────────┬───────┘ └───────────────────────────────────┘
	│
	│ Direct import
	│
	┌─────────────▼──────────────────────────────────────────────┐
	│ DETECTION LAYER │
	│ (Business Logic) │
	├─────────────────────────────────────────────────────────────┤
	│ detection/service.py │ Main detection service │
	│ detection/ocr_handler.py │ OCR-only processing │
	│ detection/response_builder.py │ Response formatting │
	└─────────────────────────────────────────────────────────────┘
	```

	### Multi-Model Pipeline

	CU-1 combines 4 AI models in a sophisticated pipeline:

	1. RF-DETR (Detection Transformer)
	- Detects generic "UI elements" as a SINGLE CLASS
	- Provides bounding boxes and confidence scores
	- Does NOT distinguish between button, input, text, etc.

	2. CLIP (OpenAI)
	- OPTIONAL multi-class classification
	- Takes RF-DETR detections and classifies them into 6 types:
	* `button` - Buttons, FABs, chips, switches
	* `input` - Text fields, search bars
	* `text` - Labels, titles, paragraphs
	* `image` - Images, icons, avatars
	* `list_item` - List items, cards, tiles
	* `navigation` - Navigation bars, tabs, menus

	3. EasyOCR
	- Extracts text content from detected regions
	- Runs global OCR merge to catch text outside detection boxes

	4. BLIP (Salesforce)
	- OPTIONAL visual description generation
	- Describes icons and images when text is not present

	## 🚀 Quick Start

	### Installation

	```bash
	# Clone the repository
	git clone <repository-url>
	cd CU1X

	# Install dependencies
	pip install -r requirements.txt
	```

	### Running the Application

	> 📖 NEW: Architecture unified! All modes now use the API layer for consistency.
	> See [START.md](START.md) for detailed guide.

	Option 1: One-Command Launch (Recommended for Testing)

	Automatically starts both API server and Gradio UI:

	```bash
	python app.py
	```

	What happens:
	1. ✅ Starts API server in background (port 8000)
	2. ✅ Waits for API to be ready
	3. ✅ Starts Gradio UI (port 7860)
	4. ✅ Handles clean shutdown with Ctrl+C

	Access:
	- Gradio UI: http://localhost:7860
	- API Docs: http://localhost:8000/docs

	---

	Option 2: Manual Launch (2 Terminals)

	For more control and debugging:

	```bash
	# Terminal 1: Start API server
	python app_api.py

	# Terminal 2: Start Gradio UI
	python app_ui.py
	```

	Access:
	- API: http://localhost:8000
	- API Docs: http://localhost:8000/docs
	- Gradio UI: http://localhost:7860

	---

	Option 3: API Only

	For API-only usage (scripts, integrations):

	```bash
	python app_api.py
	```

	Then use the REST API programmatically (see examples below).

	## 📡 API Usage

	### Python Example

	```python
	import requests

	# Detect UI elements
	with open("screenshot.png", "rb") as f:
	response = requests.post(
	"http://localhost:8000/detect",
	files={"image": f},
	data={
	"confidence_threshold": 0.35,
	"enable_clip": True,
	"enable_ocr": True,
	"enable_blip": False
	}
	)

	results = response.json()
	print(f"Found {results['total_detections']} elements")

	for detection in results['detections']:
	print(f"- {detection['class_name']}: {detection.get('text', 'N/A')}")
	```

	### cURL Example

	```bash
	curl -X POST "http://localhost:8000/detect" \
	-F "image=@screenshot.png" \
	-F "confidence_threshold=0.35" \
	-F "enable_clip=true" \
	-F "enable_ocr=true"
	```

	### Response Format

	```json
	{
	"success": true,
	"detections": [
	{
	"box": {"x1": 50, "y1": 100, "x2": 200, "y2": 150},
	"confidence": 0.79,
	"class_id": 0,
	"class_name": "button",
	"text": "Submit",
	"description": ""
	}
	],
	"total_detections": 1,
	"image_size": {"width": 1080, "height": 1920},
	"parameters": {
	"confidence_threshold": 0.35,
	"enable_clip": true,
	"enable_ocr": true,
	"enable_blip": false
	},
	"type_distribution": {"button": 5, "text": 12},
	"annotated_image": {
	"mime": "image/png",
	"base64": "iVBORw0KGgoAAAANSU..."
	}
	}
	```

	## 🐍 Python Library Usage

	You can also use CU-1 as a Python library:

	```python
	from detection.service import DetectionService

	# Initialize detector
	detector = DetectionService(
	enable_clip=True,
	enable_ocr=True,
	enable_blip=False
	)

	# Analyze image
	results = detector.analyze(
	"screenshot.png",
	confidence_threshold=0.35,
	use_clip=True,
	use_blip=False
	)

	# Access detections
	for detection in results['detections']:
	box = detection['box']
	print(f"{detection['class_name']}: {detection['text']}")
	print(f" Location: ({box['x1']}, {box['y1']}) to ({box['x2']}, {box['y2']})")
	```

	## 🎯 Detection Modes

	### 1. Full Detection Mode (Default)

	Uses RF-DETR to detect elements, optionally classifies with CLIP, extracts text with OCR.

	```python
	data = {
	"confidence_threshold": 0.35,
	"enable_clip": True, # Classify element types
	"enable_ocr": True, # Extract text
	"enable_blip": False
	}
	```

	### 2. OCR-Only Mode

	Bypasses RF-DETR and runs OCR directly across the entire image.

	```python
	data = {
	"ocr_only": True,
	"enable_clip": False, # Must be false
	"enable_blip": False # Must be false
	}
	```

	### 3. Visual Description Mode

	Generates descriptions for icons using BLIP.

	```python
	data = {
	"enable_clip": True,
	"enable_ocr": True,
	"enable_blip": True,
	"blip_scope": "icons" # or "all"
	}
	```

	## 📁 Project Structure

	```
	CU1X/
	├── app_api.py # API server entry point
	├── app_ui.py # Gradio UI entry point
	├── detection/ # Business logic layer
	│ ├── __init__.py
	│ ├── service.py # Main DetectionService
	│ ├── ocr_handler.py # OCR-only processing
	│ └── response_builder.py # Response formatting
	├── api/ # HTTP layer (thin)
	│ ├── __init__.py
	│ └── endpoints.py # FastAPI endpoints
	├── ui/ # UI layer
	│ ├── __init__.py
	│ └── gradio_interface.py # Gradio interface (API client)
	├── rfdetr/ # RF-DETR implementation
	├── model.pth # Trained model weights
	├── requirements.txt # Python dependencies
	└── README.md
	```

	## ⚙️ Configuration

	### Environment Variables

	API Server:
	- No configuration needed (runs on port 8000)

	Gradio UI:
	- `CU1-X_API_URL`: API endpoint (default: `http://localhost:8000`)
	- `GRADIO_SERVER_NAME`: Server host (default: `0.0.0.0`)
	- `GRADIO_SERVER_PORT`: Server port (default: `7860`)
	- `GRADIO_SHARE`: Enable Gradio sharing (default: `false`)

	Example:
	```bash
	export CU1_API_URL=http://your-api-server:8000
	python app_ui.py
	```

	## 🔍 Detection Parameters

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `confidence_threshold` \| float \| 0.35 \| Detection confidence (0.1-0.9) \|
	\| `enable_clip` \| bool \| false \| Classify element types \|
	\| `enable_ocr` \| bool \| true \| Extract text content \|
	\| `enable_blip` \| bool \| false \| Generate visual descriptions \|
	\| `blip_scope` \| str \| "icons" \| "icons" or "all" \|
	\| `ocr_only` \| bool \| false \| Skip detection, OCR only \|

	## 🐛 Bug Fixes in This Version

	### 1. Fixed RF-DETR Single-Class Confusion

	Issue: Code suggested RF-DETR did multi-class detection, but it only detects generic "UI elements" (single class).

	Fix:
	- Removed unused `base_class_ids` variable
	- Added clear documentation explaining RF-DETR is single-class
	- CLIP provides the multi-class classification (6 types)

	### 2. Fixed OCR-Only Validation Logic

	Issue: API incorrectly rejected `enable_ocr=true` when `ocr_only=true`.

	Fix:
	```python
	# OLD (WRONG):
	if ocr_only and (enable_clip or enable_blip or enable_ocr):
	raise HTTPException(...)

	# NEW (CORRECT):
	if ocr_only and (enable_clip or enable_blip):
	raise HTTPException(...)
	```

	## 🏆 Key Architecture Principles

	1. Separation of Concerns: Detection logic, API layer, and UI layer are completely isolated
	2. No Business Logic in API: `api/endpoints.py` only handles HTTP, delegates to `detection/` module
	3. Service-Oriented: Gradio UI is a client of the API (HTTP calls), not direct imports
	4. Single Source of Truth: All detection logic in `detection/` module
	5. Testability: Each layer can be tested independently

	## 🚦 Performance

	Detection performance depends on enabled features:

	\| Mode \| Time \| Use Case \|
	\|------\|------\|----------\|
	\| RF-DETR only \| ~25-35s \| Just bounding boxes \|
	\| RF-DETR + OCR \| ~30-40s \| Text extraction \|
	\| RF-DETR + CLIP + OCR \| ~50-60s \| Full classification + text \|
	\| RF-DETR + CLIP + OCR + BLIP \| ~70-90s \| Complete analysis \|

	Times are approximate and depend on image size and hardware (CPU vs GPU).

	## 🤗 Deploying to Hugging Face Spaces

	### Quick Deploy

	1. Create a new Space on Hugging Face
	- Choose "Gradio" as SDK
	- Select hardware (CPU or GPU)

	2. Upload these files:
	```bash
	app.py # Unified entry point (API + UI)
	app_api.py # API server (launched by app.py)
	requirements.txt # Dependencies
	detection/ # Detection modules
	api/ # API endpoints
	ui/ # UI components
	model.pth # Model weights
	README.md # Documentation
	```

	3. Space will auto-deploy - First run takes 5-10 minutes (model download)

	### Unified Architecture

	NEW: `app.py` now uses the same unified API architecture everywhere:

	1. ✅ Starts API server in subprocess
	2. ✅ Starts Gradio UI that connects to API
	3. ✅ Same code path as local development
	4. ✅ Consistent behavior across all environments

	Benefits:
	- Single code path to maintain (no special HF Spaces mode)
	- Same API layer everywhere (easier debugging)
	- Can scale to separate API/UI servers if needed

	### 🔌 Accessing HF Space via API

	Once deployed, your HF Space automatically exposes an API:

	```python
	# Install Gradio client
	pip install gradio_client

	# Use your Space
	from gradio_client import Client

	client = Client("YOUR_USERNAME/cu1-detector")
	result = client.predict("screenshot.png", 0.35, 2, True, True, False, False, "Only image & button")

	annotated_image, summary, detections = result
	print(f"Found {detections['total_detections']} elements!")
	```

	See:
	- `examples/simple_hf_api_example.py` - Quick start
	- `examples/huggingface_api_usage.py` - Full examples (batch, async, etc.)
	- [DEPLOYMENT.md](DEPLOYMENT.md) - Complete deployment guide (Docker, AWS, GCP, Azure, etc.)

	## 📝 License

	See LICENSE file for details.

	## 🙏 Acknowledgments

	- RF-DETR: Roboflow
	- CLIP: OpenAI
	- BLIP: Salesforce
	- EasyOCR: JaidedAI

	---

	Questions or issues? Please open an issue on GitHub.