Spaces:

OsamaBinLikhon
/

computer-using-agent

Build error

App Files Files Community

OsamaBinLikhon commited on Dec 13, 2025

Commit

ed0a3fa

verified ·

1 Parent(s): 1c5ecfa

Clean deployment: Computer-Using Agent

Browse files

Files changed (5) hide show

.gitignore +3 -0
Dockerfile +69 -0
README.md +142 -8
computer_agent.py +487 -0
requirements.txt +21 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,3 @@

+*.log
+__pycache__/
+*.pyc

Dockerfile ADDED Viewed

	@@ -0,0 +1,69 @@

+FROM huggingface/transformers-pytorch-gpu:latest
+# Install system dependencies for GUI and browser automation
+RUN apt-get update && apt-get install -y \
+    # GUI and display libraries
+    libgtk-3-0 \
+    libx11-6 \
+    libxext6 \
+    libxrender1 \
+    libxtst6 \
+    libxrandr2 \
+    libasound2 \
+    libpangocairo-1.0-0 \
+    libatk1.0-0 \
+    libatk-bridge2.0-0 \
+    libcups2 \
+    libdrm2 \
+    libxkbcommon0 \
+    libxcomposite1 \
+    libxdamage1 \
+    libgbm1 \
+    libxss1 \
+    # Browser dependencies
+    wget \
+    gnupg \
+    unzip \
+    curl \
+    # Development tools
+    build-essential \
+    python3-dev \
+    && rm -rf /var/lib/apt/lists/*
+# Set environment variables for GUI
+ENV DISPLAY=:99
+ENV QT_X11_NO_MITSHM=1
+ENV XDG_RUNTIME_DIR=/tmp/runtime-root
+ENV PYTHONPATH=/workspace
+# Create necessary directories
+RUN mkdir -p /workspace /tmp/runtime-root
+# Set working directory
+WORKDIR /workspace
+# Copy requirements first for better caching
+COPY requirements.txt .
+# Install Python dependencies
+RUN pip install --no-cache-dir -r requirements.txt
+# Install Playwright browsers
+RUN playwright install chromium
+# Copy application files
+COPY . .
+# Expose port for Gradio
+EXPOSE 7860
+# Set environment variable for Gradio
+ENV GRADIO_SERVER_PORT=7860
+ENV GRADIO_SERVER_NAME=0.0.0.0
+# Health check
+HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:7860/ || exit 1
+# Run the application
+CMD ["python", "computer_agent.py"]

README.md CHANGED Viewed

@@ -1,10 +1,144 @@
----
-title: Computer Using Agent
-emoji: 👀
-colorFrom: blue
-colorTo: green
-sdk: docker
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Computer-Using Agent
+🤖 **AI-powered browser automation system similar to OpenAI's Operator**
+This Hugging Face Space provides a comprehensive computer-using agent that can interact with web browsers, take screenshots, perform actions, and automate various tasks through a user-friendly Gradio interface.
+## Features
+### 🌐 Browser Automation
+- **Web Navigation**: Navigate to any URL with intelligent loading detection
+- **Screenshot Capture**: Take high-quality screenshots of web pages
+- **Element Interaction**: Click on elements, type text, and interact with forms
+- **Page Analysis**: Extract content, links, forms, and page structure
+### 🎯 Advanced Controls
+- **CSS Selector Support**: Target specific elements using CSS selectors
+- **Scrolling**: Navigate up and down pages with customizable scroll amounts
+- **Content Extraction**: Get page text, HTML, and structural information
+- **Action History**: Track all actions performed by the agent
+### 🔧 Technical Features
+- **Headless Browser**: Runs efficiently in server environments
+- **Multi-tab Support**: Handle multiple browser contexts
+- **Error Handling**: Robust error recovery and logging
+- **Real-time Status**: Monitor agent status and performance
+## 🚀 Usage
+### Basic Navigation
+1. Click "Initialize Browser" to start the browser
+2. Enter a URL in the URL field
+3. Click "Navigate" to visit the page
+4. Use "Take Screenshot" to capture the current page
+### Element Interaction
+1. Use browser dev tools to find CSS selectors
+2. Enter the selector in the "CSS Selector" field
+3. Click "Click Element" to interact with the element
+4. Use "Type Text" to input text into form fields
+### Page Content Analysis
+1. Navigate to any web page
+2. Click "Get Page Content" to extract:
+   - Page title and text content
+   - Links and navigation elements
+   - Form structures and inputs
+   - Page HTML structure
+## 🛠️ API Integration
+The agent can be integrated with various AI models from Hugging Face:
+```python
+from huggingface_hub import hf_hub_download
+# Load models for enhanced capabilities
+model = hf_hub_download(repo_id="microsoft/DialoGPT-medium", filename="pytorch_model.bin")
+```
+### Supported Model Types
+- **Language Models**: For natural language processing
+- **Vision Models**: For image analysis and understanding
+- **Multimodal Models**: For combined text and image processing
+## 🏗️ Architecture
+### Core Components
+- **ComputerUsingAgent**: Main agent class managing browser operations
+- **Gradio Interface**: User-friendly web interface
+- **Playwright Integration**: Browser automation engine
+- **State Management**: Track agent status and actions
+### Browser Configuration
+- **Chromium**: Primary browser engine
+- **Headless Mode**: Server-optimized operation
+- **Custom User Agent**: Enhanced compatibility
+- **Security Disabled**: For automation purposes
+## 🔧 Configuration
+### Environment Variables
+- `GRADIO_SERVER_PORT`: Port for Gradio interface (default: 7860)
+- `GRADIO_SERVER_NAME`: Server host (default: 0.0.0.0)
+- `DISPLAY`: Display for GUI operations
+### Browser Settings
+- **Viewport**: 1280x720 (configurable)
+- **User Agent**: Custom Windows Chrome user agent
+- **Security**: Disabled for automation compatibility
+## 📋 Requirements
+### System Dependencies
+- Python 3.8+
+- Chromium browser
+- X11 display libraries
+- System libraries for GUI support
+### Python Dependencies
+- `gradio==6.1.0`: Web interface framework
+- `playwright==1.52.0`: Browser automation
+- `opencv-python==4.11.0.86`: Image processing
+- `pillow==12.0.0`: Image handling
+- `pyautogui==0.9.54`: GUI automation
+## 🚨 Important Notes
+### Security Considerations
+- Browser security features are disabled for automation
+- Only use in trusted environments
+- Monitor for malicious content when browsing
+### Usage Guidelines
+- Respect website terms of service
+- Implement rate limiting for production use
+- Add CAPTCHA handling for automated interactions
+- Monitor resource usage for large-scale operations
+## 🔮 Future Enhancements
+### Planned Features
+- **Multi-modal AI Integration**: Combine with vision models
+- **Computer Vision**: Advanced element detection
+- **Task Planning**: Automated workflow execution
+- **API Integration**: Connect with external services
+- **Mobile Support**: Touch and mobile interaction
+### AI Model Integration
+- **GPT Models**: For natural language task understanding
+- **CLIP**: For image-based element recognition
+- **YOLO**: For object detection and interaction
+- **BLIP**: For advanced image captioning
+## 📞 Support
+For issues and feature requests, please create an issue in the repository or contact the development team.
+## 📄 License
+This project is licensed under the MIT License - see the LICENSE file for details.
 ---
+**Built with ❤️ using Hugging Face Spaces, Gradio, and Playwright**

computer_agent.py ADDED Viewed

	@@ -0,0 +1,487 @@

+import asyncio
+import json
+import base64
+import io
+import os
+import time
+import threading
+from typing import Dict, List, Optional, Any
+from dataclasses import dataclass
+from pathlib import Path
+import logging
+import gradio as gr
+import cv2
+import numpy as np
+from PIL import Image
+from playwright.async_api import async_playwright, Browser, BrowserContext, Page
+import requests
+from huggingface_hub import hf_hub_download, login
+# Optional imports for GUI automation
+PYAUTOGUI_AVAILABLE = False
+try:
+    # Set DISPLAY before importing pyautogui
+    if 'DISPLAY' not in os.environ:
+        os.environ['DISPLAY'] = ':99'
+    import pyautogui
+    PYAUTOGUI_AVAILABLE = True
+except ImportError:
+    print("Warning: pyautogui not available, GUI automation disabled")
+except Exception as e:
+    print(f"Warning: pyautogui import failed: {e}, GUI automation disabled")
+    PYAUTOGUI_AVAILABLE = False
+# Setup logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+@dataclass
+class AgentState:
+    """State management for the computer agent"""
+    browser: Optional[Browser] = None
+    context: Optional[BrowserContext] = None
+    page: Optional[Page] = None
+    is_running: bool = False
+    screenshot_count: int = 0
+    action_history: List[str] = None
+    def __post_init__(self):
+        if self.action_history is None:
+            self.action_history = []
+class ComputerUsingAgent:
+    """Computer-Using Agent similar to OpenAI's Operator"""
+    def __init__(self):
+        self.state = AgentState()
+        self.setup_logging()
+    def setup_logging(self):
+        """Setup logging configuration"""
+        logging.basicConfig(
+            level=logging.INFO,
+            format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+            handlers=[
+                logging.FileHandler('agent.log'),
+                logging.StreamHandler()
+            ]
+        )
+    async def initialize_browser(self, headless: bool = True, viewport_width: int = 1280, viewport_height: int = 720):
+        """Initialize browser with specified settings"""
+        try:
+            logger.info("Initializing browser...")
+            playwright = await async_playwright().start()
+            # Launch browser with enhanced settings
+            self.state.browser = await playwright.chromium.launch(
+                headless=headless,
+                args=[
+                    "--no-sandbox",
+                    "--disable-dev-shm-usage",
+                    "--disable-web-security",
+                    "--disable-features=VizDisplayCompositor",
+                    "--disable-blink-features=AutomationControlled",
+                    "--disable-infobars",
+                    "--disable-background-timer-throttling",
+                    "--disable-popup-blocking",
+                    "--disable-backgrounding-occluded-windows",
+                    "--disable-renderer-backgrounding",
+                    "--disable-window-activation",
+                    "--disable-focus-on-load",
+                    "--no-first-run",
+                    "--no-default-browser-check",
+                    "--window-position=0,0",
+                ]
+            )
+            # Create context with persistent user data
+            self.state.context = await self.state.browser.new_context(
+                viewport={'width': viewport_width, 'height': viewport_height},
+                user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
+            )
+            # Create a new page
+            self.state.page = await self.state.context.new_page()
+            self.state.is_running = True
+            logger.info("Browser initialized successfully")
+            return True
+        except Exception as e:
+            logger.error(f"Failed to initialize browser: {str(e)}")
+            return False
+    async def navigate_to_url(self, url: str) -> Dict[str, Any]:
+        """Navigate to a URL and return status"""
+        if not self.state.page:
+            return {"success": False, "message": "Browser not initialized"}
+        try:
+            # Add protocol if missing
+            if not url.startswith(('http://', 'https://')):
+                url = 'https://' + url
+            await self.state.page.goto(url, wait_until='networkidle', timeout=30000)
+            await self.state.page.wait_for_timeout(2000)  # Wait for page to fully load
+            # Get page title and URL
+            title = await self.state.page.title()
+            current_url = self.state.page.url
+            self.state.action_history.append(f"Navigated to: {url}")
+            return {
+                "success": True,
+                "message": f"Successfully navigated to {url}",
+                "title": title,
+                "current_url": current_url
+            }
+        except Exception as e:
+            logger.error(f"Failed to navigate to {url}: {str(e)}")
+            return {"success": False, "message": f"Failed to navigate: {str(e)}"}
+    async def take_screenshot(self) -> str:
+        """Take a screenshot and return base64 encoded image"""
+        if not self.state.page:
+            return ""
+        try:
+            # Take screenshot
+            screenshot_bytes = await self.state.page.screenshot(type='png')
+            # Convert to base64
+            base64_image = base64.b64encode(screenshot_bytes).decode('utf-8')
+            self.state.screenshot_count += 1
+            self.state.action_history.append(f"Screenshot taken (Total: {self.state.screenshot_count})")
+            return base64_image
+        except Exception as e:
+            logger.error(f"Failed to take screenshot: {str(e)}")
+            return ""
+    async def click_element(self, selector: str) -> Dict[str, Any]:
+        """Click on an element using CSS selector"""
+        if not self.state.page:
+            return {"success": False, "message": "Browser not initialized"}
+        try:
+            # Wait for element and click
+            await self.state.page.wait_for_selector(selector, timeout=10000)
+            await self.state.page.click(selector)
+            self.state.action_history.append(f"Clicked element: {selector}")
+            return {"success": True, "message": f"Successfully clicked element: {selector}"}
+        except Exception as e:
+            logger.error(f"Failed to click element {selector}: {str(e)}")
+            return {"success": False, "message": f"Failed to click element: {str(e)}"}
+    async def type_text(self, selector: str, text: str) -> Dict[str, Any]:
+        """Type text into an input field"""
+        if not self.state.page:
+            return {"success": False, "message": "Browser not initialized"}
+        try:
+            # Wait for element, clear it, and type
+            await self.state.page.wait_for_selector(selector, timeout=10000)
+            await self.state.page.click(selector)  # Focus the element
+            await self.state.page.keyboard.press('Control+a')  # Select all
+            await self.state.page.keyboard.type(text)
+            self.state.action_history.append(f"Typed text into {selector}: {text[:50]}...")
+            return {"success": True, "message": f"Successfully typed text into {selector}"}
+        except Exception as e:
+            logger.error(f"Failed to type text into {selector}: {str(e)}")
+            return {"success": False, "message": f"Failed to type text: {str(e)}"}
+    async def scroll_page(self, direction: str = "down", amount: int = 500) -> Dict[str, Any]:
+        """Scroll the page"""
+        if not self.state.page:
+            return {"success": False, "message": "Browser not initialized"}
+        try:
+            if direction.lower() == "down":
+                await self.state.page.evaluate(f"window.scrollBy(0, {amount})")
+            elif direction.lower() == "up":
+                await self.state.page.evaluate(f"window.scrollBy(0, -{amount})")
+            self.state.action_history.append(f"Scrolled {direction} by {amount}px")
+            return {"success": True, "message": f"Successfully scrolled {direction}"}
+        except Exception as e:
+            logger.error(f"Failed to scroll: {str(e)}")
+            return {"success": False, "message": f"Failed to scroll: {str(e)}"}
+    async def get_page_content(self) -> Dict[str, Any]:
+        """Get page content including text and structure"""
+        if not self.state.page:
+            return {"success": False, "message": "Browser not initialized"}
+        try:
+            # Get page title
+            title = await self.state.page.title()
+            # Get page text content
+            text_content = await self.state.page.evaluate("document.body.innerText")
+            # Get page HTML (first 5000 characters to avoid too much data)
+            html_content = await self.state.page.content()
+            html_content = html_content[:5000] if len(html_content) > 5000 else html_content
+            # Get links
+            links = await self.state.page.evaluate("""
+                Array.from(document.querySelectorAll('a')).map(a => ({
+                    href: a.href,
+                    text: a.textContent.trim(),
+                    title: a.title
+                })).slice(0, 20)
+            """)
+            # Get form elements
+            forms = await self.state.page.evaluate("""
+                Array.from(document.querySelectorAll('form')).map(form => ({
+                    action: form.action,
+                    method: form.method,
+                    inputs: Array.from(form.querySelectorAll('input, textarea, select')).map(input => ({
+                        type: input.type,
+                        name: input.name,
+                        placeholder: input.placeholder,
+                        required: input.required
+                    }))
+                }))
+            """)
+            self.state.action_history.append("Extracted page content")
+            return {
+                "success": True,
+                "title": title,
+                "text_content": text_content[:2000],  # Limit text content
+                "html_content": html_content,
+                "links": links,
+                "forms": forms
+            }
+        except Exception as e:
+            logger.error(f"Failed to get page content: {str(e)}")
+            return {"success": False, "message": f"Failed to get page content: {str(e)}"}
+    async def close_browser(self):
+        """Close browser and cleanup"""
+        try:
+            if self.state.page:
+                await self.state.page.close()
+            if self.state.context:
+                await self.state.context.close()
+            if self.state.browser:
+                await self.state.browser.close()
+            self.state.is_running = False
+            logger.info("Browser closed successfully")
+        except Exception as e:
+            logger.error(f"Error closing browser: {str(e)}")
+    def get_status(self) -> Dict[str, Any]:
+        """Get current agent status"""
+        return {
+            "is_running": self.state.is_running,
+            "browser_initialized": self.state.browser is not None,
+            "page_loaded": self.state.page is not None,
+            "screenshot_count": self.state.screenshot_count,
+            "action_history": self.state.action_history[-10:],  # Last 10 actions
+            "current_url": self.state.page.url if self.state.page else "None"
+        }
+# Global agent instance
+agent = ComputerUsingAgent()
+def process_action(action_type: str, **kwargs):
+    """Process agent actions"""
+    try:
+        if action_type == "initialize":
+            headless = kwargs.get("headless", True)
+            result = asyncio.run(agent.initialize_browser(headless=headless))
+            return "Browser initialized successfully" if result else "Failed to initialize browser"
+        elif action_type == "navigate":
+            url = kwargs.get("url", "")
+            if not url:
+                return "URL is required"
+            result = asyncio.run(agent.navigate_to_url(url))
+            return result["message"]
+        elif action_type == "screenshot":
+            image_base64 = asyncio.run(agent.take_screenshot())
+            if image_base64:
+                return "Screenshot taken successfully", image_base64
+            else:
+                return "Failed to take screenshot"
+        elif action_type == "click":
+            selector = kwargs.get("selector", "")
+            if not selector:
+                return "CSS selector is required"
+            result = asyncio.run(agent.click_element(selector))
+            return result["message"]
+        elif action_type == "type":
+            selector = kwargs.get("selector", "")
+            text = kwargs.get("text", "")
+            if not selector or not text:
+                return "Selector and text are required"
+            result = asyncio.run(agent.type_text(selector, text))
+            return result["message"]
+        elif action_type == "scroll":
+            direction = kwargs.get("direction", "down")
+            amount = kwargs.get("amount", 500)
+            result = asyncio.run(agent.scroll_page(direction, amount))
+            return result["message"]
+        elif action_type == "content":
+            result = asyncio.run(agent.get_page_content())
+            if result["success"]:
+                return f"Page: {result['title']}\n\nContent: {result['text_content'][:500]}..."
+            else:
+                return result["message"]
+        elif action_type == "status":
+            status = agent.get_status()
+            return json.dumps(status, indent=2)
+        elif action_type == "close":
+            asyncio.run(agent.close_browser())
+            return "Browser closed successfully"
+        else:
+            return f"Unknown action: {action_type}"
+    except Exception as e:
+        logger.error(f"Error processing action {action_type}: {str(e)}")
+        return f"Error: {str(e)}"
+def gradio_interface():
+    """Create Gradio interface for the computer agent"""
+    with gr.Blocks(title="Computer-Using Agent", theme=gr.themes.Soft()) as interface:
+        gr.Markdown("# Computer-Using Agent")
+        gr.Markdown("🤖 **AI-powered browser automation similar to OpenAI's Operator**")
+        with gr.Tab("Controls"):
+            with gr.Row():
+                initialize_btn = gr.Button("Initialize Browser", variant="primary")
+                close_btn = gr.Button("Close Browser", variant="secondary")
+                status_btn = gr.Button("Get Status")
+            status_display = gr.Textbox(label="Status", lines=5)
+            with gr.Row():
+                url_input = gr.Textbox(label="URL", placeholder="https://example.com")
+                navigate_btn = gr.Button("Navigate", variant="primary")
+            navigation_status = gr.Textbox(label="Navigation Status")
+        with gr.Tab("Screenshot & Content"):
+            with gr.Row():
+                screenshot_btn = gr.Button("Take Screenshot", variant="primary")
+                content_btn = gr.Button("Get Page Content", variant="secondary")
+            screenshot_output = gr.Image(label="Current Screenshot")
+            content_output = gr.Textbox(label="Page Content", lines=10)
+        with gr.Tab("Interaction"):
+            with gr.Row():
+                selector_input = gr.Textbox(label="CSS Selector", placeholder="#button, .class, element")
+                click_btn = gr.Button("Click Element", variant="primary")
+            with gr.Row():
+                text_input = gr.Textbox(label="Text to Type", placeholder="Enter text here...")
+                type_btn = gr.Button("Type Text", variant="primary")
+            with gr.Row():
+                scroll_direction = gr.Dropdown(["down", "up"], value="down", label="Scroll Direction")
+                scroll_amount = gr.Number(value=500, label="Scroll Amount")
+                scroll_btn = gr.Button("Scroll Page", variant="secondary")
+            interaction_status = gr.Textbox(label="Interaction Status", lines=3)
+        with gr.Tab("Advanced"):
+            action_history = gr.Textbox(label="Action History", lines=8)
+            refresh_history_btn = gr.Button("Refresh History")
+        # Event handlers
+        initialize_btn.click(
+            fn=lambda: process_action("initialize"),
+            outputs=status_display
+        )
+        close_btn.click(
+            fn=lambda: process_action("close"),
+            outputs=status_display
+        )
+        status_btn.click(
+            fn=lambda: process_action("status"),
+            outputs=status_display
+        )
+        navigate_btn.click(
+            fn=lambda url: process_action("navigate", url=url),
+            inputs=url_input,
+            outputs=navigation_status
+        )
+        screenshot_btn.click(
+            fn=lambda: process_action("screenshot"),
+            outputs=[interaction_status, screenshot_output]
+        )
+        content_btn.click(
+            fn=lambda: process_action("content"),
+            outputs=content_output
+        )
+        click_btn.click(
+            fn=lambda selector: process_action("click", selector=selector),
+            inputs=selector_input,
+            outputs=interaction_status
+        )
+        type_btn.click(
+            fn=lambda selector, text: process_action("type", selector=selector, text=text),
+            inputs=[selector_input, text_input],
+            outputs=interaction_status
+        )
+        scroll_btn.click(
+            fn=lambda direction, amount: process_action("scroll", direction=direction, amount=int(amount)),
+            inputs=[scroll_direction, scroll_amount],
+            outputs=interaction_status
+        )
+        refresh_history_btn.click(
+            fn=lambda: process_action("status"),
+            outputs=action_history
+        )
+    return interface
+if __name__ == "__main__":
+    # Create and launch Gradio interface
+    interface = gradio_interface()
+    interface.launch(
+        server_name="0.0.0.0",
+        server_port=7860,
+        share=False,
+        debug=True
+    )

requirements.txt ADDED Viewed

	@@ -0,0 +1,21 @@

+# Computer-Using Agent Dependencies
+gradio==6.1.0
+playwright==1.52.0
+opencv-python==4.11.0.86
+pillow==12.0.0
+pyautogui==0.9.54
+numpy==2.3.5
+huggingface-hub==1.2.3
+pydantic==2.12.4
+python-multipart==0.0.20
+# Browser automation dependencies
+python3-xlib==0.15
+pyperclip==1.11.0
+pyrect==0.2.0
+pyscreeze==1.0.1
+# Additional utilities
+requests==2.31.0
+asyncio
+aiofiles==24.1.0