OsamaBinLikhon commited on
Commit
ede4344
ยท
verified ยท
1 Parent(s): 13bcdd9

Enhancement: Add VNC desktop environment integration

Browse files
Files changed (1) hide show
  1. README.md +145 -120
README.md CHANGED
@@ -1,155 +1,180 @@
1
  ---
2
- title: "Computer-Using Agent"
3
- emoji: "๐Ÿค–"
4
- colorFrom: "blue"
5
- colorTo: "indigo"
6
  sdk: "docker"
7
  sdk_version: "3.12.0"
8
- app_file: "computer_agent.py"
9
  pinned: false
10
  ---
11
 
12
- # Computer-Using Agent
13
 
14
- ๐Ÿค– **AI-powered browser automation system similar to OpenAI's Operator**
15
 
16
- This Hugging Face Space provides a comprehensive computer-using agent that can interact with web browsers, take screenshots, perform actions, and automate various tasks through a user-friendly Gradio interface.
17
 
18
- ## Features
19
 
20
- ### ๐ŸŒ Browser Automation
21
  - **Web Navigation**: Navigate to any URL with intelligent loading detection
22
  - **Screenshot Capture**: Take high-quality screenshots of web pages
23
  - **Element Interaction**: Click on elements, type text, and interact with forms
24
  - **Page Analysis**: Extract content, links, forms, and page structure
25
 
26
- ### ๐ŸŽฏ Advanced Controls
 
 
 
 
 
 
 
27
  - **CSS Selector Support**: Target specific elements using CSS selectors
28
  - **Scrolling**: Navigate up and down pages with customizable scroll amounts
29
  - **Content Extraction**: Get page text, HTML, and structural information
30
  - **Action History**: Track all actions performed by the agent
31
 
32
- ### ๐Ÿ”ง Technical Features
33
- - **Headless Browser**: Runs efficiently in server environments
34
- - **Multi-tab Support**: Handle multiple browser contexts
35
- - **Error Handling**: Robust error recovery and logging
36
- - **Real-time Status**: Monitor agent status and performance
37
-
38
  ## ๐Ÿš€ Usage
39
 
40
- ### Basic Navigation
41
- 1. Click "Initialize Browser" to start the browser
42
- 2. Enter a URL in the URL field
43
- 3. Click "Navigate" to visit the page
44
- 4. Use "Take Screenshot" to capture the current page
45
-
46
- ### Element Interaction
47
- 1. Use browser dev tools to find CSS selectors
48
- 2. Enter the selector in the "CSS Selector" field
49
- 3. Click "Click Element" to interact with the element
50
- 4. Use "Type Text" to input text into form fields
51
-
52
- ### Page Content Analysis
53
- 1. Navigate to any web page
54
- 2. Click "Get Page Content" to extract:
55
- - Page title and text content
56
- - Links and navigation elements
57
- - Form structures and inputs
58
- - Page HTML structure
59
-
60
- ## ๐Ÿ› ๏ธ API Integration
61
-
62
- The agent can be integrated with various AI models from Hugging Face:
63
-
64
- ```python
65
- from huggingface_hub import hf_hub_download
66
-
67
- # Load models for enhanced capabilities
68
- model = hf_hub_download(repo_id="microsoft/DialoGPT-medium", filename="pytorch_model.bin")
69
- ```
70
-
71
- ### Supported Model Types
72
- - **Language Models**: For natural language processing
73
- - **Vision Models**: For image analysis and understanding
74
- - **Multimodal Models**: For combined text and image processing
75
 
76
  ## ๐Ÿ—๏ธ Architecture
77
 
78
- ### Core Components
79
- - **ComputerUsingAgent**: Main agent class managing browser operations
80
- - **Gradio Interface**: User-friendly web interface
81
- - **Playwright Integration**: Browser automation engine
82
- - **State Management**: Track agent status and actions
83
-
84
- ### Browser Configuration
85
- - **Chromium**: Primary browser engine
86
- - **Headless Mode**: Server-optimized operation
87
- - **Custom User Agent**: Enhanced compatibility
88
- - **Security Disabled**: For automation purposes
89
-
90
- ## ๐Ÿ”ง Configuration
91
-
92
- ### Environment Variables
93
- - `GRADIO_SERVER_PORT`: Port for Gradio interface (default: 7860)
94
- - `GRADIO_SERVER_NAME`: Server host (default: 0.0.0.0)
95
- - `DISPLAY`: Display for GUI operations
96
-
97
- ### Browser Settings
98
- - **Viewport**: 1280x720 (configurable)
99
- - **User Agent**: Custom Windows Chrome user agent
100
- - **Security**: Disabled for automation compatibility
101
-
102
- ## ๐Ÿ“‹ Requirements
103
-
104
- ### System Dependencies
105
- - Python 3.8+
106
- - Chromium browser
107
- - X11 display libraries
108
- - System libraries for GUI support
109
-
110
- ### Python Dependencies
111
- - `gradio==6.1.0`: Web interface framework
112
- - `playwright==1.52.0`: Browser automation
113
- - `opencv-python==4.11.0.86`: Image processing
114
- - `pillow==12.0.0`: Image handling
115
- - `pyautogui==0.9.54`: GUI automation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
116
 
117
  ## ๐Ÿšจ Important Notes
118
 
119
- ### Security Considerations
120
- - Browser security features are disabled for automation
121
- - Only use in trusted environments
122
- - Monitor for malicious content when browsing
123
-
124
- ### Usage Guidelines
125
- - Respect website terms of service
126
- - Implement rate limiting for production use
127
- - Add CAPTCHA handling for automated interactions
128
- - Monitor resource usage for large-scale operations
129
-
130
- ## ๐Ÿ”ฎ Future Enhancements
131
 
132
- ### Planned Features
133
- - **Multi-modal AI Integration**: Combine with vision models
134
- - **Computer Vision**: Advanced element detection
135
- - **Task Planning**: Automated workflow execution
136
- - **API Integration**: Connect with external services
137
- - **Mobile Support**: Touch and mobile interaction
138
 
139
- ### AI Model Integration
140
- - **GPT Models**: For natural language task understanding
141
- - **CLIP**: For image-based element recognition
142
- - **YOLO**: For object detection and interaction
143
- - **BLIP**: For advanced image captioning
144
 
145
- ## ๐Ÿ“ž Support
 
 
 
146
 
147
- For issues and feature requests, please create an issue in the repository or contact the development team.
148
-
149
- ## ๐Ÿ“„ License
150
-
151
- This project is licensed under the MIT License - see the LICENSE file for details.
152
 
153
  ---
154
 
155
- **Built with โค๏ธ using Hugging Face Spaces, Gradio, and Playwright**
 
 
 
1
  ---
2
+ title: "Enhanced Computer-Using Agent with VNC"
3
+ emoji: "๐Ÿ–ฅ๏ธ"
4
+ colorFrom: "green"
5
+ colorTo: "blue"
6
  sdk: "docker"
7
  sdk_version: "3.12.0"
8
+ app_file: "computer_agent_vnc.py"
9
  pinned: false
10
  ---
11
 
12
+ # ๐Ÿ–ฅ๏ธ Enhanced Computer-Using Agent with VNC
13
 
14
+ ๐Ÿค– **AI-powered browser automation with full desktop environment access**
15
 
16
+ This enhanced Hugging Face Space provides a comprehensive computer-using agent that combines browser automation with a full VNC-accessible desktop environment, similar to OpenAI's Operator but with enhanced GUI capabilities.
17
 
18
+ ## โœจ New Features
19
 
20
+ ### ๐ŸŒ Enhanced Browser Automation
21
  - **Web Navigation**: Navigate to any URL with intelligent loading detection
22
  - **Screenshot Capture**: Take high-quality screenshots of web pages
23
  - **Element Interaction**: Click on elements, type text, and interact with forms
24
  - **Page Analysis**: Extract content, links, forms, and page structure
25
 
26
+ ### ๐Ÿ–ฅ๏ธ VNC Desktop Environment
27
+ - **Full GUI Access**: Complete XFCE4 desktop environment accessible via web
28
+ - **VNC Integration**: Direct VNC access through browser interface
29
+ - **Desktop Applications**: Run any Linux GUI applications
30
+ - **Web-based VNC**: Access desktop through noVNC web client
31
+
32
+ ### ๐Ÿ”ง Advanced Controls
33
+ - **Dual Interface**: Browser automation + full desktop environment
34
  - **CSS Selector Support**: Target specific elements using CSS selectors
35
  - **Scrolling**: Navigate up and down pages with customizable scroll amounts
36
  - **Content Extraction**: Get page text, HTML, and structural information
37
  - **Action History**: Track all actions performed by the agent
38
 
 
 
 
 
 
 
39
  ## ๐Ÿš€ Usage
40
 
41
+ ### Browser Automation Tab
42
+ 1. Click "Initialize Browser" to start the browser automation
43
+ 2. Enter a URL and click "Navigate" to visit the page
44
+ 3. Use "Take Screenshot" to capture the current page
45
+ 4. Monitor status and action history
46
+
47
+ ### VNC Desktop Tab
48
+ 1. Click "Check VNC Status" to verify desktop environment
49
+ 2. Click "Open VNC Viewer" to access full desktop in new tab
50
+ 3. Use the desktop environment for any GUI applications
51
+ 4. **VNC Access Details:**
52
+ - **Port**: 5901
53
+ - **Password**: computer-agent
54
+ - **Web Interface**: Available through the VNC tab
55
+
56
+ ### System Info Tab
57
+ 1. Get detailed system information
58
+ 2. Monitor agent status and capabilities
59
+ 3. View feature availability
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
  ## ๐Ÿ—๏ธ Architecture
62
 
63
+ ### Desktop Environment
64
+ - **GUI Framework**: XFCE4 (lightweight desktop environment)
65
+ - **VNC Server**: TigerVNC standalone server
66
+ - **Web Bridge**: noVNC + websockify for web access
67
+ - **Display Resolution**: 1920x1080 (configurable)
68
+
69
+ ### Browser Integration
70
+ - **Automation Engine**: Playwright with Chromium
71
+ - **Screenshot Capability**: Real-time page capture
72
+ - **Element Interaction**: Advanced DOM manipulation
73
+ - **Headless/Headed**: Configurable browser mode
74
+
75
+ ### Web Interface
76
+ - **Framework**: Gradio 4.21.0 with enhanced features
77
+ - **Tabs**: Browser automation, VNC desktop, system info
78
+ - **Real-time Updates**: Live status monitoring
79
+ - **Action History**: Complete interaction logging
80
+
81
+ ## ๐Ÿ› ๏ธ Technical Specifications
82
+
83
+ ### Hardware Requirements
84
+ - **CPU**: 2 vCPU (included in CPU basic tier)
85
+ - **RAM**: 16 GB (adequate for desktop environment)
86
+ - **Storage**: Standard Hugging Face Space allocation
87
+
88
+ ### Software Stack
89
+ - **Base**: Ubuntu 22.04 LTS
90
+ - **Python**: 3.10 with optimized dependencies
91
+ - **Desktop**: XFCE4 + X11
92
+ - **VNC**: TigerVNC + noVNC
93
+ - **Browser**: Chromium with Playwright automation
94
+
95
+ ### Network Configuration
96
+ - **Web Interface**: Port 7860 (Gradio)
97
+ - **VNC Server**: Port 5901 (TigerVNC)
98
+ - **Web Bridge**: Port 5901 (websockify)
99
+
100
+ ## ๐Ÿ”’ Security Considerations
101
+
102
+ ### VNC Security
103
+ - **Password Protection**: VNC server requires authentication
104
+ - **Local Access**: VNC accessible only within the Space environment
105
+ - **No External Access**: Desktop environment isolated to container
106
+
107
+ ### Browser Security
108
+ - **Headless Mode**: Browser runs without visible interface
109
+ - **Security Disabled**: For automation compatibility (same-origin policy relaxed)
110
+ - **Sandboxed**: Browser runs in containerized environment
111
+
112
+ ## ๐ŸŽฏ Use Cases
113
+
114
+ ### Enhanced Automation
115
+ - **GUI Testing**: Test applications requiring desktop environment
116
+ - **Visual Regression**: Compare screenshots with desktop applications
117
+ - **Multi-app Workflows**: Coordinate between browser and desktop apps
118
+ - **Development**: Develop and test GUI applications
119
+
120
+ ### Research & Development
121
+ - **AI Research**: Run AI models with GUI interfaces
122
+ - **Data Analysis**: Use desktop tools for data visualization
123
+ - **Prototyping**: Rapid GUI application development
124
+ - **Education**: Interactive learning environments
125
+
126
+ ## ๐Ÿ”ฎ Advanced Features
127
+
128
+ ### VNC Integration Benefits
129
+ - **Full Desktop**: Access to complete Linux desktop environment
130
+ - **GUI Applications**: Run any X11-based applications
131
+ - **File Management**: Native file explorer and management tools
132
+ - **Development Tools**: IDEs, debuggers, and development utilities
133
+
134
+ ### Browser Automation Enhanced
135
+ - **Visual Testing**: Compare automated browser actions with desktop
136
+ - **Complex Workflows**: Combine browser automation with desktop apps
137
+ - **Screenshots**: Capture both browser and desktop content
138
+ - **Monitoring**: Real-time view of all automated activities
139
+
140
+ ## ๐Ÿ“‹ System Requirements
141
+
142
+ ### For Users
143
+ - **Web Browser**: Any modern browser with JavaScript enabled
144
+ - **Network**: Stable internet connection for Space access
145
+ - **VNC Viewer**: Built-in web VNC client (no installation required)
146
+
147
+ ### For Development
148
+ - **Docker**: For local testing and development
149
+ - **Linux**: Ubuntu 22.04 or compatible distribution
150
+ - **Python 3.10+**: For running enhanced agent locally
151
 
152
  ## ๐Ÿšจ Important Notes
153
 
154
+ ### Performance Considerations
155
+ - **Resource Usage**: Desktop environment uses additional memory
156
+ - **Startup Time**: VNC server adds ~10-15 seconds to startup
157
+ - **Network**: VNC traffic uses bandwidth for remote desktop access
 
 
 
 
 
 
 
 
158
 
159
+ ### Best Practices
160
+ - **Close VNC**: Close VNC viewer when not in use to save resources
161
+ - **Monitor Usage**: Check Space logs for resource consumption
162
+ - **Test Locally**: Develop and test locally before deploying
 
 
163
 
164
+ ## ๐Ÿ”ง Troubleshooting
 
 
 
 
165
 
166
+ ### VNC Issues
167
+ - **Connection Failed**: Check VNC status in the interface
168
+ - **Black Screen**: Wait 30 seconds for desktop to fully initialize
169
+ - **Slow Performance**: Normal for remote desktop over web
170
 
171
+ ### Browser Automation
172
+ - **Elements Not Found**: Ensure page has fully loaded
173
+ - **Screenshots Fail**: Check browser initialization status
174
+ - **Navigation Timeout**: Verify URL accessibility
 
175
 
176
  ---
177
 
178
+ **Experience the future of web automation with full desktop capabilities! ๐Ÿš€โœจ**
179
+
180
+ Built with โค๏ธ using Hugging Face Spaces, Gradio, Playwright, TigerVNC, and XFCE4