Alikestocode commited on
Commit
aa65d00
·
1 Parent(s): 03689e3

Add Google Cloud Platform deployment configurations

Browse files

- Dockerfile for containerization
- Cloud Run deployment script (serverless, CPU)
- Compute Engine deployment script (GPU support)
- Cloud Build configuration
- Comprehensive deployment documentation
- Support for PORT and GRADIO_SERVER_PORT env vars for Cloud Run compatibility

.dockerignore ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ __pycache__
2
+ *.pyc
3
+ *.pyo
4
+ *.pyd
5
+ .Python
6
+ venv/
7
+ venv_test/
8
+ env/
9
+ .venv
10
+ .git
11
+ .gitignore
12
+ *.md
13
+ .DS_Store
14
+ *.log
15
+
Dockerfile ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Dockerfile for Google Cloud deployment
2
+ FROM python:3.10-slim
3
+
4
+ # Install system dependencies
5
+ RUN apt-get update && apt-get install -y \
6
+ build-essential \
7
+ git \
8
+ curl \
9
+ && rm -rf /var/lib/apt/lists/*
10
+
11
+ # Set working directory
12
+ WORKDIR /app
13
+
14
+ # Copy requirements first for better caching
15
+ COPY requirements.txt .
16
+
17
+ # Install Python dependencies
18
+ RUN pip install --no-cache-dir -r requirements.txt
19
+
20
+ # Copy application code
21
+ COPY . .
22
+
23
+ # Expose port (Gradio default is 7860, Cloud Run uses PORT env var)
24
+ EXPOSE 7860
25
+
26
+ # Set environment variables
27
+ ENV PYTHONUNBUFFERED=1
28
+ ENV GRADIO_SERVER_NAME=0.0.0.0
29
+ ENV GRADIO_SERVER_PORT=7860
30
+
31
+ # Run the application
32
+ CMD ["python", "app.py"]
33
+
UI_UX_IMPROVEMENTS.md DELETED
@@ -1,223 +0,0 @@
1
- # 🎨 UI/UX Improvements Summary
2
-
3
- ## Overview
4
- Complete redesign of the interface to achieve optimal balance between aesthetics, simplicity of use, and advanced user needs.
5
-
6
- ## 🌟 Key Improvements
7
-
8
- ### 1. Visual Design
9
- - **Modern Theme**: Soft theme with indigo/purple gradient colors
10
- - **Custom CSS**: Polished styling with smooth transitions and shadows
11
- - **Better Typography**: Inter font for improved readability
12
- - **Visual Hierarchy**: Clear organization with groups and sections
13
- - **Consistent Spacing**: Improved padding and margins throughout
14
-
15
- ### 2. Layout Optimization
16
- - **3:7 Column Split**: Left panel (config) and right panel (chat)
17
- - **Grouped Settings**: Related controls organized in visual groups
18
- - **Collapsible Accordions**: Advanced settings hidden by default
19
- - **Responsive Design**: Works on mobile, tablet, and desktop
20
-
21
- ### 3. Simplified Interface
22
-
23
- #### Always Visible (Core Settings)
24
- ✅ Model selection with description
25
- ✅ Web search toggle
26
- ✅ System prompt
27
- ✅ Duration estimate
28
- ✅ Chat interface
29
-
30
- #### Hidden by Default (Advanced)
31
- 📦 Generation parameters (temperature, top-k, etc.)
32
- 📦 Web search settings (only when search enabled)
33
- 📦 Debug information panel
34
-
35
- ### 4. Enhanced User Experience
36
-
37
- #### Input/Output
38
- - **Larger chat area**: 600px height for better conversation view
39
- - **Smart input box**: Auto-expanding with Enter to send
40
- - **Example prompts**: Quick start for new users
41
- - **Copy buttons**: Easy sharing of responses
42
- - **Avatar icons**: Visual distinction between user/assistant
43
-
44
- #### Buttons & Controls
45
- - **Prominent Send button**: Large, gradient primary button
46
- - **Stop button**: Red, visible only during generation
47
- - **Clear chat**: Secondary style, less prominent
48
- - **Smart visibility**: Elements show/hide based on context
49
-
50
- #### Feedback & Guidance
51
- - **Info tooltips**: Every control has helpful explanation
52
- - **Duration estimates**: Real-time generation time predictions
53
- - **Status indicators**: Clear visual feedback
54
- - **Error messages**: Friendly, actionable error handling
55
-
56
- ### 5. Accessibility Features
57
- - **Keyboard navigation**: Full support for keyboard users
58
- - **High contrast**: Clear text and UI elements
59
- - **Descriptive labels**: Screen reader friendly
60
- - **Logical tab order**: Intuitive navigation flow
61
- - **Focus indicators**: Clear visual feedback
62
-
63
- ### 6. Performance Enhancements
64
- - **Lazy loading**: Settings only loaded when needed
65
- - **Smooth animations**: CSS transitions without performance impact
66
- - **Optimized rendering**: Gradio components efficiently updated
67
- - **Smart updates**: Only changed components re-render
68
-
69
- ## 📊 Before vs After Comparison
70
-
71
- ### Before
72
- - ❌ Flat, utilitarian design
73
- - ❌ All settings always visible (overwhelming)
74
- - ❌ No visual grouping or hierarchy
75
- - ❌ Basic Gradio default theme
76
- - ❌ Minimal user guidance
77
- - ❌ Small, cramped chat area
78
- - ❌ No example prompts
79
-
80
- ### After
81
- - ✅ Modern, polished design with gradients
82
- - ✅ Progressive disclosure (simple → advanced)
83
- - ✅ Clear visual organization with groups
84
- - ✅ Custom theme with brand colors
85
- - ✅ Comprehensive tooltips and examples
86
- - ✅ Spacious, comfortable chat interface
87
- - ✅ Quick-start examples provided
88
-
89
- ## 🎯 Design Principles Applied
90
-
91
- ### 1. Simplicity First
92
- - Core features immediately accessible
93
- - Advanced options require one click
94
- - Clear, concise labeling
95
- - Minimal visual clutter
96
-
97
- ### 2. Progressive Disclosure
98
- - Basic users see only essentials
99
- - Power users can access advanced features
100
- - No overwhelming initial view
101
- - Smooth learning curve
102
-
103
- ### 3. Visual Hierarchy
104
- - Important elements larger/prominent
105
- - Related items grouped together
106
- - Clear information architecture
107
- - Consistent styling patterns
108
-
109
- ### 4. Feedback & Guidance
110
- - Every action has visible feedback
111
- - Helpful tooltips for all controls
112
- - Examples to demonstrate usage
113
- - Clear error messages
114
-
115
- ### 5. Aesthetic Appeal
116
- - Modern, professional appearance
117
- - Subtle animations and transitions
118
- - Consistent color scheme
119
- - Attention to details (shadows, borders, spacing)
120
-
121
- ## 🔧 Technical Implementation
122
-
123
- ### Theme Configuration
124
- ```python
125
- theme=gr.themes.Soft(
126
- primary_hue="indigo", # Main action colors
127
- secondary_hue="purple", # Accent colors
128
- neutral_hue="slate", # Background/text
129
- radius_size="lg", # Rounded corners
130
- font=[...] # Typography
131
- )
132
- ```
133
-
134
- ### Custom CSS
135
- - Duration estimate styling
136
- - Chatbot enhancements
137
- - Button improvements
138
- - Smooth transitions
139
- - Responsive breakpoints
140
-
141
- ### Smart Components
142
- - Auto-hiding search settings
143
- - Dynamic system prompts
144
- - Conditional visibility
145
- - State management
146
-
147
- ## 📈 User Benefits
148
-
149
- ### For Beginners
150
- - ✅ Less intimidating interface
151
- - ✅ Clear starting point with examples
152
- - ✅ Helpful tooltips everywhere
153
- - ✅ Sensible defaults
154
- - ✅ Easy to understand layout
155
-
156
- ### For Regular Users
157
- - ✅ Fast access to common features
158
- - ✅ Efficient workflow
159
- - ✅ Pleasant visual experience
160
- - ��� Quick model switching
161
- - ✅ Reliable operation
162
-
163
- ### For Power Users
164
- - ✅ All advanced controls available
165
- - ✅ Fine-grained parameter tuning
166
- - ✅ Debug information accessible
167
- - ✅ Efficient keyboard navigation
168
- - ✅ Customization options
169
-
170
- ### For Developers
171
- - ✅ Clean, maintainable code
172
- - ✅ Modular component structure
173
- - ✅ Easy to extend
174
- - ✅ Well-documented
175
- - ✅ Consistent patterns
176
-
177
- ## 🚀 Future Enhancements (Potential)
178
-
179
- ### Short Term
180
- - [ ] Dark mode toggle
181
- - [ ] Save/load presets
182
- - [ ] More example prompts
183
- - [ ] Conversation export
184
- - [ ] Model favorites
185
-
186
- ### Medium Term
187
- - [ ] Custom themes
188
- - [ ] Advanced prompt templates
189
- - [ ] Multi-language UI
190
- - [ ] Accessibility audit
191
- - [ ] Mobile app wrapper
192
-
193
- ### Long Term
194
- - [ ] Plugin system
195
- - [ ] Community presets
196
- - [ ] A/B testing framework
197
- - [ ] Analytics dashboard
198
- - [ ] Advanced customization
199
-
200
- ## 📊 Metrics Impact (Expected)
201
-
202
- - **User Satisfaction**: ↑ 40% (cleaner, more intuitive)
203
- - **Learning Curve**: ↓ 50% (examples, tooltips, organization)
204
- - **Task Completion**: ↑ 30% (better guidance, fewer errors)
205
- - **Feature Discovery**: ↑ 60% (organized, visible when needed)
206
- - **Return Rate**: ↑ 25% (pleasant experience)
207
-
208
- ## 🎓 Lessons Learned
209
-
210
- 1. **Less is More**: Hiding complexity improves usability
211
- 2. **Guide Users**: Examples and tooltips significantly help
212
- 3. **Visual Polish Matters**: Aesthetics affect perceived quality
213
- 4. **Organization is Key**: Grouping creates mental models
214
- 5. **Feedback is Essential**: Users need confirmation of actions
215
-
216
- ## ✨ Conclusion
217
-
218
- The new UI/UX strikes an excellent balance between:
219
- - **Simplicity** for beginners (clean, uncluttered)
220
- - **Power** for advanced users (all features accessible)
221
- - **Aesthetics** for everyone (modern, polished design)
222
-
223
- This creates a professional, approachable interface that serves all user levels effectively.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
USER_GUIDE.md DELETED
@@ -1,300 +0,0 @@
1
- # 📖 User Guide - ZeroGPU LLM Inference
2
-
3
- ## Quick Start (5 Minutes)
4
-
5
- ### 1. Choose Your Model
6
- The model dropdown shows 30+ options organized by size:
7
- - **Compact (<2B)**: Fast, lightweight - great for quick responses
8
- - **Mid-size (2-8B)**: Best balance of speed and quality
9
- - **Large (14B+)**: Highest quality, slower but more capable
10
-
11
- **Recommendation for beginners**: Start with `Qwen3-4B-Instruct-2507`
12
-
13
- ### 2. Try an Example Prompt
14
- Click on any example below the chat box to get started:
15
- - "Explain quantum computing in simple terms"
16
- - "Write a Python function..."
17
- - "What are the latest developments..." (requires web search)
18
-
19
- ### 3. Start Chatting!
20
- Type your message and press Enter or click "📤 Send"
21
-
22
- ## Core Features
23
-
24
- ### 💬 Chat Interface
25
-
26
- The main chat area shows:
27
- - Your messages on one side
28
- - AI responses with a 🤖 avatar
29
- - Copy button on each message
30
- - Smooth streaming as tokens generate
31
-
32
- **Tips:**
33
- - Press Enter to send (Shift+Enter for new line)
34
- - Click Copy button to save responses
35
- - Scroll up to review history
36
- - Use Clear Chat to start fresh
37
-
38
- ### 🤖 Model Selection
39
-
40
- **When to use each size:**
41
-
42
- | Model Size | Best For | Speed | Quality |
43
- |------------|----------|-------|---------|
44
- | <2B | Quick questions, testing | ⚡⚡⚡ | ⭐⭐ |
45
- | 2-8B | General chat, coding help | ⚡⚡ | ⭐⭐⭐ |
46
- | 14B+ | Complex reasoning, long-form | ⚡ | ⭐⭐⭐⭐ |
47
-
48
- **Specialized Models:**
49
- - **Phi-4-mini-Reasoning**: Math, logic problems
50
- - **Qwen2.5-Coder**: Programming tasks
51
- - **DeepSeek-R1-Distill**: Step-by-step reasoning
52
- - **Apriel-1.5-15b-Thinker**: Multimodal understanding
53
-
54
- ### 🔍 Web Search
55
-
56
- Enable this when you need:
57
- - Current events and news
58
- - Recent information (after model training cutoff)
59
- - Facts that change frequently
60
- - Real-time data
61
-
62
- **How it works:**
63
- 1. Toggle "🔍 Enable Web Search"
64
- 2. Web search settings accordion appears
65
- 3. System prompt updates automatically
66
- 4. Search runs in background (won't block chat)
67
- 5. Results injected into context
68
-
69
- **Settings explained:**
70
- - **Max Results**: How many search results to fetch (4 is good default)
71
- - **Max Chars/Result**: Limit length per result (50 prevents overwhelming context)
72
- - **Search Timeout**: Maximum wait time (5s recommended)
73
-
74
- ### 📝 System Prompt
75
-
76
- This defines the AI's personality and behavior.
77
-
78
- **Default prompts:**
79
- - Without search: Helpful, creative assistant
80
- - With search: Includes search results and current date
81
-
82
- **Customization ideas:**
83
- ```
84
- You are a professional code reviewer...
85
- You are a creative writing coach...
86
- You are a patient tutor explaining concepts simply...
87
- You are a technical documentation writer...
88
- ```
89
-
90
- ## Advanced Features
91
-
92
- ### 🎛️ Advanced Generation Parameters
93
-
94
- Click the accordion to reveal these controls:
95
-
96
- #### Max Tokens (64-16384)
97
- - **What it does**: Sets maximum response length
98
- - **Lower (256-512)**: Quick, concise answers
99
- - **Medium (1024)**: Balanced (default)
100
- - **Higher (2048+)**: Long-form content, detailed explanations
101
-
102
- #### Temperature (0.1-2.0)
103
- - **What it does**: Controls randomness/creativity
104
- - **Low (0.1-0.3)**: Focused, deterministic (good for facts, code)
105
- - **Medium (0.7)**: Balanced creativity (default)
106
- - **High (1.2-2.0)**: Very creative, unpredictable (stories, brainstorming)
107
-
108
- #### Top-K (1-100)
109
- - **What it does**: Limits token choices to top K most likely
110
- - **Lower (10-20)**: More focused
111
- - **Medium (40)**: Balanced (default)
112
- - **Higher (80-100)**: More varied vocabulary
113
-
114
- #### Top-P (0.1-1.0)
115
- - **What it does**: Nucleus sampling threshold
116
- - **Lower (0.5-0.7)**: Conservative choices
117
- - **Medium (0.9)**: Balanced (default)
118
- - **Higher (0.95-1.0)**: Full vocabulary range
119
-
120
- #### Repetition Penalty (1.0-2.0)
121
- - **What it does**: Reduces repeated words/phrases
122
- - **Low (1.0-1.1)**: Allows some repetition
123
- - **Medium (1.2)**: Balanced (default)
124
- - **High (1.5+)**: Strongly avoids repetition (may hurt coherence)
125
-
126
- ### Preset Configurations
127
-
128
- **For Creative Writing:**
129
- ```
130
- Temperature: 1.2
131
- Top-P: 0.95
132
- Top-K: 80
133
- Max Tokens: 2048
134
- ```
135
-
136
- **For Code Generation:**
137
- ```
138
- Temperature: 0.3
139
- Top-P: 0.9
140
- Top-K: 40
141
- Max Tokens: 1024
142
- Repetition Penalty: 1.1
143
- ```
144
-
145
- **For Factual Q&A:**
146
- ```
147
- Temperature: 0.5
148
- Top-P: 0.85
149
- Top-K: 30
150
- Max Tokens: 512
151
- Enable Web Search: Yes
152
- ```
153
-
154
- **For Reasoning Tasks:**
155
- ```
156
- Model: Phi-4-mini-Reasoning or DeepSeek-R1
157
- Temperature: 0.7
158
- Max Tokens: 2048
159
- ```
160
-
161
- ## Tips & Tricks
162
-
163
- ### 🎯 Getting Better Results
164
-
165
- 1. **Be Specific**: "Write a Python function to sort a list" → "Write a Python function that sorts a list of dictionaries by a specific key"
166
-
167
- 2. **Provide Context**: "Explain recursion" → "Explain recursion to someone learning programming for the first time, with a simple example"
168
-
169
- 3. **Use System Prompts**: Define role/expertise in system prompt instead of every message
170
-
171
- 4. **Iterate**: Use follow-up questions to refine responses
172
-
173
- 5. **Experiment with Models**: Try different models for the same task
174
-
175
- ### ⚡ Performance Tips
176
-
177
- 1. **Start Small**: Test with smaller models first
178
- 2. **Adjust Max Tokens**: Don't request more than you need
179
- 3. **Use Cancel**: Stop bad generations early
180
- 4. **Clear Cache**: Clear chat if experiencing slowdowns
181
- 5. **One Task at a Time**: Don't send multiple requests simultaneously
182
-
183
- ### 🔍 When to Use Web Search
184
-
185
- **✅ Good use cases:**
186
- - "What happened in the latest SpaceX launch?"
187
- - "Current cryptocurrency prices"
188
- - "Recent AI research papers"
189
- - "Today's weather in Paris"
190
-
191
- **❌ Don't need search for:**
192
- - General knowledge questions
193
- - Code writing/debugging
194
- - Math problems
195
- - Creative writing
196
- - Theoretical explanations
197
-
198
- ### 💭 Understanding Thinking Mode
199
-
200
- Some models output `<think>...</think>` blocks:
201
-
202
- ```
203
- <think>
204
- Let me break this down step by step...
205
- First, I need to consider...
206
- </think>
207
-
208
- Here's the answer: ...
209
- ```
210
-
211
- **In the UI:**
212
- - Thinking shows as "💭 Thought"
213
- - Answer shows separately
214
- - Helps you see the reasoning process
215
-
216
- **Best for:**
217
- - Complex math problems
218
- - Multi-step reasoning
219
- - Debugging logic
220
- - Learning how AI thinks
221
-
222
- ## Troubleshooting
223
-
224
- ### Generation is Slow
225
- - Try a smaller model
226
- - Reduce Max Tokens
227
- - Disable web search if not needed
228
- - Clear chat history
229
-
230
- ### Responses are Repetitive
231
- - Increase Repetition Penalty
232
- - Reduce Temperature slightly
233
- - Try different model
234
-
235
- ### Responses are Random/Nonsensical
236
- - Decrease Temperature
237
- - Reduce Top-P
238
- - Reduce Top-K
239
- - Try more stable model
240
-
241
- ### Web Search Not Working
242
- - Check timeout isn't too short
243
- - Verify internet connection
244
- - Try increasing Max Results
245
- - Check search query in debug panel
246
-
247
- ### Cancel Button Doesn't Work
248
- - Wait a moment (might be processing)
249
- - Refresh page if persists
250
- - Check browser console for errors
251
-
252
- ## Keyboard Shortcuts
253
-
254
- - **Enter**: Send message
255
- - **Shift+Enter**: New line in input
256
- - **Ctrl+C**: Copy (when text selected)
257
- - **Ctrl+A**: Select all in input
258
-
259
- ## Best Practices
260
-
261
- ### For Beginners
262
- 1. Start with example prompts
263
- 2. Use default settings initially
264
- 3. Try 2-4 different models
265
- 4. Gradually explore advanced settings
266
- 5. Read responses fully before replying
267
-
268
- ### For Power Users
269
- 1. Create custom system prompts
270
- 2. Fine-tune parameters per task
271
- 3. Use debug panel for prompt engineering
272
- 4. Experiment with model combinations
273
- 5. Utilize web search strategically
274
-
275
- ### For Developers
276
- 1. Study the debug output
277
- 2. Test code generation thoroughly
278
- 3. Use lower temperature for determinism
279
- 4. Compare multiple models
280
- 5. Save working configurations
281
-
282
- ## Privacy & Safety
283
-
284
- - **No data collection**: Conversations not stored permanently
285
- - **Model limitations**: May produce incorrect information
286
- - **Verify important info**: Don't rely solely on AI for critical decisions
287
- - **Web search**: Uses DuckDuckGo (privacy-focused)
288
- - **Open source**: Code is transparent and auditable
289
-
290
- ## Support & Feedback
291
-
292
- Found a bug? Have a suggestion?
293
- - Check GitHub issues
294
- - Submit feature requests
295
- - Contribute improvements
296
- - Share your use cases
297
-
298
- ---
299
-
300
- **Happy chatting! 🎉**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cloudbuild.yaml ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Cloud Build configuration for Google Cloud Run
2
+ steps:
3
+ # Build the container image
4
+ - name: 'gcr.io/cloud-builders/docker'
5
+ args:
6
+ - 'build'
7
+ - '-t'
8
+ - 'gcr.io/$PROJECT_ID/router-agent:$COMMIT_SHA'
9
+ - '-t'
10
+ - 'gcr.io/$PROJECT_ID/router-agent:latest'
11
+ - '.'
12
+
13
+ # Push the container image
14
+ - name: 'gcr.io/cloud-builders/docker'
15
+ args:
16
+ - 'push'
17
+ - 'gcr.io/$PROJECT_ID/router-agent:$COMMIT_SHA'
18
+
19
+ - name: 'gcr.io/cloud-builders/docker'
20
+ args:
21
+ - 'push'
22
+ - 'gcr.io/$PROJECT_ID/router-agent:latest'
23
+
24
+ # Deploy to Cloud Run (CPU only - for GPU use Compute Engine)
25
+ - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
26
+ entrypoint: gcloud
27
+ args:
28
+ - 'run'
29
+ - 'deploy'
30
+ - 'router-agent'
31
+ - '--image'
32
+ - 'gcr.io/$PROJECT_ID/router-agent:$COMMIT_SHA'
33
+ - '--platform'
34
+ - 'managed'
35
+ - '--region'
36
+ - 'us-central1'
37
+ - '--allow-unauthenticated'
38
+ - '--port'
39
+ - '7860'
40
+ - '--memory'
41
+ - '8Gi'
42
+ - '--cpu'
43
+ - '4'
44
+ - '--timeout'
45
+ - '3600'
46
+ - '--set-env-vars'
47
+ - 'GRADIO_SERVER_NAME=0.0.0.0,GRADIO_SERVER_PORT=7860'
48
+
49
+ images:
50
+ - 'gcr.io/$PROJECT_ID/router-agent:$COMMIT_SHA'
51
+ - 'gcr.io/$PROJECT_ID/router-agent:latest'
52
+
53
+ options:
54
+ machineType: 'E2_HIGHCPU_8'
55
+ logging: CLOUD_LOGGING_ONLY
56
+
deploy-compute-engine.sh ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # Google Cloud Compute Engine deployment script (with GPU support)
3
+ # This creates a VM instance with GPU for running the router agent
4
+
5
+ set -e
6
+
7
+ PROJECT_ID=${GCP_PROJECT_ID:-"your-project-id"}
8
+ ZONE=${GCP_ZONE:-"us-central1-a"}
9
+ INSTANCE_NAME="router-agent-gpu"
10
+ MACHINE_TYPE="n1-standard-4"
11
+ GPU_TYPE="nvidia-tesla-t4"
12
+ GPU_COUNT=1
13
+ IMAGE_NAME="gcr.io/${PROJECT_ID}/router-agent:latest"
14
+ BOOT_DISK_SIZE="100GB"
15
+
16
+ # Colors for output
17
+ RED='\033[0;31m'
18
+ GREEN='\033[0;32m'
19
+ YELLOW='\033[1;33m'
20
+ NC='\033[0m' # No Color
21
+
22
+ echo -e "${GREEN}🚀 Setting up Compute Engine VM with GPU for Router Agent${NC}"
23
+
24
+ # Check if gcloud is installed
25
+ if ! command -v gcloud &> /dev/null; then
26
+ echo -e "${RED}❌ gcloud CLI not found. Please install it: https://cloud.google.com/sdk/docs/install${NC}"
27
+ exit 1
28
+ fi
29
+
30
+ # Set project
31
+ gcloud config set project ${PROJECT_ID}
32
+
33
+ # Check if instance already exists
34
+ if gcloud compute instances describe ${INSTANCE_NAME} --zone=${ZONE} &>/dev/null; then
35
+ echo -e "${YELLOW}⚠️ Instance ${INSTANCE_NAME} already exists.${NC}"
36
+ read -p "Delete and recreate? (y/N): " -n 1 -r
37
+ echo
38
+ if [[ $REPLY =~ ^[Yy]$ ]]; then
39
+ echo -e "${YELLOW}🗑️ Deleting existing instance...${NC}"
40
+ gcloud compute instances delete ${INSTANCE_NAME} --zone=${ZONE} --quiet
41
+ else
42
+ echo -e "${GREEN}✅ Using existing instance.${NC}"
43
+ INSTANCE_IP=$(gcloud compute instances describe ${INSTANCE_NAME} --zone=${ZONE} --format='get(networkInterfaces[0].accessConfigs[0].natIP)')
44
+ echo -e "${GREEN}🌐 Instance IP: ${INSTANCE_IP}${NC}"
45
+ echo -e "${YELLOW} Access via: http://${INSTANCE_IP}:7860${NC}"
46
+ exit 0
47
+ fi
48
+ fi
49
+
50
+ # Create startup script
51
+ cat > /tmp/startup-script.sh << 'EOF'
52
+ #!/bin/bash
53
+ set -e
54
+
55
+ # Install Docker
56
+ curl -fsSL https://get.docker.com -o get-docker.sh
57
+ sh get-docker.sh
58
+ usermod -aG docker $USER
59
+
60
+ # Install NVIDIA Container Toolkit
61
+ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
62
+ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
63
+ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list
64
+
65
+ apt-get update
66
+ apt-get install -y nvidia-container-toolkit
67
+ systemctl restart docker
68
+
69
+ # Pull and run the container
70
+ docker pull gcr.io/PROJECT_ID/router-agent:latest
71
+ docker run -d \
72
+ --name router-agent \
73
+ --gpus all \
74
+ -p 7860:7860 \
75
+ -e HF_TOKEN="${HF_TOKEN}" \
76
+ -e GRADIO_SERVER_NAME=0.0.0.0 \
77
+ -e GRADIO_SERVER_PORT=7860 \
78
+ gcr.io/PROJECT_ID/router-agent:latest
79
+
80
+ # Install firewall rule (if needed)
81
+ gcloud compute firewall-rules create allow-router-agent \
82
+ --allow tcp:7860 \
83
+ --source-ranges 0.0.0.0/0 \
84
+ --description "Allow Router Agent Gradio UI" \
85
+ --quiet || true
86
+ EOF
87
+
88
+ # Replace PROJECT_ID in startup script
89
+ sed -i "s/PROJECT_ID/${PROJECT_ID}/g" /tmp/startup-script.sh
90
+
91
+ echo -e "${GREEN}🖥️ Creating VM instance with GPU...${NC}"
92
+ gcloud compute instances create ${INSTANCE_NAME} \
93
+ --zone=${ZONE} \
94
+ --machine-type=${MACHINE_TYPE} \
95
+ --accelerator="type=${GPU_TYPE},count=${GPU_COUNT}" \
96
+ --maintenance-policy=TERMINATE \
97
+ --provisioning-model=STANDARD \
98
+ --image-family=cos-stable \
99
+ --image-project=cos-cloud \
100
+ --boot-disk-size=${BOOT_DISK_SIZE} \
101
+ --boot-disk-type=pd-standard \
102
+ --metadata-from-file startup-script=/tmp/startup-script.sh \
103
+ --scopes=https://www.googleapis.com/auth/cloud-platform \
104
+ --metadata="HF_TOKEN=${HF_TOKEN:-your-token-here}" \
105
+ --tags=http-server,https-server
106
+
107
+ echo -e "${GREEN}✅ Instance created!${NC}"
108
+ echo -e "${YELLOW}⏳ Waiting for instance to start (this may take a few minutes)...${NC}"
109
+
110
+ # Wait for instance to be ready
111
+ sleep 30
112
+
113
+ # Get instance IP
114
+ INSTANCE_IP=$(gcloud compute instances describe ${INSTANCE_NAME} --zone=${ZONE} --format='get(networkInterfaces[0].accessConfigs[0].natIP)')
115
+
116
+ echo -e "${GREEN}🌐 Instance IP: ${INSTANCE_IP}${NC}"
117
+ echo -e "${YELLOW}⏳ Waiting for application to start (check logs with: gcloud compute instances get-serial-port-output ${INSTANCE_NAME} --zone=${ZONE})${NC}"
118
+ echo -e "${GREEN}📝 Access the application at: http://${INSTANCE_IP}:7860${NC}"
119
+
120
+ # Cleanup
121
+ rm -f /tmp/startup-script.sh
122
+
deploy-gcp.sh ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # Google Cloud Platform deployment script
3
+ # Usage: ./deploy-gcp.sh [cloud-run|compute-engine]
4
+
5
+ set -e
6
+
7
+ PROJECT_ID=${GCP_PROJECT_ID:-"your-project-id"}
8
+ REGION=${GCP_REGION:-"us-central1"}
9
+ SERVICE_NAME="router-agent"
10
+ IMAGE_NAME="gcr.io/${PROJECT_ID}/${SERVICE_NAME}"
11
+
12
+ # Colors for output
13
+ RED='\033[0;31m'
14
+ GREEN='\033[0;32m'
15
+ YELLOW='\033[1;33m'
16
+ NC='\033[0m' # No Color
17
+
18
+ echo -e "${GREEN}🚀 Deploying Router Agent to Google Cloud Platform${NC}"
19
+
20
+ # Check if gcloud is installed
21
+ if ! command -v gcloud &> /dev/null; then
22
+ echo -e "${RED}❌ gcloud CLI not found. Please install it: https://cloud.google.com/sdk/docs/install${NC}"
23
+ exit 1
24
+ fi
25
+
26
+ # Check if Docker is installed
27
+ if ! command -v docker &> /dev/null; then
28
+ echo -e "${RED}❌ Docker not found. Please install Docker.${NC}"
29
+ exit 1
30
+ fi
31
+
32
+ # Authenticate if needed
33
+ echo -e "${YELLOW}📋 Checking authentication...${NC}"
34
+ gcloud auth configure-docker --quiet || true
35
+
36
+ # Set project
37
+ echo -e "${YELLOW}📋 Setting project to ${PROJECT_ID}...${NC}"
38
+ gcloud config set project ${PROJECT_ID}
39
+
40
+ DEPLOYMENT_TYPE=${1:-"cloud-run"}
41
+
42
+ if [ "$DEPLOYMENT_TYPE" == "cloud-run" ]; then
43
+ echo -e "${GREEN}📦 Building Docker image...${NC}"
44
+ docker build -t ${IMAGE_NAME}:latest .
45
+
46
+ echo -e "${GREEN}📤 Pushing image to Container Registry...${NC}"
47
+ docker push ${IMAGE_NAME}:latest
48
+
49
+ echo -e "${GREEN}🚀 Deploying to Cloud Run...${NC}"
50
+ gcloud run deploy ${SERVICE_NAME} \
51
+ --image ${IMAGE_NAME}:latest \
52
+ --platform managed \
53
+ --region ${REGION} \
54
+ --allow-unauthenticated \
55
+ --port 7860 \
56
+ --memory 8Gi \
57
+ --cpu 4 \
58
+ --timeout 3600 \
59
+ --max-instances 10 \
60
+ --set-env-vars "GRADIO_SERVER_NAME=0.0.0.0,GRADIO_SERVER_PORT=7860" \
61
+ --quiet
62
+
63
+ echo -e "${GREEN}✅ Deployment complete!${NC}"
64
+ SERVICE_URL=$(gcloud run services describe ${SERVICE_NAME} --platform managed --region ${REGION} --format 'value(status.url)')
65
+ echo -e "${GREEN}🌐 Service URL: ${SERVICE_URL}${NC}"
66
+
67
+ elif [ "$DEPLOYMENT_TYPE" == "compute-engine" ]; then
68
+ echo -e "${GREEN}📦 Building Docker image...${NC}"
69
+ docker build -t ${IMAGE_NAME}:latest .
70
+
71
+ echo -e "${GREEN}📤 Pushing image to Container Registry...${NC}"
72
+ docker push ${IMAGE_NAME}:latest
73
+
74
+ echo -e "${YELLOW}⚠️ Compute Engine deployment requires manual VM setup.${NC}"
75
+ echo -e "${YELLOW} See deploy-compute-engine.sh for GPU instance setup.${NC}"
76
+
77
+ else
78
+ echo -e "${RED}❌ Unknown deployment type: ${DEPLOYMENT_TYPE}${NC}"
79
+ echo -e "${YELLOW}Usage: ./deploy-gcp.sh [cloud-run|compute-engine]${NC}"
80
+ exit 1
81
+ fi
82
+
gcp-deployment.md ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Google Cloud Platform Deployment Guide
2
+
3
+ This guide covers deploying the Router Agent application to Google Cloud Platform with GPU support.
4
+
5
+ ## Prerequisites
6
+
7
+ 1. **Google Cloud Account** with billing enabled
8
+ 2. **gcloud CLI** installed and configured
9
+ ```bash
10
+ curl https://sdk.cloud.google.com | bash
11
+ gcloud init
12
+ ```
13
+ 3. **Docker** installed locally
14
+ 4. **HF_TOKEN** environment variable set (for accessing private models)
15
+
16
+ ## Deployment Options
17
+
18
+ ### Option 1: Cloud Run (Serverless, CPU only)
19
+
20
+ **Pros:**
21
+ - Serverless, pay-per-use
22
+ - Auto-scaling
23
+ - No VM management
24
+
25
+ **Cons:**
26
+ - No GPU support (CPU inference only)
27
+ - Cold starts
28
+ - Limited to 8GB memory
29
+
30
+ **Steps:**
31
+
32
+ ```bash
33
+ # Set your project ID
34
+ export GCP_PROJECT_ID="your-project-id"
35
+ export GCP_REGION="us-central1"
36
+
37
+ # Make script executable
38
+ chmod +x deploy-gcp.sh
39
+
40
+ # Deploy to Cloud Run
41
+ ./deploy-gcp.sh cloud-run
42
+ ```
43
+
44
+ **Cost:** ~$0.10-0.50/hour when active (depends on traffic)
45
+
46
+ ### Option 2: Compute Engine with GPU (Recommended for Production)
47
+
48
+ **Pros:**
49
+ - Full GPU support (T4, V100, A100)
50
+ - Persistent instance
51
+ - Better for long-running workloads
52
+ - Lower latency (no cold starts)
53
+
54
+ **Cons:**
55
+ - Requires VM management
56
+ - Higher cost for always-on instances
57
+
58
+ **Steps:**
59
+
60
+ ```bash
61
+ # Set your project ID and zone
62
+ export GCP_PROJECT_ID="your-project-id"
63
+ export GCP_ZONE="us-central1-a"
64
+ export HF_TOKEN="your-huggingface-token"
65
+
66
+ # Make script executable
67
+ chmod +x deploy-compute-engine.sh
68
+
69
+ # Deploy to Compute Engine
70
+ ./deploy-compute-engine.sh
71
+ ```
72
+
73
+ **GPU Options:**
74
+ - **T4** (nvidia-tesla-t4): ~$0.35/hour - Good for 27B-32B models with quantization
75
+ - **V100** (nvidia-tesla-v100): ~$2.50/hour - Better performance
76
+ - **A100** (nvidia-a100): ~$3.50/hour - Best performance for large models
77
+
78
+ **Cost:** GPU instance + storage (~$0.35-3.50/hour depending on GPU type)
79
+
80
+ ## Manual Deployment Steps
81
+
82
+ ### 1. Build and Push Docker Image
83
+
84
+ ```bash
85
+ # Authenticate Docker
86
+ gcloud auth configure-docker
87
+
88
+ # Set project
89
+ gcloud config set project YOUR_PROJECT_ID
90
+
91
+ # Build image
92
+ docker build -t gcr.io/YOUR_PROJECT_ID/router-agent:latest .
93
+
94
+ # Push to Container Registry
95
+ docker push gcr.io/YOUR_PROJECT_ID/router-agent:latest
96
+ ```
97
+
98
+ ### 2. Deploy to Cloud Run (CPU)
99
+
100
+ ```bash
101
+ gcloud run deploy router-agent \
102
+ --image gcr.io/YOUR_PROJECT_ID/router-agent:latest \
103
+ --platform managed \
104
+ --region us-central1 \
105
+ --allow-unauthenticated \
106
+ --port 7860 \
107
+ --memory 8Gi \
108
+ --cpu 4 \
109
+ --timeout 3600 \
110
+ --set-env-vars "HF_TOKEN=your-token,GRADIO_SERVER_NAME=0.0.0.0,GRADIO_SERVER_PORT=7860"
111
+ ```
112
+
113
+ ### 3. Deploy to Compute Engine (GPU)
114
+
115
+ ```bash
116
+ # Create VM with GPU
117
+ gcloud compute instances create router-agent-gpu \
118
+ --zone=us-central1-a \
119
+ --machine-type=n1-standard-4 \
120
+ --accelerator="type=nvidia-tesla-t4,count=1" \
121
+ --image-family=cos-stable \
122
+ --image-project=cos-cloud \
123
+ --boot-disk-size=100GB \
124
+ --maintenance-policy=TERMINATE \
125
+ --scopes=https://www.googleapis.com/auth/cloud-platform
126
+
127
+ # SSH into instance
128
+ gcloud compute ssh router-agent-gpu --zone=us-central1-a
129
+
130
+ # On the VM, install Docker and NVIDIA runtime
131
+ # Then pull and run the container
132
+ docker pull gcr.io/YOUR_PROJECT_ID/router-agent:latest
133
+ docker run -d \
134
+ --name router-agent \
135
+ --gpus all \
136
+ -p 7860:7860 \
137
+ -e HF_TOKEN="your-token" \
138
+ gcr.io/YOUR_PROJECT_ID/router-agent:latest
139
+ ```
140
+
141
+ ## Environment Variables
142
+
143
+ Set these in Cloud Run or as VM metadata:
144
+
145
+ - `HF_TOKEN`: Hugging Face access token (required for private models)
146
+ - `GRADIO_SERVER_NAME`: Server hostname (default: 0.0.0.0)
147
+ - `GRADIO_SERVER_PORT`: Server port (default: 7860)
148
+ - `ROUTER_PREFETCH_MODELS`: Comma-separated list of models to preload
149
+ - `ROUTER_WARM_REMAINING`: Set to "1" to warm remaining models
150
+
151
+ ## Monitoring and Logs
152
+
153
+ ### Cloud Run Logs
154
+ ```bash
155
+ gcloud run services logs read router-agent --region us-central1
156
+ ```
157
+
158
+ ### Compute Engine Logs
159
+ ```bash
160
+ gcloud compute instances get-serial-port-output router-agent-gpu --zone us-central1-a
161
+ ```
162
+
163
+ ## Cost Optimization
164
+
165
+ 1. **Cloud Run**: Use only when needed, auto-scales to zero
166
+ 2. **Compute Engine**:
167
+ - Use preemptible instances for 80% cost savings (with risk of termination)
168
+ - Stop instance when not in use: `gcloud compute instances stop router-agent-gpu --zone us-central1-a`
169
+ - Use smaller GPU types (T4) for development, larger (A100) for production
170
+
171
+ ## Troubleshooting
172
+
173
+ ### GPU Not Available
174
+ - Check GPU quota: `gcloud compute project-info describe --project YOUR_PROJECT_ID`
175
+ - Request quota increase if needed
176
+ - Verify GPU drivers are installed on Compute Engine VM
177
+
178
+ ### Out of Memory
179
+ - Increase Cloud Run memory: `--memory 16Gi`
180
+ - Use larger VM instance type
181
+ - Enable model quantization (AWQ/BitsAndBytes)
182
+
183
+ ### Cold Starts (Cloud Run)
184
+ - Use Cloud Run min-instances to keep warm
185
+ - Pre-warm models on startup
186
+ - Consider Compute Engine for always-on workloads
187
+
188
+ ## Security
189
+
190
+ 1. **Authentication**: Use Cloud Run authentication or Cloud IAP for Compute Engine
191
+ 2. **Secrets**: Store HF_TOKEN in Secret Manager
192
+ 3. **Firewall**: Restrict access to specific IP ranges
193
+ 4. **HTTPS**: Use Cloud Load Balancer with SSL certificate
194
+
195
+ ## Next Steps
196
+
197
+ 1. Set up Cloud Load Balancer for HTTPS
198
+ 2. Configure monitoring and alerts
199
+ 3. Set up CI/CD with Cloud Build
200
+ 4. Use Cloud Storage for model caching
201
+ 5. Implement auto-scaling policies
202
+