File size: 14,975 Bytes
0d2b52f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cf2697b
0d2b52f
 
 
 
 
 
 
 
 
 
 
 
 
cf2697b
0d2b52f
 
cf2697b
0d2b52f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cf2697b
0d2b52f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cf2697b
0d2b52f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cf2697b
0d2b52f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cf2697b
 
0d2b52f
 
cf2697b
0d2b52f
 
 
 
 
 
 
 
cf2697b
0d2b52f
 
cf2697b
 
0d2b52f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cf2697b
0d2b52f
cf2697b
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
# AntiAtropos AWS Operations Guide

Everything you need to run the AWS infrastructure for AntiAtropos without blowing up your bill.

**Architecture: FastAPI on Hugging Face Spaces, EKS + AMP + AMG on AWS.**

---

## Table of Contents

1. [Replica Strategy & Caps](#1-replica-strategy--caps)
2. [Autoscaling Configuration](#2-autoscaling-configuration)
3. [Cost Guardrails](#3-cost-guardrails)
4. [Step-by-Step Deployment Walkthrough](#4-step-by-step-deployment-walkthrough)
5. [Configuring HF Spaces to Connect to AWS](#5-configuring-hf-spaces-to-connect-to-aws)
6. [Day-2 Operations](#6-day-2-operations)
7. [Teardown & Cost Recovery](#7-teardown--cost-recovery)

---

## 1. Replica Strategy & Caps

### What Runs Where

| Component | Where | Scaled By | Cost Impact |
|---|---|---|---|
| **AntiAtropos FastAPI server** | HF Spaces | HF auto-scales | $0-5/month (HF billing) |
| **Workload pods** (payments, checkout, etc.) | EKS | SRE agent via `KubernetesExecutor` | **HIGH** — this is where costs spiral |
| **Prometheus Agent** | EKS (monitoring ns) | Static (1 pod) | Low |
| **AMP** | AWS managed | Serverless | Pay per GB ingested |
| **AMG** | AWS managed | Serverless | Pay per editor |

### Workload Pod Replicas — Where Costs Spiral

The SRE agent's `SCALE_UP` action calls `KubernetesExecutor._scale_deployment()`, which patches `replicas` on real K8s Deployments. A bad agent can scale every deployment to the cap.

The `ANTIATROPOS_MAX_REPLICAS` env var (set on HF Spaces) is the **global** ceiling applied to all deployments. The default in `kubernetes_executor.py` is 20 — with 5 deployments, that's **100 pods** worst case. **Set it to 6.**

**Recommended caps by deployment:**

| Deployment | Min | Max Replicas | Reasoning |
|---|---|---|---|
| `payments` (node-0, VIP) | 2 | 6 | VIP node — needs redundancy, 6 is plenty for the traffic model |
| `checkout` (node-1) | 1 | 5 | Can burst but shouldn't stay high |
| `catalog` (node-2) | 1 | 5 | Same |
| `cart` (node-3) | 1 | 4 | Non-critical, sheddable |
| `auth` (node-4) | 1 | 4 | Non-critical, sheddable |

**Total worst case: 24 workload pods.**

At ~0.25 vCPU / 256MB per workload pod (nginx containers), that's ~6 vCPU and ~6GB RAM — fits on 2x t3.medium nodes with some headroom, or 3 nodes for comfort.

### How the Cap Works

The `KubernetesExecutor._scale_deployment()` method reads `ANTIATROPOS_MAX_REPLICAS` from the environment and refuses to scale above it:

```
Ack: SCALE_UP for node-0 - replicas unchanged at 6 (bounds 1-6)
```

This is enforced in code (`kubernetes_executor.py` line 115):
```python
desired = min(self.max_replicas, current + delta)
```

**Set `ANTIATROPOS_MAX_REPLICAS=6` on your HF Space.**

---

## 2. Autoscaling Configuration

### EKS Node Autoscaling

The cluster needs to grow nodes when the agent scales workloads. Install the Cluster Autoscaler:

```bash
helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm repo update

helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --namespace kube-system \
  -f deploy/aws/cluster-autoscaler-values.yaml
```

**The node group `maxSize` in `eksctl-cluster.yaml` (4) is your ultimate cost ceiling.**

```
4 nodes x $0.0416/hr (t3.medium on-demand) = $0.1664/hr = ~$120/month max
```

With spot instances, this drops to ~$36/month max.

### What Happens When the Agent Scales Workloads

1. Agent on HF Spaces sends `SCALE_UP` action
2. `KubernetesExecutor._scale_deployment()` patches the Deployment's `spec.replicas` via EKS API server
3. Kubernetes scheduler tries to place the new pod
4. If no node has capacity -> pod is `Pending`
5. Cluster Autoscaler sees `Pending` pods -> adds a node (within `maxSize`)
6. If `maxSize` is hit -> pod stays `Pending` (agent action succeeded but pod won't schedule)

**This is why `maxSize` in the node group is your ultimate cost ceiling.**

---

## 3. Cost Guardrails

### Monthly Cost Caps by Tier

| Tier | Max Nodes | Max Workload Pods | Estimated Monthly Cost |
|---|---|---|---|
| **Dev/Testing** | 2 | 10 (2/deployment) | ~$80 |
| **Training** | 3 | 15 (3/deployment) | ~$130 |
| **Benchmark Suite** | 4 | 24 (~5/deployment) | ~$160 |
| **Unlimited (danger)** | inf | 100 (20/deployment) | $500+ |

### AWS Budgets — Get Alerts Before You Overspend

```bash
aws budgets create-budget \
  --account-id $(aws sts get-caller-identity --query Account --output text) \
  --budget '{
    "BudgetName": "AntiAtropos-Monthly",
    "BudgetLimit": {"Amount": "150", "Unit": "USD"},
    "TimeUnit": "MONTHLY",
    "CostFilters": {
      "TagKeyValue": ["user:Project$AntiAtropos"]
    },
    "CostTypes": {
      "IncludeTax": true,
      "IncludeSubscription": true,
      "UseBlended": false
    }
  }'

# Alert at 50%
aws budgets create-notification \
  --account-id $(aws sts get-caller-identity --query Account --output text) \
  --budget-name "AntiAtropos-Monthly" \
  --notification '{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":50}' \
  --subscribers '[{"SubscriptionType":"EMAIL","Address":"your-email@example.com"}]'

# Alert at 80%
aws budgets create-notification \
  --account-id $(aws sts get-caller-identity --query Account --output text) \
  --budget-name "AntiAtropos-Monthly" \
  --notification '{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":80}' \
  --subscribers '[{"SubscriptionType":"EMAIL","Address":"your-email@example.com"}]'
```

### Cost-Saving Checklist

- [ ] Use **spot instances** for node groups (60-70% cheaper, OK for training)
- [ ] Set `ANTIATROPOS_MAX_REPLICAS=6` on HF Spaces (not 20) to prevent agent runaway
- [ ] Cap node group `maxSize` at 4 (in `eksctl-cluster.yaml`)
- [ ] Set AWS Budget alert at $150/month
- [ ] Scale workloads to zero between runs: `kubectl scale deployment -n prod-sre --replicas=0 --all`
- [ ] Delete the cluster for multi-day breaks: `eksctl delete cluster --name antiatropos`
- [ ] AMP free tier covers first 10GB ingest/month
- [ ] AMG free tier is 1 editor for 30 days — cancel if not needed

---

## 4. Step-by-Step Deployment Walkthrough

### Before You Start

You need:
- AWS account with billing alerts enabled
- AWS CLI v2 installed and configured (`aws configure`)
- eksctl, kubectl, helm installed
- About 20-30 minutes

### Step 1: Create the EKS Cluster (15 min)

```bash
eksctl create cluster -f deploy/aws/eksctl-cluster.yaml

# Verify
aws eks update-kubeconfig --name antiatropos --region ap-south-1
kubectl get nodes
```

### Step 2: Deploy Sample Workloads (1 min)

```bash
kubectl apply -f deploy/aws/k8s-workloads.yaml
kubectl get pods -n prod-sre
```

### Step 3: Create AMP Workspace (1 min)

```bash
aws amp create-workspace --alias antiatropos-metrics --region ap-south-1

# Note the workspace ID
aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 --query 'workspaces[0].workspaceId' --output text
```

### Step 4: Set Up IRSA (2 min)

```bash
# Prometheus agent needs to write to AMP
eksctl create iamserviceaccount \
  --cluster antiatropos \
  --namespace monitoring \
  --name prometheus-sa \
  --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
  --approve
```

### Step 5: Install Prometheus Agent (2 min)

```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Replace WORKSPACE_ID
helm install prometheus-agent prometheus-community/prometheus \
  --namespace monitoring --create-namespace \
  -f deploy/aws/prometheus-agent-values.yaml \
  --set "prometheus.prometheusSpec.remoteWrite[0].url=https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/remote_write"
```

### Step 6: Set Up AMG (5 min)

```bash
# Create IAM role for AMG
aws iam create-role \
  --role-name AntiAtroposGrafanaRole \
  --assume-role-policy-document file://deploy/aws/grafana-trust-policy.json

aws iam attach-role-policy \
  --role-name AntiAtroposGrafanaRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess

# Create workspace
aws grafana create-workspace \
  --workspace-name antiatropos-dashboards \
  --account-access-type CURRENT_ACCOUNT \
  --authentication-method AWS_SSO \
  --permission-type SERVICE_MANAGED \
  --data-sources PROMETHEUS \
  --region ap-south-1
```

Then in the AMG web UI:
1. Sign in with AWS SSO
2. Configuration -> Data Sources -> Add AMP workspace
3. Dashboards -> Import -> Upload JSON from `deploy/grafana/provisioning/dashboards/json/`
4. Select AMP data source when importing

### Step 7: Install Cluster Autoscaler (2 min)

```bash
helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm repo update

helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --namespace kube-system \
  -f deploy/aws/cluster-autoscaler-values.yaml
```

### Step 8: Generate Kubeconfig for HF Spaces (1 min)

```bash
./deploy/aws/generate-kubeconfig.sh
# Outputs: deploy/aws/kubeconfig-antiatropos.yaml
```

### Step 9: Configure HF Spaces

See [Section 5](#5-configuring-hf-spaces-to-connect-to-aws) below.

---

## 5. Configuring HF Spaces to Connect to AWS

### Secrets (HF Space Settings -> Repository secrets)

| Secret | Value |
|---|---|
| `OPENAI_API_KEY` | Your OpenAI API key |
| `KUBECONFIG_CONTENT` | Base64-encoded content of `kubeconfig-antiatropos.yaml` |

To encode the kubeconfig:
```bash
cat deploy/aws/kubeconfig-antiatropos.yaml | base64 -w 0
```

### Environment Variables (HF Space Settings -> Variables)

| Variable | Value |
|---|---|
| `ANTIATROPOS_ENV_MODE` | `live` |
| `ANTIATROPOS_STRICT_REAL` | `false` |
| `PROMETHEUS_URL` | `https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID` |
| `KUBECONFIG` | `/app/kubeconfig.yaml` |
| `ANTIATROPOS_K8S_NAMESPACE` | `prod-sre` |
| `ANTIATROPOS_DEPLOYMENT_PREFIX` | `` (empty) |
| `ANTIATROPOS_MIN_REPLICAS` | `1` |
| `ANTIATROPOS_MAX_REPLICAS` | `6` |
| `ANTIATROPOS_SCALE_STEP` | `3` |
| `ANTIATROPOS_PROM_TIMEOUT_S` | `5.0` |
| `ANTIATROPOS_METRIC_AGGREGATION` | `sum` |
| `ANTIATROPOS_WORKLOAD_MAP` | See below |

### Workload Map Value

```json
{
  "node-0": {"deployment": "payments", "namespace": "prod-sre"},
  "node-1": {"deployment": "checkout", "namespace": "prod-sre"},
  "node-2": {"deployment": "catalog", "namespace": "prod-sre"},
  "node-3": {"deployment": "cart", "namespace": "prod-sre"},
  "node-4": {"deployment": "auth", "namespace": "prod-sre"}
}
```

### Entrypoint Modification

Add this to `deploy/entrypoint.sh` before the uvicorn line, so the kubeconfig is decoded from the HF secret:

```bash
# Decode kubeconfig from HF Spaces secret
if [ -n "${KUBECONFIG_CONTENT:-}" ]; then
    echo "${KUBECONFIG_CONTENT}" | base64 -d > /app/kubeconfig.yaml
    export KUBECONFIG=/app/kubeconfig.yaml
fi
```

### Verifying the Connection

After deploying, check from HF Spaces that the server can reach AWS:

1. Check the HF Space logs for `antiatropos_step` events
2. Look for `Ack: SCALE_UP` messages (agent is reaching EKS)
3. Look for non-zero `request_rate` / `cpu_utilization` (PrometheusClient is reaching AMP)
4. If `ANTIATROPOS_STRICT_REAL=false` (recommended), failures fall back to mock silently

---

## 6. Day-2 Operations

### Scaling Workloads Manually

```bash
# Scale a specific deployment
kubectl scale deployment/payments -n prod-sre --replicas=4

# Scale all workloads down
kubectl scale deployment -n prod-sre --replicas=0 --all

# Scale all workloads back up
kubectl scale deployment payments -n prod-sre --replicas=2
kubectl scale deployment checkout -n prod-sre --replicas=1
kubectl scale deployment catalog -n prod-sre --replicas=1
kubectl scale deployment cart -n prod-sre --replicas=1
kubectl scale deployment auth -n prod-sre --replicas=1
```

### Pausing Everything (Without Deleting)

```bash
# Scale all workloads to 0
kubectl scale deployment -n prod-sre --replicas=0 --all

# Note: EKS nodes still run and cost money.
# For real savings, delete the cluster (Section 7).
```

### Monitoring Agent Behavior

Watch what the SRE agent is doing in real-time:

```bash
# Check how many workload pods the agent has created
kubectl get deployments -n prod-sre

# Check current replica counts
kubectl get hpa -A  # if any HPAs are defined

# Check node pressure
kubectl top nodes
```

### Checking Current Spend

```bash
# Current month cost by service
aws ce get-cost-and-usage \
  --time-period Start=$(date -d '1st of this month' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE
```

### Regenerating Kubeconfig

If the EKS cluster is recreated or credentials expire:

```bash
./deploy/aws/generate-kubeconfig.sh
# Re-upload the base64-encoded content to HF Spaces secret KUBECONFIG_CONTENT
```

---

## 7. Teardown & Cost Recovery

### Partial Teardown (Keep Cluster, Stop Workloads)

```bash
kubectl scale deployment -n prod-sre --replicas=0 --all
# Still paying for EKS control plane ($73/month) and idle nodes
```

### Full Teardown (Stop All Charges)

```bash
# Delete workloads
kubectl delete -f deploy/aws/k8s-workloads.yaml

# Delete Prometheus agent
helm uninstall prometheus-agent -n monitoring
kubectl delete namespace monitoring

# Delete AMP workspace
AMP_WS_ID=$(aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 --query 'workspaces[0].workspaceId' --output text)
aws amp delete-workspace --workspace-id $AMP_WS_ID --region ap-south-1

# Delete AMG workspace
AMG_WS_ID=$(aws grafana list-workspaces --region ap-south-1 --query 'workspaces[0].id' --output text)
aws grafana delete-workspace --workspace-id $AMG_WS_ID

# Delete IAM role for Grafana
aws iam detach-role-policy --role-name AntiAtroposGrafanaRole --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess
aws iam detach-role-policy --role-name AntiAtroposGrafanaRole --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess
aws iam delete-role --role-name AntiAtroposGrafanaRole

# Delete the EKS cluster (10-15 min)
eksctl delete cluster --name antiatropos --region ap-south-1

# Verify nothing is left
aws eks list-clusters --region ap-south-1
aws amp list-workspaces --region ap-south-1
```

Also remove the `KUBECONFIG_CONTENT` secret and reset `PROMETHEUS_URL` to `mock` in your HF Space.

---

## Quick Reference Card

| Task | Command |
|---|---|
| Deploy AWS infra | `./deploy/aws/deploy.sh` |
| Check workloads | `kubectl get pods -n prod-sre` |
| Check monitoring | `kubectl get pods -n monitoring` |
| Scale a workload | `kubectl scale deployment/payments -n prod-sre --replicas=N` |
| Pause all workloads | `kubectl scale deployment -n prod-sre --replicas=0 --all` |
| Check AMP data | `awscurl --service aps "https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WS_ID/api/v1/query?query=up" --region ap-south-1` |
| Generate kubeconfig | `./deploy/aws/generate-kubeconfig.sh` |
| Nuke everything | `eksctl delete cluster --name antiatropos --region ap-south-1` |