# OrgState — Load Testing Locust-based perf harness (Stage 93). Prove the platform handles realistic concurrent load before committing to SLAs with paying customers. ## V1.1 SLA targets Single-replica deploy (one API process, one scheduler, SQLite or Postgres). For multi-replica scale, numbers go up but the per-instance ratios should hold. | Metric | Target | Notes | |---|---|---| | p50 read latency | < 100ms | dashboard browsing feels instant | | p95 read latency | < 500ms | dashboard tolerable on slow networks | | p50 write latency | < 500ms | observation ingestion responsive | | p95 write latency | < 2 000ms | batches of 5 rows | | Error rate | < 1% | excluding 400/422 "bad data" cases | | Throughput | ≥ 100 RPS | sustained, mixed read/write | These are V1.1 floors — generous because we run a single replica + zero DB tuning. Tighten as scale grows. ## Install The load tool is an **optional** dep — not pulled by `requirements-runtime.txt`. ```bash pip install -r requirements-load.txt ``` ## Run ### Headless smoke (CI-friendly) ```bash ./load/smoke.sh ``` Runs locust against `http://localhost:8080` for 30s with 20 virtual users. Exits 0 if SLAs are met, 1 otherwise. Assumes you started the API yourself: ```bash # terminal 1 ORGSTATE_DB_PATH=/tmp/load.sqlite3 \ python -m uvicorn infra.api.app:app --port 8080 # terminal 2 — bootstrap the tenant + key the locustfile uses python -m infra --db /tmp/load.sqlite3 onboard acme "ACME" \ --mint-operator > /tmp/acme.json export LOCUST_TENANT_ID=acme export LOCUST_API_KEY=$(jq -r '.keys[] | select(.role=="operator").raw_key' \ /tmp/acme.json) # terminal 2 — run the smoke ./load/smoke.sh ``` ### Interactive (Web UI) ```bash locust -f load/locustfile.py --host http://localhost:8080 # open http://localhost:8089 — set user count + spawn rate + run time ``` ### Full perf run (2 minutes, 50 users) ```bash locust -f load/locustfile.py \ --host https://api.orgstate.example \ --users 50 --spawn-rate 5 --run-time 2m \ --headless --csv perf-2026-05-18 ``` Outputs `perf-2026-05-18_stats.csv` (per-endpoint p50/p95/p99/RPS) and `perf-2026-05-18_failures.csv`. ## Configuration (env vars) | Var | Required? | Default | Why | |---|---|---|---| | `LOCUST_TENANT_ID` | recommended | `acme` | which tenant to hammer | | `LOCUST_API_KEY` | yes | unset | bearer token (operator role) | | `LOCUST_ENTITY_TYPE` | no | `warehouse` | for ingestion/run tasks | | `LOCUST_VERTICAL` | no | `logistics` | for run trigger | ## Scenario mix The locustfile uses **weighted task selection** to approximate real customer traffic: * **70% reads** (`health`, `tenant`, `runs`, `usage`, `webhooks`) — dashboard browsing. * **25% writes** (`POST /observations`) — data feed ingestion. * **5% trigger** (`POST /observations/run`) — pipeline runs. Adjust per-customer by editing the `@task(N)` weights in `locustfile.py`. ## Interpreting results Look for in the headless CSV / Web UI: 1. **Failures column should be < 1%** of total requests. Anything higher means the platform is shedding load — investigate before tightening SLAs. 2. **p95 columns** must stay under the targets above. If reads exceed 500ms, the dashboard will feel sluggish. 3. **RPS** should hit the throughput floor sustainably (not just peak). A drop-off after warm-up means GC pressure or DB contention. 4. **Spike test** (`--users 200 --spawn-rate 50 --run-time 30s`) — RPS should plateau, not crash. p95 may briefly spike during ramp. ## What this DOESN'T cover (yet) * **Multi-tenant interleaving** — the locustfile hammers ONE tenant. Real fairness testing needs N parallel users per tenant. V1.2 candidate. * **Long-tail latency** (p99.9) — single-machine locust can't reliably measure beyond p99. Use distributed mode (`--master` + `--worker`) for production-scale runs. * **Sustained 24h runs** — designed for short bursts. Soak testing is a separate exercise; rotate the API process if memory grows. ## Day-2 ops When p95 starts creeping past target on a healthy deploy, the usual suspects (in order of likelihood): 1. **DB index drift** — `EXPLAIN QUERY PLAN` against the slowest endpoint. SQLite + Postgres tend to miss indexes on aggregated reports. 2. **`/metrics` scrape interval too aggressive** — Prometheus default 15s is fine; 1s slows everyone. 3. **Webhook deliveries blocking** — Stage 76 audit log + Stage 77 auto-disable should keep this bounded, but verify via `infra webhook deliveries list`. 4. **Audit log too big** — Stage 91 retention purge keeps it bounded; set `ORGSTATE_RETENTION_AUDIT_LOGS_DAYS=90` and run nightly. See `RUNBOOK.md` § 7 for the full incident triage decision tree.