| # OrgState β Load Testing |
|
|
| Locust-based perf harness (Stage 93). Prove the platform handles realistic concurrent load before committing to SLAs with paying customers. |
|
|
| ## V1.1 SLA targets |
|
|
| Single-replica deploy (one API process, one scheduler, SQLite or Postgres). For multi-replica scale, numbers go up but the per-instance ratios should hold. |
|
|
| | Metric | Target | Notes | |
| |---|---|---| |
| | p50 read latency | < 100ms | dashboard browsing feels instant | |
| | p95 read latency | < 500ms | dashboard tolerable on slow networks | |
| | p50 write latency | < 500ms | observation ingestion responsive | |
| | p95 write latency | < 2 000ms | batches of 5 rows | |
| | Error rate | < 1% | excluding 400/422 "bad data" cases | |
| | Throughput | β₯ 100 RPS | sustained, mixed read/write | |
|
|
| These are V1.1 floors β generous because we run a single replica + zero DB tuning. Tighten as scale grows. |
|
|
| ## Install |
|
|
| The load tool is an **optional** dep β not pulled by `requirements-runtime.txt`. |
|
|
| ```bash |
| pip install -r requirements-load.txt |
| ``` |
|
|
| ## Run |
|
|
| ### Headless smoke (CI-friendly) |
|
|
| ```bash |
| ./load/smoke.sh |
| ``` |
|
|
| Runs locust against `http://localhost:8080` for 30s with 20 virtual users. Exits 0 if SLAs are met, 1 otherwise. Assumes you started the API yourself: |
|
|
| ```bash |
| # terminal 1 |
| ORGSTATE_DB_PATH=/tmp/load.sqlite3 \ |
| python -m uvicorn infra.api.app:app --port 8080 |
| |
| # terminal 2 β bootstrap the tenant + key the locustfile uses |
| python -m infra --db /tmp/load.sqlite3 onboard acme "ACME" \ |
| --mint-operator > /tmp/acme.json |
| export LOCUST_TENANT_ID=acme |
| export LOCUST_API_KEY=$(jq -r '.keys[] | select(.role=="operator").raw_key' \ |
| /tmp/acme.json) |
| |
| # terminal 2 β run the smoke |
| ./load/smoke.sh |
| ``` |
|
|
| ### Interactive (Web UI) |
|
|
| ```bash |
| locust -f load/locustfile.py --host http://localhost:8080 |
| # open http://localhost:8089 β set user count + spawn rate + run time |
| ``` |
|
|
| ### Full perf run (2 minutes, 50 users) |
|
|
| ```bash |
| locust -f load/locustfile.py \ |
| --host https://api.orgstate.example \ |
| --users 50 --spawn-rate 5 --run-time 2m \ |
| --headless --csv perf-2026-05-18 |
| ``` |
|
|
| Outputs `perf-2026-05-18_stats.csv` (per-endpoint p50/p95/p99/RPS) and `perf-2026-05-18_failures.csv`. |
|
|
| ## Configuration (env vars) |
|
|
| | Var | Required? | Default | Why | |
| |---|---|---|---| |
| | `LOCUST_TENANT_ID` | recommended | `acme` | which tenant to hammer | |
| | `LOCUST_API_KEY` | yes | unset | bearer token (operator role) | |
| | `LOCUST_ENTITY_TYPE` | no | `warehouse` | for ingestion/run tasks | |
| | `LOCUST_VERTICAL` | no | `logistics` | for run trigger | |
|
|
| ## Scenario mix |
|
|
| The locustfile uses **weighted task selection** to approximate real customer traffic: |
|
|
| * **70% reads** (`health`, `tenant`, `runs`, `usage`, `webhooks`) β dashboard browsing. |
| * **25% writes** (`POST /observations`) β data feed ingestion. |
| * **5% trigger** (`POST /observations/run`) β pipeline runs. |
|
|
| Adjust per-customer by editing the `@task(N)` weights in `locustfile.py`. |
|
|
| ## Interpreting results |
|
|
| Look for in the headless CSV / Web UI: |
|
|
| 1. **Failures column should be < 1%** of total requests. Anything higher means the platform is shedding load β investigate before tightening SLAs. |
| 2. **p95 columns** must stay under the targets above. If reads exceed 500ms, the dashboard will feel sluggish. |
| 3. **RPS** should hit the throughput floor sustainably (not just peak). A drop-off after warm-up means GC pressure or DB contention. |
| 4. **Spike test** (`--users 200 --spawn-rate 50 --run-time 30s`) β RPS should plateau, not crash. p95 may briefly spike during ramp. |
|
|
| ## What this DOESN'T cover (yet) |
|
|
| * **Multi-tenant interleaving** β the locustfile hammers ONE tenant. Real fairness testing needs N parallel users per tenant. V1.2 candidate. |
| * **Long-tail latency** (p99.9) β single-machine locust can't reliably measure beyond p99. Use distributed mode (`--master` + `--worker`) for production-scale runs. |
| * **Sustained 24h runs** β designed for short bursts. Soak testing is a separate exercise; rotate the API process if memory grows. |
|
|
| ## Day-2 ops |
|
|
| When p95 starts creeping past target on a healthy deploy, the usual suspects (in order of likelihood): |
|
|
| 1. **DB index drift** β `EXPLAIN QUERY PLAN` against the slowest endpoint. SQLite + Postgres tend to miss indexes on aggregated reports. |
| 2. **`/metrics` scrape interval too aggressive** β Prometheus default 15s is fine; 1s slows everyone. |
| 3. **Webhook deliveries blocking** β Stage 76 audit log + Stage 77 auto-disable should keep this bounded, but verify via `infra webhook deliveries list`. |
| 4. **Audit log too big** β Stage 91 retention purge keeps it bounded; set `ORGSTATE_RETENTION_AUDIT_LOGS_DAYS=90` and run nightly. |
|
|
| See `RUNBOOK.md` Β§ 7 for the full incident triage decision tree. |
|
|