OrgState β Load Testing
Locust-based perf harness (Stage 93). Prove the platform handles realistic concurrent load before committing to SLAs with paying customers.
V1.1 SLA targets
Single-replica deploy (one API process, one scheduler, SQLite or Postgres). For multi-replica scale, numbers go up but the per-instance ratios should hold.
| Metric | Target | Notes |
|---|---|---|
| p50 read latency | < 100ms | dashboard browsing feels instant |
| p95 read latency | < 500ms | dashboard tolerable on slow networks |
| p50 write latency | < 500ms | observation ingestion responsive |
| p95 write latency | < 2 000ms | batches of 5 rows |
| Error rate | < 1% | excluding 400/422 "bad data" cases |
| Throughput | β₯ 100 RPS | sustained, mixed read/write |
These are V1.1 floors β generous because we run a single replica + zero DB tuning. Tighten as scale grows.
Install
The load tool is an optional dep β not pulled by requirements-runtime.txt.
pip install -r requirements-load.txt
Run
Headless smoke (CI-friendly)
./load/smoke.sh
Runs locust against http://localhost:8080 for 30s with 20 virtual users. Exits 0 if SLAs are met, 1 otherwise. Assumes you started the API yourself:
# terminal 1
ORGSTATE_DB_PATH=/tmp/load.sqlite3 \
python -m uvicorn infra.api.app:app --port 8080
# terminal 2 β bootstrap the tenant + key the locustfile uses
python -m infra --db /tmp/load.sqlite3 onboard acme "ACME" \
--mint-operator > /tmp/acme.json
export LOCUST_TENANT_ID=acme
export LOCUST_API_KEY=$(jq -r '.keys[] | select(.role=="operator").raw_key' \
/tmp/acme.json)
# terminal 2 β run the smoke
./load/smoke.sh
Interactive (Web UI)
locust -f load/locustfile.py --host http://localhost:8080
# open http://localhost:8089 β set user count + spawn rate + run time
Full perf run (2 minutes, 50 users)
locust -f load/locustfile.py \
--host https://api.orgstate.example \
--users 50 --spawn-rate 5 --run-time 2m \
--headless --csv perf-2026-05-18
Outputs perf-2026-05-18_stats.csv (per-endpoint p50/p95/p99/RPS) and perf-2026-05-18_failures.csv.
Configuration (env vars)
| Var | Required? | Default | Why |
|---|---|---|---|
LOCUST_TENANT_ID |
recommended | acme |
which tenant to hammer |
LOCUST_API_KEY |
yes | unset | bearer token (operator role) |
LOCUST_ENTITY_TYPE |
no | warehouse |
for ingestion/run tasks |
LOCUST_VERTICAL |
no | logistics |
for run trigger |
Scenario mix
The locustfile uses weighted task selection to approximate real customer traffic:
- 70% reads (
health,tenant,runs,usage,webhooks) β dashboard browsing. - 25% writes (
POST /observations) β data feed ingestion. - 5% trigger (
POST /observations/run) β pipeline runs.
Adjust per-customer by editing the @task(N) weights in locustfile.py.
Interpreting results
Look for in the headless CSV / Web UI:
- Failures column should be < 1% of total requests. Anything higher means the platform is shedding load β investigate before tightening SLAs.
- p95 columns must stay under the targets above. If reads exceed 500ms, the dashboard will feel sluggish.
- RPS should hit the throughput floor sustainably (not just peak). A drop-off after warm-up means GC pressure or DB contention.
- Spike test (
--users 200 --spawn-rate 50 --run-time 30s) β RPS should plateau, not crash. p95 may briefly spike during ramp.
What this DOESN'T cover (yet)
- Multi-tenant interleaving β the locustfile hammers ONE tenant. Real fairness testing needs N parallel users per tenant. V1.2 candidate.
- Long-tail latency (p99.9) β single-machine locust can't reliably measure beyond p99. Use distributed mode (
--master+--worker) for production-scale runs. - Sustained 24h runs β designed for short bursts. Soak testing is a separate exercise; rotate the API process if memory grows.
Day-2 ops
When p95 starts creeping past target on a healthy deploy, the usual suspects (in order of likelihood):
- DB index drift β
EXPLAIN QUERY PLANagainst the slowest endpoint. SQLite + Postgres tend to miss indexes on aggregated reports. /metricsscrape interval too aggressive β Prometheus default 15s is fine; 1s slows everyone.- Webhook deliveries blocking β Stage 76 audit log + Stage 77 auto-disable should keep this bounded, but verify via
infra webhook deliveries list. - Audit log too big β Stage 91 retention purge keeps it bounded; set
ORGSTATE_RETENTION_AUDIT_LOGS_DAYS=90and run nightly.
See RUNBOOK.md Β§ 7 for the full incident triage decision tree.