orgstate / load /README.md
Legal-i's picture
Initial OrgState deploy via Stage 150 free-tier stack
d2d1903 verified
# OrgState β€” Load Testing
Locust-based perf harness (Stage 93). Prove the platform handles realistic concurrent load before committing to SLAs with paying customers.
## V1.1 SLA targets
Single-replica deploy (one API process, one scheduler, SQLite or Postgres). For multi-replica scale, numbers go up but the per-instance ratios should hold.
| Metric | Target | Notes |
|---|---|---|
| p50 read latency | < 100ms | dashboard browsing feels instant |
| p95 read latency | < 500ms | dashboard tolerable on slow networks |
| p50 write latency | < 500ms | observation ingestion responsive |
| p95 write latency | < 2 000ms | batches of 5 rows |
| Error rate | < 1% | excluding 400/422 "bad data" cases |
| Throughput | β‰₯ 100 RPS | sustained, mixed read/write |
These are V1.1 floors β€” generous because we run a single replica + zero DB tuning. Tighten as scale grows.
## Install
The load tool is an **optional** dep β€” not pulled by `requirements-runtime.txt`.
```bash
pip install -r requirements-load.txt
```
## Run
### Headless smoke (CI-friendly)
```bash
./load/smoke.sh
```
Runs locust against `http://localhost:8080` for 30s with 20 virtual users. Exits 0 if SLAs are met, 1 otherwise. Assumes you started the API yourself:
```bash
# terminal 1
ORGSTATE_DB_PATH=/tmp/load.sqlite3 \
python -m uvicorn infra.api.app:app --port 8080
# terminal 2 β€” bootstrap the tenant + key the locustfile uses
python -m infra --db /tmp/load.sqlite3 onboard acme "ACME" \
--mint-operator > /tmp/acme.json
export LOCUST_TENANT_ID=acme
export LOCUST_API_KEY=$(jq -r '.keys[] | select(.role=="operator").raw_key' \
/tmp/acme.json)
# terminal 2 β€” run the smoke
./load/smoke.sh
```
### Interactive (Web UI)
```bash
locust -f load/locustfile.py --host http://localhost:8080
# open http://localhost:8089 β€” set user count + spawn rate + run time
```
### Full perf run (2 minutes, 50 users)
```bash
locust -f load/locustfile.py \
--host https://api.orgstate.example \
--users 50 --spawn-rate 5 --run-time 2m \
--headless --csv perf-2026-05-18
```
Outputs `perf-2026-05-18_stats.csv` (per-endpoint p50/p95/p99/RPS) and `perf-2026-05-18_failures.csv`.
## Configuration (env vars)
| Var | Required? | Default | Why |
|---|---|---|---|
| `LOCUST_TENANT_ID` | recommended | `acme` | which tenant to hammer |
| `LOCUST_API_KEY` | yes | unset | bearer token (operator role) |
| `LOCUST_ENTITY_TYPE` | no | `warehouse` | for ingestion/run tasks |
| `LOCUST_VERTICAL` | no | `logistics` | for run trigger |
## Scenario mix
The locustfile uses **weighted task selection** to approximate real customer traffic:
* **70% reads** (`health`, `tenant`, `runs`, `usage`, `webhooks`) β€” dashboard browsing.
* **25% writes** (`POST /observations`) β€” data feed ingestion.
* **5% trigger** (`POST /observations/run`) β€” pipeline runs.
Adjust per-customer by editing the `@task(N)` weights in `locustfile.py`.
## Interpreting results
Look for in the headless CSV / Web UI:
1. **Failures column should be < 1%** of total requests. Anything higher means the platform is shedding load β€” investigate before tightening SLAs.
2. **p95 columns** must stay under the targets above. If reads exceed 500ms, the dashboard will feel sluggish.
3. **RPS** should hit the throughput floor sustainably (not just peak). A drop-off after warm-up means GC pressure or DB contention.
4. **Spike test** (`--users 200 --spawn-rate 50 --run-time 30s`) β€” RPS should plateau, not crash. p95 may briefly spike during ramp.
## What this DOESN'T cover (yet)
* **Multi-tenant interleaving** β€” the locustfile hammers ONE tenant. Real fairness testing needs N parallel users per tenant. V1.2 candidate.
* **Long-tail latency** (p99.9) β€” single-machine locust can't reliably measure beyond p99. Use distributed mode (`--master` + `--worker`) for production-scale runs.
* **Sustained 24h runs** β€” designed for short bursts. Soak testing is a separate exercise; rotate the API process if memory grows.
## Day-2 ops
When p95 starts creeping past target on a healthy deploy, the usual suspects (in order of likelihood):
1. **DB index drift** β€” `EXPLAIN QUERY PLAN` against the slowest endpoint. SQLite + Postgres tend to miss indexes on aggregated reports.
2. **`/metrics` scrape interval too aggressive** β€” Prometheus default 15s is fine; 1s slows everyone.
3. **Webhook deliveries blocking** β€” Stage 76 audit log + Stage 77 auto-disable should keep this bounded, but verify via `infra webhook deliveries list`.
4. **Audit log too big** β€” Stage 91 retention purge keeps it bounded; set `ORGSTATE_RETENTION_AUDIT_LOGS_DAYS=90` and run nightly.
See `RUNBOOK.md` Β§ 7 for the full incident triage decision tree.