# OrgState — Load Testing

Locust-based perf harness (Stage 93). Prove the platform handles realistic concurrent load before committing to SLAs with paying customers.

## V1.1 SLA targets

Single-replica deploy (one API process, one scheduler, SQLite or Postgres). For multi-replica scale, numbers go up but the per-instance ratios should hold.

| Metric | Target | Notes |
|---|---|---|
| p50 read latency | < 100ms | dashboard browsing feels instant |
| p95 read latency | < 500ms | dashboard tolerable on slow networks |
| p50 write latency | < 500ms | observation ingestion responsive |
| p95 write latency | < 2 000ms | batches of 5 rows |
| Error rate | < 1% | excluding 400/422 "bad data" cases |
| Throughput | ≥ 100 RPS | sustained, mixed read/write |

These are V1.1 floors — generous because we run a single replica + zero DB tuning. Tighten as scale grows.

## Install

The load tool is an **optional** dep — not pulled by `requirements-runtime.txt`.

```bash
pip install -r requirements-load.txt
```

## Run

### Headless smoke (CI-friendly)

```bash
./load/smoke.sh
```

Runs locust against `http://localhost:8080` for 30s with 20 virtual users. Exits 0 if SLAs are met, 1 otherwise. Assumes you started the API yourself:

```bash
# terminal 1
ORGSTATE_DB_PATH=/tmp/load.sqlite3 \
    python -m uvicorn infra.api.app:app --port 8080

# terminal 2 — bootstrap the tenant + key the locustfile uses
python -m infra --db /tmp/load.sqlite3 onboard acme "ACME" \
    --mint-operator > /tmp/acme.json
export LOCUST_TENANT_ID=acme
export LOCUST_API_KEY=$(jq -r '.keys[] | select(.role=="operator").raw_key' \
    /tmp/acme.json)

# terminal 2 — run the smoke
./load/smoke.sh
```

### Interactive (Web UI)

```bash
locust -f load/locustfile.py --host http://localhost:8080
# open http://localhost:8089 — set user count + spawn rate + run time
```

### Full perf run (2 minutes, 50 users)

```bash
locust -f load/locustfile.py \
    --host https://api.orgstate.example \
    --users 50 --spawn-rate 5 --run-time 2m \
    --headless --csv perf-2026-05-18
```

Outputs `perf-2026-05-18_stats.csv` (per-endpoint p50/p95/p99/RPS) and `perf-2026-05-18_failures.csv`.

## Configuration (env vars)

| Var | Required? | Default | Why |
|---|---|---|---|
| `LOCUST_TENANT_ID` | recommended | `acme` | which tenant to hammer |
| `LOCUST_API_KEY` | yes | unset | bearer token (operator role) |
| `LOCUST_ENTITY_TYPE` | no | `warehouse` | for ingestion/run tasks |
| `LOCUST_VERTICAL` | no | `logistics` | for run trigger |

## Scenario mix

The locustfile uses **weighted task selection** to approximate real customer traffic:

* **70% reads** (`health`, `tenant`, `runs`, `usage`, `webhooks`) — dashboard browsing.
* **25% writes** (`POST /observations`) — data feed ingestion.
* **5% trigger** (`POST /observations/run`) — pipeline runs.

Adjust per-customer by editing the `@task(N)` weights in `locustfile.py`.

## Interpreting results

Look for in the headless CSV / Web UI:

1. **Failures column should be < 1%** of total requests. Anything higher means the platform is shedding load — investigate before tightening SLAs.
2. **p95 columns** must stay under the targets above. If reads exceed 500ms, the dashboard will feel sluggish.
3. **RPS** should hit the throughput floor sustainably (not just peak). A drop-off after warm-up means GC pressure or DB contention.
4. **Spike test** (`--users 200 --spawn-rate 50 --run-time 30s`) — RPS should plateau, not crash. p95 may briefly spike during ramp.

## What this DOESN'T cover (yet)

* **Multi-tenant interleaving** — the locustfile hammers ONE tenant. Real fairness testing needs N parallel users per tenant. V1.2 candidate.
* **Long-tail latency** (p99.9) — single-machine locust can't reliably measure beyond p99. Use distributed mode (`--master` + `--worker`) for production-scale runs.
* **Sustained 24h runs** — designed for short bursts. Soak testing is a separate exercise; rotate the API process if memory grows.

## Day-2 ops

When p95 starts creeping past target on a healthy deploy, the usual suspects (in order of likelihood):

1. **DB index drift** — `EXPLAIN QUERY PLAN` against the slowest endpoint. SQLite + Postgres tend to miss indexes on aggregated reports.
2. **`/metrics` scrape interval too aggressive** — Prometheus default 15s is fine; 1s slows everyone.
3. **Webhook deliveries blocking** — Stage 76 audit log + Stage 77 auto-disable should keep this bounded, but verify via `infra webhook deliveries list`.
4. **Audit log too big** — Stage 91 retention purge keeps it bounded; set `ORGSTATE_RETENTION_AUDIT_LOGS_DAYS=90` and run nightly.

See `RUNBOOK.md` § 7 for the full incident triage decision tree.