orgstate / load /README.md
Legal-i's picture
Initial OrgState deploy via Stage 150 free-tier stack
d2d1903 verified

OrgState β€” Load Testing

Locust-based perf harness (Stage 93). Prove the platform handles realistic concurrent load before committing to SLAs with paying customers.

V1.1 SLA targets

Single-replica deploy (one API process, one scheduler, SQLite or Postgres). For multi-replica scale, numbers go up but the per-instance ratios should hold.

Metric Target Notes
p50 read latency < 100ms dashboard browsing feels instant
p95 read latency < 500ms dashboard tolerable on slow networks
p50 write latency < 500ms observation ingestion responsive
p95 write latency < 2 000ms batches of 5 rows
Error rate < 1% excluding 400/422 "bad data" cases
Throughput β‰₯ 100 RPS sustained, mixed read/write

These are V1.1 floors β€” generous because we run a single replica + zero DB tuning. Tighten as scale grows.

Install

The load tool is an optional dep β€” not pulled by requirements-runtime.txt.

pip install -r requirements-load.txt

Run

Headless smoke (CI-friendly)

./load/smoke.sh

Runs locust against http://localhost:8080 for 30s with 20 virtual users. Exits 0 if SLAs are met, 1 otherwise. Assumes you started the API yourself:

# terminal 1
ORGSTATE_DB_PATH=/tmp/load.sqlite3 \
    python -m uvicorn infra.api.app:app --port 8080

# terminal 2 β€” bootstrap the tenant + key the locustfile uses
python -m infra --db /tmp/load.sqlite3 onboard acme "ACME" \
    --mint-operator > /tmp/acme.json
export LOCUST_TENANT_ID=acme
export LOCUST_API_KEY=$(jq -r '.keys[] | select(.role=="operator").raw_key' \
    /tmp/acme.json)

# terminal 2 β€” run the smoke
./load/smoke.sh

Interactive (Web UI)

locust -f load/locustfile.py --host http://localhost:8080
# open http://localhost:8089 β€” set user count + spawn rate + run time

Full perf run (2 minutes, 50 users)

locust -f load/locustfile.py \
    --host https://api.orgstate.example \
    --users 50 --spawn-rate 5 --run-time 2m \
    --headless --csv perf-2026-05-18

Outputs perf-2026-05-18_stats.csv (per-endpoint p50/p95/p99/RPS) and perf-2026-05-18_failures.csv.

Configuration (env vars)

Var Required? Default Why
LOCUST_TENANT_ID recommended acme which tenant to hammer
LOCUST_API_KEY yes unset bearer token (operator role)
LOCUST_ENTITY_TYPE no warehouse for ingestion/run tasks
LOCUST_VERTICAL no logistics for run trigger

Scenario mix

The locustfile uses weighted task selection to approximate real customer traffic:

  • 70% reads (health, tenant, runs, usage, webhooks) β€” dashboard browsing.
  • 25% writes (POST /observations) β€” data feed ingestion.
  • 5% trigger (POST /observations/run) β€” pipeline runs.

Adjust per-customer by editing the @task(N) weights in locustfile.py.

Interpreting results

Look for in the headless CSV / Web UI:

  1. Failures column should be < 1% of total requests. Anything higher means the platform is shedding load β€” investigate before tightening SLAs.
  2. p95 columns must stay under the targets above. If reads exceed 500ms, the dashboard will feel sluggish.
  3. RPS should hit the throughput floor sustainably (not just peak). A drop-off after warm-up means GC pressure or DB contention.
  4. Spike test (--users 200 --spawn-rate 50 --run-time 30s) β€” RPS should plateau, not crash. p95 may briefly spike during ramp.

What this DOESN'T cover (yet)

  • Multi-tenant interleaving β€” the locustfile hammers ONE tenant. Real fairness testing needs N parallel users per tenant. V1.2 candidate.
  • Long-tail latency (p99.9) β€” single-machine locust can't reliably measure beyond p99. Use distributed mode (--master + --worker) for production-scale runs.
  • Sustained 24h runs β€” designed for short bursts. Soak testing is a separate exercise; rotate the API process if memory grows.

Day-2 ops

When p95 starts creeping past target on a healthy deploy, the usual suspects (in order of likelihood):

  1. DB index drift β€” EXPLAIN QUERY PLAN against the slowest endpoint. SQLite + Postgres tend to miss indexes on aggregated reports.
  2. /metrics scrape interval too aggressive β€” Prometheus default 15s is fine; 1s slows everyone.
  3. Webhook deliveries blocking β€” Stage 76 audit log + Stage 77 auto-disable should keep this bounded, but verify via infra webhook deliveries list.
  4. Audit log too big β€” Stage 91 retention purge keeps it bounded; set ORGSTATE_RETENTION_AUDIT_LOGS_DAYS=90 and run nightly.

See RUNBOOK.md Β§ 7 for the full incident triage decision tree.