Spaces:

Legal-i
/

orgstate

Running

App Files Files Community

orgstate / load /README.md

Legal-i

Initial OrgState deploy via Stage 150 free-tier stack

d2d1903 verified 17 days ago

preview code

raw

history blame contribute delete

4.72 kB

	# OrgState — Load Testing

	Locust-based perf harness (Stage 93). Prove the platform handles realistic concurrent load before committing to SLAs with paying customers.

	## V1.1 SLA targets

	Single-replica deploy (one API process, one scheduler, SQLite or Postgres). For multi-replica scale, numbers go up but the per-instance ratios should hold.

	\| Metric \| Target \| Notes \|
	\|---\|---\|---\|
	\| p50 read latency \| < 100ms \| dashboard browsing feels instant \|
	\| p95 read latency \| < 500ms \| dashboard tolerable on slow networks \|
	\| p50 write latency \| < 500ms \| observation ingestion responsive \|
	\| p95 write latency \| < 2 000ms \| batches of 5 rows \|
	\| Error rate \| < 1% \| excluding 400/422 "bad data" cases \|
	\| Throughput \| ≥ 100 RPS \| sustained, mixed read/write \|

	These are V1.1 floors — generous because we run a single replica + zero DB tuning. Tighten as scale grows.

	## Install

	The load tool is an optional dep — not pulled by `requirements-runtime.txt`.

	```bash
	pip install -r requirements-load.txt
	```

	## Run

	### Headless smoke (CI-friendly)

	```bash
	./load/smoke.sh
	```

	Runs locust against `http://localhost:8080` for 30s with 20 virtual users. Exits 0 if SLAs are met, 1 otherwise. Assumes you started the API yourself:

	```bash
	# terminal 1
	ORGSTATE_DB_PATH=/tmp/load.sqlite3 \
	python -m uvicorn infra.api.app:app --port 8080

	# terminal 2 — bootstrap the tenant + key the locustfile uses
	python -m infra --db /tmp/load.sqlite3 onboard acme "ACME" \
	--mint-operator > /tmp/acme.json
	export LOCUST_TENANT_ID=acme
	export LOCUST_API_KEY=$(jq -r '.keys[] \| select(.role=="operator").raw_key' \
	/tmp/acme.json)

	# terminal 2 — run the smoke
	./load/smoke.sh
	```

	### Interactive (Web UI)

	```bash
	locust -f load/locustfile.py --host http://localhost:8080
	# open http://localhost:8089 — set user count + spawn rate + run time
	```

	### Full perf run (2 minutes, 50 users)

	```bash
	locust -f load/locustfile.py \
	--host https://api.orgstate.example \
	--users 50 --spawn-rate 5 --run-time 2m \
	--headless --csv perf-2026-05-18
	```

	Outputs `perf-2026-05-18_stats.csv` (per-endpoint p50/p95/p99/RPS) and `perf-2026-05-18_failures.csv`.

	## Configuration (env vars)

	\| Var \| Required? \| Default \| Why \|
	\|---\|---\|---\|---\|
	\| `LOCUST_TENANT_ID` \| recommended \| `acme` \| which tenant to hammer \|
	\| `LOCUST_API_KEY` \| yes \| unset \| bearer token (operator role) \|
	\| `LOCUST_ENTITY_TYPE` \| no \| `warehouse` \| for ingestion/run tasks \|
	\| `LOCUST_VERTICAL` \| no \| `logistics` \| for run trigger \|

	## Scenario mix

	The locustfile uses weighted task selection to approximate real customer traffic:

	* 70% reads (`health`, `tenant`, `runs`, `usage`, `webhooks`) — dashboard browsing.
	* 25% writes (`POST /observations`) — data feed ingestion.
	* 5% trigger (`POST /observations/run`) — pipeline runs.

	Adjust per-customer by editing the `@task(N)` weights in `locustfile.py`.

	## Interpreting results

	Look for in the headless CSV / Web UI:

	1. Failures column should be < 1% of total requests. Anything higher means the platform is shedding load — investigate before tightening SLAs.
	2. p95 columns must stay under the targets above. If reads exceed 500ms, the dashboard will feel sluggish.
	3. RPS should hit the throughput floor sustainably (not just peak). A drop-off after warm-up means GC pressure or DB contention.
	4. Spike test (`--users 200 --spawn-rate 50 --run-time 30s`) — RPS should plateau, not crash. p95 may briefly spike during ramp.

	## What this DOESN'T cover (yet)

	* Multi-tenant interleaving — the locustfile hammers ONE tenant. Real fairness testing needs N parallel users per tenant. V1.2 candidate.
	* Long-tail latency (p99.9) — single-machine locust can't reliably measure beyond p99. Use distributed mode (`--master` + `--worker`) for production-scale runs.
	* Sustained 24h runs — designed for short bursts. Soak testing is a separate exercise; rotate the API process if memory grows.

	## Day-2 ops

	When p95 starts creeping past target on a healthy deploy, the usual suspects (in order of likelihood):

	1. DB index drift — `EXPLAIN QUERY PLAN` against the slowest endpoint. SQLite + Postgres tend to miss indexes on aggregated reports.
	2. `/metrics` scrape interval too aggressive — Prometheus default 15s is fine; 1s slows everyone.
	3. Webhook deliveries blocking — Stage 76 audit log + Stage 77 auto-disable should keep this bounded, but verify via `infra webhook deliveries list`.
	4. Audit log too big — Stage 91 retention purge keeps it bounded; set `ORGSTATE_RETENTION_AUDIT_LOGS_DAYS=90` and run nightly.

	See `RUNBOOK.md` § 7 for the full incident triage decision tree.