braincore-pgvector-starter / docs /security-notes.md
trentdoney's picture
Upload docs/security-notes.md
d82356c verified
# Security Notes
This is a **reference starter**, not a production system. Use these notes to harden before deploying.
## What is already safe-by-default
- **Tenant isolation**: the `tenant_id` column isolates data. The search query enforces `tenant_id = X OR visibility = public`.
- **Public vs restricted**: `restricted` items never leak to other tenants unless explicitly requested.
- **Trust classes**: retrieval can be gated by `max_rank` so low-trust content is excluded.
- **Retrieval logging**: every search is logged with filters and result IDs for audit.
- **No private credentials**: the repo contains no tokens, hostnames, or internal paths.
- **Synthetic data only**: all seed data is public-safe and fabricated.
## What you must add for production
1. **Authentication / authorization**
- Add OAuth2, API keys, or mutual TLS.
- Bind `tenant_id` to the authenticated user; never accept it from the request body.
- Enable Postgres Row-Level Security (RLS) and tie policies to application-level user IDs.
2. **Input validation**
- Limit `content` size (e.g. 100 KB) to prevent storage abuse.
- Sanitize `metadata` JSON to reject unexpected keys.
- Rate-limit writes per tenant.
3. **Network security**
- Do not expose Postgres port `5432` to the internet.
- Run the API and DB in a private VPC or behind a reverse proxy.
- Use TLS for all client↔API and API↔Postgres connections.
4. **Secrets management**
- Rotate `POSTGRES_PASSWORD` immediately; store it in a secrets manager (e.g. HashiCorp Vault, AWS Secrets Manager).
- Never commit `.env` files with real passwords.
5. **Observability**
- Alert on abnormal retrieval patterns (e.g. tenant A querying tenant B data).
- Monitor `retrieval_logs` for signs of probing or data exfiltration.
6. **Backup and encryption**
- Encrypt Postgres volumes at rest.
- Schedule automated backups and test restores.
## Known limitations
- No authentication layer is included (by design, to keep the starter runnable).
- Placeholder embeddings are not semantically meaningful; swap in a real model before any serious use.
- HNSW index parameters (`m=16`, `ef_construction=64`) are starter defaults; tune for your data size and recall requirements.