Deploying rag-psych to Fly.io
Architecture: Fly.io for the API container, Neon for Postgres + pgvector. Both have free tiers that cover the demo. The API auto-stops when idle so the bill stays at $0 between visits.
Critical: never run
--sources dsm5against the remote DB. The DSM chunks are licensed for local personal use only and must stay in your laptop'spgdatavolume. Thescripts/seed_remote.shhelper rejects any attempt to ingest DSM remotely.
1. Provision Postgres on Neon (free)
- Sign up at console.neon.tech (GitHub auth, free).
- Create a new project. Region: pick one near your Fly region (we'll use
ordbelow; Neon'saws-us-east-2is closest). - In the project dashboard, go to Settings β Extensions and enable
vector. (One toggle. Neon ships pgvector pre-installed; you just activate it.) - Copy the Connection string from the dashboard. It looks like:
postgresql://USER:PASSWORD@ep-xyz-12345.us-east-2.aws.neon.tech/rag_psych?sslmode=require - Apply our schema. From your laptop:
psql 'postgresql://USER:PASSWORD@ep-xyz-12345.us-east-2.aws.neon.tech/rag_psych?sslmode=require' \ -f ingest/schema.sql
2. Install Fly's CLI and authenticate
brew install flyctl # macOS; see fly.io/docs/hands-on/install-flyctl/ for others
fly auth signup # or: fly auth login (browser)
Free trial credit covers the demo for the first month. After that the runtime cost is dominated by uptime, which scale-to-zero pushes near zero.
3. Launch the Fly app (no deploy yet)
From the repo root:
fly launch --no-deploy --copy-config --name rag-psych-<your-suffix>
When prompted:
- Region: pick the same one your Neon DB is closest to (e.g.
ord) - Postgres: No (we're using Neon, not Fly Postgres)
- Redis: No
- Settings: keep the existing
fly.toml(the--copy-configflag preserves it)
Open the generated fly.toml and update the app = "..." line to match
the unique name Fly assigned (the launcher overwrites it).
4. Set secrets
These never appear in fly.toml or the image. They live encrypted in
Fly's secret store and are injected as env vars at runtime.
fly secrets set \
DATABASE_URL='postgresql://USER:PASSWORD@ep-xyz.us-east-2.aws.neon.tech/rag_psych?sslmode=require' \
ANTHROPIC_API_KEY='sk-ant-api03-...' \
ANTHROPIC_MODEL='claude-haiku-4-5' \
NCBI_EMAIL='you@example.com' \
ICD_CLIENT_ID='...' \
ICD_CLIENT_SECRET='...' \
EVAL_PASSWORD="$(python3 -c 'import secrets; print(secrets.token_urlsafe(16))')" \
CORS_ORIGIN='https://rag-psych-<your-suffix>.fly.dev'
β οΈ Rotate your local
.envkeys before deploying. TheANTHROPIC_API_KEYandEVAL_PASSWORDcurrently in.envshould be regenerated for production β assume the old ones are compromised (anything that touches a chat with an LLM is). Console: console.anthropic.com/settings/keys.
5. Deploy
fly deploy
First push uploads ~3.5 GB (models baked into the image β see
api/Dockerfile). Subsequent deploys only push the layers that changed,
so code edits redeploy in seconds.
When it finishes:
fly status # check machine health
fly logs # follow startup logs
open https://rag-psych-<your-suffix>.fly.dev/health
A {"status":"ok"} response means the API is up. The first /query
will pay a 5β10 s cold-start while the embedder + reranker load.
6. Seed the remote database
The Neon DB is empty after step 1. Run ingest from your laptop against it:
DATABASE_URL='postgresql://USER:PASSWORD@ep-xyz.us-east-2.aws.neon.tech/rag_psych?sslmode=require' \
./scripts/seed_remote.sh
This takes 10β15 minutes wall time, mostly the PubMed efetch loop on
the first run. The cached JSON files in data/cache/ mean re-runs are
near-instant.
After it finishes, hit the deployed UI:
open https://rag-psych-<your-suffix>.fly.dev/ui
7. Production hardening (recommended before sharing publicly)
| Item | How |
|---|---|
| Anthropic spend cap | Drop to $5/week at console.anthropic.com β Settings β Limits before sharing the URL |
/eval password |
Confirm EVAL_PASSWORD is random + long. Never the local-dev value. |
| Rate limit | Already 30/min/IP. If you see abuse, drop to 10/min in api/main.py's @limiter.limit decorator and fly deploy |
| CORS | Only the Fly subdomain in production. If you remove the localhost dev origin from the Fly secret, dev still works locally because your .env has its own value |
| Healthcheck grace | fly.toml already gives 30 s for cold-start. If you see flapping, bump to 60 s |
| Auto-stop | Already on (auto_stop_machines = "stop"). Verify with fly status β idle machines should report stopped |
8. Day-2 ops cheatsheet
fly status # which machines are running
fly logs # tail combined logs
fly logs --instance <id> # one machine
fly ssh console # shell into a running instance
fly secrets list # what's set (values not shown)
fly secrets unset SOME_KEY # remove
fly scale memory 4096 # bump RAM if rerank latency is bad
fly scale count 2 # add a second machine for redundancy
fly apps destroy rag-psych-<...> # nuke everything (careful)
9. Things you'll feel in production that you don't on localhost
- First request is slow (5β10 s) when the machine just woke from auto-stop. Subsequent requests in the same minute are fast.
- Latency floor is higher because every request crosses the public
internet (Fly β Neon) instead of localhost. Expect ~2Γ the times shown
in
eval/results/*.json. - Anthropic costs scale with usage. A naive demo URL hit by 100 curious visitors at 5 queries each = ~$2 of Haiku tokens. Spend cap protects you.
- No DSM in answers. If a query that worked locally suddenly returns the refusal string in production, you're hitting the DSM-shaped hole in the public corpus β that's expected.
Why not Fly Postgres?
Fly's managed Postgres is fine but you'd have to install pgvector via a custom Postgres image and manage it yourself, then pay $5/mo for the smallest instance. Neon's free tier is functionally equivalent for our scale (~30K chunks) and has zero setup beyond toggling the extension.
If you want everything inside Fly's network later (lower latency,
fewer external dependencies), swap DATABASE_URL to a Fly Postgres
endpoint and re-run step 6. The application code doesn't care.