hasari-api / docs /DEPLOY_GUIDE.md
erdoganpeker's picture
v0.3.0 β€” multimodal vehicle damage MVP
e327f0d
# Deploy Guide β€” Render.com
End-to-end walkthrough for deploying HasarΔ° to Render.com using `render.yaml`. Covers prerequisites, infra setup, environment configuration, the first deploy, smoke tests, monitoring, rollback, and cost.
> Target audience: anyone with shell access and a Render account, no prior Render experience required.
---
## Prerequisites
Before you start, you need:
| Item | Why | How to get it |
|---|---|---|
| **Render account** | Hosts the API + worker | [render.com](https://render.com) β€” free tier ok for the web service; Postgres/Redis are paid |
| **GitHub access to the repo** | Render builds from git | `arac-hasar-v2` repo permissions |
| **AWS S3 bucket** (or compatible) | Image storage (uploads + visualizations) | Or use Cloudflare R2 / Backblaze B2 β€” anything S3-compatible |
| **AWS IAM access key** with `s3:GetObject`, `s3:PutObject`, `s3:DeleteObject` on the bucket | Backend uploads/signs URLs | AWS console β†’ IAM |
| **Custom domain** (optional) | Branded URL | Any registrar, point CNAME at Render |
| **A strong `JWT_SECRET_KEY`** | Signs auth tokens | `openssl rand -base64 48` |
| **Sentry DSN** (optional) | Error tracking | [sentry.io](https://sentry.io) |
**Time estimate**: 45–90 minutes for the first deploy. Subsequent deploys are git-push fast.
---
## Step 1 β€” Provision infrastructure
The `render.yaml` at the repo root declares all services. Render reads it on first connect and creates everything in one go.
### 1a. Create the Postgres database
In the Render dashboard:
1. **New +** β†’ **PostgreSQL**
2. **Name**: `hasari-db`
3. **Database**: `hasari`
4. **User**: `hasari`
5. **Region**: pick the one nearest your users (Frankfurt for EU/TR, Oregon for US)
6. **Plan**: **Starter** ($7/month) is sufficient for the pilot. Free tier expires after 90 days β€” do not use it for production.
7. **Create database**.
Copy the **Internal Database URL** (format: `postgres://hasari:…@…/hasari`). You'll paste it as `DATABASE_URL` in step 2.
### 1b. Create the Redis instance
1. **New +** β†’ **Redis**
2. **Name**: `hasari-redis`
3. **Region**: same as Postgres
4. **Plan**: **Starter** ($10/month). Free tier has no persistence; do not use for production.
5. **Maxmemory policy**: `allkeys-lru`
6. **Create Redis**.
Copy the **Internal Redis URL** β€” you'll use it as `REDIS_URL`.
### 1c. Create the S3 bucket
In the AWS console:
1. S3 β†’ **Create bucket** β†’ `hasari-uploads-prod` (or your name), region matching the API for low latency
2. **Block all public access** β€” yes, keep all four boxes checked. The backend serves presigned URLs; no public listing.
3. **Versioning**: disabled (uploads are immutable; no need for revisions)
4. **Server-side encryption**: SSE-S3 (default) is fine
5. After creation: **Permissions** β†’ **CORS** β†’ add:
```json
[
{
"AllowedHeaders": ["*"],
"AllowedMethods": ["GET", "PUT", "POST"],
"AllowedOrigins": ["https://hasari.app", "https://hasari-api.onrender.com"],
"ExposeHeaders": ["ETag", "x-amz-request-id"],
"MaxAgeSeconds": 3000
}
]
```
Replace the origins with your actual web app URL.
6. IAM β†’ create an IAM user `hasari-backend`, attach an inline policy:
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
"Resource": "arn:aws:s3:::hasari-uploads-prod/*"
},
{
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::hasari-uploads-prod"
}
]
}
```
Create an access key for this user β€” save both halves now, you cannot retrieve the secret again.
---
## Step 2 β€” Configure the backend service
### 2a. Connect the repo
1. Render dashboard β†’ **New +** β†’ **Blueprint**
2. Connect your GitHub account, pick `arac-hasar-v2`
3. Render detects `render.yaml` and shows the services it will create:
- `hasari-api` β€” web service (FastAPI)
- `hasari-worker` β€” background worker (Celery)
4. Click **Apply** to create both.
### 2b. Set environment variables
For **each** of `hasari-api` and `hasari-worker`, go to **Environment** and add:
#### Required
| Name | Example | Description | Security note |
|---|---|---|---|
| `ENVIRONMENT` | `production` | Disables dev auth fallback | Never set to `dev` here |
| `DATABASE_URL` | `postgres://…` | From step 1a | Internal URL only; no external traffic |
| `REDIS_URL` | `redis://…` | From step 1b | Internal URL only |
| `JWT_SECRET_KEY` | (32+ char random string) | Signs JWTs | Generate fresh: `openssl rand -base64 48`. Rotating invalidates all existing sessions. |
| `S3_BUCKET` | `hasari-uploads-prod` | Bucket name | β€” |
| `S3_REGION` | `eu-central-1` | AWS region | β€” |
| `S3_ACCESS_KEY` | `AKIA…` | From step 1c | Use a dedicated IAM user, not your root key |
| `S3_SECRET_KEY` | `…` | From step 1c | Mark this var as "secret" in Render UI |
| `S3_ENDPOINT_URL` | (blank for AWS) | Set only for R2/MinIO/B2 | β€” |
| `CORS_ORIGINS` | `https://hasari.app,https://www.hasari.app` | Comma-separated allowed web origins | Never use `*` in production |
#### Recommended
| Name | Default | Description |
|---|---|---|
| `ACCESS_TOKEN_MINUTES` | `30` | Short access TTL β€” keeps damage from a stolen token bounded |
| `REFRESH_TOKEN_DAYS` | `7` | Refresh TTL β€” balance UX vs. risk |
| `MAX_IMAGE_SIZE_MB` | `12` | Per-image upload limit |
| `MAX_IMAGES_SYNC` | `5` | Sync mode cap |
| `MAX_IMAGES_ASYNC` | `20` | Async mode cap |
| `SENTRY_DSN` | (blank) | Enable Sentry error tracking |
| `LOG_LEVEL` | `INFO` | `DEBUG` for troubleshooting; keep `INFO` in prod |
#### ML service
| Name | Default | Description |
|---|---|---|
| `ML_MODEL_DIR` | `/app/models` | Path to YOLO `.pt` weight files inside the container |
| `ML_DEVICE` | `cpu` | `cuda` requires a GPU instance (Render does not offer GPU β€” keep CPU on Render and offload heavy ML to a separate GPU host or external service for production) |
> **GPU note**: Render does not currently offer GPU instances. For the pilot, the backend runs YOLO on CPU β€” slower (~5–10Γ— CPU vs. GPU). For production loads above ~50 inspections/hour, host the ML service separately on a GPU VPS (Hetzner, RunPod, etc.) and point `ML_SERVICE_URL` at it. Architecture diagram in [README.md](../README.md#architecture).
### 2c. Build & start commands
If `render.yaml` is missing these, set them manually:
**hasari-api** (web service):
- Build: `pip install -r services/backend/requirements.txt`
- Start: `cd services/backend && alembic upgrade head && uvicorn main:app --host 0.0.0.0 --port $PORT`
**hasari-worker** (background worker):
- Build: same
- Start: `cd services/backend && celery -A worker worker --loglevel=info --concurrency=2`
---
## Step 3 β€” First deploy
1. Both services auto-deploy on push to `main`. Trigger the first deploy:
- Go to **hasari-api** β†’ **Manual Deploy** β†’ **Deploy latest commit**.
2. Watch the build log. Expected duration: 8–15 minutes (downloads PyTorch, YOLO weights).
3. The first start runs `alembic upgrade head` β€” DB schema is created.
4. Once "Your service is live" appears, hit the health endpoint:
```bash
curl https://hasari-api.onrender.com/health
```
Expected:
```json
{"status":"ok","ml_loaded":true,"timestamp":"2026-05-15T...","version":"0.1.0"}
```
If `ml_loaded` is `false`, check the start log for "ML pipeline init failed" β€” usually means model weights are missing from the container. Run a one-off SSH session and run `python -c "from ml_service import ml_pipeline; print(ml_pipeline.is_loaded())"`.
---
## Step 4 β€” Create the admin user
The first admin must be created out-of-band β€” there is no admin-registration UI. SSH into the API container via the Render dashboard (**Shell** tab) and run:
```bash
cd services/backend
python -c "
from database import init_db
from auth import _repo
from security import hash_password
init_db()
user = _repo.create(
email='admin@yourcompany.com',
password_hash=hash_password('CHANGE_ME_now_strong_password'),
full_name='Admin',
)
# Promote
import psycopg
with psycopg.connect('$DATABASE_URL') as conn:
conn.execute('UPDATE users SET role=%s WHERE id=%s', ('admin', user['id']))
conn.commit()
print('Admin user created:', user['email'])
"
```
Sign in immediately at `https://hasari.app/login` and rotate the password through the UI.
---
## Step 5 β€” Smoke test checklist
Before announcing the deploy is "done", run through this list. Each item should pass on the first try.
- [ ] `GET /health` returns 200 with `ml_loaded: true`
- [ ] `GET /api/v1/version` returns expected git SHA and build time
- [ ] `POST /auth/register` with a new email returns 201 + token pair
- [ ] `POST /auth/login` with that email returns 200 + new token pair
- [ ] `GET /auth/me` with the access token returns the user
- [ ] `POST /auth/refresh` with the refresh token returns a new token pair
- [ ] `POST /api/v1/inspect/sync` with a 1MB JPG returns 200 with parts/damages JSON within 15 seconds
- [ ] `GET /api/v1/inspect` returns the inspection in the list
- [ ] `GET /api/v1/inspect/{id}/visualization/annotated` redirects to a presigned S3 URL that returns a PNG
- [ ] `DELETE /api/v1/inspect/{id}` removes it (subsequent GET returns 404)
- [ ] Web app loads at custom domain, language defaults to TR
- [ ] Sign in via web app, complete one inspection end-to-end
- [ ] Open Render logs β€” no `ERROR` or `CRITICAL` entries in the past hour
- [ ] Postgres connection count < 20 (visible in Render β†’ hasari-db β†’ Metrics)
If any item fails, **do not** announce the launch. See [Troubleshooting](#troubleshooting).
---
## Monitoring & log access
### Logs
- **Render dashboard**: hasari-api β†’ **Logs** tab β€” live tail.
- **CLI**: `render logs --service hasari-api --tail` (install [render-cli](https://render.com/docs/cli)).
- **Structured JSON**: every log line is `{"time":..., "level":..., "logger":..., "msg":...}` β€” pipe to `jq` for filtering.
### Metrics
- **Render built-in**: CPU, memory, response time, throughput visible per service in the dashboard.
- **Prometheus**: scrape `https://hasari-api.onrender.com/metrics` (requires `Authorization: Bearer <admin token>`). See `observability/` for a Grafana dashboard JSON to import.
### Alerts
Configure in Render β†’ service β†’ **Notifications**:
- **Deploy failed** β†’ Slack/email
- **Service crashed** β†’ on-call rotation
- **Disk usage > 80%** β†’ Slack
For app-level alerts (error rate > 1%, p95 latency > 3s), set up Sentry alerts on the `SENTRY_DSN` project.
---
## Rolling back a bad deploy
If the latest deploy is broken:
1. Render dashboard β†’ **hasari-api** β†’ **Events** tab.
2. Find the last known-good deploy (green checkmark).
3. Click **Rollback to this deploy**.
4. Confirm. Render redeploys the previous Docker image β€” takes ~30 seconds.
For database migrations that cannot be rolled back automatically:
```bash
# In the Render shell:
cd services/backend
alembic downgrade -1
```
**Important**: never `alembic downgrade` a migration that dropped a column with live data β€” you will lose data. Pre-launch, test every migration's `downgrade()` against a copy of production data.
---
## Cost estimate (monthly, pilot scale)
| Item | Plan | Cost |
|---|---|---|
| Render web service (`hasari-api`) | Starter (512 MB) | $7 |
| Render background worker (`hasari-worker`) | Starter (512 MB) | $7 |
| Render Postgres | Starter | $7 |
| Render Redis | Starter | $10 |
| AWS S3 (10 GB storage, 100k req/month) | Pay-as-you-go | ~$1 |
| AWS data transfer (out) | Pay-as-you-go | ~$2 |
| Custom domain | (you own it) | $0 |
| Sentry (free tier) | Developer | $0 |
| **Total** | | **~$34/month** |
Scaling beyond ~500 inspections/day will require:
- Larger Render plans (Standard: $25/service)
- Moving ML to a GPU VPS (Hetzner GPU: $80/month)
- S3 storage growth: $0.023/GB/month
---
## Troubleshooting
### `ml_loaded: false` at startup
**Cause**: model weights missing or wrong path.
**Fix**: ensure `services/ml/yolo11m-seg.pt`, `yolo11s-seg.pt`, `yolo11n-cls.pt` are committed to the repo or downloaded in the build step. Check `ML_MODEL_DIR` env var.
### 503 "Is kuyrugu su an kullanilamiyor"
**Cause**: Celery worker can't reach Redis.
**Fix**: confirm `REDIS_URL` is set on **hasari-worker** (not just api). Restart worker.
### Postgres connection limit exceeded
**Cause**: too many open connections β€” usually a long-running query or leaked sessions.
**Fix**: check Render β†’ Postgres β†’ Metrics β†’ "Active connections". Restart API service to drop them. Add `pool_pre_ping=True` and `pool_recycle=300` to SQLAlchemy engine config.
### S3 403 on upload
**Cause**: IAM policy doesn't grant `s3:PutObject`, or bucket name typo, or wrong region.
**Fix**: run `aws s3 ls s3://your-bucket-name --region eu-central-1` from anywhere with the same credentials to verify.
### Web app CORS error
**Cause**: `CORS_ORIGINS` doesn't include the actual origin (don't forget `https://`, no trailing slash).
**Fix**: update env var, redeploy API.
### Health check passing but inspections always fail with 500
**Cause**: usually an unhandled exception in the ML pipeline (corrupt model, OOM, missing class).
**Fix**: enable `LOG_LEVEL=DEBUG`, reproduce, read the traceback in logs. If it's OOM, upgrade to Standard plan or move ML off-box.
### Migration locked / `alembic upgrade head` hangs
**Cause**: a previous migration left a lock in the `alembic_version` table.
**Fix**: in psql, `DELETE FROM alembic_locks;` (table name varies β€” check your alembic config) or set `LOCK_TIMEOUT` and retry.
---
## Post-launch monitoring (first 48 hours)
Watch these metrics every 4 hours for the first two days:
- **Error rate** (Sentry): target < 0.5% of requests
- **API latency p95** (Render metrics): target < 1.5 s for non-inspection endpoints
- **Inspection latency p95**: target < 12 s end-to-end for 4-photo batches
- **Database active connections**: target < 50% of pool max
- **Redis memory**: target < 80% of plan limit
- **Failed inspection rate**: target < 2% of jobs reaching `failed`
If any metric exceeds target for 30+ minutes, treat as a P1 incident.
---
## Related docs
- [API_GUIDE.md](API_GUIDE.md) β€” REST contract for smoke tests
- [AUTH_FLOW.md](AUTH_FLOW.md) β€” token lifecycle (env vars relevant)
- [LAUNCH_CHECKLIST.md](LAUNCH_CHECKLIST.md) β€” pre-go-live sign-off gates
- [OBSERVABILITY_SETUP.md](OBSERVABILITY_SETUP.md) β€” Prometheus + Grafana wiring