Spaces:

safetrack
/

edtech

Running

App Files Files Community

edtech / docs /monitoring-runbook.md

CognxSafeTrack

feat: backlog P0→P3 — toast system, payments, tenant isolation, feedback handler, i18n parity

6dd9bad 12 days ago

preview code

raw

history blame contribute delete

3.21 kB

	# Monitoring & Incident Runbook

	## BullBoard Dashboard

	BullBoard is live at `/v1/internal/queues` on the API service.

	Access: Requires either:
	- `Authorization: Bearer <ADMIN_API_KEY>` header, OR
	- Valid JWT token with `ORG_ADMIN` or `SUPER_ADMIN` role

	Both queues are visible: `whatsapp-queue` and `notification-queue`.

	What to look for:
	- Failed tab: jobs that exhausted all retry attempts. Check the error message and stack trace.
	- Delayed tab: jobs stuck waiting (usually a Redis issue or backoff after failures).
	- Active tab: jobs that have been running too long (> 2 min = likely hung).

	---

	## Railway Log Alerts

	Configure these alert rules in the Railway dashboard under Logs → Alerts:

	\| Pattern \| Severity \| Action \|
	\|---------\|----------\|--------\|
	\| `[WORKER] No handler found for job name` \| Critical \| Page on-call — silent job drop \|
	\| `[WORKER] Job.*failed` \| High \| Notify team channel \|
	\| `[FEEDBACK] Feedback generation failed` \| High \| Notify team channel \|
	\| `[REDIS] Worker connection error` \| Critical \| Page on-call \|
	\| `[BRIDGE] Could not resolve organizationId` \| Medium \| Log review next morning \|
	\| `Daily Limit Reached` \| Low \| Review weekly \|

	---

	## Manual Re-queue Procedure

	When a job fails permanently and needs to be retried:

	### Via BullBoard UI
	1. Navigate to `/v1/internal/queues`
	2. Select the queue (`whatsapp-queue`)
	3. Go to Failed tab
	4. Click Retry on the job, or Retry All for bulk recovery

	### Via Redis CLI (emergency)
	```bash
	# Connect to Redis
	redis-cli -u $REDIS_URL

	# List failed job IDs for whatsapp-queue
	LRANGE bull:whatsapp-queue:failed 0 -1

	# Move a specific job back to waiting
	# (BullBoard UI is preferred — use CLI only when dashboard is unreachable)
	```

	### Via API (programmatic)
	```bash
	# Trigger a manual broadcast retry
	curl -X POST https://api.xamle.studio/v1/internal/ping \
	-H "Authorization: Bearer $ADMIN_API_KEY"
	```

	---

	## Common Incidents

	### All exercise feedback silently ignored
	Symptom: Users send exercise answers, receive no feedback.
	Cause: `generate-feedback` job enqueued but no handler registered.
	Fix: Ensure `FeedbackHandler` is registered in `apps/whatsapp-worker/src/index.ts`.
	Check: `grep "generate-feedback" logs` — should show `[WORKER] Processing job: generate-feedback`.

	### Daily message limit hit for an org
	Symptom: `[WORKER] Skipping job send-message for Org X: Daily Limit Reached` in logs.
	Check: `UsageService.getDailyUsage(orgId)` returns count near limit.
	Fix: Increase limit in `Organization.dailyMessageLimit`, or wait until midnight UTC for reset.

	### WhatsApp message delivered but DB not updated
	Symptom: User received message, but `UserProgress.exerciseStatus` still `PENDING`.
	Cause: Atomicity rule violation — `sendWhatsApp` called before `prisma.update`.
	Fix: Always update DB first. Review any recent changes to handler files.

	### Redis connection lost
	Symptom: `[REDIS] Worker connection error` repeating in logs. All jobs stall.
	Fix: Railway auto-restarts the worker. If persistent, check Redis service health in Railway dashboard and rotate `REDIS_URL` if needed.