# Monitoring & Incident Runbook ## BullBoard Dashboard BullBoard is live at `/v1/internal/queues` on the API service. **Access:** Requires either: - `Authorization: Bearer ` header, OR - Valid JWT token with `ORG_ADMIN` or `SUPER_ADMIN` role Both queues are visible: `whatsapp-queue` and `notification-queue`. **What to look for:** - **Failed** tab: jobs that exhausted all retry attempts. Check the error message and stack trace. - **Delayed** tab: jobs stuck waiting (usually a Redis issue or backoff after failures). - **Active** tab: jobs that have been running too long (> 2 min = likely hung). --- ## Railway Log Alerts Configure these alert rules in the Railway dashboard under **Logs → Alerts**: | Pattern | Severity | Action | |---------|----------|--------| | `[WORKER] No handler found for job name` | Critical | Page on-call — silent job drop | | `[WORKER] Job.*failed` | High | Notify team channel | | `[FEEDBACK] Feedback generation failed` | High | Notify team channel | | `[REDIS] Worker connection error` | Critical | Page on-call | | `[BRIDGE] Could not resolve organizationId` | Medium | Log review next morning | | `Daily Limit Reached` | Low | Review weekly | --- ## Manual Re-queue Procedure When a job fails permanently and needs to be retried: ### Via BullBoard UI 1. Navigate to `/v1/internal/queues` 2. Select the queue (`whatsapp-queue`) 3. Go to **Failed** tab 4. Click **Retry** on the job, or **Retry All** for bulk recovery ### Via Redis CLI (emergency) ```bash # Connect to Redis redis-cli -u $REDIS_URL # List failed job IDs for whatsapp-queue LRANGE bull:whatsapp-queue:failed 0 -1 # Move a specific job back to waiting # (BullBoard UI is preferred — use CLI only when dashboard is unreachable) ``` ### Via API (programmatic) ```bash # Trigger a manual broadcast retry curl -X POST https://api.xamle.studio/v1/internal/ping \ -H "Authorization: Bearer $ADMIN_API_KEY" ``` --- ## Common Incidents ### All exercise feedback silently ignored **Symptom:** Users send exercise answers, receive no feedback. **Cause:** `generate-feedback` job enqueued but no handler registered. **Fix:** Ensure `FeedbackHandler` is registered in `apps/whatsapp-worker/src/index.ts`. **Check:** `grep "generate-feedback" logs` — should show `[WORKER] Processing job: generate-feedback`. ### Daily message limit hit for an org **Symptom:** `[WORKER] Skipping job send-message for Org X: Daily Limit Reached` in logs. **Check:** `UsageService.getDailyUsage(orgId)` returns count near limit. **Fix:** Increase limit in `Organization.dailyMessageLimit`, or wait until midnight UTC for reset. ### WhatsApp message delivered but DB not updated **Symptom:** User received message, but `UserProgress.exerciseStatus` still `PENDING`. **Cause:** Atomicity rule violation — `sendWhatsApp` called before `prisma.update`. **Fix:** Always update DB first. Review any recent changes to handler files. ### Redis connection lost **Symptom:** `[REDIS] Worker connection error` repeating in logs. All jobs stall. **Fix:** Railway auto-restarts the worker. If persistent, check Redis service health in Railway dashboard and rotate `REDIS_URL` if needed.