CognxSafeTrack
feat: backlog P0βP3 β toast system, payments, tenant isolation, feedback handler, i18n parity
6dd9bad | # Monitoring & Incident Runbook | |
| ## BullBoard Dashboard | |
| BullBoard is live at `/v1/internal/queues` on the API service. | |
| **Access:** Requires either: | |
| - `Authorization: Bearer <ADMIN_API_KEY>` header, OR | |
| - Valid JWT token with `ORG_ADMIN` or `SUPER_ADMIN` role | |
| Both queues are visible: `whatsapp-queue` and `notification-queue`. | |
| **What to look for:** | |
| - **Failed** tab: jobs that exhausted all retry attempts. Check the error message and stack trace. | |
| - **Delayed** tab: jobs stuck waiting (usually a Redis issue or backoff after failures). | |
| - **Active** tab: jobs that have been running too long (> 2 min = likely hung). | |
| --- | |
| ## Railway Log Alerts | |
| Configure these alert rules in the Railway dashboard under **Logs β Alerts**: | |
| | Pattern | Severity | Action | | |
| |---------|----------|--------| | |
| | `[WORKER] No handler found for job name` | Critical | Page on-call β silent job drop | | |
| | `[WORKER] Job.*failed` | High | Notify team channel | | |
| | `[FEEDBACK] Feedback generation failed` | High | Notify team channel | | |
| | `[REDIS] Worker connection error` | Critical | Page on-call | | |
| | `[BRIDGE] Could not resolve organizationId` | Medium | Log review next morning | | |
| | `Daily Limit Reached` | Low | Review weekly | | |
| --- | |
| ## Manual Re-queue Procedure | |
| When a job fails permanently and needs to be retried: | |
| ### Via BullBoard UI | |
| 1. Navigate to `/v1/internal/queues` | |
| 2. Select the queue (`whatsapp-queue`) | |
| 3. Go to **Failed** tab | |
| 4. Click **Retry** on the job, or **Retry All** for bulk recovery | |
| ### Via Redis CLI (emergency) | |
| ```bash | |
| # Connect to Redis | |
| redis-cli -u $REDIS_URL | |
| # List failed job IDs for whatsapp-queue | |
| LRANGE bull:whatsapp-queue:failed 0 -1 | |
| # Move a specific job back to waiting | |
| # (BullBoard UI is preferred β use CLI only when dashboard is unreachable) | |
| ``` | |
| ### Via API (programmatic) | |
| ```bash | |
| # Trigger a manual broadcast retry | |
| curl -X POST https://api.xamle.studio/v1/internal/ping \ | |
| -H "Authorization: Bearer $ADMIN_API_KEY" | |
| ``` | |
| --- | |
| ## Common Incidents | |
| ### All exercise feedback silently ignored | |
| **Symptom:** Users send exercise answers, receive no feedback. | |
| **Cause:** `generate-feedback` job enqueued but no handler registered. | |
| **Fix:** Ensure `FeedbackHandler` is registered in `apps/whatsapp-worker/src/index.ts`. | |
| **Check:** `grep "generate-feedback" logs` β should show `[WORKER] Processing job: generate-feedback`. | |
| ### Daily message limit hit for an org | |
| **Symptom:** `[WORKER] Skipping job send-message for Org X: Daily Limit Reached` in logs. | |
| **Check:** `UsageService.getDailyUsage(orgId)` returns count near limit. | |
| **Fix:** Increase limit in `Organization.dailyMessageLimit`, or wait until midnight UTC for reset. | |
| ### WhatsApp message delivered but DB not updated | |
| **Symptom:** User received message, but `UserProgress.exerciseStatus` still `PENDING`. | |
| **Cause:** Atomicity rule violation β `sendWhatsApp` called before `prisma.update`. | |
| **Fix:** Always update DB first. Review any recent changes to handler files. | |
| ### Redis connection lost | |
| **Symptom:** `[REDIS] Worker connection error` repeating in logs. All jobs stall. | |
| **Fix:** Railway auto-restarts the worker. If persistent, check Redis service health in Railway dashboard and rotate `REDIS_URL` if needed. | |