File size: 3,211 Bytes
6dd9bad | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 | # Monitoring & Incident Runbook
## BullBoard Dashboard
BullBoard is live at `/v1/internal/queues` on the API service.
**Access:** Requires either:
- `Authorization: Bearer <ADMIN_API_KEY>` header, OR
- Valid JWT token with `ORG_ADMIN` or `SUPER_ADMIN` role
Both queues are visible: `whatsapp-queue` and `notification-queue`.
**What to look for:**
- **Failed** tab: jobs that exhausted all retry attempts. Check the error message and stack trace.
- **Delayed** tab: jobs stuck waiting (usually a Redis issue or backoff after failures).
- **Active** tab: jobs that have been running too long (> 2 min = likely hung).
---
## Railway Log Alerts
Configure these alert rules in the Railway dashboard under **Logs β Alerts**:
| Pattern | Severity | Action |
|---------|----------|--------|
| `[WORKER] No handler found for job name` | Critical | Page on-call β silent job drop |
| `[WORKER] Job.*failed` | High | Notify team channel |
| `[FEEDBACK] Feedback generation failed` | High | Notify team channel |
| `[REDIS] Worker connection error` | Critical | Page on-call |
| `[BRIDGE] Could not resolve organizationId` | Medium | Log review next morning |
| `Daily Limit Reached` | Low | Review weekly |
---
## Manual Re-queue Procedure
When a job fails permanently and needs to be retried:
### Via BullBoard UI
1. Navigate to `/v1/internal/queues`
2. Select the queue (`whatsapp-queue`)
3. Go to **Failed** tab
4. Click **Retry** on the job, or **Retry All** for bulk recovery
### Via Redis CLI (emergency)
```bash
# Connect to Redis
redis-cli -u $REDIS_URL
# List failed job IDs for whatsapp-queue
LRANGE bull:whatsapp-queue:failed 0 -1
# Move a specific job back to waiting
# (BullBoard UI is preferred β use CLI only when dashboard is unreachable)
```
### Via API (programmatic)
```bash
# Trigger a manual broadcast retry
curl -X POST https://api.xamle.studio/v1/internal/ping \
-H "Authorization: Bearer $ADMIN_API_KEY"
```
---
## Common Incidents
### All exercise feedback silently ignored
**Symptom:** Users send exercise answers, receive no feedback.
**Cause:** `generate-feedback` job enqueued but no handler registered.
**Fix:** Ensure `FeedbackHandler` is registered in `apps/whatsapp-worker/src/index.ts`.
**Check:** `grep "generate-feedback" logs` β should show `[WORKER] Processing job: generate-feedback`.
### Daily message limit hit for an org
**Symptom:** `[WORKER] Skipping job send-message for Org X: Daily Limit Reached` in logs.
**Check:** `UsageService.getDailyUsage(orgId)` returns count near limit.
**Fix:** Increase limit in `Organization.dailyMessageLimit`, or wait until midnight UTC for reset.
### WhatsApp message delivered but DB not updated
**Symptom:** User received message, but `UserProgress.exerciseStatus` still `PENDING`.
**Cause:** Atomicity rule violation β `sendWhatsApp` called before `prisma.update`.
**Fix:** Always update DB first. Review any recent changes to handler files.
### Redis connection lost
**Symptom:** `[REDIS] Worker connection error` repeating in logs. All jobs stall.
**Fix:** Railway auto-restarts the worker. If persistent, check Redis service health in Railway dashboard and rotate `REDIS_URL` if needed.
|