edtech / docs /monitoring-runbook.md
CognxSafeTrack
feat: backlog P0β†’P3 β€” toast system, payments, tenant isolation, feedback handler, i18n parity
6dd9bad
# Monitoring & Incident Runbook
## BullBoard Dashboard
BullBoard is live at `/v1/internal/queues` on the API service.
**Access:** Requires either:
- `Authorization: Bearer <ADMIN_API_KEY>` header, OR
- Valid JWT token with `ORG_ADMIN` or `SUPER_ADMIN` role
Both queues are visible: `whatsapp-queue` and `notification-queue`.
**What to look for:**
- **Failed** tab: jobs that exhausted all retry attempts. Check the error message and stack trace.
- **Delayed** tab: jobs stuck waiting (usually a Redis issue or backoff after failures).
- **Active** tab: jobs that have been running too long (> 2 min = likely hung).
---
## Railway Log Alerts
Configure these alert rules in the Railway dashboard under **Logs β†’ Alerts**:
| Pattern | Severity | Action |
|---------|----------|--------|
| `[WORKER] No handler found for job name` | Critical | Page on-call β€” silent job drop |
| `[WORKER] Job.*failed` | High | Notify team channel |
| `[FEEDBACK] Feedback generation failed` | High | Notify team channel |
| `[REDIS] Worker connection error` | Critical | Page on-call |
| `[BRIDGE] Could not resolve organizationId` | Medium | Log review next morning |
| `Daily Limit Reached` | Low | Review weekly |
---
## Manual Re-queue Procedure
When a job fails permanently and needs to be retried:
### Via BullBoard UI
1. Navigate to `/v1/internal/queues`
2. Select the queue (`whatsapp-queue`)
3. Go to **Failed** tab
4. Click **Retry** on the job, or **Retry All** for bulk recovery
### Via Redis CLI (emergency)
```bash
# Connect to Redis
redis-cli -u $REDIS_URL
# List failed job IDs for whatsapp-queue
LRANGE bull:whatsapp-queue:failed 0 -1
# Move a specific job back to waiting
# (BullBoard UI is preferred β€” use CLI only when dashboard is unreachable)
```
### Via API (programmatic)
```bash
# Trigger a manual broadcast retry
curl -X POST https://api.xamle.studio/v1/internal/ping \
-H "Authorization: Bearer $ADMIN_API_KEY"
```
---
## Common Incidents
### All exercise feedback silently ignored
**Symptom:** Users send exercise answers, receive no feedback.
**Cause:** `generate-feedback` job enqueued but no handler registered.
**Fix:** Ensure `FeedbackHandler` is registered in `apps/whatsapp-worker/src/index.ts`.
**Check:** `grep "generate-feedback" logs` β€” should show `[WORKER] Processing job: generate-feedback`.
### Daily message limit hit for an org
**Symptom:** `[WORKER] Skipping job send-message for Org X: Daily Limit Reached` in logs.
**Check:** `UsageService.getDailyUsage(orgId)` returns count near limit.
**Fix:** Increase limit in `Organization.dailyMessageLimit`, or wait until midnight UTC for reset.
### WhatsApp message delivered but DB not updated
**Symptom:** User received message, but `UserProgress.exerciseStatus` still `PENDING`.
**Cause:** Atomicity rule violation β€” `sendWhatsApp` called before `prisma.update`.
**Fix:** Always update DB first. Review any recent changes to handler files.
### Redis connection lost
**Symptom:** `[REDIS] Worker connection error` repeating in logs. All jobs stall.
**Fix:** Railway auto-restarts the worker. If persistent, check Redis service health in Railway dashboard and rotate `REDIS_URL` if needed.