Spaces:

safetrack
/

edtech

Running

File size: 3,211 Bytes

6dd9bad

# Monitoring & Incident Runbook

## BullBoard Dashboard

BullBoard is live at `/v1/internal/queues` on the API service.

**Access:** Requires either:
- `Authorization: Bearer <ADMIN_API_KEY>` header, OR
- Valid JWT token with `ORG_ADMIN` or `SUPER_ADMIN` role

Both queues are visible: `whatsapp-queue` and `notification-queue`.

**What to look for:**
- **Failed** tab: jobs that exhausted all retry attempts. Check the error message and stack trace.
- **Delayed** tab: jobs stuck waiting (usually a Redis issue or backoff after failures).
- **Active** tab: jobs that have been running too long (> 2 min = likely hung).

---

## Railway Log Alerts

Configure these alert rules in the Railway dashboard under **Logs → Alerts**:

| Pattern | Severity | Action |
|---------|----------|--------|
| `[WORKER] No handler found for job name` | Critical | Page on-call — silent job drop |
| `[WORKER] Job.*failed` | High | Notify team channel |
| `[FEEDBACK] Feedback generation failed` | High | Notify team channel |
| `[REDIS] Worker connection error` | Critical | Page on-call |
| `[BRIDGE] Could not resolve organizationId` | Medium | Log review next morning |
| `Daily Limit Reached` | Low | Review weekly |

---

## Manual Re-queue Procedure

When a job fails permanently and needs to be retried:

### Via BullBoard UI
1. Navigate to `/v1/internal/queues`
2. Select the queue (`whatsapp-queue`)
3. Go to **Failed** tab
4. Click **Retry** on the job, or **Retry All** for bulk recovery

### Via Redis CLI (emergency)
```bash
# Connect to Redis
redis-cli -u $REDIS_URL

# List failed job IDs for whatsapp-queue
LRANGE bull:whatsapp-queue:failed 0 -1

# Move a specific job back to waiting
# (BullBoard UI is preferred — use CLI only when dashboard is unreachable)
```

### Via API (programmatic)
```bash
# Trigger a manual broadcast retry
curl -X POST https://api.xamle.studio/v1/internal/ping \
  -H "Authorization: Bearer $ADMIN_API_KEY"
```

---

## Common Incidents

### All exercise feedback silently ignored
**Symptom:** Users send exercise answers, receive no feedback.  
**Cause:** `generate-feedback` job enqueued but no handler registered.  
**Fix:** Ensure `FeedbackHandler` is registered in `apps/whatsapp-worker/src/index.ts`.  
**Check:** `grep "generate-feedback" logs` — should show `[WORKER] Processing job: generate-feedback`.

### Daily message limit hit for an org
**Symptom:** `[WORKER] Skipping job send-message for Org X: Daily Limit Reached` in logs.  
**Check:** `UsageService.getDailyUsage(orgId)` returns count near limit.  
**Fix:** Increase limit in `Organization.dailyMessageLimit`, or wait until midnight UTC for reset.

### WhatsApp message delivered but DB not updated
**Symptom:** User received message, but `UserProgress.exerciseStatus` still `PENDING`.  
**Cause:** Atomicity rule violation — `sendWhatsApp` called before `prisma.update`.  
**Fix:** Always update DB first. Review any recent changes to handler files.

### Redis connection lost
**Symptom:** `[REDIS] Worker connection error` repeating in logs. All jobs stall.  
**Fix:** Railway auto-restarts the worker. If persistent, check Redis service health in Railway dashboard and rotate `REDIS_URL` if needed.