edtech / docs /monitoring-runbook.md
CognxSafeTrack
feat: backlog P0β†’P3 β€” toast system, payments, tenant isolation, feedback handler, i18n parity
6dd9bad

Monitoring & Incident Runbook

BullBoard Dashboard

BullBoard is live at /v1/internal/queues on the API service.

Access: Requires either:

  • Authorization: Bearer <ADMIN_API_KEY> header, OR
  • Valid JWT token with ORG_ADMIN or SUPER_ADMIN role

Both queues are visible: whatsapp-queue and notification-queue.

What to look for:

  • Failed tab: jobs that exhausted all retry attempts. Check the error message and stack trace.
  • Delayed tab: jobs stuck waiting (usually a Redis issue or backoff after failures).
  • Active tab: jobs that have been running too long (> 2 min = likely hung).

Railway Log Alerts

Configure these alert rules in the Railway dashboard under Logs β†’ Alerts:

Pattern Severity Action
[WORKER] No handler found for job name Critical Page on-call β€” silent job drop
[WORKER] Job.*failed High Notify team channel
[FEEDBACK] Feedback generation failed High Notify team channel
[REDIS] Worker connection error Critical Page on-call
[BRIDGE] Could not resolve organizationId Medium Log review next morning
Daily Limit Reached Low Review weekly

Manual Re-queue Procedure

When a job fails permanently and needs to be retried:

Via BullBoard UI

  1. Navigate to /v1/internal/queues
  2. Select the queue (whatsapp-queue)
  3. Go to Failed tab
  4. Click Retry on the job, or Retry All for bulk recovery

Via Redis CLI (emergency)

# Connect to Redis
redis-cli -u $REDIS_URL

# List failed job IDs for whatsapp-queue
LRANGE bull:whatsapp-queue:failed 0 -1

# Move a specific job back to waiting
# (BullBoard UI is preferred β€” use CLI only when dashboard is unreachable)

Via API (programmatic)

# Trigger a manual broadcast retry
curl -X POST https://api.xamle.studio/v1/internal/ping \
  -H "Authorization: Bearer $ADMIN_API_KEY"

Common Incidents

All exercise feedback silently ignored

Symptom: Users send exercise answers, receive no feedback.
Cause: generate-feedback job enqueued but no handler registered.
Fix: Ensure FeedbackHandler is registered in apps/whatsapp-worker/src/index.ts.
Check: grep "generate-feedback" logs β€” should show [WORKER] Processing job: generate-feedback.

Daily message limit hit for an org

Symptom: [WORKER] Skipping job send-message for Org X: Daily Limit Reached in logs.
Check: UsageService.getDailyUsage(orgId) returns count near limit.
Fix: Increase limit in Organization.dailyMessageLimit, or wait until midnight UTC for reset.

WhatsApp message delivered but DB not updated

Symptom: User received message, but UserProgress.exerciseStatus still PENDING.
Cause: Atomicity rule violation β€” sendWhatsApp called before prisma.update.
Fix: Always update DB first. Review any recent changes to handler files.

Redis connection lost

Symptom: [REDIS] Worker connection error repeating in logs. All jobs stall.
Fix: Railway auto-restarts the worker. If persistent, check Redis service health in Railway dashboard and rotate REDIS_URL if needed.