Monitoring & Incident Runbook
BullBoard Dashboard
BullBoard is live at /v1/internal/queues on the API service.
Access: Requires either:
Authorization: Bearer <ADMIN_API_KEY>header, OR- Valid JWT token with
ORG_ADMINorSUPER_ADMINrole
Both queues are visible: whatsapp-queue and notification-queue.
What to look for:
- Failed tab: jobs that exhausted all retry attempts. Check the error message and stack trace.
- Delayed tab: jobs stuck waiting (usually a Redis issue or backoff after failures).
- Active tab: jobs that have been running too long (> 2 min = likely hung).
Railway Log Alerts
Configure these alert rules in the Railway dashboard under Logs β Alerts:
| Pattern | Severity | Action |
|---|---|---|
[WORKER] No handler found for job name |
Critical | Page on-call β silent job drop |
[WORKER] Job.*failed |
High | Notify team channel |
[FEEDBACK] Feedback generation failed |
High | Notify team channel |
[REDIS] Worker connection error |
Critical | Page on-call |
[BRIDGE] Could not resolve organizationId |
Medium | Log review next morning |
Daily Limit Reached |
Low | Review weekly |
Manual Re-queue Procedure
When a job fails permanently and needs to be retried:
Via BullBoard UI
- Navigate to
/v1/internal/queues - Select the queue (
whatsapp-queue) - Go to Failed tab
- Click Retry on the job, or Retry All for bulk recovery
Via Redis CLI (emergency)
# Connect to Redis
redis-cli -u $REDIS_URL
# List failed job IDs for whatsapp-queue
LRANGE bull:whatsapp-queue:failed 0 -1
# Move a specific job back to waiting
# (BullBoard UI is preferred β use CLI only when dashboard is unreachable)
Via API (programmatic)
# Trigger a manual broadcast retry
curl -X POST https://api.xamle.studio/v1/internal/ping \
-H "Authorization: Bearer $ADMIN_API_KEY"
Common Incidents
All exercise feedback silently ignored
Symptom: Users send exercise answers, receive no feedback.
Cause: generate-feedback job enqueued but no handler registered.
Fix: Ensure FeedbackHandler is registered in apps/whatsapp-worker/src/index.ts.
Check: grep "generate-feedback" logs β should show [WORKER] Processing job: generate-feedback.
Daily message limit hit for an org
Symptom: [WORKER] Skipping job send-message for Org X: Daily Limit Reached in logs.
Check: UsageService.getDailyUsage(orgId) returns count near limit.
Fix: Increase limit in Organization.dailyMessageLimit, or wait until midnight UTC for reset.
WhatsApp message delivered but DB not updated
Symptom: User received message, but UserProgress.exerciseStatus still PENDING.
Cause: Atomicity rule violation β sendWhatsApp called before prisma.update.
Fix: Always update DB first. Review any recent changes to handler files.
Redis connection lost
Symptom: [REDIS] Worker connection error repeating in logs. All jobs stall.
Fix: Railway auto-restarts the worker. If persistent, check Redis service health in Railway dashboard and rotate REDIS_URL if needed.