Spaces:

safetrack
/

edtech

Running

App Files Files Community

edtech / docs /monitoring-runbook.md

CognxSafeTrack

feat: backlog P0→P3 — toast system, payments, tenant isolation, feedback handler, i18n parity

6dd9bad 12 days ago

preview code

raw

history blame contribute delete

3.21 kB

Monitoring & Incident Runbook

BullBoard Dashboard

BullBoard is live at /v1/internal/queues on the API service.

Access: Requires either:

Authorization: Bearer <ADMIN_API_KEY> header, OR
Valid JWT token with ORG_ADMIN or SUPER_ADMIN role

Both queues are visible: whatsapp-queue and notification-queue.

What to look for:

Failed tab: jobs that exhausted all retry attempts. Check the error message and stack trace.
Delayed tab: jobs stuck waiting (usually a Redis issue or backoff after failures).
Active tab: jobs that have been running too long (> 2 min = likely hung).

Railway Log Alerts

Configure these alert rules in the Railway dashboard under Logs → Alerts:

Pattern	Severity	Action
`[WORKER] No handler found for job name`	Critical	Page on-call — silent job drop
`[WORKER] Job.*failed`	High	Notify team channel
`[FEEDBACK] Feedback generation failed`	High	Notify team channel
`[REDIS] Worker connection error`	Critical	Page on-call
`[BRIDGE] Could not resolve organizationId`	Medium	Log review next morning
`Daily Limit Reached`	Low	Review weekly

Manual Re-queue Procedure

When a job fails permanently and needs to be retried:

Via BullBoard UI

Navigate to /v1/internal/queues
Select the queue (whatsapp-queue)
Go to Failed tab
Click Retry on the job, or Retry All for bulk recovery

Via Redis CLI (emergency)

# Connect to Redis
redis-cli -u $REDIS_URL

# List failed job IDs for whatsapp-queue
LRANGE bull:whatsapp-queue:failed 0 -1

# Move a specific job back to waiting
# (BullBoard UI is preferred — use CLI only when dashboard is unreachable)

Via API (programmatic)

# Trigger a manual broadcast retry
curl -X POST https://api.xamle.studio/v1/internal/ping \
  -H "Authorization: Bearer $ADMIN_API_KEY"

Common Incidents

All exercise feedback silently ignored

Symptom: Users send exercise answers, receive no feedback.
Cause: generate-feedback job enqueued but no handler registered.
Fix: Ensure FeedbackHandler is registered in apps/whatsapp-worker/src/index.ts.
Check: grep "generate-feedback" logs — should show [WORKER] Processing job: generate-feedback.

Daily message limit hit for an org

Symptom: [WORKER] Skipping job send-message for Org X: Daily Limit Reached in logs.
Check: UsageService.getDailyUsage(orgId) returns count near limit.
Fix: Increase limit in Organization.dailyMessageLimit, or wait until midnight UTC for reset.

WhatsApp message delivered but DB not updated

Symptom: User received message, but UserProgress.exerciseStatus still PENDING.
Cause: Atomicity rule violation — sendWhatsApp called before prisma.update.
Fix: Always update DB first. Review any recent changes to handler files.

Redis connection lost

Symptom: [REDIS] Worker connection error repeating in logs. All jobs stall.
Fix: Railway auto-restarts the worker. If persistent, check Redis service health in Railway dashboard and rotate REDIS_URL if needed.