Set up consecutive failure alerting
epic-assignment-follow-up-reminders-cron-infrastructure-task-005 — Configure alerting logic that fires when the cron trigger fails on two or more consecutive executions. Integrate with Supabase monitoring or an external alerting channel (e.g., email or Slack webhook) to notify the operations team. Track consecutive failure count in the execution log table and reset the counter on successful runs. Include the error message and last successful run timestamp in alert payloads.
Acceptance Criteria
Technical Requirements
Execution Context
Tier 20 - 2 tasks
Can start after Tier 19 completes
Implementation Notes
Query for the previous run's consecutive_failure_count using: `SELECT consecutive_failure_count FROM cron_execution_logs ORDER BY started_at DESC LIMIT 1 OFFSET 1`. Increment by 1 on failure, set to 0 on success. Only dispatch the alert when the new count equals exactly 2 (the threshold). Use Supabase Vault `vault.decrypted_secrets` to retrieve the webhook URL at runtime.
Implement the alert dispatch as a separate async function called with `await` but wrapped in its own try/catch so alert delivery failure does not pollute the cron execution status. Structure the Slack message using Block Kit for readability: include a header, error details section, and a 'Last successful run' context block. If using email, use a transactional email provider already approved by the project — confirm with the team before integrating a new service.
Testing Requirements
Unit tests: mock the alerting channel call and assert it is invoked exactly once when consecutive_failure_count transitions to 2, and NOT invoked for count 1 or count 3+. Assert alert payload structure matches specification (error_message, last_success_at, consecutive_failure_count). Integration tests: simulate two consecutive failed cron runs in a test Supabase environment and verify the alert channel mock received exactly one call with correct payload. Test the reset path: after two failures followed by a success, assert consecutive_failure_count is 0 in the new log row.
Test the timeout path: simulate a slow webhook endpoint and verify the cron function still completes within acceptable time.
If the daily cron job takes longer than 24 hours to complete (due to a large dataset or a slow query), a second instance will start while the first is still running, causing duplicate reminder dispatch for assignments processed twice.
Mitigation & Contingency
Mitigation: Implement an advisory lock that prevents a second run from starting if the first is still active. Monitor run duration via the execution log table and alert if any run exceeds 30 minutes. The 10,000-assignment load test should verify the run completes in under 5 minutes.
Contingency: If a double-run occurs, the idempotency guard in ReminderDispatchService prevents duplicate notifications from being sent. The execution log identifies the overlap and allows the ops team to investigate the root cause.
If the activity registration hook that resets last_contact_date is implemented incorrectly or not triggered for all activity types (e.g., proxy registrations, bulk registrations), peer mentors will continue receiving reminders even after logging contact, damaging user trust.
Mitigation & Contingency
Mitigation: Audit all code paths that create activity records (direct registration, proxy registration, bulk registration, coordinator proxy) and ensure each path calls the assignment contact update. Write integration tests for each registration path asserting that last_contact_date is updated.
Contingency: Provide an authenticated admin endpoint that allows manual correction of last_contact_date for a specific assignment, enabling ops to resolve individual cases while the bug is fixed and deployed.