Set up consecutive failure alerting

epic-assignment-follow-up-reminders-cron-infrastructure-task-005 — Configure alerting logic that fires when the cron trigger fails on two or more consecutive executions. Integrate with Supabase monitoring or an external alerting channel (e.g., email or Slack webhook) to notify the operations team. Track consecutive failure count in the execution log table and reset the counter on successful runs. Include the error message and last successful run timestamp in alert payloads.

high priority medium complexity infrastructure pending infrastructure specialist Tier 20

Acceptance Criteria

After each failed cron run, `consecutive_failure_count` in the latest `cron_execution_logs` row is incremented relative to the previous run's count

After a successful run, the consecutive_failure_count in the new log row is reset to 0

When consecutive_failure_count reaches 2 (two consecutive failures), an alert is dispatched to the configured alerting channel

The alert payload includes: error message from the most recent failure, timestamp of the last successful run (queried from cron_execution_logs), and consecutive failure count

Alert is NOT re-sent on each subsequent consecutive failure after the threshold — only on the exact transition to count==2 (to prevent alert storms). Optional: re-alert every N additional failures (configurable)

Alert channel configuration (webhook URL or email) is stored in Supabase Vault or environment variables — not hardcoded in source

A test mode flag allows triggering a test alert without requiring actual consecutive failures

Receiving the alert requires no authentication — the alert is pushed to the configured channel by the Edge Function

Technical Requirements

frameworks

Supabase Edge Functions (Deno/TypeScript)

Supabase Vault (for secret storage)

apis

Slack Incoming Webhooks API or SMTP email API (e.g., Resend, Postmark)

Supabase service_role key

data models

cron_execution_logs

performance requirements

Alert dispatch must be non-blocking — use async fire-and-forget after the log update; do not delay cron function completion

Alert HTTP call must have a timeout of 5 seconds to prevent hanging on unresponsive webhook endpoints

security requirements

Webhook URL and any API keys must be stored in Supabase Vault, never in source code or cron_execution_logs rows

Alert payloads must not include PII — include only system identifiers and error codes

The alerting path must not throw unhandled exceptions that could mask the original cron failure

Execution Context

Execution Tier

Tier 20

Tier 20 - 2 tasks

Can start after Tier 19 completes

View Full Execution Plan

Implementation Notes

Query for the previous run's consecutive_failure_count using: `SELECT consecutive_failure_count FROM cron_execution_logs ORDER BY started_at DESC LIMIT 1 OFFSET 1`. Increment by 1 on failure, set to 0 on success. Only dispatch the alert when the new count equals exactly 2 (the threshold). Use Supabase Vault `vault.decrypted_secrets` to retrieve the webhook URL at runtime.

Implement the alert dispatch as a separate async function called with `await` but wrapped in its own try/catch so alert delivery failure does not pollute the cron execution status. Structure the Slack message using Block Kit for readability: include a header, error details section, and a 'Last successful run' context block. If using email, use a transactional email provider already approved by the project — confirm with the team before integrating a new service.

Testing Requirements

Unit tests: mock the alerting channel call and assert it is invoked exactly once when consecutive_failure_count transitions to 2, and NOT invoked for count 1 or count 3+. Assert alert payload structure matches specification (error_message, last_success_at, consecutive_failure_count). Integration tests: simulate two consecutive failed cron runs in a test Supabase environment and verify the alert channel mock received exactly one call with correct payload. Test the reset path: after two failures followed by a success, assert consecutive_failure_count is 0 in the new log row.

Test the timeout path: simulate a slow webhook endpoint and verify the cron function still completes within acceptable time.

Component

Assignment Reminder Cron Trigger

infrastructure medium

Dependencies (1)

Add structured execution logging to the cron trigger that captures start time, completion time, total duration, number of assignments evaluated, number of reminders dispatched, and any errors encountered. Store logs in a dedicated cron_execution_logs table in Supabase to support monitoring, auditing, and debugging of the reminder pipeline. epic-assignment-follow-up-reminders-cron-infrastructure-task-002

Epic Risks (2)

high impact low prob technical

If the daily cron job takes longer than 24 hours to complete (due to a large dataset or a slow query), a second instance will start while the first is still running, causing duplicate reminder dispatch for assignments processed twice.

Mitigation & Contingency

Mitigation: Implement an advisory lock that prevents a second run from starting if the first is still active. Monitor run duration via the execution log table and alert if any run exceeds 30 minutes. The 10,000-assignment load test should verify the run completes in under 5 minutes.

Contingency: If a double-run occurs, the idempotency guard in ReminderDispatchService prevents duplicate notifications from being sent. The execution log identifies the overlap and allows the ops team to investigate the root cause.

high impact medium prob integration

If the activity registration hook that resets last_contact_date is implemented incorrectly or not triggered for all activity types (e.g., proxy registrations, bulk registrations), peer mentors will continue receiving reminders even after logging contact, damaging user trust.

Mitigation & Contingency

Mitigation: Audit all code paths that create activity records (direct registration, proxy registration, bulk registration, coordinator proxy) and ensure each path calls the assignment contact update. Write integration tests for each registration path asserting that last_contact_date is updated.

Contingency: Provide an authenticated admin endpoint that allows manual correction of last_contact_date for a specific assignment, enabling ops to resolve individual cases while the bug is fixed and deployed.

Quick Links

All Tasks Execution Plan