high priority medium complexity infrastructure pending infrastructure specialist Tier 20

Acceptance Criteria

After each failed cron run, `consecutive_failure_count` in the latest `cron_execution_logs` row is incremented relative to the previous run's count
After a successful run, the consecutive_failure_count in the new log row is reset to 0
When consecutive_failure_count reaches 2 (two consecutive failures), an alert is dispatched to the configured alerting channel
The alert payload includes: error message from the most recent failure, timestamp of the last successful run (queried from cron_execution_logs), and consecutive failure count
Alert is NOT re-sent on each subsequent consecutive failure after the threshold — only on the exact transition to count==2 (to prevent alert storms). Optional: re-alert every N additional failures (configurable)
Alert channel configuration (webhook URL or email) is stored in Supabase Vault or environment variables — not hardcoded in source
A test mode flag allows triggering a test alert without requiring actual consecutive failures
Receiving the alert requires no authentication — the alert is pushed to the configured channel by the Edge Function

Technical Requirements

frameworks
Supabase Edge Functions (Deno/TypeScript)
Supabase Vault (for secret storage)
apis
Slack Incoming Webhooks API or SMTP email API (e.g., Resend, Postmark)
Supabase service_role key
data models
cron_execution_logs
performance requirements
Alert dispatch must be non-blocking — use async fire-and-forget after the log update; do not delay cron function completion
Alert HTTP call must have a timeout of 5 seconds to prevent hanging on unresponsive webhook endpoints
security requirements
Webhook URL and any API keys must be stored in Supabase Vault, never in source code or cron_execution_logs rows
Alert payloads must not include PII — include only system identifiers and error codes
The alerting path must not throw unhandled exceptions that could mask the original cron failure

Execution Context

Execution Tier
Tier 20

Tier 20 - 2 tasks

Can start after Tier 19 completes

Implementation Notes

Query for the previous run's consecutive_failure_count using: `SELECT consecutive_failure_count FROM cron_execution_logs ORDER BY started_at DESC LIMIT 1 OFFSET 1`. Increment by 1 on failure, set to 0 on success. Only dispatch the alert when the new count equals exactly 2 (the threshold). Use Supabase Vault `vault.decrypted_secrets` to retrieve the webhook URL at runtime.

Implement the alert dispatch as a separate async function called with `await` but wrapped in its own try/catch so alert delivery failure does not pollute the cron execution status. Structure the Slack message using Block Kit for readability: include a header, error details section, and a 'Last successful run' context block. If using email, use a transactional email provider already approved by the project — confirm with the team before integrating a new service.

Testing Requirements

Unit tests: mock the alerting channel call and assert it is invoked exactly once when consecutive_failure_count transitions to 2, and NOT invoked for count 1 or count 3+. Assert alert payload structure matches specification (error_message, last_success_at, consecutive_failure_count). Integration tests: simulate two consecutive failed cron runs in a test Supabase environment and verify the alert channel mock received exactly one call with correct payload. Test the reset path: after two failures followed by a success, assert consecutive_failure_count is 0 in the new log row.

Test the timeout path: simulate a slow webhook endpoint and verify the cron function still completes within acceptable time.

Component
Assignment Reminder Cron Trigger
infrastructure medium
Epic Risks (2)
high impact low prob technical

If the daily cron job takes longer than 24 hours to complete (due to a large dataset or a slow query), a second instance will start while the first is still running, causing duplicate reminder dispatch for assignments processed twice.

Mitigation & Contingency

Mitigation: Implement an advisory lock that prevents a second run from starting if the first is still active. Monitor run duration via the execution log table and alert if any run exceeds 30 minutes. The 10,000-assignment load test should verify the run completes in under 5 minutes.

Contingency: If a double-run occurs, the idempotency guard in ReminderDispatchService prevents duplicate notifications from being sent. The execution log identifies the overlap and allows the ops team to investigate the root cause.

high impact medium prob integration

If the activity registration hook that resets last_contact_date is implemented incorrectly or not triggered for all activity types (e.g., proxy registrations, bulk registrations), peer mentors will continue receiving reminders even after logging contact, damaging user trust.

Mitigation & Contingency

Mitigation: Audit all code paths that create activity records (direct registration, proxy registration, bulk registration, coordinator proxy) and ensure each path calls the assignment contact update. Write integration tests for each registration path asserting that last_contact_date is updated.

Contingency: Provide an authenticated admin endpoint that allows manual correction of last_contact_date for a specific assignment, enabling ops to resolve individual cases while the bug is fixed and deployed.