Implement edge function error handling and execution logging

epic-certificate-expiry-notifications-orchestration-services-task-016 — Add structured error handling to the expiry check edge function: catch downstream service failures, log execution summaries (mentor counts per tier, notification counts dispatched, suppressions applied, errors encountered) to a persistent execution log table, and emit an alert if the function exits with a non-zero error count. Ensure partial failures do not abort the entire run.

high priority medium complexity backend pending backend specialist Tier 4

Acceptance Criteria

A database table `expiry_check_execution_log` is created via a migration with columns: `id` (uuid PK), `executed_at` (timestamptz), `lapsed_count` (int), `seven_day_count` (int), `thirty_day_count` (int), `sixty_day_count` (int), `notifications_dispatched` (int), `suppressions_applied` (int), `error_count` (int), `errors` (jsonb), `duration_ms` (int)

After each cron run the edge function inserts one row into `expiry_check_execution_log` regardless of success or partial failure — the log row is always written, even when all downstream calls fail

If `error_count > 0` in the execution log row, the function also invokes a lightweight alert mechanism (e.g. Supabase Edge Function `send-ops-alert` or a structured `console.error` that triggers a Supabase log alert rule)

A single downstream service failure (e.g. FCM call rejected for one mentor) increments `error_count` and appends a structured `{ tier, mentorId, error }` object to the `errors` jsonb array without aborting remaining processing

The function continues processing all tiers even when a complete tier dispatch fails — tier-level failures are isolated from each other

The execution log table has a retention policy: rows older than 90 days are purged via a pg_cron job or Supabase's built-in table partitioning to prevent unbounded growth

RLS on `expiry_check_execution_log` restricts read access to `service_role` only — no mobile client can query execution logs

Execution duration (wall-clock ms from function start to log write) is recorded in `duration_ms` for performance trend monitoring

Technical Requirements

frameworks

Deno (Supabase Edge Function runtime)

Supabase Edge Functions TypeScript

apis

Supabase service-role client for writing to `expiry_check_execution_log`

Supabase `console.error` structured logging (feeds into Supabase log alerts)

data models

certification (source data — counts derived from tier partitioning)

expiry_check_execution_log (new audit table — see acceptance criteria for schema)

performance requirements

Log write to `expiry_check_execution_log` must be a single INSERT — no SELECT-then-INSERT pattern

Error accumulation uses an in-memory array during the run; no database writes per error event (only final batch write)

Log table query for monitoring must be indexable by `executed_at DESC` — add index in migration

security requirements

RLS policy: `expiry_check_execution_log` SELECT and INSERT restricted to `service_role` — deny all authenticated and anon roles

The `errors` jsonb column must not store full PII — mentor names/emails must not appear; only UUIDs and error codes

Alert mechanism must not include PII in alert payloads — summary counts and error codes only

Execution Context

Execution Tier

Tier 4

Tier 4 - 323 tasks

Can start after Tier 3 completes

View Full Execution Plan

Implementation Notes

Use the try/catch/finally pattern: initialise an `executionContext` object at function start, populate it as stages complete, and write the log row in a `finally` block to guarantee the INSERT always runs regardless of errors. Use `performance.now()` (available in Deno) to measure `duration_ms`. Keep the error accumulator as an array of typed objects `Array<{ tier: string; mentorId: string; code: string; message: string }>` — the `message` field must be sanitised to strip any PII before recording. For the alert mechanism, a `console.error(JSON.stringify({ alert: 'expiry-check-errors', errorCount, executionId }))` line is sufficient initially — Supabase log alerting can be configured in the dashboard to trigger on `error` level log lines from this function.

The 90-day retention purge can be implemented as a second `cron.schedule` entry in the same migration file that runs monthly.

Testing Requirements

Unit and integration tests using Deno test runner with mocked Supabase client. Required scenarios: (1) fully successful run produces a log row with `error_count = 0` and accurate tier counts; (2) one downstream failure increments `error_count` to 1 and appends a structured error object to `errors` array without aborting other tier processing; (3) complete tier failure (all mentors in a tier fail) is recorded in the log but does not prevent other tiers from processing; (4) log write failure (Supabase INSERT rejected) is caught and `console.error`-logged without crashing the function (best-effort logging); (5) `duration_ms` field is a positive integer. Additionally, write a manual monitoring test: after a staging cron run, query `expiry_check_execution_log` and confirm the row was inserted with accurate counts.

Component

Certificate Expiry Check Edge Function

infrastructure medium

Dependencies (1)

Create the Supabase Edge Function that runs on the daily cron schedule. The function queries the certification expiry repository for all mentors with certificates expiring within 60 days or already lapsed, partitions results by threshold tier, and passes each partition to the downstream services: orchestrator for notification dispatch and visibility suppressor for lapsed mentors. epic-certificate-expiry-notifications-orchestration-services-task-014

Epic Risks (4)

high impact medium prob technical

If the daily edge function runs more than once in a 24-hour window due to a Supabase scheduling anomaly or manual re-trigger, the orchestrator could dispatch duplicate push notifications to the same mentor and coordinator for the same threshold, eroding user trust.

Mitigation & Contingency

Mitigation: Implement idempotency at the notification record level using a unique constraint on (mentor_id, threshold_days, certification_id). The orchestrator checks for an existing record before dispatching. Use a database-level upsert with ON CONFLICT DO NOTHING.

Contingency: If duplicate notifications are reported in production, add a rate-limiting guard in the edge function that aborts if a notification for the same mentor and threshold was created within the last 20 hours, and add an alerting rule to Supabase logs for duplicate dispatch attempts.

medium impact medium prob scope

The mentor visibility suppressor relies on the daily edge function to detect expiry and update suppression_status. A mentor whose certificate expires at midnight may remain visible for up to 24 hours if the cron runs at a fixed time, violating HLF's requirement that expired mentors disappear promptly.

Mitigation & Contingency

Mitigation: Schedule the edge function to run at 00:05 UTC to minimise lag after midnight transitions. Additionally, the RLS policy can include a direct date comparison (certification_expiry_date < now()) as a secondary predicate that does not rely on suppression_status, providing real-time enforcement at the database level.

Contingency: If the cron lag is unacceptable after launch, implement a Supabase database trigger on the certifications table that fires on UPDATE of expiry_date and calls the suppressor immediately, reducing lag to near-zero for renewal and expiry events.

medium impact low prob integration

The orchestrator needs to resolve the coordinator assigned to a specific peer mentor to dispatch coordinator-side notifications. If the assignment relationship is not normalised or is missing for some mentors, coordinator notifications will silently fail.

Mitigation & Contingency

Mitigation: Query the coordinator assignment from the existing assignments or user_roles table before dispatch. Log a structured warning (missing_coordinator_assignment: mentor_id) when no coordinator is found. Add a data quality check in the edge function that reports mentors without coordinators.

Contingency: If coordinator assignments are missing at scale, fall back to notifying the chapter-level admin role for the mentor's chapter, and surface a data quality report to the admin dashboard showing mentors without assigned coordinators.

medium impact low prob dependency

The course enrollment prompt service generates deep-link URLs targeting the course administration feature. If the course administration feature changes its deep-link schema or the Dynamics portal URL structure changes, enrollment prompts will navigate to broken destinations.

Mitigation & Contingency

Mitigation: Define the deep-link contract between the certificate expiry feature and the course administration feature as a shared constant in a cross-feature navigation config. Version the deep-link schema and validate the generated URL format in unit tests.

Contingency: If the deep-link breaks in production, the course enrollment prompt service should gracefully fall back to opening the course administration feature root screen with a query parameter indicating the notification context, allowing the user to manually locate the correct course.

Quick Links

All Tasks Execution Plan