Implement nightly scheduler trigger for expiry checker
epic-peer-mentor-pause-management-automated-expiry-task-002 — Wire the CertificationExpiryChecker into the nightly job scheduler (394-nightly-scheduler). Configure the cron trigger to invoke the expiry checker each night at a defined off-peak time, implement graceful error handling so a scheduler failure does not silently skip the run, and add a run-lock mechanism to prevent concurrent executions.
Acceptance Criteria
Technical Requirements
Execution Context
Tier 1 - 540 tasks
Can start after Tier 0 completes
Implementation Notes
Prefer Supabase Edge Functions with a cron trigger over pg_cron for easier deployment and TypeScript/Dart interop. The run-lock should be implemented as a Postgres advisory lock (`pg_try_advisory_lock`) or a dedicated table row with `SELECT FOR UPDATE SKIP LOCKED` — advisory locks are simpler but lost on disconnect; a table-based lock survives restarts. Use a UUID v4 as the run_id so logs are globally unique and traceable across retries. The stale-lock override logic should check `started_at < NOW() - INTERVAL '2 hours'` and DELETE the old lock before inserting a new one, wrapped in a single transaction.
For the alerting mechanism, a Supabase Database Webhook on the scheduler_run_log table filtered to `status = 'FAILED'` calling a notification endpoint is the least-infrastructure approach.
Testing Requirements
Integration tests: (1) invoke scheduler twice in rapid succession and assert only one run proceeds (lock test), (2) simulate an exception in the expiry checker and assert the run is marked FAILED in the log table and no lock remains, (3) simulate a stale lock (insert a lock record with started_at = now() - 3 hours) and assert the new run overrides it. Unit tests for the lock acquisition/release logic using a mocked Supabase client. Manual smoke test in staging: trigger the cron at a modified time, confirm run log entry appears within 1 minute.
The nightly expiry checker may run multiple times due to scheduler retries or infrastructure issues, causing duplicate auto-transitions and duplicate coordinator notifications that erode trust in the notification system.
Mitigation & Contingency
Mitigation: Implement idempotency via a unique constraint on (mentor_id, threshold_day, certification_expiry_date) in the cert_expiry_reminders table. Auto-transitions should be wrapped in a Postgres RPC that checks current status before applying, making repeated invocations safe.
Contingency: Add a compensation query in the reconciliation log that detects duplicate log entries for the same certification period and alerts the operations team for manual review within 24 hours.
The HLF Dynamics portal API may have eventual-consistency behaviour or rate limits that cause website listing updates to lag behind status changes, leaving expired mentors visible on the public website for an unacceptable window.
Mitigation & Contingency
Mitigation: Design the sync service to be triggered immediately on status transitions (event-driven via database webhook) in addition to the nightly batch run. Implement a reconciliation job that verifies sync state against app state and re-triggers any divergent records.
Contingency: If real-time sync cannot be guaranteed, implement a manual 'force sync' action in the coordinator dashboard so coordinators can trigger an immediate re-sync for urgent cases. Document the expected sync lag in coordinator onboarding materials.
Stakeholder requests to extend the expiry checker to handle additional certification types, grace periods, or organisation-specific threshold configurations may significantly increase scope beyond what is designed here, delaying delivery.
Mitigation & Contingency
Mitigation: Parameterise threshold day values (30, 14, 7) via configuration repository rather than hard-coding them, enabling per-organisation customisation without code changes. Document that grace period logic and additional cert types are out of scope for this epic and require a dedicated follow-up.
Contingency: Deliver the feature with hard-coded HLF-standard thresholds first and introduce the configuration repository as a follow-up task in the next sprint, using a feature flag to enable per-org threshold overrides.
Dynamics portal API credentials stored as environment secrets in Supabase Edge Function configuration may be rotated or invalidated by HLF IT without notice, causing silent sync failures that go undetected for multiple days.
Mitigation & Contingency
Mitigation: Implement credential health-check calls on each scheduler run and emit an immediate alert on auth failure rather than only alerting after N consecutive failures. Document the credential rotation procedure with HLF IT and establish a rotation notification protocol.
Contingency: Maintain a break-glass manual sync script accessible to HLF administrators that can re-execute the Dynamics sync with newly provided credentials while the automated system is restored.