epic-peer-mentor-pause-management-automated-expiry-task-002 - Implementation Task | Likepersonsapp

critical priority medium complexity infrastructure pending infrastructure specialist Tier 1

Acceptance Criteria

A Supabase Edge Function or pg_cron job is configured to trigger CertificationExpiryChecker at 02:00 UTC every night

The cron schedule is stored in version-controlled configuration (not set manually in Supabase dashboard) and documented

A run-lock record is inserted into a `scheduler_run_locks` table (or equivalent) at the start of each run and deleted on completion; a concurrent invocation finding an existing lock exits immediately with a logged warning

Run-lock records older than 2 hours are automatically considered stale and overridden (prevents indefinite lock after a crash)

If the expiry checker throws an unhandled exception, the scheduler catches it, logs a structured error entry with run ID and stack trace, and marks the run as FAILED in the audit/log table — it does NOT silently swallow the error

A missed or failed run triggers a Supabase alert (email or webhook) to the on-call contact within 5 minutes

Scheduler configuration is environment-aware: staging runs at 03:00 UTC, production at 02:00 UTC

Manual trigger endpoint exists for testing and incident recovery (authenticated, admin-only)

End-to-end test confirms that starting two concurrent invocations results in only one completing the full run

Technical Requirements

frameworks

Dart (Supabase Edge Function runtime or Dart backend service)

Supabase pg_cron extension or Supabase Edge Function cron trigger

Supabase (scheduler_run_locks table)

apis

Supabase Edge Functions invocation API

pg_cron (if using Postgres-native scheduling)

Supabase Realtime or webhook for failure alerting

data models

scheduler_run_locks (run_id, job_name, started_at, status)

scheduler_run_log (run_id, job_name, started_at, finished_at, status, error_message)

performance requirements

Scheduler overhead (lock acquisition + release) must not exceed 500ms

Lock check must use a SELECT FOR UPDATE SKIP LOCKED pattern to be concurrency-safe at the database level

security requirements

Manual trigger endpoint requires admin JWT or service-role key — reject unauthenticated requests with 401

scheduler_run_locks and scheduler_run_log tables must have RLS policies allowing only service-role writes

Cron configuration must not be stored in client-side code or committed .env files

Execution Context

Execution Tier

Tier 1

Tier 1 - 540 tasks

Can start after Tier 0 completes

View Full Execution Plan

Implementation Notes

Prefer Supabase Edge Functions with a cron trigger over pg_cron for easier deployment and TypeScript/Dart interop. The run-lock should be implemented as a Postgres advisory lock (`pg_try_advisory_lock`) or a dedicated table row with `SELECT FOR UPDATE SKIP LOCKED` — advisory locks are simpler but lost on disconnect; a table-based lock survives restarts. Use a UUID v4 as the run_id so logs are globally unique and traceable across retries. The stale-lock override logic should check `started_at < NOW() - INTERVAL '2 hours'` and DELETE the old lock before inserting a new one, wrapped in a single transaction.

For the alerting mechanism, a Supabase Database Webhook on the scheduler_run_log table filtered to `status = 'FAILED'` calling a notification endpoint is the least-infrastructure approach.

Testing Requirements

Integration tests: (1) invoke scheduler twice in rapid succession and assert only one run proceeds (lock test), (2) simulate an exception in the expiry checker and assert the run is marked FAILED in the log table and no lock remains, (3) simulate a stale lock (insert a lock record with started_at = now() - 3 hours) and assert the new run overrides it. Unit tests for the lock acquisition/release logic using a mocked Supabase client. Manual smoke test in staging: trigger the cron at a modified time, confirm run log entry appears within 1 minute.

Component

Certification Expiry Checker Service

service high

Dependencies (1)

Design and implement the data access layer for the CertificationExpiryChecker service. Create query methods on the certification status repository to fetch all certifications expiring within 30 days, identify those expiring today, and retrieve mentor status records. Ensure queries are optimised for nightly batch execution and include indexes on expiry date fields. epic-peer-mentor-pause-management-automated-expiry-task-001

Epic Risks (4)

high impact medium prob technical

The nightly expiry checker may run multiple times due to scheduler retries or infrastructure issues, causing duplicate auto-transitions and duplicate coordinator notifications that erode trust in the notification system.

Mitigation & Contingency

Mitigation: Implement idempotency via a unique constraint on (mentor_id, threshold_day, certification_expiry_date) in the cert_expiry_reminders table. Auto-transitions should be wrapped in a Postgres RPC that checks current status before applying, making repeated invocations safe.

Contingency: Add a compensation query in the reconciliation log that detects duplicate log entries for the same certification period and alerts the operations team for manual review within 24 hours.

high impact medium prob integration

The HLF Dynamics portal API may have eventual-consistency behaviour or rate limits that cause website listing updates to lag behind status changes, leaving expired mentors visible on the public website for an unacceptable window.

Mitigation & Contingency

Mitigation: Design the sync service to be triggered immediately on status transitions (event-driven via database webhook) in addition to the nightly batch run. Implement a reconciliation job that verifies sync state against app state and re-triggers any divergent records.

Contingency: If real-time sync cannot be guaranteed, implement a manual 'force sync' action in the coordinator dashboard so coordinators can trigger an immediate re-sync for urgent cases. Document the expected sync lag in coordinator onboarding materials.

medium impact medium prob scope

Stakeholder requests to extend the expiry checker to handle additional certification types, grace periods, or organisation-specific threshold configurations may significantly increase scope beyond what is designed here, delaying delivery.

Mitigation & Contingency

Mitigation: Parameterise threshold day values (30, 14, 7) via configuration repository rather than hard-coding them, enabling per-organisation customisation without code changes. Document that grace period logic and additional cert types are out of scope for this epic and require a dedicated follow-up.

Contingency: Deliver the feature with hard-coded HLF-standard thresholds first and introduce the configuration repository as a follow-up task in the next sprint, using a feature flag to enable per-org threshold overrides.

high impact low prob security

Dynamics portal API credentials stored as environment secrets in Supabase Edge Function configuration may be rotated or invalidated by HLF IT without notice, causing silent sync failures that go undetected for multiple days.

Mitigation & Contingency

Mitigation: Implement credential health-check calls on each scheduler run and emit an immediate alert on auth failure rather than only alerting after N consecutive failures. Document the credential rotation procedure with HLF IT and establish a rotation notification protocol.

Contingency: Maintain a break-glass manual sync script accessible to HLF administrators that can re-execute the Dynamics sync with newly provided credentials while the automated system is restored.

Quick Links

All Tasks Execution Plan

Implement nightly scheduler trigger for expiry checker

Acceptance Criteria

Technical Requirements

Execution Context

Implementation Notes

Testing Requirements