epic-peer-mentor-pause-management-automated-expiry-task-008 - Implementation Task | Likepersonsapp

high priority medium complexity backend pending backend specialist Tier 4

Acceptance Criteria

Retry configuration (max_attempts, base_delay_seconds, max_delay_seconds) is externalized as Edge Function environment variables with safe defaults (max_attempts=3, base_delay=2, max_delay=30)

Exponential backoff with jitter is applied between retry attempts: delay = min(base * 2^attempt + jitter, max_delay)

Only retryable errors (5xx, network timeout) trigger retry; non-retryable 4xx errors move the operation directly to dead-letter queue without retry

Retry attempt count is incremented on the sync_queue row after each failed attempt

After max_attempts is exhausted, the sync_queue row status is updated to 'dead_lettered' and a copy is inserted into a sync_dead_letter_queue table

On dead-lettering, a failure alert notification is dispatched to the HLF organisation coordinator(s) via the existing push notification channel (FCM)

Failure alert includes: mentor name (or ID), operation type (remove/restore), failure reason, and a deep link to the relevant admin screen

Dead-lettered operations do not block the queue processor from processing other pending operations (per-mentor isolation, not global lock)

A coordinator can manually re-queue a dead-lettered operation from the admin interface (sets status back to 'pending', resets attempt count)

Monitoring: a Supabase scheduled function or pg_cron job alerts if more than 5 dead-lettered operations accumulate within 1 hour

Technical Requirements

frameworks

Supabase Edge Functions (Deno)

Firebase Cloud Messaging (FCM) for coordinator push alerts

apis

FCM API v1 (server-side dispatch via Edge Function)

Supabase PostgreSQL for dead_letter_queue table

HLFWebsiteSyncService (task-006)

data models

assignment

performance requirements

Dead-letter queue insert must complete within 200ms of final retry failure

Push notification dispatch must not block the queue processor — fire-and-forget with its own error handling

Queue processor must process at least 10 sync operations per minute under normal load

security requirements

FCM server key stored in Edge Function environment secrets only — never in mobile app binary

Failure alert push notification payload must not include PII — use mentor UUID only; full name resolved client-side from local data

Dead-letter queue table subject to RLS: only service role can write; coordinator role can read for their own organisation

Execution Context

Execution Tier

Tier 4

Tier 4 - 323 tasks

Can start after Tier 3 completes

View Full Execution Plan

Implementation Notes

Implement the queue processor as a stateless Edge Function invoked by pg_cron every 1 minute. Use SELECT ... FOR UPDATE SKIP LOCKED to safely dequeue — this prevents two concurrent invocations from processing the same row. The dead-letter queue is a separate table (not just a status flag) to enable independent querying, alerting, and manual re-queue without touching the main sync_queue.

For the coordinator push alert, keep the FCM payload minimal: { type: 'hlf_sync_failure', mentor_id: uuid, operation: 'remove'|'restore' } — the Flutter app fetches human-readable details on open. Log all retry attempts and final outcome to the audit log table (covered by task-009) so the dead-letter alert is traceable.

Testing Requirements

Unit tests: mock HLFWebsiteSyncService to throw retryable (5xx) and non-retryable (4xx) errors; assert correct retry counts, correct backoff delay intervals (mock timers), correct dead-letter queue insertion on exhaustion. Assert non-retryable errors skip retry entirely. Integration tests: seed a sync_queue entry; simulate persistent 503 responses from mocked Dynamics API; assert dead_letter_queue row created after max_attempts; assert FCM dispatch was called with correct payload. Test per-mentor isolation: one dead-lettered operation must not prevent processing of a second mentor's pending operation.

Component

HLF Website Sync Service

service high

Dependencies (1)

Wire HLFWebsiteSyncService to listen for mentor status change events (paused, expired_cert, reinstated). On each relevant transition, enqueue a sync operation to add or remove the mentor from the public chapter website listing. Implement an event-driven trigger using the existing status repository change hooks so sync happens automatically without manual invocation. epic-peer-mentor-pause-management-automated-expiry-task-007

Epic Risks (4)

high impact medium prob technical

The nightly expiry checker may run multiple times due to scheduler retries or infrastructure issues, causing duplicate auto-transitions and duplicate coordinator notifications that erode trust in the notification system.

Mitigation & Contingency

Mitigation: Implement idempotency via a unique constraint on (mentor_id, threshold_day, certification_expiry_date) in the cert_expiry_reminders table. Auto-transitions should be wrapped in a Postgres RPC that checks current status before applying, making repeated invocations safe.

Contingency: Add a compensation query in the reconciliation log that detects duplicate log entries for the same certification period and alerts the operations team for manual review within 24 hours.

high impact medium prob integration

The HLF Dynamics portal API may have eventual-consistency behaviour or rate limits that cause website listing updates to lag behind status changes, leaving expired mentors visible on the public website for an unacceptable window.

Mitigation & Contingency

Mitigation: Design the sync service to be triggered immediately on status transitions (event-driven via database webhook) in addition to the nightly batch run. Implement a reconciliation job that verifies sync state against app state and re-triggers any divergent records.

Contingency: If real-time sync cannot be guaranteed, implement a manual 'force sync' action in the coordinator dashboard so coordinators can trigger an immediate re-sync for urgent cases. Document the expected sync lag in coordinator onboarding materials.

medium impact medium prob scope

Stakeholder requests to extend the expiry checker to handle additional certification types, grace periods, or organisation-specific threshold configurations may significantly increase scope beyond what is designed here, delaying delivery.

Mitigation & Contingency

Mitigation: Parameterise threshold day values (30, 14, 7) via configuration repository rather than hard-coding them, enabling per-organisation customisation without code changes. Document that grace period logic and additional cert types are out of scope for this epic and require a dedicated follow-up.

Contingency: Deliver the feature with hard-coded HLF-standard thresholds first and introduce the configuration repository as a follow-up task in the next sprint, using a feature flag to enable per-org threshold overrides.

high impact low prob security

Dynamics portal API credentials stored as environment secrets in Supabase Edge Function configuration may be rotated or invalidated by HLF IT without notice, causing silent sync failures that go undetected for multiple days.

Mitigation & Contingency

Mitigation: Implement credential health-check calls on each scheduler run and emit an immediate alert on auth failure rather than only alerting after N consecutive failures. Document the credential rotation procedure with HLF IT and establish a rotation notification protocol.

Contingency: Maintain a break-glass manual sync script accessible to HLF administrators that can re-execute the Dynamics sync with newly provided credentials while the automated system is restored.

Quick Links

All Tasks Execution Plan

Implement retry logic and failure alerting for HLF sync

Acceptance Criteria

Technical Requirements

Execution Context

Implementation Notes

Testing Requirements