Health Monitor scheduled polling and alerting

epic-external-system-integration-configuration-backend-infrastructure-task-015 — Set up the scheduled polling mechanism that invokes the health monitor on a configurable interval (default every 15 minutes) using a pg_cron job or Supabase scheduled function. When a status transitions from healthy to degraded or unreachable, trigger a notification via the coordinator notification service so admins are alerted proactively. Include suppression logic to avoid repeated notifications for the same ongoing failure.

high priority medium complexity infrastructure pending infrastructure specialist Tier 4

Acceptance Criteria

A pg_cron job (or Supabase scheduled Edge Function) is configured to invoke the health monitor Edge Function every 15 minutes; the interval is stored as a configurable value in the integration configuration table

The scheduler iterates over all organisations with at least one enabled integration and invokes runHealthChecks(orgId) for each

Status transition detection: before writing a new health status, the scheduler reads the previous status for each (org_id, integration_type); if the new status is 'degraded' or 'unreachable' AND the previous status was 'healthy', a notification event is emitted

The notification event triggers the coordinator notification service to send a push notification to all org admins of the affected organisation with message content including integration name and new status

Suppression logic: a notification_suppression_until timestamp is stored per (org_id, integration_type); a new degradation notification is not sent if the current time is before notification_suppression_until; default suppression window is 60 minutes

When a status recovers from degraded/unreachable to healthy, a recovery notification is sent and the suppression timestamp is cleared

A manual trigger API (POST /health-monitor/run?orgId=) allows admins to force an immediate health check outside the schedule

Scheduler run results are logged to a scheduler_run_log table (run_at, orgs_checked, checks_run, notifications_sent, errors)

If the scheduler invocation itself fails, the error is logged and the next scheduled run is not blocked

An integration test confirms that a simulated status transition from healthy to degraded results in exactly one notification per suppression window

Technical Requirements

frameworks

Supabase Edge Functions (Deno)

Supabase PostgreSQL 15

Firebase Cloud Messaging (FCM) API v1

apis

Supabase Edge Functions (Deno)

Firebase Cloud Messaging (FCM) API v1

Supabase REST API

data models

device_token

performance requirements

Full scheduler run across 10 organisations with 5 integrations each must complete within 90 seconds

Notification dispatch must not block health status writes — fire notifications asynchronously after persistence

Suppression lookup must complete in under 10ms (indexed on org_id, integration_type)

security requirements

FCM server key stored as a Supabase secret — never in mobile binary or exposed via REST API

Notification payload contains only integration name and status — no credentials, no PII, no error details

Manual trigger endpoint requires valid JWT with org_admin or global_admin role

Scheduler function runs with service role credentials, scoped to health monitoring operations only

Suppression timestamps stored server-side only — clients cannot manipulate suppression windows

Execution Context

Execution Tier

Tier 4

Tier 4 - 323 tasks

Can start after Tier 3 completes

Integration Task

Handles integration between different epics or system components. Requires coordination across multiple development streams.

View Full Execution Plan

Implementation Notes

Implement the transition detection as a Postgres function or Edge Function that uses a CTE to compare current vs new status before upserting: WITH prev AS (SELECT status FROM integration_health_status WHERE org_id=$1 AND integration_type=$2) INSERT ... ON CONFLICT ... DO UPDATE ... RETURNING (new.status != prev.status AND new.status != 'healthy') AS should_alert.

This avoids a separate SELECT round-trip. For the suppression timestamp, add a notification_suppression_until column to integration_health_status — it colocates suppression state with health state and is updated atomically in the same upsert. Dispatch FCM notifications via a separate Edge Function (the existing coordinator notification service) rather than inline in the scheduler, so the scheduler is not blocked by notification failures. The pg_cron schedule string '*/15 * * * *' covers the 15-minute default; store it as a text value in integration_config so it can be changed per-deployment without a code deploy.

Log scheduler_run_log rows regardless of success to maintain a complete audit trail for ops teams.

Testing Requirements

Unit tests: (1) transition detection logic — healthy→degraded fires notification; healthy→healthy does not; degraded→degraded does not (suppressed); degraded→healthy fires recovery notification and clears suppression; (2) suppression window enforced — second degradation within 60 minutes does not produce second notification; (3) suppression cleared on recovery. Integration tests: (1) pg_cron job exists and is scheduled at correct interval; (2) manual trigger endpoint rejects unauthenticated callers; (3) scheduler_run_log row inserted after each run; (4) FCM dispatch is called with correct device tokens for affected org admins (mock FCM in test environment). End-to-end test: simulate Xledger health check returning 'unreachable', verify FCM notification dispatched and suppression_until set.

Component

Integration Health Monitor

service medium

Dependencies (2)

Implement the integration_health_status table and repository for storing and querying health check results. The table holds the latest status per (org_id, integration_type) pair plus a rolling 24-hour history for trend display. Implement queries that return current health status for all integrations of an org (used by the admin dashboard) and aggregate health across all orgs (used by global admins). Include RLS policies ensuring org isolation. epic-external-system-integration-configuration-backend-infrastructure-task-014 Implement concrete health check classes for each supported integration target: Xledger (authenticate + list accounts endpoint), Dynamics (OAuth token refresh + GET organisations), Cornerstone (ping auth endpoint), Consio (list endpoint), and Bufdir (schema validation endpoint). Each check uses the Vault credential access layer, performs a lightweight read-only probe, and measures round-trip latency. Checks must complete within a 10-second timeout. epic-external-system-integration-configuration-backend-infrastructure-task-013

Epic Risks (3)

medium impact medium prob technical

Supabase Edge Functions have cold start latency that can cause the first sync invocation after idle periods to fail or timeout when the external API has a short connection window, leading to missed scheduled syncs that go undetected.

Mitigation & Contingency

Mitigation: Configure Edge Function memory and implement a warm-up ping mechanism before heavy sync invocations. Set generous timeout values on the external API calls. Log all cold-start incidents for monitoring.

Contingency: If cold starts cause consistent sync failures, migrate the sync scheduler to a persistent Supabase cron job that pre-warms the function 30 seconds before the scheduled sync time.

high impact low prob technical

The sync scheduler must execute jobs at predictable times for financial reporting accuracy. Drift in cron execution timing (due to Supabase infrastructure delays) could cause syncs to run at wrong times, leading to missing data in accounting exports or duplicate exports across reporting periods.

Mitigation & Contingency

Mitigation: Implement idempotency keys based on integration ID + scheduled period, so re-runs of a delayed sync cannot create duplicate exports. Log actual execution timestamps vs scheduled timestamps and alert on drift exceeding 5 minutes.

Contingency: If scheduler reliability is insufficient, integrate with a dedicated cron service (e.g., pg_cron on Supabase) for millisecond-precise scheduling, replacing the application-level scheduler.

high impact medium prob integration

Aggressive health monitoring ping frequency could trigger rate limiting on external APIs (especially Xledger and Dynamics), causing legitimate export calls to fail after the monitor exhausts the API's request quota.

Mitigation & Contingency

Mitigation: Use lightweight health check endpoints (HEAD requests or vendor-specific ping/status endpoints) rather than data requests. Set health check frequency to once per 15 minutes minimum. Implement exponential backoff after consecutive failures.

Contingency: If rate limiting occurs, disable active health monitoring for the affected integration type and switch to passive health detection (mark unhealthy only when a scheduled sync fails).

Quick Links

All Tasks Execution Plan