Health Monitor connectivity test framework

epic-external-system-integration-configuration-backend-infrastructure-task-012 — Build the core health check framework for the Integration Health Monitor. Define a HealthCheck interface with methods check(orgId, integrationConfig): HealthResult that each adapter-specific check must implement. Implement the test runner that invokes checks in parallel across all configured integrations for an org, collects results, and writes them to the integration_health_status table with timestamp, status (healthy/degraded/unreachable), and latency_ms.

high priority medium complexity backend pending backend specialist Tier 2

Acceptance Criteria

A TypeScript/Deno HealthCheck interface is defined with signature: check(orgId: string, config: IntegrationConfig): Promise<HealthResult>

HealthResult type contains: { status: 'healthy' | 'degraded' | 'unreachable', latency_ms: number, checked_at: string, error?: string }

The test runner function runHealthChecks(orgId) loads all enabled integrations for the org from integration_config, instantiates the correct adapter, and runs all checks in parallel using Promise.allSettled

A Promise.allSettled result where the promise rejects is mapped to status='unreachable' with the rejection message in error field

integration_health_status table exists with columns: id (uuid PK), org_id (uuid NOT NULL), integration_type (text NOT NULL), status (text CHECK IN ('healthy','degraded','unreachable')), latency_ms (integer), checked_at (timestamptz NOT NULL), error_message (text)

After running checks, the test runner upserts one row per integration into integration_health_status using ON CONFLICT (org_id, integration_type) DO UPDATE

A separate health_check_history table stores the last 96 rows per (org_id, integration_type) to support the 24-hour trend view (96 × 15-minute intervals)

The framework enforces a maximum 10-second timeout per individual check; checks exceeding the timeout resolve as status='unreachable' with an appropriate error message

RLS on both tables restricts SELECT to the org's own rows; INSERT/UPDATE via service role only

Unit tests for the test runner cover: parallel execution, timeout enforcement, rejected promise mapping, and upsert correctness

Technical Requirements

frameworks

Supabase Edge Functions (Deno)

Supabase PostgreSQL 15

apis

Supabase Edge Functions (Deno)

Supabase REST API

data models

bufdir_export_audit_log

performance requirements

All checks for a single org must complete within 12 seconds total (10s check timeout + 2s overhead)

Upsert of all health status rows must complete within 500ms

History table pruning query (keep latest 96 per key) must run in under 200ms

security requirements

HealthCheck implementations must never log raw credentials — only integration_type and org_id

Vault credential retrieval happens inside each adapter — the framework passes only non-sensitive IntegrationConfig metadata

RLS on integration_health_status prevents cross-org reads

Service role key used exclusively in Edge Function environment

Execution Context

Execution Tier

Tier 2

Tier 2 - 518 tasks

Can start after Tier 1 completes

Integration Task

Handles integration between different epics or system components. Requires coordination across multiple development streams.

View Full Execution Plan

Implementation Notes

Use Promise.allSettled (not Promise.all) so a single failing check does not abort results collection for other integrations. Implement the 10-second timeout using AbortController and a race between the check promise and a timeout promise: Promise.race([checkPromise, timeoutPromise]). Define IntegrationConfig as a discriminated union type keyed on integration_type to give adapters type-safe access to their specific config fields without casting. The history table pruning can be a Postgres trigger or a post-upsert function call — trigger approach is safer as it runs atomically.

Keep the HealthCheck interface in a shared Deno module importable by all adapter implementations. Design the upsert to update latency_ms even on 'healthy' status so trend charts can show latency over time.

Testing Requirements

Unit tests (Deno test runner) covering: (1) runHealthChecks calls all adapters in parallel (mock adapters with artificial delays, verify all called); (2) a check that takes longer than 10s is cancelled and mapped to 'unreachable'; (3) a check that throws is mapped to 'unreachable' with error; (4) upsert logic correctly updates an existing row; (5) history row is inserted alongside the upsert. Integration test against local Supabase: verify RLS blocks cross-org SELECT. Test history pruning by inserting 100 rows for the same key and confirming only 96 remain after cleanup. Minimum 85% branch coverage on the framework core.

Component

Integration Health Monitor

service medium

Dependencies (1)

Implement the secure credential retrieval layer within the Edge Functions environment. Create a CredentialProvider module that reads integration API keys from Supabase Vault using the service role client, caches credentials per-invocation, and ensures credentials are never logged or returned in response payloads. Include error handling for missing or expired secrets. epic-external-system-integration-configuration-backend-infrastructure-task-002

Epic Risks (3)

medium impact medium prob technical

Supabase Edge Functions have cold start latency that can cause the first sync invocation after idle periods to fail or timeout when the external API has a short connection window, leading to missed scheduled syncs that go undetected.

Mitigation & Contingency

Mitigation: Configure Edge Function memory and implement a warm-up ping mechanism before heavy sync invocations. Set generous timeout values on the external API calls. Log all cold-start incidents for monitoring.

Contingency: If cold starts cause consistent sync failures, migrate the sync scheduler to a persistent Supabase cron job that pre-warms the function 30 seconds before the scheduled sync time.

high impact low prob technical

The sync scheduler must execute jobs at predictable times for financial reporting accuracy. Drift in cron execution timing (due to Supabase infrastructure delays) could cause syncs to run at wrong times, leading to missing data in accounting exports or duplicate exports across reporting periods.

Mitigation & Contingency

Mitigation: Implement idempotency keys based on integration ID + scheduled period, so re-runs of a delayed sync cannot create duplicate exports. Log actual execution timestamps vs scheduled timestamps and alert on drift exceeding 5 minutes.

Contingency: If scheduler reliability is insufficient, integrate with a dedicated cron service (e.g., pg_cron on Supabase) for millisecond-precise scheduling, replacing the application-level scheduler.

high impact medium prob integration

Aggressive health monitoring ping frequency could trigger rate limiting on external APIs (especially Xledger and Dynamics), causing legitimate export calls to fail after the monitor exhausts the API's request quota.

Mitigation & Contingency

Mitigation: Use lightweight health check endpoints (HEAD requests or vendor-specific ping/status endpoints) rather than data requests. Set health check frequency to once per 15 minutes minimum. Implement exponential backoff after consecutive failures.

Contingency: If rate limiting occurs, disable active health monitoring for the affected integration type and switch to passive health detection (mark unhealthy only when a scheduled sync fails).

Quick Links

All Tasks Execution Plan