Edge Function error handling and retry logic

epic-external-system-integration-configuration-backend-infrastructure-task-005 — Implement structured error handling within the Edge Function including classification of transient vs permanent failures, exponential backoff retry logic for network errors (max 3 attempts), timeout enforcement per adapter call, and structured error response payloads that downstream consumers can parse to determine retry eligibility. Log all failures with correlation IDs to the integration_run_logs table.

high priority medium complexity backend pending backend specialist Tier 3

Acceptance Criteria

Transient errors (HTTP 429, 502, 503, 504, network timeout, DNS failure) trigger retry with exponential backoff: 1s → 2s → 4s with ±20% jitter

Permanent errors (HTTP 400, 401, 403, 404, 422, transformation validation error) do not retry and return immediately with isRetryable: false

Maximum 3 retry attempts per adapter invocation; after exhaustion the function returns a structured failure payload

Each adapter call enforces a configurable timeout (default 10 seconds); timeout triggers classify as transient and consume a retry slot

All error responses conform to schema: { correlationId: string, errorClass: 'transient' | 'permanent' | 'timeout', httpStatus: number | null, adapterName: string, attempt: number, message: string, isRetryable: boolean }

Correlation ID is a UUID generated at function invocation start and threaded through all retry attempts and log entries

Every failure (each attempt) is written to integration_run_logs table with: correlation_id, adapter_name, attempt_number, error_class, http_status, error_message, timestamp

Successful response after retry is logged with total_attempts count

Edge Function returns HTTP 200 with structured payload even on permanent failure — callers must inspect isRetryable, not HTTP status, for retry decisions

Retry logic does not retry if the downstream system returned a business error payload (e.g., Xledger validation error in response body) — only retries on infrastructure-level failures

Technical Requirements

frameworks

Supabase Edge Functions (Deno runtime)

Deno AbortController for timeout enforcement

apis

Supabase PostgreSQL — insert to integration_run_logs table

Xledger REST API — adapter target

Microsoft Dynamics 365 REST API — adapter target

data models

bufdir_export_audit_log

performance requirements

Total retry budget (3 attempts + backoff) must not exceed Edge Function 60-second wall-clock limit

Log writes to integration_run_logs must be fire-and-forget (non-blocking) to not add to retry latency

Jitter calculation must not use Math.random() in a tight loop — compute once per retry slot

security requirements

Correlation IDs must be UUIDs (v4), never sequential integers that could expose invocation volume

Error messages logged to integration_run_logs must not contain raw API credentials or authorization headers

HTTP response bodies from failed external calls must be truncated to 500 characters before logging to prevent PII leak from external error responses

Log entries must include organization_id for multi-tenant audit isolation

Execution Context

Execution Tier

Tier 3

Tier 3 - 413 tasks

Can start after Tier 2 completes

Integration Task

Handles integration between different epics or system components. Requires coordination across multiple development streams.

View Full Execution Plan

Implementation Notes

Implement as a withRetry(fn, options) higher-order function in a retry.ts module so retry logic is adapter-agnostic. Use AbortController + Promise.race for timeout enforcement rather than Deno-specific APIs to keep the module portable. Store backoff timing constants (BASE_DELAY_MS, MAX_ATTEMPTS, JITTER_FACTOR) in a config object injectable at test time — avoid hardcoding. For log writes, use Supabase service role client but wrap in try/catch so a log write failure does not mask the original error.

Classify HTTP errors by status code range using a lookup map, not a long if/else chain. For Dynamics, be aware that it can return HTTP 200 with an error body — parse response JSON to detect this case and classify as permanent.

Testing Requirements

Unit tests: (1) transient classification — mock fetch returning 503 triggers 3 retries with correct backoff intervals (use fake timers); (2) permanent classification — 400 response returns immediately, 0 retries; (3) timeout — AbortController fires at 10s, classified as transient; (4) retry exhaustion — after 3 transient failures returns isRetryable: false; (5) correlation ID — same UUID appears in all log entries for a single invocation. Integration tests: deploy to Supabase local dev, trigger function with a mock adapter that alternately fails/succeeds, assert integration_run_logs rows match expected attempt sequence. Test that business-level error from Xledger (HTTP 200 with error body) is NOT retried.

Component

Integration Edge Functions

infrastructure high

Dependencies (1)

Implement the core dispatch routing logic inside the Edge Function that receives export job requests, identifies the target integration type (Xledger, Dynamics, Cornerstone, Consio, Bufdir), loads the corresponding adapter from the REST API Adapter Registry, and executes the export. The router must validate the request payload, resolve field mappings via the Field Mapping Resolver, and return a structured result object. epic-external-system-integration-configuration-backend-infrastructure-task-003

Epic Risks (3)

medium impact medium prob technical

Supabase Edge Functions have cold start latency that can cause the first sync invocation after idle periods to fail or timeout when the external API has a short connection window, leading to missed scheduled syncs that go undetected.

Mitigation & Contingency

Mitigation: Configure Edge Function memory and implement a warm-up ping mechanism before heavy sync invocations. Set generous timeout values on the external API calls. Log all cold-start incidents for monitoring.

Contingency: If cold starts cause consistent sync failures, migrate the sync scheduler to a persistent Supabase cron job that pre-warms the function 30 seconds before the scheduled sync time.

high impact low prob technical

The sync scheduler must execute jobs at predictable times for financial reporting accuracy. Drift in cron execution timing (due to Supabase infrastructure delays) could cause syncs to run at wrong times, leading to missing data in accounting exports or duplicate exports across reporting periods.

Mitigation & Contingency

Mitigation: Implement idempotency keys based on integration ID + scheduled period, so re-runs of a delayed sync cannot create duplicate exports. Log actual execution timestamps vs scheduled timestamps and alert on drift exceeding 5 minutes.

Contingency: If scheduler reliability is insufficient, integrate with a dedicated cron service (e.g., pg_cron on Supabase) for millisecond-precise scheduling, replacing the application-level scheduler.

high impact medium prob integration

Aggressive health monitoring ping frequency could trigger rate limiting on external APIs (especially Xledger and Dynamics), causing legitimate export calls to fail after the monitor exhausts the API's request quota.

Mitigation & Contingency

Mitigation: Use lightweight health check endpoints (HEAD requests or vendor-specific ping/status endpoints) rather than data requests. Set health check frequency to once per 15 minutes minimum. Implement exponential backoff after consecutive failures.

Contingency: If rate limiting occurs, disable active health monitoring for the affected integration type and switch to passive health detection (mark unhealthy only when a scheduled sync fails).

Quick Links

All Tasks Execution Plan