medium priority high complexity infrastructure pending infrastructure specialist Tier 5

Acceptance Criteria

Supabase Edge Function `batch-duplicate-detection` is deployed and callable via HTTP POST with a JSON body: `{ "org_id": "uuid" | "all", "since_hours": number, "dry_run": boolean }`
Supabase cron job triggers the function on a configurable schedule (default: daily at 02:00 UTC) via pg_cron or Supabase scheduled functions
When org_id = 'all', processes organizations one at a time in ascending created_at order to avoid timeout
Per-organization processing: fetches all activities in the configured time window, calls `get_cross_hierarchy_activity_candidates()` for each, and inserts new suspected_duplicates rows with detection_method = 'batch'
Skips pairs already present in suspected_duplicates (ON CONFLICT DO NOTHING) regardless of their current review_status
Rate limiting: processes at most 500 activities per org per invocation; if more exist, logs a warning and schedules the remainder for the next run via a cursor stored in a `batch_detection_cursors` table
dry_run = true returns a JSON response with the count of pairs that would be inserted without actually writing to suspected_duplicates
Function returns a structured JSON response: `{ "orgs_processed": number, "pairs_found": number, "pairs_inserted": number, "errors": [] }`
Function respects the `is_active` flag in duplicate_detection_config — skips orgs where is_active = false
Authorization: function requires a valid Supabase service role JWT in the Authorization header — not callable by end users

Technical Requirements

frameworks
Supabase Edge Functions (Deno runtime)
Supabase pg_cron
apis
Supabase service role client (supabase-js)
get_cross_hierarchy_activity_candidates() via RPC
duplicate_detection_config table
suspected_duplicates table
data models
activities
suspected_duplicates
duplicate_detection_config
batch_detection_cursors
performance requirements
Single org processing for 500 activities must complete within the Supabase Edge Function 60-second timeout
Database queries inside the function must use server-side pagination (LIMIT/OFFSET or cursor) to avoid loading all activities into Deno memory
Total memory footprint must stay under 128MB (Supabase Edge Function limit)
security requirements
Function must validate the Authorization header contains the Supabase service role JWT before any database operations
org_id parameter must be validated as a valid UUID or the literal string 'all' — reject all other values with HTTP 400
All database operations use the service role client, not the anon client
Function must not expose raw PostgreSQL error messages in HTTP responses — sanitize error objects before returning

Execution Context

Execution Tier
Tier 5

Tier 5 - 253 tasks

Can start after Tier 4 completes

Implementation Notes

Structure the Edge Function as: (1) JWT validation middleware; (2) parameter parsing and validation; (3) org list resolution (single org or all active orgs from config); (4) per-org processing loop with try/catch that populates the errors array; (5) structured JSON response. For cursor-based pagination, store `{ org_id, last_processed_activity_id, updated_at }` in `batch_detection_cursors`. Use the Supabase service role client initialized with `createClient(SUPABASE_URL, SERVICE_ROLE_KEY)` — store these as Edge Function secrets, never hardcoded. For the pg_cron job, add a migration: `SELECT cron.schedule('batch-duplicate-detection', '0 2 * * *', $$SELECT net.http_post(url := 'https://.supabase.co/functions/v1/batch-duplicate-detection', body := '{"org_id":"all","since_hours":26}') $$)`.

Use `since_hours: 26` (slightly more than 24h) in the cron job to provide overlap and catch any activities that fell in the gap between trigger firings. Document the overlap strategy in a code comment.

Testing Requirements

Unit tests (Deno test runner): (1) mock Supabase client and verify correct RPC calls are made for a single org; (2) dry_run=true returns count without insert calls; (3) invalid org_id UUID returns HTTP 400; (4) missing/invalid Authorization header returns HTTP 401; (5) RPC error for one org is caught and added to errors array without stopping other orgs. Integration tests against local Supabase CLI: (1) seed 20 cross-hierarchy near-duplicate activities and run the function → verify 10 pairs inserted (unique pairs); (2) run again immediately → 0 new pairs inserted (idempotency); (3) change similarity_threshold in config and run again → verify previously-missed pairs are now flagged with detection_method='batch'; (4) simulate org with 600 activities → verify cursor is written and only 500 are processed. End-to-end: trigger pg_cron manually and verify the full pipeline completes and sends a Supabase Realtime event to the admin Flutter client.

Component
Duplicate Activity Detector
infrastructure high
Epic Risks (3)
medium impact high prob technical

Fingerprint-based similarity matching may produce high false-positive rates for common activity types (e.g., weekly group sessions with the same participants), causing alert fatigue among coordinators and undermining trust in the detection system.

Mitigation & Contingency

Mitigation: Start with conservative, high-confidence thresholds (exact peer mentor match + same date + same activity type) before adding looser fuzzy matching. Allow NHF administrators to tune thresholds based on observed false-positive rates. Log all detection decisions for retrospective threshold calibration.

Contingency: Introduce a snooze mechanism allowing coordinators to dismiss false positives for a configurable period. Track dismissal rates per activity type and automatically raise the similarity threshold for activity types with high dismissal rates.

medium impact medium prob technical

A database trigger on the activities insert path adds synchronous overhead to every activity registration. For HLF peer mentors with 380 annual registrations or coordinators doing bulk proxy registration, this could create perceptible latency or lock contention.

Mitigation & Contingency

Mitigation: Implement the trigger as a DEFERRED constraint trigger (fires after the transaction commits) or replace it with a LISTEN/NOTIFY pattern that queues detection work asynchronously via an Edge Function, completely decoupling detection from the registration write path.

Contingency: Disable the synchronous trigger entirely and rely solely on the scheduled Edge Function for batch detection. Accept a detection delay of up to the scheduling interval (e.g., 15 minutes) in exchange for zero impact on registration latency.

medium impact medium prob dependency

The duplicate detection logic must be validated and approved by NHF before go-live, including agreement on threshold values and the review workflow. NHF stakeholder availability for sign-off may delay this epic's release independently of technical readiness.

Mitigation & Contingency

Mitigation: Gate the feature behind the NHF-specific feature flag so technical deployment can proceed independently of business approval. Involve an NHF administrator in threshold calibration sessions during QA, reducing the formal sign-off surface to policy and workflow rather than technical details.

Contingency: Release the detection system in 'silent mode' — flagging duplicates internally without surfacing notifications to coordinators — until NHF approves the workflow. Use the silent period to collect real data on false-positive rates and refine thresholds before activating notifications.