Document cron infrastructure and operations runbook
epic-assignment-follow-up-reminders-cron-infrastructure-task-007 — Write an operations runbook documenting the cron schedule registration, execution log schema, advisory lock mechanism, alerting thresholds, and how to manually trigger or disable the cron job for maintenance. Include troubleshooting steps for common failure scenarios (missed runs, lock contention, ReminderSchedulerService timeouts). Document the activity registration hook and how it integrates with the contact tracking repository to suppress reminders after successful contact.
Acceptance Criteria
Technical Requirements
Execution Context
Tier 22 - 1 tasks
Can start after Tier 21 completes
Implementation Notes
Structure the runbook with these top-level sections: Overview, Architecture Diagram, Cron Schedule Registration, Execution Log Schema, Advisory Lock, Alerting, Manual Operations (trigger/disable/re-enable), Troubleshooting, and Activity Registration Hook. Include a simple ASCII or Mermaid diagram showing the flow: Cron fires → acquire lock → invoke ReminderSchedulerService → log result → release lock → (on failure) increment counter → (at threshold) dispatch alert. For the troubleshooting section, focus on the most likely real-world scenarios based on what was observed during implementation and testing. Keep SQL examples minimal and targeted — one query to check recent logs, one to manually release a stuck lock, one to verify last_contact_date state.
Link to the integration test file as a reference for expected behavior.
Testing Requirements
Documentation review: have at least one team member who was not involved in implementation follow the runbook to manually trigger the cron function and verify the execution log is created correctly. Verify all SQL snippets execute without error against the current database schema. Verify all Supabase Dashboard navigation steps are accurate against the current Supabase UI. No automated tests required for this task — review is the validation mechanism.
If the daily cron job takes longer than 24 hours to complete (due to a large dataset or a slow query), a second instance will start while the first is still running, causing duplicate reminder dispatch for assignments processed twice.
Mitigation & Contingency
Mitigation: Implement an advisory lock that prevents a second run from starting if the first is still active. Monitor run duration via the execution log table and alert if any run exceeds 30 minutes. The 10,000-assignment load test should verify the run completes in under 5 minutes.
Contingency: If a double-run occurs, the idempotency guard in ReminderDispatchService prevents duplicate notifications from being sent. The execution log identifies the overlap and allows the ops team to investigate the root cause.
If the activity registration hook that resets last_contact_date is implemented incorrectly or not triggered for all activity types (e.g., proxy registrations, bulk registrations), peer mentors will continue receiving reminders even after logging contact, damaging user trust.
Mitigation & Contingency
Mitigation: Audit all code paths that create activity records (direct registration, proxy registration, bulk registration, coordinator proxy) and ensure each path calls the assignment contact update. Write integration tests for each registration path asserting that last_contact_date is updated.
Contingency: Provide an authenticated admin endpoint that allows manual correction of last_contact_date for a specific assignment, enabling ops to resolve individual cases while the bug is fixed and deployed.