Launch & ops
Operations runbook
Email, webhooks, env verification, signing, and day-two troubleshooting.
Operations runbook
Internal notes for running AgentNexusAPI in production. Complement PRODUCTION_READINESS.md.
Production environment pass
Before or after each deploy, confirm the host (Vercel, Netlify, etc.) has the variables from .env.example Production deployment checklist.
Automated check (local): with a filled .env.local (or production-like values):
npm run verify:production-env
Treat warnings as blockers for real production unless you explicitly accept the gap (e.g. no approval email until Resend is configured). To fail the script on those too:
VERIFY_PRODUCTION_ENV_STRICT=1 npm run verify:production-env
CI: GitHub Actions does not run this by default (no secrets in PR builds). For a deploy workflow, inject the same secrets the host uses and run verify:production-env before next build, or run it manually after setting env in the dashboard.
Email (Resend) and approvals
Symptom: Approvers never receive mail, or /api/v1/evaluate returns 200 with pending_human but nothing arrives.
Checks:
RESEND_API_KEYand verifiedRESEND_FROM_EMAILin the host environment.NEXT_PUBLIC_APP_URLmatches the public origin used in magic links (HTTPS in production).- Logs: JSON lines with
event: "approval_email_failed"onPOST /api/v1/evaluate, or[evaluate] approval notification email:on stderr. - Sentry: Exceptions tagged
component: "approval_email"whenSENTRY_DSNis set.
Impact: Evaluations can sit in pending_human with stages created; no automatic retry today. Mitigation: Fix Resend/domain, then rely on existing approval links in email (if partial send) or re-trigger notifications from the product (future: retry queue).
Optional automation: Schedule a job that finds evaluations stuck in pending_human with a recent stage and no email audit event, then calls your own notification path — not shipped by default.
Slack and Teams (approval routing)
Slack (one-click approve/reject): Dashboard → Workspaces → [workspace] → Slack / Teams approval routing → Add to Slack (OAuth). Then set Slack channel ID and invite the app to that channel. In the Slack app (api.slack.com): Interactivity & Shortcuts → Request URL https://<your-host>/api/integrations/slack/interactions. OAuth & Permissions → Redirect URL https://<your-host>/api/integrations/slack/oauth/callback. Bot token scopes: chat:write, users:read.email. Host environment: SLACK_CLIENT_ID, SLACK_CLIENT_SECRET, SLACK_SIGNING_SECRET, and SLACK_OAUTH_STATE_SECRET (or reuse APPROVAL_OTP_SECRET only for signing OAuth state if the dedicated secret is unset).
Who may click Approve: The handler loads the Slack user’s profile email via users.info and requires it to match a workspace member email for the stage’s required_role (same pool as Resend approver emails). Use a restricted channel and aligned emails.
Critical OTP tiers: Approve from Slack is blocked until OTP is verified on the hosted approval page; Reject still works from Slack. The Slack message includes an Open approval page button.
Teams: Optional Incoming Webhook URL in the same dashboard section. The host POSTs a MessageCard with Review and approve or reject opening /approve/<token> in the browser (no signed callback from Teams webhooks).
Logs: On POST /v1/evaluate, JSON lines event: "approval_slack_failed" / approval_teams_failed when a chat post fails (email may still succeed).
Webhook delivery failures
Symptom: Integrator sees no callback; dashboard / webhook_deliveries shows failed or blocked.
blocked: URL failed SSRF / hostname allowlist (assertWebhookUrlSafeForDispatch). If the workspace configured allowed webhook hostnames (dashboard), the URL host must match one pattern (exact or *.suffix.example.com). Fix the URL, allowlist, or environment (production requires safe public HTTPS targets).
failed: HTTP non-2xx or network/timeout after in-process retries.
Manual replay (MVP):
-
Dashboard: On an evaluation’s audit detail page, each
failedwebhook row has Replay delivery (workspace members only; uses the same server-side replay as below). -
Cron / automation: Set
CRON_SECRETin the server environment (long random string; treat like a password). -
POST /api/cron/replay-webhookwith headerAuthorization: Bearer <CRON_SECRET>and body:{ "delivery_id": "<uuid from webhook_deliveries>" } -
Only rows with
status: failedare replayed.blockedanddeliveredare rejected. -
The replay re-validates the URL, re-signs with the signing secret resolved at replay time (per-workspace secret if configured, else
INTEGRATION_WEBHOOK_HMAC_SECRET), and runs the same retry loop against the existing row. If you rotated the tenant secret after the original attempt, the integrator must verify with the current secret or the signature will not match the original delivery.
Scheduling: Point Netlify Scheduled Functions, GitHub Actions, or another cron at this endpoint when you want periodic retries (still pass delivery_id per known failure).
Approval deadline enforcement (time-boxed human gates)
Policies may set approval_timeout_seconds on routing stages so the active approval link expires after a wall-clock window. Enforcement is not real-time in the browser; schedule a job that calls:
- Set
CRON_SECRET(same value as webhook replay automation). POST /api/cron/approval-timeoutswith headerAuthorization: Bearer <CRON_SECRET>.- Optional JSON body:
{ "limit": 40 }(default 40, max 200 per run).
Run every 1–5 minutes (or tighter if you need stricter SLA). Without this job, stages stay pending past the policy window.
Details: APPROVAL_TIMEOUTS.md.
Integrator webhooks (hostname allowlist & HMAC signing)
Hostname allowlist (optional) — webhook_allowed_hostnames on workspace_integration_settings, edited under Dashboard → Workspaces → [workspace] → Integrator webhooks. When non-empty, webhook_url on POST /v1/evaluate must use a host that matches one entry (exact hostname or *.example.com for subdomains). Applies again on outbound dispatch and replay. Empty list → no extra restriction beyond global SSRF rules.
Signing header (HMAC)
Outbound integrator webhooks (both approval_required and terminal_outcome payloads) may include:
X-AgentNexus-Signature: sha256=<hex> where <hex> is HMAC-SHA256 over the raw JSON body bytes using the signing secret (same format many providers use; integrators should constant-time compare).
Which secret is used
- If the workspace has a non-empty
webhook_hmac_secretinworkspace_integration_settings(same dashboard section), that value is used. - Otherwise, if
INTEGRATION_WEBHOOK_HMAC_SECRETis set on the host, that env value is used. - If neither is set, no signature header is sent.
Rotation (tenant / per-workspace)
- In the dashboard, set a new signing secret (overwrites the stored value). Coordinate with the integrator so they start verifying with the new key before or as you save (brief overlap: integrator accepts old+new, or you pause dispatches during cutover).
- To revert to global-only signing: check Remove per-workspace secret and save; new deliveries use
INTEGRATION_WEBHOOK_HMAC_SECRETonly (if set).
Rotation (global / env)
- Deploy new
INTEGRATION_WEBHOOK_HMAC_SECRET; workspaces without a per-workspace secret immediately use the new key. - Workspaces with a per-workspace secret are unchanged until you clear or update that row.
Verification checklist for integrators
- Header name:
X-AgentNexus-Signature, value prefixsha256=. - Body: use the exact raw request body (no pretty-print or key reorder if your stack re-serializes).
- Secret: the one agreed for that workspace (dashboard) or the platform default (ops / env).
Troubleshooting
- Valid delivery but signature fails: Wrong secret (tenant vs global), body was transformed by a proxy, or replay after secret rotation (signature is computed with current secret, not the one at original
created_at). - No signature header: Neither per-workspace nor env secret configured; configure at least one if authenticity is required.
Health and database
- Liveness:
GET /api/health— no database. - DB reachability:
GET /api/health?deep=1— 503 if Supabase query fails; checksupabase_client_error-style logs on the server.
Rate limits
If users see 429 from Edge middleware (code: rate_limited, no scope), tune RATE_LIMIT_*_PER_MINUTE or use CDN/WAF. If scope: api_key, the Node evaluate/receipt limiter fired; tune RATE_LIMIT_PER_API_KEY_PER_MINUTE or disable with RATE_LIMIT_DISABLED=1 (disables both layers that share Upstash).