An autonomous monitoring agent on AWS that detects incidents across services and generates AI-assisted on-call explanations (severity, likely causes, next actions).
Runs on a schedule, checks real production signals, stores state, suppresses alert spam via cooldowns, and sends email notifications with an AI-generated incident summary when a problem is detected.
Perception → Memory → Decision → Action
Runs without prompts
Gemini generates structured on-call guidance.
Signal-to-action, not chatbot fluff
EventBridge triggers a Lambda function that pulls health and performance signals from multiple sources. Cooldown state is persisted in DynamoDB to prevent repeat notifications for the same incident. Alerts are delivered via SNS (email subscription). When an incident exists, the agent calls Gemini to generate a compact incident explanation and recommended next steps.
EventBridge (Schedule)
→ Lambda (Ops Agent)
→ CloudWatch Metrics (CloudFront / API Gateway / Lambda)
→ HTTP Health Check (Render)
→ DynamoDB (cooldowns)
→ SNS (email alert)
→ Gemini API (incident explainer)
Health endpoint check
Edge stability + traffic
Reliability + latency
Execution health
Detection stays deterministic (rules + thresholds). AI is used only for incident explanation to keep behavior stable and auditable. Secrets are stored in AWS SSM Parameter Store (SecureString) and read at runtime. Alert spam is prevented using per-incident cooldowns persisted in DynamoDB. LLM output was tuned empirically; a 750-token cap was the reliability/cost sweet spot for complete explanations.
SEVERITY: med
LIKELY_CAUSE:
- Misconfigured health check URL (typo)
- Deployment changed or removed health endpoint
IMMEDIATE_ACTIONS:
- Verify configured health check URL
- Check recent deployments
- Curl health endpoint to confirm accessibility
IF_REPEATS_CHECK:
- Review service logs for startup errors
- Confirm service is listening on expected port