AWS • AGENTS • OBSERVABILITY

Personal Cloud Ops Agent

An autonomous monitoring agent on AWS that detects incidents across services and generates AI-assisted on-call explanations (severity, likely causes, next actions).

What It Does

Runs on a schedule, checks real production signals, stores state, suppresses alert spam via cooldowns, and sends email notifications with an AI-generated incident summary when a problem is detected.

Autonomous Agent Loop

Perception → Memory → Decision → Action

Perception:HTTP + CloudWatch metrics
Memory:DynamoDB cooldown state
Action:SNS alert emails

Runs without prompts

AI Incident Explainer

Gemini generates structured on-call guidance.

Output:Severity + Causes + Actions
Controls:Temp tuned, token capped

Signal-to-action, not chatbot fluff

Architecture

EventBridge triggers a Lambda function that pulls health and performance signals from multiple sources. Cooldown state is persisted in DynamoDB to prevent repeat notifications for the same incident. Alerts are delivered via SNS (email subscription). When an incident exists, the agent calls Gemini to generate a compact incident explanation and recommended next steps.

EventBridge (Schedule) → Lambda (Ops Agent) → CloudWatch Metrics (CloudFront / API Gateway / Lambda) → HTTP Health Check (Render) → DynamoDB (cooldowns) → SNS (email alert) → Gemini API (incident explainer)

Signals Monitored

Render Service

Health endpoint check

Signal:/health HTTP status
Failure:non-2xx / timeout

CloudFront

Edge stability + traffic

Metric:5xxErrorRate
Metric:Requests

API Gateway

Reliability + latency

Metric:5XXError
Metric:Latency

Lambda API

Execution health

Metric:Errors
Metric:Throttles

Engineering Decisions

Detection stays deterministic (rules + thresholds). AI is used only for incident explanation to keep behavior stable and auditable. Secrets are stored in AWS SSM Parameter Store (SecureString) and read at runtime. Alert spam is prevented using per-incident cooldowns persisted in DynamoDB. LLM output was tuned empirically; a 750-token cap was the reliability/cost sweet spot for complete explanations.

Sample Incident Output

SEVERITY: med LIKELY_CAUSE: - Misconfigured health check URL (typo) - Deployment changed or removed health endpoint IMMEDIATE_ACTIONS: - Verify configured health check URL - Check recent deployments - Curl health endpoint to confirm accessibility IF_REPEATS_CHECK: - Review service logs for startup errors - Confirm service is listening on expected port

Links

Back to Projects
Request a Walkthrough