Table of Contents

Amazon CloudWatch Deep Dive: A Complete Guide to Monitoring / Logs / Metrics Design, Compared with Cloud Monitoring (GCP) and Azure Monitor

Introduction (Key Points Summary)

  • In this article we’ll take Amazon CloudWatch as the main axis and, by comparing it with Google Cloud Monitoring / Cloud Logging and Azure Monitor / Log Analytics, carefully整理the overall design of monitoring, logs, metrics, and alerts.
  • CloudWatch is AWS’s standard monitoring platform, covering metrics collection, log aggregation, alarms, dashboards, tracing (via X-Ray), and event integration in one service.
  • In GCP, Cloud Monitoring + Cloud Logging play the same role, while in Azure it’s Azure Monitor + Log Analytics / Application Insights. In every cloud, the essence is how you combine metrics + logs + events + visualization + notifications.
  • The key design points are:
    1. What you monitor (how you define SLOs/SLIs)
    2. At what granularity you collect metrics and logs
    3. What is automatically alerted and what requires human judgment
    4. Who your dashboards are for and how you build them
    5. Balancing cost and performance (log volume and retention period)
  • This article is written for SREs, infra / platform engineers, web/enterprise app developers, data platform engineers, security/governance staff, corporate IT operations teams, and startup tech leads.
  • By the end, the goal is that you’ll be able to design monitoring platforms on GCP and Azure using almost the same mental model, with CloudWatch as your reference point.

1. What Is Amazon CloudWatch? A Single “Landing Zone” for Monitoring

Amazon CloudWatch is a managed monitoring service that collects metrics (numbers), logs, and events from AWS resources and applications, and provides alarms, visualization, and automated actions on top of them.

Because many services like EC2, RDS, and Lambda automatically push standard metrics to CloudWatch, it’s fair to say you can start from the mindset of “send everything to CloudWatch first, then decide what to do with it.”

Representative features include:

  • Metrics: CPU utilization, disk I/O, SQS queue length, custom metrics, etc., collected at 1-minute or 5-minute intervals.
  • Logs: Collected as CloudWatch Logs groups from app logs, OS logs, ALB / CloudFront logs, and so on.
  • Alarms: Threshold-based alerts built from metrics or log-based metrics, sending notifications via SNS or ops tools.
  • Dashboards: “Status boards” that visualize metrics and alarms in one place.
  • Events / EventBridge integration: Fire Lambda or Step Functions on resource state changes or scheduled events.

In GCP, Cloud Monitoring (formerly Stackdriver Monitoring) handles metrics and alerts, Cloud Logging handles logs; in Azure, Azure Monitor is the metrics platform and Log Analytics / Application Insights cover logs and traces. Conceptually, they fill the same roles.


2. Before Designing Monitoring: Start from SLOs/SLIs (What Do You Want to See?)

Before memorizing monitoring tool features, you first need to decide what you actually want to monitor.

2.1 SLOs (Service Level Objectives) and SLIs (Service Level Indicators)

  • SLI: A metric (indicator) that expresses service quality.
    • Examples:
      • “Per-minute successful request rate”
      • “p95 latency”
      • “Error rate”
      • “Average job completion time”
  • SLO: A target that defines how much deviation you’ll tolerate for an SLI.
    • Examples:
      • “HTTP success rate ≥ 99.9% over 90 days”
      • “p95 latency under 500 ms”

In CloudWatch, you can turn these into metrics and wire them into alarms, connecting your day-to-day operations directly with your SLOs.

In GCP Cloud Monitoring and Azure Monitor, you similarly express SLIs as monitoring metrics or KQL queries, then reflect SLOs in alert conditions or SLO views.

2.2 Separate “Infrastructure Metrics” from “User Experience Metrics”

  • Infrastructure side: CPU, approximate memory usage, disk I/O, network bandwidth, number of containers, queue length, etc.
  • Application / user experience side: error rate, latency, revenue / conversions, sign-in success rate, etc.

In CloudWatch, consciously distinguish between infra metrics (provided by AWS) and app-side custom metrics, and focus on alerting on metrics that directly reflect user experience. That’s how you avoid alert fatigue.


3. CloudWatch Metrics: Three Pillars — Standard, Custom, and Log-Based

3.1 Standard Metrics

AWS services automatically send standard metrics to CloudWatch. For example:

  • EC2: CPUUtilization, NetworkIn/Out, StatusCheckFailed, etc.
  • RDS: CPUUtilization, FreeStorageSpace, DatabaseConnections, etc.
  • Lambda: Invocations, Errors, Duration, Throttles, etc.
  • ALB/NLB: RequestCount, TargetResponseTime, HTTPCode_ELB_5XX, etc.
  • SQS: ApproximateNumberOfMessagesVisible, etc.

These are available at no extra cost (some higher-frequency metrics are billed), so the fastest path is to build your first dashboards around the standard metrics.

3.2 Custom Metrics

App-specific indicators (e.g., number of orders, failed logins, batch processing time) can be sent to CloudWatch using PutMetricData (API/SDK) or the CloudWatch Agent.

  • Namespace example: MyApp/Orders
  • Dimensions example: service=frontend, region=ap-northeast-1

In GCP, you can send custom metrics to Cloud Monitoring; in Azure, you can do the same into Azure Monitor. The idea is identical: express app success/failure/latency numerically and tie it to SLOs.

3.3 Log-Based Metrics

You can create metric filters on CloudWatch Logs to turn log patterns into metrics.

  • Examples:
    • Count log entries representing HTTP 500 errors and create an App/5xxCount metric, then build an error-rate alert.
    • Count “payment failure” logs as a metric and detect abnormal spikes.

GCP Cloud Logging + Cloud Monitoring and Azure Log Analytics also have mechanisms to turn log query results into metrics and alerts.


4. CloudWatch Logs: Aggregation, Search, Filters, Retention

4.1 Log Groups and Log Streams

CloudWatch Logs organizes logs into a hierarchy of log groups → log streams.

  • Log group: Unit like an app or a service (e.g., /aws/lambda/app-api)
  • Log stream: Finer-grained units such as instance IDs, container IDs, or dates

If you decide on log granularity and grouping up front, day-to-day operations get much easier later.

4.2 How Logs Are Collected

  • CloudWatch Agent:
    • Collects file logs and OS metrics from EC2 and on-prem environments.
  • Service integrations:
    • Lambda, API Gateway, ECS, EKS, RDS (enhanced logs), ALB, WAF, VPC Flow Logs, etc. can be configured to output directly to CloudWatch Logs.
  • Fluent Bit / Fluentd / OpenTelemetry Collector:
    • Common choices for routing logs from container environments to CloudWatch Logs.

In GCP, you’d use the Logging agent or Ops Agent; in Azure, the Log Analytics agent or Azure Monitor Agent. Same concept.

4.3 Log Retention and Lifecycle

CloudWatch Logs lets you set retention periods per log group.

  • Production app logs: 30–90 days
  • Security / audit logs: 6–24 months (but usually cheaper to export to S3 or equivalent for long-term storage)

For long-term retention, a common pattern is to regularly export logs to S3 and query with Athena. GCP has Logging → Cloud Storage exports; Azure has exports from Log Analytics to storage accounts, playing the same role.


5. Alarms and Notifications: How to Choose Thresholds

5.1 CloudWatch Alarm Basics

A CloudWatch alarm becomes ALARM when a metric meets a given condition.

  • Sample conditions:
    • CPUUtilization > 80% for 5 consecutive minutes
    • 5xx error rate > 1% at least 2 out of 10 minutes
    • App/LoginFailureCount exceeds 50 events in 1 minute

When an alarm triggers, you can notify via SNS (email, Slack, PagerDuty, etc.), or use it as a trigger for Lambda or EC2 Auto Scaling policies.

5.2 Preventing Alert Fatigue

  • Only alert on conditions that should change human or system behaviour
    • “CPU above 70%” alone is rarely actionable; focus on user-impacting SLI metrics.
  • Use warning vs critical levels
    • Handle minor issues in next-day reviews; reserve night-time paging for truly critical events.
  • Silence and revisit noisy rules
    • For rules with frequent false positives, pause and review thresholds, time windows, and metric choices.

GCP and Azure have very flexible alert policies as well; the principles are identical.


6. Dashboards and Visualization: Be Clear Who the Screen Is For

6.1 CloudWatch Dashboards

CloudWatch Dashboards allow you to combine multiple metrics, alarms, and widgets into a single view.

  • Example dashboard types:
    • Platform team: global CPU / network / error rates / alarm counts
    • Product team: latency, error rate, order count, revenue for a specific service
    • Executives: a few key business KPIs plus a high-level uptime view

6.2 Comparing Dashboards Across the Three Clouds

  • CloudWatch Dashboards: Metric- and alarm-centric panels.
  • Cloud Monitoring Dashboards (GCP): Combine metrics, log-based indicators, Uptime Checks, etc.
  • Azure Dashboards / Workbooks: Rich visualizations that mix Azure Monitor metrics, Log Analytics queries, and APM data.

Regardless of the platform, dashboards get better when you define a precise audience and prune to only what that audience needs.


7. Distributed Tracing and Tying It to Logs: X-Ray / Cloud Trace / Application Insights

With microservices, Lambda, and message queues, simple metrics and logs often aren’t enough to see where latency is introduced.

  • AWS: Connect AWS X-Ray with CloudWatch Logs, embed trace IDs into logs, and you can visualize which services a request went through and where it spent time.
  • GCP: Use Cloud Trace / Cloud Profiler in combination with Cloud Logging.
  • Azure: Application Insights traces are integrated with Log Analytics.

CloudWatch itself is not a tracing engine, but by linking logs/metrics with traces via identifiers, you significantly simplify root cause analysis (RCA).


8. Cost Design: Finding the “Right Amount” of Metrics, Logs, and Dashboards

Monitoring platforms are one of those areas where costs creep up the more you use them.

8.1 Main CloudWatch Billing Drivers

  • Metrics (many standard metrics include free tiers; high-resolution metrics may cost extra)
  • Custom metrics
  • Log ingest volume, storage, and retention
  • Number of dashboards
  • Number of alarms (for some features)

8.2 Ways to Control Costs

  1. Don’t create unnecessary custom metrics
    • Use log-based metrics where appropriate; find a balance.
  2. Set retention based on log purpose
    • App debug logs: short-term only
    • Security logs: short retention in CloudWatch; long-term archive in S3, etc.
  3. Reserve high-frequency metrics for where they’re truly needed
    • Don’t go to 10-second granularity when 1-minute is enough.
  4. Regularly clean up dashboards and alarms
    • Decommission unused panels and alarm rules.

GCP and Azure are the same: log/metric volume and retention directly drive cost, so it’s vital to decide “how much to keep” from the start.


9. Design Patterns: Three Sample “Monitoring Architectures”

9.1 Small to Mid-Sized Web/API Services

  • Goal: Keep a web/API service developed by one or a few teams running stably.
  • AWS setup:
    • CloudWatch metrics: standard metrics for EC2/LB/RDS/Lambda plus app custom metrics (error rate, latency, etc.)
    • CloudWatch Logs: app logs and ALB access logs
    • Alarms: SLO-based metrics (success rate, latency) and critical infra issues (e.g., disk running out)
    • Dashboards: One or two team-focused boards plus one global view

On GCP you’d do the same with Cloud Monitoring + Logging; on Azure with Azure Monitor + Log Analytics.

9.2 Batch Jobs and Data Pipelines

  • Goal: Make success/failure of night-time batches, ETL, or streaming jobs visible.
  • Setup:
    • Metrics: job counts, failures, processing time, latency
    • Logs: collect step-level logs in CloudWatch Logs and create log-based metrics on failure patterns
    • Alarms: notify when jobs don’t finish within a certain window or failure rates spike

In GCP (Dataflow / Composer) and Azure (Data Factory, etc.), you send metrics/logs into Cloud Monitoring / Azure Monitor the same way.

9.3 Multi-Account / Multi-Cloud Environments

  • Goal: Monitor multiple AWS accounts, or AWS + GCP + Azure together.
  • Architecture idea:
    • Export metrics/logs from each cloud’s CloudWatch / Cloud Monitoring / Azure Monitor into a shared visualization platform (e.g., Grafana, Datadog).
    • Still keep native monitoring on each cloud to take advantage of service-specific features.

CloudWatch, Cloud Monitoring, and Azure Monitor all support export/APIs to feed external systems, making “two-tier” designs straightforward.


10. Practical CloudWatch Configuration Examples

To make this less abstract, here are some simple examples.

10.1 Sending a Custom Metric (Python SDK)

import boto3, time

cloudwatch = boto3.client("cloudwatch", region_name="ap-northeast-1")

def put_order_count(count: int):
    cloudwatch.put_metric_data(
        Namespace="MyApp/Orders",
        MetricData=[
            {
                "MetricName": "OrderCreated",
                "Dimensions": [
                    {"Name": "service", "Value": "web"},
                    {"Name": "env", "Value": "prod"},
                ],
                "Timestamp": time.time(),
                "Value": count,
                "Unit": "Count",
            }
        ],
    )

10.2 Metric Filter Example (Count 5xx Errors from Logs)

aws logs put-metric-filter \
  --log-group-name "/aws/lambda/app-api" \
  --filter-name "5xx-count" \
  --filter-pattern '\"status\":5*' \
  --metric-transformations \
    metricName=Lambda5xxCount,metricNamespace=MyApp/API,metricValue=1

You can then create an alarm on MyApp/API Lambda5xxCount to fire when 5xx errors spike, making app abnormalities easier to catch quickly.


11. Common Pitfalls and How to Avoid Them

  • Only monitoring infra metrics and missing user experience
    • Fix: add error rate, latency, and business metrics as custom metrics and design alerts around SLOs.
  • Dumping all logs into CloudWatch and suffering from slow searches and high cost
    • Fix: group logs by type, set retention per purpose, and archive long-term logs to S3.
  • So many alerts that no one pays attention anymore
    • Fix:
      • Treat non-SLI alerts as “Warnings” to be reviewed daily/weekly.
      • Temporarily stop noisy rules and revisit their SLOs/thresholds.
  • Dashboards are fragmented between dev and ops teams
    • Fix: create one shared SLO dashboard, then layer team-specific panels on top.

12. Who Gains What? (Benefits by Role)

  • SREs / platform engineers
    • You get a solid pattern for designing metrics, logs, alarms, and dashboards together around CloudWatch. The same pattern scales across clouds, reducing operational burden.
  • Application developers
    • You can see how your code behaves in production via CloudWatch metrics and logs, and move closer to being an engineer who thinks in SLIs.
  • Corporate IT / operations
    • You can mix CloudWatch, Cloud Monitoring, and Azure Monitor and still draw a realistic line of “this is enough monitoring”. It also simplifies audits and incident reports.
  • Security / governance teams
    • Handling of audit logs, access logs, and change histories, and their retention strategies, become clearer, making it easier to close gaps between policies and implementation.
  • Startup tech leads
    • Even with a small team, you can build a “proper” monitoring platform centred on CloudWatch, and you’ll know how to extend the same design to GCP and Azure as you grow.

13. Three Things You Can Do Today

  1. Define just one or two SLOs/SLIs
    • e.g., Success rate 99.9%, p95 latency 500 ms.
  2. Create a single CloudWatch metric that represents that SLI
    • Use a log-based metric or a custom metric — start with just one.
  3. Build an alarm and a simple dashboard for that metric
    • With just this, you move from “looking at metrics randomly” to “tracking goals vs actuals”.

14. Conclusion: CloudWatch Is the Hub Linking Numbers, Logs, and Events

Amazon CloudWatch is more than a metrics collection tool. It centralizes:

  • Metrics (SLIs/SLOs)
  • Logs (context)
  • Alarms (triggers for action)
  • Dashboards (shared visibility)
  • Events (automated actions)

In short, it is the operational hub of your AWS environment.

GCP Cloud Monitoring/Logging and Azure Monitor/Log Analytics serve the same purpose. Once you learn how to design which indicators to collect and how to act on them, your knowledge carries over across clouds.

Combined with the services we’ve covered before — S3, EC2, Lambda, RDS, VPC, CloudFront — and now CloudWatch, you should be able to see the path to a coherent “visible, protected, and scalable” platform from infra → app → frontend → monitoring.

Next time we’ll look at AWS IAM (identity and access management), and, by comparing it with Cloud IAM (GCP) and Azure AD / Entra ID + RBAC, dig into permission design, role design, and how to use identities for a zero-trust future.

By greeden

Leave a Reply

Your email address will not be published. Required fields are marked *

日本語が含まれない投稿は無視されますのでご注意ください。(スパム対策)