Amazon EC2 Deep Dive: Practical Virtual Server Design Without the Confusion—Real-World Comparison with GCP Compute Engine & Azure VM
Introduction (Key Takeaways)
- This article is a practical guide you can use as-is in the field, centered on AWS Amazon EC2 and including a feature-by-feature comparison with Google Compute Engine (GCE) and Azure Virtual Machines (Azure VM).
- The bottom line first: For the “classic IaaS” of availability and scalability, standardize around EC2 with Auto Scaling Group + ELB + Launch Template for stability. GCE’s strength is custom machine types (fine-grained vCPU/memory), while Azure naturally integrates with AD/AAD and virtual networks, fitting well with enterprise governance.
- The crux of design lies in five areas: instance selection, storage design, network/security, scaling/availability, and operations automation.
- Includes ready-to-use CLI samples (side-by-side for EC2/GCE/Azure VM), user data (cloud-init), templates for ASG/scale design, and an operations checklist.
- Intended readers: IT departments migrating to cloud, web/line-of-business app developers, SRE/infrastructure engineers, security/governance leads, and startup tech leads. For those who want to quickly grasp the big picture from on-prem VM lift & shift to modern scale-out design.
1. What Is Amazon EC2: From IaaS Basics to “Patterns You Should Use Today”
Amazon Elastic Compute Cloud (EC2) provides virtual servers (IaaS) on demand. You launch instances from AMIs (machine images) and define network boundaries with VPC/subnets/security groups.
Modern EC2 commonly uses Nitro-based virtualization, allowing you to design for stable performance, secure isolation, and flexible ENA (high-speed networking). For availability, multi-AZ distribution is the baseline; combining Auto Scaling Groups (ASG) with Elastic Load Balancing (ELB) makes self-healing during failures and following demand fluctuations straightforward.
In practice, you standardize launch parameters (AMI/type/disk/user data/role) with Launch Templates, manage min/desired/max capacity via ASG, and use target tracking scaling to auto-adjust to goals like CPU or request latency—this is the “default” pattern.
2. Representative Use Cases (Design Patterns & Comparison Points)
- Web/API Clusters Prioritizing Availability
- Scale with ASG + ALB (HTTP/HTTPS) and distribute across multiple AZs. Use EBS for root and keep external state in RDS/ElastiCache/EFS.
- On GCE, use Managed Instance Group (MIG) + HTTP(S) Load Balancing; on Azure, use Virtual Machine Scale Sets (VMSS) + Application Gateway/Front Door as equivalent patterns.
- Batch Processing / Job Execution
- Cut costs with Spot Instances. Ideal for interruption-tolerant workloads (MapReduce, video transcoding).
- GCE Spot VMs and Azure Spot VMs follow the same idea. Re-execution strategies and checkpoint design are key.
- HPC / Machine Learning
- Needs high-bandwidth networking, GPUs/high-speed storage. Use Placement Groups or Dedicated Hosts for noise isolation.
- Rough equivalents: GCP’s A2/A3 families and Azure’s HB/HC/ND/NC. Prioritize network bandwidth, GPU generation, and storage throughput.
- Lift & Shift of Enterprise Servers
- Start with single AZ × ASG (min 1 / max 1) to ensure self-healing. Use Systems Manager (SSM) for OS/patch deployment to simplify ops.
- GCE offers OS Config; Azure provides Update Management/Automation for similar operations.
3. Instance Selection: A 5-Step Path to Avoid Confusion
Choose the right family and right size to balance performance and cost.
- Select a Family
- General Purpose: Balanced; fits most web/API workloads.
- Compute Optimized: CPU-bound processing; app/game servers, etc.
- Memory Optimized: In-memory DBs/caches/large-scale analytics.
- Storage Optimized: For OLAP/log processing needing local NVMe.
- Accelerated (GPU/FPGA, etc.): AI training/inference, GPU rendering.
- Storage Modality
- Choose EBS (persistent block) or instance store (fast, non-persistent).
- Use EBS for the root by default. Consider instance store only for temp files/caches.
- Network Performance
- Select ENA-capable sizes and enhanced networking per required bandwidth/pps.
- Availability Requirements
- Decide between ASG with multi-AZ or Dedicated Hosts/Dedicated Instances for stronger noise isolation.
- Pricing Model
- Validate on On-Demand → optimize steady state with Savings Plans/RI → absorb bursts/batch with Spot.
GCE comparison: Custom machine types allow fine-grained vCPU/memory (GB) specification. Azure comparison: A rich variety of SKUs (sizes) and features like Proximity Placement Groups for latency optimization.
4. Storage Design: Choosing Among EBS / Instance Store / EFS / FSx
- EBS (Elastic Block Store)
- Types: General Purpose SSD (gp*), Provisioned IOPS (io*), Throughput Optimized (st), Cold HDD (sc), etc.
- Best for root/data disks. Manage generations with snapshots. Enable encryption by default.
- Instance Store
- Physically attached high-speed NVMe. Non-persistent—data is lost on stop/restart.
- Restrict to temp areas/shuffle/cache.
- EFS (NFS-compatible shared files) / FSx (Windows/Lustre/ONTAP, etc.)
- Use EFS/FSx when shared access across apps is needed. Choose by throughput/latency requirements.
GCE: Persistent Disk (PD) options include Standard/SSD/Extreme; Local SSD is ultra-low latency but non-persistent.
Azure: Managed Disks (Standard HDD/SSD, Premium/Ultra). Azure Files provides shared storage.
5. Networking & Security: The Initial “Four-Pack” to Lock Down First
- VPC/Subnets
- Use private subnets + NAT/proxy for outbound access. Keep public subnets minimal.
- Security Groups (SG)
- Stateful virtual firewalls. Enforce least privilege. NACLs are supplemental CIDR-level controls.
- Identity
- Use IAM roles (instance profiles) to avoid embedding credentials. Enforce IMDSv2.
- Operational Access
- Use SSM Session Manager for bastionless shell/port forwarding. Retire key distribution and bastion management.
GCE: VPC/Firewall, attach service accounts to instances. OS Login simplifies IAM integration.
Azure: VNet/NSG/ASG, eliminate credentials with Managed Identity. Bastion/Just-In-Time for secure access.
6. Availability & Scaling: Make ASG + ELB Your “Standard Equipment”
- Auto Scaling Group (ASG)
- Health checks auto-replace failed instances. Policies include schedule/target tracking/step.
- Elastic Load Balancing (ALB/NLB/GWLB)
- ALB: L7—HTTP/HTTPS with path/host-based routing.
- NLB: L4—ultra-low latency, static IPs; strong for gRPC/high throughput.
- Multi-AZ / Multi-Region
- AZ distribution is mandatory. For cross-region, Route 53/health checks and data replication design are key.
GCE: Managed Instance Group + Cloud Load Balancing (global HTTP(S)).
Azure: VMSS + Application Gateway/Load Balancer/Front Door.
7. Automation & Bootstrap: A Repeatable “Ritual of Launch”
- Use Launch Templates + user data (cloud-init) for declarative initialization.
- Bake AMIs (e.g., with Packer) to pre-embed time-consuming setup, reducing boot time.
- Standardize configuration and patching with Systems Manager (State Manager/Run Command/Patch Manager).
GCE: Startup Script/Metadata, Instance Templates.
Azure: VM Extensions (Custom Script), Shared Image Gallery.
8. Monitoring, Operations & Traceability
- CloudWatch: Metrics (CPU/network/disk) and Agent for OS metrics/log collection.
- CloudTrail: API activity trails—track instance/network/key operations.
- SSM: Visibility into inventory/vulnerabilities/patching.
- Alerting: Define SLO/SLI first, and prioritize user-experience-linked indicators (e.g., p95 latency) over raw thresholds.
GCE: Cloud Monitoring/Logging, Ops Agent.
Azure: Azure Monitor/Log Analytics, Diagnostic Settings.
9. Cost Optimization: Steady, No-Drama Moves
- Rightsizing: Measure CPU/memory/disk/network usage and size down one step when appropriate.
- Combine pricing models:
- On-Demand…for dev/experiments.
- Savings Plans/Reserved Instances…for steady workloads.
- Spot…for bursts/batch.
- Scheduled stops: Auto-stop at nights/weekends.
- Externalize state: Make instances stateless to ease scale-down and replacement.
- Storage optimization: Revisit EBS types, tune IOPS for General Purpose SSD, prune unused snapshots.
GCE: Committed Use Discounts + Spot, Sustained Use Discounts (automatic for long usage).
Azure: Reserved VM Instances + Spot, with similar auto-stop/schedules.
10. “Cross-Cloud Glossary” for Smoother Ops Conversations
- Launch definition: Launch Template (EC2) / Instance Template (GCE) / VMSS Model + Image (Azure)
- Auto scale: ASG (EC2) / MIG (GCE) / VMSS (Azure)
- L7 load balancer: ALB (EC2) / HTTP(S) LB (GCE) / Application Gateway (Azure)
- Credential delivery: IAM Role (EC2) / Service Account (GCE) / Managed Identity (Azure)
- Bastionless access: SSM Session Manager (EC2) / IAP + OS Login (GCE) / Bastion/Just-in-Time (Azure)
11. Practical Samples (Hands-On Across 3 Clouds)
11.1 EC2 Launch (Minimal, with User Data)
# 1) Security Group (example opens 80/22; omit 22 if operating via SSM)
aws ec2 create-security-group \
--group-name web-sg --description "web sg" --vpc-id vpc-xxxxxxxx
aws ec2 authorize-security-group-ingress \
--group-id sg-xxxxxxxx --protocol tcp --port 80 --cidr 0.0.0.0/0
# 2) Key pair (not needed if operating via SSM)
aws ec2 create-key-pair --key-name web-key > web-key.pem && chmod 400 web-key.pem
# 3) Launch (shortest example without Launch Template)
cat > user-data.sh <<'EOF'
#cloud-config
packages:
- nginx
runcmd:
- systemctl enable nginx
- systemctl start nginx
EOF
aws ec2 run-instances \
--image-id ami-xxxxxxxxxxxxxxxxx \
--instance-type t3.small \
--subnet-id subnet-xxxxxxxx \
--security-group-ids sg-xxxxxxxx \
--iam-instance-profile Name=ec2-app-role \
--user-data file://user-data.sh \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=web-1}]'
11.2 GCE (Equivalent “Shortest Deploy”)
gcloud compute instances create web-1 \
--zone=asia-northeast1-b \
--machine-type=e2-small \
--metadata=startup-script='#!/bin/bash
sudo apt -y update && sudo apt -y install nginx
sudo systemctl enable nginx && sudo systemctl start nginx' \
--tags=http-server
11.3 Azure VM (Equivalent “Shortest Deploy”)
az group create -n rg-web -l japaneast
az vm create -g rg-web -n web-1 \
--image UbuntuLTS --size Standard_B1ms \
--custom-data cloudinit.yaml \
--admin-username azureuser --generate-ssh-keys
az vm open-port -g rg-web -n web-1 --port 80
# cloudinit.yaml is almost equivalent to EC2's cloud-config
11.4 Auto Scaling (EC2: Highlights for Launch Template + ASG + ALB)
# Launch Template
aws ec2 create-launch-template \
--launch-template-name web-lt \
--launch-template-data '{
"ImageId":"ami-xxxxxxxxxxxxxxxxx",
"InstanceType":"t3.small",
"IamInstanceProfile":{"Name":"ec2-app-role"},
"SecurityGroupIds":["sg-xxxxxxxx"],
"UserData":"'"$(base64 -w0 user-data.sh)"'"
}'
# Auto Scaling Group (2 AZs, target tracking CPU 50%)
aws autoscaling create-auto-scaling-group \
--auto-scaling-group-name web-asg \
--launch-template LaunchTemplateName=web-lt \
--min-size 2 --max-size 6 --desired-capacity 2 \
--vpc-zone-identifier "subnet-a,subnet-b"
aws autoscaling put-scaling-policy \
--auto-scaling-group-name web-asg \
--policy-name cpu50 \
--policy-type TargetTrackingScaling \
--target-tracking-configuration '{"PredefinedMetricSpecification":{"PredefinedMetricType":"ASGAverageCPUUtilization"},"TargetValue":50.0}'
12. Security & Operational Governance: Practical “Standard Rules”
- Clear public boundary: Make ALB public; keep EC2 instances private. Avoid exposing instances directly to the internet.
- Zero embedded credentials: Always use IAM Roles/Managed Identity/Service Accounts; never store keys in environment variables or files.
- IMDSv2/metadata protection: Mandatory for SSRF defense. Control metadata access on GCE/Azure as well.
- Patch operations: Use SSM Patch Manager, etc., with regular schedules. For emergencies, apply blue/green in stages.
- Centralize logs: Send app logs to CloudWatch Logs (or equivalents on GCE/Azure). Assume structured logs to improve observability.
13. Availability Design in Practice: 3 Patterns
- Standard Three-Tier Web
- ALB → ASG (app) → RDS/EFS/ElastiCache. Favor scale-out.
- Stateless API + External Auth
- ALB + Cognito/IdP for auth; use immutable deployments for the ASG to avoid drift.
- HPC/Low Latency
- Choose suitable Placement Group (Cluster/Spread/Partition). For storage, instance store + parallel FS.
Across GCE/Azure, the fundamentals remain: LB + scaling group + externalized state (DB/Cache).
14. Common Pitfalls & How to Avoid Them
- Going to prod in a single AZ: Full outage on failure. Treat ASG + multi-AZ as the “minimum line.”
- Long-lived bastion servers: A breeding ground for risk. Move to SSM Session Manager/IAP/Bastion JIT.
- Secrets on instances: Keys in env/files. Migrate to Parameter Store/Secrets Manager/Key Vault/Secret Manager.
- Keeping state locally: Blocks scaling. Push to EBS/EFS/object storage/external DBs.
- Snapshot sprawl: Define generations/retention/tags and prune regularly.
- Reactive monitoring: Define SLO/SLI/alerts before launch.
15. Design Checklist (Lock These in Week One)
- Goals: Define availability (AZ distribution) / performance (p95) / recovery time / operational boundaries with numbers.
- Instances: Family/size/pricing model, AMI management, user data.
- Storage: EBS type/IOPS/snapshots/encryption, need for shared FS.
- Network: VPC/CIDR/subnets, SG/NACL/routes, DNS/name resolution.
- Security: IAM roles/IMDSv2/key management, SSM for bastionless ops.
- Scaling: ASG capacity & policies, health checks, deployment method (Rolling/Blue-Green/Canary).
- Monitoring: Metrics/log collection/dashboards/alerts, need for tracing.
- Cost: Rightsizing/stop schedules/Spot coverage, Savings Plans/RI policy.
- Change management: IaC (e.g., Terraform/CloudFormation/Bicep/DM) and review steps.
- Exit plan: AMI/snapshots/data migration, DNS/certs/monitoring cleanup.
16. Operations Runbook (First-Response Pattern for Incidents)
- Detection: Quickly verify alert details and impact scope (AZ/ASG/targets).
- Check auto-recovery: Confirm ASG replacement in progress and healthy target count on the LB.
- Temporary mitigation: Lower scale-out thresholds and add extra capacity.
- Root cause isolation: Compare recent deploy/AMI/user data/external deps (DB/Cache).
- Permanent fix: Improve health checks, resource limits (ulimit), and add observability points.
- Postmortem: Record learnings in dashboards/runbooks and update tags/naming/policies.
17. Case Studies (3 Examples)
17.1 Startup API Platform (Tight Deadline)
- Architecture: ALB → ASG (general-purpose t-family) → RDS (managed) → CloudWatch.
- Key points: AMI bake + user data for reproducibility; nightly stops for cost.
- Other clouds: On GCE, MIG + HTTP(S) LB + Cloud SQL; on Azure, VMSS + AppGW + Azure Database.
17.2 Image Processing Batch (Cost-First)
- Architecture: SQS (queue) → Spot instance fleet → results to S3.
- Key points: Interruption handling and retry; store checkpoints in S3.
- Other clouds: Use GCE/Azure Spot VMs + queues (Pub/Sub / Storage Queue) similarly.
17.3 Staged Migration of Core Internal Systems (Safety-First)
- Architecture: ASG (min 1 / max 1) for self-healing → scale to multi-AZ as load grows.
- Key points: Bastionless ops with SSM, define a patch baseline date; store audit logs in a separate account.
- Other clouds: GCE OS Config/Shielded VM, Azure Update Management/Defender for Cloud for comparable posture.
18. Three-Cloud Comparison Summary (From a Practitioner’s View)
- Clarity of design choices: GCE (custom types are straightforward) ≧ EC2 (rich patterns) ≧ Azure (broad options; depends on design skills)
- Enterprise governance/network affinity: Azure > EC2 ≧ GCE
- Maturity of auto-scaling: EC2 (mature ASG + abundant examples) ≧ GCE (simple MIG) ≧ Azure (feature-rich VMSS)
- Operational consistency (bastionless/audit trails): EC2 (SSM/CloudTrail) ≧ Azure (JIT/Bastion/Activity) ≧ GCE (OS Login/IAP/Audit Logs)
- Flexibility in cost optimization: EC2 (Savings Plans/Spot combinations) ≧ GCE (Committed/Sustained/Spot) ≧ Azure (Reserved/Spot)
19. Intended Readers & Concrete Outcomes
- IT departments (on-prem VM migration): Establish like-for-like migration → self-healing → standardized monitoring/patching/audit trails on EC2 using the ASG + SSM pattern.
- Web/line-of-business app developers: Learn stateless API + externalized state, and ship safely with rolling/canary.
- SRE/infrastructure engineers: With VPC/SG/IMDSv2/log aggregation as table stakes, build SLO-based alerting.
- Security/governance leads: Define org-wide guardrails for no secrets on hosts, bastionless access, and audit retention.
- Startup tech leads: Minimal viable setup → patternization → IaC, to build a small-to-start, wide-to-scale platform quickly.
20. Summary: EC2 Stays Steady When You Run It as a “Pattern”
By operating Amazon EC2 with the ASG + ELB + Launch Template repeatable pattern, you can better balance availability, performance, and cost. Center storage on EBS, externalize state, and use SSM for bastionless access. The more you follow these principles, the more confidently you can replace to recover during incidents and handle deployments and scaling without fear.
The same design mindset carries across clouds by mapping to MIG/VMSS. Start by IaC-ing the minimal setup, then optimize cost in the order measure → rightsizing → apply discounts.
Appendix A: cloud-init (Minimal Web Server)
#cloud-config
package_update: true
packages:
- nginx
write_files:
- path: /var/www/html/healthz.html
content: "ok"
runcmd:
- systemctl enable nginx
- systemctl start nginx
- printf "location /healthz { return 200 'ok'; add_header Content-Type text/plain; }" \
> /etc/nginx/snippets/healthz.conf
- 'sed -i "/server_name _;/a include snippets/healthz.conf;" /etc/nginx/sites-available/default'
- systemctl reload nginx
Appendix B: Security Groups (Minimal, ALB-Only Access)
# App SG: allow 80/TCP only from the ALB's SG
APP_SG=$(aws ec2 create-security-group --group-name app-sg --description "app" --vpc-id vpc-xxx --query GroupId --output text)
ALB_SG=$(aws ec2 create-security-group --group-name alb-sg --description "alb" --vpc-id vpc-xxx --query GroupId --output text)
aws ec2 authorize-security-group-ingress --group-id $APP_SG --protocol tcp --port 80 --source-group $ALB_SG
aws ec2 authorize-security-group-ingress --group-id $ALB_SG --protocol tcp --port 80 --cidr 0.0.0.0/0
Appendix C: Simple Weekly Rightsizing Procedure
- Use CloudWatch/Monitoring and the Agent to visualize CPU/memory/disk/network.
- If over two consecutive weeks you see CPU p95 < 40% and memory p95 < 60%, consider downsizing one tier.
- Confirm headroom with a load test, then downsize. Apply rolling updates on the ASG for zero downtime.
- Every three months, review the coverage of discount models (Savings/RI/Committed/Reserved).
Appendix D: Deployment Strategy Templates (Rolling/Blue-Green/Canary)
- Rolling: Increase MaxSurge-equivalent on the ASG and replace in sequence. Lowest risk.
- Blue-Green: Prepare a new ASG separately → switch target groups. Easy rollback.
- Canary: 1 instance → small subset → all. Decide in stages using metrics/logs/traces.
- Common: Clearly document health check URLs/DB migration order/feature flags in the runbook.
Appendix E: Three Things You Can Do Today
- Create a Launch Template, put cloud-init in user data, and make launches reproducible.
- Ensure self-healing first with an ASG (min 1 / max 1). Add ALB in the next step.
- Enable SSM Agent + Session Manager, retire bastions and key distribution, and set up log collection at the same time.
—With this, your second step with EC2 is rock solid. Next time we’ll cover AWS Lambda, comparing across to Cloud Functions/Azure Functions as well. Stay tuned.
