Scaling the World's Largest Deployment System: From Bare Metal to Cloud Native
Scaling the World's Largest Deployment System: From Bare Metal to Cloud Native
The Mission
I'm working on what is likely the world's largest internal deployment system. Not hyperbole—the scale is genuinely staggering.
The numbers:
- 5-6 million deployments per day
- Tens of thousands of distinct services
- Hundreds of thousands of hosts across multiple regions
- Core infrastructure services that power some of the most-used APIs on the planet
For years, this system handled deployments to internal server types: bare metal, custom networking, proprietary infrastructure. In 2025, we undertook a massive expansion: enable deployments to native cloud compute resources—specifically EC2 instances, Auto Scaling Groups, and the full cloud-native stack.
This isn't just adding another deployment target. It's a fundamental shift in how we think about fleets, capacity, and deployment orchestration.
When your deployment system is already handling millions of deployments daily to bare metal infrastructure, and you need to expand to support dynamic, elastic cloud fleets that scale up and down automatically, every design decision matters at a scale where small inefficiencies multiply into massive problems.
The Challenge: Two Worlds Colliding
The existing deployment system was built for a world where capacity was relatively static. You provisioned servers, they got racked, and they stayed there until they were decommissioned. Fleet membership changed slowly and predictably.
Cloud-native infrastructure is fundamentally different. Instances come and go based on traffic patterns. Fleets are defined by queries, not static lists. The entire mental model shifts from "deploy to these specific hosts" to "deploy to whatever hosts match these criteria right now."
The Old World: Bare Metal
Characteristics:
- Fixed capacity: Servers are provisioned and sit there
- Static fleet membership: You know every server by name
- Predictable networking: Custom topology, well-understood
- Lifecycle control: We own the hardware from power-on to decommission
The deployment model was straightforward: maintain a registry of all hosts, query the registry for hosts matching a service name, push the deployment artifact to those hosts. The host list might have hundreds or thousands of entries, but it changed slowly—maybe a few additions or removals per week.
The New World: Cloud Native
Characteristics:
- Dynamic capacity: Instances come and go based on autoscaling
- Fluid fleet membership: The set of instances changes constantly
- Standard networking: VPCs, security groups, IAM roles
- Shared lifecycle: Cloud provider manages underlying infrastructure
The paradigm shift: you can no longer think of deployments as "push this artifact to this list of hosts." Instead, you're deploying to a constantly-evolving query result. An Auto Scaling Group might have 50 instances right now, 75 instances in an hour, and 30 instances tomorrow. Your deployment system needs to handle all of these scenarios gracefully.
Problem 1: How Do You Deploy to a Moving Target?
In the bare metal world, deployments were deterministic. Start with a list of 1,000 hosts, deploy to all 1,000, you're done. Success is binary: did all hosts get the update?
In the cloud-native world, the question becomes: what does "all hosts" even mean? The fleet membership changes while you're deploying. New instances launch. Old instances terminate. At any given moment, "all hosts" is a snapshot that's already stale.
The Tag-Based Fleet Model
In cloud environments, infrastructure is often described using tags:
{
"InstanceId": "i-abc123",
"Tags": [
{"Key": "Environment", "Value": "production"},
{"Key": "Service", "Value": "api-gateway"},
{"Key": "Version", "Value": "v2"},
{"Key": "Region", "Value": "us-east-1"}
]
}
The insight: What if deployments targeted tag queries instead of static instance lists?
Instead of maintaining a static registry of hosts, we could query EC2 for instances matching a tag combination. This is how Auto Scaling Groups already work—instances are ephemeral, but the logical fleet is defined by tags.
The deployment model becomes: "Deploy to all instances where Service=API and Environment=production". The actual instances that match can change over time, but the deployment definition remains stable.
Dynamic Fleet Resolution
# Conceptual model
class TagBasedFleet:
def __init__(self, tag_query):
self.tag_query = tag_query # e.g., {"Service": "api-gateway", "Environment": "prod"}
self.ec2_client = boto3.client('ec2')
self.cache_ttl = 60 # seconds
def resolve(self):
"""
Resolve tag query to current set of instances
"""
# Query EC2 API for instances matching tags
filters = [
{'Name': f'tag:{k}', 'Values': [v]}
for k, v in self.tag_query.items()
]
filters.append({'Name': 'instance-state-name', 'Values': ['running']})
response = self.ec2_client.describe_instances(Filters=filters)
# Extract instance IDs from potentially paginated response
instances = []
for reservation in response['Reservations']:
for instance in reservation['Instances']:
instances.append(instance['InstanceId'])
# This set changes over time!
# By the time we finish deploying, the fleet may have changed
return instances
def deploy(self, artifact):
"""
Deploy to all matching instances
"""
# Snapshot the fleet at the start of deployment
current_instances = self.resolve()
# But what if instances are added/removed during deployment?
# - New instances won't have the artifact
# - Terminated instances will fail deployment
# - We need a strategy to handle both
for instance_id in current_instances:
try:
self._deploy_to_instance(instance_id, artifact)
except InstanceTerminatedException:
# Expected - instance scaled down during deployment
continue
The consistency challenge is real: at 5-6 million deployments per day across fleets that can scale by hundreds of instances per minute, this isn't a theoretical problem—it's a daily reality.
The Snapshot Problem
Deployment starts with 100 instances matching the tag query. Mid-deployment, autoscaling adds 20 more. What happens?
Option 1: Deploy to snapshot
- Pro: Consistent deployment target
- Con: New instances don't get the update
Option 2: Continuously resolve
- Pro: New instances get the update
- Con: Deployment never "completes"—it's a continuous process
The approach we took: Hybrid
- Deploy to snapshot - We snapshot the fleet at deployment start and deploy to those instances
- Lifecycle hook integration - New instances that launch during or after deployment get the current version via Auto Scaling lifecycle hooks
- Convergence monitoring - Track which instances are running which versions, alert on divergence
This gives us the best of both worlds: deterministic deployments that complete, plus automated handling of newly-launched instances.
The tradeoff: increased complexity. Now the deployment system needs to integrate deeply with the Auto Scaling lifecycle, not just push artifacts.
Problem 2: Integration with Autoscaling
Auto Scaling Groups (ASGs) are constantly making decisions about capacity. They launch instances when load increases, terminate instances when it decreases. These decisions are made independently of our deployment system.
The challenge: these two systems need to coordinate. If we're deploying to an instance and ASG decides to terminate it, what happens? If ASG launches a new instance mid-deployment, does it get the new version or the old one?
At our scale, with thousands of ASGs and millions of deployments daily, these race conditions aren't edge cases—they're the normal operating mode.
The Autoscaling Lifecycle
Cloud autoscaling groups have a lifecycle:
Scale Up Event → Instance Launch → Initialization → In Service → Running
↓
Scale Down Event ← Instance Draining ← Termination Signal
The Deployment Challenge:
- Don't deploy to instances that are terminating
- Do deploy to instances that are launching
- Handle instances that scale up mid-deployment
- Gracefully handle instances that scale down mid-deployment
The race condition matrix:
| Deployment State | Scale Up Event | Scale Down Event | |-----------------|----------------|------------------| | Not Started | New instance needs current version | Instance terminated, skip it | | In Progress | New instance needs current version | Cancel deployment to that instance | | Completed | New instance needs current version | Normal termination |
Lifecycle Hooks: The Integration Point
EC2 Auto Scaling offers lifecycle hooks—callbacks during scaling events. When an instance is launching or terminating, ASG can pause and notify you, giving your system time to prepare the instance or clean it up.
How it works:
- You create a lifecycle hook on the ASG
- When a scaling event occurs, ASG publishes a message to SNS/SQS
- Your service receives the message and performs actions
- You complete the lifecycle action, allowing ASG to continue
This is the bridge between the deployment system and Auto Scaling. Lifecycle hooks let us ensure every instance—whether launched before, during, or after a deployment—ends up with the correct version.
# Conceptual lifecycle hook integration
class AutoscalingIntegration:
def __init__(self, deployment_service, asg_client):
self.deployment_service = deployment_service
self.asg_client = asg_client
self.max_wait_time = 300 # 5 minutes
def on_instance_launching(self, instance_id, lifecycle_action_token, asg_name):
"""
Called when autoscaling is launching a new instance
"""
try:
# Wait for instance to be reachable (SSH/SSM)
self._wait_for_instance_ready(instance_id, timeout=120)
# Query what version this fleet should be running
target_version = self.deployment_service.get_target_version_for_fleet(asg_name)
# Deploy that version to this instance
self.deployment_service.deploy_to_instance(instance_id, target_version)
# Run health checks
if not self._health_check(instance_id):
raise HealthCheckFailedException()
# Success! Complete the lifecycle action
self.asg_client.complete_lifecycle_action(
LifecycleActionToken=lifecycle_action_token,
AutoScalingGroupName=asg_name,
LifecycleActionResult='CONTINUE'
)
except Exception as e:
# Deployment failed - abandon this instance
self.asg_client.complete_lifecycle_action(
LifecycleActionToken=lifecycle_action_token,
AutoScalingGroupName=asg_name,
LifecycleActionResult='ABANDON' # ASG will terminate it
)
def on_instance_terminating(self, instance_id, lifecycle_action_token, asg_name):
"""
Called when autoscaling is terminating an instance
"""
# Cancel any in-progress deployment to this instance
self.deployment_service.cancel_deployment(instance_id)
# Drain connections (let in-flight requests complete)
self._drain_connections(instance_id, timeout=60)
# Complete lifecycle action - OK to terminate now
self.asg_client.complete_lifecycle_action(
LifecycleActionToken=lifecycle_action_token,
AutoScalingGroupName=asg_name,
LifecycleActionResult='CONTINUE'
)
The implementation looks straightforward, but the devil is in the details: timeouts, retries, error handling, and observability at scale.
The Coordination Challenge
Now you have two systems trying to manage instance state:
- Deployment system: "I'm updating this instance"
- Autoscaling: "I'm terminating this instance"
Who wins? How do you coordinate?
The coordination protocol:
We use a distributed state machine tracked in a highly available data store (think DynamoDB). Each instance has a state:
DEPLOYING- Deployment in progressDEPLOYED- Successfully deployedTERMINATING- ASG is terminatingDRAINING- Connections being drainedERROR- Deployment failed
State transitions are atomic. When ASG signals termination, we try to transition from DEPLOYING → TERMINATING. If that succeeds, we cancel the deployment. If the deployment already completed (DEPLOYED state), we just drain and terminate normally.
The key insight: both systems need to agree on instance state, and state transitions must be atomic to avoid race conditions.
Architecture Evolution
The original architecture was designed for a static world. Adapting it for dynamic, elastic infrastructure required fundamental changes—not just bolting on new features, but rethinking core abstractions.
Original Architecture (Bare Metal)
Deployment Controller
↓
Fixed Server Registry
↓
Server Agents (long-lived connections)
This worked well for years. The registry was authoritative. Servers rarely changed. Deployments were predictable. The system scaled horizontally by sharding based on service name.
At 5-6 million deployments per day, this architecture handles bare metal beautifully. But it doesn't handle dynamic fleets.
New Architecture (Cloud Native)
Deployment Controller
↓
├─→ Fixed Server Registry (legacy)
└─→ Tag Query Resolver → Cloud Provider API
↓
Dynamic Instance Set
↓
Instance Agents (ephemeral connections)
↓
Lifecycle Hook Integration
Key additions:
- Tag Query Resolver: Queries EC2 API to resolve tag-based fleet definitions
- Dynamic Instance Set: The fleet membership changes continuously
- Lifecycle Hook Integration: SQS queues receiving ASG lifecycle events
- Dual-mode support: Both static registry and dynamic tag queries work simultaneously
The system needs to handle both models during the transition period, then continue supporting bare metal even as cloud adoption grows.
The Abstraction Layer
We needed an abstraction that unified both models:
# Conceptual abstraction
from abc import ABC, abstractmethod
from typing import List, Stream
class Fleet(ABC):
@abstractmethod
def get_instances(self) -> List[Instance]:
"""Return current set of instances in fleet"""
pass
@abstractmethod
def watch_changes(self) -> Stream[FleetChangeEvent]:
"""Stream of add/remove events"""
pass
class StaticFleet(Fleet):
"""Fixed set of instances (bare metal)"""
def __init__(self, registry_client, service_name):
self.registry = registry_client
self.service_name = service_name
def get_instances(self):
# Query static registry
return self.registry.get_hosts_for_service(self.service_name)
def watch_changes(self):
# Bare metal fleet changes rarely
# Poll registry every few minutes
return self.registry.poll_for_changes(self.service_name, interval=300)
class TagBasedFleet(Fleet):
"""Dynamic set based on tag query (cloud)"""
def __init__(self, ec2_client, tag_query):
self.ec2 = ec2_client
self.tag_query = tag_query
def get_instances(self):
# Query EC2 API for instances matching tags
return self._resolve_tag_query()
def watch_changes(self):
# Subscribe to ASG lifecycle events via SQS
# Real-time notifications of instance adds/removes
return self._subscribe_to_lifecycle_events()
class DeploymentOrchestrator:
def deploy(self, fleet: Fleet, artifact: Artifact):
"""Deploy to any Fleet implementation"""
# Get initial snapshot
instances = fleet.get_instances()
# Start deployment
deployment_id = self._create_deployment(instances, artifact)
# Watch for fleet changes and handle them
for event in fleet.watch_changes():
if event.type == 'INSTANCE_ADDED':
self._deploy_to_instance(event.instance_id, artifact)
elif event.type == 'INSTANCE_REMOVED':
self._cancel_deployment(event.instance_id)
return deployment_id
This abstraction is critical. It lets us write deployment logic once and have it work for both bare metal and cloud-native fleets. The Deployment Orchestrator doesn't need to know whether it's deploying to a static registry or a dynamic tag query—it just works with the Fleet interface.
Benefits:
- Unified codebase for both infrastructure types
- Easier testing (mock the Fleet interface)
- Future-proof (new fleet types just implement the interface)
- Gradual migration (teams can move from StaticFleet to TagBasedFleet at their own pace)
Deploying to Tags: The Devil Is in the Details
Tag-based deployment sounds elegant in theory. In practice, you hit a dozen edge cases.
Challenge 1: Tag Propagation Latency
You add a tag to an instance. How long until it's visible in the deployment system?
EC2 tags are eventually consistent. When you add or update a tag, it typically shows up in API queries within seconds, but there's no guarantee. During periods of high API load, propagation can take longer.
Our caching strategy:
- Fleet resolution cache: 60-second TTL
- Forced refresh: On deployment start, force a fresh query (bypass cache)
- Invalidation: When we receive lifecycle hooks, invalidate cache for that ASG
The tradeoff: fresher data means more API calls, which means hitting rate limits faster. Stale data means potentially deploying to the wrong instances.
Challenge 2: Tag Mutation During Deployment
What if someone changes tags on instances mid-deployment?
Example: You start deploying to all instances tagged Version=v1. Mid-deployment, someone updates half the fleet to Version=v2. Should the deployment continue to those instances?
Our solution: snapshot-based versioning. When a deployment starts, we snapshot the resolved fleet and record it with a deployment ID. The deployment operates on that snapshot, regardless of tag changes. This gives us:
- Consistency: Deployment targets don't change mid-flight
- Audit trail: We know exactly which instances were targeted
- Idempotency: Retrying a deployment targets the same instances
Tradeoff: If someone urgently needs to remove an instance from the deployment target, they can't just change tags—they need to cancel the deployment.
Challenge 3: Tag Query Complexity
Simple tag query: Environment=production
Complex tag query: (Environment=production AND Service=api) OR (Environment=staging AND Tag=canary-enabled)
We support a simple query language that maps to EC2 filter syntax:
- AND/OR logic
- Negation (NOT)
- Wildcard matching
- Multiple values per key
But complex queries are slow—each OR clause potentially requires a separate API call. At scale, a single complex query for a large fleet can take 10+ seconds.
Optimization: pre-compute and cache common queries, expire based on fleet churn rate.
Challenge 4: Authorization
In the old model, we controlled server access. In the cloud, IAM roles and permissions are critical.
The authorization model:
- Deployment service IAM role: Has
ec2:DescribeInstancesandec2:DescribeTagspermissions - Tag-based access control: Teams can restrict deployments using tag-based IAM policies
- Service-level auth: Teams must be authorized to deploy a specific service, regardless of where it runs
Example policy: "You can deploy ServiceX to any instances tagged with Team=platform, but not to Team=frontend"
This decentralizes control—teams own their infrastructure via tags, and the deployment system respects those boundaries.
Performance at Scale
At 5-6 million deployments per day, performance isn't just a nice-to-have—it's existential. Small inefficiencies compound into massive problems.
Fleet Resolution Latency
Resolving a tag query to instances:
- Bare metal registry lookup: < 10ms (in-memory cache)
- Cloud API query (uncached): 200-800ms depending on fleet size
At scale:
- 1,000 instances: ~250ms (single DescribeInstances call)
- 10,000 instances: ~2 seconds (pagination required, multiple API calls)
- 100,000 instances: ~15-20 seconds (heavy pagination, potential rate limiting)
Caching strategy:
- Hot cache: 60-second TTL for frequently-queried fleets
- Pre-warming: Background jobs refresh caches for large fleets before they're likely to be queried
- Partial caching: Cache instances by tag value, combine cached results for multi-tag queries
This reduced average resolution time from 5 seconds to under 200ms for 90% of queries.
Deployment Velocity
- Peak throughput: 8,000-10,000 deployments per minute
- Average instance deployment time: 30-60 seconds (download artifact, install, health check)
- Parallel deployments per fleet: 100-500 instances simultaneously (configurable based on fleet size and risk tolerance)
The bottleneck shifted from fleet resolution to artifact distribution. We addressed this with:
- Regional artifact caches (S3 with CloudFront)
- Delta deployments (only transfer changed files)
- Peer-to-peer distribution for very large artifacts
API Rate Limiting
EC2 API rate limits are real, especially for DescribeInstances. At our scale, we hit them daily.
Strategies for staying under limits:
- Request coalescing: Batch multiple deployment requests for the same fleet
- Exponential backoff with jitter: When throttled, back off with randomized retry timing
- Regional partitioning: Distribute API calls across regions to use separate quota pools
- Quota monitoring: Track API call rates, alert before hitting limits
We also worked with AWS to increase our rate limits for specific APIs based on our usage patterns.
The Lifecycle Hook Dance
Lifecycle hooks are conceptually simple but operationally complex. Here's what actually happens in production.
Launching Instances
1. Autoscaling decides to scale up
2. Cloud provider launches new instance
3. Instance transitions to "Pending"
4. Lifecycle hook fires: "Instance launching"
5. Deployment system receives notification
6. Wait for instance to be reachable
7. Deploy current version
8. Run health checks
9. Complete lifecycle action
10. Instance transitions to "InService"
Timing details:
- Steps 1-4: < 1 second
- Step 5-6 (wait for reachable): 30-120 seconds (boot time + initialization)
- Step 7 (deploy): 30-60 seconds
- Step 8 (health check): 10-30 seconds
- Total: 70-210 seconds per instance launch
Failure scenarios:
- Instance fails to become reachable (timeout: 5 min) → ABANDON instance
- Deployment fails (artifact corruption, disk full, etc.) → ABANDON instance
- Health check fails → ABANDON instance
- Lifecycle hook timeout (our limit: 10 min) → ASG proceeds anyway, instance may be unhealthy
If we ABANDON an instance, ASG terminates it and launches a replacement. This automatically retries, which usually resolves transient failures.
Terminating Instances
1. Autoscaling decides to scale down
2. Lifecycle hook fires: "Instance terminating"
3. Deployment system receives notification
4. Cancel any in-progress deployment
5. Begin connection draining
6. Wait for draining to complete (timeout: X minutes)
7. Complete lifecycle action
8. Instance terminates
Edge cases:
- Deployment was already complete → Just drain and terminate normally
- No deployment was happening → Just drain and terminate
- Draining times out (> 60s) → Complete lifecycle action anyway, let ASG terminate
- Multiple termination hooks for same instance (rare but possible) → Idempotent handling, only drain once
Race Conditions
Scenario: Deployment starts, then scale-down event
- Deployment system resolves fleet, gets 100 instances, starts deploying
- 50 instances deployed successfully
- ASG decides to scale down, selects 20 instances for termination (some already deployed, some not)
- Lifecycle hooks fire for those 20 instances
- Deployment system receives termination notifications
- For instances already deployed: drain and complete
- For instances mid-deployment: cancel deployment, drain, complete
- For instances not yet deployed: skip them entirely
The atomic state transitions in DynamoDB ensure we handle this correctly—we can't deploy and terminate simultaneously.
Scenario: Scale-up event during deployment
- Deployment resolves fleet: 100 instances
- Start deploying to those 100
- ASG scales up, adds 20 new instances
- Lifecycle hooks fire for the 20 new instances
- Deployment system receives launch notifications
- Query: what version should these instances run?
- Deploy that version (same one the existing fleet is getting)
- Original deployment completes with 100 instances
- Lifecycle deployments complete with 20 instances
- Net result: All 120 instances running the same version
This is where the hybrid approach shines—we don't need to continuously resolve the fleet during deployment. Lifecycle hooks handle the new instances automatically.
Monitoring and Observability
At this scale, observability isn't optional—it's how you debug distributed systems spanning millions of instances.
Monitoring infrastructure:
- Metrics: CloudWatch + internal time-series database
- Logs: Centralized logging (CloudWatch Logs + internal log aggregation)
- Traces: Distributed tracing for end-to-end deployment lifecycle
- Dashboards: Real-time dashboards for each service team + central ops dashboard
Key Metrics
Deployment Metrics:
- Active deployments
- Deployment velocity (instances/min)
- Success rate
- Rollback rate
Fleet Metrics:
- Fleet size by tag query
- Churn rate (instances added/removed)
- Tag propagation latency
Lifecycle Metrics:
- Lifecycle hook latency
- Hook timeout rate
- Instance launch-to-ready time
Alerting:
- Deployment success rate < 95% → Page on-call
- Lifecycle hook timeout rate > 5% → Alert
- API throttling detected → Alert + auto-scale workers
- Fleet resolution time > 10s → Warning
We also built anomaly detection: if deployment failure rate for a specific service suddenly spikes, automatically alert that service's on-call.
Testing Strategies
Testing a system this complex requires multiple strategies.
Unit Tests
Standard fare—test individual components in isolation:
- Fleet resolution logic (mocked EC2 APIs)
- Deployment state machine transitions
- Tag query parsing and optimization
- Authorization logic
High coverage (> 85%) on core business logic.
Integration Tests
Test deployment orchestration with mock cloud APIs:
- Spin up mock EC2 API endpoints
- Simulate tag queries returning different results over time
- Simulate lifecycle hooks firing during deployments
- Verify correct state transitions
These tests caught race conditions that unit tests missed.
Chaos Tests
Inject failures and verify graceful degradation:
- EC2 API timeouts (throttling simulation)
- Autoscaling events mid-deployment
- Network partitions between components
- DynamoDB unavailability
- Lifecycle hook message delivery delays
We ran chaos tests in a staging environment weekly, and found real bugs every time.
Shadow Mode
Before going live, we ran in shadow mode:
- Resolve tag queries but don't deploy
- Integrate with lifecycle hooks but don't block
- Collect metrics and validate behavior
Shadow mode learnings:
- Tag queries for some large fleets took 20+ seconds (led to caching strategy)
- Lifecycle hook messages had ~1% delivery delays > 30s (led to retry logic)
- Some tag queries returned inconsistent results across API calls (eventual consistency handling)
- DynamoDB hot partition on deployment state table (led to partition key redesign)
Shadow mode ran for 4 weeks, processing millions of shadow deployments. This caught issues that would have been catastrophic in production.
Rollout Strategy
Given the scale and criticality, rollout was extremely cautious.
Phase 1: Read-Only Integration
- Integrate with cloud APIs
- Resolve tag queries
- No actual deployments
Phase 2: Opt-In Beta
- Select teams deploy to cloud resources
- High-touch support
- Rapid iteration based on feedback
Phase 3: General Availability
- All teams can deploy to cloud
- Documentation and training
- Self-service tooling
Adoption metrics:
- Week 1: 50 teams, 10,000 deployments
- Month 1: 300 teams, 200,000 deployments
- Month 3: 800 teams, 1.5 million deployments
- Month 6: 1,500+ teams, 3 million+ cloud deployments per day
Success rate remained above 98% throughout the rollout.
What Went Wrong
Let's be honest—at this scale, incidents are inevitable. Here are three that taught us important lessons.
Incident 1: The Great Tag Storm
What happened:
An automation script had a bug that caused it to rapidly update tags on 50,000 instances—adding and removing the same tag every few seconds. This created a storm of cache invalidations in our fleet resolution system.
Our tag query cache relied on change notifications from lifecycle hooks. The rapid tag changes caused:
- Cache thrashing (constant invalidation and refresh)
- EC2 API throttling (excessive DescribeInstances calls)
- Deployment delays (couldn't resolve fleets reliably)
Impact: Deployments to affected fleets delayed by 10-30 minutes for 2 hours.
Resolution:
- Rate-limited cache invalidations (max one per fleet per 30 seconds)
- Added tag change rate monitoring/alerting
- Fixed the buggy automation script
Learning: Never assume external systems will behave rationally. Protect yourself with rate limits.
Incident 2: Lifecycle Hook Timeout Cascade
What happened:
A deployment artifact for a high-traffic service was corrupted in S3. New instances launched via ASG would:
- Receive lifecycle hook
- Attempt to deploy the artifact
- Artifact download fails
- Retry download (with backoff)
- Eventually timeout the lifecycle hook (10 min)
- ASG abandons the instance and launches a replacement
- Replacement hits the same issue
This created a launch/fail/relaunch loop. Over 30 minutes, ASG launched and abandoned 200+ instances trying to meet capacity.
Impact: Service capacity reduced by 30% for 45 minutes, users experienced elevated latency.
Resolution:
- Detected pattern (rapidly abandoned instances)
- Automatic alert triggered
- On-call identified corrupted artifact
- Rolled back to previous version
- ASG stabilized
Learning: Build circuit breakers. If multiple consecutive instances fail lifecycle hooks for the same reason, pause and alert rather than burning through instances.
Incident 3: The Deployment That Never Ended
What happened:
A deployment started for a fleet of 1,000 instances. Due to a bug in our snapshot logic, we didn't properly snapshot the fleet—instead, we continuously re-resolved it.
Meanwhile, the fleet was experiencing heavy autoscaling (demand spike). Instances were launching and terminating rapidly. Our deployment system tried to deploy to every instance it saw, including ones that would terminate seconds later.
The deployment ran for 6 hours, deploying to over 3,500 instances (due to churn), before we manually cancelled it.
Impact: Deployment resources exhausted, other deployments queued behind it were delayed.
Resolution:
- Manually cancelled the deployment
- Fixed the snapshot bug
- Added deployment duration alerts (anything > 2 hours triggers investigation)
Learning: Always snapshot your target fleet. Never continuously resolve during deployment unless you explicitly intend to.
What Went Right
Not everything was incidents and firefighting. Several architectural decisions proved invaluable.
1. The Abstraction Layer
The unified Fleet interface allowed us to support both models without duplicating logic.
2. Incremental Rollout
Shadow mode and phased rollout prevented catastrophic failures.
3. Strong Observability
Metrics and logging allowed rapid debugging of complex distributed issues.
Lessons Learned
1. Dynamic Infrastructure Requires Different Thinking
You can't treat cloud instances like bare metal servers with different names.
The mental model shift is profound: you're no longer deploying to a fixed set of hosts. You're deploying to a logical fleet defined by a query, and that query result changes over time. This affects everything—how you think about deployment completion, how you handle failures, how you monitor convergence.
Teams that tried to use cloud instances exactly like bare metal servers struggled. Teams that embraced the dynamic model thrived.
2. Tags Are Powerful But Tricky
Tag-based fleet management is flexible, but consistency is hard.
Best practices we discovered:
- Always snapshot the fleet at deployment start
- Cache aggressively but invalidate intelligently
- Monitor tag propagation latency
- Use lifecycle hooks as the source of truth for fleet changes
- Document your tagging schema and enforce it (consider tag policies)
- Never use tag queries that can return wildly different results second-to-second
3. Lifecycle Integration Is Critical
You can't just deploy to cloud instances—you must integrate with their lifecycle.
Autoscaling groups will make capacity decisions independent of your deployment system. If you don't integrate with lifecycle hooks, you'll constantly be fighting with ASG. Instances will launch without the right version. Instances will terminate mid-deployment causing failures.
The investment in lifecycle hook integration was significant—but it's non-negotiable for production-quality deployments to elastic infrastructure.
4. API Limits Are Real
At scale, cloud provider APIs become a bottleneck.
Strategies that worked for us:
- Cache everything you can
- Use lifecycle hooks for change notifications instead of polling
- Batch API calls when possible
- Monitor your API call rates and set alerts
- Work with your cloud provider to understand and optimize quotas
- Build rate limiting and backoff into every API client
5. Observability Is Non-Negotiable
Debugging distributed systems without metrics is impossible.
The ROI of our observability investment was immediate and ongoing. Every incident was resolved faster because we had:
- Deployment-level traces showing exactly what happened
- Metrics at every component boundary
- Automated anomaly detection
- Dashboards that made patterns obvious
We spent ~20% of development time on observability infrastructure. It paid for itself within the first month of production traffic.
The Road Ahead
The system works, but there's always room for improvement.
1. Cross-Cloud Support
Currently, this implementation is AWS-specific (EC2, ASG, tags, lifecycle hooks). But the abstraction layer was designed with multi-cloud in mind.
Challenges:
- Different cloud providers have different autoscaling primitives
- Tag semantics vary (AWS tags vs Azure tags vs GCP labels)
- Lifecycle hooks work differently (or don't exist)
- API rate limits and consistency guarantees differ
The Fleet interface should work, but each cloud provider needs its own implementation of TagBasedFleet. That's significant engineering work.
2. Predictive Scaling Integration
AWS ASG supports predictive scaling—it forecasts future load and scales proactively. If ASG knows it will scale up 100 instances in 10 minutes, could we pre-deploy to those instances?
Potential optimizations:
- Pre-warm artifacts in regional caches before scale-up
- Optimize AMIs for the predicted workload
- Coordinate deployments to avoid scaling events
This is still research territory, but the payoff could be significant.
3. GitOps Integration
Currently, deployment requests come from various tools and APIs. What if fleet definitions and deployment configs lived in git, and the system continuously reconciled actual state with desired state?
The GitOps model:
- Tag-based fleet definitions stored in git
- Deployment configs as code
- Continuous reconciliation (every instance running the right version)
- Rollback = git revert
This would shift the paradigm from "push deployments" to "declarative infrastructure."
4. Multi-Region Coordination
Deploying to tagged fleets across regions simultaneously introduces new challenges:
- Do we resolve fleets per-region or globally?
- How do we handle region-specific failures?
- What if tag propagation latency varies by region?
- Should rollout be sequential (region by region) or parallel?
We have some multi-region support, but it's not as sophisticated as single-region deployments. There's work to do here.
Key Takeaways
- Expanding a deployment system to cloud-native infrastructure is a paradigm shift, not just a new target
- Tag-based fleet resolution enables dynamic infrastructure but introduces consistency challenges
- Lifecycle hooks are essential for integrating with autoscaling
- The abstraction layer is critical—unify different infrastructure models behind a common interface
- Dynamic fleets require different deployment semantics than static fleets
- Observability and incremental rollout are essential for managing complexity
- API rate limits and eventual consistency are unavoidable at scale
- Testing dynamic systems requires chaos engineering and shadow mode
Implementation details have been abstracted. All examples are simplified for illustration.
Further Reading
- AWS Auto Scaling Documentation: Understanding lifecycle hooks and scaling policies
- "Going Faster: Continuous Delivery at Scale" - Werner Vogels' perspective on deployment systems
- "The Story of Apollo" - Amazon's deployment engine history
- AWS CodeDeploy documentation - Public-facing tool inspired by internal systems
- "Building Scalable Systems" - Patterns for handling dynamic infrastructure
About this article: This covers a real expansion of a large-scale internal deployment system to support cloud-native infrastructure. Specific metrics, company details, and implementation specifics have been abstracted, but the technical challenges, architectural decisions, and lessons learned are authentic.