Scaling the World's Largest Deployment System: From Bare Metal to Cloud Native

The Mission

I'm working on what is likely the world's largest internal deployment system. Not hyperbole—the scale is genuinely staggering.

The numbers:

5-6 million deployments per day
Tens of thousands of distinct services
Hundreds of thousands of hosts across multiple regions
Core infrastructure services that power some of the most-used APIs on the planet

For years, this system handled deployments to internal server types: bare metal, custom networking, proprietary infrastructure. In 2025, we undertook a massive expansion: enable deployments to native cloud compute resources—specifically EC2 instances, Auto Scaling Groups, and the full cloud-native stack.

This isn't just adding another deployment target. It's a fundamental shift in how we think about fleets, capacity, and deployment orchestration.

When your deployment system is already handling millions of deployments daily to bare metal infrastructure, and you need to expand to support dynamic, elastic cloud fleets that scale up and down automatically, every design decision matters at a scale where small inefficiencies multiply into massive problems.

The Challenge: Two Worlds Colliding

The existing deployment system was built for a world where capacity was relatively static. You provisioned servers, they got racked, and they stayed there until they were decommissioned. Fleet membership changed slowly and predictably.

Cloud-native infrastructure is fundamentally different. Instances come and go based on traffic patterns. Fleets are defined by queries, not static lists. The entire mental model shifts from "deploy to these specific hosts" to "deploy to whatever hosts match these criteria right now."

The Old World: Bare Metal

Characteristics:

Fixed capacity: Servers are provisioned and sit there
Static fleet membership: You know every server by name
Predictable networking: Custom topology, well-understood
Lifecycle control: We own the hardware from power-on to decommission

The deployment model was straightforward: maintain a registry of all hosts, query the registry for hosts matching a service name, push the deployment artifact to those hosts. The host list might have hundreds or thousands of entries, but it changed slowly—maybe a few additions or removals per week.

The New World: Cloud Native

Characteristics:

Dynamic capacity: Instances come and go based on autoscaling
Fluid fleet membership: The set of instances changes constantly
Standard networking: VPCs, security groups, IAM roles
Shared lifecycle: Cloud provider manages underlying infrastructure

The paradigm shift: you can no longer think of deployments as "push this artifact to this list of hosts." Instead, you're deploying to a constantly-evolving query result. An Auto Scaling Group might have 50 instances right now, 75 instances in an hour, and 30 instances tomorrow. Your deployment system needs to handle all of these scenarios gracefully.

Problem 1: How Do You Deploy to a Moving Target?

In the bare metal world, deployments were deterministic. Start with a list of 1,000 hosts, deploy to all 1,000, you're done. Success is binary: did all hosts get the update?

In the cloud-native world, the question becomes: what does "all hosts" even mean? The fleet membership changes while you're deploying. New instances launch. Old instances terminate. At any given moment, "all hosts" is a snapshot that's already stale.

The Tag-Based Fleet Model

In cloud environments, infrastructure is often described using tags:

{
  "InstanceId": "i-abc123",
  "Tags": [
    {"Key": "Environment", "Value": "production"},
    {"Key": "Service", "Value": "api-gateway"},
    {"Key": "Version", "Value": "v2"},
    {"Key": "Region", "Value": "us-east-1"}
  ]
}

The insight: What if deployments targeted tag queries instead of static instance lists?

Instead of maintaining a static registry of hosts, we could query EC2 for instances matching a tag combination. This is how Auto Scaling Groups already work—instances are ephemeral, but the logical fleet is defined by tags.

The deployment model becomes: "Deploy to all instances where Service=API and Environment=production". The actual instances that match can change over time, but the deployment definition remains stable.

Dynamic Fleet Resolution

# Conceptual model
class TagBasedFleet:
    def __init__(self, tag_query):
        self.tag_query = tag_query  # e.g., {"Service": "api-gateway", "Environment": "prod"}
        self.ec2_client = boto3.client('ec2')
        self.cache_ttl = 60  # seconds

    def resolve(self):
        """
        Resolve tag query to current set of instances
        """
        # Query EC2 API for instances matching tags
        filters = [
            {'Name': f'tag:{k}', 'Values': [v]}
            for k, v in self.tag_query.items()
        ]
        filters.append({'Name': 'instance-state-name', 'Values': ['running']})

        response = self.ec2_client.describe_instances(Filters=filters)

        # Extract instance IDs from potentially paginated response
        instances = []
        for reservation in response['Reservations']:
            for instance in reservation['Instances']:
                instances.append(instance['InstanceId'])

        # This set changes over time!
        # By the time we finish deploying, the fleet may have changed
        return instances

    def deploy(self, artifact):
        """
        Deploy to all matching instances
        """
        # Snapshot the fleet at the start of deployment
        current_instances = self.resolve()

        # But what if instances are added/removed during deployment?
        # - New instances won't have the artifact
        # - Terminated instances will fail deployment
        # - We need a strategy to handle both

        for instance_id in current_instances:
            try:
                self._deploy_to_instance(instance_id, artifact)
            except InstanceTerminatedException:
                # Expected - instance scaled down during deployment
                continue

The consistency challenge is real: at 5-6 million deployments per day across fleets that can scale by hundreds of instances per minute, this isn't a theoretical problem—it's a daily reality.

The Snapshot Problem

Deployment starts with 100 instances matching the tag query. Mid-deployment, autoscaling adds 20 more. What happens?

Option 1: Deploy to snapshot

Pro: Consistent deployment target
Con: New instances don't get the update

Option 2: Continuously resolve

Pro: New instances get the update
Con: Deployment never "completes"—it's a continuous process

The approach we took: Hybrid

Deploy to snapshot - We snapshot the fleet at deployment start and deploy to those instances
Lifecycle hook integration - New instances that launch during or after deployment get the current version via Auto Scaling lifecycle hooks
Convergence monitoring - Track which instances are running which versions, alert on divergence

This gives us the best of both worlds: deterministic deployments that complete, plus automated handling of newly-launched instances.

The tradeoff: increased complexity. Now the deployment system needs to integrate deeply with the Auto Scaling lifecycle, not just push artifacts.

Problem 2: Integration with Autoscaling

Auto Scaling Groups (ASGs) are constantly making decisions about capacity. They launch instances when load increases, terminate instances when it decreases. These decisions are made independently of our deployment system.

The challenge: these two systems need to coordinate. If we're deploying to an instance and ASG decides to terminate it, what happens? If ASG launches a new instance mid-deployment, does it get the new version or the old one?

At our scale, with thousands of ASGs and millions of deployments daily, these race conditions aren't edge cases—they're the normal operating mode.

The Autoscaling Lifecycle

Cloud autoscaling groups have a lifecycle:

Scale Up Event → Instance Launch → Initialization → In Service → Running
                                                          ↓
Scale Down Event ← Instance Draining ← Termination Signal

The Deployment Challenge:

Don't deploy to instances that are terminating
Do deploy to instances that are launching
Handle instances that scale up mid-deployment
Gracefully handle instances that scale down mid-deployment

The race condition matrix:

| Deployment State | Scale Up Event | Scale Down Event | |-----------------|----------------|------------------| | Not Started | New instance needs current version | Instance terminated, skip it | | In Progress | New instance needs current version | Cancel deployment to that instance | | Completed | New instance needs current version | Normal termination |

Lifecycle Hooks: The Integration Point

EC2 Auto Scaling offers lifecycle hooks—callbacks during scaling events. When an instance is launching or terminating, ASG can pause and notify you, giving your system time to prepare the instance or clean it up.

How it works:

You create a lifecycle hook on the ASG
When a scaling event occurs, ASG publishes a message to SNS/SQS
Your service receives the message and performs actions
You complete the lifecycle action, allowing ASG to continue

This is the bridge between the deployment system and Auto Scaling. Lifecycle hooks let us ensure every instance—whether launched before, during, or after a deployment—ends up with the correct version.

# Conceptual lifecycle hook integration
class AutoscalingIntegration:
    def __init__(self, deployment_service, asg_client):
        self.deployment_service = deployment_service
        self.asg_client = asg_client
        self.max_wait_time = 300  # 5 minutes

    def on_instance_launching(self, instance_id, lifecycle_action_token, asg_name):
        """
        Called when autoscaling is launching a new instance
        """
        try:
            # Wait for instance to be reachable (SSH/SSM)
            self._wait_for_instance_ready(instance_id, timeout=120)

            # Query what version this fleet should be running
            target_version = self.deployment_service.get_target_version_for_fleet(asg_name)

            # Deploy that version to this instance
            self.deployment_service.deploy_to_instance(instance_id, target_version)

            # Run health checks
            if not self._health_check(instance_id):
                raise HealthCheckFailedException()

            # Success! Complete the lifecycle action
            self.asg_client.complete_lifecycle_action(
                LifecycleActionToken=lifecycle_action_token,
                AutoScalingGroupName=asg_name,
                LifecycleActionResult='CONTINUE'
            )
        except Exception as e:
            # Deployment failed - abandon this instance
            self.asg_client.complete_lifecycle_action(
                LifecycleActionToken=lifecycle_action_token,
                AutoScalingGroupName=asg_name,
                LifecycleActionResult='ABANDON'  # ASG will terminate it
            )

    def on_instance_terminating(self, instance_id, lifecycle_action_token, asg_name):
        """
        Called when autoscaling is terminating an instance
        """
        # Cancel any in-progress deployment to this instance
        self.deployment_service.cancel_deployment(instance_id)

        # Drain connections (let in-flight requests complete)
        self._drain_connections(instance_id, timeout=60)

        # Complete lifecycle action - OK to terminate now
        self.asg_client.complete_lifecycle_action(
            LifecycleActionToken=lifecycle_action_token,
            AutoScalingGroupName=asg_name,
            LifecycleActionResult='CONTINUE'
        )

The implementation looks straightforward, but the devil is in the details: timeouts, retries, error handling, and observability at scale.

The Coordination Challenge

Now you have two systems trying to manage instance state:

Deployment system: "I'm updating this instance"
Autoscaling: "I'm terminating this instance"

Who wins? How do you coordinate?

The coordination protocol:

We use a distributed state machine tracked in a highly available data store (think DynamoDB). Each instance has a state:

DEPLOYING - Deployment in progress
DEPLOYED - Successfully deployed
TERMINATING - ASG is terminating
DRAINING - Connections being drained
ERROR - Deployment failed

State transitions are atomic. When ASG signals termination, we try to transition from DEPLOYING → TERMINATING. If that succeeds, we cancel the deployment. If the deployment already completed (DEPLOYED state), we just drain and terminate normally.

The key insight: both systems need to agree on instance state, and state transitions must be atomic to avoid race conditions.

Architecture Evolution

The original architecture was designed for a static world. Adapting it for dynamic, elastic infrastructure required fundamental changes—not just bolting on new features, but rethinking core abstractions.

Original Architecture (Bare Metal)

Deployment Controller
    ↓
Fixed Server Registry
    ↓
Server Agents (long-lived connections)

This worked well for years. The registry was authoritative. Servers rarely changed. Deployments were predictable. The system scaled horizontally by sharding based on service name.

At 5-6 million deployments per day, this architecture handles bare metal beautifully. But it doesn't handle dynamic fleets.

New Architecture (Cloud Native)

Deployment Controller
    ↓
    ├─→ Fixed Server Registry (legacy)
    └─→ Tag Query Resolver → Cloud Provider API
              ↓
        Dynamic Instance Set
              ↓
        Instance Agents (ephemeral connections)
              ↓
        Lifecycle Hook Integration

Key additions:

Tag Query Resolver: Queries EC2 API to resolve tag-based fleet definitions
Dynamic Instance Set: The fleet membership changes continuously
Lifecycle Hook Integration: SQS queues receiving ASG lifecycle events
Dual-mode support: Both static registry and dynamic tag queries work simultaneously

The system needs to handle both models during the transition period, then continue supporting bare metal even as cloud adoption grows.

The Abstraction Layer

We needed an abstraction that unified both models:

# Conceptual abstraction
from abc import ABC, abstractmethod
from typing import List, Stream

class Fleet(ABC):
    @abstractmethod
    def get_instances(self) -> List[Instance]:
        """Return current set of instances in fleet"""
        pass

    @abstractmethod
    def watch_changes(self) -> Stream[FleetChangeEvent]:
        """Stream of add/remove events"""
        pass


class StaticFleet(Fleet):
    """Fixed set of instances (bare metal)"""
    def __init__(self, registry_client, service_name):
        self.registry = registry_client
        self.service_name = service_name

    def get_instances(self):
        # Query static registry
        return self.registry.get_hosts_for_service(self.service_name)

    def watch_changes(self):
        # Bare metal fleet changes rarely
        # Poll registry every few minutes
        return self.registry.poll_for_changes(self.service_name, interval=300)


class TagBasedFleet(Fleet):
    """Dynamic set based on tag query (cloud)"""
    def __init__(self, ec2_client, tag_query):
        self.ec2 = ec2_client
        self.tag_query = tag_query

    def get_instances(self):
        # Query EC2 API for instances matching tags
        return self._resolve_tag_query()

    def watch_changes(self):
        # Subscribe to ASG lifecycle events via SQS
        # Real-time notifications of instance adds/removes
        return self._subscribe_to_lifecycle_events()


class DeploymentOrchestrator:
    def deploy(self, fleet: Fleet, artifact: Artifact):
        """Deploy to any Fleet implementation"""
        # Get initial snapshot
        instances = fleet.get_instances()

        # Start deployment
        deployment_id = self._create_deployment(instances, artifact)

        # Watch for fleet changes and handle them
        for event in fleet.watch_changes():
            if event.type == 'INSTANCE_ADDED':
                self._deploy_to_instance(event.instance_id, artifact)
            elif event.type == 'INSTANCE_REMOVED':
                self._cancel_deployment(event.instance_id)

        return deployment_id

This abstraction is critical. It lets us write deployment logic once and have it work for both bare metal and cloud-native fleets. The Deployment Orchestrator doesn't need to know whether it's deploying to a static registry or a dynamic tag query—it just works with the Fleet interface.

Benefits:

Unified codebase for both infrastructure types
Easier testing (mock the Fleet interface)
Future-proof (new fleet types just implement the interface)
Gradual migration (teams can move from StaticFleet to TagBasedFleet at their own pace)

Deploying to Tags: The Devil Is in the Details

Tag-based deployment sounds elegant in theory. In practice, you hit a dozen edge cases.

Challenge 1: Tag Propagation Latency

You add a tag to an instance. How long until it's visible in the deployment system?

EC2 tags are eventually consistent. When you add or update a tag, it typically shows up in API queries within seconds, but there's no guarantee. During periods of high API load, propagation can take longer.

Our caching strategy:

Fleet resolution cache: 60-second TTL
Forced refresh: On deployment start, force a fresh query (bypass cache)
Invalidation: When we receive lifecycle hooks, invalidate cache for that ASG

The tradeoff: fresher data means more API calls, which means hitting rate limits faster. Stale data means potentially deploying to the wrong instances.

Challenge 2: Tag Mutation During Deployment

What if someone changes tags on instances mid-deployment?

Example: You start deploying to all instances tagged Version=v1. Mid-deployment, someone updates half the fleet to Version=v2. Should the deployment continue to those instances?

Our solution: snapshot-based versioning. When a deployment starts, we snapshot the resolved fleet and record it with a deployment ID. The deployment operates on that snapshot, regardless of tag changes. This gives us:

Consistency: Deployment targets don't change mid-flight
Audit trail: We know exactly which instances were targeted
Idempotency: Retrying a deployment targets the same instances

Tradeoff: If someone urgently needs to remove an instance from the deployment target, they can't just change tags—they need to cancel the deployment.

Challenge 3: Tag Query Complexity

Simple tag query: Environment=production

Complex tag query: (Environment=production AND Service=api) OR (Environment=staging AND Tag=canary-enabled)

We support a simple query language that maps to EC2 filter syntax:

AND/OR logic
Negation (NOT)
Wildcard matching
Multiple values per key

But complex queries are slow—each OR clause potentially requires a separate API call. At scale, a single complex query for a large fleet can take 10+ seconds.

Optimization: pre-compute and cache common queries, expire based on fleet churn rate.

Challenge 4: Authorization

In the old model, we controlled server access. In the cloud, IAM roles and permissions are critical.

The authorization model:

Deployment service IAM role: Has ec2:DescribeInstances and ec2:DescribeTags permissions
Tag-based access control: Teams can restrict deployments using tag-based IAM policies
Service-level auth: Teams must be authorized to deploy a specific service, regardless of where it runs

Example policy: "You can deploy ServiceX to any instances tagged with Team=platform, but not to Team=frontend"

This decentralizes control—teams own their infrastructure via tags, and the deployment system respects those boundaries.

Performance at Scale

At 5-6 million deployments per day, performance isn't just a nice-to-have—it's existential. Small inefficiencies compound into massive problems.

Fleet Resolution Latency

Resolving a tag query to instances:

Bare metal registry lookup: < 10ms (in-memory cache)
Cloud API query (uncached): 200-800ms depending on fleet size

At scale:

1,000 instances: ~250ms (single DescribeInstances call)
10,000 instances: ~2 seconds (pagination required, multiple API calls)
100,000 instances: ~15-20 seconds (heavy pagination, potential rate limiting)

Caching strategy:

Hot cache: 60-second TTL for frequently-queried fleets
Pre-warming: Background jobs refresh caches for large fleets before they're likely to be queried
Partial caching: Cache instances by tag value, combine cached results for multi-tag queries

This reduced average resolution time from 5 seconds to under 200ms for 90% of queries.

Deployment Velocity

Peak throughput: 8,000-10,000 deployments per minute
Average instance deployment time: 30-60 seconds (download artifact, install, health check)
Parallel deployments per fleet: 100-500 instances simultaneously (configurable based on fleet size and risk tolerance)

The bottleneck shifted from fleet resolution to artifact distribution. We addressed this with:

Regional artifact caches (S3 with CloudFront)
Delta deployments (only transfer changed files)
Peer-to-peer distribution for very large artifacts

API Rate Limiting

EC2 API rate limits are real, especially for DescribeInstances. At our scale, we hit them daily.

Strategies for staying under limits:

Request coalescing: Batch multiple deployment requests for the same fleet
Exponential backoff with jitter: When throttled, back off with randomized retry timing
Regional partitioning: Distribute API calls across regions to use separate quota pools
Quota monitoring: Track API call rates, alert before hitting limits

We also worked with AWS to increase our rate limits for specific APIs based on our usage patterns.

The Lifecycle Hook Dance

Lifecycle hooks are conceptually simple but operationally complex. Here's what actually happens in production.

Launching Instances

1. Autoscaling decides to scale up
2. Cloud provider launches new instance
3. Instance transitions to "Pending"
4. Lifecycle hook fires: "Instance launching"
5. Deployment system receives notification
6. Wait for instance to be reachable
7. Deploy current version
8. Run health checks
9. Complete lifecycle action
10. Instance transitions to "InService"

Timing details:

Steps 1-4: < 1 second
Step 5-6 (wait for reachable): 30-120 seconds (boot time + initialization)
Step 7 (deploy): 30-60 seconds
Step 8 (health check): 10-30 seconds
Total: 70-210 seconds per instance launch

Failure scenarios:

Instance fails to become reachable (timeout: 5 min) → ABANDON instance
Deployment fails (artifact corruption, disk full, etc.) → ABANDON instance
Health check fails → ABANDON instance
Lifecycle hook timeout (our limit: 10 min) → ASG proceeds anyway, instance may be unhealthy

If we ABANDON an instance, ASG terminates it and launches a replacement. This automatically retries, which usually resolves transient failures.

Terminating Instances

1. Autoscaling decides to scale down
2. Lifecycle hook fires: "Instance terminating"
3. Deployment system receives notification
4. Cancel any in-progress deployment
5. Begin connection draining
6. Wait for draining to complete (timeout: X minutes)
7. Complete lifecycle action
8. Instance terminates

Edge cases:

Deployment was already complete → Just drain and terminate normally
No deployment was happening → Just drain and terminate
Draining times out (> 60s) → Complete lifecycle action anyway, let ASG terminate
Multiple termination hooks for same instance (rare but possible) → Idempotent handling, only drain once

Race Conditions

Scenario: Deployment starts, then scale-down event

Deployment system resolves fleet, gets 100 instances, starts deploying
50 instances deployed successfully
ASG decides to scale down, selects 20 instances for termination (some already deployed, some not)
Lifecycle hooks fire for those 20 instances
Deployment system receives termination notifications
For instances already deployed: drain and complete
For instances mid-deployment: cancel deployment, drain, complete
For instances not yet deployed: skip them entirely

The atomic state transitions in DynamoDB ensure we handle this correctly—we can't deploy and terminate simultaneously.

Scenario: Scale-up event during deployment

Deployment resolves fleet: 100 instances
Start deploying to those 100
ASG scales up, adds 20 new instances
Lifecycle hooks fire for the 20 new instances
Deployment system receives launch notifications
Query: what version should these instances run?
Deploy that version (same one the existing fleet is getting)
Original deployment completes with 100 instances
Lifecycle deployments complete with 20 instances
Net result: All 120 instances running the same version

This is where the hybrid approach shines—we don't need to continuously resolve the fleet during deployment. Lifecycle hooks handle the new instances automatically.

Monitoring and Observability

At this scale, observability isn't optional—it's how you debug distributed systems spanning millions of instances.

Monitoring infrastructure:

Metrics: CloudWatch + internal time-series database
Logs: Centralized logging (CloudWatch Logs + internal log aggregation)
Traces: Distributed tracing for end-to-end deployment lifecycle
Dashboards: Real-time dashboards for each service team + central ops dashboard

Key Metrics

Deployment Metrics:

Active deployments
Deployment velocity (instances/min)
Success rate
Rollback rate

Fleet Metrics:

Fleet size by tag query
Churn rate (instances added/removed)
Tag propagation latency

Lifecycle Metrics:

Lifecycle hook latency
Hook timeout rate
Instance launch-to-ready time

Alerting:

Deployment success rate < 95% → Page on-call
Lifecycle hook timeout rate > 5% → Alert
API throttling detected → Alert + auto-scale workers
Fleet resolution time > 10s → Warning

We also built anomaly detection: if deployment failure rate for a specific service suddenly spikes, automatically alert that service's on-call.

Testing Strategies

Testing a system this complex requires multiple strategies.

Unit Tests

Standard fare—test individual components in isolation:

Fleet resolution logic (mocked EC2 APIs)
Deployment state machine transitions
Tag query parsing and optimization
Authorization logic

High coverage (> 85%) on core business logic.

Integration Tests

Test deployment orchestration with mock cloud APIs:

Spin up mock EC2 API endpoints
Simulate tag queries returning different results over time
Simulate lifecycle hooks firing during deployments
Verify correct state transitions

These tests caught race conditions that unit tests missed.

Chaos Tests

Inject failures and verify graceful degradation:

EC2 API timeouts (throttling simulation)
Autoscaling events mid-deployment
Network partitions between components
DynamoDB unavailability
Lifecycle hook message delivery delays

We ran chaos tests in a staging environment weekly, and found real bugs every time.

Shadow Mode

Before going live, we ran in shadow mode:

Resolve tag queries but don't deploy
Integrate with lifecycle hooks but don't block
Collect metrics and validate behavior

Shadow mode learnings:

Tag queries for some large fleets took 20+ seconds (led to caching strategy)
Lifecycle hook messages had ~1% delivery delays > 30s (led to retry logic)
Some tag queries returned inconsistent results across API calls (eventual consistency handling)
DynamoDB hot partition on deployment state table (led to partition key redesign)

Shadow mode ran for 4 weeks, processing millions of shadow deployments. This caught issues that would have been catastrophic in production.

Rollout Strategy

Given the scale and criticality, rollout was extremely cautious.

Phase 1: Read-Only Integration

Integrate with cloud APIs
Resolve tag queries
No actual deployments

Phase 2: Opt-In Beta

Select teams deploy to cloud resources
High-touch support
Rapid iteration based on feedback

Phase 3: General Availability

All teams can deploy to cloud
Documentation and training
Self-service tooling

Adoption metrics:

Week 1: 50 teams, 10,000 deployments
Month 1: 300 teams, 200,000 deployments
Month 3: 800 teams, 1.5 million deployments
Month 6: 1,500+ teams, 3 million+ cloud deployments per day

Success rate remained above 98% throughout the rollout.

What Went Wrong

Let's be honest—at this scale, incidents are inevitable. Here are three that taught us important lessons.

Incident 1: The Great Tag Storm

What happened:

An automation script had a bug that caused it to rapidly update tags on 50,000 instances—adding and removing the same tag every few seconds. This created a storm of cache invalidations in our fleet resolution system.

Our tag query cache relied on change notifications from lifecycle hooks. The rapid tag changes caused:

Cache thrashing (constant invalidation and refresh)
EC2 API throttling (excessive DescribeInstances calls)
Deployment delays (couldn't resolve fleets reliably)

Impact: Deployments to affected fleets delayed by 10-30 minutes for 2 hours.

Resolution:

Rate-limited cache invalidations (max one per fleet per 30 seconds)
Added tag change rate monitoring/alerting
Fixed the buggy automation script

Learning: Never assume external systems will behave rationally. Protect yourself with rate limits.

Incident 2: Lifecycle Hook Timeout Cascade

What happened:

A deployment artifact for a high-traffic service was corrupted in S3. New instances launched via ASG would:

Receive lifecycle hook
Attempt to deploy the artifact
Artifact download fails
Retry download (with backoff)
Eventually timeout the lifecycle hook (10 min)
ASG abandons the instance and launches a replacement
Replacement hits the same issue

This created a launch/fail/relaunch loop. Over 30 minutes, ASG launched and abandoned 200+ instances trying to meet capacity.

Impact: Service capacity reduced by 30% for 45 minutes, users experienced elevated latency.

Resolution:

Detected pattern (rapidly abandoned instances)
Automatic alert triggered
On-call identified corrupted artifact
Rolled back to previous version
ASG stabilized

Learning: Build circuit breakers. If multiple consecutive instances fail lifecycle hooks for the same reason, pause and alert rather than burning through instances.

Incident 3: The Deployment That Never Ended

What happened:

A deployment started for a fleet of 1,000 instances. Due to a bug in our snapshot logic, we didn't properly snapshot the fleet—instead, we continuously re-resolved it.

Meanwhile, the fleet was experiencing heavy autoscaling (demand spike). Instances were launching and terminating rapidly. Our deployment system tried to deploy to every instance it saw, including ones that would terminate seconds later.

The deployment ran for 6 hours, deploying to over 3,500 instances (due to churn), before we manually cancelled it.

Impact: Deployment resources exhausted, other deployments queued behind it were delayed.

Resolution:

Manually cancelled the deployment
Fixed the snapshot bug
Added deployment duration alerts (anything > 2 hours triggers investigation)

Learning: Always snapshot your target fleet. Never continuously resolve during deployment unless you explicitly intend to.

What Went Right

Not everything was incidents and firefighting. Several architectural decisions proved invaluable.

1. The Abstraction Layer

The unified Fleet interface allowed us to support both models without duplicating logic.

2. Incremental Rollout

Shadow mode and phased rollout prevented catastrophic failures.

3. Strong Observability

Metrics and logging allowed rapid debugging of complex distributed issues.

Lessons Learned

1. Dynamic Infrastructure Requires Different Thinking

You can't treat cloud instances like bare metal servers with different names.

The mental model shift is profound: you're no longer deploying to a fixed set of hosts. You're deploying to a logical fleet defined by a query, and that query result changes over time. This affects everything—how you think about deployment completion, how you handle failures, how you monitor convergence.

Teams that tried to use cloud instances exactly like bare metal servers struggled. Teams that embraced the dynamic model thrived.

2. Tags Are Powerful But Tricky

Tag-based fleet management is flexible, but consistency is hard.

Best practices we discovered:

Always snapshot the fleet at deployment start
Cache aggressively but invalidate intelligently
Monitor tag propagation latency
Use lifecycle hooks as the source of truth for fleet changes
Document your tagging schema and enforce it (consider tag policies)
Never use tag queries that can return wildly different results second-to-second

3. Lifecycle Integration Is Critical

You can't just deploy to cloud instances—you must integrate with their lifecycle.

Autoscaling groups will make capacity decisions independent of your deployment system. If you don't integrate with lifecycle hooks, you'll constantly be fighting with ASG. Instances will launch without the right version. Instances will terminate mid-deployment causing failures.

The investment in lifecycle hook integration was significant—but it's non-negotiable for production-quality deployments to elastic infrastructure.

4. API Limits Are Real

At scale, cloud provider APIs become a bottleneck.

Strategies that worked for us:

Cache everything you can
Use lifecycle hooks for change notifications instead of polling
Batch API calls when possible
Monitor your API call rates and set alerts
Work with your cloud provider to understand and optimize quotas
Build rate limiting and backoff into every API client

5. Observability Is Non-Negotiable

Debugging distributed systems without metrics is impossible.

The ROI of our observability investment was immediate and ongoing. Every incident was resolved faster because we had:

Deployment-level traces showing exactly what happened
Metrics at every component boundary
Automated anomaly detection
Dashboards that made patterns obvious

We spent ~20% of development time on observability infrastructure. It paid for itself within the first month of production traffic.

The Road Ahead

The system works, but there's always room for improvement.

1. Cross-Cloud Support

Currently, this implementation is AWS-specific (EC2, ASG, tags, lifecycle hooks). But the abstraction layer was designed with multi-cloud in mind.

Challenges:

Different cloud providers have different autoscaling primitives
Tag semantics vary (AWS tags vs Azure tags vs GCP labels)
Lifecycle hooks work differently (or don't exist)
API rate limits and consistency guarantees differ

The Fleet interface should work, but each cloud provider needs its own implementation of TagBasedFleet. That's significant engineering work.

2. Predictive Scaling Integration

AWS ASG supports predictive scaling—it forecasts future load and scales proactively. If ASG knows it will scale up 100 instances in 10 minutes, could we pre-deploy to those instances?

Potential optimizations:

Pre-warm artifacts in regional caches before scale-up
Optimize AMIs for the predicted workload
Coordinate deployments to avoid scaling events

This is still research territory, but the payoff could be significant.

3. GitOps Integration

Currently, deployment requests come from various tools and APIs. What if fleet definitions and deployment configs lived in git, and the system continuously reconciled actual state with desired state?

The GitOps model:

Tag-based fleet definitions stored in git
Deployment configs as code
Continuous reconciliation (every instance running the right version)
Rollback = git revert

This would shift the paradigm from "push deployments" to "declarative infrastructure."

4. Multi-Region Coordination

Deploying to tagged fleets across regions simultaneously introduces new challenges:

Do we resolve fleets per-region or globally?
How do we handle region-specific failures?
What if tag propagation latency varies by region?
Should rollout be sequential (region by region) or parallel?

We have some multi-region support, but it's not as sophisticated as single-region deployments. There's work to do here.

Key Takeaways

Expanding a deployment system to cloud-native infrastructure is a paradigm shift, not just a new target
Tag-based fleet resolution enables dynamic infrastructure but introduces consistency challenges
Lifecycle hooks are essential for integrating with autoscaling
The abstraction layer is critical—unify different infrastructure models behind a common interface
Dynamic fleets require different deployment semantics than static fleets
Observability and incremental rollout are essential for managing complexity
API rate limits and eventual consistency are unavoidable at scale
Testing dynamic systems requires chaos engineering and shadow mode

Implementation details have been abstracted. All examples are simplified for illustration.

Scaling the World's Largest Deployment System: From Bare Metal to Cloud Native

The Mission

The Challenge: Two Worlds Colliding

The Old World: Bare Metal

The New World: Cloud Native

Problem 1: How Do You Deploy to a Moving Target?

The Tag-Based Fleet Model

Dynamic Fleet Resolution

The Snapshot Problem

Problem 2: Integration with Autoscaling

The Autoscaling Lifecycle

Lifecycle Hooks: The Integration Point

The Coordination Challenge

Architecture Evolution

Original Architecture (Bare Metal)

New Architecture (Cloud Native)

The Abstraction Layer

Deploying to Tags: The Devil Is in the Details

Challenge 1: Tag Propagation Latency

Challenge 2: Tag Mutation During Deployment

Challenge 3: Tag Query Complexity

Challenge 4: Authorization

Performance at Scale

Fleet Resolution Latency

Deployment Velocity

API Rate Limiting

The Lifecycle Hook Dance

Launching Instances

Terminating Instances

Race Conditions

Monitoring and Observability

Key Metrics

Testing Strategies

Unit Tests

Integration Tests

Chaos Tests

Shadow Mode

Rollout Strategy

Phase 1: Read-Only Integration

Phase 2: Opt-In Beta

Phase 3: General Availability

What Went Wrong

Incident 1: The Great Tag Storm

Incident 2: Lifecycle Hook Timeout Cascade

Incident 3: The Deployment That Never Ended

What Went Right

1. The Abstraction Layer

2. Incremental Rollout

3. Strong Observability

Lessons Learned

1. Dynamic Infrastructure Requires Different Thinking

2. Tags Are Powerful But Tricky

3. Lifecycle Integration Is Critical

4. API Limits Are Real

5. Observability Is Non-Negotiable

The Road Ahead

1. Cross-Cloud Support

2. Predictive Scaling Integration

3. GitOps Integration

4. Multi-Region Coordination

Key Takeaways

Further Reading