Patching 25,000 Servers Without Breaking the Internet

The Wake-Up Call

March 2020. A critical security vulnerability is announced. Your infrastructure team has 72 hours to patch over 25,000 compute instances across dozens of regions and hundreds of availability zones.

Oh, and you can't cause an outage. Customers are depending on you.

Welcome to my 2020.

The Deceptively Simple Plan

The initial design review lasted about 30 minutes:

Use infrastructure-as-code tooling to orchestrate the rollout
Deploy in waves across regions
Leverage existing deployment tools
Monitor and rollback if needed

"Should be straightforward," someone said. They were very, very wrong.

When Infrastructure-as-Code Hits Its Limits

The CloudFormation Problem

Our first instinct was to use infrastructure-as-code templates to manage the rollout. In theory, it's perfect: declarative, version-controlled, supports rolling updates.

In practice, at 25,000 instances? Not so much.

# Simplified example of the problem
# TODO: Add details about the template structure

Resources:
  AutoScalingGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    UpdatePolicy:
      AutoScalingRollingUpdate:
        # This works great for 100 instances
        # At 5,000 instances? Different story.
        MaxBatchSize: 10
        MinInstancesInService: 95%
        PauseTime: PT5M

What broke:

The CodeDeploy Scaling Wall

Our deployment orchestration system had never been tested at this scale.

Issues we hit:

Coordination Overhead:
State Management:
Timeout Cascades:

Cellular Architecture: Divide and Conquer

The breakthrough came from thinking about isolation boundaries differently.

What Is a Cell?

Think of it like bulkheads on a ship. If one compartment floods, it doesn't sink the whole vessel.

Global Fleet (25,000 instances)
├── US-East-1 (8,000 instances)
│   ├── AZ-A (2,700 instances)
│   │   ├── Cell-1 (500 instances) ← Deployment unit
│   │   ├── Cell-2 (500 instances)
│   │   └── ...
│   ├── AZ-B (2,700 instances)
│   └── AZ-C (2,600 instances)
├── EU-West-1 (6,000 instances)
└── ...

The Deployment Wave Strategy

Wave 1: Canary Cells (1% of fleet)

Wave 2: Regional Expansion (10% of fleet)

Wave 3: Full Rollout (remaining 89%)

The Rolling Update Problem

Here's a fun problem: how do you update 5,000 instances in an auto-scaling group without causing a stampede?

Naive Approach: Sequential Updates

Update instance 1... wait for health check... success!
Update instance 2... wait for health check... success!
Update instance 3...
...
Update instance 5000...

Estimated completion: 347 hours

Not great.

Parallel Approach: Controlled Chaos

Key Constraints:

Maintain minimum healthy instance count (no customer impact)
Respect API rate limits (don't get throttled)
Monitor aggregate metrics (detect failures early)
Coordinate across AZs (maintain fault tolerance)

The Coordination Tax

As scale increases, coordination overhead dominates:

| Fleet Size | Sequential | Parallel (10) | Parallel (100) | Coordination Overhead | |------------|------------|---------------|----------------|-----------------------| | 100 | 8h | 50m | 6m | Negligible | | 1,000 | 83h | 8.3h | 50m | Moderate | | 10,000 | 833h | 83h | 8.3h | Significant | | 25,000 | ? | ? | ? | Dominates |

Testing at Scale: The Limits of Pre-Production

You can't simulate 25,000 instances in a test environment. Well, you can, but:

What We Did Instead

Testing Layers:

Unit tests for orchestration logic
Integration tests with small fleets (100 instances)
Canary deployments in production (500 instances)
Progressive rollout with escape hatches

Architecture Deep Dive

Components

1. Orchestration Layer

2. Cell Coordinator

3. Instance Update Agent

4. Observability System

Data Flow

Control Plane
    ↓ (select next cell)
Cell Coordinator
    ↓ (batch instances)
Update Orchestrator
    ↓ (parallel execution)
Instance Agents
    ↓ (health signals)
Monitoring & Feedback Loop
    ↑
Control Plane (continue/pause/rollback)

The Gotchas Nobody Warns You About

1. API Rate Limits Are Real

2. Eventual Consistency Is Frustrating

3. Health Checks Are Harder Than They Look

4. Network Partitions During Deployment

5. The Dependency Graph Is Never Complete

Performance Optimizations

1. Instance Metadata Caching

2. Parallel API Calls with Circuit Breakers

3. Incremental State Updates

4. Strategic Checkpointing

Monitoring and Observability

Key Metrics:

Deployment velocity (instances/minute)
Success rate per cell
Error distribution by type
Regional health scores
API call latency and error rates

What Went Wrong

Let's be honest—not everything worked perfectly.

Incident: The Great Timeout Cascade

What I'd Do Differently

With the benefit of hindsight:

Start with cellular architecture from day one

Key Takeaways

Infrastructure-as-code tools have scaling limits—design for them
Cellular architecture isn't just for runtime; it's for deployment too
Coordination overhead grows non-linearly with fleet size
Progressive rollout is your testing strategy at scale
Observability is the only way to maintain confidence during large deployments
The right abstraction layer matters: instance vs cell vs region
Rate limits and eventual consistency will bite you—plan accordingly

Implementation details have been abstracted and simplified. All examples are for illustrative purposes.