Patching 25,000 Servers Without Breaking the Internet

November 08, 2020 · 9 min read · infrastructure, automation, scaling, deployment, distributed-systems

Patching 25,000 Servers Without Breaking the Internet

The Wake-Up Call

March 2020. A critical security vulnerability is announced. Your infrastructure team has 72 hours to patch over 25,000 compute instances across dozens of regions and hundreds of availability zones.

Oh, and you can't cause an outage. Customers are depending on you.

Welcome to my 2020.

The Deceptively Simple Plan

The initial design review lasted about 30 minutes:

  1. Use infrastructure-as-code tooling to orchestrate the rollout
  2. Deploy in waves across regions
  3. Leverage existing deployment tools
  4. Monitor and rollback if needed

"Should be straightforward," someone said. They were very, very wrong.

When Infrastructure-as-Code Hits Its Limits

The CloudFormation Problem

Our first instinct was to use infrastructure-as-code templates to manage the rollout. In theory, it's perfect: declarative, version-controlled, supports rolling updates.

In practice, at 25,000 instances? Not so much.

# Simplified example of the problem
# TODO: Add details about the template structure

Resources:
  AutoScalingGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    UpdatePolicy:
      AutoScalingRollingUpdate:
        # This works great for 100 instances
        # At 5,000 instances? Different story.
        MaxBatchSize: 10
        MinInstancesInService: 95%
        PauseTime: PT5M

What broke:

The CodeDeploy Scaling Wall

Our deployment orchestration system had never been tested at this scale.

Issues we hit:

  1. Coordination Overhead:
  2. State Management:
  3. Timeout Cascades:

Cellular Architecture: Divide and Conquer

The breakthrough came from thinking about isolation boundaries differently.

What Is a Cell?

Think of it like bulkheads on a ship. If one compartment floods, it doesn't sink the whole vessel.

Global Fleet (25,000 instances)
├── US-East-1 (8,000 instances)
│   ├── AZ-A (2,700 instances)
│   │   ├── Cell-1 (500 instances) ← Deployment unit
│   │   ├── Cell-2 (500 instances)
│   │   └── ...
│   ├── AZ-B (2,700 instances)
│   └── AZ-C (2,600 instances)
├── EU-West-1 (6,000 instances)
└── ...

The Deployment Wave Strategy

Wave 1: Canary Cells (1% of fleet)

Wave 2: Regional Expansion (10% of fleet)

Wave 3: Full Rollout (remaining 89%)

The Rolling Update Problem

Here's a fun problem: how do you update 5,000 instances in an auto-scaling group without causing a stampede?

Naive Approach: Sequential Updates

Update instance 1... wait for health check... success!
Update instance 2... wait for health check... success!
Update instance 3...
...
Update instance 5000...

Estimated completion: 347 hours

Not great.

Parallel Approach: Controlled Chaos

Key Constraints:

  • Maintain minimum healthy instance count (no customer impact)
  • Respect API rate limits (don't get throttled)
  • Monitor aggregate metrics (detect failures early)
  • Coordinate across AZs (maintain fault tolerance)

The Coordination Tax

As scale increases, coordination overhead dominates:

| Fleet Size | Sequential | Parallel (10) | Parallel (100) | Coordination Overhead | |------------|------------|---------------|----------------|-----------------------| | 100 | 8h | 50m | 6m | Negligible | | 1,000 | 83h | 8.3h | 50m | Moderate | | 10,000 | 833h | 83h | 8.3h | Significant | | 25,000 | ? | ? | ? | Dominates |

Testing at Scale: The Limits of Pre-Production

You can't simulate 25,000 instances in a test environment. Well, you can, but:

What We Did Instead

Testing Layers:

  1. Unit tests for orchestration logic
  2. Integration tests with small fleets (100 instances)
  3. Canary deployments in production (500 instances)
  4. Progressive rollout with escape hatches

Architecture Deep Dive

Components

1. Orchestration Layer

2. Cell Coordinator

3. Instance Update Agent

4. Observability System

Data Flow

Control Plane
    ↓ (select next cell)
Cell Coordinator
    ↓ (batch instances)
Update Orchestrator
    ↓ (parallel execution)
Instance Agents
    ↓ (health signals)
Monitoring & Feedback Loop
    ↑
Control Plane (continue/pause/rollback)

The Gotchas Nobody Warns You About

1. API Rate Limits Are Real

2. Eventual Consistency Is Frustrating

3. Health Checks Are Harder Than They Look

4. Network Partitions During Deployment

5. The Dependency Graph Is Never Complete

Performance Optimizations

1. Instance Metadata Caching

2. Parallel API Calls with Circuit Breakers

3. Incremental State Updates

4. Strategic Checkpointing

Monitoring and Observability

Key Metrics:

  • Deployment velocity (instances/minute)
  • Success rate per cell
  • Error distribution by type
  • Regional health scores
  • API call latency and error rates

What Went Wrong

Let's be honest—not everything worked perfectly.

Incident: The Great Timeout Cascade

What I'd Do Differently

With the benefit of hindsight:

  1. Start with cellular architecture from day one

Key Takeaways

  • Infrastructure-as-code tools have scaling limits—design for them
  • Cellular architecture isn't just for runtime; it's for deployment too
  • Coordination overhead grows non-linearly with fleet size
  • Progressive rollout is your testing strategy at scale
  • Observability is the only way to maintain confidence during large deployments
  • The right abstraction layer matters: instance vs cell vs region
  • Rate limits and eventual consistency will bite you—plan accordingly

Implementation details have been abstracted and simplified. All examples are for illustrative purposes.