Patching 25,000 Servers Without Breaking the Internet
Patching 25,000 Servers Without Breaking the Internet
The Wake-Up Call
March 2020. A critical security vulnerability is announced. Your infrastructure team has 72 hours to patch over 25,000 compute instances across dozens of regions and hundreds of availability zones.
Oh, and you can't cause an outage. Customers are depending on you.
Welcome to my 2020.
The Deceptively Simple Plan
The initial design review lasted about 30 minutes:
- Use infrastructure-as-code tooling to orchestrate the rollout
- Deploy in waves across regions
- Leverage existing deployment tools
- Monitor and rollback if needed
"Should be straightforward," someone said. They were very, very wrong.
When Infrastructure-as-Code Hits Its Limits
The CloudFormation Problem
Our first instinct was to use infrastructure-as-code templates to manage the rollout. In theory, it's perfect: declarative, version-controlled, supports rolling updates.
In practice, at 25,000 instances? Not so much.
# Simplified example of the problem
# TODO: Add details about the template structure
Resources:
AutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
UpdatePolicy:
AutoScalingRollingUpdate:
# This works great for 100 instances
# At 5,000 instances? Different story.
MaxBatchSize: 10
MinInstancesInService: 95%
PauseTime: PT5M
What broke:
The CodeDeploy Scaling Wall
Our deployment orchestration system had never been tested at this scale.
Issues we hit:
- Coordination Overhead:
- State Management:
- Timeout Cascades:
Cellular Architecture: Divide and Conquer
The breakthrough came from thinking about isolation boundaries differently.
What Is a Cell?
Think of it like bulkheads on a ship. If one compartment floods, it doesn't sink the whole vessel.
Global Fleet (25,000 instances)
├── US-East-1 (8,000 instances)
│ ├── AZ-A (2,700 instances)
│ │ ├── Cell-1 (500 instances) ← Deployment unit
│ │ ├── Cell-2 (500 instances)
│ │ └── ...
│ ├── AZ-B (2,700 instances)
│ └── AZ-C (2,600 instances)
├── EU-West-1 (6,000 instances)
└── ...
The Deployment Wave Strategy
Wave 1: Canary Cells (1% of fleet)
Wave 2: Regional Expansion (10% of fleet)
Wave 3: Full Rollout (remaining 89%)
The Rolling Update Problem
Here's a fun problem: how do you update 5,000 instances in an auto-scaling group without causing a stampede?
Naive Approach: Sequential Updates
Update instance 1... wait for health check... success!
Update instance 2... wait for health check... success!
Update instance 3...
...
Update instance 5000...
Estimated completion: 347 hours
Not great.
Parallel Approach: Controlled Chaos
Key Constraints:
- Maintain minimum healthy instance count (no customer impact)
- Respect API rate limits (don't get throttled)
- Monitor aggregate metrics (detect failures early)
- Coordinate across AZs (maintain fault tolerance)
The Coordination Tax
As scale increases, coordination overhead dominates:
| Fleet Size | Sequential | Parallel (10) | Parallel (100) | Coordination Overhead | |------------|------------|---------------|----------------|-----------------------| | 100 | 8h | 50m | 6m | Negligible | | 1,000 | 83h | 8.3h | 50m | Moderate | | 10,000 | 833h | 83h | 8.3h | Significant | | 25,000 | ? | ? | ? | Dominates |
Testing at Scale: The Limits of Pre-Production
You can't simulate 25,000 instances in a test environment. Well, you can, but:
What We Did Instead
Testing Layers:
- Unit tests for orchestration logic
- Integration tests with small fleets (100 instances)
- Canary deployments in production (500 instances)
- Progressive rollout with escape hatches
Architecture Deep Dive
Components
1. Orchestration Layer
2. Cell Coordinator
3. Instance Update Agent
4. Observability System
Data Flow
Control Plane
↓ (select next cell)
Cell Coordinator
↓ (batch instances)
Update Orchestrator
↓ (parallel execution)
Instance Agents
↓ (health signals)
Monitoring & Feedback Loop
↑
Control Plane (continue/pause/rollback)
The Gotchas Nobody Warns You About
1. API Rate Limits Are Real
2. Eventual Consistency Is Frustrating
3. Health Checks Are Harder Than They Look
4. Network Partitions During Deployment
5. The Dependency Graph Is Never Complete
Performance Optimizations
1. Instance Metadata Caching
2. Parallel API Calls with Circuit Breakers
3. Incremental State Updates
4. Strategic Checkpointing
Monitoring and Observability
Key Metrics:
- Deployment velocity (instances/minute)
- Success rate per cell
- Error distribution by type
- Regional health scores
- API call latency and error rates
What Went Wrong
Let's be honest—not everything worked perfectly.
Incident: The Great Timeout Cascade
What I'd Do Differently
With the benefit of hindsight:
- Start with cellular architecture from day one
Key Takeaways
- Infrastructure-as-code tools have scaling limits—design for them
- Cellular architecture isn't just for runtime; it's for deployment too
- Coordination overhead grows non-linearly with fleet size
- Progressive rollout is your testing strategy at scale
- Observability is the only way to maintain confidence during large deployments
- The right abstraction layer matters: instance vs cell vs region
- Rate limits and eventual consistency will bite you—plan accordingly
Implementation details have been abstracted and simplified. All examples are for illustrative purposes.