Patching 25,000 Servers Without Breaking the Internet
Patching 25,000 Servers Without Breaking the Internet
The Wake-Up Call
March 2020. A critical security vulnerability lands—the kind that gets a CVE number and a scary name. Your infrastructure team has 72 hours to patch over 25,000 compute instances across 12 regions and hundreds of availability zones.
The vulnerability was in a core system library. Every instance needed an update. The patch couldn't wait for the next maintenance window. It couldn't wait for the weekend. It needed to happen now, and it needed to happen without taking down the services that millions of users depended on.
I was on the infrastructure team responsible for making this happen. We had 47 different service types, running on instance sizes ranging from t3.small to r5.24xlarge. Some services were stateless and could handle rolling restarts. Others maintained in-memory state and needed careful coordination. A few were single points of failure that required manual intervention.
The Deceptively Simple Plan
The initial design review lasted about 30 minutes:
- Use infrastructure-as-code tooling to orchestrate the rollout
- Deploy in waves across regions
- Leverage existing deployment tools
- Monitor and rollback if needed
The plan looked solid on the whiteboard. It didn't survive contact with reality.
Our initial estimate was 48 hours. The actual time: 11 days. Most of that was debugging, retrying, and recovering from partial failures.
When Infrastructure-as-Code Hits Its Limits
The CloudFormation Problem
CloudFormation handles rolling updates well for typical deployments. You specify an UpdatePolicy, set a batch size, and let AWS handle it. For 100 instances, it's elegant.
For 25,000 instances across a dozen regions? The elegance breaks down.
Resources:
AutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
UpdatePolicy:
AutoScalingRollingUpdate:
MaxBatchSize: 10
MinInstancesInService: 95%
PauseTime: PT5M
WaitOnResourceSignals: true
SuspendProcesses:
- HealthCheck
- ReplaceUnhealthy
This configuration works, slowly, until you hit reality:
API throttling. CloudFormation makes dozens of API calls per instance update. At 25,000 instances, you're making hundreds of thousands of API calls. AWS rate limits kick in. Your stack update stalls. You can't even check status because DescribeStacks is also throttled.
Stack update timeouts. CloudFormation stack updates have a default timeout. Large stacks with rolling updates can exceed it. When they do, the update fails, and now you have a partially-updated fleet with no clear rollback path.
Partial failure recovery. When an update fails midway, CloudFormation tries to roll back. But rolling back 12,000 already-updated instances is its own multi-hour operation. And sometimes the rollback itself fails.
Observability gaps. CloudFormation events don't give you enough visibility. "Instance update in progress" isn't useful when you need to know which specific instances are stuck.
The CodeDeploy Scaling Wall
Our deployment orchestration system had never been tested at this scale. It was designed for hundreds of deployments per day, not thousands of instances per hour.
Coordination overhead. The system tracked deployment state in a central database. Every instance update wrote to that database. At 25,000 instances, the database became the bottleneck. Write latency spiked from 5ms to 500ms. The deployment coordinator couldn't keep up.
State management. Tracking the state of 25,000 instances sounds simple—just store instance ID and status. But you also need to track: which patch version, when it started, health check results, retry count, error messages, and dependencies. The state management code wasn't designed for this volume.
Timeout cascades. One stuck instance would cause its batch to timeout. The batch timeout would cause the region rollout to pause. The pause would cause the global coordinator to wait. One bad instance could stall the entire operation for hours while we manually investigated.
Cellular Architecture: Divide and Conquer
The breakthrough came from thinking about the problem differently. Instead of treating 25,000 instances as one big deployment, we treated them as 50 independent deployments of 500 instances each.
What Is a Cell?
Think of it like bulkheads on a ship. If one compartment floods, it doesn't sink the whole vessel.
Global Fleet (25,000 instances)
├── US-East-1 (8,000 instances)
│ ├── AZ-A (2,700 instances)
│ │ ├── Cell-1 (500 instances) ← Independent deployment unit
│ │ ├── Cell-2 (500 instances)
│ │ ├── Cell-3 (500 instances)
│ │ └── ...
│ ├── AZ-B (2,700 instances)
│ └── AZ-C (2,600 instances)
├── EU-West-1 (6,000 instances)
│ └── ...
└── AP-Southeast-1 (4,000 instances)
└── ...
Cell sizing was a tradeoff. Too small (50 instances) meant excessive coordination overhead—you're managing 500 separate deployments. Too large (2,000 instances) meant large blast radius—one bad deploy affects 2,000 instances. We settled on 500 instances per cell based on empirical testing.
The Deployment Wave Strategy
Wave 1: Canary Cells (1% of fleet)
Pick one cell in each region. Deploy to those cells first. Watch everything:
- Error rates in the 15 minutes post-deployment
- Latency percentiles (p50, p95, p99)
- CPU and memory metrics
- Application-specific health checks
- Customer-facing error rates
Success criteria: No statistically significant degradation in any metric. We used a simple threshold: if any metric moved more than 2 standard deviations from baseline, stop and investigate.
Wave 2: Regional Expansion (10% of fleet)
Deploy to 2-3 cells per region in parallel. This tests the patch across different instance types, service configurations, and load patterns.
At this point we discovered our first major issue: certain instance types had a kernel compatibility problem with the patch. The canary cells happened to not include any r5.4xlarge instances. Wave 2 did. We caught it, rolled back those cells, and worked with the security team on a fix.
Wave 3: Full Rollout (remaining 89%)
Deploy to all remaining cells, but with rate limiting: maximum 10 cells simultaneously, with 2-minute gaps between batches. This kept the blast radius bounded—even a bad deploy would only affect 5,000 instances before we could detect and stop it.
The emergency brake: if more than 5% of any cell's instances failed health checks post-deploy, automatically pause the entire rollout. This fired twice during the full rollout, both times correctly identifying real issues.
The Rolling Update Problem
How do you update 500 instances in an auto-scaling group without causing a stampede?
Naive Approach: Sequential Updates
Update instance 1... wait for health check (30 sec)... success!
Update instance 2... wait for health check (30 sec)... success!
...
Update instance 500... wait for health check (30 sec)... success!
Total time: 500 × 30 sec = 4.2 hours per cell
50 cells × 4.2 hours = 208 hours = 8.7 days
Just for the patching. Not counting failures and retries.
Sequential updates don't scale.
Parallel Approach: Controlled Chaos
We could update instances in parallel, but needed to maintain constraints:
Capacity constraint: At any moment, at least 90% of instances must be healthy and serving traffic. For a 500-instance cell, that means no more than 50 instances updating simultaneously.
Rate limiting: Don't hammer the ASG API. Space out update requests by 500ms to avoid throttling.
Health awareness: Before updating an instance, verify it's actually healthy. Don't update an instance that's already degraded.
AZ distribution: Don't update all instances in one AZ simultaneously. Spread updates across AZs to maintain fault tolerance.
class CellDeploymentController:
def __init__(self, cell, config):
self.cell = cell
self.max_concurrent = config.max_concurrent_updates # 50
self.health_check_timeout = config.health_check_timeout # 60 sec
self.update_interval = config.update_interval # 500ms
async def deploy(self, patch_version):
instances = self.cell.get_healthy_instances()
semaphore = asyncio.Semaphore(self.max_concurrent)
async def update_instance(instance):
async with semaphore:
await self.rate_limiter.acquire()
try:
await instance.apply_patch(patch_version)
await self.wait_for_health(instance)
return UpdateResult.SUCCESS
except Exception as e:
return UpdateResult.FAILED(instance.id, str(e))
# Group by AZ, interleave to spread updates
by_az = self.group_by_az(instances)
interleaved = self.interleave_azs(by_az)
results = await asyncio.gather(*[
update_instance(inst) for inst in interleaved
])
return self.summarize_results(results)
With parallel updates, a 500-instance cell took about 15 minutes instead of 4 hours. The full fleet could be patched in under 24 hours (assuming no failures).
The Coordination Tax
At scale, coordination overhead dominates:
| Fleet Size | Sequential | Parallel (50) | Coordination Overhead |
|---|---|---|---|
| 100 | 50m | 3m | ~5% of total time |
| 1,000 | 8h | 25m | ~15% of total time |
| 10,000 | 83h | 4h | ~35% of total time |
| 25,000 | 208h | 10h | ~50% of total time |
At 25,000 instances, half our time was spent on coordination: checking instance health, updating state, handling retries, and managing the deployment controller. The actual patching was fast; the orchestration was slow.
Testing at Scale
You can't test 25,000 instances without, well, 25,000 instances. We didn't have a test environment at that scale. So we tested what we could and treated the production rollout as a progressive test.
Testing layers:
-
Unit tests for orchestration logic: Does the cell controller correctly handle partial failures? Does the rate limiter work?
-
Integration tests with small fleets (100 instances): Deploy test patches to a staging environment. Verify the mechanics work.
-
Canary deployments (500 instances): First production cells. Real traffic, real instances, real consequences. But limited blast radius.
-
Progressive rollout with escape hatches: Each wave is a test of the next larger scale. If wave 2 fails, we've only touched 10% of the fleet.
Architecture Deep Dive
Components
Orchestration Layer (Control Plane)
The brain of the operation. Maintained the global view: which cells exist, which are patched, which failed. Made decisions about what to deploy next based on the current state and success criteria.
State machine for each cell:
PENDING → DEPLOYING → VERIFYING → COMPLETE
↓ ↓
FAILED ← ROLLED_BACK
Cell Coordinator
One per cell. Responsible for deploying to instances within its cell. Reported status back to the orchestration layer. Made local decisions (retry this instance, skip that one) without needing global coordination.
Instance Update Agent
Running on each instance. Received patch commands, applied them, ran local health checks, reported status. This was existing infrastructure—we didn't build new agents for this project.
Observability System
The nervous system. Collected metrics from every component, aggregated them, alerted on anomalies. Without this, we'd be flying blind.
Data Flow
Orchestration Layer
↓ (select next cell batch)
Cell Coordinators (parallel)
↓ (select instance batches)
Instance Update Agents (parallel)
↓ (apply patch, health check)
Health Signals
↓ (aggregate)
Observability System
↓ (analyze, alert)
Orchestration Layer (continue/pause/rollback)
The Gotchas Nobody Warns You About
1. API Rate Limits Are Real
EC2 DescribeInstances: 100 requests/second. Sounds like a lot. But when you have 50 cell coordinators each polling instance status every 5 seconds, you're at 10 requests/second just for status checks. Add in the actual update operations, health checks, and state queries—you're over limit fast.
We implemented exponential backoff with jitter, request batching (one DescribeInstances call for up to 1000 instance IDs), and local caching of instance metadata.
2. Eventual Consistency Is Frustrating
You terminate an instance. EC2 says it's terminated. But DescribeInstances still returns it as "running" for the next 30 seconds. Your deployment coordinator thinks the instance is still there. It tries to patch it. The patch fails. The coordinator marks it as failed. The retry logic kicks in...
We added a grace period: ignore instances that changed state in the last 60 seconds. Let the state settle before making decisions.
3. Health Checks Are Harder Than They Look
"Instance is running" doesn't mean "instance is healthy." An instance can be running but:
- Still booting (not serving traffic yet)
- Kernel panic'd but not terminated (rare, but happens)
- Network-partitioned from the load balancer
- Serving traffic but returning errors
We used multiple health signals: EC2 status checks, ELB health checks, application-level health endpoints, and custom metrics. An instance was healthy only if all signals agreed.
4. Network Partitions During Deployment
Mid-rollout, a network issue caused one AZ to become unreachable from our control plane. The cell coordinator for that AZ couldn't report status. From the orchestration layer's perspective, those instances were stuck.
Our recovery: if a cell doesn't report for 10 minutes, mark it as "unknown" and skip it. Continue with other cells. Investigate the stuck cell manually. Don't let one problem region block the entire rollout.
5. The Dependency Graph Is Never Complete
We thought we knew our service dependencies. We didn't. A database update required a connection pool flush. The connection pool flush caused a brief latency spike. The latency spike triggered alerts in a downstream service. The downstream service's on-call paged us asking what we broke.
After that, we added a "blast radius map"—which services are likely to be affected by patching which other services? Not perfect, but better than discovering dependencies during an incident.
Performance Optimizations
1. Instance Metadata Caching
Fetching metadata for 25,000 instances is expensive. Tags, instance type, AZ, launch time—each requires API calls. We cached instance metadata with a 5-minute TTL. Reduced API calls by 90%.
2. Parallel API Calls with Circuit Breakers
Make API calls in parallel, but with protection. If too many calls fail, stop making new ones. This prevents one API issue from cascading into thousands of failed requests.
class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=30):
self.failures = 0
self.threshold = failure_threshold
self.reset_timeout = reset_timeout
self.state = 'CLOSED'
async def call(self, func):
if self.state == 'OPEN':
raise CircuitOpenError()
try:
result = await func()
self.failures = 0
return result
except Exception as e:
self.failures += 1
if self.failures >= self.threshold:
self.state = 'OPEN'
asyncio.create_task(self.schedule_reset())
raise
3. Incremental State Updates
Don't write the full state on every change. Write deltas. "Instance X: DEPLOYING → COMPLETE" instead of rewriting the entire 25,000-instance state map.
4. Strategic Checkpointing
Checkpoint state every 5 minutes and after each cell completes. If the orchestration layer crashes, restart from the last checkpoint, not from the beginning.
This saved us twice. Once when the orchestration server had a memory leak and crashed. Once when we deployed a buggy coordinator update (ironic, yes).
Monitoring and Observability
We built a custom dashboard for this rollout:
Key Metrics:
- Deployment velocity: instances successfully patched per minute
- Success rate by cell: which cells are healthy vs. struggling
- Error distribution: what's failing and why
- Regional health: per-region aggregate metrics
- API call latency: are we getting throttled?
Alerting thresholds:
- Cell success rate < 95%: page on-call
- API throttling detected: warn, then page if sustained
- Any cell stuck for > 30 minutes: page
- Global velocity drops below 100 instances/minute: warn
The dashboard became our primary interface. We stared at it for 11 days.
What Went Wrong
The Great Timeout Cascade
Day 3. We were 40% through the rollout when everything stopped.
What happened:
An instance in us-east-1 cell-7 was having hardware issues. It accepted the patch but took 8 minutes to restart instead of the usual 30 seconds. The cell coordinator waited for the health check timeout (60 seconds), then marked it failed and moved on. Except it didn't move on—a bug in our retry logic caused it to wait for all instances in the batch before continuing, including the slow one.
The cell stalled. The orchestration layer noticed the cell wasn't making progress. It paused the entire rollout to prevent further damage. We got paged.
The investigation:
It took 2 hours to understand what happened. The slow instance wasn't obvious in the metrics—it looked like a normal "pending" instance. We had to dig through logs to find it.
The fix:
- Set a hard per-instance timeout (5 minutes). If an instance doesn't respond, skip it and continue.
- Add monitoring for "slow instances"—any instance taking more than 2x average time.
- Fix the retry logic bug.
Time lost: 4 hours.
The r5.4xlarge Incident
Day 5. Wave 2 revealed a kernel incompatibility.
What happened:
The security patch updated a kernel module. On r5.4xlarge instances (and only that instance type), the new module conflicted with a driver needed for the enhanced networking feature. Instances would patch, reboot, and fail to come back on the network.
Why didn't canary catch it?
Our canary cells happened to not include any r5.4xlarge instances. The service that used them was only deployed in wave 2.
The fix:
- Work with the security team on a modified patch for affected instance types.
- Roll back the 200 affected instances manually (some needed hardware replacement).
- Add instance type diversity as an explicit requirement for canary selection.
Time lost: 18 hours.
What I'd Do Differently
With hindsight:
-
Start with cellular architecture from day one. We built it mid-crisis. Should have had it ready.
-
Test canary selection. Our canary cells weren't representative. Build a tool that validates canary coverage across instance types, regions, and service configurations.
-
Invest in observability earlier. The dashboard we built for this rollout became permanent infrastructure. Should have built it before we needed it desperately.
-
Document dependencies. Our mental model of service dependencies was incomplete. A maintained dependency graph would have prevented several surprises.
-
Practice at scale. Run periodic large-scale tests. Not security patches—just no-op updates that exercise the machinery. You'll find bugs before they matter.
Key Takeaways
- Infrastructure-as-code tools have scaling limits—design for them
- Cellular architecture isn't just for runtime; it's for deployment too
- Coordination overhead grows non-linearly with fleet size
- Progressive rollout is your testing strategy at scale
- Observability is the only way to maintain confidence during large deployments
- The right abstraction layer matters: instance vs cell vs region
- Rate limits and eventual consistency will bite you—plan accordingly
- Every "simple" operation becomes complex at 25,000 instances
Implementation details have been abstracted and simplified. All examples are for illustrative purposes.
Further Reading: For more on resource classification in complex infrastructure, see How 'Production' Are You Really?.