How 'Production' Are You Really? Building a Resource Classification System
How 'Production' Are You Really? Building a Resource Classification System
The Question That Shouldn't Be Hard
"Is this server production?"
Seems straightforward. But when you're building automation that touches infrastructure, this binary question reveals unexpected complexity.
In 2021, I built a resource classification service to answer: How production is this resource?
Not just yes or no, but with quantified confidence. The goal: enable automated systems to make informed decisions about which resources they could safely modify.
[IMAGE: A flowchart with the question "Is this production?" at the top. Instead of simple Yes/No branches, it splits into dozens of confusing paths labeled "Maybe?", "Technically?", "Used to be?", "Depends who you ask?", "It's complicated". A developer at the bottom looking exhausted.]
Why This Problem Exists
In a perfect world, resources are tagged consistently:
server-prod-001
server-staging-002
server-dev-003
In the real world:
john-test-server-important
legacy-app-v2-final-FINAL-v3
production-backup-maybe-can-delete
temp-server-do-not-delete-prod-depends-on-this
The Naming Problem
Naming conventions are rarely followed consistently. The reasons are understandable: teams move fast, naming is genuinely difficult, and documented standards often aren't discovered until after resources are created.
More realistic examples I've seen:
db-replica-prod-staging(which is it?!)important-server-dont-touch(helpful, thanks)new-server(created in 2017)temporary-testing-keep-forever(honest, at least)
The Temporal Problem
Resources change purpose over time. A staging server from Q2 might be running production traffic by Q4 after an incident workaround. Tags don't get updated. Documentation drifts. The resource's classification becomes stale.
[IMAGE: A timeline showing a server. January: labeled "STAGING - temporary test server". December: same server, now labeled "PRODUCTION (oops)" with sticky notes all over it saying "DON'T TOUCH" and "CRITICAL"]
The Dependency Problem
Is a database production if it only serves test environments?
What if those test environments are blocking production deployments?
What if the test environment is used for load testing before production releases?
What if production has a read-only replica that points to it for some reason nobody can remember?
Dependencies create transitive classification challenges. The problem compounds recursively.
The Risk Spectrum
Not all "production" is equally critical:
| Resource | Impact if Down | Classification | |----------|----------------|----------------| | Payment processing DB | Revenue stops, CEO panics | Critical Production | | API gateway | Customers can't login | Production | | Internal dashboard | Developers annoyed | Non-Critical Production | | Staging environment | Tests delayed | Non-Production | | Personal sandbox | One person sad | Development |
Classification also depends on organizational context. An "internal dashboard" might be business-critical if executives rely on it for decision-making.
The Classification Challenge
Where do you even get the data to make this determination?
Signal Sources
1. Explicit Tags
The dream: Environment=production
The reality: Environment=prod, Env=Production, env=PROD, Type=production-like, Stage=prod-ish
Tags are great when they exist and are correct. They're usually neither.
2. Network Topology
Production resources tend to live in production VPCs. Unless they don't. Unless someone put a staging server in the prod VPC because "it was easier." Unless there's a shared services VPC that hosts both.
But it's still a signal. Prod VPC? Probably production. Probably.
3. Traffic Patterns
Production resources see consistent traffic. 24/7. Peaks during business hours. Geographic distribution.
Dev resources? Quiet at night. Quiet on weekends. Only traffic from your office IPs.
This is actually a pretty reliable signal. Traffic doesn't lie.
[IMAGE: Two graphs side by side. Left graph labeled "Production": consistent 24/7 traffic with daily peaks. Right graph labeled "Dev": flatlines at nights and weekends, spikes during work hours. Both graphs have coffee cup icons marking "when devs arrive"]
4. Change Frequency
Production resources change slowly. Carefully. With approval processes.
Dev resources? Wild west. Deploy 47 times a day. Break things. Fix things. Break them again.
If a server has had 500 deployments this month, it's probably not production. Unless it is, and you have other problems.
5. Owner Metadata
Who owns this resource? If it's owned by the platform team and has an on-call rotation, it's probably production.
If it's owned by someone named "test-user-delete-me," it's probably not.
6. Historical Incidents
If it has caused pages, it's production.
If it's in your incident reports, it's production.
If people get angry when it's down, it's definitely production.
The Confidence Score
Rather than a binary yes/no, we used a confidence score:
{
"resourceId": "srv-abc123",
"classification": "PRODUCTION",
"confidence": 0.87,
"signals": [
{
"type": "EXPLICIT_TAG",
"value": "Environment=production",
"weight": 0.9
},
{
"type": "TRAFFIC_PATTERN",
"value": "24x7_high_volume",
"weight": 0.85
},
{
"type": "NETWORK_TOPOLOGY",
"value": "prod_vpc",
"weight": 0.8
},
{
"type": "CHANGE_FREQUENCY",
"value": "low_change_rate",
"weight": 0.7
}
],
"lastUpdated": "2021-06-10T14:23:00Z",
"reasoning": "High confidence based on explicit tag, consistent traffic, and production VPC placement"
}
This is way more useful than "yes" or "no". Automation can make decisions based on confidence thresholds:
- Confidence > 0.9: Require manual approval for changes
- Confidence 0.7-0.9: Automatic with extra monitoring
- Confidence < 0.7: Proceed normally
[IMAGE: A confidence meter like a speedometer, with zones marked "Definitely Not Production" (green), "Probably Not?" (yellow), "Maybe?" (orange), "Probably Yes?" (red), "Definitely Production" (dark red). A needle pointing to the "Maybe?" zone with a confused face emoji next to it.]
Building the System
The Three-Layer Approach
Layer 1: Explicit Signals (The Easy Stuff)
Check the tags. Check the name. Check the owner. If everything screams "production," great! High confidence, move on.
Layer 2: Behavioral Signals (The Detective Work)
Analyze traffic patterns over the last 30 days. Check deployment frequency. Look at the network topology. Build a behavioral profile.
This is where it gets interesting. A server with no explicit tags but consistent 24/7 traffic, low change frequency, and production VPC placement? Probably production.
Layer 3: Contextual Signals (The Social Network)
What depends on this resource? What does it depend on? If it's upstream of production services, it's probably production. If it's only accessed by staging environments, probably not.
This applies graph analysis to infrastructure relationships.
The Feedback Loop
The system had to learn from mistakes. When someone overrode a classification ("no, that's actually production!"), we:
- Recorded the correction
- Analyzed what signals we missed
- Adjusted signal weights
- Re-classified similar resources
This feedback loop improved classification accuracy over time without requiring complex ML models.
Use Cases
1. Automated Patching
Our patching system needed to know: can I reboot this server?
- Production server, low confidence: Ask first
- Production server, high confidence: Use careful rollout
- Non-production, high confidence: YOLO reboot
Suddenly we could patch thousands of servers without asking permission for each one.
2. Cost Optimization
Find dev resources that are tagged as production (expensive!). Find oversized staging servers. Find that test database that's bigger than production.
We saved ~$200k/month just by identifying misclassified resources.
3. Chaos Engineering
Want to inject failures? Don't target production by accident.
The classification service became the safety guardrail: "This resource has 95% confidence of being production. Really sure you want to terminate it?"
Contextual warnings significantly reduced accidental chaos experiments on production resources.
[IMAGE: A big red button labeled "TERMINATE SERVER". A dialog box asking "This resource is 94% likely to be PRODUCTION. Are you REALLY sure?" with a tiny "Yes" button and a huge "OH GOD NO" button.]
The Edge Cases That Broke Everything
The "Permanent Temporary" Server
Started as a quick test. Never got deleted. Gradually became critical. Now production depends on it.
Classification by name: "Non-production" Classification by behavior: Production
The system learned to detect these through traffic analysis and dependency graphs rather than relying on names.
Blue-Green Deployments
Server alternates between production and standby every week.
Is it production? The answer alternates weekly.
We introduced a "transiently production" classification for resources that alternate between active and standby roles.
Multi-Tenant Resources
One database. Partition A: production data. Partition B: test data.
The database is simultaneously production and non-production depending on which data you're querying.
Shadow Production
Load testing environments replay production traffic. They don't serve customer requests, but their availability blocks production deployments.
We introduced a "production-critical" category for resources that don't serve production traffic but are critical to production operations.
The Gotchas
False Positives Are Expensive
False positives have costs: unnecessary monitoring, security scans, and operational overhead for non-production resources.
False negatives are worse: automation can impact production resources, affecting customers.
The system biases toward false positives. Conservative classification is safer than aggressive automation.
Tag Hygiene Is Really Hard
Tags aren't maintained consistently. People are focused on delivery, not metadata hygiene.
We built automated tag suggestions: the system would detect likely misclassifications and suggest corrections. This approach worked better than mandating compliance.
Classification Disputes
Teams disagreed about what counted as production.
Engineering: "It's not production, no customer traffic!" Product: "It blocks production releases, so it's production-critical!" Finance: "Does it cost money? Then I want to know what it is."
We added a "disputed" flag and let humans fight it out.
What Worked
1. Confidence scores instead of binary classification
"87% confident this is production" is way more useful than "yes."
2. Multiple signal sources
No single signal is reliable. Combine everything.
3. Human override with feedback
Let people correct mistakes, then learn from them.
4. Clear explanations
The system always explained its classifications. Showing the underlying signals and their weights built trust with operators.
5. Conservative defaults
When in doubt, classify as more production than less. Safe beats sorry.
[IMAGE: A decision tree diagram showing the classification logic, with a developer looking at it and nodding approvingly. Annotations pointing to different parts: "Clear logic", "Shows reasoning", "Conservative defaults", "Can override if wrong"]
What Didn't Work
The ML Experiment
We experimented with ML-based classification using resource features.
The model achieved reasonable accuracy but was uninterpretable. When misclassifications occurred, we couldn't explain the reasoning. Teams lost confidence in the system.
For this use case, interpretability mattered more than marginal accuracy gains.
Fully Automated Reclassification
We tried automatic reclassification based on behavioral changes.
This failed: a staging server experienced a traffic spike during load testing and was reclassified as production, triggering incorrect policy enforcement.
Human approval for reclassifications prevented these false positives.
Dependency-Based Classification
"If you depend on production, you're production."
This rule sounds reasonable but creates classification creep. Eventually every resource depends on something production (DNS, authentication, etc.).
We implemented dependency depth limits and weight decay. Only direct dependencies strongly influenced classification.
Key Lessons
1. "Production" Is a Spectrum
Not binary. Not even a simple scale. More like a multi-dimensional risk profile.
2. Metadata Quality Is the Bottleneck
Classification accuracy is limited by metadata quality. Investing in tagging standards and tooling from the start pays dividends.
3. Conservative by Default
False positives are annoying. False negatives are catastrophic.
When in doubt, treat it as production.
4. Explain Your Decisions
"This is production because I said so" doesn't work.
"This is production because: explicit tag (90% weight), consistent traffic (85%), prod VPC (80%)" builds trust.
5. Humans Are Part of the System
Human judgment remains essential. Build systems that learn from human corrections rather than trying to eliminate human involvement.
The Impact
After a year:
- Classified 50,000+ resources
- Saved ~$2M annually in cost optimization
- Reduced patch deployment time by 60%
- Zero production incidents from automation gone wrong
- Enabled safe chaos engineering experiments
The system proved that resource classification, done well, enables significant improvements in operational safety and efficiency.
[IMAGE: A before/after comparison. Before: A messy spreadsheet with ???s everywhere and a stressed person. After: A clean dashboard showing classified resources with confidence scores and a person drinking coffee calmly with their feet up.]
Implementation details abstracted. All examples simplified for illustration.
Further Reading: For more on building safe automation, see Patching 25,000 Servers Without Breaking the Internet.