How 'Production' Are You Really? Building a Resource Classification System

June 15, 2021 · 9 min read · safety, classification, infrastructure, automation, risk-management

How 'Production' Are You Really? Building a Resource Classification System

The Question That Shouldn't Be Hard

"Is this server production?"

Seems straightforward. But when you're building automation that touches infrastructure, this binary question reveals unexpected complexity.

In 2021, I built a resource classification service to answer: How production is this resource?

Not just yes or no, but with quantified confidence. The goal: enable automated systems to make informed decisions about which resources they could safely modify.

Classification Decision Tree

Click to expand/collapse branches. Real classification involves dozens of signals.

YES →
YES →
PRODUCTION (95% confidence)
NO →
NO →
YES →
NO →
Production
Dev/Staging
Uncertain

Why This Problem Exists

In a perfect world, resources are tagged consistently:

server-prod-001
server-staging-002
server-dev-003

In the real world:

john-test-server-important
legacy-app-v2-final-FINAL-v3
production-backup-maybe-can-delete
temp-server-do-not-delete-prod-depends-on-this

The Naming Problem

Naming conventions are rarely followed consistently. The reasons are understandable: teams move fast, naming is genuinely difficult, and documented standards often aren't discovered until after resources are created.

More realistic examples I've seen:

  • db-replica-prod-staging (which is it?!)
  • important-server-dont-touch (helpful, thanks)
  • new-server (created in 2017)
  • temporary-testing-keep-forever (honest, at least)

The Temporal Problem

Resources change purpose over time. A staging server from Q2 might be running production traffic by Q4 after an incident workaround. Tags don't get updated. Documentation drifts. The resource's classification becomes stale.

Classification Drift: How "Temporary" Becomes "Critical"

The lifecycle of a server nobody meant to keep

Jan
Created
Temporary test server
Mar
Forgot
Team moves to other project
Jun
Discovered
"Hey, what's this server?"
Aug
Traffic
Wait, users are hitting this?
Oct
Critical
DO NOT TOUCH
Dec
Panic
It's definitely production now
staging
unknown
production

The Drift Problem

Without continuous classification, resources silently transition from "temporary" to "critical production". By the time you notice, it's too late to safely remove them.

The Dependency Problem

Is a database production if it only serves test environments?

What if those test environments are blocking production deployments?

What if the test environment is used for load testing before production releases?

What if production has a read-only replica that points to it for some reason nobody can remember?

Dependencies create transitive classification challenges. The problem compounds recursively.

The Risk Spectrum

Not all "production" is equally critical:

Resource Impact if Down Classification
Payment processing DB Revenue stops, CEO panics Critical Production
API gateway Customers can't login Production
Internal dashboard Developers annoyed Non-Critical Production
Staging environment Tests delayed Non-Production
Personal sandbox One person sad Development

Classification also depends on organizational context. An "internal dashboard" might be business-critical if executives rely on it for decision-making.

The Classification Challenge

Where do you even get the data to make this determination?

Signal Sources

1. Explicit Tags

The dream: Environment=production

The reality: Environment=prod, Env=Production, env=PROD, Type=production-like, Stage=prod-ish

Tags are great when they exist and are correct. They're usually neither.

2. Network Topology

Production resources tend to live in production VPCs. Unless they don't. Unless someone put a staging server in the prod VPC because "it was easier." Unless there's a shared services VPC that hosts both.

But it's still a signal. Prod VPC? Probably production. Probably.

3. Traffic Patterns

Production resources see consistent traffic. 24/7. Peaks during business hours. Geographic distribution.

Dev resources? Quiet at night. Quiet on weekends. Only traffic from your office IPs.

This is actually a pretty reliable signal. Traffic doesn't lie.

Traffic Patterns: Production vs Development

Production traffic follows user behavior (24/7). Dev traffic follows engineer schedules (9-5).

Production

24/7 traffic, peaks during business hours, never truly quiet

Development

9-5 traffic, flatlines nights/weekends, follows engineer schedules

4. Change Frequency

Production resources change slowly. Carefully. With approval processes.

Dev resources? Wild west. Deploy 47 times a day. Break things. Fix things. Break them again.

If a server has had 500 deployments this month, it's probably not production. Unless it is, and you have other problems.

5. Owner Metadata

Who owns this resource? If it's owned by the platform team and has an on-call rotation, it's probably production.

If it's owned by someone named "test-user-delete-me," it's probably not.

6. Historical Incidents

If it has caused pages, it's production.

If it's in your incident reports, it's production.

If people get angry when it's down, it's definitely production.

The Confidence Score

Rather than a binary yes/no, we used a confidence score:

{
  "resourceId": "srv-abc123",
  "classification": "PRODUCTION",
  "confidence": 0.87,
  "signals": [
    {
      "type": "EXPLICIT_TAG",
      "value": "Environment=production",
      "weight": 0.9
    },
    {
      "type": "TRAFFIC_PATTERN",
      "value": "24x7_high_volume",
      "weight": 0.85
    },
    {
      "type": "NETWORK_TOPOLOGY",
      "value": "prod_vpc",
      "weight": 0.8
    },
    {
      "type": "CHANGE_FREQUENCY",
      "value": "low_change_rate",
      "weight": 0.7
    }
  ],
  "lastUpdated": "2021-06-10T14:23:00Z",
  "reasoning": "High confidence based on explicit tag, consistent traffic, and production VPC placement"
}

This is way more useful than "yes" or "no". Automation can make decisions based on confidence thresholds:

  • Confidence > 0.9: Require manual approval for changes
  • Confidence 0.7-0.9: Automatic with extra monitoring
  • Confidence < 0.7: Proceed normally

Production Confidence Score

How confident are we that this resource is production?

67%
Probably Prod
Definitely DevProbably DevUncertainProbably ProdDefinitely Prod

Building the System

The Three-Layer Approach

Layer 1: Explicit Signals (The Easy Stuff)

Check the tags. Check the name. Check the owner. If everything screams "production," great! High confidence, move on.

Layer 2: Behavioral Signals (The Detective Work)

Analyze traffic patterns over the last 30 days. Check deployment frequency. Look at the network topology. Build a behavioral profile.

This is where it gets interesting. A server with no explicit tags but consistent 24/7 traffic, low change frequency, and production VPC placement? Probably production.

Layer 3: Contextual Signals (The Social Network)

What depends on this resource? What does it depend on? If it's upstream of production services, it's probably production. If it's only accessed by staging environments, probably not.

This applies graph analysis to infrastructure relationships.

The Feedback Loop

The system had to learn from mistakes. When someone overrode a classification ("no, that's actually production!"), we:

  1. Recorded the correction
  2. Analyzed what signals we missed
  3. Adjusted signal weights
  4. Re-classified similar resources

This feedback loop improved classification accuracy over time without requiring complex ML models.

Use Cases

1. Automated Patching

Our patching system needed to know: can I reboot this server?

  • Production server, low confidence: Ask first
  • Production server, high confidence: Use careful rollout
  • Non-production, high confidence: YOLO reboot

Suddenly we could patch thousands of servers without asking permission for each one.

2. Cost Optimization

Find dev resources that are tagged as production (expensive!). Find oversized staging servers. Find that test database that's bigger than production.

We saved ~$200k/month just by identifying misclassified resources.

3. Chaos Engineering

Want to inject failures? Don't target production by accident.

The classification service became the safety guardrail: "This resource has 95% confidence of being production. Really sure you want to terminate it?"

Contextual warnings significantly reduced accidental chaos experiments on production resources.

Safety Prompt: Classification-Aware Deletion

Would you click "Yes" on a 94% production confidence resource?

⚠️Terminate Instance?
Resource:i-0abc123def456
Classification:PRODUCTION
Confidence:94%

This resource is 94% likely to be PRODUCTION. Are you REALLY sure you want to terminate it?

Design Pattern: Asymmetric Friction

Make dangerous actions harder. Small "Yes" button, large "No" button. Show classification confidence to make risk explicit. Humans make fewer mistakes when consequences are visible.

The Edge Cases That Broke Everything

The "Permanent Temporary" Server

Started as a quick test. Never got deleted. Gradually became critical. Now production depends on it.

Classification by name: "Non-production" Classification by behavior: Production

The system learned to detect these through traffic analysis and dependency graphs rather than relying on names.

Blue-Green Deployments

Server alternates between production and standby every week.

Is it production? The answer alternates weekly.

We introduced a "transiently production" classification for resources that alternate between active and standby roles.

Multi-Tenant Resources

One database. Partition A: production data. Partition B: test data.

The database is simultaneously production and non-production depending on which data you're querying.

Shadow Production

Load testing environments replay production traffic. They don't serve customer requests, but their availability blocks production deployments.

We introduced a "production-critical" category for resources that don't serve production traffic but are critical to production operations.

The Gotchas

False Positives Are Expensive

False positives have costs: unnecessary monitoring, security scans, and operational overhead for non-production resources.

False negatives are worse: automation can impact production resources, affecting customers.

The system biases toward false positives. Conservative classification is safer than aggressive automation.

Tag Hygiene Is Really Hard

Tags aren't maintained consistently. People are focused on delivery, not metadata hygiene.

We built automated tag suggestions: the system would detect likely misclassifications and suggest corrections. This approach worked better than mandating compliance.

Classification Disputes

Teams disagreed about what counted as production.

Engineering: "It's not production, no customer traffic!" Product: "It blocks production releases, so it's production-critical!" Finance: "Does it cost money? Then I want to know what it is."

We added a "disputed" flag and let humans fight it out.

What Worked

1. Confidence scores instead of binary classification

"87% confident this is production" is way more useful than "yes."

2. Multiple signal sources

No single signal is reliable. Combine everything.

3. Human override with feedback

Let people correct mistakes, then learn from them.

4. Clear explanations

The system always explained its classifications. Showing the underlying signals and their weights built trust with operators.

5. Conservative defaults

When in doubt, classify as more production than less. Safe beats sorry.

What Didn't Work

The ML Experiment

We experimented with ML-based classification using resource features.

The model achieved reasonable accuracy but was uninterpretable. When misclassifications occurred, we couldn't explain the reasoning. Teams lost confidence in the system.

For this use case, interpretability mattered more than marginal accuracy gains.

Fully Automated Reclassification

We tried automatic reclassification based on behavioral changes.

This failed: a staging server experienced a traffic spike during load testing and was reclassified as production, triggering incorrect policy enforcement.

Human approval for reclassifications prevented these false positives.

Dependency-Based Classification

"If you depend on production, you're production."

This rule sounds reasonable but creates classification creep. Eventually every resource depends on something production (DNS, authentication, etc.).

We implemented dependency depth limits and weight decay. Only direct dependencies strongly influenced classification.

Key Lessons

1. "Production" Is a Spectrum

Not binary. Not even a simple scale. More like a multi-dimensional risk profile.

2. Metadata Quality Is the Bottleneck

Classification accuracy is limited by metadata quality. Investing in tagging standards and tooling from the start pays dividends.

3. Conservative by Default

False positives are annoying. False negatives are catastrophic.

When in doubt, treat it as production.

4. Explain Your Decisions

"This is production because I said so" doesn't work.

"This is production because: explicit tag (90% weight), consistent traffic (85%), prod VPC (80%)" builds trust.

5. Humans Are Part of the System

Human judgment remains essential. Build systems that learn from human corrections rather than trying to eliminate human involvement.

The Impact

After a year:

  • Classified 50,000+ resources
  • Saved ~$2M annually in cost optimization
  • Reduced patch deployment time by 60%
  • Zero production incidents from automation gone wrong
  • Enabled safe chaos engineering experiments

The system proved that resource classification, done well, enables significant improvements in operational safety and efficiency.


Implementation details abstracted. All examples simplified for illustration.

Further Reading: For more on building safe automation, see Patching 25,000 Servers Without Breaking the Internet.