How 'Production' Are You Really? Building a Resource Classification System

The Question That Shouldn't Be Hard

"Is this server production?"

Seems straightforward. But when you're building automation that touches infrastructure, this binary question reveals unexpected complexity.

In 2021, I built a resource classification service to answer: How production is this resource?

Not just yes or no, but with quantified confidence. The goal: enable automated systems to make informed decisions about which resources they could safely modify.

[IMAGE: A flowchart with the question "Is this production?" at the top. Instead of simple Yes/No branches, it splits into dozens of confusing paths labeled "Maybe?", "Technically?", "Used to be?", "Depends who you ask?", "It's complicated". A developer at the bottom looking exhausted.]

Why This Problem Exists

In a perfect world, resources are tagged consistently:

server-prod-001
server-staging-002
server-dev-003

In the real world:

john-test-server-important
legacy-app-v2-final-FINAL-v3
production-backup-maybe-can-delete
temp-server-do-not-delete-prod-depends-on-this

The Naming Problem

Naming conventions are rarely followed consistently. The reasons are understandable: teams move fast, naming is genuinely difficult, and documented standards often aren't discovered until after resources are created.

More realistic examples I've seen:

db-replica-prod-staging (which is it?!)
important-server-dont-touch (helpful, thanks)
new-server (created in 2017)
temporary-testing-keep-forever (honest, at least)

The Temporal Problem

Resources change purpose over time. A staging server from Q2 might be running production traffic by Q4 after an incident workaround. Tags don't get updated. Documentation drifts. The resource's classification becomes stale.

[IMAGE: A timeline showing a server. January: labeled "STAGING - temporary test server". December: same server, now labeled "PRODUCTION (oops)" with sticky notes all over it saying "DON'T TOUCH" and "CRITICAL"]

The Dependency Problem

Is a database production if it only serves test environments?

What if those test environments are blocking production deployments?

What if the test environment is used for load testing before production releases?

What if production has a read-only replica that points to it for some reason nobody can remember?

Dependencies create transitive classification challenges. The problem compounds recursively.

The Risk Spectrum

Not all "production" is equally critical:

| Resource | Impact if Down | Classification | |----------|----------------|----------------| | Payment processing DB | Revenue stops, CEO panics | Critical Production | | API gateway | Customers can't login | Production | | Internal dashboard | Developers annoyed | Non-Critical Production | | Staging environment | Tests delayed | Non-Production | | Personal sandbox | One person sad | Development |

Classification also depends on organizational context. An "internal dashboard" might be business-critical if executives rely on it for decision-making.

The Classification Challenge

Where do you even get the data to make this determination?

Signal Sources

1. Explicit Tags

The dream: Environment=production

The reality: Environment=prod, Env=Production, env=PROD, Type=production-like, Stage=prod-ish

Tags are great when they exist and are correct. They're usually neither.

2. Network Topology

Production resources tend to live in production VPCs. Unless they don't. Unless someone put a staging server in the prod VPC because "it was easier." Unless there's a shared services VPC that hosts both.

But it's still a signal. Prod VPC? Probably production. Probably.

3. Traffic Patterns

Production resources see consistent traffic. 24/7. Peaks during business hours. Geographic distribution.

Dev resources? Quiet at night. Quiet on weekends. Only traffic from your office IPs.

This is actually a pretty reliable signal. Traffic doesn't lie.

[IMAGE: Two graphs side by side. Left graph labeled "Production": consistent 24/7 traffic with daily peaks. Right graph labeled "Dev": flatlines at nights and weekends, spikes during work hours. Both graphs have coffee cup icons marking "when devs arrive"]

4. Change Frequency

Production resources change slowly. Carefully. With approval processes.

Dev resources? Wild west. Deploy 47 times a day. Break things. Fix things. Break them again.

If a server has had 500 deployments this month, it's probably not production. Unless it is, and you have other problems.

5. Owner Metadata

Who owns this resource? If it's owned by the platform team and has an on-call rotation, it's probably production.

If it's owned by someone named "test-user-delete-me," it's probably not.

6. Historical Incidents

If it has caused pages, it's production.

If it's in your incident reports, it's production.

If people get angry when it's down, it's definitely production.

The Confidence Score

Rather than a binary yes/no, we used a confidence score:

{
  "resourceId": "srv-abc123",
  "classification": "PRODUCTION",
  "confidence": 0.87,
  "signals": [
    {
      "type": "EXPLICIT_TAG",
      "value": "Environment=production",
      "weight": 0.9
    },
    {
      "type": "TRAFFIC_PATTERN",
      "value": "24x7_high_volume",
      "weight": 0.85
    },
    {
      "type": "NETWORK_TOPOLOGY",
      "value": "prod_vpc",
      "weight": 0.8
    },
    {
      "type": "CHANGE_FREQUENCY",
      "value": "low_change_rate",
      "weight": 0.7
    }
  ],
  "lastUpdated": "2021-06-10T14:23:00Z",
  "reasoning": "High confidence based on explicit tag, consistent traffic, and production VPC placement"
}

This is way more useful than "yes" or "no". Automation can make decisions based on confidence thresholds:

Confidence > 0.9: Require manual approval for changes
Confidence 0.7-0.9: Automatic with extra monitoring
Confidence < 0.7: Proceed normally

[IMAGE: A confidence meter like a speedometer, with zones marked "Definitely Not Production" (green), "Probably Not?" (yellow), "Maybe?" (orange), "Probably Yes?" (red), "Definitely Production" (dark red). A needle pointing to the "Maybe?" zone with a confused face emoji next to it.]

Building the System

The Three-Layer Approach

Layer 1: Explicit Signals (The Easy Stuff)

Check the tags. Check the name. Check the owner. If everything screams "production," great! High confidence, move on.

Layer 2: Behavioral Signals (The Detective Work)

Analyze traffic patterns over the last 30 days. Check deployment frequency. Look at the network topology. Build a behavioral profile.

This is where it gets interesting. A server with no explicit tags but consistent 24/7 traffic, low change frequency, and production VPC placement? Probably production.

Layer 3: Contextual Signals (The Social Network)

What depends on this resource? What does it depend on? If it's upstream of production services, it's probably production. If it's only accessed by staging environments, probably not.

This applies graph analysis to infrastructure relationships.

The Feedback Loop

The system had to learn from mistakes. When someone overrode a classification ("no, that's actually production!"), we:

Recorded the correction
Analyzed what signals we missed
Adjusted signal weights
Re-classified similar resources

This feedback loop improved classification accuracy over time without requiring complex ML models.

Use Cases

1. Automated Patching

Our patching system needed to know: can I reboot this server?

Production server, low confidence: Ask first
Production server, high confidence: Use careful rollout
Non-production, high confidence: YOLO reboot

Suddenly we could patch thousands of servers without asking permission for each one.

2. Cost Optimization

Find dev resources that are tagged as production (expensive!). Find oversized staging servers. Find that test database that's bigger than production.

We saved ~$200k/month just by identifying misclassified resources.

3. Chaos Engineering

Want to inject failures? Don't target production by accident.

The classification service became the safety guardrail: "This resource has 95% confidence of being production. Really sure you want to terminate it?"

Contextual warnings significantly reduced accidental chaos experiments on production resources.

[IMAGE: A big red button labeled "TERMINATE SERVER". A dialog box asking "This resource is 94% likely to be PRODUCTION. Are you REALLY sure?" with a tiny "Yes" button and a huge "OH GOD NO" button.]

The Edge Cases That Broke Everything

The "Permanent Temporary" Server

Started as a quick test. Never got deleted. Gradually became critical. Now production depends on it.

Classification by name: "Non-production" Classification by behavior: Production

The system learned to detect these through traffic analysis and dependency graphs rather than relying on names.

Blue-Green Deployments

Server alternates between production and standby every week.

Is it production? The answer alternates weekly.

We introduced a "transiently production" classification for resources that alternate between active and standby roles.

Multi-Tenant Resources

One database. Partition A: production data. Partition B: test data.

The database is simultaneously production and non-production depending on which data you're querying.

Shadow Production

Load testing environments replay production traffic. They don't serve customer requests, but their availability blocks production deployments.

We introduced a "production-critical" category for resources that don't serve production traffic but are critical to production operations.

The Gotchas

False Positives Are Expensive

False positives have costs: unnecessary monitoring, security scans, and operational overhead for non-production resources.

False negatives are worse: automation can impact production resources, affecting customers.

The system biases toward false positives. Conservative classification is safer than aggressive automation.

Tag Hygiene Is Really Hard

Tags aren't maintained consistently. People are focused on delivery, not metadata hygiene.

We built automated tag suggestions: the system would detect likely misclassifications and suggest corrections. This approach worked better than mandating compliance.

Classification Disputes

Teams disagreed about what counted as production.

Engineering: "It's not production, no customer traffic!" Product: "It blocks production releases, so it's production-critical!" Finance: "Does it cost money? Then I want to know what it is."

We added a "disputed" flag and let humans fight it out.

What Worked

1. Confidence scores instead of binary classification

"87% confident this is production" is way more useful than "yes."

2. Multiple signal sources

No single signal is reliable. Combine everything.

3. Human override with feedback

Let people correct mistakes, then learn from them.

4. Clear explanations

The system always explained its classifications. Showing the underlying signals and their weights built trust with operators.

5. Conservative defaults

When in doubt, classify as more production than less. Safe beats sorry.

[IMAGE: A decision tree diagram showing the classification logic, with a developer looking at it and nodding approvingly. Annotations pointing to different parts: "Clear logic", "Shows reasoning", "Conservative defaults", "Can override if wrong"]

What Didn't Work

The ML Experiment

We experimented with ML-based classification using resource features.

The model achieved reasonable accuracy but was uninterpretable. When misclassifications occurred, we couldn't explain the reasoning. Teams lost confidence in the system.

For this use case, interpretability mattered more than marginal accuracy gains.

Fully Automated Reclassification

We tried automatic reclassification based on behavioral changes.

This failed: a staging server experienced a traffic spike during load testing and was reclassified as production, triggering incorrect policy enforcement.

Human approval for reclassifications prevented these false positives.

Dependency-Based Classification

"If you depend on production, you're production."

This rule sounds reasonable but creates classification creep. Eventually every resource depends on something production (DNS, authentication, etc.).

We implemented dependency depth limits and weight decay. Only direct dependencies strongly influenced classification.

Key Lessons

1. "Production" Is a Spectrum

Not binary. Not even a simple scale. More like a multi-dimensional risk profile.

2. Metadata Quality Is the Bottleneck

Classification accuracy is limited by metadata quality. Investing in tagging standards and tooling from the start pays dividends.

3. Conservative by Default

False positives are annoying. False negatives are catastrophic.

When in doubt, treat it as production.

4. Explain Your Decisions

"This is production because I said so" doesn't work.

"This is production because: explicit tag (90% weight), consistent traffic (85%), prod VPC (80%)" builds trust.

5. Humans Are Part of the System

Human judgment remains essential. Build systems that learn from human corrections rather than trying to eliminate human involvement.

The Impact

After a year:

Classified 50,000+ resources
Saved ~$2M annually in cost optimization
Reduced patch deployment time by 60%
Zero production incidents from automation gone wrong
Enabled safe chaos engineering experiments

The system proved that resource classification, done well, enables significant improvements in operational safety and efficiency.

[IMAGE: A before/after comparison. Before: A messy spreadsheet with ???s everywhere and a stressed person. After: A clean dashboard showing classified resources with confidence scores and a person drinking coffee calmly with their feet up.]

Implementation details abstracted. All examples simplified for illustration.

Further Reading: For more on building safe automation, see Patching 25,000 Servers Without Breaking the Internet.