Kubernetes

We Inherited a 200-Service Kubernetes Cluster. Here's What We Found.

June 12, 2026 9 min read

The CTO's message on day one: "We'll give you admin access this afternoon. The previous team left three weeks ago. We're not sure what's running or why."

This is not unusual. We've taken over clusters that had been hand-configured over years by people who are no longer there, clusters that grew from a 10-service prototype into something with 200 workloads that nobody has a full picture of, clusters that are technically running fine until the day they aren't. The company was a marketplace startup, Series B, about 60 engineers. Production was handling real traffic. Our first rule in every takeover is: one week of pure read access, no changes, just understanding. You don't fix what you don't understand.

Here's what that first week found.

The audit: seven findings

1. No resource limits on 60% of pods

Of 200-odd workloads running across the cluster, 118 had no CPU or memory limits set. Some had requests, most had neither. In Kubernetes, a pod with no limits can consume as much CPU and memory as the node offers — which means one misbehaving service can starve everything else on the node. We'd seen this cause cascading failures before: a memory leak in a low-priority background job takes down the node, which evicts the pods running on it, including the payment service.

The absence of limits wasn't negligence — it was history. The cluster had started small with a single team who knew what every service did. Limits felt unnecessary when you can see everything. By the time it was 200 services, nobody had gone back.

2. Containers running as root

About 40% of the workloads were running with no security context defined, which in Kubernetes defaults to running as root (UID 0). A subset had explicitly set runAsUser: 0. Some of those containers had broad host path mounts. In a compromised container scenario, running as root dramatically increases the blast radius — container escapes are harder, but not impossible, and root in a container with host mounts is root with access to the node.

This wasn't a theoretical risk to them — they handled payment data, and their SOC 2 audit was six months out. Running as root in every pod was going to be a finding.

3. Secrets stored in ConfigMaps

Database passwords, API keys for third-party services, internal service tokens — seventeen of them stored as plaintext values in ConfigMaps rather than Secrets. Kubernetes Secrets are base64-encoded (not encrypted by default), but they're the correct primitive: they can be encrypted at rest with a KMS integration, they're excluded from certain audit logs, and most secret management tooling integrates with them rather than ConfigMaps. More practically: RBAC can restrict who can read Secrets separately from ConfigMaps. These credentials had effectively no access control beyond cluster-level permissions.

4. Two minor versions behind, with three critical CVEs unpatched

The cluster was running 1.27. Current at the time was 1.29, meaning two minor versions behind. That's not catastrophic on its own — 1.27 was still in standard support — but the specific CVE set it had missed included two that affected the API server authentication chain and one in the container runtime. All three had public proof-of-concept exploits available.

The previous team had noted the upgrade requirement in a Confluence doc eighteen months prior. It never got prioritized because "nothing is broken." Kubernetes upgrades are disruptive, require testing, and have no immediate visible payoff. They slip. Then they slip further until suddenly you're three versions behind and the upgrade path requires multiple intermediate hops.

5. Zombie namespaces from abandoned experiments

Twelve namespaces that were not part of any documented system, with names like ml-pipeline-v2-test, data-team-scratch, and migration-staging. Several were running active pods — workloads consuming CPU and memory 24/7 for projects that had been deprioritized months earlier. One namespace had a GPU-enabled node affinity that was pulling a GPU node into the pool that nothing in production needed. That node was costing $2,800/month.

6. No PodDisruptionBudgets on critical services

PodDisruptionBudgets (PDBs) tell Kubernetes how many pods of a given service must stay available during voluntary disruptions — node drains, cluster upgrades, rolling restarts. Without them, a node drain can terminate every pod of a service simultaneously. Most of the critical services ran two replicas with no PDB, which meant a routine node maintenance operation could cause a complete service outage.

We confirmed this had happened once in the past year: a node was drained for an emergency OS patch at 2am, which evicted both replicas of the checkout API, causing a 12-minute outage. The postmortem had correctly identified the lack of PDBs as the cause. It was marked as a to-do. Nobody had followed through.

7. No liveness or readiness probes on 35 services

35 workloads had no health probes configured. Without a readiness probe, Kubernetes routes traffic to a pod the moment it starts — before the application has finished initializing, connected to its database, or warmed up its cache. Without a liveness probe, a deadlocked process that's technically "running" but serving errors keeps receiving traffic indefinitely. We found one service that had been in a degraded state for weeks — not crashing, just returning 503s on about 30% of requests — which nobody had connected to the missing liveness probe.

What we did about it, and in what order

Not everything is equally urgent. Part of what we do in the first week is triage: which of these findings can cause an outage in the next 30 days? Which are compliance blockers? Which are important but low urgency?

Our priority order:

Week 2–3: Secrets migration

The ConfigMap secrets were the most immediate compliance risk. We migrated all 17 credentials to Kubernetes Secrets, updated the deployments to reference the new Secrets, and then set up Sealed Secrets for GitOps-compatible encrypted secret management. The migration was done service by service with zero downtime — update the deployment to pull from Secret, roll out, verify, delete the ConfigMap value. No big-bang migration.

Week 2–4: Resource limits audit and rollout

Setting resource limits is not "just add a YAML value." If you set limits too low, pods OOMKill under normal load. We spent two weeks profiling actual usage across the 118 unconstrained pods using VPA (Vertical Pod Autoscaler) in recommendation mode — run it, let it observe, read what it recommends, then set limits at 1.5× the p99 observed values. The rollout was staged by namespace criticality: non-production first, then internal tools, then production services one deployment at a time.

Week 3: PodDisruptionBudgets for all stateful and critical services

Straightforward once you have an inventory. We added PDBs requiring at minimum 1 pod always available for every service running more than one replica. For services with 3+ replicas we set minAvailable: 2. This took two days and had zero rollout risk.

Week 3–4: Health probes for all 35 missing services

For each of the 35 services, we looked at the application code to find the right health endpoint (or added one where none existed), set a reasonable initialDelaySeconds based on observed startup time, and rolled out. The degraded service with the missing liveness probe recovered immediately after its probe was added — the deadlocked process started getting killed and restarted, and the error rate dropped to zero within 20 minutes of the rollout.

Month 2: Namespace cleanup and node rightsizing

We scheduled a 30-minute meeting for each of the twelve zombie namespaces with whoever had created them or was closest to the project. Eight were confirmed dead and deleted. Three had ongoing work and got cleaned up and documented. One — the GPU namespace — turned out to be for an active ML project that had just been deprioritized; we moved the workload to a spot GPU node at one-third the cost.

Month 2–3: Security context rollout

Moving containers from running as root to a non-root user requires knowing what the container actually does — some legitimately need specific capabilities that require careful UID handling. We audited each workload, updated Dockerfiles and deployment specs, and rolled out incrementally. A few services needed allowPrivilegeEscalation: false with specific capabilities.drop entries; one needed a custom seccomp profile. Took longer than expected. Done right, not rushed.

Month 3–4: Kubernetes upgrade

Upgrading from 1.27 to 1.29 required two intermediate hops (1.27 → 1.28 → 1.29). We tested each hop in a staging cluster first, documented every deprecated API removed between versions, and ran a compatibility scan against all deployed Helm charts and custom manifests using pluto. The production upgrade was done during a Saturday maintenance window with the team on standby. Total downtime: zero. Time from start to finish: four hours.

The pattern in inherited clusters

We've done this enough times to know the findings before we audit. Not because teams are careless — but because clusters grow faster than the practices around them, and the people who built the foundation often aren't the people running it. What starts as a sensible decision to skip limits on a three-service cluster is still in place five years and 200 services later because nobody had the time or mandate to go back.

The audit week is the most valuable thing we do at the start of an engagement. Not because we're looking for blame — there's never any — but because you can't prioritize what you haven't mapped. By the end of week one, we have a ranked list of findings, a rough remediation timeline, and a clear answer to the question every CTO asks: "Is our cluster going to fall over?"

In this case: not imminently, but a lot closer than it should have been.

"I knew there were things we'd let slip, but I assumed it was cosmetic. The ConfigMap secrets thing — that one genuinely surprised me. Those were production database credentials sitting in plaintext in our audit logs."

If you're about to do this yourself

Before you touch anything in a cluster you've inherited:

Get full audit log access for the past 30 days. Understand what changes have been made recently before you assume the state you see is stable.
Run kubectl get pods -A | grep -v Running | grep -v Completed — anything not running or completed is a signal.
Run kubectl describe nodes across the cluster. Look for nodes under memory pressure or with high pod counts.
Check ConfigMaps for anything that looks like a credential: kubectl get configmap -A -o yaml | grep -i password is blunt but effective.
List namespaces and ask: does anyone recognize all of these? A namespace nobody can explain is worth 20 minutes of investigation.
Check the Kubernetes version against the support matrix. If you're behind, find out why and plan the upgrade before it becomes urgent.

Don't try to fix everything in week one. Map it, prioritize it, and fix it in the right order.

Inheriting a Kubernetes cluster you didn't build?

We'll do a full week-one audit — resource configuration, security posture, version status, and everything running that shouldn't be — and give you a prioritized remediation plan.

Book a Cluster Audit

← Back to all articles