Kubernetes Best Practices for Production Clusters

There's a gap between "Kubernetes works" and "Kubernetes works at 2am on a Friday when the on-call engineer is three time zones away and has only a phone." The difference between those two states is not how well you understand Kubernetes — it's how carefully you set it up before anything went wrong.

We manage 50+ production clusters across EKS, GKE, AKS, and bare metal. The problems we get called in to fix are almost always the same six things. Not exotic Kubernetes bugs, not cloud provider edge cases — the basics that were skipped during initial setup because there was pressure to ship.

This is what we apply to every cluster we touch, and why.

Set Resource Requests and Limits — Every Container, No Exceptions

Missing resource limits are the most common cause of cascading failures in Kubernetes. Here's the failure mode: one container has a memory leak. Without a limit, it grows until the node OOM-kills it — but by that point it's already consumed enough memory that the node is under pressure, evicting other pods to reclaim memory. Those pods land on other nodes, which causes similar pressure, and you have a cascade that takes out the cluster rather than just the leaky service.

With a memory limit set, the container gets OOM-killed before it can pressure the node. The pod restarts. Your liveness probe catches the restart, the on-call gets a page, and the incident is contained to one service.

The setup that works:

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m       # soft ceiling — CPU is throttled, not killed
    memory: 256Mi   # hard ceiling — exceeded = OOM kill

CPU limits are more nuanced. Setting a CPU limit causes throttling — the container gets its request amount on average, but can be throttled at burst. For latency-sensitive services, consider omitting the CPU limit and setting only a request, relying on node-level scheduling to handle the ceiling. For batch workloads, set both.

Use VPA in recommendation mode to calibrate requests. Run it for 7-14 days, then review the P95 recommendations — your requests should match P50 actual usage, your limits should match P99. Don't set limits at 10x requests "to be safe" — that defeats the purpose.

RBAC from Day One, Not Day 100

We've audited clusters where the CI/CD pipeline ran as cluster-admin. Where every developer had the same service account token with wide permissions. Where the default service account in the production namespace had been accidentally granted access to secrets across the cluster.

The reason this happens isn't negligence — it's that Kubernetes RBAC is verbose and the path of least resistance during setup is "grant more permissions, it works." The cost of fixing it later is high because you have to audit what's actually using each permission before you can remove it.

The principle: namespaces are your isolation boundary, service accounts are your identity, RBAC is how you wire them together.

# Namespace-scoped Role for a typical application
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: my-app
  name: app-runner
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "list", "watch", "update", "patch"]

For OIDC integration: EKS supports OIDC with your identity provider (Okta, Google Workspace, Azure AD), mapping groups to ClusterRoleBindings. Engineers authenticate with their SSO credentials, not a shared kubeconfig. This is table-stakes for any cluster with more than two engineers.

Use kubectl auth can-i --list from a service account's perspective to audit what it can actually do. Run this on your CI service account and your application service accounts. The results are often surprising.

Network Policies: The Security Control Nobody Sets

By default, every pod in a Kubernetes cluster can talk to every other pod. If your frontend deployment has a vulnerability and gets compromised, the attacker has network access to your database, your internal APIs, your secrets management service — anything in the cluster.

Network policies are the Kubernetes-native way to fix this, and they're one of the most consistently skipped security controls we see. The reason: without a CNI plugin that enforces network policies (Calico, Cilium, Weave Net), the resources exist but do nothing. EKS with the default VPC CNI doesn't enforce network policies. You need to install Calico or Cilium alongside it.

The default-deny pattern:

# Block all ingress and egress in a namespace by default
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Then add explicit allow rules for what should communicate. Your API service should accept traffic from your ingress controller and nothing else. Your database service should accept traffic from your API service and nothing else.

Cilium is worth the operational overhead at production scale — it does network policy enforcement at the eBPF layer (faster than iptables), gives you L7 policy enforcement (allow GET /api/v2/health but deny everything else), and comes with Hubble for network observability so you can see which pods are talking to which and write policies based on actual traffic.

GitOps Instead of kubectl apply

The operational problem with imperative Kubernetes management: you apply a change in a hurry during an incident, the cluster drifts from what's in git, and six months later nobody knows what's actually running vs what the manifests say. Then someone does a kubectl apply from git to "restore" a service and undoes a critical live fix that was never committed.

GitOps — using ArgoCD or Flux to continuously reconcile cluster state with git — eliminates this class of problem. Every change goes through a pull request. ArgoCD applies the change. If the cluster drifts from git (someone runs kubectl edit in a panic), ArgoCD will either flag it or auto-remediate it, depending on your configuration.

Secrets are the tricky part of GitOps. You can't store plaintext secrets in git. The approaches that work:

  • External Secrets Operator pulling from AWS Secrets Manager, GCP Secret Manager, or HashiCorp Vault. The manifest in git references a secret name, the operator fetches the value at runtime. Clean separation between secret references (safe to commit) and secret values (never in git).
  • Sealed Secrets — encrypt secrets with a cluster-public-key, commit the encrypted form, unseal happens in-cluster. Works, but key rotation is painful and you can't use the same sealed secret on multiple clusters.

For ArgoCD specifically: use ApplicationSets for managing multiple environments from a single configuration. The matrix generator lets you combine a list of clusters with a list of applications and generate Applications programmatically — instead of maintaining 50 Application manifests by hand.

The Upgrade Strategy Nobody Talks About

Kubernetes releases a new minor version every 4 months. Support for each minor version ends 14 months after release. If you're not upgrading at least twice a year, you're running unsupported software with known CVEs.

The upgrade strategy that's worked for us across 50+ clusters:

  1. Staging cluster runs N+1 before production. Upgrade staging to 1.32 while production runs 1.31. Run your application test suite against staging. If anything breaks, fix it before upgrading production.
  2. Check deprecated APIs before every upgrade. kubectl convert, Pluto, or the built-in deprecation warnings in kubectl apply --dry-run=server will show you which manifests use deprecated APIs. Fix those before the control plane upgrade.
  3. Control plane first, node groups second. Upgrade the EKS control plane. Then upgrade node groups one at a time, using a rolling update that cordons and drains nodes before replacing them.
  4. Schedule upgrades during low-traffic windows. Not midnight, because that's miserable. A Tuesday or Wednesday at 10am when your on-call team is awake and traffic is predictable.

Node upgrades with Karpenter are significantly less painful than with managed node groups: Karpenter supports drift detection and will automatically replace nodes that don't match the desired configuration (including the desired AMI version). You configure the upgrade policy and Karpenter handles the rolling replacement.

Observability at the Kubernetes Layer

Application-level observability is necessary but not sufficient. You also need visibility at the Kubernetes layer itself — the scheduler, the kubelet, the control plane. What to collect and what to alert on:

kube-state-metrics exposes Kubernetes object state as Prometheus metrics. The signals that matter:

  • kube_pod_status_phase{phase="Failed"} — pods stuck in Failed state
  • kube_deployment_status_replicas_unavailable — deployments with unavailable replicas
  • kube_persistentvolumeclaim_status_phase{phase="Pending"} — PVCs that haven't bound
  • kube_node_status_condition{condition="Ready",status="false"} — nodes that aren't ready

node-exporter gives you host-level metrics: CPU, memory, disk, network. Alert on node disk pressure (80% full) before Kubernetes does — by the time Kubernetes evicts pods for disk pressure, you already have an incident.

What not to alert on: every Kubernetes warning event. That generates noise. Scope your alerts to things that require human action within a defined time window. A pod restart at 3am that recovers immediately is information, not a page.

Want this applied to your cluster?

We audit production Kubernetes clusters and deliver a findings report covering RBAC posture, network policy coverage, resource configuration, observability gaps, and upgrade readiness. Free, no commitment — we do it because it usually surfaces something worth fixing.

Book Free Audit

Related: Kubernetes Consulting · DevOps Services

← Back to all articles