Senior DevOps Engineer
Gourav specializes in helping organizations design secure and scalable Kubernetes infrastructures on AWS.
We all dread that call: "Production is down!" But what happens when the outage becomes a regular occurrence? This is the story of how we helped a customer escape a recurring production nightmare by upgrading their Amazon EKS cluster to version 1.30. It wasn't just about new features; it was about bringing back stability and trust.
Our customer was running a large-scale production workload on Amazon EKS v1.29, heavily utilizing Apache Airflow to trigger pods frequently. They faced a persistent issue: their Amazon EKS nodes would frequently go into a NotReady state, leading to significant production disruptions.
Initial investigations into the kubelet logs from affected nodes revealed a critical pattern:
Our deep dive into the system revealed significant issues with CSI drivers. Pods attempting to mount Amazon EFS volumes were failing with "connection refused" errors, and the efs.csi.aws.com driver's communication socket was missing.
The customer was running aws-efs-csi-driver:v2.0.1-eksbuild.1, which was outdated. The latest version, v2.1.6, included crucial bug fixes for socket handling, systemd compatibility, and mount retries. This outdated driver was not merely buggy; it was directly contributing to node instability and crashes. Furthermore, the Secrets Store CSI driver (v1.4.4) was also failing. While the cluster ran Kubernetes v1.29, which enforced VOLUME_MOUNT_GROUP compatibility, the deployed driver version didn't fully implement it. This caused secrets to fail during mounting, leading to pod crashes and Airflow DAG failures. Adding to the complexity, the customer's use of custom AMIs for their nodes presented a significant challenge. These "black box" AMIs lacked transparent configurations for base image versions, containerd setups, and init systems, hindering effective troubleshooting and making consistent patching nearly impossible.
Initially, we implemented several stabilization measures:
While these steps provided some stability, the underlying issues, such as misaligned kube-proxy and drifted CNI plugin configuration, indicated the cluster was fundamentally unhealthy. The decision was made to perform a full upgrade to Amazon EKS v1.30, not just for new features, but primarily for enhanced stability and predictability.
We presented two upgrade strategies:
1. In-Place Upgrade: Lower disruption, but no easy rollback once the control plane is upgraded. This involves upgrading the control plane, launching new v1.30 node groups, and gradually draining old nodes.
2. Blue-Green Migration: Safe rollback, but higher complexity and time commitment due to the need to mirror workloads, secrets, CI/CD, and observability in a new cluster.
Given the cluster's complex dependencies (VPC peering, numerous route tables, hardcoded selectors in Helm charts), the customer opted for the In-Place Upgrade.
Our execution involved a meticulous process:
This meticulous approach ensured a smooth upgrade process, largely attributed to extensive preparation and validation at each step.
The transformation post-upgrade was significant:
This experience reinforced several key lessons:
This upgrade transcended a simple version bump; it marked a pivotal shift from reactive firefighting to proactive foresight, transforming a brittle system into a reliable, predictable, and "boring" (in the best way possible) production environment.
Is your Kubernetes environment causing more chaos than confidence? We specialize in stabilizing complex Amazon EKS deployments and can help you achieve predictable, reliable operations. Check more about our Expert-led Kubernetes Management & Optimization Services.
Speak with our advisors to learn how you can take control of your Cloud Cost