How a Major Amazon EKS Upgrade Solved a Customer's Production Nightmares

Table of Contents

We all dread that call: "Production is down!" But what happens when the outage becomes a regular occurrence? This is the story of how we helped a customer escape a recurring production nightmare by upgrading their Amazon EKS cluster to version 1.30. It wasn't just about new features; it was about bringing back stability and trust.

Identifying the Root Cause: The Vanishing Nodes and Silent Kubelet

Our customer was running a large-scale production workload on Amazon EKS v1.29, heavily utilizing Apache Airflow to trigger pods frequently. They faced a persistent issue: their Amazon EKS nodes would frequently go into a NotReady state, leading to significant production disruptions.

Initial investigations into the kubelet logs from affected nodes revealed a critical pattern:

Nodes were dropping every 5-7 days.
Errors like "Failed to get node when trying to set owner ref to the node lease" and "update node status exceeds retry count" indicated the kubelet was failing to update its heartbeat lease with the API server.
The most critical finding was containerd[3661]: SIGABRT: abort, signifying a containerd crash. This crash, in turn, brought down the aws-node pod, responsible for VPC networking, leading to a complete loss of API server communication and the node being marked NotReady.

Uncovering Critical CSI Driver Issues

Our deep dive into the system revealed significant issues with CSI drivers. Pods attempting to mount Amazon EFS volumes were failing with "connection refused" errors, and the efs.csi.aws.com driver's communication socket was missing.

The customer was running aws-efs-csi-driver:v2.0.1-eksbuild.1, which was outdated. The latest version, v2.1.6, included crucial bug fixes for socket handling, systemd compatibility, and mount retries. This outdated driver was not merely buggy; it was directly contributing to node instability and crashes. Furthermore, the Secrets Store CSI driver (v1.4.4) was also failing. While the cluster ran Kubernetes v1.29, which enforced VOLUME_MOUNT_GROUP compatibility, the deployed driver version didn't fully implement it. This caused secrets to fail during mounting, leading to pod crashes and Airflow DAG failures. Adding to the complexity, the customer's use of custom AMIs for their nodes presented a significant challenge. These "black box" AMIs lacked transparent configurations for base image versions, containerd setups, and init systems, hindering effective troubleshooting and making consistent patching nearly impossible.

Strategic Decision: From Patching to a Full Amazon EKS Upgrade

Initially, we implemented several stabilization measures:

Upgraded aws-efs-csi-driver to v2.1.6.
Upgraded secrets-store-csi-driver to v1.4.6.
Installed AWS's node auto-repair agent.
Enabled detailed metrics and alerts via Datadog.
Resized memory limits for OOMKilled containers.
Collected comprehensive logs using the Amazon EKS Log Collector.

While these steps provided some stability, the underlying issues, such as misaligned kube-proxy and drifted CNI plugin configuration, indicated the cluster was fundamentally unhealthy. The decision was made to perform a full upgrade to Amazon EKS v1.30, not just for new features, but primarily for enhanced stability and predictability.

Upgrade Strategy and Execution

We presented two upgrade strategies:

1. In-Place Upgrade: Lower disruption, but no easy rollback once the control plane is upgraded. This involves upgrading the control plane, launching new v1.30 node groups, and gradually draining old nodes.

2. Blue-Green Migration: Safe rollback, but higher complexity and time commitment due to the need to mirror workloads, secrets, CI/CD, and observability in a new cluster.

Given the cluster's complex dependencies (VPC peering, numerous route tables, hardcoded selectors in Helm charts), the customer opted for the In-Place Upgrade.

Our execution involved a meticulous process:

Control plane upgrade: Achieved with zero downtime, verified by synthetic probes and real traffic traces.
Add-on upgrades: VPC-CNI, CoreDNS, and kube-proxy were upgraded and validated.
New node group provisioning: A custom-built v1.30 node group was provisioned, with all taints, labels, and IAM roles surgically replicated.
Drain-and-validate: Old nodes were drained one by one, with continuous monitoring of logs (piped to OpenSearch via Fluent Bit) for any anomalies.

This meticulous approach ensured a smooth upgrade process, largely attributed to extensive preparation and validation at each step.

Post-Upgrade Impact: Restored Trust and Predictability

The transformation post-upgrade was significant:

Airflow DAGs executed without misfires.
EFS mounts are attached without delay.
Secrets injected on the first attempt.
The team's reliance on its infrastructure was restored. The psychological shift from constant firefighting to confident operation was paramount.

This experience reinforced several key lessons:

Add-ons are not optional dependencies: Outdated CSI drivers can lead to cascading failures.
Custom AMIs pose control challenges: Without consistent validation and patching, they introduce
significant exposure.
Version upgrades enhance reliability: Newer Kubernetes versions, like 1.30, bring crucial stability improvements.
Effective debugging requires forensic analysis: The truth of system behavior is often hidden in detailed log analysis.

This upgrade transcended a simple version bump; it marked a pivotal shift from reactive firefighting to proactive foresight, transforming a brittle system into a reliable, predictable, and "boring" (in the best way possible) production environment.

Is your Kubernetes environment causing more chaos than confidence? We specialize in stabilizing complex Amazon EKS deployments and can help you achieve predictable, reliable operations. Check more about our Expert-led Kubernetes Management & Optimization Services.