How Foundation AI re-architected and streamlined their Kubernetes environment with an upgrade to EKS v1.30

Industry:

Gen AI

Headquarters:

Irvine, California

Founded in:

2019

Company Size:

51 - 200 Employees

Featured Tags:

Overview

Foundation AI is an AI-powered automation company helping enterprises streamline document-intensive workflows across legal, healthcare, and financial services. Their platform leverages machine learning and intelligent document processing to transform unstructured data into actionable insights.

Values delivered

Eliminated recurring node failures and NotReady states.
Upgraded 10+ components including EKS add-ons and CSI drivers.
Saved 200+ man hours, enabling seamless upgrades.
Reduced EFS mount failures and pod crash loops.
Restored production stability and reliable infra operations.

Challenges

Recurring Production Outages

Foundation AI faced ongoing disruptions with EKS v1.29, as nodes would frequently become unresponsive and enter a NotReady state, severely affecting business-critical workloads, including Apache Airflow DAGs.

Container Runtime and Network Failures

Container crashes triggered failures in the aws-node pod responsible for VPC networking, severing communication with the control plane and taking down entire nodes.

Outdated CSI Drivers

EFS mounts were failing due to an outdated aws-efs-csi-driver, and secrets store mounts failed due to incompatibility between the deployed Secrets Store CSI driver and Kubernetes v1.29’s VOLUME_MOUNT_GROUP enforcement.

Add-on Drift and Inconsistent Configuration

Kube-proxy and VPC-CNI plugins were misaligned, causing unpredictable behavior. Without uniform configuration management, the cluster became increasingly unstable.

Opaque Custom AMIs

Custom-built AMIs lacked transparency and versioning, making troubleshooting difficult and consistent patching nearly impossible.

Solution

To address the persistent instability and operational challenges in their Amazon EKS environment, Foundation AI partnered with CloudKeeper for a comprehensive Kubernetes stabilization initiative. Given the complexity of the cluster, an in-place upgrade to Amazon EKS v1.30 was chosen.

Flawless In-Place EKS Upgrade

With minimal disruption as a priority, CloudKeeper supported a zero-downtime control plane upgrade using synthetic probes and live traffic validation. New v1.30 node groups were custom-built to mirror existing taints, labels, and IAM roles. A drain-and-validate process was adopted, gradually decommissioning old nodes while continuously tracking logs using Fluent Bit and OpenSearch for anomalies. Despite the cluster’s deeply embedded dependencies, the upgrade was executed seamlessly.

Resolved Node and Network Failures

The team upgraded essential EKS components including container runtime, VPC-CNI, kube-proxy, and CoreDNS. Additionally, AWS’s node auto-repair agent was deployed to enhance node self-healing. These upgrades eliminated frequent containerd crashes and fixed the underlying issues causing node disconnections and NotReady states.

Upgraded CSI Drivers and Secured Secrets Management

Outdated storage and secret drivers were a major source of instability.
With specialist guidance from CloudKeeper, Foundation AI team upgraded:

The aws-efs-csi-driver to fix failed volume attachments and EFS socket errors
The secrets-store-csi-driver to align with Kubernetes v1.29’s volume mount requirements.

These updates stabilized secret injection and resolved crash loops, ensuring Airflow and other workloads ran smoothly.

Improved Observability and Resilience

Backed by deep expertise in container orchestration and telemetry, Team CloudKeeper supported the enablement of end-to-end observability using Datadog, CloudWatch alarms, and Route 53 health checks. The EKS Log Collector was deployed to collect diagnostic data, while memory settings were fine-tuned to reduce OOMKill incidents—boosting workload resilience and uptime.

Post-Upgrade Impact

The EKS upgrade restored production stability and confidence across the board.

Airflow DAGs executed without misfires
EFS volumes mounted without delay
Secrets were injected reliably on first attempt

Most importantly, the Foundation AI team transitioned from constant firefighting to operating with trust, predictability, and peace of mind in their infrastructure.