How Foundation AI rearchitected and streamlined their Kubernetes environment with an upgrade to EKS v1.30

Foundation
Industry:
Gen AI
Headquarters:
Irvine, California
Founded in:
2019
Company Size:
51 - 200 Employees
Customer Speaks
CloudKeeper really helps us keep control of our cloud costs in a simple and concise format. It is easy to track down cost increases and anomalies across all of our cloud accounts.

Damien Camilleri

Chief Technology Officer

About
Overview
Foundation AI is an AI-powered automation company helping enterprises streamline document-intensive workflows across legal, healthcare, and financial services. Their platform leverages machine learning and intelligent document processing to transform unstructured data into actionable insights.
img

Values delivered

  • Eliminated recurring node failures and NotReady states.
  • Upgraded 10+ components including EKS add-ons and CSI drivers.
  • Saved 200+ man hours, enabling seamless upgrades.
  • Reduced EFS mount failures and pod crash loops.
  • Restored production stability and reliable infra operations.

Challenges

Recurring Production Outages

Foundation AI faced ongoing disruptions with EKS v1.29, as nodes would frequently become unresponsive and enter a NotReady state, severely affecting business-critical workloads, including Apache Airflow DAGs.

Container Runtime and Network Failures

Container crashes triggered failures in the aws-node pod responsible for VPC networking, severing communication with the control plane and taking down entire nodes.

Outdated CSI Drivers

EFS mounts were failing due to an outdated
aws-efs-csi-driver, and secrets store mounts failed due to incompatibility between the deployed Secrets Store CSI driver and Kubernetes v1.29’s VOLUME_MOUNT_GROUP enforcement.

Add-on Drift and Inconsistent Configuration

Kube-proxy and VPC-CNI plugins were misaligned, causing unpredictable behavior. Without uniform configuration management, the cluster became increasingly unstable.

Opaque Custom AMIs

Custom-built AMIs lacked transparency and versioning, making troubleshooting difficult and consistent patching nearly impossible.

Solution

To address the persistent instability and operational challenges in their Amazon EKS environment, Foundation AI partnered with CloudKeeper for a comprehensive Kubernetes stabilization initiative. Given the complexity of the cluster, an in-place upgrade to Amazon EKS v1.30 was chosen.

Flawless In-Place EKS Upgrade

With minimal disruption as a priority, CloudKeeper supported a zero-downtime control plane upgrade using synthetic probes and live traffic validation. New v1.30 node groups were custom-built to mirror existing taints, labels, and IAM roles. A drain-and-validate process was adopted, gradually decommissioning old nodes while continuously tracking logs using Fluent Bit and OpenSearch for anomalies. Despite the cluster’s deeply embedded dependencies, the upgrade was executed seamlessly.

Resolved Node and Network Failures

The team upgraded essential EKS components including container runtime, VPC-CNI, kube-proxy, and CoreDNS. Additionally, AWS’s node auto-repair agent was deployed to enhance node self-healing. These upgrades eliminated frequent containerd crashes and fixed the underlying issues causing node disconnections and NotReady states.

Upgraded CSI Drivers and Secured Secrets Management

Outdated storage and secret drivers were a major source of instability. 
With specialist guidance from CloudKeeper, Foundation AI team upgraded:
 

  • The aws-efs-csi-driver to fix failed volume attachments and EFS socket errors
  • The secrets-store-csi-driver to align with Kubernetes v1.29’s volume mount requirements.

These updates stabilized secret injection and resolved crash loops, ensuring Airflow and other workloads ran smoothly.

Improved Observability and Resilience

Backed by deep expertise in container orchestration and telemetry, Team CloudKeeper supported the enablement of end-to-end observability using Datadog, CloudWatch alarms, and Route 53 health checks. The EKS Log Collector was deployed to collect diagnostic data, while memory settings were fine-tuned to reduce OOMKill incidents—boosting workload resilience and uptime.
 

Post-Upgrade Impact
  • Airflow DAGs executed without misfires
  • EFS volumes mounted without delay
  • Secrets were injected reliably on first attempt

Most importantly, the Foundation AI team transitioned from constant firefighting to operating with trust, predictability, and peace of mind in their infrastructure.

Other Success Stories
  • locus

    How Locus improved resource utilization and achieved cost savings with targeted EKS optimization

    v
  • oneassist

    How OneAssist transitioned to AWS CloudFront achieving enhanced content delivery and minimizing data transfer costs

    v
  • aa

    How Loylogic’s Pointspay successfully transitioned to AWS with minimal disruptions to Live systems

    v

Speak with our advisors to learn how you can take control of your Cloud Cost