How we solved ulimit and AWS CNI Plugin Errors in an Amazon EKS Cluster

Table of Contents

While deploying a database on Amazon EKS, suddenly your pods crashed with `Too many open files` errors, followed by AWS CNI plugin failures breaking your networking? Here is the step following which you can solve it:-

Introduction

Running workloads on Amazon Elastic Kubernetes Service (EKS) is generally smooth, but like any distributed system, unexpected problems arise. Recently, I encountered a tricky issue while deploying a database on an EKS cluster. The deployment kept failing with ulimit (file descriptor) errors and was soon followed by AWS CNI plugin failures, which broke pod networking.

The Problem

While spinning up a database pod, the logs showed errors like:

Too many open files
ulimit: open files limit reached

Initially, this appeared to be a straightforward resource configuration issue. However, soon after, the AWS CNI plugin pods (`aws-node`) were restarting repeatedly with errors such as:

failed to setup eni: failed to set ulimit
failed to allocate ENI: unable to assign IP address

This created a failure: not only was the database pod crashing, but also the other workloads started failing due to broken pod networking.

Root Cause Analysis

I have listed the breakdown into three parts :

Checking Node Limits: Running `ulimit -n` on the Amazon EC2 worker nodes showed the file descriptor limit was too low (often set to 1024 by default). The database and CNI plugin required significantly higher values (65535+).
Reviewing Launch Template Configurations: Our node group was created with an AWS Launch Template. Although we had increased instance size, we hadn’t tuned OS-level limits.
Inspecting the AWS CNI DaemonSet: The CNI plugin relies on host networking and often inherits system limits. If the host settings are insufficient, pods using the plugin fail too.

The Solution

Step 1: Update Launch Template

I modified the Launch Template to apply proper ulimit values via AWS EC2 user data:

#!/bin/bash
echo "fs.file-max = 2097152" >> /etc/sysctl.conf
echo "* soft nofile 65535" >> /etc/security/limits.conf
echo "* hard nofile 65535" >> /etc/security/limits.conf
ulimit -n 65535

Step 2: Rolling Update of Node Group

I performed a rolling replacement of nodes in the Amazon EKS Node Group so that new instances inherited the updated limits.

Step 3: Restart AWS CNI Plugin Pods

Finally, I restarted the aws-node DaemonSet:

kubectl rollout restart ds aws-node -n kube-system

This reloaded the CNI plugin with the corrected host limits.

Results

The database pod deployed successfully without hitting file descriptor limits.
The CNI plugin stabilised, and pod networking returned to normal.
Node-level health metrics improved, and no further restarts occurred.

Key Points

Always tune OS-level limits when running stateful or networking-heavy workloads in Amazon EKS.
Launch Templates + User Data are a powerful way to enforce consistent settings across nodes.
Don’t just fix the symptom (application)—trace the failure upstream (in this case, to node and CNI configurations).
Consider using Amazon Bottlerocket OS or Managed Node Groups with pre-validated limits to reduce operational overhead.

Conclusion

What looked like a database-specific error turned out to be a deeper issue with EKS worker node system limits. By adjusting the ulimit settings in the Launch Template and restarting the AWS CNI plugin, I restored stability to the cluster.

Let's discuss your cloud challenges and see how CloudKeeper can solve them all!

Meet the Author

Aman Pandey
DevOps Engineer
Aman is a skilled Cloud Infrastructure Specialist with expertise in designing, managing, and optimising scalable cloud environments.

0 Comment