DevOps Engineer
Aman is a skilled Cloud Infrastructure Specialist with expertise in designing, managing, and optimising scalable cloud environments.
While deploying a database on Amazon EKS, suddenly your pods crashed with `Too many open files` errors, followed by AWS CNI plugin failures breaking your networking? Here is the step following which you can solve it:-
Running workloads on Amazon Elastic Kubernetes Service (EKS) is generally smooth, but like any distributed system, unexpected problems arise. Recently, I encountered a tricky issue while deploying a database on an EKS cluster. The deployment kept failing with ulimit (file descriptor) errors and was soon followed by AWS CNI plugin failures, which broke pod networking.
While spinning up a database pod, the logs showed errors like:
Initially, this appeared to be a straightforward resource configuration issue. However, soon after, the AWS CNI plugin pods (`aws-node`) were restarting repeatedly with errors such as:
This created a failure: not only was the database pod crashing, but also the other workloads started failing due to broken pod networking.
I have listed the breakdown into three parts :
I modified the Launch Template to apply proper ulimit values via AWS EC2 user data:
#!/bin/bash
echo "fs.file-max = 2097152" >> /etc/sysctl.conf
echo "* soft nofile 65535" >> /etc/security/limits.conf
echo "* hard nofile 65535" >> /etc/security/limits.conf
ulimit -n 65535
I performed a rolling replacement of nodes in the Amazon EKS Node Group so that new instances inherited the updated limits.
Finally, I restarted the aws-node DaemonSet:
kubectl rollout restart ds aws-node -n kube-system
This reloaded the CNI plugin with the corrected host limits.
What looked like a database-specific error turned out to be a deeper issue with EKS worker node system limits. By adjusting the ulimit settings in the Launch Template and restarting the AWS CNI plugin, I restored stability to the cluster.
Speak with our advisors to learn how you can take control of your Cloud Cost
I completely agree with your initial analysis however I believe that most ops teams often neglect to check the file descriptor limit on the node. For example if you are running heavy network workloads or stateful workloads the first thing to do to avoid repeating the error chain as in the article is to increase the ulimit limit as this should have been in the checklist from the start. https://thatsnot-myneighbor.io/