Senior DevOps Engineer
Aamir has hands-on experience across AWS, Kubernetes, Terraform, Docker, and Python, with a strong foundation in cloud infrastructure, automation, and container orchestration.
A recent engagement with a client revealed a critical issue in their Amazon Elastic Kubernetes Service (EKS) cluster. Their GPU-enabled nodes (G4dn.xlarge instances) were failing to assign IP addresses to new pods, despite having available CPU and memory. The error message they encountered was:
"pod didn't trigger scale-up: 3 node(s) didn't match Pod's node affinity/selector, 6 max node group size reached."
This post walks through our troubleshooting approach, key discoveries, and the solution that resolved the problem.
The primary issue was that each pod was requesting a full GPU, preventing multiple pods from running on a single node. Since the client needed multiple pods to share a GPU, we implemented GPU time-slicing using the NVIDIA device plugin.
Understanding GPU Time-Slicing
Step 1: Uninstall the Existing NVIDIA Device Plugin
Ensured the previous plugin was removed before applying the new configuration.
Step 2: Configure GPU Sharing via ConfigMap
Defined a ConfigMap to allow 4 pods per GPU:
Step 3: Deploy the Updated NVIDIA Device Plugin via Helm
Step 4: Apply Node Labels and Taints
To ensure only GPU workloads are scheduled on these nodes:
Label the Node:
(Required for newer NVIDIA plugin versions.)
Taint the Node (Recommended):
(Prevents non-GPU workloads from using GPU nodes.)
1. Use EKS-Optimized AMIs for GPU Workloads
The client was using a custom AMI (Amazon Linux 2). We recommended:
Switching to the Amazon Linux 2023 NVIDIA AMI (pre-configured with NVIDIA drivers).
Reverse-engineering the AMI for custom builds, if needed.
2. Right-Sizing GPU Instances
For compute-heavy workloads, consider larger instances (e.g., G4dn.2xlarge).
For lightweight workloads, time-slicing helps reduce costs by maximizing GPU utilization.
Key Takeaways
By enabling GPU time-slicing, we allowed multiple pods to share a single GPU, resolving the scheduling bottleneck. Key insights:
References
Speak with our advisors to learn how you can take control of your Cloud Cost