A recent engagement with a client revealed a critical issue in their Amazon Elastic Kubernetes Service (EKS) cluster. Their GPU-enabled nodes (G4dn.xlarge instances) were failing to assign IP addresses to new pods, despite having available CPU and memory. The error message they encountered was:
"pod didn't trigger scale-up: 3 node(s) didn't match Pod's node affinity/selector, 6 max node group size reached."
This post walks through our troubleshooting approach, key discoveries, and the solution that resolved the problem.
Problem Analysis
Client’s Challenges
- The Amazon EKS cluster was hosting GPU workloads on G4dn.xlarge instances.
- Only one pod was running per GPU node, despite available resources.
- New pods failed to schedule due to node affinity and scaling constraints.
Initial Investigation Findings
- Node Resource Allocation: Each G4dn.xlarge instance provides 1 GPU, 4 vCPUs, and 16GB memory. Kubernetes was allocating an entire GPU per pod unless explicitly configured otherwise.
- Pod Affinity & Selector Rules: Anti-affinity and node selector configurations were correct and non-conflicting.
Potential Root Causes
- GPU Resource Requests: Pods requesting full GPU resources, restricted scheduling to one pod per node.
- ENI Limitations: Initially considered but ruled out after further analysis.
Root Cause: GPU Resource Allocation
The primary issue was that each pod was requesting a full GPU, preventing multiple pods from running on a single node. Since the client needed multiple pods to share a GPU, we implemented GPU time-slicing using the NVIDIA device plugin.
Understanding GPU Time-Slicing
- Enables multiple pods to share a single GPU by time-division multiplexing.
- No memory isolation—a crashing pod may affect others sharing the GPU.
- Ideal for lightweight workloads that don’t require dedicated GPU resources.
Solution: Implementing GPU Time-Slicing
Step 1: Uninstall the Existing NVIDIA Device Plugin
Ensured the previous plugin was removed before applying the new configuration.
Step 2: Configure GPU Sharing via ConfigMap
Defined a ConfigMap to allow 4 pods per GPU:

Step 3: Deploy the Updated NVIDIA Device Plugin via Helm

Step 4: Apply Node Labels and Taints
To ensure only GPU workloads are scheduled on these nodes:
Label the Node:
(Required for newer NVIDIA plugin versions.)
Taint the Node (Recommended):
(Prevents non-GPU workloads from using GPU nodes.)
Best Practices & Recommendations
1. Use EKS-Optimized AMIs for GPU Workloads
The client was using a custom AMI (Amazon Linux 2). We recommended:
Switching to the Amazon Linux 2023 NVIDIA AMI (pre-configured with NVIDIA drivers).
Reverse-engineering the AMI for custom builds, if needed.
2. Right-Sizing GPU Instances
For compute-heavy workloads, consider larger instances (e.g., G4dn.2xlarge).
For lightweight workloads, time-slicing helps reduce costs by maximizing GPU utilization.
Key Takeaways
By enabling GPU time-slicing, we allowed multiple pods to share a single GPU, resolving the scheduling bottleneck. Key insights:
- Default GPU requests assign 1 GPU per pod—adjust via time-slicing if sharing is needed.
- Always label GPU nodes (nvidia.com/gpu.present=true) for compatibility with newer NVIDIA plugins.
- Leverage EKS-optimized AMIs for reliable GPU support.
References