Troubleshooting GPU Scheduling Issues in Amazon EKS

Table of Contents

A recent engagement with a client revealed a critical issue in their Amazon Elastic Kubernetes Service (EKS) cluster. Their GPU-enabled nodes (G4dn.xlarge instances) were failing to assign IP addresses to new pods, despite having available CPU and memory. The error message they encountered was:

"pod didn't trigger scale-up: 3 node(s) didn't match Pod's node affinity/selector, 6 max node group size reached."

This post walks through our troubleshooting approach, key discoveries, and the solution that resolved the problem.

Problem Analysis

Client’s Challenges

The Amazon EKS cluster was hosting GPU workloads on G4dn.xlarge instances.
Only one pod was running per GPU node, despite available resources.
New pods failed to schedule due to node affinity and scaling constraints.

Initial Investigation Findings

Node Resource Allocation: Each G4dn.xlarge instance provides 1 GPU, 4 vCPUs, and 16GB memory. Kubernetes was allocating an entire GPU per pod unless explicitly configured otherwise.
Pod Affinity & Selector Rules: Anti-affinity and node selector configurations were correct and non-conflicting.

Potential Root Causes

GPU Resource Requests: Pods requesting full GPU resources, restricted scheduling to one pod per node.
ENI Limitations: Initially considered but ruled out after further analysis.

Root Cause: GPU Resource Allocation

The primary issue was that each pod was requesting a full GPU, preventing multiple pods from running on a single node. Since the client needed multiple pods to share a GPU, we implemented GPU time-slicing using the NVIDIA device plugin.

Understanding GPU Time-Slicing

Enables multiple pods to share a single GPU by time-division multiplexing.
No memory isolation—a crashing pod may affect others sharing the GPU.
Ideal for lightweight workloads that don’t require dedicated GPU resources.

Solution: Implementing GPU Time-Slicing

Step 1: Uninstall the Existing NVIDIA Device Plugin
Ensured the previous plugin was removed before applying the new configuration.

Step 2: Configure GPU Sharing via ConfigMap
Defined a ConfigMap to allow 4 pods per GPU:

Troubleshooting GPU Scheduling Issues in Amazon EKS

Step 3: Deploy the Updated NVIDIA Device Plugin via Helm

Step 4: Apply Node Labels and Taints

To ensure only GPU workloads are scheduled on these nodes:

Label the Node:

Troubleshooting GPU Scheduling Issues in Amazon EKS
(Required for newer NVIDIA plugin versions.)

Taint the Node (Recommended):

Troubleshooting GPU Scheduling Issues in Amazon EKS
(Prevents non-GPU workloads from using GPU nodes.)

Best Practices & Recommendations

1. Use EKS-Optimized AMIs for GPU Workloads

The client was using a custom AMI (Amazon Linux 2). We recommended:
Switching to the Amazon Linux 2023 NVIDIA AMI (pre-configured with NVIDIA drivers).

Reverse-engineering the AMI for custom builds, if needed.

2. Right-Sizing GPU Instances

For compute-heavy workloads, consider larger instances (e.g., G4dn.2xlarge).
For lightweight workloads, time-slicing helps reduce costs by maximizing GPU utilization.

Key Takeaways

By enabling GPU time-slicing, we allowed multiple pods to share a single GPU, resolving the scheduling bottleneck. Key insights:

Default GPU requests assign 1 GPU per pod—adjust via time-slicing if sharing is needed.
Always label GPU nodes (nvidia.com/gpu.present=true) for compatibility with newer NVIDIA plugins.
Leverage EKS-optimized AMIs for reliable GPU support.

References

Let's discuss your cloud challenges and see how CloudKeeper can solve them all!

Meet the Author

Aamir Shahab

Senior DevOps Engineer

Aamir has hands-on experience across AWS, Kubernetes, Terraform, Docker, and Python, with a strong foundation in cloud infrastructure, automation, and container orchestration.

0 Comment