30-Day Cloud Fitness Challenge Sign-up, Get $50 Amazon coupon
2
2
Table of Contents

A recent engagement with a client revealed a critical issue in their Amazon Elastic Kubernetes Service (EKS) cluster. Their GPU-enabled nodes (G4dn.xlarge instances) were failing to assign IP addresses to new pods, despite having available CPU and memory. The error message they encountered was:

"pod didn't trigger scale-up: 3 node(s) didn't match Pod's node affinity/selector, 6 max node group size reached."

This post walks through our troubleshooting approach, key discoveries, and the solution that resolved the problem.

Problem Analysis

Client’s Challenges

  • The Amazon EKS cluster was hosting GPU workloads on G4dn.xlarge instances.
  • Only one pod was running per GPU node, despite available resources.
  • New pods failed to schedule due to node affinity and scaling constraints.

Initial Investigation Findings

  • Node Resource Allocation: Each G4dn.xlarge instance provides 1 GPU, 4 vCPUs, and 16GB memory. Kubernetes was allocating an entire GPU per pod unless explicitly configured otherwise.
  • Pod Affinity & Selector Rules: Anti-affinity and node selector configurations were correct and non-conflicting.

Potential Root Causes

  • GPU Resource Requests: Pods requesting full GPU resources, restricted scheduling to one pod per node.
  • ENI Limitations: Initially considered but ruled out after further analysis.

Root Cause: GPU Resource Allocation

The primary issue was that each pod was requesting a full GPU, preventing multiple pods from running on a single node. Since the client needed multiple pods to share a GPU, we implemented GPU time-slicing using the NVIDIA device plugin.

Understanding GPU Time-Slicing 

  • Enables multiple pods to share a single GPU by time-division multiplexing.
  • No memory isolation—a crashing pod may affect others sharing the GPU.
  • Ideal for lightweight workloads that don’t require dedicated GPU resources.

Solution: Implementing GPU Time-Slicing

Step 1: Uninstall the Existing NVIDIA Device Plugin
Ensured the previous plugin was removed before applying the new configuration.

Step 2: Configure GPU Sharing via ConfigMap
Defined a ConfigMap to allow 4 pods per GPU:

Troubleshooting GPU Scheduling Issues in Amazon EKS

Step 3: Deploy the Updated NVIDIA Device Plugin via Helm

Troubleshooting GPU Scheduling Issues in Amazon EKS

Step 4: Apply Node Labels and Taints

To ensure only GPU workloads are scheduled on these nodes:

Label the Node:

Troubleshooting GPU Scheduling Issues in Amazon EKS
(Required for newer NVIDIA plugin versions.)

Taint the Node (Recommended):

Troubleshooting GPU Scheduling Issues in Amazon EKS
(Prevents non-GPU workloads from using GPU nodes.)

Best Practices & Recommendations

1. Use EKS-Optimized AMIs for GPU Workloads 

The client was using a custom AMI (Amazon Linux 2). We recommended:
Switching to the Amazon Linux 2023 NVIDIA AMI (pre-configured with NVIDIA drivers).

Reverse-engineering the AMI for custom builds, if needed.

2. Right-Sizing GPU Instances

For compute-heavy workloads, consider larger instances (e.g., G4dn.2xlarge).
For lightweight workloads, time-slicing helps reduce costs by maximizing GPU utilization.

Key Takeaways

By enabling GPU time-slicing, we allowed multiple pods to share a single GPU, resolving the scheduling bottleneck. Key insights:

  • Default GPU requests assign 1 GPU per pod—adjust via time-slicing if sharing is needed.
  • Always label GPU nodes (nvidia.com/gpu.present=true) for compatibility with newer NVIDIA plugins.
  • Leverage EKS-optimized AMIs for reliable GPU support.

References

12
Let's discuss your cloud challenges and see how CloudKeeper can solve them all!
Meet the Author
  • Aamir Shahab
    Senior DevOps Engineer

    Aamir has hands-on experience across AWS, Kubernetes, Terraform, Docker, and Python, with a strong foundation in cloud infrastructure, automation, and container orchestration.

0 Comment
Leave a Comment

Speak with our advisors to learn how you can take control of your Cloud Cost