4
4
Table of Contents

When you run Kubernetes clusters on AWS, one of the biggest operational challenges is handling unexpected Amazon EC2 instance interruptions, whether caused by Spot instance interruptions, maintenance events, or Auto Scaling Group (ASG) scale-ins.
Without proper handling, workloads can be cut off mid-flight, leading to failed requests, degraded performance, or even downtime.

Enter the AWS Node Termination Handler (NTH), an open-source project by AWS that makes AWS EC2 interruptions graceful, predictable, and Kubernetes-native.

What is the AWS Node Termination Handler?

The AWS Node Termination Handler is a Kubernetes component that listens for AWS events like instance terminations or maintenance notices and ensures the node is cordoned and drained before shutdown.

That means:

  • No new Pods get scheduled on the node that’s going down.
  • Existing Pods get time to exit gracefully.
  • The cluster remains stable even during infrastructure churn.

Note: NTH is mainly needed for self-managed node groups (Amazon EC2-based). For Amazon EKS managed node groups, AWS automatically handles some termination behaviors, though NTH can still enhance observability and handling in edge cases.

NTH listens for AWS events such as:

  • Amazon EC2 Spot Instance Termination Notices
  • Scheduled Maintenance Events
  • Instance Rebalance Recommendations
  • ASG Scale-In Terminations
  • Instance State-Change Notifications

How AWS Node Termination Handler Works

NTH can run in two distinct modes, and you must choose only one at a time.

1. IMDS Processor Mode (DaemonSet)

This mode runs as a DaemonSet, with one pod on every node. It polls the Instance Metadata Service (IMDS) for signals like:

  • Spot Interruption Notices
  • Scheduled Maintenance Events
  • Instance Rebalance Recommendations

Pros

  • Lightweight (no extra AWS infra required)
  • Perfect for Spot-heavy or test clusters

Limitations

  • Does not support ASG lifecycle hooks or lifecycle heartbeats
  • Can’t handle Amazon EC2 instance state-change notifications

Installation

Installation script

2. Queue Processor Mode (Deployment)

This mode runs as a centralized Deployment that consumes events from Amazon EventBridge → Amazon SQS.
It listens for all the events IMDS mode handles, plus a few more:

  • ASG Lifecycle Hooks (EC2_INSTANCE_TERMINATING)
  • Amazon EC2 Instance State-Change Notifications
  • Spot Interruptions & Rebalance Recommendations

Pros

  • Full coverage of all event types
  • Supports lifecycle heartbeats, extending termination time up to 48 hours
  • Ideal for production clusters

Requirements

  • Amazon EventBridge rules
  • SQS queue
  • IAM permissions via IRSA (or Kiam/Kube2iam)
  • ASG lifecycle hook setup

Installation

Queue processor mode installation

Best practice: If enableSqsTerminationDraining=true, do not enable IMDS draining in the same release. Choose one mode only.

Lifecycle Heartbeats - Buying More Time

When ASG scale-in starts, the instance moves into the Terminating:Wait state.

Normally, you get a few minutes (default 300s) before it’s forcibly terminated.

In Queue mode, NTH can send lifecycle heartbeats (RecordLifecycleActionHeartbeat) that keep the instance in WAIT state, up to 48 hours total, letting long-running Pods finish cleanly.
Lifecycle heartbeats

Use cases:

  • Batch jobs or ML workloads that need longer drain times
  • Stateful workloads that require controlled rebalance across AZs

Note: Heartbeat interval must be shorter than the ASG lifecycle hook timeout. If heartbeats stop or draining completes, AWS proceeds with termination.

Prerequisites for Queue Mode

Pre-requisites for queue mode

Minimal IAM Policy Example:

Minimal IAM Policy Example

Observability with Prometheus

NTH emits metrics you can scrape using Prometheus Operator:

  • actions_total → number of node drains performed
  • events_error_total → errors while processing AWS events

Depending on your deployment:

  • Use PodMonitor for IMDS mode
  • Use ServiceMonitor for Queue mode

Also, NTH can emit Kubernetes Events (PreDrain, NodeDraining, PostDrain) for audit visibility.

Best Practices for Using AWS Node Termination Handler

  • Pick your mode wisely
    a) Use Queue mode for production or mixed node groups (ASG hooks + Spot).
    b) Use IMDS mode for lightweight Spot clusters or dev environments.
  • Protect workloads
    a) Define PodDisruptionBudgets (PDBs).
    b) Give pods time with a sensible terminationGracePeriodSeconds (60s+).
    c) Ensure drains don’t deadlock due to strict PDBs.
  • Monitor and alert
    a) Use Prometheus + Grafana dashboards.
    b) Monitor EventBridge and SQS for dropped messages.
  • Don’t double-enable modes
    a) Running both IMDS and Queue at once causes unpredictable behavior.

Testing Amazon Node Termination Handler

Let’s validate your setup:

  1. Pick an instance in your Amazon EKS cluster.
  2. Terminate it manually:
        aws ec2 terminate-instances --instance-ids i-xxxxxxxxxxxxxxxxx
  3. Watch the node cordon and drain:
    kubectl describe node <node-name> | grep -i cordon
    kubectl get pods -A -o wide | grep <node-name>
  4. Check events and logs:
    kubectl get events -A --field-selector reason=PreDrain
    kubectl logs -n kube-system -l app=aws-node-termination-handler
  5. You should see:

    a) Node cordoned 
    b) Pods draining gracefully 
    c) Termination completed cleanly

Troubleshooting Quick Notes

Troubleshooting Quick Notes

Conclusion

Running Kubernetes on AWS means dealing with ephemeral infrastructure.

By default, Amazon EC2 interruptions are abrupt — but with AWS Node Termination Handler, you can make them graceful and predictable.

  • IMDS Mode - lightweight, simple, good for Spot use cases.
  • Queue Processor Mode - production-grade, full-featured, heartbeat-powered.

If you’re running Amazon EC2 Spot Instances, using AWS Auto Scaling Groups, or operating in environments where graceful shutdown matters, NTH should be a must-have component in your cluster.

12
Let's discuss your cloud challenges and see how CloudKeeper can solve them all!
Meet the Author
  • Aamir Shahab
    Senior DevOps Engineer

    Aamir has hands-on experience across AWS, Kubernetes, Terraform, Docker, and Python, with a strong foundation in cloud infrastructure, automation, and container orchestration.

Leave a Comment

Speak with our advisors to learn how you can take control of your Cloud Cost