Graceful Amazon EC2 Shutdowns in Kubernetes with AWS Node Termination Handler

Table of Contents

When you run Kubernetes clusters on AWS, one of the biggest operational challenges is handling unexpected Amazon EC2 instance interruptions, whether caused by Spot instance interruptions, maintenance events, or Auto Scaling Group (ASG) scale-ins.
Without proper handling, workloads can be cut off mid-flight, leading to failed requests, degraded performance, or even downtime.

Enter the AWS Node Termination Handler (NTH), an open-source project by AWS that makes AWS EC2 interruptions graceful, predictable, and Kubernetes-native.

What is the AWS Node Termination Handler?

The AWS Node Termination Handler is a Kubernetes component that listens for AWS events like instance terminations or maintenance notices and ensures the node is cordoned and drained before shutdown.

That means:

No new Pods get scheduled on the node that’s going down.
Existing Pods get time to exit gracefully.
The cluster remains stable even during infrastructure churn.

Note: NTH is mainly needed for self-managed node groups (Amazon EC2-based). For Amazon EKS managed node groups, AWS automatically handles some termination behaviors, though NTH can still enhance observability and handling in edge cases.

NTH listens for AWS events such as:

Amazon EC2 Spot Instance Termination Notices
Scheduled Maintenance Events
Instance Rebalance Recommendations
ASG Scale-In Terminations
Instance State-Change Notifications

How AWS Node Termination Handler Works

NTH can run in two distinct modes, and you must choose only one at a time.

1. IMDS Processor Mode (DaemonSet)

This mode runs as a DaemonSet, with one pod on every node. It polls the Instance Metadata Service (IMDS) for signals like:

Spot Interruption Notices
Scheduled Maintenance Events
Instance Rebalance Recommendations

Pros

Lightweight (no extra AWS infra required)
Perfect for Spot-heavy or test clusters

Limitations

Does not support ASG lifecycle hooks or lifecycle heartbeats
Can’t handle Amazon EC2 instance state-change notifications

Installation

2. Queue Processor Mode (Deployment)

This mode runs as a centralized Deployment that consumes events from Amazon EventBridge → Amazon SQS.
It listens for all the events IMDS mode handles, plus a few more:

ASG Lifecycle Hooks (EC2_INSTANCE_TERMINATING)
Amazon EC2 Instance State-Change Notifications
Spot Interruptions & Rebalance Recommendations

Pros

Full coverage of all event types
Supports lifecycle heartbeats, extending termination time up to 48 hours
Ideal for production clusters

Requirements

Amazon EventBridge rules
SQS queue
IAM permissions via IRSA (or Kiam/Kube2iam)
ASG lifecycle hook setup

Installation

Queue processor mode installation

Best practice: If enableSqsTerminationDraining=true, do not enable IMDS draining in the same release. Choose one mode only.

Lifecycle Heartbeats - Buying More Time

When ASG scale-in starts, the instance moves into the Terminating:Wait state.

Normally, you get a few minutes (default 300s) before it’s forcibly terminated.

In Queue mode, NTH can send lifecycle heartbeats (RecordLifecycleActionHeartbeat) that keep the instance in WAIT state, up to 48 hours total, letting long-running Pods finish cleanly.

Use cases:

Batch jobs or ML workloads that need longer drain times
Stateful workloads that require controlled rebalance across AZs

Note: Heartbeat interval must be shorter than the ASG lifecycle hook timeout. If heartbeats stop or draining completes, AWS proceeds with termination.

Prerequisites for Queue Mode

Pre-requisites for queue mode

Minimal IAM Policy Example:

Observability with Prometheus

NTH emits metrics you can scrape using Prometheus Operator:

actions_total → number of node drains performed
events_error_total → errors while processing AWS events

Depending on your deployment:

Use PodMonitor for IMDS mode
Use ServiceMonitor for Queue mode

Also, NTH can emit Kubernetes Events (PreDrain, NodeDraining, PostDrain) for audit visibility.

Best Practices for Using AWS Node Termination Handler

Pick your mode wisely
a) Use Queue mode for production or mixed node groups (ASG hooks + Spot).
b) Use IMDS mode for lightweight Spot clusters or dev environments.
Protect workloads
a) Define PodDisruptionBudgets (PDBs).
b) Give pods time with a sensible terminationGracePeriodSeconds (60s+).
c) Ensure drains don’t deadlock due to strict PDBs.
Monitor and alert
a) Use Prometheus + Grafana dashboards.
b) Monitor EventBridge and SQS for dropped messages.
Don’t double-enable modes
a) Running both IMDS and Queue at once causes unpredictable behavior.

Testing Amazon Node Termination Handler

Let’s validate your setup:

Pick an instance in your Amazon EKS cluster.
Terminate it manually:
aws ec2 terminate-instances --instance-ids i-xxxxxxxxxxxxxxxxx
Watch the node cordon and drain:
kubectl describe node <node-name> | grep -i cordon
kubectl get pods -A -o wide | grep <node-name>
Check events and logs:
kubectl get events -A --field-selector reason=PreDrain
kubectl logs -n kube-system -l app=aws-node-termination-handler
You should see:
a) Node cordoned
b) Pods draining gracefully
c) Termination completed cleanly

Troubleshooting Quick Notes

Conclusion

Running Kubernetes on AWS means dealing with ephemeral infrastructure.

By default, Amazon EC2 interruptions are abrupt — but with AWS Node Termination Handler, you can make them graceful and predictable.

IMDS Mode - lightweight, simple, good for Spot use cases.
Queue Processor Mode - production-grade, full-featured, heartbeat-powered.

If you’re running Amazon EC2 Spot Instances, using AWS Auto Scaling Groups, or operating in environments where graceful shutdown matters, NTH should be a must-have component in your cluster.

Let's discuss your cloud challenges and see how CloudKeeper can solve them all!

Meet the Author

Aamir Shahab
Senior DevOps Engineer
Aamir has hands-on experience across AWS, Kubernetes, Terraform, Docker, and Python, with a strong foundation in cloud infrastructure, automation, and container orchestration.

1 Comment

Alyssa Snoddy

This is a really helpful overview, handling unexpected interruptions is definitely one of the trickiest parts of running Kubernetes on AWS. Also, when I need a quick break from all the DevOps chaos, I usually jump into something simple like https://geometrydashs.io

Graceful Amazon EC2 Shutdowns in Kubernetes with AWS Node Termination Handler

What is the AWS Node Termination Handler?

How AWS Node Termination Handler Works

1. IMDS Processor Mode (DaemonSet)

Pros

Limitations

Installation

2. Queue Processor Mode (Deployment)

Pros

Requirements

Installation

Lifecycle Heartbeats - Buying More Time

Prerequisites for Queue Mode

Minimal IAM Policy Example:

Observability with Prometheus

Best Practices for Using AWS Node Termination Handler

Testing Amazon Node Termination Handler

Troubleshooting Quick Notes

Conclusion

You may also like