How Locus improved resource utilization and achieved cost savings with targeted EKS optimization

Industry:

Logistics

Headquarters:

Milpitas, California

Founded in:

2015

Company Size:

201 - 500 Employees

Overview

Locus is a leading logistics automation partner offering a unique order-to-delivery dispatch management platform helping enterprises to optimize supply chain operations through intelligent dispatch planning, real-time tracking, and data-driven optimization.

Values delivered

Improved CPU utilization across clusters.
Reduced pod restarts and oversized nodes.
Enabled granular pod cost attribution.
Lowered EC2 costs with optimized instances.
Reduced pod instability with smart scheduling.

Challenges

Though Locus had a proper EKS foundation, they faced persistent inefficiencies around resource usage, scheduling behavior, and cost visibility - limiting their ability to convert insights into action.

Diagnosing Utilization and Instance Fit

EC2 nodes showed average CPU utilization under 10% due to memory-heavy but CPU-light workloads. Karpenter’s bin-packing struggled to schedule these efficiently, leading to large, underutilized instances across clusters.

Bridging the Observability-to-Action Gap

Tools like Kubecost and CastAI provided cost insights, but didn’t align well with internal workflows. The team lacked the clarity and granularity needed to take decisive action on their cost data.

Improving Scheduling and Node Pool Strategy

Overly aggressive consolidation windows led to frequent pod evictions and churn. While the team had separated workloads by type, scheduling still caused instability and unnecessary overhead.

Correcting Oversized Instance Provisioning

NodeClaim policies unintentionally allowed provisioning of 32xlarge and 48xlarge instances. These oversized types weren’t regularly used but posed cost risks during burst scenarios or resource crunches.

Solution

CloudKeeper worked with Locus to implement a targeted optimization strategy covering right-sizing, Karpenter tuning, cost visibility, and infrastructure hardening to improve resource utilization, workload stability, and cloud spend.

Right-Sizing Workloads with Better Instance Matching

After analyzing underutilized EC2 nodes and misaligned instance choices, we reclassified services based on memory and CPU profiles. A tailored mix of R-series and M-series instances improved bin-packing efficiency and reduced waste, guided by average (not peak) CPU usage.

Refining Node Pool Strategy and Scaling Policies

We optimized node pool segmentation by isolating workloads based on type and duration, while adjusting consolidation idle timers from 1 to 15 - 30 minutes. This stabilized pods and minimized churn without compromising responsiveness to load changes.

Leveraging VPA for Precise Resource Allocation

Vertical Pod Autoscaler was introduced in recommendation mode for staging and off-mode for production. Its suggestions helped right-size resource requests, particularly for Java services, resulting in better CPU utilization and less overprovisioning without service disruptions.

Restricting Oversized Instances in Karpenter

We discovered and removed overly large instances (e.g., 32xlarge, 48xlarge) from autoscaling configurations. Spot-based pools with TTL policies were introduced for short-lived workloads, helping isolate bursty cron jobs from long-running production services to enhance cost control.

Boosting Cost Visibility with CloudKeeper Lens

Locus switched from Kubecost to CloudKeeper Lens to gain deeper cost attribution across pods, containers, and namespaces. The platform provided clearer insights and required no additional configuration, helping teams tie usage to spend more effectively.

Hardening Infrastructure and Streamlining Logging

A full cluster audit flagged root-level pods and recommended hardening. All pods used IRSA, and ENI limits were healthy. We also advised moving from sidecar logging to stdout/stderr streams for simpler, more scalable log management.

Post-Upgrade Impact

Within 48 hours of the transformation process, CPU over-allocation dropped and node churn reduced dramatically.

Pod lifetimes increased from hours to days.
EC2 average CPU utilization significantly improved.
Removed idle NodePools and applied VPA-driven rightsizing.
Graviton adoption was added to the roadmap for future savings.

Beyond immediate results, Locus committed to a phased rollout of the optimizations across multiple production regions.