Values delivered
- Improved CPU utilization across clusters.
- Reduced pod restarts and oversized nodes.
- Enabled granular pod cost attribution.
- Lowered EC2 costs with optimized instances.
- Reduced pod instability with smart scheduling.
Chief Technology Officer
Values delivered
Challenges
Though Locus had a proper EKS foundation, they faced persistent inefficiencies around resource usage, scheduling behavior, and cost visibility - limiting their ability to convert insights into action.
EC2 nodes showed average CPU utilization under 10% due to memory-heavy but CPU-light workloads. Karpenter’s bin-packing struggled to schedule these efficiently, leading to large, underutilized instances across clusters.
Tools like Kubecost and CastAI provided cost insights, but didn’t align well with internal workflows. The team lacked the clarity and granularity needed to take decisive action on their cost data.
Overly aggressive consolidation windows led to frequent pod evictions and churn. While the team had separated workloads by type, scheduling still caused instability and unnecessary overhead.
NodeClaim policies unintentionally allowed provisioning of 32xlarge and 48xlarge instances. These oversized types weren’t regularly used but posed cost risks during burst scenarios or resource crunches.
Solution
CloudKeeper worked with Locus to implement a targeted optimization strategy covering right-sizing, Karpenter tuning, cost visibility, and infrastructure hardening to improve resource utilization, workload stability, and cloud spend.
After analyzing underutilized EC2 nodes and misaligned instance choices, we reclassified services based on memory and CPU profiles. A tailored mix of R-series and M-series instances improved bin-packing efficiency and reduced waste, guided by average (not peak) CPU usage.
We optimized node pool segmentation by isolating workloads based on type and duration, while adjusting consolidation idle timers from 1 to 15 - 30 minutes. This stabilized pods and minimized churn without compromising responsiveness to load changes.
Vertical Pod Autoscaler was introduced in recommendation mode for staging and off-mode for production. Its suggestions helped right-size resource requests, particularly for Java services, resulting in better CPU utilization and less overprovisioning without service disruptions.
We discovered and removed overly large instances (e.g., 32xlarge, 48xlarge) from autoscaling configurations. Spot-based pools with TTL policies were introduced for short-lived workloads, helping isolate bursty cron jobs from long-running production services to enhance cost control.
Locus switched from Kubecost to CloudKeeper Lens to gain deeper cost attribution across pods, containers, and namespaces. The platform provided clearer insights and required no additional configuration, helping teams tie usage to spend more effectively.
A full cluster audit flagged root-level pods and recommended hardening. All pods used IRSA, and ENI limits were healthy. We also advised moving from sidecar logging to stdout/stderr streams for simpler, more scalable log management.
Within 48 hours of the transformation process, CPU over-allocation dropped and node churn reduced dramatically.
Beyond immediate results, Locus committed to a phased rollout of the optimizations across multiple production regions.
How OneAssist transitioned to AWS CloudFront achieving enhanced content delivery and minimizing data transfer costs
How Loylogic’s Pointspay successfully transitioned to AWS with minimal disruptions to Live systems
Speak with our advisors to learn how you can take control of your Cloud Cost