Helping a Global Energy Intelligence Platform resolve EKS logging issues

Logo
Industry:
Energy Intelligence / SaaS
Headquarters:
Mountain View, California
Founded in:
2011
Company Size:
500+ Employees

Overview

The customer is a global leader in AI-powered energy intelligence solutions, helping utilities and energy providers turn smart meter and IoT data into actionable insights. Their platform supports load disaggregation, energy efficiency, demand response, and customer engagement initiatives.

Challenges

The customer runs backend workloads on Amazon EKS. While spot instances optimized costs, operational and reliability issues began affecting platform stability and daily operations.

  • Pods frequently crashed due to application logs filling ephemeral storage and triggering disk pressure conditions.
  • Unexpected pod failures caused permanent log loss, creating auditability gaps and compliance risks.
  • Limited log visibility slowed incident investigation and increased recovery times during production issues.
  • Engineering teams spent excessive time firefighting failures, increasing operational overhead and impacting service reliability.

The customer needed a scalable, fault-tolerant logging strategy that prevented storage-related crashes, ensured persistent logs, and required minimal or no application code changes.

The Solution

Solution: Partner-led Support

CloudKeeper partnered with the customer’s engineering teams to design a resilient, low-touch logging architecture that improved platform stability, ensured log durability, and aligned with AWS best practices.

Log Offloading via Sidecar Architecture
  • Implemented Fluent Bit sidecar containers in EKS pods to continuously stream logs independent of pod termination.
  • Enabled log synchronization at defined intervals without requiring application code changes.
  • Enforced log rotation policies to prevent ephemeral storage exhaustion and disk pressure issues.
Centralized and Structured Log Storage
  • Offloaded logs to Amazon S3 using a structured, date- and application-based hierarchy.
  • Enabled clear segregation of logs by workload and time for improved governance.
  • Simplified log retrieval for audits, investigations, and operational troubleshooting.
Stability and Reliability Improvements
  • Decoupled logging from the application lifecycle to ensure log persistence during unexpected pod failures.
  • Eliminated disk pressure–related crashes while aligning logging operations with AWS scalability and cost-efficiency best practices.

This solution established a stable, scalable logging foundation, reducing operational risk and enabling engineering teams to focus on reliability, performance improvements, and delivering consistent experiences.

Values Delivered Description

Post Optimization Impact

Description

Zero Disk Pressure Failures - Pod crashes due to disk pressure were fully resolved.

Description

Reliable Log Retention - Logs remained consistently available for audit, even in cases of abrupt pod termination.

Description

Operational Efficiency - DevOps teams reclaimed time from manual firefighting, redirecting efforts toward innovation

Description

Improved Customer Experience - Stable backend services significantly reduced disruption and enhanced reliability.

Conclusion

CloudKeeper helped the customer resolve their logging and storage constraints by decoupling log management from pod lifecycles and eliminating storage pressure. The resulting architecture improved fault tolerance and observability, and reduced operational noise across production workloads.

With a stable, scalable logging foundation, the customer now operates and scales Kubernetes workloads with greater reliability, control, and confidence.

 Talk to our team

Other Success Stories
  • logo

    How Scans.AI optimized EKS performance and reduced AWS costs 
     

    v
  • Fundamento Logo

    How Fundamento improved AI reliability, GKE stability & cloud efficiency

    v
  • logo

    How ZenduIT improved cloud visibility, storage governance & monthly cost savings

    v

Speak with our advisors to learn how you can take control of your Cloud Cost