The customer is a global leader in AI-powered energy intelligence solutions, helping utilities and energy providers turn smart meter and IoT data into actionable insights. Their platform supports load disaggregation, energy efficiency, demand response, and customer engagement initiatives.
Helping a Global Energy Intelligence Platform resolve EKS logging issues
Helping a Global Energy Intelligence Platform resolve EKS logging issues

Overview
Challenges
The customer runs backend workloads on Amazon EKS. While spot instances optimized costs, operational and reliability issues began affecting platform stability and daily operations.
- Pods frequently crashed due to application logs filling ephemeral storage and triggering disk pressure conditions.
- Unexpected pod failures caused permanent log loss, creating auditability gaps and compliance risks.
- Limited log visibility slowed incident investigation and increased recovery times during production issues.
- Engineering teams spent excessive time firefighting failures, increasing operational overhead and impacting service reliability.
The customer needed a scalable, fault-tolerant logging strategy that prevented storage-related crashes, ensured persistent logs, and required minimal or no application code changes.
The Solution
Solution: Partner-led Support
CloudKeeper partnered with the customer’s engineering teams to design a resilient, low-touch logging architecture that improved platform stability, ensured log durability, and aligned with AWS best practices.
- Implemented Fluent Bit sidecar containers in EKS pods to continuously stream logs independent of pod termination.
- Enabled log synchronization at defined intervals without requiring application code changes.
- Enforced log rotation policies to prevent ephemeral storage exhaustion and disk pressure issues.
- Offloaded logs to Amazon S3 using a structured, date- and application-based hierarchy.
- Enabled clear segregation of logs by workload and time for improved governance.
- Simplified log retrieval for audits, investigations, and operational troubleshooting.
- Decoupled logging from the application lifecycle to ensure log persistence during unexpected pod failures.
- Eliminated disk pressure–related crashes while aligning logging operations with AWS scalability and cost-efficiency best practices.
This solution established a stable, scalable logging foundation, reducing operational risk and enabling engineering teams to focus on reliability, performance improvements, and delivering consistent experiences.
Post Optimization Impact
Zero Disk Pressure Failures - Pod crashes due to disk pressure were fully resolved.
Reliable Log Retention - Logs remained consistently available for audit, even in cases of abrupt pod termination.
Operational Efficiency - DevOps teams reclaimed time from manual firefighting, redirecting efforts toward innovation
Improved Customer Experience - Stable backend services significantly reduced disruption and enhanced reliability.
Conclusion
CloudKeeper helped the customer resolve their logging and storage constraints by decoupling log management from pod lifecycles and eliminating storage pressure. The resulting architecture improved fault tolerance and observability, and reduced operational noise across production workloads.
With a stable, scalable logging foundation, the customer now operates and scales Kubernetes workloads with greater reliability, control, and confidence.
Speak with our advisors to learn how you can take control of your Cloud Cost
