Sustaining resilience during rapid growth
Solidus Labs had already implemented several capabilities to keep its Kubernetes environment running smoothly and efficiently. But it wasn’t until it introduced PerfectScale by DoiT that it could comprehensively right-size its pod resources and address the root cause of recurring CPU throttling and out-of-memory (OOM) issues.
PerfectScale by DoiT helped Solidus navigate an infrastructure that’s in a constant state of change. “R&D is releasing changes on an hourly basis due to the nature of our business,” Ben Hoffman, R&D Director at Solidus Labs. “Some of our clients send data in large batches, while others use us as a real-time service, making it hard to predict the load fluctuations on our services.”
By automating resource recommendations and scaling decisions, PerfectScale enabled the team to shift away from reactive, manual interventions. Previously, they spent hours stabilizing their largest cluster and replicating configurations across others—only to see results quickly degrade. With PerfectScale, that effort became unnecessary, and resource waste across smaller clusters was eliminated.
“I would jump into a Grafana and pull in metrics from Prometheus and logs from Logz.io, and make adjustments to the requests based on the different peaks of our environment,” Shemtov Fisher, DevOps Engineer at Solidus Labs/Develeap. “Then a few weeks would pass, and we’d start seeing throttling and memory issues resurface, leading to a second round of adjustments. When I jumped in a third time, I knew we needed a solution in place to help automate this process. PerfectScale by DoiT is the exact solution we needed to fill this gap.”
Improving Kubernetes stability by reducing CPU throttling and OOM issues by 90%
Shortly after implementing PerfectScale by DoiT, Solidus could proactively “right-scale” its pod resources, significantly reducing CPU throttling and OOM issues.
“We went from multiple issues a day, to maybe one or two issues in the last month,” said Hoffman. “With PerfectScale, we have seen over a 90% reduction helping us ensure our applications have the capacity to meet our customer demand.”
Additionally, PerfectScale has drastically reduced the mean-time-to-resolution (MTTR) for capacity-related issues.
“Before PerfectScale, the DevOps team would get an alert when an issue occurred, then we would triage the issue to the proper service owner to resolve,” Barak Arzuan, DevOps Engineer Solidus Labs/Develeap. “Depending on the criticality, it could take hours or even more for the service owners to evaluate the issue and provide us with the proper resource requirements. With PerfectScale, we can immediately provide the service providers with evidence on why the issue is happening along with precise recommendations on how to resolve it. This has helped a lot with our day-to-day operations.”
No more continuous manual work for system health and cost-efficiency.
Adding additional capacity to improve system resilience and stability comes with a price. To mitigate any extra costs, the team leveraged PerfectScale’s cost-optimization capabilities to move unused resources to areas that needed additional capacity.
“In some of our clusters, we found significant cost savings opportunities,” explained Arzuan. “We were able to reinvest these savings into our clusters that were lacking resources. This resulted in a fully stable, resilient, and cost-effective environment with no impacts on our budget.”
“We have a large number of clients, each using our application slightly differently. Keeping our Kubernetes environment optimized is essential for Solidus Labs to ensure our applications have the resources they need to support our customers today, and as our company continues to grow in the future,” said Hoffman. “PerfectScale is removing time-consuming manual tasks we have faced in the past, making it easy to continuously maintain our system’s health and cost-effectiveness.”