It is fascinating to see how FinOps blends into observability. Cloud monitoring is not just about keeping things up and running; it goes beyond looking and waiting for issues and reviewing application errors and anomalies in the infrastructure. For us at Apptio, issues are not limited to outages or performance degradation; an issue is anything that does not align with our business objectives, so for us, overspending is an issue.
Monitoring for migrations
Recently, we had such an issue with one of our large-scale data processing jobs. The service had recently migrated to our primary Kubernetes platform from a legacy environment where the jobs ran directly on the hosts. In addition to focused monitoring post migration, we also had alerts in place to help detect any issues. Soon after the migration, we received a Cloudability alert indicating that our spending on this service was rising and had almost doubled. This was outside of our budgeted costs. Interestingly, we had not received any alerts or notifications for performance or application issues. In fact, the application was performing better than it was before the migration, and the data processing jobs ran faster because some of the more demanding jobs were now allocated to much larger EC2 instances.
Root cause analysis
We looked at the data from different perspectives by using various views in Cloudability. We were able to see that the On-Demand spend went up and the Spot/On-Demand ratio got reversed in favor of On-Demand. The cost analysis based on the instance types indicated that the primary issue had to do with r6i.16xlarge and c5.9xlarge instances.
We drilled further into the data and analyzed the wasted resources by instance type. This revealed that our pods were not efficiently packed, especially on the r6i.16xlarge instances.
First, we dug into the root cause of the poor pod allocation. We use Kyverno policies to match applications with the appropriate nodes to achieve the most optimal allocation. The data processing job has three flavors: small, large, and huge. This is based on the resource requirements to perform the job. According to the Kyverno policies, huge workloads are directed to r6i.16xlarge group, large workloads go to c5.9xlarge, but the small workloads were not accounted for. As a result, they were scheduled to any node, including those that were only reserved for huge and large workloads. Because of the missing Kyverno policy, the largest and the most expensive instances became fragmented and remained underutilized, occupied by just one small workload. To remedy this, we added another policy to schedule the small workloads to c5.9xlarge and c5.4xlarge instances, depending on their resource requests. This ensured that each processing job was going to be matched to the best resource, maximizing allocation density and minimizing waste.
Maximize pod density
Another action for optimizing deployment density had to do with the Cluster Autoscaler. By default, the Cluster Autoscaler tries to evict pods from nodes that are less than 50% utilized to pack them better onto other nodes. When it initiates the eviction, it puts a “ToBeDeleted” taint on the node to prevent it from being used by other pods. The problem is that some pods have long termination grace periods. The eviction of that pod does not happen until the pod finishes, leaving the node underutilized and unusable by other workloads. We disabled this default setting, allowing short running workloads to utilize a node that was going to be deleted.
As the result of these two actions, we cut the waste on r6i.16xlarge instances in half, as indicated on the chart below.
Next steps — future optimizations
We are still not out of the woods and have a few follow-up action items. While we improved the utilization of the most expensive instances and reduced the cost of our waste, we pushed the problem down to the next tier. The c5.9xlarge instances with 72 GB of memory now host pods that are requesting 64 GB. To reduce the 8 GB of wastage by host, we are planning to switch to c6i.16xlarge instances for these workloads, which will allow us to use two pods per host without wasting resources.
We also introduced Karpenter as a replacement for the Cluster Autoscaler and are expecting several improvements, such as better node termination time of unused nodes and better deployment density.
It is too early to tell how effective these changes are, but one thing is clear: the FinOps angle of observability enables us to spot problems that are otherwise hard to detect and efficiently understand their scale and financial impact. If we don’t continually improve at this and address these issues, the inefficiencies could cost the company hundreds of thousands of dollars.