Running a Successful RI Program With Metrics-Driven Optimization

Learn how a metrics-driven Reserved Instance program is different than a cadence-driven one — and how it can unlock even more savings.

Over the years, we’ve talked a lot about strategies for purchasing Reserved Instances (RIs) and the theory behind them. Hopefully it’s helped save significant amounts off your cloud bill. These savings can be achieved without necessarily impacting your engineering teams and give your finance team a powerful lever.

Now we want to take this to another level by tying all these learnings together and producing business meaningful metrics to drive our RI program. These metrics can tell us how successful our RI program is and make sure actions are taken at the right time. This approach gives you significant benefits when compared to current cadence-driven RI programs.

Let’s start by building these concepts up.

Vive le Histograms and Break Even Points

There have been numerous innovations to RIs over the years, but at their very heart there’s one thing that’s remained constant: Working out what RIs to purchase starts with knowing individual break even points for the instances you’re running, then building histograms of your historical usage to enable a smart decision.

An outcome of this analysis might be that based on my recent usage patterns, if I purchase five particular RIs through the partial upfront payment option, then I’ll save 50% over the course of the next 12 months and be cash flow neutral after seven months compared to running On-Demand.

RI Classifications and Embracing Regional Benefit/ISF

When we purchase RIs, we’re committing ourselves to pay for instances that have some very specific attributes. The shape of those attributes can be affected by ISF and the Regional Benefit where applicable, but regardless we can think of each combination of attributes as an RI Classification. If we ignore tenancy for the moment — given the vast majority are shared tenancy — there are three key attributes of a classification: instance type (effectively instance family for ISF), location (availability zone or region) and operating system. We can use the flexibility of regional benefit and ISF to our advantage to reduce the overall number of classifications we have to consider, which in turn helps simplify our RI planning and lowers risk. Here are some example RI Classifications for AWS:

RI instance type RI Location RI OS Notes
m5.2xlarge ap-southeast-2a rhel Simple scenario. RI needs to match underlying AZ and instance type exactly
r5.2xlarge ap-southeast-2 windows Regional benefit applies (the underlying instances can be in any AZ). We need to match the underlying instance type exactly as ISF doesn’t apply to Windows.
c5.large ap-southeast-2 linux ISF and regional benefit apply. We purchase the smallest size in the family to cover the wide variety of sizes we may be running.

Even though these all look a little bit different (for example, reserving into an AZ verus a region), we can think of each as a valid classification to purchase RIs into. In fact, these are what constitute the list of recommendations you can find within Cloudability’s RI planner. When we evaluate metrics to drive a successful RI program we do so in the context of these classifications.

Leveraging the 90% Waterline to Establish a Coverable Area per Classification

As we established with Atlassian, a smart waterline of 90% can help maximise savings while providing a buffer against potential usage fluctuations. It’s important to recognise that this analysis is always done in the context of individual RI classifications. What this does give us is a coverable area for each classification, and that will be a critical concept going forward.

Cldy emerge 2019 06 image1

The above diagram is a great representation of how the coverable area fits into the bigger story: we’re now clear what level of usage makes sense to cover (ignoring spikey areas) and how we’re tracking to that. We can see that for this specific classification, the cloud user has RIs covering about 50% of the coverable area, could say they have true RI coverage of about 50%.

Don’t Boil the RI Ocean, Just Focus Your Classifications

Now that we understand RI purchases in the context of classifications and coverable areas, we have to decide which ones to prioritise and which aren’t going to help us achieve our goals. At Cloudability we see a lot of RI profiles, and a regular theme is having a very long tail of small classifications that aren’t worth considering. Here are some good reasons to ignore certain classifications:

  • They’re an old instance family
  • They only address a small amount of underlying instances
  • Usage for those instances is trending downwards
  • The footprint and resultant savings are tiny

One good approach to start with is using an official whitelist to focus on your top four or five classifications (if ISF is in play, there’s a good chance this will cover the vast majority of your potential savings). This gives us a manageable surface area to assess our overall RI health. With the rich set of customer data available to Cloudability, we’re focused on identifying patterns and training ML models so that we can provide customers a pre-prioritised list of RI classifications that have characteristics likely to deliver ongoing savings.

The Ultimate RI Metric: Weighted RI Coverage Rate

So far we’ve described a metric that is relevant per RI classification — true RI coverage. We’ve also discussed ignoring classifications altogether that shouldn’t be covered. Our job now is to find a way to get an aggregate coverage score that tells us how well we’re doing overall and possibly informs when actions are required. It’s probably not a surprise that when calculating an aggregate score, some RI classifications are going to be considered more significant than others. For example, achieving a high true RI coverage for a large set of 10xlarge instances is more important than doing so for a handful of nano-sized instances. The metric we’re going to use to surface this weighted aggregate score is called the Weighted RI Coverage Rate. We use the On-Demand cost of each coverable area to drive the weighting. Here’s a very simple scenario where a customer has two classifications to cover with RIs.

Instance Type Coverable Area Current RIs True RI Coverage Cost (On Demand) Weighting
m5.large 100 instances 20 20% $100k 0.67
c5.xlarge 25 instances 20 80% $50k 0.33

Note: The On-Demand cost above represents the cost if the entire coverable area were run On-Demand.

Looking at the table, we can see this organisation has a Weighted RI Coverage Rate of 40% (20%*0.67 + 80%*0.33). The beautiful thing here is that we now have an objective and universal way to measure RI health, one that can drive your RI actions even if your organisation has some special considerations. In this simple example, the organisation might decide to target a weighted RI coverage rate of 60%. If so, then they could look at exactly what levers they can pull to achieve that. Purchasing 30 m5.large RIs would achieve just that and leave you with a true RI coverage of 50% for that classification, so it’s probably a good move! We should look to automate this loop as much as we can.

Summary

To run a successful RI program, it’s recommended that you move to a metrics-driven approach rather than a cadence-driven approach. Doing so will provide meaningful targets that help you understand your RI health over your cloud journey and drive actions that result in real cost savings. As your organisation matures its RI program, and possibly takes advantage of features like convertible RIs, then it can look to raise its target rate and further increase RI saving potential.

目次

カテゴリ

タグ