Blog

Spot instances decoded: Understanding, implementing, and saving

Despite potentially reducing your compute costs anywhere from 20-80% and indirectly resulting in your applications being more resilient, Spot instances (or Spot VMs on Google Cloud) aren’t typically used as much as they should be.

While reasons range from their less-than-predictable nature, a fear of workload interruptions, or the complexity of setting up and managing them, the most common reason why many avoid using Spot instances is simply a lack of familiarity with them.

However, understanding how to navigate these perceived hurdles will help you realize previously-untapped compute savings.

In this two-part series, we’re going to cover everything you need to know to gain the fullest benefit from Spot instances.

In part one, we’ll cover what Spot instances are, why you should use them, and when you should use them. In part two, we’ll cover how to optimize Spot instance utilization with Auto Scaling groups and how to maximize your savings using Spot Scaling.

What are Spot instances?

Spot instances are sort of like cheap, same-day flights that become available due to last-minute cancellations or unsold inventory. Airlines often reduce prices significantly to fill these empty seats quickly, before the flight takes off.

In the case of compute instances, cloud providers offer unused on-demand computing resources at much lower prices  compared to on-demand instances — up to 90% off — as a way to make use of their excess capacity.

You simply set a bid price for the Spot instance — the maximum you’re willing to pay per hour — and if the Spot price (the current price in the spot market) is below your bid price, the instance runs.

However, Spot instances can be reclaimed by the cloud provider with only a two-minute notice if the demand for regular-priced, on-demand instances increases, potentially interrupting your application.

Spot instances and flight tickets both present a chance to acquire something (computing power for the former, and flight tickets for the latter) at a reduced cost. However, there's a level of uncertainty and risk involved — Spot instances might be reclaimed if the market price rises above your bid, and last-minute flight tickets might vanish if someone else purchases them before you.

The two most common situations interruptions occur are when:

  1. There's a surge in demand for on-demand or reserved instances
  2. Spot prices rise above the bid (less likely now)

Why you should use Spot instances (hint: it’s not just because of cost savings)

While the potential EC2 savings (as high as 80%!) are often-touted as a major benefit of using Spot instances, it’s not the only benefit.

Using Spot instances doesn't inherently make applications more resilient, but it often requires applications to already possess certain levels of resilience to effectively accommodate the potential interruptions associated with Spot instances. 

For example, because applications running on Spot instances should ideally be architected to handle interruptions gracefully, you should’ve already designed it for checkpoints, auto-saving mechanisms, or distributing workloads across multiple instances.

That way, your infrastructure:

  1. Better handles fluctuations, 
  2. Maintains performance during peak times, and 
  3. Mitigates risks associated with potential interruptions or failures.

During peak loads, Spot instances can be integrated into your system to handle increased demand, ensuring that your system can accommodate fluctuations in traffic or workload without performance degradation.

With cheaper instances, you can allocate more resources toward redundancy and failover mechanisms, and distribute workloads across more instances. 

And if one instance experiences an interruption, other instances can continue processing parts of the workload, minimizing the impact of any single failure. This ensures that your workloads seamlessly shift to another instance without significant cost implications.

EC2 Instance Pools Explained

To best leverage AWS Spot instances, it's important to conceptually understand EC2 instance pools. An EC2 instance pool refers to the total capacity of an instance type (i.e. m5.xlarge) in a given region. 

When there’s unused capacity in an instance pool, that spare capacity is referred to as a Spot Capacity pool.

Instance Pool

Each instance family, instance size, availability zone, and region have distinct EC2 instance pools, and therefore Spot capacity pools.

As such, you shouldn’t “put all your eggs in one basket.” The more pools you tap into, the more diversified your potential instance selection will be — which minimizes the chances that Spot instances aren’t available for your application to use.

When you should use Spot instances

In general, Spot instances are best suited for workloads that:

  • Are flexible,
  • Don’t have specific time requirements,
  • Are distributable / can be split into concurrently-running tasks, and 
  • Can tolerate interruptions

We’ll cover the specific use cases where using Spot instances make sense, but here are three questions for helping you figure out if your workloads are suitable for Spot instances:

  1. Are my workloads fault-tolerant?

    Since Spot instances can be interrupted, workloads must be designed to handle interruptions without causing a critical failure or data loss.Fault-tolerant workloads can continue running or can quickly recover when instances are interrupted or terminated.
  2. Can the workload be stopped in < 2 minutes?Workloads must be stoppable within a short notice period to prevent data loss or disruption.If your workload can be stopped in less than two minutes, it becomes easier to respond to Spot instance interruptions.For this reason, stateless applications are well-suited for Spot instances, since they don’t store session data. This makes it easy for them to seamlessly migrate between instances without losing functionality or data, making them resilient to interruptions.
  3. Can I be flexible about instance types and availability zones?Distributing your workloads across multiple instances and availability zones reduces the vulnerability of your workloads to interruptions spreading the risk.Remember, capacity is a property of a Spot instance pool. Each different instance type in each different availability zone is a separate pool. When you’re able to tap into more than one pool, the risk of interruptions in all the pool capacities at the same time is lower than the risk of an interruption in a single pool.Spreading across multiple availability zones decreases dependency on a single pool, ensuring continuity even if one zone experiences capacity constraints or price spikes.

 

More specifically, you should consider using Spot instances in the following situations.

Testing Environments and CI/CD

Testing/Dev environments and CI/CD tasks usually don’t need continuous uptime because they’re used intermittently to work on specific features or test changes. Additionally, development and testing tasks can be restarted, or paused and resumed (if planned ahead), without critical data loss, making them more tolerant of interruptions.

These workloads are often flexible in terms of resource requirements and can adapt to different instance types or availability zones without compromising the work being performed.

Batch processing tasks

Batch processing and ETL jobs oftentimes aren’t time-critical, allowing for flexibility that makes Spot instances a great fit.

These tasks can also be broken down into smaller, independent units that can be distributed across multiple instances without significant impact if an instance is interrupted. 

This way, the interruption of one instance doesn't hinder the completion of the entire job, as the workload can be distributed among other available instances. And if there aren’t available instances, jobs can be structured to save intermediate states, resuming  from the last checkpoint in case of interruptions.


High-performance computing (HPC) and big data processing

High-performance computing tasks involve handling and analyzing vast amounts of data. Spot instances make sense for these types of workloads because these tasks can be distributed across various instances and allow for easy scaling up and down. 

Typically these tasks are costly since processing large datasets requires substantial compute resources, but with Spot instances the cost of each instance is much lower — and with thousands of instances this adds up.

Web servers

Web servers are great candidates for Spot instances because they are usually stateless. They don’t typically store data locally or rely on information from previous sessions, and therefore they can be interrupted without significant impact.

In many cases with web servers, each request is processed independently without relying on stored session information.

Containerized workloads / Kubernetes

Containerized applications are often designed to be stateless, making them a good candidate for Spot instances.

Since containers don't usually store session-specific data, new containers can be spun up or shut down without affecting the overall system. 

Also, since containers divide applications into smaller, independent units, containerized workloads can adapt easily to different instance types or availability zones. This flexibility aligns perfectly with the variable nature of Spot instances.

Conclusion

We've covered everything you need to know about Spot instances — from their concept to leveraging their advantages effectively, and which use cases allow you to maximize their advantages.

In part two of our series, we’ll cover Auto Scaling groups (ASGs), which help you handle Spot interruptions and optimize their utilization, and Spot Scaling, which simplifies the process of configuring and managing ASGs so you can maximize Spot savings and application availability.

Subscribe to updates, news and more.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related blogs