BLOG

Optimizing ML Costs with Azure Machine Learning

Table of contents

Scaling Machine Learning (ML) initiatives can get expensive. This post outlines common financial challenges in ML and provides actionable strategies using Azure Machine Learning (AML) to optimize your spend. The key takeaway is that point fixes aren’t enough: A systematic approach is the way to manage costs effectively.

MLOps is Expensive

The High Cost of ML: Understanding the Strain on Your Budget

Several inherent characteristics of ML development and deployment make it costly.

  • Modern ML models, especially in deep learning, require massive datasets. Storing, moving, and processing this data contribute significantly to costs.
  • The complex algorithms themselves, such as deep neural networks with numerous layers or computationally demanding reinforcement learning techniques, require substantial processing power.
  • The reliance on specialized and scarce hardware like GPUs for both training and inference adds a premium to compute costs.
  • ML development is iterative. Re-training multiple times with varied hyperparameters, different data splits, or data means that each experimental run incurs additional compute expenses. For instance, a single hyperparameter tuning sweep might launch hundreds of individual training jobs.
  • ML development is a complex multistep process: Data ingestion, cleansing, transformation, training, hypertuning, prediction, and more. The Machine Learning Operations (MLOps) process raises the risk of unnecessary repetitions and operations that add cost.

Customers have asked me, “What does ML training cost?” I tell them that these factors can raise costs to any arbitrary level: Single LLM training runs have cost over 150 million dollars, with the training costing billions overall. You won’t be paying that much, but you do need to understand that there is no ceiling.

AML for Cost Optimization

Although the points here are universal across MLOps systems, AML itself helps you save money. Microsoft’s comprehensive cloud-based ML platform is designed to streamline the entire lifecycle of machine learning models, from building and training to deployment and ongoing management. AML services can each be used alone through REST interfaces, but they also integrate deeply with each other and the broader Azure ecosystem. By implementing time-tested designs for efficiency, AML services let you implement your ML less expensively than do-it-yourself, even when you are paying for the services themselves.

I’ll leave out of this discussion pre-built models like Large Language Models (LLMs) and Azure Cognitive Services for Vision or Translation. Fewer granular “knobs” are available for tuning, necessitating a different approach to cost optimization.

Infrastructure Cost Drivers

To effectively manage costs, it’s crucial to understand where your budget is being allocated. The primary drivers, roughly in descending order, include:

  1. Compute: This is usually the largest expense and encompasses compute (CPUs, GPUs) and memory consumed during model training and for serving predictions.
  2. Storage: Azure Blob Storage is heavily used for datasets, model artifacts, and container images in Azure Container Registry. The chosen storage tier, redundancy options, and the sheer volume of data influence costs.
  3. Networking: Though core training and prediction processes should not generate extreme networking costs, charges can accumulate from data egress, VNet peering, ExpressRoute connections, and NAT Gateway usage. For example, transferring terabytes of image data from on-premises storage to Azure Blob Storage for training, or frequent data exchanges between microservices in an MLOps workflow, can lead to networking expenses.
  4. Services: This includes fees for Azure SaaS APIs, such as Azure AI Search, Document Intelligence or Bot Service:

Guiding Principles for ML Cost Optimization

Adopting a FinOps mindset means embracing several core principles.

First and foremost is to avoid waste. Architectural choices such as which service to use are important, but the significant avoidable costs arise from misuse: For example, using GPUs where CPUs can do the training, or storing masses of unused data in expensive Blob Storage tiers.

Secondly, standardize your architecture. This involves using AML services, such as Compute Targets in AML Workspaces for training rather than managing your fleets of Azure Virtual Machines. The Azure team has built efficient systems that save you money: For example, with AML training, you pay for just the compute needed for training, rather than for VMs that you pay for continually (unless you manage the autoscaling yourself). This also means adopting standard workflows, such as a Continuous Training (CT) pattern where new code or data automatically triggers an AML Pipeline run. This way, data ingestion, training, verification, and deployment occur exactly when needed, without excess runs or waiting so long that processes become inefficient.

Thirdly, do not overoptimize, falling into the “Illusion of Efficiency.” For instance, aggressively compressing training data to save on storage might paradoxically increase overall costs due to significantly higher CPU time spent on decompression during every training epoch.

Don’t forget that your engineers’ time is costly; avoid spending excessive effort on micro-optimizations. Use their time well by prioritizing clean, maintainable architectures over specific optimizations: You can’t predict future needs now, but when the time comes to reduce cost in the future, you want an architecture that makes it feasible to allocate work time.

Lastly, remember to iterate your optimizations, starting with the biggest cost drivers. You can’t implement all optimizations in one cycle, so it is best to fix the low-hanging opportunities, then again check where the big costs are.

Getting the most out of Cost Analytics Tools

Don’t target the cost overrun that first catches your eye. The time you spend on fixing that might be better spent on something else.

DoiT Cloud Intelligence (console.doit.com) equips Azure users with powerful tools for comprehending and managing cloud expenditure. You can create billing reports and dashboards, set budgets and alerts, get warnings about cost anomalies, and receive proactive recommendations for cost savings. Consistent use of these tools is key to identifying trends, highlighting outliers, and homing in on the biggest opportunities for cost savings.

Optimizing Training Costs

The training phase is frequently the most resource-intensive and, consequently, the most expensive part of the ML lifecycle. It consumes substantial data and compute power, and requires numerous iterative cycles.

Right-sizing machines: By monitoring resource utilization (CPU, GPU, memory) with Azure Monitor during training, you can make informed decisions. If a high-end GPU (like an ND H100 v5-series) is consistently underutilized, switching to a more cost-effective option (perhaps an NCasT4_v3-series VM) makes sense.

Use GPUs only when necessary. If your model isn’t GPU-accelerated, opting for CPU-optimized VMs (such as F-series) is more economical. When using GPUs, ensure that your code is fully optimized to exploit their capabilities, for example, by using appropriate batch sizes and efficient data-loading pipelines.

Azure Spot Virtual Machines (Low Priority VMsgive excellent savings. For fault-tolerant training jobs (and your systems should be fault-tolerant!), Spot VMs can yield savings of up to 90% compared to pay-as-you-go prices. They are well-suited, for example, to hyperparameter tuning tasks involving many independent trials, where the preemption of a single trial doesn’t jeopardize the entire process.

Development Environments

For the development phase, Jupyter Notebooks or Visual Studio Code in an AML workspace offer managed, cloud-based workstations. You only pay for the time they are actively running with auto-shutdown policies, unlike a powerful laptop that you amortize 24x7 or a VM that is always running unless you remember to shut it down. To save more, offload heavy work: having powerful resources in your dev environment means you are paying for a fixed set of resources for the whole workday. For example, rather than running training in your notebook, submit extensive, long-running training jobs as AML training jobs that will run on autoscaled, cost-efficient Compute Clusters.

Data Storage

For MLOps on Azure, Azure Blob Storage is the standard for object storage. I’ve seen projects that start with a simple local disk and move to Managed Disks, or those that start on a local network and move to Azure Files, but these are costly: Blob Storage is the standard for ML and far less expensive. Selecting the right access tiers (Hot, Cool, Archive) according to access frequency is essential. Implementing lifecycle management policies can automate the process transitions.. Implementing lifecycle management policies can automate the transitions. For example, if you train on new data only, automatically archive or delete the old data after a month.

Prediction

After training your models, you deploy them to serve predictions (inference). AML Endpoints save money by autoscaling based on built-in metrics. As with training, choosing the smallest effective instance also saves you money. Model co-hosting, or multi-model deployment, allows multiple smaller models to share the same endpoint deployment, reducing per-model overhead if the models are often called sequentially or by the same application. However, if an endpoint goes unused, autoscaling won’t take it down to zero resources, so shut it down yourself. If you have a very low-traffic inference app, take care of scaling the endpoints to zero or else deploy it to Azure Container Apps or Azure Functions.

For non-real-time use cases, Batch Endpoints are significantly cheaper than online prediction and offer better throughput, though also higher latency. Optimizing the batch size and the configuration of the underlying compute cluster gives you the best cost savings.

Monitoring: Keeping an Eye on Costs and Performance

AML services have built-in monitoring — another advantage over do-it-yourself. The monitoring drives down costs by making sure you get the most out of the resources in producing high-quality models.

There are two types of monitoring> Infrastructure Monitoring, primarily through Azure Monitor, which tracks resource utilization (CPU, GPU, memory) as well as training job durations, prediction latency, and QPS.

In contrast, Model Monitoring tracks model-specific metrics, like F1-score. After deployment, this monitoring helps you detect data drift, feature skew, and prediction bias, so that you can decide when to effectively spend the money on retraining. For example, a fraud detection model might drift and need retraining just when the transaction amounts gradually change or the type of fraud evolves, but not otherwise.

Tying it All Together: AML Pipelines

AML Pipelines reduce ML costs by tying steps together efficiently and control engineering costs by automating repetitive tasks. Unnecessary execution steps or data pileups are prevented with robust orchestration for defining and managing complex ML workflows. Capabilities include parallelization (fan-out/fan-in processing, useful for hyperparameter tuning), conditional execution (running steps only if certain conditions are met, like deploying a model only if its accuracy surpasses a set threshold), and caching or component reuse (whereby if a pipeline step’s inputs and code are unchanged, its cached output is reused, saving compute).

Take Action Today

Optimizing ML costs is an ongoing endeavor that blends intelligent technology choices with a robust FinOps process. By harnessing the full capabilities of AML, adhering to sound architectural principles, and maintaining continuous vigilance over your expenditure, you ensure your ML initiatives deliver maximum business value without straining your budget. Begin by identifying your most significant cost drivers and commit to implementing one or two of these discussed strategies in the coming quarter. Your bottom line will thank you.

As a cloud architect at DoiT, I help customers with cost optimization, security, robustness and more. Schedule a demo and a call with our dedicated team today to discover how DoiT Cloud Intelligence — architects and software alike — can elevate your experience and drive results!

Schedule a call with our team

You will receive a calendar invite to the email address provided below for a 15-minute call with one of our team members to discuss your needs.

You will be presented with date and time options on the next step