Azure Cost Management & Optimisations


Cost Optimisations:

Optimised SKUs for resources

Most Azure resources have varying levels of SKU’s - following the general rule of being more expensive depending on the features and ‘power’, whether that be compute units for a VM or throughput for a VPN Gateway

Unless you have a a serverless workload or have a specific set of requirements you’ll likely be able to use a lower SKU or one more relevant to your workload

Examples include lowering the SKU based off metrics (relevant for VM’s and App Service Plans) or using a different type of VM SKU.
If your workload has a high memory paired with moderate CPU usage you would be better off using a memory-optimised (E-family) type of VM instead of scaling a general purpose VM

Regular auditing of resources

This involves auditing existing resources with 2 different strategies, the first being regularly asking teams/resource owners whether they still need a certain resource or if their needs have changed.
Often an application is left on the backburner but its resources are still provisioned

The second involves being aware of new Azure offerings - as a general rule the closer you are to a SaaS service (IaaS>PaaS>SaaS) the more cost effective it is both from a monetary and man-hour perspective.
While an application can be architected following best practise/cost optimisations at the time a new Azure service can offer drastic cost savings.

You may have had to run a backup service on a VM but it can be replaced by Azure Backup, you may have needed a 3rd party service to manage your on-prem footprint but this can now be consolidated into your existing Azure environment using Azure Arc, or the new Azure Files v2 billing model can replace your existing Azure File Share - now offers predictable pricing (based on the provisioned storage/IOPS/throughput) and is often cheaper than the previous PAYG model

The landscape is regularly changing so its important to be aware of new services, and while many may require devlopment time which offsets the potential cost savings other services may only need simple changes to offer substancial savings

Scale & Schedule resources

One of the main benefits of the cloud is elasticity - someone is able to spin up 1000 VMs for 2 hours during their peak usage and then shut them all down, all while only being charged for 1000x2 hours

Elasticity should be the one of the primary focuses during an app’s development lifecycle - if an app supports scaling then you can utilise metrics to scale resources
The most relevant example is web apps with peaks in traffic - you can scale your App Service Plans up or down to match the traffic automatically, creating a stable experience for users while also optimising running costs

Resources can also be scheduled to start and shutdown & de-provision.
This option is best used for user VM’s or development machines that are only used during set hours (i.e working hours) - you can automatically de-provision a users VM at the end of their work day and start it 1 hour before their day starts to run updates and patches without needing to interrupt them during their day

Enforce tagging of resources

Tags are key:value pairs you can assign to resources (individual resources, resource groups & subscriptions) - on their own they have no cost impact but they allow you to implement a billing strategy
Tags are primarily used to track deployment environments by implementing an ‘Environment’ tag consisting of dev, test & prod

This lets you distinguish billing data between various environments as well as being able to add further information to segregate billing, e.g you can tag an entire subscription as being owned by IT with the Department:IT tag for simplicity but tag individual VM’s back to departments such as Department:Marketing

You can also create automation surrounding the tags, e.g you can force all resources to start with a autoDelete:true tag using Azure policy while allowing people to change it
You can then create automation to delete any :true tagged resources at the end of the day to reduce the number of test resources persisting in the environment

Storage lifecycle management

Storage can quickly balloon in cost if you leave all your data in the default Hot tier - while data will have the lowest transactional (read/write) costs it has the highest storage costs

If you have data that becomes less frequently accessed as time goes on you can implement lifecycle management to move blobs from the Hot to cool/cold tier, or further down to the archival tier

You may have a compliance requirement to hold backup data for a year, but realistically you wouldn’t need to access any backups older than 30 days
You can set up lifecycle management to automatically move blobs to lower tiers depending on their age, data will start on the hot tier as this is the most cost effective for transactional costs but after 30 days it can move over to the cool/cold tier, then after another 30 days it can move over to the archival tier
You can also automate blob deletion, so after 1 year the blob can be deleted entirely to optimise storage costs

Spot instances

Spot instances let you use currently idle VM resources for up to 90% off the regular cost, with that catch that these resources can be taken away from you (after a 30 second warning)
If you have stateless workloads/batch jobs that run for a short time or can handle failing over to another node you can utilise spot instances for drastic cost savings

Azure has an eviction policy that notifies the VM if it is about to be deprovisioned, this will delete the VM & data disk - you can set a maximum price you’re willing to pay for resources and when the price exeeds this or the resources are needed by another customer you will receive a notification on the VM (which can be used programatically to end running jobs safely)

While resources aren’t guaranteed and the discounted rates do change regularly if your workload is engineered specifically to utilise spot instances you can save a significant amount compared to running them on regular VMs