[ARTICLE]

Principles and Practices of Cost Optimization on Google Cloud Platform

With great complexity comes many options for optimization. This article will first examine some of the overarching principles of cloud cost optimization for Google Cloud Platform (GCP). Then we’ll consider some practical actions you can take now. 

While the following will call out some GCP specifics, I find that these principles can be broadly transferred to any cloud system.

Cost optimization – Know your target.

One challenge with cost optimization is there are myriad ways to achieve it. But, not all options will result in positive outcomes for your business. It is worth keeping in mind that cloud systems’ real benefits are not relatively cheaper than physical infrastructure. Though they may be, but rather that they provide you with a faster time to value.  In other words, if you keep your cloud costs reasonable, the value comes from being able to provide greater value to your customers. This will translate to increased revenue.

Cost-cutting thus requires an understanding of not just cost but also performance and reliability targets. This then brings into play the developers and site reliability engineers (SREs) and the company’s business side – executives and finance.

If you can define service level objectives for costs, performance, and reliability, you now have a target to aim for. 

Implement and leverage cloud cost metering and monitoring

Traditional IT infrastructures had fairly dependable budgets. They were then doled out to business units for capital expenditures after approval, and forecasting utilized historical data to derive future budget needs. This model’s static nature provided budget certainty. One of the pain points as purchases was not instantaneous, often required purchase approval,  ordering, delivery, and installation, and that took time.

With cloud environments, costs are now operational rather than capital.  Budgets can be spent in an on-demand fashion and, as needed, in a timely fashion. Only the resources being used are paid for. But because purchase decisions are being made subject to less review, having a way to understand actual operational costs becomes critical. 

As it turns out, one of the key features of a cloud system is having measurable services, and GCP provides ways to track and measure costs for each of your cloud services.  Still, this can become a challenge with larger systems as the sheer number of measurements can become difficult to parse if standards are not put in place.

Defining standards for resource labeling, defining unit budgets and spending alerts and developing a model that makes sure that engineering and finance are communicating cost requirements effectively becomes critical. One holdover from the CapEx world is developers and engineers treating servers as ‘theirs’ and keeping them reserved even when idle. Another common CapEx holdover is over-provisioning resources to ensure performance and then paying for the overpowered system during non-peak times. Standards help define boundaries and expectations around resource use in an otherwise easy to overspend environment.

The GCP Cloud Billing reports are an incredibly powerful tool to understand service costs, especially when services are appropriately labeled so that costs are attributable to specific teams or departments.  The ability to create customized dashboards with Cloud Billing also allows getting away from just looking at service costs. Appropriate labeling allows business-relevant evaluation of GCP resource costs compared to revenue generated by specific customers. This can then feedback into the discussion of budget and resource allocation between finance and engineering.

Do you know your cloud’s value versus its cost? 

The CapEx world was very much about controlling and reducing costs.  In part, this was because the server you bought for that project was not something that would be returned.  Things are very different in an OpEx world, and it is no longer just about cutting costs. The cost optimization focus is better thought of as eliminating waste and maximizing the value derived from your spending.  Here again, finance and engineering can work together to define standards and metrics to understand a service’s operational cost/value proposition. 

  • What is the value that Service X provides to our customers?
  • What does it cost to provide Service X?
  • How can I optimize the cost of Service X without degrading performance?

Depending on where you are in a digital transformation journey, this may be a more or less difficult task to implement. Still, the value of understanding the actual cost-benefit of service provides valuable insights when trying to keep costs down and customers happy at the same time.

Start with standards and automate processes early on.

If you can define standards early on in a project, you will be better off than enforce them retroactively. Defining how resources are labeled and providing limits on resource deployment maxima is good to define upfront. Coupling standards with automation tools such as GCP’s  Cloud Deployment Manager or  Terraform will help ensure application consistency.

Setting up a sensible cost management hierarchy to create logical resource groupings is also best-done upfront. You can, and should, define a simple structure that meets your initial for cost management and attribution and then add complexity if needed. With GCP, you can leverage the setup wizard to get recommendations about setting up your environment.

I’ve mentioned the importance of labeling resources as key to effective cost management, and it is worth reiterating. Having well-labeled resources ties costs to a specific business unit or project so that it becomes possible to tie service costs to business value. By default, it is easy to see that your company is spending $35,000 on Google Kubernetes Engine (GKE). If you label the two services you run on GKE, you can see that you spend $10,000 on GKE for your web photo catalog that generated $11,000 in revenue, and $25,000 on GKE for an AI image search engine generated $150,000.  

Review cloud costs and optimization practices regularly.

Your specific review cadence will depend on your development and customer environment; having a regular review process is important to avoid spending surprises. Again, having teams with diverse responsibilities engineering, dev, finance, executive, meeting to review the data on use, and potentially adjust use/cost forecasts. The default GCP Cloud Billing console makes it simple and quick to review and audit your cloud costs regularly. Putting effort into setting up a custom dashboard can be well worth simplifying specific metrics important to your company.

It is worth considering that if your customer base is fairly stable and your system is relatively small, reviews may not need to be as frequent as dealing with a dynamic or huge system.  Large systems with multiple applications and cloud spend in the seven-figure range per month can rapidly and unnecessarily lose revenue from inefficiency and, conversely, rapidly save significant revenue from optimization efforts. 

Setting actionable priorities

As previously mentioned, there is often, but not always,  a tradeoff when looking at optimizing cost, performance, velocity, and reliability. When looking across a company, it can be difficult to make decisions across competing requirements and goals. A multi-disciplinary team review of optimization objectives and goals helps find a realistic balance between cost savings and customer value impacts. If there is a clear understanding of the required effort (dev or engineering), the potential cost savings, and the potential business value, it becomes easier to make informed decisions. 

The Practice of Cost Optimization

Cloud optimization is not something that can be solved using a default procedure or checklist. Everyone’s cloud environment is in some way unique, as are the specifics of the applications you are running. Once the structure for communication monitoring, evaluating, and tuning for cloud cost optimization is set, it is time to apply some specific tools and practices. Google Cloud optimization tools can be generally binned into three categories: cost visibility, efficient resource allocation, and pricing optimization.

Cost visibility

The variable nature of cloud systems and the on-demand capabilities that allow DevOps and SRE to address variation can result in unintended and unexpected costs. Understanding and controlling spending is an initial first step in optimizing your costs. With Google Cloud, you immediately have access to several no-cost billing and cost management tools that provide the necessary visibility to start making necessary spending adjustment decisions. We’ve already touched on the value of using GCP’s default billing management tools and the ability to customize your billing dashboard for greater insights. Additional cost management/visibility features include quotas, budgets, and alerts that will give you real-time alerts and greater control over costs and reduce the probability of unintentional overspending. 

Resource Usage Optimization

Overprovisioning is a pervasive opportunity for cost optimization efforts. While overprovisioning has its roots in traditional IT principles, it remains an issue in environments where automation to monitor and right scale resources is not fully developed.  The GCP ” Recommender” can help identify over- or under-provisioned or idle resources.  Looking at ways to automate such management should be a key SRE team goal and is where Opsani’s continuous optimization function is ideal for applying.

Some specific resource use optimization actions to take include:

Delete Idle VMs: Getting rid of resources you are paying for but not using is a clear way to cut costs. There is even a GCP Idle VM Recommender that will find inactive VMs and persistent disks. Deletion should be done with care and ideally by the person or team that created it.

Schedule non-prod VMs: Most cloud systems only charge for the resources you have running. If you have dev/test environments that are not active during non-business hours, turning them off can provide substantial savings. 

Rightsize VMs: Overprovisioned VMs will have you paying for resources that you are not getting value from.  Having machines that are too small (e.g., worker nodes in a GKE cluster) may result in inefficient bin packing, and again, you are paying for unused resources. There are also GCP rightsizing recommendations that can show you how to effectively downsize your machine type based on changes in CPU and RAM usage. Simply finding a better default machine type can help resolve this. If you are really into tuning your environment, it is possible to create custom GCP machine types with the correct amount of CPU and RAM for your specific use case. 

Pricing efficiency

GCP, like most other cloud vendors, have a wide range of pricing options. For VMs, these include volume, sustained use, and committed use discounts, preemptible instances, per-second billing, and others. GCP also offers many storage classes where the price is generally correlated with the frequency of data access and the rate at which data can be retrieved. These specific pricing options you can use will depend on your use case and backed up by performance data.  A couple of examples:

Preemptible VMs: These virtual servers are highly affordable and appropriate for fault-tolerant and generally short-lived uses.  These VMs only live up to a maximum of 24 hours and can cost up to 80% less than normal VMs. 

VM Sustained-use Discounts: This uses resource-based pricing that applies sustained use discounts to all predefined machine types in a region collectively rather than to individual machine types. The greater the aggregate use, the lower the overall costs.

Cloud Storage Classes:  The default GCP storage class is standard, but if you rarely access data (e.g., data archives), the nearline or cold line storage classes may provide. If you are maintaining data that you are unlikely to access (e.g., legal discovery requirements), the archival class may provide even further savings. 

From principles to practice

While I wanted to provide some practical examples in this article, there are truly myriad options for improving GCP’s cloud costs. Look for a deeper dive into hands-on recommendations in an upcoming post.  

Much like the DevOps journey that many of us are on today, cost optimization is a journey that parallels the growth, tuning, and automation of your cloud environment. I hope you appreciate that visibility into your actual costs and the source of those costs are primary.  Secondarily, equally important is the inclusion of an appropriate but diverse set of views about business goals, operational goals, and development goals to understand the true effects of cost on business value, application reliability, and performance. Opsani can be one piece of the puzzle that provides an automated optimization solution that is always up to date regardless of your GCP environment changes.