Peter led the development team that created one of the first cloud autoscalers back in 2006. He has over 15 years of experience building cloud platforms and helping customers reliably operate and scale high performance enterprise applications in the cloud.

Misconceptions about Horizontal Autoscaling

As traditionally understood, horizontal scaling is adding more replicas of an application to meet increased load on that application and removing unnecessary replicas when the load decreases. While it is possible to scale applications up and down manually, autoscaling is a mechanism for doing this automatically. Fortunately, Kubernetes has built-in support for horizontal autoscaling; it is called HorizontalPodAutoscaler, often abbreviated as HPA.

We frequently refer to autoscaling simply as adding more instances to meet increased traffic. Ironically, this is the opposite of what autoscaling was created to do. Without autoscaling, you have to scale your application for peak traffic and leave most replicas underutilized when traffic tapers down at night. What autoscaling actually does is reduce the number of replicas when your app is not at peak load and increase them back when they are needed. 

Sounds great, right? However, the problems with autoscaling that arise are not on scale-down (overspending rarely raises alarms and triggers PagerDuty) but on not adding replicas fast enough to meet traffic increases.

Another misconception about horizontal autoscaling is that it works on high watermark and low watermark triggers: when traffic (or resource) utilization goes above a high watermark, we add more replicas; when it goes below a low watermark, we remove replicas. Indeed, this is how some of the early autoscalers worked. However, modern autoscalers borrow pages from industrial control theory, using advanced mathematical approaches such as proportional–integral–derivative controller (PID) and even predictive autoscaling.

The practical takeaway is that HPA uses a target value for a metric and decides on the number of replicas so that the metric is kept near the target value; no high or low watermark values are used.

Is Autoscaling Necessary?

To benefit from autoscaling, your app should need more than one or two replicas to be able to serve its anticipated peak load. Otherwise, even with autoscaling, it will still run the same one or two replicas. 

Once you have a big-enough application with variable traffic, there are two complementary reasons to autoscale:

  1. Reduce costs when traffic is lower. While there are some advantages with autoscaling in on-premise deployments, scaling down your application in the cloud can save significant costs because of the cloud’s elastic costs.
  2. Readiness for increased traffic. If you anticipate increased traffic — e.g., due to a promotion your company is running — but don’t know the exact moment and level of load you expect, setting up autoscaling can help prepare for the deluge and capture the new customer inflow.

If you need to but don’t autoscale, the consequences can be:

  1. If you have allocated replicas for peak traffic, then you are reserving and paying for computing resources that are not needed / not being used. Not disruptive, but it will not keep your CFO happy.
  2. Suppose the traffic exceeds what your current replicas can handle. In that case, all your customers will likely experience significantly increased response times (from 100msec to 5-20 seconds per page load or API call) and even high rate of errors (failed requests). This is something that will have a significant negative impact on your business — and will likely start with increased pager events to your team (you do have monitoring and alarms, right?)

In short, if your applications experience variable load, you need horizontal autoscaling. There are a large number of high quality how-to’s for HPA, so what we will focus on is highlighting some of the gotchas that you want to be aware of when using autoscaling.

What Are the 5 Gotchas with Horizontal Scaling?

While horizontal autoscaling is appropriate and needed for most cloud-native applications, there are some cases where it may not be appropriate, and/or additional concerns need to be addressed. Here are the 5 gotchas with autoscaling to save you a headache down the line.

1) Applications that don’t scale by adding more instances because they are stateful.

These are often databases, but a somewhat unexpected cloud-native app that doesn’t scale out horizontally is Prometheus (the stock one). These applications simply don’t support running multiple instances: at best, new instances have no impact; at worst, they can lead to data corruption. In short, if your app cannot be manually scaled (by changing the number of replicas in your deployment), then it cannot be autoscaled. Such applications typically have application-specific scaling capabilities (e.g., MySQL clustering; Cortex, or Thanos for Prometheus). In Kubernetes, these are frequently encoded as Kubernetes operators to make them easy to run and scale.

2) Applications that don’t scale by adding more instances because they rely on an upstream service that becomes a bottleneck (e.g., a database or an external service).

In this case, adding more instances to your app simply does not make the app go faster or improve its reliability. This becomes quite insidious when autoscaling is configured to work not on resource utilization (e.g., cpu and memory) but on black box metrics (e.g., response time or error rate). In these cases, hitting an upstream bottleneck may lead an autoscaler to quickly scale your app to the maximum (peak) number of replicas and increase cost 10x, all the while not improving performance. The solution for these applications is to first scale the upstream service until the bottleneck is removed, then consider using horizontal autoscaling.

3) Unavailable cluster resources.

While the HPA will automatically start additional pods when needed, it relies on the cluster to have enough node resources to schedule these pods to run. Kubernetes separates the concerns of workload autoscaling (application pods) from infrastructure autoscaling (cluster nodes). If the cluster does not have sufficient resources to schedule the additional pods, they will not run. To solve this, there are two approaches: (a) the traditional approach, using a cluster autoscaler that adds and removes nodes to the cluster, or (b) the more modern, nodeless Kubernetes cloud service, now available for all major cloud providers (e.g.,  AWS Fargate for EKS)

4) Control lag.

When HPA detects the need to add more replicas due to increased load, it sends a request to start the additional replicas to Kubernetes. Between the moment of this decision and when the new replicas are up and serving requests, there are a lot of things that need to happen: (a) Kubernetes control plane to accept the request, schedule the pods to nodes (including adding nodes to the cluster, if cluster autoscaling is needed), (b) nodes to download the container images and start the pods, (c) application to start and, in some cases, warm up. For applications that need more time to start (esp. some Java applications), this delay may leave the application with insufficient replicas for many minutes! To compensate for control lag, you can either (a) set the autoscaling target somewhat lower (provide headroom) or (b) use predictive autoscaling.

5) Incorrect or unvalidated maximum replicas.

Because autoscaling is essentially connected to your credit card (or other means of paying for the additional replicas), HPA has a configuration attribute for the maximum number of replicas. Once that limit is reached, HPA will not request additional replicas, even if the traffic requires it. While this is an important cost protection measure, it can prevent your application from scaling out when needed. To deal with this issue, periodically review the peak number of replicas reached vs. the maximum configured — as over time, your business scales, you will need to add more resources.

For critical production applications, it is essential to test the scaling to the maximum desired levels periodically. This way, you ensure that all elements are scaled to support the target load, addressing many of the issues above — from upstream service bottlenecks to cluster autoscaling and maximum replicas.

It is also desirable to validate that the application can scale down and release resources (including scale down the cluster nodes) to conserve resources and reduce costs when traffic is lower. 

Properly tuning your applications’ configuration ensures that your applications can scale up and down reliably. We will cover some of the critical elements and mechanisms for achieving this reliability in an upcoming post. To learn how Opsani can boost your Kubernetes performance click here!