A quick search online will find lots of tools that examine your AWS logs or bills and provide a list of vm’s with low utilization. The idea, of course, is if you reduce the vm size or even delete them you’ll save money.

Increasing utilization is a seductively simple idea.

After all, if a vm isn’t using its resources, clearly you’re spending too much on it. Plus, utilization is easy to understand as a metric to optimize against. Well, if what you’re running are just independent vm’s, this can work great. However, if you’re running a service with multiple vm’s this can create new problems.

To understand how this simplistic approach can backfire, consider a simple two vm service with a web front-end and a DB back-end. Based solely on utilization you may get an alert that the backend vm is underutilized and a suggestion to choose a smaller vm type. But WHY is the back-end vm utilization low? The problem could be that there simply isn’t enough traffic to utilize the resources in the vm, in which case choosing a new vm type would be advantageous. Then again, the issue could be that the front-end vm type wasn’t chosen properly. There may be heavy traffic, but if the front-end vm is too small it may be unable to process all the incoming traffic. In that case the low utilization of the back-end vm is actually a result of a bottleneck in the front-end.

Utilization and Bottlenecks

A vm creating a bottleneck doesn’t always result in low utilization of the next vm in the chain. Sometimes, the effects show up where you least expect.

A typical service is obviously more complex than two vm’s and that just makes the whole thing more complicated. A vm creating a bottleneck doesn’t always result in low utilization of the next vm in the chain. Sometimes, the effects show up where you least expect. So what do we about this. Well, there’s a simple brute force approach and a more insightful long term approach that requires a little thought.

The brute force approach is to go through all the vm’s in a service first, looking for high utilization. Adjust the vm type for these to provide MORE resources. This ensures they’re not bottlenecks holding back traffic from other vm’s. Next look for low utilization vm’s and adjust those vm types to provide fewer resources. Repeat these two steps a couple times until the lowest utilization is acceptable. Finally, readjust the high utilization vm resources down. Repeat that a couple times and you’ll have a well tuned service.

If that seems like a lot of work, it is. Clearly we need a more automated approach, but before we get to that note that in our example we’re only tuning a couple variables, memory and cpu. There are many other settings that affect performance that you’re probably not tuning.

What are those parameters?

More on that in a future post.