scalability

 

Getting the problem statement right is crucial. In this blog, you will learn how Opsani’s understanding of our customers’ scaling and stability concerns has evolved and directed the specific technical problems we solve. This blog will conclude with our working problem statement, turning “just make my app scale and not cost infinity dollars” into something a bit less vague we engineers can work with.

How is Opsani designed to support scalability and stability?

At first, we designed the Opsani engine around black-box sample-efficient optimization techniques. In an active learning setting, we thought we needed to achieve the best results in as few tries as we could. Get results to the customer as fast as possible, right?

A few things were wrong with this framing. First, we’d assumed our customers have and believe in testing environments – this turns out not to be the case. We turned away too many customers because they didn’t have a testing environment with realistic load generation. Just as many didn’t have the APM we needed to measure the effects of our adjustments to their application settings.

Moreover, when a customer has a testing environment, it inevitably isn’t “production-like”. The testing environment serves to smoke test the application – it doesn’t stress test the application like production traffic will. In this case, every time we find new, improved settings for an app in the testing environment, the customer invariably asks “but how will this work in production?” The only believable answer to “how will it work in production?” from a third-party vendor is “it already does”, so tuning live in production is a key component of how we frame our problem.

What is different about tuning applications live production?

Tuning live in production introduces new risks and affordances. When we try new settings, there’s always a risk of disrupting production traffic, and that’s a real risk with substantial consequences for our customers’ business, even though we only tune one pod in an entire deployment. To adapt to this concern, we refocused on safety rather than just speed.

How did Opsani shift focus towards safety? 

Safety means, for Opsani, keeping SLO violations to a bare minimum. Of course, the safest solution is to overprovision to the hilt and stay scaled up redundantly at all times – but that’s absurd, and it’s unnecessarily expensive. We’ve developed a balanced, opinionated stance on how frequently we can try new things vs how often we can cause disruptions, and this is a key to making the customer experience smooth. When tuning live in production, it’s vital to minimize the number of SLO violations while we explore and optimize our customers’ application settings.

How does Opsani summarize these dual concerns of safety, optimization efficiency, and cost? 

Finally, we’ve arrived at Opsani’s working problem statement: “tune the customer’s application and infrastructure settings for SLO adherence and cost – in that order – live, in production, without disrupting production traffic”. Notice that sample efficiency has been dropped. It’s better to tune slowly and steadily in production than to tune as fast as possible in a test environment.

If you’re interested in seeing our solution in action, we’d love to show you how we do it using our free trial.