[ARTICLE]

SRE Service Level Agreement Terms Explained: SLA, SLO, SLI

When you sign a technology-related service contract, you probably want to be sure you know what you are getting for your money. If you are responsible for delivering on that contract, you likely want to know if you are meeting the contract obligations.  The acronyms SLA, SLO, and SLI are interrelated terms that help define exactly that. You’ve probably heard the term SLA (Service Level Agreement) as this important to both service users and providers.  The terms SLO (service level objective) and SLI (service level indicators) are more IT operational terms that are important to site reliability engineers (SREs) in helping make sure that SLAs are not being violated. Before we venture into the details, it is helpful to have an initial understanding of what the related terms mean:

Service Level Terminology

  • Metric: something that is measurable and related to a service level objective/agreement
  • Metric value: the value of a metric at a single point in time
  • Service Level Indicator (SLI): a metric and its target values (range) over a period of time
  • Service Level Objective (SLO): all SLIs representing the SLA objective
  • Service Level Agreement (SLA): legal agreement about SLO (e.g., how it is measured, notifications, service credits, etc.)

You can think of an SLA as a legal promise to users, an SLO as an objective that helps you ensure you keep that promise, and the SLI as the metric or metrics that let you and the customer know how you are performing in keeping the promise.

Service-Level Agreement (SLA)

What is a Service Level Agreement?

An SLA (service level agreement) is a generally contractual agreement between a service provider and a service user about performance metrics, e.g., uptime, latency, capacity. Because these agreements tend to be contractual between a company and client, they are generally written up by a company’s business and legal team. However, the SRE team initially defines them.  An SLA typically includes both the SLA metrics and the business consequences of failing to meet the SLA.  These might include refunds, service credits, or similar penalties to the service provider.

The SLA is considered the overall service agreement related to a system’s reliability or availability. While singular, the SLA should derive from a rationally defined SLO (which may differ from the SLA), and that is typically an aggregate of multiple individual metrics (the SLIs). 

Challenges when using SLAs

SLAs can be a big challenge to create and implement correctly. SLAs are at times written by people that are not involved in building and running the technology that the SLA is meant to define.  Failing to critically create SLAs in a way that clearly defines service expectations, defines the associated metrics, and is clear on consequences can create a promise that is difficult or impossible to keep. 

Ensuring that the legal and business development teams are including the IT and DevOps teams will greatly increase the chance that you will create a functional SLA. An SLA should not be a wish coming from either the business or the client. An SLA should be grounded in the real world with expectations set on the reality of what a system can realistically support and provide.

It is important to consider the effects of client side delays when defining SLAs. If a client inadvertently causes a situation that impacts performance, you don’t want to be in a situation where this causes the SLA to be broken. 

Do you need an SLA?

If you are not providing a free service, and SLA is generally not provided.  On the other hand, paying customers generally expect SLAs as they provide a guarantee of the level of service and the consequences, such as compensation, if the guarantee is not met.

Service Level Objectives (SLO)

What is an SLO?

A SLOs (service level objective) is the aggregate of a set of metrics like uptime or response time that are used to evaluate system performance. So, if the SLA is the formal agreement between you and your customer, the SLO is what sets a customer’s expectations and helps IT, and DevOps teams understand what goals they need to hit and measure themselves against.

Challenges when using SLOs

While SLOs are similar to SLAs, they aren’t typically written by the legal team and will benefit from being written simply and clearly. The metrics that define SLOs should be limited to those that truly define performance measures.  Also, consider the potential for client-side impacts on service when writing these, as this helps translate your SLO requirements over to the SLA.

When defining an SLO, the SLI value(s) chosen should be those that define the lowest acceptable level of reliability possible.  While this may seem initially counterintuitive, greater reliability incurs greater cost, so acceptable service should still keep the customer happy without requiring the additional work to provide increased performance that may not even be noticed. In SRE, the tradeoff between increased reliability is not only increased cost, but also slowed development as changes caused by the internal development process can impact SLOs as well. 

It is worth considering that it is common to have two SLOs per service. One that is ‘customer-facing’ and used to derive the SLA and a stricter internal SLO.  The internal SLO may include more metrics or have a lower availability value than the one used for the SLA. The difference between the customer-facing 99.9% SLO and the 99.95% SLO is an error budget in SRE terms. The value in doing this is that if the internal SLO is violated, there is still room to take action to avoid violating the customer-facing SLO. 

Do you need an SLO?

Unlike SLAs, which provide value for paying customers, SLOs can be used for free accounts, and if managing software systems for your own company, they can be used for internal management programs. Creating SLOs for internal databases, networks, and the like helps set expectations and measures for internal systems’ performance, just like those that are customer-facing.

Service Level Indicator (SLI)

What is an SLI?

An SLI (service level indicator) is the metric that allows both a provider and customer to measure compliance with an SLO (service level objective).  Rather than a single metric value, these are typically an aggregate of values over time. In any case, the SLI must meet or exceed the SLO and SLA cutoffs. If your SLA requires 99.9% uptime, your SLO is likely also 99.9%, and your SLI must be greater (e.g., 99.96% uptime). If you have an internal SLO, you might set it slightly higher, perhaps 99.95%. 

While SLIs can include any metric deemed relevant, typical SLIs tend to focus on the RED metrics:

  • Rate (e.g., request rate, throughput, transaction rate)
  • Error rate
  • Duration (e.g., response time, latency)

Challenges when using SLIs

In a world where creating metrics can be as simple as a few clicks with a mouse, it is critical to lean towards simplicity when defining SLIs. Think of them as capturing key performance indicators (KPIs) rather than all possible performance indicators.  First, decide if the metric under consideration matters to the client. If not, don’t use it to create SLO/SLA.  Secondarily decide if it will help with improving your internal SLO if you have one.  It may, but if it is not needed, it is better to exclude the metric for your SLOs.

Do you need SLIs?

Good SLIs let you measure how reliable your service is. Having appropriate SLIs provides value whether you are using an SLA or just an SLO.  By now, you should understand that the SLA/SLO defines an acceptable level of performance.  The SLIs are how you can evaluate that performance against the SLO/SLA standard to inform your operations team and your customers. 

Conclusion

The use of SLAs, SLOs, and SLIs clearly define expectations for system reliability for both your customers and SRE teams. Well written SLAs and SLOs are derived from customer needs and appropriate SLIs that are used to verify that those needs are being met. Defining an error budget with a stricter, internal SLO can help focus SREs on improving overall system performance on addressing reliability issues.  

Opsani uses SLOs to guide the automated and continuous improvement of application performance and cost reduction. If you have SLOs and SLIs defined, you can load them into Opsani directly.  For systems that don’t have SLOs defined, Opsani can recommend an appropriate SLO and then update the SLO as the system is optimized.  If you’d like to see Opsani in action for yourself, our free trial allows you to run continuous optimization on the application of your choice. Sign up for a free trial here.