hamburger icon close icon

SRE SLOs: A Practical Guide for SREs on Defining SLAs, SLIs, and SLOs

Site reliability engineers bridge the gap between development and IT operations, so that the applications that developers create can deliver consistent and predictable real-world performance and availability. Service level agreements (SLA), service level objectives (SLOs) and service level indicators (SLIs) combine to play a key part in defining and quantifying what it means for a service to be “available” and “performant” through clearly defined numerical measurements that can be tracked and reported against.

This article will first look at the DevOps and site reliability engineering concepts. It will then define what the terms SLI, SLA and SLO mean, then take a deeper look at how these metrics can be adopted in DevOps cultures and site reliability engineering.

Use the links below to jump down to the sections on:

DevOps and Site Reliability Engineering

While linked, there is a difference between the domains of DevOps and SRE.

DevOps culture integrates development with operations so that applications and infrastructure that host those applications work better together. DevOps teams often focus on automating deployment, tasks, and features of applications in order to remove any manual interventions, thereby reducing unintended human error during those operations.

Site reliability engineering (SRE) is an engineering discipline which helps organizations sustainably achieve predefined levels of reliability in their corporate infrastructure and application systems. These reliability levels and their measurements must be in line with everyone’s expectations, as too many levels of reliability and complex measurements will be expensive and delay application and infrastructure system changes, which can impact business. Too little would be just as expensive and drive away both customers and stakeholders, again causing business impact.

Site reliability engineers will therefore focus on meeting DevOps goals by automating redundancy and all manual tasks as much as possible. It’s important to note that this automation never ends, as SREs will constantly re-evaluate and find new tasks worthy of automation during any application’s development, testing, releasing, and operating cycle. Learn more about DevOps versus SRE here.

3 Key Metrics for Service Reliability

In order to measure service reliability, it is important to understand some of the key concepts used in site reliability engineering, specifically service level indicators (SLIs), service level objectives (SLOs), and service level agreements (SLAs).

SRE SLI: Service Level Indicators (SLI)

SLI is the service level indicator that defines what the reliability of a service is, by numerical indicators which can then be accurately measured over time. These directly indicate the health, availability, and performance of a service with metrics such as latency, throughput, and errors/failures per X number of requests.

Google defines service level indicators to consist of two parts: SLI specification itself (such as latency, throughput, errors / failures per number of requests) and the SLI implementation (that defines how the SLI is measured in real life).

SRE SLO: Service Level Objectives (SLO)

Service level objectives are typically defined as a SLI threshold / range: A range of values of a service level indicator that needs to be maintained for the service to be acceptably reliable. An example would be response latency of an application defined to be within a range of 1–2ms for it to be reliably available.

Within site reliability engineering, SLOs are used to define the levels of acceptable service availability. This is typically then communicated to all stakeholders in the organization, focusing on each of their specific deliverables on which everyone can agree.

Essentially, SLO is a definition of the operational boundaries of how reliable a service should be. It is also recommended that the SLOs defined using these SLIs need to be specified as SMART goals: In that they need to be Specific, Measurable, Achievable, Relevant, and Time-bound, for precision and ability to be clearly measured.

SRE SLA: Service Level Agreement (SLA)

SLAs are the contractual agreements that define the levels of service that can typically be expected by the end users of the service (i.e., customers), as defined through service level objectives.

A note about SLAs: SLAs typically also include the consequences of missing such SLO commitments which would typically be some form of compensation (monetary, service credits, etc.) for failing to meet them. It is important to note that site reliability engineering doesn’t often involve SLAs as it is more focused around the definition of SLOs and SLIs. Defining SLAs often involves business, product and legal entities; however, the ramifications of missing SLAs need to be factored into SLOs and SLIs during their definition.

Site Reliability Engineering Considerations

A developers view

As the name suggests, DevOps combines the two fields of development and operations. While developers are trying to solve a problem, site reliability engineers will focus on deployment, running, and maintenance to ensure the reliability of a service.

Developers need to strictly adhere and work within SLO and SLI parameters during the design and development stages of the application services. Any deviation from these will ultimately affect the SLA commitments with potential punitive consequences to the business.

As an example, a developer working on a new application may need to ensure that the API endpoint should answer all the REST API calls within 50ms response times, 99% of the time daily. In order to meet this SLO requirement, the developer would need to consider the SLIs such as the “endpoint response time” (latency), as well as the “daily uptime” (availability) and ensure that the application can consistently meet these SLO requirements.

An infrastructure view

Site reliability engineers will ensure that the SLOs are also adhered to once an application service has been deployed into production infrastructure. They will continuously monitor and measure the SLO performance and perform necessary tweaks within both the application as well as the underlying infrastructure to maintain the SLO in the short term or long term. SREs will also ensure that ongoing engineering work such as service improvements or fixing bugs are appropriately prioritized based on the SLO and the impact on the error budget (the amount of errors that can be tolerated) based on the SLO.

Consider the same example as above. With an SLO of meeting REST API response time of < 50ms during 99% of daily responses within each day, an application that receives 1 million API requests a day can have an error budget of 10,000 API requests that could have > 50ms response time (1,000,000*1/100) within each day. An SRE therefore may decide to prioritize the engineering work that has the least impact on this error budget, such as prioritizing bug fixes that increase uptime performance over new features, increasing the reliability.

Other SRE Considerations to Keep in Mind

SLI selection

Selecting the approximate number of key SLIs is vital for the success of evaluating the reliability of a service efficiently. Too little indicators will not provide the accurate reliability picture while too many will likely overwhelm the practicality of measuring and tracking the reliability.

Relevance of the end user

It is important to select these SLIs to be as closely representative of the ultimate user experience as possible. After all, there is no point in an application that is delivering sub-2ms latency on the server side for API requests, if the user is not able to see the results due to a latency problem with the page’s JavaScript.

Distributions over averages

Also, most SLI metrics are better used as distributions rather than averages. Averages tend to mask peaks and troughs; using averages as SLI measurements can therefore lead to incorrect diagnoses of an application's reliability. For example, an application SLI which requires a typical API request to be served within 50ms on average may not detect 5% of the API requests taking 30 times longer over the course of a day. As such, it is advisable to use percentiles, such as the 99th percentile, which will cover plausible worst-case values that can impact the user experience and therefore the perceived reliability of the application service.

Measurable SLOs

When rolling up SLIs into measurable ranges as SLOs, it is very important to consider what the application service’s target users care about in terms of their experience. This is important to ensure the actual business impact of the SLOs is legitimate. Using easy to measure SLIs within an SLO may otherwise render the SLO meaningless in the eyes of the users and business stakeholders due to poor user experiences.

SLO ranges based on business needs

SRE SLO targets need to be done based on business implications for maximum effect. Keeping SLOs within range rather than absolute numbers and avoiding aggregating multiple SLIs can also be good practice for effectiveness and the easier stakeholder approval. SLOs should always be a major driver used for prioritizing work for Site Reliability Engineers and Developers because they directly represent what the users care about.

Conclusion and Next Steps

Site Reliability Engineering ensures that there is ample communication between both development and operations as well as a good understanding so that they reach the common goal of a robust application service that has high levels of reliability, in the form of availability and performance.

Service Level Indicators help define what these reliability factors look like from quantifiable measurements while the Service Level Objectives define the acceptable ranges for these quantifiable measurements. Service Level Agreements then help wrap these into business contracts with clearly defined repercussions should they be missed.

NetApp Cloud Volume ONTAP can help Site Reliability Engineers meet various Site Reliability Engineering challenges and make DevOps better. To find out more, read about all the SRE benefits that come with Cloud Volumes ONTAP.

FAQs

What is SLI in SRE?

In Site Reliability Engineering, SLI refers to the service level indicator which is a numerical indicator that can be measured to gauge the reliability of an application service. An example would be the “Application latency” for a web application.

What is SLO in SRE?

In Site Reliability Engineering, SLO refers to a range of a Service Level Indicator (Numerical indicator that can be measured to gauge the reliability of an application service) that is typically used to measure the reliability of an application service. An example would be “Maximum application response time is to be within 1-2 ms latency, 99% of the time during a day”.

What is SRE SLA?

A service level agreement is a business agreement to meet a specific set of service level objectives and any punitive damages awarded in the case of not being able to meet them. SREs typically don’t focus on SLAs, though these types of agreements can help SREs define SLOs and SLIs.

New call-to-action

Yifat Perry, Product Marketing Lead

Product Marketing Lead