What is Site Reliability Engineering (SRE)
Updated: Jun 5, 2020
When it comes to Site Reliability Engineering, short for SRE, the resources available online are only limited to the books published by Google themselves. They do share some useful case studies that will help us understand what SRE is, and how to understand the concepts given in it, but they do not clearly explain on how to build your own SRE team for your organisation. The concept of SRE was cooked fresh within the walls of Google and later released to the general public as a practice for anyone to follow.
In this article I would like to give a brief introduction to SRE and why it is important to any Software Engineering organisation. This is based on my experiences and learnings from leading a Site Reliability Engineering team for one of the leading organisations in the US.
At the early stages of any new software product, the user base is very low, and the primary focus is on delivering features as quickly as you can, to reach to a stable market in the industry. During this period, you might get a few tickets which can be handled by developers themselves and the same could happen to DevOps tasks as well. But as the System grows, the developers will have to focus primarily on Development, whereas you will have to start hiring Support Engineers and SysAdmins to take care of the operational tasks. But, what will happen when it grows larger where the SysAdmins are no longer able to tackle it by themselves? You will have to hire more SysAdmins and Support Engineers to take care of the reliability of the system. As the system grows, the cost for operational tasks will also grow linearly and where will it stop?
The day to day responsibilities of Software Engineers and Operations Engineers are increasing daily, where growing organisations need to seek approaches to keep the system stable as much as possible. You need your Site to be more and more reliable when it grows, in terms of Scalability, Availability and other aspects. If you fail to keep up to the customers' expectations, your product will fail in the industry and will completely lose its traction. How can we tackle problems like this in the real-world whilst ensuring that the operational costs stay intact with our budgets? The burning question that was asked a long time back was:
It is a truth universally acknowledged that systems do not run themselves. How, then, should a system—particularly a complex computing system that operates at a large scale—be run?
This is the basic need for Site Reliability Engineering, where a specific set of engineers, build their own set of practices to ensure that the Site is Reliable at any given point in time. In any growing system, you need a set of engineers who will look for new ways to improve the stability of production systems with proper monitoring and automation-first practices.
What is Site Reliability Engineering
Google founded SRE in 2003 and this is a framework introduced by Google on how to operate large scale production systems in a reliable manner. This may sound like an operations function, but it is not. According to the founder of Google's SRE team, Ben Treynor;
SRE is what happens when you ask a Software Engineer to design an operations function for your system.
It's a very versatile approach which allows you to reliably run mission critical systems, no matter how small or big the system is.
SRE is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. A Site Reliability Engineer, will have to spend up to 50% of their time, doing operational work like being on-call, manual workloads, documentations etc. Whereas the remaining 50% of the time, an SRE is expected to do actual development, such as new features, deployments, automation tasks etc. A system managed by SREs is meant to be self-healing and very proactive. SRE owns the entire production environment and has to ensure that the Site is reliable, no matter what gets released to production.
In my opinion, an ideal SRE, is a software engineer, with a strong background on administering and operating production systems. From what I see, you can do Site Reliability Engineering, without having a Site Reliability Engineer, and you may already have engineers playing the role of SRE, without even having an SRE Team.
Site reliability engineering is a cross-functional role, assuming responsibilities traditionally siloed off to development, operations, and other IT groups. They will seek to automate everything that comes in their way, to make room for actual engineering work rather than manual labor.
Demand for Site Reliability Engineering
The usual question about SRE is whether SRE is suitable for small organisations. This is highly debatable, but my belief is that they do. Even if it is a small organisation, there is always someone who will be taking care of the operations work from time to time. As I said earlier, you may already have SREs working under you, even without you knowing it. This has grown as a practice for larger organisations, but it would be well suited for small organisations to take over the practices even without establishing an SRE team in the organisation.
The mindset of an SRE is different from that of a Software engineer or an Operations Engineer. SREs always think of ways to automate most of the operations work, rather than just doing them. This mindset is something that needs to grow within, where you think of ways and tools to alert, monitor, do, and automate most of the tasks you are doing at the moment, in order to make the system more reliable.
As an SE, you will grow the depth of your knowledge on a single area. But as an SRE, you will grow the breadth of your knowledge on a vast area by learning about different technologies available in the industry.
The demand for Site Reliability Engineers has grown rapidly throughout the world, where the average salary of an SRE is higher than that of an SE. Currently if you search for SRE positions on Glassdoor, you would find over 70k positions available worldwide. The demand is growing rapidly as organisations start to understand the value of SREs in keeping the Site Reliable. If you really want to be a Site Reliability Engineer, ask the following questions from yourself.
Do you want to improve your coding skills in terms of scripting, dashboarding, monitoring and alerting?
Are you interested in learning about how complex production systems work?
Do you possess the leadership and communication skills to communicate with different stakeholders.
Are you willing to research and read about new technologies in the market. (You need to become a Jack of all Trades, in terms of Software Industry)
DevOps vs SRE
DevOps is a set of practices that combines software development (Dev) and IT operations (Ops) which aims to shorten the systems development life cycle and provide continuous delivery with high software quality. SRE and DevOps are highly related to each other, because they all work for the same targets. But the way SRE sees the system is different from a traditional DevOps culture. There is a common saying in software terms, as follows.
SRE Implements DevOps
First let's understand the 5 key pillars of success of DevOps.
DevOps - 5 Key Pillars of Success
How SRE satisfies these 5 Pillars
1) Reduce organisational silos
SRE shares ownership with developers to create a shared responsibility
SREs use the same tools that developers use, and vice versa
2) Accept failure as normal
SREs embrace risk
SRE quantifies failure and availability in a prescriptive manner using Service Level Indicators (SLIs) and Service Level Objectives (SLOs
SRE mandates blameless postmortems
3) Implement gradual changes
SRE encourages developers and product owners to do small deployments gradually to reduce the cost of failure
4) Leverage tooling and automation
SREs have a charter to automate menial tasks (called "toil") away. I will explain a bit more on toil later.
5) Measure everything
SRE defines prescriptive ways to measure values
SRE fundamentally believes that systems operation, is a software problem
What does a Site Reliability Engineer Do?
The role of a Site Reliability Engineer (SRE) is never properly defined anywhere. It's more of a culture and a set of norms built by organisations in tackling production related matters on their own. Hence, the role of an SRE differs from organisation to organisation. But there are a common set of practices that SREs follow, and it is not limited to the following.
1) Monitor Almost Everything
Most of the systems we see today are highly distributed and it is very rarely that we see non-distributed, monolithic architectures. The role of an SRE is not limited to just monitoring the Distributed System, but monitoring almost everything, based on my understanding.
Monitoring can/should be done on the production applications, deployment servers, underlying infrastructure, code quality, and even Mean Time To Deliver a system etc.
2) Ensuring System Compliance with industry standards
The system you are maintaining might have agreed to comply to industry standards like ISO 27001 security standards and ISO 9001 quality standards. In this case, there should be a way to monitor whether the system is inline with this standards or whether it is declining in them, over time.
3) Measuring Service Level Objectives (SLOs)
SLOs are a key aspect of any System, which explains the overall behaviour of a production system. I will explain this more in the below sections. But for now, just assume that this is about measuring the uptime of a System. Have you seen systems mentioning that it is available for 99 percent of the time? The more 9s it adds to this definition, the more rigid the system becomes. The following table will give you the idea.
4) Provide Compensation for SLA breaches
Many production applications have licensed or paying customers, where we need to provide a reliable system to them. If the system is not reliable enough for the customers, then they will raise the question as to why they would want to keep on paying for this software. Even highly reliable systems can go down unexpectedly. But based on the past metrics, every company defines its availability, whereas a breach in this number will cause the company to pay back the customers in terms of cash, credit, or discounts.
But this is something the customers cannot monitor. This can only be monitored by SLOs, and this is where SRE comes into play. Have a look at how Google Cloud Platform penalises its own services, if they fail to adhere to the SLAs. The below screenshot is from Google Compute Engine. (Compute Engine SLA).
5) Automate Everything
You should try and reduce the level of manual work you do as an SRE. You will have to build a lot of automation scripts in order to make sure that you can just sit back and have a cup of coffee while your system is running smoothly. So, as an SRE, your first question should be to ask the following.
Can I automate this task as well?
6) Provide On-Call Support for Major Incidents
SRE is not a support engineer position. The first call regarding complaints should come to the support engineers. Then, if it is a High Severity incident, the Support Team decides the wake up the SRE Team.
In this case the SRE is responsible to for analysing the incident and waking up others required to solve this crisis. I use the word crisis, because this process should not happen, unless it is defined to be an organisation wide incident.
SRE will take care of the incident from top to bottom and after the incident is resolved, SRE will create a PostMortem document document and a retrospective with the leads to ensure that this will not happen again. These postmortems should be blameless, and only as a part of a learning exercise.
7) Communication with Development Teams and Management
SRE needs to ensure they build a good rapport with the development teams, and that they provide the necessary details to the management when needed. The management will depend on the metrics provided by SRE to make many business decisions in the organisation.
Service Level Terminology
Have you ever wondered how to measure the behaviour of a service? How could you actually measure whether a production application is running smoothly or not? We sometimes go with the gut feeling and determine if the users are happy, where the conclusion would be that the service is running smoothly. These applications could be internal APIs, or even Public Applications used by the general public. Nevertheless, the service should have proper metrics that we can investigate to measure the quality of the application. Some applications might behave as intended for some users, and some might not. This is where we need to define levels of service to the user, so that they understand what to expect in an application when using it. This does not indicate the actual features/requirements provided by the application. These define how the application behaves in a live production environment.
This is where we need to introduce proper metrics and keep monitoring them so that stakeholders are aware of the behaviour of the application over time. In Site Reliability Engineering, there are three main concepts where metrics are needed to be collected.
Service Level Objectives
Service Level Indicators
Service Level Agreements
These measurements describe the basic properties the applications should have (SLO), what values we want these metrics to have or maintain (SLI), and how should we react if we are not able to provide the expected service. Defining these metrics is very important for SREs to understand the behaviour of the application and to be confident about the production environment.
The term Service Level Agreement (SLA) is something we all are familiar about, but the word has taken different forms in the Software Industry, based on the context of the usage. This section intends to explain the terminologies in depth for the readers to have the exact definition so that defining these metrics will be crystal clear.
Service Level Indicator - SLI
This is a carefully defined quantitative measure of some aspect of the level of service that is provided. Some values that we actually need to measure might not be directly available for us to monitor. For example, the network delays on the client side might not be directly measurable for our monitoring tools. Due to that reason, these might not be considered as metrics and some other aspects will come into the picture. Most commonly used SLIs are given below.
Request Latency: How long does it take to return a response for a request.
Error Rate: Number of failed requests over the number of total requests.
System Throughput: This is measured as requests per second. Eg: The service can handle 20 requests per second.
Availability: The fraction of the time the service is usable. This is directly related to the other defined SLOs and how they form up to the definition of "Availability" is a separate discussion.
Service Level Objective - SLO
This is a target value or a range of values for a service level, that is directly measurable by an SLI. Deciding on a proper Service Level Objective is a bit tough and opinion based. What is important is the monitoring fact, where SLOs which you cannot monitor, will not have any value at all. For example, measuring the network delays on the user side is impossible unless you maintain a frontend client app to do so. Hence, having that as an SLO is not very applicable.
Having a proper SLO defined in the application is very critical not only for the management, but also for the users using the application. This will set the expectations to everyone on how the application will perform. If someone is complaining that the application is running very slow, we can correlate this with the metrics gathered from the SLOs to see whether the affected user has been properly captured by the downtime in the SLO. Without an explicit SLO, users often develop their own beliefs about desired performance, which may be unrelated to the beliefs held by the people designing and operating the service. This dynamic can lead to both over-reliance on the service, when users incorrectly believe that a service will be more available than it actually is.
Service Level Agreements - SLA
This is the legal binding which asks the question, "what happens if the SLOs are not met"? This agreement directly speaks to the customer and communicates the consequences of maintaining a defined SLO. If the SLOs are met, the customer is happy, and if it is not met, the service provider will have a penalty (or any other form) to be paid back to the customer. This is mostly in concern with applications involving licenses and paid subscriptions.
The SRE Team does not get involved in deciding the SLAs, because SLAs are closely tied to business and product decisions. But SRE will get involved in taking actions if the SLOs are not met as per the SLA.
Some organisations might not have a direct SLA with its customers, but an implicit one. For example, Google Search does not have an agreement with its users. But still, if the search results are generated slowly or incorrectly, the organisation will end up paying a penalty to its customers, which is its reputation. Nevertheless, SLOs and SLIs are important and later you can decide on how to implement an SLA for the service being provided.
What is Toil
This section will explain the concept of Toil in Site Reliability Engineering. As Site Reliability Engineers, we are required to perform a certain amount operational activities in our day to day processes. This being said, if these operational activities converts themselves to Toil, that should be eliminated from the SREs themselves. As SREs, we have much more crucial long running engineering tasks to carry out, rather than spend most of our time on Toil. So this article will try to give a definition for Toil and explain how SREs should tackle Toil in their day to day processes.
What we often see in the Engineering domain, is the misuse of the word Toil. Toil is not just work that we have to do regularly, and the work we get bored with. Work like writing documentations, conducting meetings, sending out emails cannot be considered as Toil. These are merely administrative work, and in management terms, this can be simply called as Overhead. So, when it comes to understanding Toil, it's definitely not work, that irritates us and discomforts us. These kind of feelings are highly opinionated and can be interpreted in different ways.
Toil is work which can carried the following characteristics in general. It does not necessarily need to carry all of the above properties, but at least a combination of them.
1) Manual Work
Executing a script or triggering a script. If a human needs to manually trigger a script, in order to execute the steps in the script, this is Manual Work. This time can be considered as Toil time.
2) Repetitive Work
If you are performing a task once or twice, this is not Toil. But if you have to do this continuously, then this becomes Toil. For example, sending out an email daily to stakeholders, is definitely Toil.
If the manual work you are doing, can be simply converted to a script or automated program, then that work is definitely Toil. By automating it, you will reduce the need of Human Effort, to execute the task. But if it needs a human judgement, like deciding whether it is a Bug or a Feature, then it is not Toil. This statement is still arguable, where you can use sophisticated tools like Machine Learning to optimise the judgements. In this case, it can be called as Toil as well.
Toil is interrupt-driven and reactive, rather than strategy-driven and proactive. For example, if and when an incident happens, we have to create a channel, create pages, create postmortems etc, involves a lot of work, and what we do in each incident will differ from each other. So it will be hard to fully eliminate the process, but we should definitely work towards reducing it.
5) No Enduring Value
If the operational tasks, did not change the state of the system, then it is definitely Toil. If the work you did, changed the performance of the application or added a new feature to the System, then it cannot be considered as Toil.
6) O(n) with service growth
If the work you do, grows with the size of the system, requires more resources and takes more time, then it is considered as Toil. For example, if you are supposed send a daily mail on the new incidents you get, then if you suddenly get around 50 incident overnight, then you will have to manually analyse all of it and summarise to an email. This is something the SREs should try to automate.