Blog

Traditional Definition of SRE

Site Reliability Engineering or "SRE" is a relatively new title and position that had its roots in Google and made its way to the broader software community in 2003. It has since matured and evolved, many times taking on different meanings depending on who you talk to, or where you go. This can lead to some confusion, as well as blurred lines between SRE and Devops. What I hope to share is my experience and personal opinion on what the "traditional" intent was for SRE, and in my experience how that compares to what companies are doing now and what the term and role SRE descibe to me.

In the traditional sense, and what Google had in mind when they created the role of SRE, DevOps was to be thought of as a philosophy, and SRE as a prescriptive way of accomplishing that philosophy; implementing the developer mindset, workflows, tools, etc...and applying them to the operations world. Under this definition, DevOps is like "What to do", and SRE is like "How to do".

However, as mentioned this "traditional" definition of SRE can be blurred with DevOps, and can vary depending on where you go and who you talk to. Since its inception, SRE has evolved to encompass many different meanings and responsibilities. But, the core principals and reason for its creation, many of which overlap with DevOps, still hold true - which is to solve the pain point of infrastructure that you continually roll changes out to. At its core, the objective is to create reliable, redundant, fault tolerant, immutable infrastructure using infrastructure as code and a set of guiding principals, standards, and workflows. This of course with the understanding of working closely with both developers, and operations.

And SRE mentality is that "once software is stabilized and deployed in production that it needs much less attention" is wrong. We want systems that are automatic, not just automated. 50% of an SREs time should be dedicated to Ops work. That 50% is the cap according to Google. The other 50% should be focused on development work and making the application more reliable, as well as reducing toil by automating repetitive development tasks. That percentage should decrease overtime, as the Ops work that an SRE does should make the application run and repair itself on its own. The amount of SRE is needed to run, repair and monitor a system scales sub-linearly with a system.

SRE Foundational Practices

Forty to ninety percent of business costs come after birth or after the creation of something, but most of the effort is put into before something is created. However, what happens after? That is where SRE comes in.

Site Reliability Engineers are Software Engineers who's focus is on that forty to ninety percent, and are cross functionally Software Engineers, but geared towards business objectives, goals, and saving costs. As such, an SRE's main focus is on the Production environment, and most importantly (as the name suggests) system reliability and being able to able to shape the data and protect it from failure.

Concepts

Observability - which is monitoring, is something an SRE needs to know a lot about because redundancy helps keep a system reliable.

Observability vs Monitoring - observability is how much minoring you have in place, or your level/ scope of coverage, monitoring is the tools you use to visualize and alert to have observability.

Redundancy - essentially a way to back up data - like having a leader / follower. So we would want a cluster of DB where one DB will take in the writes and then communicate them to the other nodes in the cluster.

Defining Availability Through SLAs, SLOs, and SLIs

Being that SREs are geared towards business objectives and goals, and the prerequisite to success is Availability, it falls under the responsibilty of SRE to define the Availability of the services the business provides. A system that is unavailable cannot perform its function and will fail by default.

Defining Availability can look like the following:

Define Availability - Whether a system is able to fulfill its intended function at a point in time.

Level of Availability - The more reliable the service, the more it costs to operate. Define the lowest level of reliability that you can get away with for each service.

Plan In Case of Failure - In addition to being used as a reporting tool, the historical availability measurement can also describe the probability that your system will perform as expected in the future

This is done thorough what are called SLIs (Service Level Indicators), SLOs (Service Level Objectives), and SLAs (Service Level Agreements).

Service-Level Objectives

An agreement between stakeholders on how reliable a service should be. A precise numerical target for system availability.

Service-Level Agreements

An SLA normally involves a promise to someone using your service that its availability SLO should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid. Because of the principle that availability shouldn’t be much better than the SLO, the availability SLO in the SLA is normally a looser objective than the internal availability SLO.

Service-Level Indicators

A direct measurement of a service’s behavior: the frequency of successful probes of our system. When we evaluate whether our system has been running within SLO for the past week, we look at the SLI to get the service availability percentage.

SRE Practices

Broadly speaking, in the "traditional" definition, DevOps describes what needs to be done to unify software development and operations. Whereas the "traditional" intent with SRE prescribes how this can be done. While SRE culture prioritizes reliability over the speed of change, DevOps instead accentuates agility across all stages of the product development cycle. However, both approaches try to find a balance between two poles and can complement each other in terms of methods, practices, and solutions

1)Reduce Organizational Cilos

2)Accept Failure as Normal

3)Implement Gradual Change

4)Leverage Tooling and Automation

5)Measure Everything

The Four Golden Signals

The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.

1) Latency

The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.

2) Traffic

A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content). For an audio streaming system, this measurement might focus on network I/O rate or concurrent sessions. For a key-value storage system, this measurement might be transactions and retrievals per second.

3) Errors

The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, "If you committed to one-second response times, any request over one second is an error"). Where protocol response codes are insufficient to express all failure conditions, secondary (internal) protocols may be necessary to track partial failure modes. Monitoring these cases can be drastically different: catching HTTP 500s at your load balancer can do a decent job of catching all completely failed requests, while only end-to-end system tests can detect that you’re serving the wrong content.

4) Saturation

How "full" your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential.

In complex systems, saturation can be supplemented with higher-level load measurement: can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it currently receives? For very simple services that have no parameters that alter the complexity of the request (e.g., "Give me a nonce" or "I need a globally unique monotonic integer") that rarely change configuration, a static value from a load test might be adequate. As discussed in the previous paragraph, however, most services need to use indirect signals like CPU utilization or network bandwidth that have a known upper bound. Latency increases are often a leading indicator of saturation. Measuring your 99th percentile response time over some small window (e.g., one minute) can give a very early signal of saturation. Finally, saturation is also concerned with predictions of impending saturation, such as "It looks like your database will fill its hard drive in 4 hours." If you measure all four golden signals and page a human when one signal is problematic (or, in the case of saturation, nearly problematic), your service will be at least decently covered by monitoring.

SRE Principals

Everything Should Be Reproducible

IaC (Infrastucuture as Code)

Don't be perfect

Iterate over time

Plan for failure

document everything

As much As Necessary Should Be Automated

Develop tools and systems reducing toil and repetitive work from engineers.

Automate everything, or as much as possible (deployments, maintenances, tests, scaling, mitigation).

Security Should Not Be Ignored

Starting well saves time

coding standards simplify things

System standards also help

Simple Is Better Than Magical

Naming things appropriately

Time taken to explain systems is better spent developing

Assumptions are bad (chaos engineering)

Explicitness grants transparacny

Visibility Is Key, Access Is Not

Monitoring / Metrics / logs

Data extraction

Read only infrastructure

Applications are self managed

Cattle, not pets (special snowflake)

In addition to the concepts and principals above, the general mindset of an SRE when working on any project always has these fundamental propositions in mind:

Monitor everything.

Think scalable from the start.

Build resilient-enough architectures.

Handle change and risk through SLAs, SLOs and SLIs.

Learn from outages.

The primary focus of SRE is system reliability, which is considered the most fundamental feature of any product. The pyramid below illustrates elements contributing to the reliability, from the most basic (monitoring) to the most advanced (reliable product launches).

DevOps Pillars and SRE Practices

Help Create and Maintain Immutable Infrastructure

The main pain point both DevOps and SRE seek to solve is - the idea of how to make stable infrastructure that you continually, easily, and reliably roll changes out to.

The mindset shift is that you start thinking in terms of building artifacts (like machine images for example) you then roll changes out through your environments. The idea is that you're then always dealing with the same the thing. You're assured that the thing you have in your configuration, is whats in your test environment, staging environment, etc...is the thing in your production environment.

Infrastructure as code is also a major part of this process and mindset, and is the tool that allows the confidence that what you have defined is what is running in your environments. As well as allows the continual and reliably rolling of changes to these various environments. In addition, Iac (Infrastructure as code) allows the tracking of changes and a form of an audit log, when using a form of source control like Git. This eliminates creating, editing, or updating anything manually via "ClickOps", which with a large enough infrastructure and many people working simultaneously, can quickly turn into an unfavorable and challenge circumstance.

Infrastructure as code has thus become the new standard in upholding the set of guiding principals, standards, and workflows that are all part of being a DevOps Engineer and SRE.

Benefit of Infrastructure as code

Let you functionally test whole infrastructure and application package as a unit.

Application or infrastructure behavior you see will then start responding and correlation to a specific git commit or branch.

Lets you KNOW the production environment is is the state that you think, bc without it, or if just running a playbook, you might not know or the state might not be what you think it is.

Multiple people can work on larger infrastructure projects, then those commits can be tested individually

Leading to "pipeline" thinking. application, build image, create environment, move to staging, etc...

Benefits can be summarized into:

1. Can trace problems to a single base or a single change in your infrastructure.

2. Progress can be done on branches, not a bunch of people changing things in a random interface.

IaC Tools

Terraform

Ansible

Puppet

Chef

Many New and Emerging others....

Benefits of These Guiding Principals

In summary, we can break down the technical and business benefits of adopting the SRE mindset into a very condensed list. Overall the SRE implementation of the DevOps mindset can help organizations to ensure higher success rates for releases, reduce the lead time between bug fixes, streamline and continuous delivery through automation, and an overall reduction in manpower costs.

Technical Benefits

Continuous software delivery.

Less complex problems to manage.

Early detection and faster correction of defects.

Business Benefits

Faster delivery of features.

Stable operating environments.

Improved communication and collaboration between the teams.

Improved DevOps KPIs which in turn improve the business.

Meantime to failure recovery (The average time taken to recover from a failure).

Deployment frequency (The frequency in which the deployment occurs).

Percentage of failed deployments (The number of times the deployment fails).

Written February 24, 2024