Category: Book Summary

Monitoring is a crucial aspect of Site Reliability Engineering (SRE) because it allows teams to detect, diagnose, and resolve issues in distributed systems. In this article, we’ll explore the principles of monitoring and best practices for monitoring distributed systems. First principle: Measure what matters Teams should identify key performance indicators (KPIs) that directly impact user…

Book Summary: SRE, Part 3, Service Level Indicators (SLIs), Objectives (SLOs), and Agreements (SLAs)

—

by

In this article, We are going to learn about Site Reliability Engineering (SRE) core terminologies. It’s important to understand those terms because they are used a lot nowadays in the software industry. I know that learning terminologies might sound boring or complex but I will try to make it simple and as practical as possible.…

Book Summary: Site Reliability Engineering, Part 2, Error Budgets and Service Level Objectives (SLOs)

—

by

It would be nice to build 100% reliable services. Ones that never fail. right? absolutely not. It’s going to be really bad to do such a thing because it’s very expensive and it will limit how fast new features can be developed and delivered to the users. Also users typically won’t notice the difference between…

Book Summary: Site Reliability Engineering, Part 1, How a service would be deployed at Google scale

—

by

How to deploy an application so that it works well at large scale? Of course there is no easy answer for such a question. It probably would take an entire book to explain that. Fortunately, in Site Reliability Engineering book, Google explained briefly what it might be like. They explained how to deploy sample service…

Book Summary Series

—

by