I believe reading is fundamental. site reliability engineers (SREs) need to have deep knowledge in a wide range of subjects and topics such as coding, operating systems, computer networking, large-scale distributed systems, SRE best practices, and more to be successful at their job. In this article, I discuss a few books that will help SREs to become better at their job.
1. Site Reliability Engineering, by the Google SRE team
Google originally coined the term “Site Reliability Engineering.” This book is a must read for anyone interested in site reliability engineering. It covers a wide range of topics that SREs focus on day to day such as SLOs, eliminating toil, monitoring distributed systems, release management, incident management, infrastructure, and more. This books gives an overview of the different elements that SREs work on. Although this book has many topics specific to Google, it provides a good framework and mental model about various SRE topics. The online version of this book is freely available, so there is no excuse not to read it. Â The free online version of this book is available here.
2. The Site Reliability Workbook, by the Google SRE team
After the success of the original site reliability engineering book, the Google SRE team released this book as a continuation to add more implementation details to the topics in the first book. One of my favorite chapters in the book is “Introducing Non-Abstract Large Scale System Design,” and I have read it multiple times. In similar fashion to their first book, this book is also available for free to read online. Â You can read this book for free here.
3. Systems Performance, by Brendan Gregg
I got introduced to Brendan Gregg’s work through his famous blog “Linux Performance Analysis in 60,000 Milliseconds.” This book introduced me to the USE Method, which is one that can help to quickly troubleshoot performance issues. USE stands for usage, saturation, and errors. This book covers topics such as Linux kernel internals, various observability tools (to analyze CPU, memory, disk, file systems, and network), and application performance topics. The USE method helped me apply methodical problem solving while troubleshooting complex distributed system issues. This book can help you to gain a deeper understanding of troubleshooting performance issues on a Linux operating system. Â More information about his book can be found here.
4. The Linux Programming Interface, by Michael Kerrisk
Having a deeper understanding about operating systems can provide a valuable advantage for SREs. Most of the time, SREs tend to use many commands to configure and troubleshoot various OS related issues. However, understanding how the operating systems work internally help make troubleshooting easier. This book provides a deeper understanding about the Linux OS, and focuses on the system call interface of the Linux OS.
A majority of the teams and companies use Linux to run production systems. However, you may work in teams where other operating systems like Windows are being used. If that is the case, then including a book specific to the OS in your reading list is worthwhile. You can check out the above mentioned book here.
5. TCP/IP Illustrated: The Protocols, Volume 1, by Kevin Fall and Richard Stevens
This book is great to learn about core networking protocols such as IP (Internet Protocol), ICMP (Internet Control Message Protocol), ARP (Address Resolution Protocol), UDP (User Datagram Protocol), and TCP (Transmission Control Protocol). Having strong understanding of the TCP/IP protocol suite and how to use various tools to debug networking issues is one of the core skills for SREs. This books provides the reader with a strong understanding of how protocols work under the hood. Details about the book are found here.
6. The Illustrated Network: How TCP/IP Works in a Modern Network, by Walter Goralski
While TCP/IP Illustrated provides an in-depth explanation of the core TCP/IP protocols, this book focuses on understanding the fundamental principles and how they work in a modern networking context. This is great addition to your library along with TCP/IP Illustrated, which provides a deeper and broader understanding of TCP/IP protocols. More about this book can be found here.
7. Designing Data-Intensive Applications, by Martin Kleppmann
This is a great book for understanding how distributed systems work through the lens of data-oriented systems. If you are working on distributed database systems, this book is a must read. I personally learned a lot with this book because I currently work as an SRE on CosmosDB (a globally distributed database service). What makes this book specifically useful for SREs is that it focuses on the reliability, scalability, and maintainability of data-intensive applications. It dives deep in to distributed database concepts such as replication, partitioning, transactions, and the problems with distributed consensus. You can learn more about this book here.
8. Building Secure and Reliable Systems, by the Google SRE team
This book extends the principles of site reliability engineering to encompass the security aspects, and argues that security and reliability are not separate concerns, but rather are deeply related and should be addressed together. It advocates for integrating security practices into every stage of the system lifecycle— from design and development to deployment and operations. Google has made this book available for free here.
9. Domain-Specific Books
Often, SREs work in specific domains such as databases, real-time communication systems, ERP/CRM systems, AI/ML systems, and more, and having a general understanding of these domains is important to be effective at your job. Including a book in your reading list that provides a breadth of knowledge about the domains is a great idea.
Conclusion
By reading these books, you can develop a deeper understanding on various subjects such as coding, operating systems, computer networking, distributed systems, and SRE principles which will help you to become a better site reliability engineer. Personally, these books helped me to broaden my understanding of the essential knowledge to perform my job as an SRE effectively, and also helped me while I was pursuing opportunities across teams and organizations as well. Happy reading!