Site Reliability Engineering

Video description

The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large scale computing systems?

Site Reliability Engineering

Video description

In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You’ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization.

This book is divided into four sections:

Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices
Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)
Practices—Understand the theory and practice of an SRE’s day to day work: building and operating large distributed computing systems
Management—Explore Google's best practices for training, communication, and meetings that your organization can use

Publisher resources

Download Example Code

Title Page

Foreword

Preface

Part I. Introduction

Chapter 1. Introduction

Chapter 2. The Production Environment at Google, from the Viewpoint of an SRE

Part II. Principles

Chapter 3. Embracing Risk

Chapter 4. Service Level Objectives

Chapter 5. Eliminating Toil

Chapter 6. Monitoring Distributed Systems

Chapter 7. The Evolution of Automation at Google

Chapter 8. Release Engineering

Chapter 9. Simplicity

Part III. Practices

Chapter 10. Practical Alerting from Time-Series Data

Chapter 11. Being On-Call

Chapter 12. Effective Troubleshooting

Chapter 13. Emergency Response

Chapter 14. Managing Incidents

Chapter 15. Postmortem Culture: Learning from Failure

Chapter 16. Tracking Outages

Chapter 17. Testing for Reliability

Chapter 18. Software Engineering in SRE

Chapter 19. Load Balancing at the Frontend

Chapter 20. Load Balancing in the Datacenter

Chapter 21. Handling Overload

Chapter 22. Addressing Cascading Failures

Chapter 23. Managing Critical State: Distributed Consensus for Reliability

Chapter 24. Distributed Periodic Scheduling with Cron

Chapter 25. Data Processing Pipelines

Chapter 26. Data Integrity: What You Read Is What You Wrote

Chapter 27. Reliable Product Launches at Scale

Part IV. Management

Chapter 28. Accelerating SREs to On-Call and Beyond

Chapter 29. Dealing with Interrupts

Chapter 30. Embedding an SRE to Recover from Operational Overload

Chapter 31. Communication and Collaboration in SRE

Chapter 32. The Evolving SRE Engagement Model

Part V. Conclusions

Chapter 33. Lessons Learned from Other Industries

Chapter 34. Conclusion

Start your Free Trial

Self paced

Go to the Course
We have partnered with providers to bring you collection of courses, When you buy through links on our site, we may earn an affiliate commission from provider.