The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large scale computing systems?
In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, …
Site Reliability Engineering
Video description
The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large scale computing systems?
In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You’ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization.
This book is divided into four sections:
Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE) Practices—Understand the theory and practice of an SRE’s day to day work: building and operating large distributed computing systems Management—Explore Google's best practices for training, communication, and meetings that your organization can use
Chapter 2. The Production Environment at Google, from the Viewpoint of an SRE
Part II. Principles
Chapter 3. Embracing Risk
Chapter 4. Service Level Objectives
Chapter 5. Eliminating Toil
Chapter 6. Monitoring Distributed Systems
Chapter 7. The Evolution of Automation at Google
Chapter 8. Release Engineering
Chapter 9. Simplicity
Part III. Practices
Chapter 10. Practical Alerting from Time-Series Data
Chapter 11. Being On-Call
Chapter 12. Effective Troubleshooting
Chapter 13. Emergency Response
Chapter 14. Managing Incidents
Chapter 15. Postmortem Culture: Learning from Failure
Chapter 16. Tracking Outages
Chapter 17. Testing for Reliability
Chapter 18. Software Engineering in SRE
Chapter 19. Load Balancing at the Frontend
Chapter 20. Load Balancing in the Datacenter
Chapter 21. Handling Overload
Chapter 22. Addressing Cascading Failures
Chapter 23. Managing Critical State: Distributed Consensus for Reliability
Chapter 24. Distributed Periodic Scheduling with Cron
Chapter 25. Data Processing Pipelines
Chapter 26. Data Integrity: What You Read Is What You Wrote
Chapter 27. Reliable Product Launches at Scale
Part IV. Management
Chapter 28. Accelerating SREs to On-Call and Beyond
Chapter 29. Dealing with Interrupts
Chapter 30. Embedding an SRE to Recover from Operational Overload
Chapter 31. Communication and Collaboration in SRE
Chapter 32. The Evolving SRE Engagement Model
Part V. Conclusions
Chapter 33. Lessons Learned from Other Industries
Chapter 34. Conclusion
Copyright
Start your Free Trial Self paced Go to the Course We have partnered with providers to bring you collection of courses, When you buy through links on our site, we may earn an affiliate commission from provider.
This site uses cookies. By continuing to use this website, you agree to their use.I Accept