Betsy Beyer,Google SRE团队的成员,该团队负责Google生产系统的运维。 Chris Jones,Google SRE团队的成员,该团队负责Google生产系统的运维。 Jennifer Petoff,Google SRE团队的成员,该团队负责Google生产系统的运维。 Niall Richard Murphy,Google SRE团队的成员,该团队负责Google生产系统的运维。
无
Foreword Preface Part Ⅰ.Introduction 1.Introduction The Sysadmin Approach to Service Management Google's Approach to Service Management: Site Reliability Engineering Tenets of SRE The End of the Beginning 2.The Production Environment at 6oogle, from the Viewpoint of an SRE Hardware System Software That "Organizes" the Hardware Other System Software Our Software Infrastructure Our Development Environment Shakespeare: A Sample Service Part Ⅱ.Principles 3.Embracing Risk Managing Risk Measuring Service Risk Risk Tolerance of Services Motivation for Error Budgets 4.Service Level Objectives Service Level Terminology Indicators in Practice Objectives in Practice Agreements in Practice 5.Eliminating Toil Toil Defined Why Less Toil Is Better What Qualifies as Engineering? Is Toil Always Bad? Conclusion 6.Monitoring Distributed Systems Definitions Why Monitor? Setting Reasonable Expectations for Monitoring Symptoms Versus Causes Black-Box Versus White-Box The Four Golden Signals Worrying About Your Tail (or, Instrumentation and Performance) Choosing an Appropriate Resolution for Measurements As Simple as Possible, No Simpler Tying These Principles Together Monitoring for the Long Term Conclusion 7.The Evolution of Automation at Google The Value of Automation The Value for Google SRE The Use Cases for Automation Automate Yourself Out of a Job: Automate ALL the Things! Soothing the Pain: Applying Automation to Cluster Turnups Borg: Birth of the Warehouse-Scale Computer Reliability Is the Fundamental Feature Recommendations 8.Release Engineering The Role of a Release Engineer Philosophy ……