1 00:00:06,320 --> 00:00:11,499 [Music] 2 00:00:16,080 --> 00:00:20,640 third time's a charm um so i've caused 3 00:00:18,640 --> 00:00:22,640 worse outages than myself at work so 4 00:00:20,640 --> 00:00:24,640 this is fine um 5 00:00:22,640 --> 00:00:27,039 and now for our final keynote speaker of 6 00:00:24,640 --> 00:00:29,439 the conference liz fong jones 7 00:00:27,039 --> 00:00:31,199 as a bit of an aside as follows of liz 8 00:00:29,439 --> 00:00:32,800 on social media we'll know she's had 9 00:00:31,199 --> 00:00:34,559 something of an odyssey over the last 10 00:00:32,800 --> 00:00:36,719 couple days having to fly from sydney to 11 00:00:34,559 --> 00:00:38,800 the west coast of usa to get some urgent 12 00:00:36,719 --> 00:00:40,239 document documents notarized then 13 00:00:38,800 --> 00:00:42,079 getting back on a plane to sydney 14 00:00:40,239 --> 00:00:45,200 returning yesterday morning and it all 15 00:00:42,079 --> 00:00:47,280 seems to be sorted now about liz liz is 16 00:00:45,200 --> 00:00:49,840 a developer advocate labor and ethics 17 00:00:47,280 --> 00:00:51,760 organizer and site reliability engineer 18 00:00:49,840 --> 00:00:52,879 with 16 plus years of 19 00:00:51,760 --> 00:00:55,840 experience 20 00:00:52,879 --> 00:00:56,800 she's an advocate at honeycomb for sre 21 00:00:55,840 --> 00:00:58,879 and the 22 00:00:56,800 --> 00:01:00,960 observability communities 23 00:00:58,879 --> 00:01:03,120 and previously was an sre working on 24 00:01:00,960 --> 00:01:05,600 projects ranging from the google cloud 25 00:01:03,120 --> 00:01:07,840 load balancer to google flights 26 00:01:05,600 --> 00:01:10,000 liz will be talking about cultivating 27 00:01:07,840 --> 00:01:13,560 production excellence 28 00:01:10,000 --> 00:01:13,560 over to you liz 29 00:01:15,920 --> 00:01:20,880 good day everyone and thank you for 30 00:01:18,320 --> 00:01:22,960 joining me um i wanted to acknowledge 31 00:01:20,880 --> 00:01:25,280 that i am living on the lands of the 32 00:01:22,960 --> 00:01:27,119 gradual people and i also wanted to 33 00:01:25,280 --> 00:01:28,960 acknowledge that there are 34 00:01:27,119 --> 00:01:31,840 things happening in the world right now 35 00:01:28,960 --> 00:01:33,680 and that um tonga is under threat from 36 00:01:31,840 --> 00:01:36,240 climate change and from the volcano that 37 00:01:33,680 --> 00:01:38,640 has just erupted 38 00:01:36,240 --> 00:01:42,240 okay let us go ahead and begin 39 00:01:38,640 --> 00:01:44,399 so today i'm really excited to be 40 00:01:42,240 --> 00:01:47,119 telling you about all the lessons i've 41 00:01:44,399 --> 00:01:49,280 learned from the past 42 00:01:47,119 --> 00:01:51,840 from the past 15 years of my experience 43 00:01:49,280 --> 00:01:54,560 working in the field of site reliability 44 00:01:51,840 --> 00:01:56,560 engineering at companies large and small 45 00:01:54,560 --> 00:01:58,960 and how we can take those lessons and 46 00:01:56,560 --> 00:02:00,960 make our workplaces more humane for us 47 00:01:58,960 --> 00:02:03,680 and to make our systems more operable 48 00:02:00,960 --> 00:02:03,680 and reliable 49 00:02:03,840 --> 00:02:07,280 so 50 00:02:04,640 --> 00:02:09,119 those of us who are attending lca are 51 00:02:07,280 --> 00:02:10,720 very familiar with the idea of operating 52 00:02:09,119 --> 00:02:12,720 in production 53 00:02:10,720 --> 00:02:15,520 but some of our colleagues may not 54 00:02:12,720 --> 00:02:17,680 necessarily have that same notion so to 55 00:02:15,520 --> 00:02:20,239 get everyone on the same page we have to 56 00:02:17,680 --> 00:02:22,080 orient around what the objectives are of 57 00:02:20,239 --> 00:02:24,319 the business what the objectives are of 58 00:02:22,080 --> 00:02:26,640 the code that we are writing 59 00:02:24,319 --> 00:02:28,959 and fundamentally we're writing code in 60 00:02:26,640 --> 00:02:30,879 order to solve problems but 61 00:02:28,959 --> 00:02:32,879 the problem is that just landing the 62 00:02:30,879 --> 00:02:35,519 code in the main branch of get does not 63 00:02:32,879 --> 00:02:37,680 necessarily mean that your job is done 64 00:02:35,519 --> 00:02:39,360 that there are so many additional steps 65 00:02:37,680 --> 00:02:41,280 that you have to complete in order to 66 00:02:39,360 --> 00:02:44,319 make sure that your services are 67 00:02:41,280 --> 00:02:46,319 delivering value to end users 68 00:02:44,319 --> 00:02:48,239 because if your change has to be 69 00:02:46,319 --> 00:02:50,720 repeatedly rolled back or it takes 70 00:02:48,239 --> 00:02:54,640 months to push out you're not actually 71 00:02:50,720 --> 00:02:54,640 delivering that value until months later 72 00:02:54,800 --> 00:02:57,760 another challenge that we frequently 73 00:02:56,239 --> 00:02:59,599 encounter is that production is 74 00:02:57,760 --> 00:03:02,319 increasingly complex 75 00:02:59,599 --> 00:03:05,200 that we have all of these challenges of 76 00:03:02,319 --> 00:03:07,120 scale and and and uh 77 00:03:05,200 --> 00:03:09,599 and and reliability needs and future 78 00:03:07,120 --> 00:03:12,000 needs and we've developed these layers 79 00:03:09,599 --> 00:03:13,360 of infrastructure to try to 80 00:03:12,000 --> 00:03:15,200 make it easier 81 00:03:13,360 --> 00:03:16,720 but we've also made it hard to fit that 82 00:03:15,200 --> 00:03:18,480 all into the head of one individual 83 00:03:16,720 --> 00:03:20,879 person 84 00:03:18,480 --> 00:03:23,120 that with microservices the theory was 85 00:03:20,879 --> 00:03:25,040 that we could decouple our services from 86 00:03:23,120 --> 00:03:27,440 each other and decouple the release and 87 00:03:25,040 --> 00:03:29,120 launch cycles so that each team could 88 00:03:27,440 --> 00:03:31,200 ship independently 89 00:03:29,120 --> 00:03:33,040 but that now means that a service three 90 00:03:31,200 --> 00:03:34,959 layers away from you can break you in a 91 00:03:33,040 --> 00:03:37,040 way that you didn't anticipate in a way 92 00:03:34,959 --> 00:03:39,680 that was not a failure mode of a more 93 00:03:37,040 --> 00:03:41,599 tightly coupled system 94 00:03:39,680 --> 00:03:44,159 we also have problems with big data 95 00:03:41,599 --> 00:03:47,040 where big data can result in increasing 96 00:03:44,159 --> 00:03:49,040 demands upon compute and networking and 97 00:03:47,040 --> 00:03:52,080 we have to be able to scale our systems 98 00:03:49,040 --> 00:03:53,840 to meet and address that demand 99 00:03:52,080 --> 00:03:56,480 so some of the systems that we're 100 00:03:53,840 --> 00:03:58,879 creating add complexity in order to try 101 00:03:56,480 --> 00:04:01,920 to solve some of these challenges 102 00:03:58,879 --> 00:04:04,319 but when the systems don't work exactly 103 00:04:01,920 --> 00:04:06,480 as planned for instance as the audio 104 00:04:04,319 --> 00:04:08,799 feed this morning did not necessarily 105 00:04:06,480 --> 00:04:10,720 work right it takes a substantial amount 106 00:04:08,799 --> 00:04:13,680 of effort to try to figure out what's 107 00:04:10,720 --> 00:04:15,200 going on and how do we fix it 108 00:04:13,680 --> 00:04:18,320 so that's what the subject of today's 109 00:04:15,200 --> 00:04:21,519 talk is how do we fix our problems with 110 00:04:18,320 --> 00:04:23,360 more confidence more quickly 111 00:04:21,519 --> 00:04:25,120 but let's start with defining what the 112 00:04:23,360 --> 00:04:26,840 problem is 113 00:04:25,120 --> 00:04:30,000 what does uptime 114 00:04:26,840 --> 00:04:33,360 mean when i was a wii game developer 115 00:04:30,000 --> 00:04:35,680 when i was about 16 or 17 years old 116 00:04:33,360 --> 00:04:38,320 i worked at a small game studio and we 117 00:04:35,680 --> 00:04:40,560 had the database and we had the game 118 00:04:38,320 --> 00:04:42,639 world server and if those were up 119 00:04:40,560 --> 00:04:45,280 everything was fine but if they were not 120 00:04:42,639 --> 00:04:46,960 up everything was 100 down 121 00:04:45,280 --> 00:04:49,840 but that's no longer the world we live 122 00:04:46,960 --> 00:04:51,840 in today where you may have hundreds or 123 00:04:49,840 --> 00:04:53,919 thousands of linux servers that are 124 00:04:51,840 --> 00:04:57,919 running as vms on amazon's 125 00:04:53,919 --> 00:05:00,080 infrastructure or in azure or on gcp 126 00:04:57,919 --> 00:05:01,680 and we can't wait until our customers 127 00:05:00,080 --> 00:05:03,120 complain to us and ring us up on the 128 00:05:01,680 --> 00:05:05,600 phone and say i'd like to cancel my 129 00:05:03,120 --> 00:05:07,360 subscription right like we would like to 130 00:05:05,600 --> 00:05:09,120 be a little bit more proactive at 131 00:05:07,360 --> 00:05:11,520 detecting when our customers are having 132 00:05:09,120 --> 00:05:13,440 problems without necessarily getting 133 00:05:11,520 --> 00:05:15,759 paged every single time a single server 134 00:05:13,440 --> 00:05:17,520 flaps 135 00:05:15,759 --> 00:05:18,960 there are all of these demands for 136 00:05:17,520 --> 00:05:21,199 reliability 137 00:05:18,960 --> 00:05:22,400 for features and for 138 00:05:21,199 --> 00:05:23,759 and for investment and future 139 00:05:22,400 --> 00:05:26,320 scalability 140 00:05:23,759 --> 00:05:28,880 it's a lot to deal with 141 00:05:26,320 --> 00:05:31,440 and honestly as someone who has been in 142 00:05:28,880 --> 00:05:34,160 the role of a hero in an organization 143 00:05:31,440 --> 00:05:36,639 who's the one person holding up the 144 00:05:34,160 --> 00:05:39,360 reliability on her shoulders 145 00:05:36,639 --> 00:05:41,360 it doesn't work not for not for you know 146 00:05:39,360 --> 00:05:43,039 a decade two decades you get really 147 00:05:41,360 --> 00:05:45,440 tired 148 00:05:43,039 --> 00:05:48,240 so what strategies can we employ in 149 00:05:45,440 --> 00:05:51,199 order to make our services more reliable 150 00:05:48,240 --> 00:05:54,240 and friendlier for the people 151 00:05:51,199 --> 00:05:56,160 so as miles introduced me yes hi my name 152 00:05:54,240 --> 00:05:58,400 is liz and i'm a principal developer 153 00:05:56,160 --> 00:06:00,800 advocate at honeycomb 154 00:05:58,400 --> 00:06:02,720 and honeycomb is a company that aims to 155 00:06:00,800 --> 00:06:04,639 help make developers 156 00:06:02,720 --> 00:06:08,560 happier and more productive through 157 00:06:04,639 --> 00:06:10,400 improving observability into production 158 00:06:08,560 --> 00:06:11,680 so here are some of the lessons i've 159 00:06:10,400 --> 00:06:14,000 learned from both working with 160 00:06:11,680 --> 00:06:16,000 honeycombs clients as well as in my 161 00:06:14,000 --> 00:06:18,319 previous life at google on the customer 162 00:06:16,000 --> 00:06:21,520 reliability engineering team as well as 163 00:06:18,319 --> 00:06:23,039 on various other esri teams at google 164 00:06:21,520 --> 00:06:24,880 the lesson that i learned is that 165 00:06:23,039 --> 00:06:26,639 heroism doesn't work and that we need 166 00:06:24,880 --> 00:06:28,240 different strategies 167 00:06:26,639 --> 00:06:30,720 and that those strategies cannot 168 00:06:28,240 --> 00:06:33,840 necessarily just be buying tools and 169 00:06:30,720 --> 00:06:35,600 fixing to fix things 170 00:06:33,840 --> 00:06:37,199 and i know that this is ironic because i 171 00:06:35,600 --> 00:06:39,520 do work at a company that sells 172 00:06:37,199 --> 00:06:43,199 developer tooling and here i am telling 173 00:06:39,520 --> 00:06:46,240 you don't buy devops right why is that 174 00:06:43,199 --> 00:06:47,919 the answer is that when you have all of 175 00:06:46,240 --> 00:06:49,680 these technologies that you're trying to 176 00:06:47,919 --> 00:06:51,599 integrate together that you're trying to 177 00:06:49,680 --> 00:06:53,039 make work together sometimes you're 178 00:06:51,599 --> 00:06:54,880 adding to your workload rather than 179 00:06:53,039 --> 00:06:57,120 reducing your workload 180 00:06:54,880 --> 00:06:58,400 when you actually try to glom on all 181 00:06:57,120 --> 00:07:01,199 these things that are supposed to make 182 00:06:58,400 --> 00:07:02,880 your developers lives easier 183 00:07:01,199 --> 00:07:04,880 for instance 184 00:07:02,880 --> 00:07:06,319 you get told to write more tests but 185 00:07:04,880 --> 00:07:07,840 what happens if you add more tests and 186 00:07:06,319 --> 00:07:09,919 add more tests and add more tests and 187 00:07:07,840 --> 00:07:12,080 they start flaking all the time and then 188 00:07:09,919 --> 00:07:13,599 people ignore the red build because oh 189 00:07:12,080 --> 00:07:16,160 that's just flaky tests i'm just going 190 00:07:13,599 --> 00:07:17,599 to merge anyways right congratulations 191 00:07:16,160 --> 00:07:19,680 you've just made your problem harder not 192 00:07:17,599 --> 00:07:22,080 easier 193 00:07:19,680 --> 00:07:24,080 or let's talk about the drive recently 194 00:07:22,080 --> 00:07:26,400 to kind of encourage people to push to 195 00:07:24,080 --> 00:07:27,919 production quickly recklessly almost 196 00:07:26,400 --> 00:07:30,000 right what happens if you're continuous 197 00:07:27,919 --> 00:07:32,479 integration and continuous delivery 198 00:07:30,000 --> 00:07:34,960 that's meant to ship [ __ ] fast 199 00:07:32,479 --> 00:07:37,120 ships [ __ ] fast 200 00:07:34,960 --> 00:07:37,919 oops 201 00:07:37,120 --> 00:07:39,520 or 202 00:07:37,919 --> 00:07:41,039 those of us who have been around the 203 00:07:39,520 --> 00:07:44,000 community for a long time you've 204 00:07:41,039 --> 00:07:45,919 probably run rm-rf rf 205 00:07:44,000 --> 00:07:47,360 a couple of times before 206 00:07:45,919 --> 00:07:49,360 right now imagine doing that to your 207 00:07:47,360 --> 00:07:50,960 entire production infrastructure well 208 00:07:49,360 --> 00:07:52,319 congratulations you can do that now with 209 00:07:50,960 --> 00:07:53,759 infrastructure as code you can delete 210 00:07:52,319 --> 00:07:56,800 your entire production infrastructure 211 00:07:53,759 --> 00:07:56,800 with one stray commit 212 00:07:56,840 --> 00:08:03,759 whoops or let's talk about kubernetes 213 00:08:00,960 --> 00:08:06,000 not everyone needs kubernetes it's a lot 214 00:08:03,759 --> 00:08:08,560 of complexity and it's only worth it to 215 00:08:06,000 --> 00:08:10,240 solve certain problems 216 00:08:08,560 --> 00:08:11,599 but the number one thing that i see 217 00:08:10,240 --> 00:08:14,479 people getting wrong 218 00:08:11,599 --> 00:08:16,319 is adopting the idea of production 219 00:08:14,479 --> 00:08:18,319 ownership or let's put everyone into 220 00:08:16,319 --> 00:08:20,240 pagerduty 221 00:08:18,319 --> 00:08:22,000 and i think that that can have some 222 00:08:20,240 --> 00:08:24,000 negative side effects let's talk about 223 00:08:22,000 --> 00:08:26,080 them 224 00:08:24,000 --> 00:08:27,759 when you put everyone on call for 225 00:08:26,080 --> 00:08:29,520 systems that they are not prepared to 226 00:08:27,759 --> 00:08:30,960 run right those of us who have scar 227 00:08:29,520 --> 00:08:33,360 tissue from years and years of being 228 00:08:30,960 --> 00:08:36,240 around production we can handle to an 229 00:08:33,360 --> 00:08:37,680 extent um being paid at 3am 230 00:08:36,240 --> 00:08:39,760 but let's suppose you're a brand new 231 00:08:37,680 --> 00:08:44,159 engineer and your first experience with 232 00:08:39,760 --> 00:08:47,200 on-call is being paged at 1 am 2 am 3 am 233 00:08:44,159 --> 00:08:48,000 multiple times per week 234 00:08:47,200 --> 00:08:50,399 and 235 00:08:48,000 --> 00:08:52,399 you eventually are going to say take me 236 00:08:50,399 --> 00:08:53,440 off this on call rotation or i quit 237 00:08:52,399 --> 00:08:55,760 right 238 00:08:53,440 --> 00:08:57,839 it's not a happy situation if you have a 239 00:08:55,760 --> 00:08:58,959 system that is continuously generating 240 00:08:57,839 --> 00:09:00,959 noise 241 00:08:58,959 --> 00:09:03,279 and even when you do try to debug it you 242 00:09:00,959 --> 00:09:05,600 don't know heads from tails where do i 243 00:09:03,279 --> 00:09:07,839 get started 244 00:09:05,600 --> 00:09:09,200 and if you have dashboards those 245 00:09:07,839 --> 00:09:11,600 dashboards are often a source of 246 00:09:09,200 --> 00:09:13,839 technical that in and of themselves 247 00:09:11,600 --> 00:09:15,839 because they're chasing your last outage 248 00:09:13,839 --> 00:09:17,920 there are you know 20 different graphs 249 00:09:15,839 --> 00:09:19,120 on 20 different pages and you're trying 250 00:09:17,920 --> 00:09:21,519 to figure out which line wiggle the same 251 00:09:19,120 --> 00:09:23,200 line as a southern line right like 252 00:09:21,519 --> 00:09:25,040 this isn't actually helping you solve 253 00:09:23,200 --> 00:09:26,880 the problem all that's happening is 254 00:09:25,040 --> 00:09:28,880 you're spending time in a state of 255 00:09:26,880 --> 00:09:30,720 cognitive overload while your customers 256 00:09:28,880 --> 00:09:33,519 are suffering and incidents are taking 257 00:09:30,720 --> 00:09:35,440 forever to fix 258 00:09:33,519 --> 00:09:38,160 so finally you pick up the phone and you 259 00:09:35,440 --> 00:09:40,720 call someone like me or like miles and 260 00:09:38,160 --> 00:09:42,399 and we you know tell you oh you just 261 00:09:40,720 --> 00:09:43,920 frob the thing restart it it'll be fine 262 00:09:42,399 --> 00:09:45,360 in the morning right 263 00:09:43,920 --> 00:09:47,200 except for someone like mirror miles 264 00:09:45,360 --> 00:09:49,200 like we've gotten really tired of 265 00:09:47,200 --> 00:09:50,959 getting woken up every single week at 266 00:09:49,200 --> 00:09:54,080 least once a week even if we're not on 267 00:09:50,959 --> 00:09:54,080 call for years on end 268 00:09:54,160 --> 00:09:57,040 and finally 269 00:09:55,360 --> 00:09:59,120 you go back to sleep you wake up in the 270 00:09:57,040 --> 00:10:01,200 morning cup of coffee in the hand and 271 00:09:59,120 --> 00:10:03,200 you realize that you can't actually push 272 00:10:01,200 --> 00:10:04,399 fix because someone has broken the build 273 00:10:03,200 --> 00:10:06,480 overnight 274 00:10:04,399 --> 00:10:08,160 and even though each individual set of 275 00:10:06,480 --> 00:10:09,760 unit tests passes 276 00:10:08,160 --> 00:10:12,160 the integration tests don't because 277 00:10:09,760 --> 00:10:13,680 there is a problem in the gaps between 278 00:10:12,160 --> 00:10:16,000 our services that is causing things to 279 00:10:13,680 --> 00:10:17,760 flake 280 00:10:16,000 --> 00:10:19,920 so this is what we described in the 281 00:10:17,760 --> 00:10:22,320 field of site reliability engineering as 282 00:10:19,920 --> 00:10:24,320 a state of operational overload 283 00:10:22,320 --> 00:10:26,560 where there's no time to do projects and 284 00:10:24,320 --> 00:10:28,560 even if you did have time to do projects 285 00:10:26,560 --> 00:10:30,640 there's really not much of a plan of how 286 00:10:28,560 --> 00:10:32,959 do i get myself out of this situation 287 00:10:30,640 --> 00:10:35,040 how do i spend one or two spare hours in 288 00:10:32,959 --> 00:10:38,399 order to chip away at this pile of 289 00:10:35,040 --> 00:10:40,320 technical and operational debt 290 00:10:38,399 --> 00:10:42,480 so this can feel very very draining this 291 00:10:40,320 --> 00:10:43,600 can feel stressful right like many of us 292 00:10:42,480 --> 00:10:45,040 have been here 293 00:10:43,600 --> 00:10:46,720 and it can feel like you're barely 294 00:10:45,040 --> 00:10:49,279 hanging on to the edge of the cliff with 295 00:10:46,720 --> 00:10:51,519 your fingernails 296 00:10:49,279 --> 00:10:52,800 so how do we make this better what are 297 00:10:51,519 --> 00:10:55,839 we missing 298 00:10:52,800 --> 00:10:57,680 and why is it that tools don't help 299 00:10:55,839 --> 00:11:00,800 well i think the thing that we need to 300 00:10:57,680 --> 00:11:02,480 focus on is that people are the ones who 301 00:11:00,800 --> 00:11:05,040 operate your systems that you cannot 302 00:11:02,480 --> 00:11:07,600 have a healthy system without healthy 303 00:11:05,040 --> 00:11:10,640 people standing behind it 304 00:11:07,600 --> 00:11:12,240 for instance let's take me as an example 305 00:11:10,640 --> 00:11:15,120 every single time i've probably given 306 00:11:12,240 --> 00:11:17,040 this talk several dozen times now 307 00:11:15,120 --> 00:11:19,279 every time i go through this section my 308 00:11:17,040 --> 00:11:22,079 heart rate goes up right like i have all 309 00:11:19,279 --> 00:11:25,600 that packed up anxiety about production 310 00:11:22,079 --> 00:11:28,560 outages of getting paid at 3 am 311 00:11:25,600 --> 00:11:30,480 and it gets really challenging right but 312 00:11:28,560 --> 00:11:32,079 i need to take a moment and focus on me 313 00:11:30,480 --> 00:11:34,240 i encourage you to focus on yourself 314 00:11:32,079 --> 00:11:35,839 right like to take care of your needs in 315 00:11:34,240 --> 00:11:37,200 my case i'm going to take a deep breath 316 00:11:35,839 --> 00:11:38,160 and i'm going to take take a drink of 317 00:11:37,200 --> 00:11:40,079 water 318 00:11:38,160 --> 00:11:42,079 and that will help me deliver a better 319 00:11:40,079 --> 00:11:44,480 talk in the same way you should make 320 00:11:42,079 --> 00:11:46,240 sure to take a deep breath everything's 321 00:11:44,480 --> 00:11:47,600 going to be fine and drink water even 322 00:11:46,240 --> 00:11:51,800 when you're in the middle of an outage 323 00:11:47,600 --> 00:11:51,800 focus on the people first 324 00:11:58,399 --> 00:12:01,680 so much better right 325 00:12:00,079 --> 00:12:03,360 so you cannot run a healthy system 326 00:12:01,680 --> 00:12:06,000 without healthy people 327 00:12:03,360 --> 00:12:07,920 and that means that tools cannot fix a 328 00:12:06,000 --> 00:12:09,600 culture if your people are not healthy 329 00:12:07,920 --> 00:12:11,600 right 330 00:12:09,600 --> 00:12:13,519 that your tools can help automate 331 00:12:11,600 --> 00:12:15,519 processes that you already are inclined 332 00:12:13,519 --> 00:12:17,360 to do um but it can't fix a culture 333 00:12:15,519 --> 00:12:18,959 where people are blamed where people are 334 00:12:17,360 --> 00:12:20,079 on call all the time and stressed out 335 00:12:18,959 --> 00:12:21,680 right like 336 00:12:20,079 --> 00:12:24,240 tooling that generates more alerts is 337 00:12:21,680 --> 00:12:26,160 not going to be helpful there 338 00:12:24,240 --> 00:12:28,079 so what should we do instead 339 00:12:26,160 --> 00:12:29,680 i argue that we should invest in people 340 00:12:28,079 --> 00:12:32,000 culture and process 341 00:12:29,680 --> 00:12:33,920 that investing in those three things is 342 00:12:32,000 --> 00:12:37,600 how we get out of this mess of 343 00:12:33,920 --> 00:12:38,959 operational overload and production pain 344 00:12:37,600 --> 00:12:41,120 this is what i call production 345 00:12:38,959 --> 00:12:43,760 excellence the idea that our system 346 00:12:41,120 --> 00:12:46,160 should be more reliable and friendlier 347 00:12:43,760 --> 00:12:49,839 that we shouldn't have to feed computers 348 00:12:46,160 --> 00:12:51,600 with the blood and sweat of human beings 349 00:12:49,839 --> 00:12:54,079 that we have to plan intentionally and 350 00:12:51,600 --> 00:12:56,880 figure out how are we getting there 351 00:12:54,079 --> 00:12:58,560 what milestones are we aiming for 352 00:12:56,880 --> 00:13:00,639 and what key performance indicators are 353 00:12:58,560 --> 00:13:02,079 we measuring 354 00:13:00,639 --> 00:13:04,079 we also have to think about getting the 355 00:13:02,079 --> 00:13:05,360 rights of people into the room 356 00:13:04,079 --> 00:13:06,959 and this means 357 00:13:05,360 --> 00:13:08,880 a lot of the people who should be in 358 00:13:06,959 --> 00:13:11,600 this room are not necessarily in this 359 00:13:08,880 --> 00:13:13,760 room uh even at lca 360 00:13:11,600 --> 00:13:15,839 that we need to involve everyone from 361 00:13:13,760 --> 00:13:17,680 the business level to the support teams 362 00:13:15,839 --> 00:13:19,200 right we need to involve people who are 363 00:13:17,680 --> 00:13:21,040 even in sales right like we need to 364 00:13:19,200 --> 00:13:22,880 involve people all across the spectrum 365 00:13:21,040 --> 00:13:24,880 in order to make sure that we're aligned 366 00:13:22,880 --> 00:13:27,839 about delivering value to customers in a 367 00:13:24,880 --> 00:13:27,839 sustainable fashion 368 00:13:28,000 --> 00:13:30,560 so that means that we need a 369 00:13:28,959 --> 00:13:32,880 psychologically safe environment that 370 00:13:30,560 --> 00:13:34,320 people need to be able to contribute and 371 00:13:32,880 --> 00:13:36,000 that if you cannot feel safe to 372 00:13:34,320 --> 00:13:38,639 contribute none of the rest of this talk 373 00:13:36,000 --> 00:13:40,240 is going to make sense 374 00:13:38,639 --> 00:13:42,160 but let's talk about the four key 375 00:13:40,240 --> 00:13:44,639 elements of production excellence that 376 00:13:42,160 --> 00:13:46,720 will make your life easier 377 00:13:44,639 --> 00:13:48,399 first of all in order to operate an 378 00:13:46,720 --> 00:13:50,480 excellent system you need to know when 379 00:13:48,399 --> 00:13:52,480 that system is too broken 380 00:13:50,480 --> 00:13:55,360 when it's outside of the bounds of the 381 00:13:52,480 --> 00:13:57,360 normal expected system behavior 382 00:13:55,360 --> 00:13:59,199 secondly you need to be able to debug 383 00:13:57,360 --> 00:14:01,120 those failures in order to restore them 384 00:13:59,199 --> 00:14:02,720 to working order 385 00:14:01,120 --> 00:14:04,560 third you have to be able to collaborate 386 00:14:02,720 --> 00:14:06,240 across multiple teams 387 00:14:04,560 --> 00:14:08,560 in order to solve these problems across 388 00:14:06,240 --> 00:14:10,399 multiple microservices 389 00:14:08,560 --> 00:14:12,560 and then fourth and finally you need to 390 00:14:10,399 --> 00:14:14,480 close that feedback loop we don't live 391 00:14:12,560 --> 00:14:16,560 in the world of groundhog day it's not 392 00:14:14,480 --> 00:14:18,880 okay to repeat the same outages over and 393 00:14:16,560 --> 00:14:20,720 over so we need that feedback loop to 394 00:14:18,880 --> 00:14:22,880 solve the longer running issues that we 395 00:14:20,720 --> 00:14:24,639 face as engineers 396 00:14:22,880 --> 00:14:26,560 so if you do those four things you'll 397 00:14:24,639 --> 00:14:28,000 have a system that is much friendlier to 398 00:14:26,560 --> 00:14:30,320 your human beings who are operating the 399 00:14:28,000 --> 00:14:32,399 system 400 00:14:30,320 --> 00:14:34,320 so why did i say 401 00:14:32,399 --> 00:14:35,839 no on our systems are failing too much 402 00:14:34,320 --> 00:14:37,920 right why did they not say no when our 403 00:14:35,839 --> 00:14:39,519 systems are failing at all 404 00:14:37,920 --> 00:14:41,920 the answer is that 405 00:14:39,519 --> 00:14:44,480 if you got alerted every single time a 406 00:14:41,920 --> 00:14:46,160 packet dropped in the uh in the fiber 407 00:14:44,480 --> 00:14:48,399 optic channel between here in la right 408 00:14:46,160 --> 00:14:50,160 like you'd get paged all the time right 409 00:14:48,399 --> 00:14:53,040 our systems are always failing in small 410 00:14:50,160 --> 00:14:55,199 microscopic ways but we build redundancy 411 00:14:53,040 --> 00:14:57,120 into our systems in order to try to 412 00:14:55,199 --> 00:14:58,639 solve those problems 413 00:14:57,120 --> 00:15:00,480 so instead of thinking about are the 414 00:14:58,639 --> 00:15:02,560 systems failing at all right like is the 415 00:15:00,480 --> 00:15:04,399 lawn outside on the domain outside where 416 00:15:02,560 --> 00:15:06,800 i am right like is every single blade of 417 00:15:04,399 --> 00:15:09,279 grass on that lawn green no 418 00:15:06,800 --> 00:15:11,360 but overall does it look green enough 419 00:15:09,279 --> 00:15:13,279 yes right like it's soft it's blush i 420 00:15:11,360 --> 00:15:14,800 can go lie down and and have a picnic 421 00:15:13,279 --> 00:15:16,399 right like that's that i think is good 422 00:15:14,800 --> 00:15:17,680 enough 423 00:15:16,399 --> 00:15:19,600 so instead of saying you know is the 424 00:15:17,680 --> 00:15:22,880 system failing at all we need to talk 425 00:15:19,600 --> 00:15:24,560 about is the system failing too much 426 00:15:22,880 --> 00:15:26,160 and that means we need a quantitative 427 00:15:24,560 --> 00:15:28,480 measure of that that we can use to 428 00:15:26,160 --> 00:15:30,720 operate the system 429 00:15:28,480 --> 00:15:32,639 this is the idea from site reliability 430 00:15:30,720 --> 00:15:34,399 engineering at google the idea of the 431 00:15:32,639 --> 00:15:37,759 service level indicator and its 432 00:15:34,399 --> 00:15:40,000 companion the service level objective 433 00:15:37,759 --> 00:15:41,279 a service level indicator and service 434 00:15:40,000 --> 00:15:43,279 level objective 435 00:15:41,279 --> 00:15:45,920 what they represent is a common language 436 00:15:43,279 --> 00:15:47,600 between us as engineers our business 437 00:15:45,920 --> 00:15:49,759 stakeholders and our customers about 438 00:15:47,600 --> 00:15:51,920 what the expected level of reliability 439 00:15:49,759 --> 00:15:53,839 is for our services 440 00:15:51,920 --> 00:15:56,240 and we need to think about our services 441 00:15:53,839 --> 00:15:57,680 in terms of the broader context 442 00:15:56,240 --> 00:15:58,800 what is it that our customers are trying 443 00:15:57,680 --> 00:16:00,720 to achieve 444 00:15:58,800 --> 00:16:03,680 what is that critical user journey that 445 00:16:00,720 --> 00:16:03,680 people are trying to do 446 00:16:04,240 --> 00:16:08,720 for instance maybe it's people buying 447 00:16:06,800 --> 00:16:10,639 tickets to attend the outdoor film 448 00:16:08,720 --> 00:16:13,279 festival in sydney 449 00:16:10,639 --> 00:16:16,000 or maybe it's someone who is trying to 450 00:16:13,279 --> 00:16:18,160 place an order for uh for computer parts 451 00:16:16,000 --> 00:16:20,079 from scorpiotech right 452 00:16:18,160 --> 00:16:22,240 either way there's some action that a 453 00:16:20,079 --> 00:16:23,680 customer is trying to achieve and we 454 00:16:22,240 --> 00:16:25,199 need to make sure that we're measuring 455 00:16:23,680 --> 00:16:27,040 it to ensure that it is of a 456 00:16:25,199 --> 00:16:28,720 satisfactory level of performance and 457 00:16:27,040 --> 00:16:30,800 quality 458 00:16:28,720 --> 00:16:32,560 so we need to understand is a given 459 00:16:30,800 --> 00:16:33,519 interaction between our customers and 460 00:16:32,560 --> 00:16:35,040 ourselves 461 00:16:33,519 --> 00:16:36,720 creating a good or bad experience for 462 00:16:35,040 --> 00:16:38,079 that customer 463 00:16:36,720 --> 00:16:40,000 how do you know 464 00:16:38,079 --> 00:16:42,399 well this is where product managers and 465 00:16:40,000 --> 00:16:44,399 user experience researchers and customer 466 00:16:42,399 --> 00:16:46,639 success can really help you out to help 467 00:16:44,399 --> 00:16:50,240 you understand what differentiates a 468 00:16:46,639 --> 00:16:52,160 successful from a failing transaction 469 00:16:50,240 --> 00:16:54,720 or if you work in a field where you are 470 00:16:52,160 --> 00:16:56,079 able to experiment on your own systems 471 00:16:54,720 --> 00:16:58,240 and where you use those systems 472 00:16:56,079 --> 00:17:00,320 yourselves my main piece of advice is to 473 00:16:58,240 --> 00:17:02,399 do chaos engineering to deliberately 474 00:17:00,320 --> 00:17:04,079 slow down transactions 475 00:17:02,399 --> 00:17:06,400 so you can understand where does it 476 00:17:04,079 --> 00:17:09,280 start to feel sluggish and then set your 477 00:17:06,400 --> 00:17:11,280 threshold just shy of there 478 00:17:09,280 --> 00:17:13,199 because our goal is to give machines 479 00:17:11,280 --> 00:17:15,760 rules for deciding whether customers are 480 00:17:13,199 --> 00:17:18,319 having a good or bad experience 481 00:17:15,760 --> 00:17:19,360 and sure we're all used to trans-pacific 482 00:17:18,319 --> 00:17:21,199 latency 483 00:17:19,360 --> 00:17:22,079 but we can in general say you know hey 484 00:17:21,199 --> 00:17:24,319 like 485 00:17:22,079 --> 00:17:26,559 a request is successful if it responds 486 00:17:24,319 --> 00:17:28,640 in you know maybe not 100 milliseconds 487 00:17:26,559 --> 00:17:30,320 but 300 milliseconds right that's fast 488 00:17:28,640 --> 00:17:32,400 enough and then it has to return a 489 00:17:30,320 --> 00:17:36,080 success it can't return a fast fail is 490 00:17:32,400 --> 00:17:36,080 not a success to us as a customer 491 00:17:36,160 --> 00:17:39,600 and then we can think about what's the 492 00:17:37,840 --> 00:17:41,919 total denominator right how many 493 00:17:39,600 --> 00:17:44,480 eligible events did we see 494 00:17:41,919 --> 00:17:46,640 how many transactions were attempted 495 00:17:44,480 --> 00:17:47,919 and i don't just mean like you know just 496 00:17:46,640 --> 00:17:49,039 take everything comes into your load 497 00:17:47,919 --> 00:17:50,960 balancer 498 00:17:49,039 --> 00:17:52,720 because there's often health check 499 00:17:50,960 --> 00:17:54,880 traffic there's often denial of service 500 00:17:52,720 --> 00:17:56,559 attack traffic right like we need to 501 00:17:54,880 --> 00:17:59,520 think only about the real customer 502 00:17:56,559 --> 00:18:01,760 traffic not synthetic events 503 00:17:59,520 --> 00:18:04,320 and then we can compute our actual 504 00:18:01,760 --> 00:18:06,080 achieved availability the number of good 505 00:18:04,320 --> 00:18:07,120 events divided by the number of eligible 506 00:18:06,080 --> 00:18:08,960 events 507 00:18:07,120 --> 00:18:10,799 and we can set a target that we can 508 00:18:08,960 --> 00:18:12,160 measure that is our service level 509 00:18:10,799 --> 00:18:14,720 objective 510 00:18:12,160 --> 00:18:16,320 so the service level indicator earlier 511 00:18:14,720 --> 00:18:18,799 helped separate good events from bad 512 00:18:16,320 --> 00:18:21,120 events but the service level objective 513 00:18:18,799 --> 00:18:23,360 helps us decide over a longer period of 514 00:18:21,120 --> 00:18:26,840 time what percentage of the events in 515 00:18:23,360 --> 00:18:29,679 the sli should be should be 516 00:18:26,840 --> 00:18:32,000 successful so for instance 517 00:18:29,679 --> 00:18:34,480 if the website of that i'm operating was 518 00:18:32,000 --> 00:18:36,000 100 down today 519 00:18:34,480 --> 00:18:38,080 but i came to you and said you know what 520 00:18:36,000 --> 00:18:39,919 but tomorrow we're going to be 100 up 521 00:18:38,080 --> 00:18:41,600 you'd laugh at me right 522 00:18:39,919 --> 00:18:43,600 so it turns out that human beings have 523 00:18:41,600 --> 00:18:45,679 memories longer than day so you need to 524 00:18:43,600 --> 00:18:47,360 think about measuring reliability and 525 00:18:45,679 --> 00:18:50,160 setting goals for it over a longer 526 00:18:47,360 --> 00:18:51,919 period like 30 or 90 days 527 00:18:50,160 --> 00:18:53,600 we also need to set an appropriate 528 00:18:51,919 --> 00:18:56,799 target percentage 529 00:18:53,600 --> 00:18:58,960 for instance i might aim to have 99.9 of 530 00:18:56,799 --> 00:19:00,960 events be successful over the past 30 531 00:18:58,960 --> 00:19:03,200 days where an event is successful if it 532 00:19:00,960 --> 00:19:07,120 was served in less than 300 milliseconds 533 00:19:03,200 --> 00:19:07,919 and with an http response code of 200. 534 00:19:07,120 --> 00:19:10,280 so 535 00:19:07,919 --> 00:19:12,400 why did i not say aim for 536 00:19:10,280 --> 00:19:14,799 99.99999 why should i not have infinite 537 00:19:12,400 --> 00:19:16,160 nines right 538 00:19:14,799 --> 00:19:18,960 the answer is 539 00:19:16,160 --> 00:19:20,240 that every nine is additionally costly 540 00:19:18,960 --> 00:19:23,120 right 541 00:19:20,240 --> 00:19:24,799 and indeed we see this in australia with 542 00:19:23,120 --> 00:19:26,160 nbn right 543 00:19:24,799 --> 00:19:28,000 that's sure 544 00:19:26,160 --> 00:19:29,679 we could spend 545 00:19:28,000 --> 00:19:31,679 trillions of dollars 546 00:19:29,679 --> 00:19:33,600 putting putting you know dozens of 547 00:19:31,679 --> 00:19:35,760 satellites into space 548 00:19:33,600 --> 00:19:37,840 in order to fulfill 549 00:19:35,760 --> 00:19:39,919 99.99999 550 00:19:37,840 --> 00:19:42,080 population coverage 551 00:19:39,919 --> 00:19:43,919 or we could put two satellites into 552 00:19:42,080 --> 00:19:46,160 space and say you know what we're 553 00:19:43,919 --> 00:19:48,400 covering four nines of the population 554 00:19:46,160 --> 00:19:50,799 that's good enough right 555 00:19:48,400 --> 00:19:52,640 there is a cost to each incremental 556 00:19:50,799 --> 00:19:54,960 failure that you prevent 557 00:19:52,640 --> 00:19:56,880 and a incremental benefit associated 558 00:19:54,960 --> 00:19:58,000 with that individual failure that was 559 00:19:56,880 --> 00:20:00,400 prevented 560 00:19:58,000 --> 00:20:01,919 at some point those two lines cross over 561 00:20:00,400 --> 00:20:04,720 so it's up to you to figure out how 562 00:20:01,919 --> 00:20:06,799 critical is a service and then decide 563 00:20:04,720 --> 00:20:08,400 okay what is the level of reliability 564 00:20:06,799 --> 00:20:11,120 that we're aiming for and what's going 565 00:20:08,400 --> 00:20:13,039 to be cost effective 566 00:20:11,120 --> 00:20:14,559 now what do i do with the service level 567 00:20:13,039 --> 00:20:16,240 objective right all this is abstract 568 00:20:14,559 --> 00:20:17,919 right setting goals why do these goals 569 00:20:16,240 --> 00:20:20,000 matter how do i make my life easier as 570 00:20:17,919 --> 00:20:21,679 an operations engineer let me tell you 571 00:20:20,000 --> 00:20:23,840 why 572 00:20:21,679 --> 00:20:25,600 the answer is going back to that first 573 00:20:23,840 --> 00:20:27,840 point i made about operationally 574 00:20:25,600 --> 00:20:30,400 overloaded engineers 575 00:20:27,840 --> 00:20:33,120 often when you get paged at 2 am there's 576 00:20:30,400 --> 00:20:35,039 no actual end user impact it's just 577 00:20:33,120 --> 00:20:37,520 something paging you because the disk 578 00:20:35,039 --> 00:20:41,520 usage got to 90.1 579 00:20:37,520 --> 00:20:43,919 90.01 oh no such an emergency right 580 00:20:41,520 --> 00:20:46,840 were any users impacted 581 00:20:43,919 --> 00:20:50,559 no was this worth waking someone up for 582 00:20:46,840 --> 00:20:53,120 no so we can instead think about using 583 00:20:50,559 --> 00:20:55,200 our measurements of what the reliability 584 00:20:53,120 --> 00:20:57,039 level experienced by our customers is 585 00:20:55,200 --> 00:20:59,919 and use that to decide is this important 586 00:20:57,039 --> 00:21:02,400 enough or not to wake a human up 587 00:20:59,919 --> 00:21:03,840 so for instance let's talk about 588 00:21:02,400 --> 00:21:06,000 the idea of the error budget which is 589 00:21:03,840 --> 00:21:09,600 the inverse of your service level 590 00:21:06,000 --> 00:21:11,280 objective if i'm targeting 99.9 591 00:21:09,600 --> 00:21:14,080 reliability 592 00:21:11,280 --> 00:21:15,760 that means 1 in 1 000 requests are 593 00:21:14,080 --> 00:21:18,400 allowed to fail 594 00:21:15,760 --> 00:21:21,200 so if i'm serving 10 million requests 595 00:21:18,400 --> 00:21:23,360 per month that means 10 000 requests per 596 00:21:21,200 --> 00:21:25,120 month can fail 597 00:21:23,360 --> 00:21:27,919 and i can have a different degree of 598 00:21:25,120 --> 00:21:30,080 urgency based off of the error rate 599 00:21:27,919 --> 00:21:33,440 right it's no longer an emergency if one 600 00:21:30,080 --> 00:21:35,120 out of one requests fails at 2 am 601 00:21:33,440 --> 00:21:36,960 right instead i ask 602 00:21:35,120 --> 00:21:39,360 if i continue having this number of 603 00:21:36,960 --> 00:21:40,640 errors how many hours will it be until i 604 00:21:39,360 --> 00:21:43,039 run out 605 00:21:40,640 --> 00:21:45,360 if i'm serving a thousand bad requests 606 00:21:43,039 --> 00:21:47,360 per hour and i'm allowed to have 10 000 607 00:21:45,360 --> 00:21:49,039 bad requests per month i'm going to run 608 00:21:47,360 --> 00:21:50,720 out in 10 hours that's probably worth 609 00:21:49,039 --> 00:21:52,720 waking someone up for 610 00:21:50,720 --> 00:21:54,480 but if i'm only going to run out in days 611 00:21:52,720 --> 00:21:56,640 it can wait until the next business day 612 00:21:54,480 --> 00:21:58,000 it's okay 613 00:21:56,640 --> 00:21:59,919 and the beauty of this is that it 614 00:21:58,000 --> 00:22:02,880 catches subtle failures as well that are 615 00:21:59,919 --> 00:22:04,799 not entirely up or down 616 00:22:02,880 --> 00:22:06,320 here's an example of an actual failure 617 00:22:04,799 --> 00:22:08,240 that we encountered at honeycomb when we 618 00:22:06,320 --> 00:22:11,360 were first turning on service level 619 00:22:08,240 --> 00:22:13,280 objectives for our own services 620 00:22:11,360 --> 00:22:14,880 we started having an intermittent brown 621 00:22:13,280 --> 00:22:16,799 out where two percent of our traffic 622 00:22:14,880 --> 00:22:19,760 would fail for 20 minutes at a time 623 00:22:16,799 --> 00:22:21,440 repeating every three hours 624 00:22:19,760 --> 00:22:23,600 now it turns out that there is a memory 625 00:22:21,440 --> 00:22:25,280 leak and all of our servers had started 626 00:22:23,600 --> 00:22:26,720 at the same time and were crashing and 627 00:22:25,280 --> 00:22:28,559 stampeding at the same time but we 628 00:22:26,720 --> 00:22:29,919 didn't know that at the time 629 00:22:28,559 --> 00:22:32,880 but what we did know was that our 630 00:22:29,919 --> 00:22:35,039 service level objective fired saying 631 00:22:32,880 --> 00:22:36,799 that too many users were experiencing 632 00:22:35,039 --> 00:22:38,080 problems uploading their telemetry to 633 00:22:36,799 --> 00:22:39,840 honeycomb 634 00:22:38,080 --> 00:22:42,320 whereas our conventional black box 635 00:22:39,840 --> 00:22:44,640 alerts were not able to catch the issue 636 00:22:42,320 --> 00:22:46,080 because of the fact that they wait for 637 00:22:44,640 --> 00:22:48,080 two consecutive 638 00:22:46,080 --> 00:22:49,919 probes in a row to fail and we did not 639 00:22:48,080 --> 00:22:52,159 have two consecutive probe failures in a 640 00:22:49,919 --> 00:22:52,159 row 641 00:22:52,240 --> 00:22:56,240 so besides making your alerting much 642 00:22:54,559 --> 00:22:58,000 higher fidelity the other thing that 643 00:22:56,240 --> 00:23:00,880 service level objectives do is that they 644 00:22:58,000 --> 00:23:03,200 help you navigate the tension between 645 00:23:00,880 --> 00:23:04,559 product development and reliability 646 00:23:03,200 --> 00:23:06,080 instead of having product managers 647 00:23:04,559 --> 00:23:08,159 saying ship more features i want more 648 00:23:06,080 --> 00:23:10,400 features right like now you have a 649 00:23:08,159 --> 00:23:12,400 framework to treat reliability as a 650 00:23:10,400 --> 00:23:14,159 product feature 651 00:23:12,400 --> 00:23:16,480 so for instance if you're having too 652 00:23:14,159 --> 00:23:18,000 much reliability that can almost be a 653 00:23:16,480 --> 00:23:19,919 bad thing because people's expectations 654 00:23:18,000 --> 00:23:21,600 will ratchet up and your competitors 655 00:23:19,919 --> 00:23:23,200 might be innovating faster than you if 656 00:23:21,600 --> 00:23:25,200 you're focusing on delivering 657 00:23:23,200 --> 00:23:26,559 reliability at the expense of product 658 00:23:25,200 --> 00:23:28,480 features 659 00:23:26,559 --> 00:23:30,559 so you might decide i'm going to do an 660 00:23:28,480 --> 00:23:32,080 experiment and i'm going to 661 00:23:30,559 --> 00:23:33,679 roll something out to one percent or two 662 00:23:32,080 --> 00:23:36,080 percent or five percent of my users 663 00:23:33,679 --> 00:23:37,600 knowing that even even if it fails 100 664 00:23:36,080 --> 00:23:39,600 you know of five views of five percent 665 00:23:37,600 --> 00:23:40,720 of users you can roll it back within 666 00:23:39,600 --> 00:23:44,240 five minutes and you're not going to 667 00:23:40,720 --> 00:23:45,679 damage that many people's experiences 668 00:23:44,240 --> 00:23:47,600 but conversely if you had a set of 669 00:23:45,679 --> 00:23:49,120 really bad outages recently 670 00:23:47,600 --> 00:23:50,799 you can think instead about how do we 671 00:23:49,120 --> 00:23:52,240 invest in more reliability how do we 672 00:23:50,799 --> 00:23:54,400 make the business case right the answer 673 00:23:52,240 --> 00:23:56,159 is we're failing our objectives 674 00:23:54,400 --> 00:23:58,000 therefore our customers are at risk of 675 00:23:56,159 --> 00:23:59,520 no longer trusting our service no 676 00:23:58,000 --> 00:24:01,360 feature that we ship is going to matter 677 00:23:59,520 --> 00:24:03,600 unless we bring reliability back up to 678 00:24:01,360 --> 00:24:03,600 par 679 00:24:03,760 --> 00:24:07,279 now you don't have to have super complex 680 00:24:05,840 --> 00:24:08,640 slos to start with even your load 681 00:24:07,279 --> 00:24:10,320 balancer logs are great because they 682 00:24:08,640 --> 00:24:11,600 help you understand 683 00:24:10,320 --> 00:24:13,760 what is it that your customers are 684 00:24:11,600 --> 00:24:15,520 experiencing from a neutral-ish point of 685 00:24:13,760 --> 00:24:16,400 view right it helps you measure what you 686 00:24:15,520 --> 00:24:18,559 can 687 00:24:16,400 --> 00:24:20,720 in order to deliver a better customer 688 00:24:18,559 --> 00:24:22,720 experience and actually put yourselves 689 00:24:20,720 --> 00:24:24,240 in the shoes of those users 690 00:24:22,720 --> 00:24:25,919 and over time you can iterate to make 691 00:24:24,240 --> 00:24:27,520 those slos better right like you can 692 00:24:25,919 --> 00:24:29,360 start incorporating things like real 693 00:24:27,520 --> 00:24:31,039 user monitoring from your client devices 694 00:24:29,360 --> 00:24:32,480 right like there are a million things 695 00:24:31,039 --> 00:24:34,880 you can do 696 00:24:32,480 --> 00:24:36,559 but start by meeting your user needs and 697 00:24:34,880 --> 00:24:38,159 then if users are having problems that 698 00:24:36,559 --> 00:24:40,720 your oslo isn't firing you need to 699 00:24:38,159 --> 00:24:42,880 correct your slo 700 00:24:40,720 --> 00:24:44,960 so slos help you reduce alerting noise 701 00:24:42,880 --> 00:24:46,400 and really help you hone in only on what 702 00:24:44,960 --> 00:24:49,039 matters but that's only half of the 703 00:24:46,400 --> 00:24:51,440 story to me why 704 00:24:49,039 --> 00:24:53,039 because we don't just need to focus on 705 00:24:51,440 --> 00:24:55,200 the issue of operator burnout we also 706 00:24:53,039 --> 00:24:57,039 need to focus on restoring customer 707 00:24:55,200 --> 00:24:58,799 experience as quickly as possible when 708 00:24:57,039 --> 00:25:00,799 we do confirm that there's a genuine 709 00:24:58,799 --> 00:25:03,039 issue 710 00:25:00,799 --> 00:25:05,120 so let's talk about how that works 711 00:25:03,039 --> 00:25:06,720 our outages are never exactly identical 712 00:25:05,120 --> 00:25:08,559 right that we 713 00:25:06,720 --> 00:25:10,480 always have if we're doing our job as 714 00:25:08,559 --> 00:25:12,000 engineers we're always going to have 715 00:25:10,480 --> 00:25:14,320 these new kinds of failures that are 716 00:25:12,000 --> 00:25:15,360 happening that we didn't anticipate 717 00:25:14,320 --> 00:25:16,799 because you wouldn't have written the 718 00:25:15,360 --> 00:25:18,640 bug in the first place if you knew it 719 00:25:16,799 --> 00:25:20,240 was going to break so therefore by 720 00:25:18,640 --> 00:25:21,600 definition anything that goes wrong in 721 00:25:20,240 --> 00:25:23,840 production is going to be something 722 00:25:21,600 --> 00:25:25,520 unexpected 723 00:25:23,840 --> 00:25:27,200 and not only that 724 00:25:25,520 --> 00:25:29,360 it may be something that is challenging 725 00:25:27,200 --> 00:25:31,279 or hard to reproduce in staging because 726 00:25:29,360 --> 00:25:32,559 staging is not production 727 00:25:31,279 --> 00:25:34,480 right staging doesn't have the same 728 00:25:32,559 --> 00:25:37,120 scale and it's a futile waste of money 729 00:25:34,480 --> 00:25:39,120 to make staging look like production 730 00:25:37,120 --> 00:25:40,720 we have to be able to debug things as 731 00:25:39,120 --> 00:25:42,559 they're happening live in production 732 00:25:40,720 --> 00:25:44,240 rather than waiting weeks to reproduce 733 00:25:42,559 --> 00:25:45,919 them 734 00:25:44,240 --> 00:25:48,000 and we have to avoid creating these 735 00:25:45,919 --> 00:25:51,360 silos between different tooling across 736 00:25:48,000 --> 00:25:52,880 different uh problem domains services uh 737 00:25:51,360 --> 00:25:55,120 environments right like we have to be 738 00:25:52,880 --> 00:25:56,720 able to use the same tooling to very 739 00:25:55,120 --> 00:25:58,559 quickly iterate and understand what's 740 00:25:56,720 --> 00:26:00,080 happening 741 00:25:58,559 --> 00:26:02,080 now the thing that i've discovered is 742 00:26:00,080 --> 00:26:04,240 that in my 15 years as a site 743 00:26:02,080 --> 00:26:05,760 reliability engineer 744 00:26:04,240 --> 00:26:07,520 when we have an outage when something is 745 00:26:05,760 --> 00:26:08,960 bumped in the middle of the night 746 00:26:07,520 --> 00:26:11,279 the thing that takes the longest is 747 00:26:08,960 --> 00:26:13,120 figuring out what do we think is going 748 00:26:11,279 --> 00:26:14,880 wrong and how can we verify that as 749 00:26:13,120 --> 00:26:16,720 quickly as possible that's what takes 750 00:26:14,880 --> 00:26:19,679 the most time 751 00:26:16,720 --> 00:26:20,880 or kind of going to um fighter pilot 752 00:26:19,679 --> 00:26:23,279 school for a moment right like there's 753 00:26:20,880 --> 00:26:25,760 this idea in fighter pilot school of uh 754 00:26:23,279 --> 00:26:27,440 orient of orient observe decide act 755 00:26:25,760 --> 00:26:29,520 right the ooda loop 756 00:26:27,440 --> 00:26:31,600 so we have to think about how do we 757 00:26:29,520 --> 00:26:33,279 accelerate that orient and observe part 758 00:26:31,600 --> 00:26:35,600 as quickly as possible 759 00:26:33,279 --> 00:26:37,520 how do we actually explore that data in 760 00:26:35,600 --> 00:26:39,440 order to ask new questions rather than 761 00:26:37,520 --> 00:26:40,960 just leafing through the existing 762 00:26:39,440 --> 00:26:43,600 dashboards that showed us the questions 763 00:26:40,960 --> 00:26:45,360 we thought to ask before 764 00:26:43,600 --> 00:26:47,440 all this is to say that our services 765 00:26:45,360 --> 00:26:49,760 must be observable 766 00:26:47,440 --> 00:26:51,919 in control theory observability is the 767 00:26:49,760 --> 00:26:55,120 ability to based off of the outputs of a 768 00:26:51,919 --> 00:26:56,799 system in for its inner state 769 00:26:55,120 --> 00:26:59,120 but i prefer to in the systems 770 00:26:56,799 --> 00:27:00,240 engineering context of computer systems 771 00:26:59,120 --> 00:27:02,880 think about 772 00:27:00,240 --> 00:27:04,320 how do we ask and answer unknown 773 00:27:02,880 --> 00:27:05,679 unknowns things that we didn't 774 00:27:04,320 --> 00:27:07,279 anticipate would break in the first 775 00:27:05,679 --> 00:27:09,279 place 776 00:27:07,279 --> 00:27:10,720 this requires us to be able to examine 777 00:27:09,279 --> 00:27:12,720 the events that are happening inside of 778 00:27:10,720 --> 00:27:14,960 our system in their full context to 779 00:27:12,720 --> 00:27:18,159 understand properties like 780 00:27:14,960 --> 00:27:20,240 you know which request is this from 781 00:27:18,159 --> 00:27:22,720 this happens to be some of the innards 782 00:27:20,240 --> 00:27:25,120 of our uh query engine 783 00:27:22,720 --> 00:27:27,840 and we can understand things like 784 00:27:25,120 --> 00:27:29,919 for this query which service issued it 785 00:27:27,840 --> 00:27:32,799 um you know was it an end user was it 786 00:27:29,919 --> 00:27:35,120 our uh was our alerting service 787 00:27:32,799 --> 00:27:36,960 and where do we get slow right did it 788 00:27:35,120 --> 00:27:39,600 get slow waiting on aws lambda did it 789 00:27:36,960 --> 00:27:42,080 get slow uh reading files off of disk um 790 00:27:39,600 --> 00:27:44,240 did it gets did was it blocked in cpu 791 00:27:42,080 --> 00:27:45,840 and which user is it right like not not 792 00:27:44,240 --> 00:27:47,760 just kind of mushing all of our users 793 00:27:45,840 --> 00:27:49,200 together but instead thinking about how 794 00:27:47,760 --> 00:27:50,799 can we break apart and understand the 795 00:27:49,200 --> 00:27:52,640 behavior of each individual user in 796 00:27:50,799 --> 00:27:53,840 isolation 797 00:27:52,640 --> 00:27:54,960 we have to be able to explain the 798 00:27:53,840 --> 00:27:56,320 variance we have to be able to 799 00:27:54,960 --> 00:27:58,640 understand 800 00:27:56,320 --> 00:28:01,520 what separates a good user experience 801 00:27:58,640 --> 00:28:03,679 right a succeeding sli from a bad user 802 00:28:01,520 --> 00:28:05,600 experience a failing soi 803 00:28:03,679 --> 00:28:07,520 so for instance this is one way that you 804 00:28:05,600 --> 00:28:10,559 might visualize this data right to look 805 00:28:07,520 --> 00:28:11,760 at how can i see what dimensions are 806 00:28:10,559 --> 00:28:13,840 different between the succeeding and 807 00:28:11,760 --> 00:28:15,760 failing requests 808 00:28:13,840 --> 00:28:17,440 but even better yet why should we have 809 00:28:15,760 --> 00:28:18,960 to do this at 2am when we're at our 810 00:28:17,440 --> 00:28:21,279 worst 811 00:28:18,960 --> 00:28:23,760 i do not think that we should need to be 812 00:28:21,279 --> 00:28:27,200 doing debugging at 2am 813 00:28:23,760 --> 00:28:28,640 it's about mitigating impact first 814 00:28:27,200 --> 00:28:30,720 as long as you're collecting enough 815 00:28:28,640 --> 00:28:32,880 telemetry along the way and you don't 816 00:28:30,720 --> 00:28:34,559 have to catch it live in the act 817 00:28:32,880 --> 00:28:36,000 then you can roll back the bad release 818 00:28:34,559 --> 00:28:37,840 drain the bad data center and then you 819 00:28:36,000 --> 00:28:40,159 can debug it sitting cup of coffee in 820 00:28:37,840 --> 00:28:42,320 hand at 9am instead i think that's a lot 821 00:28:40,159 --> 00:28:44,399 better 822 00:28:42,320 --> 00:28:46,640 but to me observability is not just a 823 00:28:44,399 --> 00:28:48,799 benefit to the break fix it's not just a 824 00:28:46,640 --> 00:28:50,240 benefit to ops it's a benefit to 825 00:28:48,799 --> 00:28:52,320 everyone 826 00:28:50,240 --> 00:28:54,000 because observability helps us gain 827 00:28:52,320 --> 00:28:56,159 better confidence in what our code is 828 00:28:54,000 --> 00:28:57,520 doing in the first place why are tests 829 00:28:56,159 --> 00:28:59,679 failing 830 00:28:57,520 --> 00:29:01,679 or why is it that our deployment loop is 831 00:28:59,679 --> 00:29:03,919 taking two hours to run where it 832 00:29:01,679 --> 00:29:05,760 previously took 30 minutes 833 00:29:03,919 --> 00:29:07,200 or what are these users actually doing 834 00:29:05,760 --> 00:29:09,120 with these features that we released 835 00:29:07,200 --> 00:29:11,840 last monday which users are making the 836 00:29:09,120 --> 00:29:13,600 most use of those features 837 00:29:11,840 --> 00:29:15,120 and do i have any single points of 838 00:29:13,600 --> 00:29:16,559 failure in my system or circular 839 00:29:15,120 --> 00:29:18,320 dependencies that i should be working to 840 00:29:16,559 --> 00:29:19,919 fix over the longer term 841 00:29:18,320 --> 00:29:21,840 those are all use cases that i think 842 00:29:19,919 --> 00:29:24,640 everyone can benefit from not just the 843 00:29:21,840 --> 00:29:26,080 people who are on call 844 00:29:24,640 --> 00:29:27,919 the other misconception that i often 845 00:29:26,080 --> 00:29:29,279 hear is that observability is logs 846 00:29:27,919 --> 00:29:31,360 choices and metrics that it's kind of 847 00:29:29,279 --> 00:29:33,440 these three pillars 848 00:29:31,360 --> 00:29:35,679 and that's not true to me 849 00:29:33,440 --> 00:29:37,399 observability as i described is a 850 00:29:35,679 --> 00:29:40,159 capability it is in fact a 851 00:29:37,399 --> 00:29:42,559 socio-technical capability it's one 852 00:29:40,159 --> 00:29:44,320 where the humans have to be able to use 853 00:29:42,559 --> 00:29:46,880 the tools appropriately to be able to 854 00:29:44,320 --> 00:29:48,799 answer their own questions 855 00:29:46,880 --> 00:29:50,240 and that means that it should be as easy 856 00:29:48,799 --> 00:29:51,919 to add 857 00:29:50,240 --> 00:29:53,520 the necessary debugging and 858 00:29:51,919 --> 00:29:55,520 instrumentation 859 00:29:53,520 --> 00:29:57,360 as it is to add a printf to bug line 860 00:29:55,520 --> 00:29:59,279 because as much as i have done a lot of 861 00:29:57,360 --> 00:30:01,039 printf debugging and got here before 862 00:29:59,279 --> 00:30:02,720 like that's not a scalable method for 863 00:30:01,039 --> 00:30:05,120 working in prod how do we make it as 864 00:30:02,720 --> 00:30:07,360 easy to add good observability as it is 865 00:30:05,120 --> 00:30:09,120 to do a printf 866 00:30:07,360 --> 00:30:11,279 how do we send that data and store it in 867 00:30:09,120 --> 00:30:13,840 an economical fashion and finally most 868 00:30:11,279 --> 00:30:15,919 importantly can you actually query that 869 00:30:13,840 --> 00:30:17,600 data if you cannot actually query that 870 00:30:15,919 --> 00:30:18,640 data and answer your unknown unknown 871 00:30:17,600 --> 00:30:20,159 questions 872 00:30:18,640 --> 00:30:24,159 you don't have observability you just 873 00:30:20,159 --> 00:30:26,240 have a very expensive dev null 874 00:30:24,159 --> 00:30:28,799 so hopefully this elucidates for you why 875 00:30:26,240 --> 00:30:31,039 slos and observability go together 876 00:30:28,799 --> 00:30:32,880 because slos help you understand when 877 00:30:31,039 --> 00:30:34,720 things have gone too wrong and 878 00:30:32,880 --> 00:30:37,600 observability helps you piece together 879 00:30:34,720 --> 00:30:39,919 why so you can remediate the issue 880 00:30:37,600 --> 00:30:40,960 and then finally deliver a lasting fix 881 00:30:39,919 --> 00:30:44,240 to it 882 00:30:40,960 --> 00:30:46,000 as soon as it's convenient to you 883 00:30:44,240 --> 00:30:47,360 but we need to talk about the two other 884 00:30:46,000 --> 00:30:49,679 elements of production excellence 885 00:30:47,360 --> 00:30:51,760 besides slos and observability we need 886 00:30:49,679 --> 00:30:53,760 to talk about collaboration 887 00:30:51,760 --> 00:30:55,840 because the reality is that no 888 00:30:53,760 --> 00:30:58,080 individual human debugs and solves 889 00:30:55,840 --> 00:31:01,360 things alone no individual human is on 890 00:30:58,080 --> 00:31:03,360 call for services 24 7 365 that just is 891 00:31:01,360 --> 00:31:05,200 not sustainable anymore 892 00:31:03,360 --> 00:31:07,919 people deserve to go on vacations people 893 00:31:05,200 --> 00:31:09,440 deserve to uh retire right like people 894 00:31:07,919 --> 00:31:11,360 might might 895 00:31:09,440 --> 00:31:13,279 be out on sick leave because of covid 896 00:31:11,360 --> 00:31:16,000 right we have to be able to allow that 897 00:31:13,279 --> 00:31:18,080 slack in our system 898 00:31:16,000 --> 00:31:20,000 so how do we make it possible for 899 00:31:18,080 --> 00:31:22,399 everyone to collaborate and debug 900 00:31:20,000 --> 00:31:24,559 together how do we raise every human 901 00:31:22,399 --> 00:31:25,919 being to the level of the best debugger 902 00:31:24,559 --> 00:31:28,320 on their team 903 00:31:25,919 --> 00:31:30,720 and how do we make sure that information 904 00:31:28,320 --> 00:31:33,039 is not lost as we cross organizational 905 00:31:30,720 --> 00:31:34,960 gaps 906 00:31:33,039 --> 00:31:36,559 we have to think about a broader set of 907 00:31:34,960 --> 00:31:39,279 users we need to think about the 908 00:31:36,559 --> 00:31:42,720 customer support agent or the uh or the 909 00:31:39,279 --> 00:31:44,399 cloud provider as equal customers to our 910 00:31:42,720 --> 00:31:46,720 engineering teams as far as who is a 911 00:31:44,399 --> 00:31:48,640 client of our debugging 912 00:31:46,720 --> 00:31:52,399 and are we working together have we 913 00:31:48,640 --> 00:31:54,240 practiced this at 3 pm and not at 3 am 914 00:31:52,399 --> 00:31:56,240 are we doing game days right are we 915 00:31:54,240 --> 00:31:58,840 doing disaster drills so that we work 916 00:31:56,240 --> 00:32:00,399 out those kinks rather than doing it 917 00:31:58,840 --> 00:32:02,320 live 918 00:32:00,399 --> 00:32:03,279 and we need to make sure that we have 919 00:32:02,320 --> 00:32:05,519 you know 920 00:32:03,279 --> 00:32:07,760 not service selfishness right service 921 00:32:05,519 --> 00:32:09,120 ownership does not mean selfishness 922 00:32:07,760 --> 00:32:10,720 we have to make sure that people 923 00:32:09,120 --> 00:32:12,320 understand that 924 00:32:10,720 --> 00:32:14,320 we are working together and that 925 00:32:12,320 --> 00:32:16,000 hoarding knowledge inside of your head 926 00:32:14,320 --> 00:32:17,760 you know sure 927 00:32:16,000 --> 00:32:20,559 many of us have read the uh bastard 928 00:32:17,760 --> 00:32:21,679 operator from hell the bo foh comics 929 00:32:20,559 --> 00:32:23,600 right 930 00:32:21,679 --> 00:32:25,679 but you know that's not a model to 931 00:32:23,600 --> 00:32:27,440 follow right it's not okay to have your 932 00:32:25,679 --> 00:32:29,360 job security tied up and how much you 933 00:32:27,440 --> 00:32:31,200 alone know 934 00:32:29,360 --> 00:32:33,760 a modern system is built up of people 935 00:32:31,200 --> 00:32:35,760 who are working together 936 00:32:33,760 --> 00:32:38,080 my colleague jessica care calls this the 937 00:32:35,760 --> 00:32:40,080 idea of semanthicy the idea of learning 938 00:32:38,080 --> 00:32:42,080 systems that are learning together made 939 00:32:40,080 --> 00:32:44,000 up of humans and machines they're trying 940 00:32:42,080 --> 00:32:45,679 to do better and to iterate and to work 941 00:32:44,000 --> 00:32:47,679 better together and build better tools 942 00:32:45,679 --> 00:32:48,960 over time 943 00:32:47,679 --> 00:32:51,039 we have to be able to work together and 944 00:32:48,960 --> 00:32:52,080 lean on our teams right we have to be 945 00:32:51,039 --> 00:32:54,159 able to hand off on-call 946 00:32:52,080 --> 00:32:56,159 responsibilities we have to recognize 947 00:32:54,159 --> 00:32:58,000 that on-call is a team level 948 00:32:56,159 --> 00:32:59,679 responsibility not an individual level 949 00:32:58,000 --> 00:33:01,600 responsibility 950 00:32:59,679 --> 00:33:03,200 if someone is a new parent and they are 951 00:33:01,600 --> 00:33:06,240 having sleepless nights because of their 952 00:33:03,200 --> 00:33:08,320 child don't put them on call right 953 00:33:06,240 --> 00:33:10,240 well i do believe that every developer 954 00:33:08,320 --> 00:33:11,919 should have some exposure to production 955 00:33:10,240 --> 00:33:13,919 it doesn't have to take the form of 956 00:33:11,919 --> 00:33:16,000 on-call it can take the form of working 957 00:33:13,919 --> 00:33:18,000 tickets during normal business hours 958 00:33:16,000 --> 00:33:20,000 right it just involves some form of 959 00:33:18,000 --> 00:33:21,679 feedback loops that you are exposed to 960 00:33:20,000 --> 00:33:23,279 the consequences of the technical 961 00:33:21,679 --> 00:33:24,720 decisions of your team 962 00:33:23,279 --> 00:33:27,440 regardless of whether it's during 963 00:33:24,720 --> 00:33:29,440 working hours or not 964 00:33:27,440 --> 00:33:30,640 and we have to document things so that 965 00:33:29,440 --> 00:33:33,120 we're leaving 966 00:33:30,640 --> 00:33:35,120 as my colleague tanya riley says we're 967 00:33:33,120 --> 00:33:36,240 leaving cookies for our future self not 968 00:33:35,120 --> 00:33:37,919 traps 969 00:33:36,240 --> 00:33:40,320 we have to keep things organized and not 970 00:33:37,919 --> 00:33:41,919 just you know have misleading documents 971 00:33:40,320 --> 00:33:43,279 that that lead us astray that have 972 00:33:41,919 --> 00:33:45,039 gotten out of date but we have to at 973 00:33:43,279 --> 00:33:45,840 least have these common patterns right 974 00:33:45,039 --> 00:33:48,159 of 975 00:33:45,840 --> 00:33:50,399 what is the service for how do i shut it 976 00:33:48,159 --> 00:33:52,000 off how do i update it what happens if 977 00:33:50,399 --> 00:33:55,279 this breaks right those are kind of the 978 00:33:52,000 --> 00:33:55,279 key things that we need to remember 979 00:33:55,440 --> 00:33:58,320 and we need to make sure that you know 980 00:33:56,799 --> 00:33:59,840 you don't have that single person like 981 00:33:58,320 --> 00:34:01,600 miles or myself you know who has been 982 00:33:59,840 --> 00:34:03,919 that hero for ages and ages right we 983 00:34:01,600 --> 00:34:05,679 have to share that knowledge and we have 984 00:34:03,919 --> 00:34:07,840 to make sure that people are able to 985 00:34:05,679 --> 00:34:09,599 collaborate and having common sources of 986 00:34:07,840 --> 00:34:11,280 data 987 00:34:09,599 --> 00:34:13,599 one common source of data that i want to 988 00:34:11,280 --> 00:34:15,679 point out is open telemetry 989 00:34:13,599 --> 00:34:18,000 which is a vendor neutral standard that 990 00:34:15,679 --> 00:34:20,560 is developed by end users and multiple 991 00:34:18,000 --> 00:34:21,919 vendors working together in order to 992 00:34:20,560 --> 00:34:24,240 make sure that when you produce 993 00:34:21,919 --> 00:34:25,919 observability data that is a single 994 00:34:24,240 --> 00:34:27,200 source of truth that you can pipe to 995 00:34:25,919 --> 00:34:29,679 anywhere you like 996 00:34:27,200 --> 00:34:31,280 without experiencing lock-in and that's 997 00:34:29,679 --> 00:34:33,119 really powerful because it means that 998 00:34:31,280 --> 00:34:35,119 you no longer have data that is tied to 999 00:34:33,119 --> 00:34:37,280 one specific platform that for instance 1000 00:34:35,119 --> 00:34:39,440 no one else besides one team can access 1001 00:34:37,280 --> 00:34:41,200 or that you know you lose your data if 1002 00:34:39,440 --> 00:34:42,720 you wind up having to uh change 1003 00:34:41,200 --> 00:34:44,480 providers right like 1004 00:34:42,720 --> 00:34:45,839 having the shared understanding and 1005 00:34:44,480 --> 00:34:48,560 ground truth about what's happening in 1006 00:34:45,839 --> 00:34:49,839 our systems is really powerful 1007 00:34:48,560 --> 00:34:52,159 we also have to make sure that we're 1008 00:34:49,839 --> 00:34:54,639 rewarding curiosity and teamwork right 1009 00:34:52,159 --> 00:34:56,240 that instead of saying you know uh hey 1010 00:34:54,639 --> 00:34:57,920 jess i can't believe you didn't know 1011 00:34:56,240 --> 00:34:59,839 that right like do you think just gonna 1012 00:34:57,920 --> 00:35:01,040 ask me another question after that no 1013 00:34:59,839 --> 00:35:02,400 she's not 1014 00:35:01,040 --> 00:35:03,599 so instead we have to say things like 1015 00:35:02,400 --> 00:35:04,960 you know hey 1016 00:35:03,599 --> 00:35:06,560 thank you for asking the question i'm 1017 00:35:04,960 --> 00:35:08,240 sorry that the documentation wasn't 1018 00:35:06,560 --> 00:35:09,599 clear let's work together to document 1019 00:35:08,240 --> 00:35:11,839 that better so that no one else has to 1020 00:35:09,599 --> 00:35:14,079 stumble into that again 1021 00:35:11,839 --> 00:35:15,760 so by rewarding curiosity and teamwork 1022 00:35:14,079 --> 00:35:17,760 that really creates a healthier culture 1023 00:35:15,760 --> 00:35:19,280 of collaboration 1024 00:35:17,760 --> 00:35:21,119 but it's not just about collaborating 1025 00:35:19,280 --> 00:35:22,960 with your current colleagues it's also 1026 00:35:21,119 --> 00:35:25,280 about collaborating with your present 1027 00:35:22,960 --> 00:35:28,320 past and future self 1028 00:35:25,280 --> 00:35:29,760 that often you know we pull up get blame 1029 00:35:28,320 --> 00:35:31,520 and get blame says you know who that 1030 00:35:29,760 --> 00:35:33,520 idiot was who wrote that code that that 1031 00:35:31,520 --> 00:35:36,800 crashed 1032 00:35:33,520 --> 00:35:38,720 yeah it was me it was me again 1033 00:35:36,800 --> 00:35:40,400 right so you have to leave yourself 1034 00:35:38,720 --> 00:35:41,440 breadcrumbs and be kind to yourself 1035 00:35:40,400 --> 00:35:43,520 right to make sure that you're 1036 00:35:41,440 --> 00:35:45,040 documenting things for future you so 1037 00:35:43,520 --> 00:35:46,880 that you're not cursing future you're in 1038 00:35:45,040 --> 00:35:49,040 the future but be kind to yourself about 1039 00:35:46,880 --> 00:35:50,560 it 1040 00:35:49,040 --> 00:35:53,119 but while i did say earlier that 1041 00:35:50,560 --> 00:35:55,200 outreaches are not exactly the same 1042 00:35:53,119 --> 00:35:56,960 there are common patterns that we often 1043 00:35:55,200 --> 00:35:58,400 see in outages and it behooves us as 1044 00:35:56,960 --> 00:36:00,400 engineers to think about closing 1045 00:35:58,400 --> 00:36:04,400 feedback loops and eliminating some of 1046 00:36:00,400 --> 00:36:06,320 the more common categories of outages 1047 00:36:04,400 --> 00:36:08,000 and we can think about employing risk 1048 00:36:06,320 --> 00:36:10,839 analysis to help us become more 1049 00:36:08,000 --> 00:36:12,400 proactive in approaching uh our failure 1050 00:36:10,839 --> 00:36:14,160 cases 1051 00:36:12,400 --> 00:36:17,359 so i have out my window the sydney 1052 00:36:14,160 --> 00:36:20,079 harbour bridge and let's suppose that um 1053 00:36:17,359 --> 00:36:21,599 there's been uh some lack of maintenance 1054 00:36:20,079 --> 00:36:23,920 and cars are falling through the road 1055 00:36:21,599 --> 00:36:26,240 bed and the sydney transport trains are 1056 00:36:23,920 --> 00:36:28,400 falling through it's no good right and 1057 00:36:26,240 --> 00:36:30,000 also we all know um especially in light 1058 00:36:28,400 --> 00:36:31,680 of the volcanic activity that at some 1059 00:36:30,000 --> 00:36:34,400 point you know we live along the pacific 1060 00:36:31,680 --> 00:36:37,599 rim the earthquake's gonna happen right 1061 00:36:34,400 --> 00:36:38,800 and finally it's probably long past time 1062 00:36:37,599 --> 00:36:41,359 that we took down the christmas lighting 1063 00:36:38,800 --> 00:36:42,800 on the sydney harbor bridge right 1064 00:36:41,359 --> 00:36:44,720 which one of those three things do we 1065 00:36:42,800 --> 00:36:46,480 address first 1066 00:36:44,720 --> 00:36:48,000 we fix the cars that are falling through 1067 00:36:46,480 --> 00:36:49,040 the road bed and the trains that are 1068 00:36:48,000 --> 00:36:50,560 traveling along the tracks they're 1069 00:36:49,040 --> 00:36:52,000 leaving nowhere right like we fix that 1070 00:36:50,560 --> 00:36:55,119 first because it's having the highest 1071 00:36:52,000 --> 00:36:56,640 most critical impact on our users 1072 00:36:55,119 --> 00:37:00,720 so we need to think about what's the 1073 00:36:56,640 --> 00:37:02,480 frequency and impact of a risk to users 1074 00:37:00,720 --> 00:37:05,359 now let's move back from the physical 1075 00:37:02,480 --> 00:37:07,359 world to the computer world 1076 00:37:05,359 --> 00:37:08,960 how many of you have that shutter down 1077 00:37:07,359 --> 00:37:12,400 your spine when i say 1078 00:37:08,960 --> 00:37:12,400 the my sequel database 1079 00:37:13,359 --> 00:37:16,640 right 1080 00:37:14,800 --> 00:37:17,520 that is a single point of failure and we 1081 00:37:16,640 --> 00:37:20,000 know 1082 00:37:17,520 --> 00:37:21,520 as systems engineers that the my sequel 1083 00:37:20,000 --> 00:37:23,040 database is going to fail at some point 1084 00:37:21,520 --> 00:37:24,400 in the next year right it's going to 1085 00:37:23,040 --> 00:37:25,680 happen 1086 00:37:24,400 --> 00:37:27,119 you don't often have control over 1087 00:37:25,680 --> 00:37:29,119 frequency but what you do have control 1088 00:37:27,119 --> 00:37:30,960 over is impact so how can we make the 1089 00:37:29,119 --> 00:37:33,119 mysql database failing 1090 00:37:30,960 --> 00:37:34,880 not take out everything well 1091 00:37:33,119 --> 00:37:36,960 you could shard the database right you 1092 00:37:34,880 --> 00:37:39,040 could um have it only take down two 1093 00:37:36,960 --> 00:37:40,400 percent of users data at a time if it 1094 00:37:39,040 --> 00:37:42,480 goes out 1095 00:37:40,400 --> 00:37:43,839 you could decrease the amount of time 1096 00:37:42,480 --> 00:37:45,520 that it takes to identify that it's a 1097 00:37:43,839 --> 00:37:46,800 mysql database 1098 00:37:45,520 --> 00:37:48,160 you could also decrease the amount of 1099 00:37:46,800 --> 00:37:49,520 time that it takes to fail over right 1100 00:37:48,160 --> 00:37:50,720 you could have a hot spare running 1101 00:37:49,520 --> 00:37:52,800 instead of needing to restore from a 1102 00:37:50,720 --> 00:37:55,359 backup right that cuts the time from you 1103 00:37:52,800 --> 00:37:57,760 know two hours to restore the backup to 1104 00:37:55,359 --> 00:37:59,040 two seconds before the uh the watchdog 1105 00:37:57,760 --> 00:38:01,680 notices and kicks things over to the 1106 00:37:59,040 --> 00:38:03,280 replica right 1107 00:38:01,680 --> 00:38:04,880 so think about what are the risks that 1108 00:38:03,280 --> 00:38:06,560 are the most significant what's 1109 00:38:04,880 --> 00:38:07,839 happening with the highest impact 1110 00:38:06,560 --> 00:38:10,800 impacting the highest percentage of 1111 00:38:07,839 --> 00:38:12,640 users and lasting the longest 1112 00:38:10,800 --> 00:38:14,880 and then trying to move that needle 1113 00:38:12,640 --> 00:38:16,880 trying to work on on reducing those 1114 00:38:14,880 --> 00:38:18,320 significant impacts that doesn't mean 1115 00:38:16,880 --> 00:38:20,880 you have to eliminate them entirely but 1116 00:38:18,320 --> 00:38:23,680 you can think about ways to mitigate the 1117 00:38:20,880 --> 00:38:25,280 impact that they're less bad 1118 00:38:23,680 --> 00:38:27,040 and how do i decide what's a bad risk 1119 00:38:25,280 --> 00:38:28,880 what's not acceptable well the answer is 1120 00:38:27,040 --> 00:38:30,480 we just defined a service level 1121 00:38:28,880 --> 00:38:32,160 objective right 1122 00:38:30,480 --> 00:38:34,880 so if you have a service level objective 1123 00:38:32,160 --> 00:38:37,040 in mind then that helps you figure out 1124 00:38:34,880 --> 00:38:38,480 okay i'm allowed to have 10 000 bed 1125 00:38:37,040 --> 00:38:40,640 requests per month 1126 00:38:38,480 --> 00:38:43,040 this one failure case for instance the 1127 00:38:40,640 --> 00:38:44,800 mysql database is responsible for you 1128 00:38:43,040 --> 00:38:46,960 know in expectation it's going to cause 1129 00:38:44,800 --> 00:38:48,720 5000 failed requests per month 1130 00:38:46,960 --> 00:38:50,880 that's too high because that doesn't 1131 00:38:48,720 --> 00:38:52,400 leave us enough room for our unknown 1132 00:38:50,880 --> 00:38:54,720 unknowns for things that we didn't think 1133 00:38:52,400 --> 00:38:56,240 were going to happen 1134 00:38:54,720 --> 00:38:58,320 so there right there that's your 1135 00:38:56,240 --> 00:39:00,320 business case for being proactive for 1136 00:38:58,320 --> 00:39:01,839 fixing those issues before they cause 1137 00:39:00,320 --> 00:39:02,800 you to burn through your entire error 1138 00:39:01,839 --> 00:39:04,640 budget 1139 00:39:02,800 --> 00:39:06,560 because it's not possible to operate a 1140 00:39:04,640 --> 00:39:08,400 sustainable error budget 1141 00:39:06,560 --> 00:39:10,320 if that error budget is already filled 1142 00:39:08,400 --> 00:39:12,560 with known things that is going that are 1143 00:39:10,320 --> 00:39:14,160 going to break 1144 00:39:12,560 --> 00:39:15,599 and finally this also helps give you a 1145 00:39:14,160 --> 00:39:18,000 framework for how to approach your 1146 00:39:15,599 --> 00:39:19,839 retrospectives from incidents 1147 00:39:18,000 --> 00:39:21,359 many of us when we create retrospectives 1148 00:39:19,839 --> 00:39:22,800 kind of create this list of 10 things 1149 00:39:21,359 --> 00:39:24,560 that would have prevented this specific 1150 00:39:22,800 --> 00:39:26,720 outage right it's not the right way to 1151 00:39:24,560 --> 00:39:28,800 think about it instead think about what 1152 00:39:26,720 --> 00:39:30,240 were the highest impact things that 1153 00:39:28,800 --> 00:39:32,160 resulted in 1154 00:39:30,240 --> 00:39:33,839 an outage or an outage similar to this 1155 00:39:32,160 --> 00:39:35,280 being able to happen in the future right 1156 00:39:33,839 --> 00:39:36,320 pick those one or two most important 1157 00:39:35,280 --> 00:39:38,079 things 1158 00:39:36,320 --> 00:39:40,800 based off of 1159 00:39:38,079 --> 00:39:41,920 the frequency and the blast radius 1160 00:39:40,800 --> 00:39:43,680 and those are the ones that you should 1161 00:39:41,920 --> 00:39:46,400 actually fix and you can put the other 1162 00:39:43,680 --> 00:39:48,800 ones on a bug bash list or on a or a 1163 00:39:46,400 --> 00:39:50,640 hack week list right don't waste your 1164 00:39:48,800 --> 00:39:53,359 precious production improvement time 1165 00:39:50,640 --> 00:39:55,280 doing chrome polishing 1166 00:39:53,359 --> 00:39:57,040 but i want to call out two things 1167 00:39:55,280 --> 00:39:59,520 first of all that if you have a lack of 1168 00:39:57,040 --> 00:40:01,760 observability that is systematic risk 1169 00:39:59,520 --> 00:40:04,079 that if every algae just taking half an 1170 00:40:01,760 --> 00:40:05,599 hour an hour two hours three hours even 1171 00:40:04,079 --> 00:40:07,359 for people to figure out what's even 1172 00:40:05,599 --> 00:40:08,960 going on what services are impacted 1173 00:40:07,359 --> 00:40:10,480 who's impacted like what do we think is 1174 00:40:08,960 --> 00:40:12,079 happening right like 1175 00:40:10,480 --> 00:40:13,119 that is time that your users are 1176 00:40:12,079 --> 00:40:14,880 suffering 1177 00:40:13,119 --> 00:40:17,040 and if you shrink the amount of time 1178 00:40:14,880 --> 00:40:18,880 that it takes to debug things 1179 00:40:17,040 --> 00:40:21,200 and know like you know we don't believe 1180 00:40:18,880 --> 00:40:22,800 in the idea of mean time to recovery but 1181 00:40:21,200 --> 00:40:25,280 certainly you know you can think about 1182 00:40:22,800 --> 00:40:26,880 the number of dropped queries right as 1183 00:40:25,280 --> 00:40:29,040 users are experiencing pain as you're 1184 00:40:26,880 --> 00:40:30,560 figuring out what's even going on right 1185 00:40:29,040 --> 00:40:32,800 so better observability is something 1186 00:40:30,560 --> 00:40:34,800 that impacts every item on that list of 1187 00:40:32,800 --> 00:40:36,079 risks 1188 00:40:34,800 --> 00:40:38,000 and similarly if you have a lack of 1189 00:40:36,079 --> 00:40:39,920 collaboration right if i'd talked to the 1190 00:40:38,000 --> 00:40:41,760 av team just now and i and i'd said 1191 00:40:39,920 --> 00:40:43,200 something like you know can't believe 1192 00:40:41,760 --> 00:40:45,119 this get your [ __ ] together how could 1193 00:40:43,200 --> 00:40:46,160 this possibly really 1194 00:40:45,119 --> 00:40:48,160 do you think that's going to make them 1195 00:40:46,160 --> 00:40:50,720 work better no that's going to make them 1196 00:40:48,160 --> 00:40:52,400 work worse right like 1197 00:40:50,720 --> 00:40:54,640 being friendly being supportive right 1198 00:40:52,400 --> 00:40:56,240 like that's how you ensure that people 1199 00:40:54,640 --> 00:40:59,839 report issues early that people 1200 00:40:56,240 --> 00:40:59,839 collaborate that they're transparent 1201 00:41:00,000 --> 00:41:04,400 in 2018 google cloud had a large outage 1202 00:41:02,800 --> 00:41:07,200 that impacted global networking for 1203 00:41:04,400 --> 00:41:08,400 about 45 or 50 minutes 1204 00:41:07,200 --> 00:41:10,400 do you know what caused that edge to be 1205 00:41:08,400 --> 00:41:11,760 solved in 45 50 minutes and not three 1206 00:41:10,400 --> 00:41:13,760 hours 1207 00:41:11,760 --> 00:41:14,720 the developer who had pushed the code 1208 00:41:13,760 --> 00:41:16,400 change 1209 00:41:14,720 --> 00:41:18,480 that had inadvertently tickled that 1210 00:41:16,400 --> 00:41:20,240 outage 1211 00:41:18,480 --> 00:41:21,599 they raised their hand and they said i 1212 00:41:20,240 --> 00:41:24,560 think it might be my change and i'm 1213 00:41:21,599 --> 00:41:26,720 already reverting it instead of cowering 1214 00:41:24,560 --> 00:41:28,800 and being in fear and not speaking up 1215 00:41:26,720 --> 00:41:31,200 for fear of being fired right 1216 00:41:28,800 --> 00:41:32,839 good collaboration decreases the amount 1217 00:41:31,200 --> 00:41:35,760 of time that outages 1218 00:41:32,839 --> 00:41:37,200 take so you don't have to be a hero in 1219 00:41:35,760 --> 00:41:39,599 order to have a 1220 00:41:37,200 --> 00:41:41,760 system that is production excellent 1221 00:41:39,599 --> 00:41:43,599 you just have to follow this recipe of 1222 00:41:41,760 --> 00:41:45,200 those four things 1223 00:41:43,599 --> 00:41:47,200 and i'm not just talking the abstract 1224 00:41:45,200 --> 00:41:48,880 i'm talking about what i've learned from 1225 00:41:47,200 --> 00:41:51,760 my time at honeycomb 1226 00:41:48,880 --> 00:41:53,280 that we're today about 40 50 engineers 1227 00:41:51,760 --> 00:41:54,720 but when i started we're 10 engineers 1228 00:41:53,280 --> 00:41:57,040 right and we're competing with companies 1229 00:41:54,720 --> 00:41:59,119 that are 10 times their size 1230 00:41:57,040 --> 00:42:00,640 and we managed to pull it off only 1231 00:41:59,119 --> 00:42:03,280 because of some of these factors of 1232 00:42:00,640 --> 00:42:06,000 production excellence of feeling like we 1233 00:42:03,280 --> 00:42:07,839 have the freedom to deploy on fridays to 1234 00:42:06,000 --> 00:42:10,160 deploy confidently 1235 00:42:07,839 --> 00:42:12,480 but also to have the responsibility to 1236 00:42:10,160 --> 00:42:15,119 look after changes after they go out 1237 00:42:12,480 --> 00:42:17,680 we do not push and run right like 1238 00:42:15,119 --> 00:42:19,839 it's not safe to push at 6 pm and walk 1239 00:42:17,680 --> 00:42:21,359 out the door and go get a and and go get 1240 00:42:19,839 --> 00:42:22,720 a beer right 1241 00:42:21,359 --> 00:42:24,240 no matter what day of the week it is on 1242 00:42:22,720 --> 00:42:25,520 the other hand if it's friday morning go 1243 00:42:24,240 --> 00:42:27,680 ahead and push as long as you're going 1244 00:42:25,520 --> 00:42:29,440 to be awake and around to look at it 1245 00:42:27,680 --> 00:42:32,400 so you can see here that we push up to 1246 00:42:29,440 --> 00:42:34,160 14 times per day 1247 00:42:32,400 --> 00:42:35,920 and we've done this all while traffic 1248 00:42:34,160 --> 00:42:37,440 has gone up three to five times in a 1249 00:42:35,920 --> 00:42:39,200 single year 1250 00:42:37,440 --> 00:42:40,960 covid has been really interesting for a 1251 00:42:39,200 --> 00:42:42,560 business because more things are moving 1252 00:42:40,960 --> 00:42:45,680 online 1253 00:42:42,560 --> 00:42:47,839 we have had basically a tripling of the 1254 00:42:45,680 --> 00:42:49,200 amount of right workload of the amount 1255 00:42:47,839 --> 00:42:51,680 of telemetry data coming from our 1256 00:42:49,200 --> 00:42:53,839 customers into honeycomb 1257 00:42:51,680 --> 00:42:56,000 and we've also had three times as many 1258 00:42:53,839 --> 00:42:57,200 people who are asking us questions 1259 00:42:56,000 --> 00:42:59,440 right people who are trying to 1260 00:42:57,200 --> 00:43:01,280 understand the behavior of their systems 1261 00:42:59,440 --> 00:43:03,040 so we've had to add all these features 1262 00:43:01,280 --> 00:43:04,960 and scale out our system and keep it 1263 00:43:03,040 --> 00:43:07,040 reliable at the same time 1264 00:43:04,960 --> 00:43:07,920 and how did we do that 1265 00:43:07,040 --> 00:43:10,240 well 1266 00:43:07,920 --> 00:43:12,800 we defined our slos 1267 00:43:10,240 --> 00:43:14,560 we thought about our areas of risk 1268 00:43:12,800 --> 00:43:16,240 and then we designed experiments to 1269 00:43:14,560 --> 00:43:19,760 validate that that risk was not there 1270 00:43:16,240 --> 00:43:22,480 and to fix those risks if we found them 1271 00:43:19,760 --> 00:43:24,839 right that's how we approach this 1272 00:43:22,480 --> 00:43:28,000 so we actually practice slos and 1273 00:43:24,839 --> 00:43:30,079 honeycomb and our slos reflect the user 1274 00:43:28,000 --> 00:43:32,160 value that we provide 1275 00:43:30,079 --> 00:43:34,319 remember when i said earlier that we're 1276 00:43:32,160 --> 00:43:36,079 a observability provider right what that 1277 00:43:34,319 --> 00:43:38,400 means fundamentally is we are a big data 1278 00:43:36,079 --> 00:43:40,960 platform right we ingest telemetry in 1279 00:43:38,400 --> 00:43:42,880 for instance open telemetry format 1280 00:43:40,960 --> 00:43:45,040 and we have to break apart that data and 1281 00:43:42,880 --> 00:43:46,640 provide kind of almost an index on it so 1282 00:43:45,040 --> 00:43:49,040 that it can be quickly retrieved and 1283 00:43:46,640 --> 00:43:51,359 queried later 1284 00:43:49,040 --> 00:43:53,680 so therefore our slos are shaped like 1285 00:43:51,359 --> 00:43:56,400 those user journeys that if you view the 1286 00:43:53,680 --> 00:43:58,240 home page 99.9 of the time it should 1287 00:43:56,400 --> 00:43:59,920 load quickly enough within 250 1288 00:43:58,240 --> 00:44:02,000 milliseconds 1289 00:43:59,920 --> 00:44:04,319 but that it's okay for some queries to 1290 00:44:02,000 --> 00:44:05,599 fail even up to one percent of arbitrary 1291 00:44:04,319 --> 00:44:07,440 queries that those could take longer 1292 00:44:05,599 --> 00:44:08,480 than 10 seconds right like because the 1293 00:44:07,440 --> 00:44:10,079 idea is 1294 00:44:08,480 --> 00:44:12,000 you might be able to hit retry right 1295 00:44:10,079 --> 00:44:13,520 there that's okay 1296 00:44:12,000 --> 00:44:15,440 but the one thing that we do not mess 1297 00:44:13,520 --> 00:44:17,440 around with is user data coming in 1298 00:44:15,440 --> 00:44:19,359 because we know we in general have like 1299 00:44:17,440 --> 00:44:20,800 one maybe two chances to get it right if 1300 00:44:19,359 --> 00:44:23,040 you happen to have a buffer and retry 1301 00:44:20,800 --> 00:44:24,720 right that if we drop customer data on 1302 00:44:23,040 --> 00:44:27,119 the floor that's going to result in a 1303 00:44:24,720 --> 00:44:30,800 graph in a divot in the graph of every 1304 00:44:27,119 --> 00:44:30,800 honeycomb customer in perpetuity 1305 00:44:31,040 --> 00:44:35,440 so that means that we have to think 1306 00:44:32,720 --> 00:44:37,440 about how do we adhere to such a high 1307 00:44:35,440 --> 00:44:40,319 slow right 1308 00:44:37,440 --> 00:44:42,319 right that's 99.99 that's 4.3 minutes of 1309 00:44:40,319 --> 00:44:43,920 violation a month that can be really 1310 00:44:42,319 --> 00:44:46,720 hairy 1311 00:44:43,920 --> 00:44:48,640 so how do we stay within slo 1312 00:44:46,720 --> 00:44:50,400 the answer is the state accelerate state 1313 00:44:48,640 --> 00:44:52,480 of devops metrics right which tell us 1314 00:44:50,400 --> 00:44:55,520 that if you deploy on demand multiple 1315 00:44:52,480 --> 00:44:56,319 times per day it's safer to do that 1316 00:44:55,520 --> 00:44:57,760 right 1317 00:44:56,319 --> 00:44:59,200 no amount of qa checking is going to 1318 00:44:57,760 --> 00:45:01,599 find these big giant issues so it's 1319 00:44:59,200 --> 00:45:04,720 better almost like riding a bicycle the 1320 00:45:01,599 --> 00:45:06,319 faster you ride to a degree the more 1321 00:45:04,720 --> 00:45:09,040 stable you are because you kind of have 1322 00:45:06,319 --> 00:45:10,720 that spinning gyroscope right 1323 00:45:09,040 --> 00:45:12,560 that the quicker the feedback loop is 1324 00:45:10,720 --> 00:45:14,400 between you pushing it to change into 1325 00:45:12,560 --> 00:45:15,760 main and it's landing in production the 1326 00:45:14,400 --> 00:45:17,680 more likely it is that you'll remember 1327 00:45:15,760 --> 00:45:19,520 what was going on and have context to 1328 00:45:17,680 --> 00:45:21,119 figure out what's happening 1329 00:45:19,520 --> 00:45:23,280 and that if there is a problem right 1330 00:45:21,119 --> 00:45:25,520 like you'll notice that the dora metrics 1331 00:45:23,280 --> 00:45:28,720 say that 15 of companies 1332 00:45:25,520 --> 00:45:30,800 or 15 of the changes pushed by elite 1333 00:45:28,720 --> 00:45:32,640 companies can fail that's okay the thing 1334 00:45:30,800 --> 00:45:34,000 that distinguishes them is that failures 1335 00:45:32,640 --> 00:45:35,920 are not expensive that they can roll 1336 00:45:34,000 --> 00:45:38,800 back in minutes or at most an hour 1337 00:45:35,920 --> 00:45:40,480 rather than than days 1338 00:45:38,800 --> 00:45:43,040 so for us 1339 00:45:40,480 --> 00:45:45,040 it starts with lead time we think about 1340 00:45:43,040 --> 00:45:46,400 how do you actually 1341 00:45:45,040 --> 00:45:47,520 take less than three hours from the time 1342 00:45:46,400 --> 00:45:49,200 that you're sitting hands on keyboard 1343 00:45:47,520 --> 00:45:51,280 writing the tests making sure they pass 1344 00:45:49,200 --> 00:45:53,119 to having it running in production 1345 00:45:51,280 --> 00:45:55,760 and the answer is we keep the build time 1346 00:45:53,119 --> 00:45:57,359 fast less than 10 minutes per build and 1347 00:45:55,760 --> 00:46:00,800 every time it gets higher than that we 1348 00:45:57,359 --> 00:46:03,119 do some debugging to figure out why 1349 00:46:00,800 --> 00:46:04,800 we automatically push once an hour and 1350 00:46:03,119 --> 00:46:06,480 the reason for that is that we are 1351 00:46:04,800 --> 00:46:08,160 trying to keep the number of commits per 1352 00:46:06,480 --> 00:46:09,839 build artifact low 1353 00:46:08,160 --> 00:46:11,760 and again to keep that lead time very 1354 00:46:09,839 --> 00:46:12,640 low 1355 00:46:11,760 --> 00:46:14,240 and 1356 00:46:12,640 --> 00:46:15,520 while it is true that you know maybe 1357 00:46:14,240 --> 00:46:16,720 five percent of the time our changes 1358 00:46:15,520 --> 00:46:18,480 don't work exactly the way that we 1359 00:46:16,720 --> 00:46:20,079 anticipated 1360 00:46:18,480 --> 00:46:21,920 we only have about one in a thousand 1361 00:46:20,079 --> 00:46:24,160 changes fail in a way that we cannot 1362 00:46:21,920 --> 00:46:25,920 quickly remediate right that require 1363 00:46:24,160 --> 00:46:26,960 actually doing a fixed forward or a 1364 00:46:25,920 --> 00:46:30,880 rollback 1365 00:46:26,960 --> 00:46:30,880 rather than just doing a flag flip 1366 00:46:30,960 --> 00:46:34,240 and that means that we've optimized 1367 00:46:32,480 --> 00:46:38,000 around that path of making things take 1368 00:46:34,240 --> 00:46:39,760 as little time as possible to repair 1369 00:46:38,000 --> 00:46:42,000 so what this means is that in practice 1370 00:46:39,760 --> 00:46:44,319 we add instrumentation right we add open 1371 00:46:42,000 --> 00:46:45,920 telemetry spans as we write our code so 1372 00:46:44,319 --> 00:46:47,599 we can better understand that behavior 1373 00:46:45,920 --> 00:46:49,440 both as we're coding 1374 00:46:47,599 --> 00:46:51,599 writing our tests and also as it reaches 1375 00:46:49,440 --> 00:46:53,839 production 1376 00:46:51,599 --> 00:46:57,680 we don't do kind of heavyweight 1377 00:46:53,839 --> 00:46:59,680 you know selenium tests we do dom tests 1378 00:46:57,680 --> 00:47:02,000 to kind of serialize and diff the dom to 1379 00:46:59,680 --> 00:47:03,760 make sure that it's good enough 1380 00:47:02,000 --> 00:47:05,280 and we make sure every major change has 1381 00:47:03,760 --> 00:47:08,640 the ability to turn on and off so that 1382 00:47:05,280 --> 00:47:10,240 we decouple a release from a deploy 1383 00:47:08,640 --> 00:47:12,480 and we really prioritize making those 1384 00:47:10,240 --> 00:47:14,319 unit tests fast as i said earlier a 1385 00:47:12,480 --> 00:47:15,920 build time 10 minute or bust right and 1386 00:47:14,319 --> 00:47:18,160 if it takes longer than that we're doing 1387 00:47:15,920 --> 00:47:21,040 some digging to figure out why 1388 00:47:18,160 --> 00:47:23,440 and also human beings prioritize doing 1389 00:47:21,040 --> 00:47:25,040 their reviews as quickly as possible 1390 00:47:23,440 --> 00:47:26,160 said people don't drop standing floor 1391 00:47:25,040 --> 00:47:27,920 and people are not blocking on each 1392 00:47:26,160 --> 00:47:29,280 other that doesn't mean you interrupt 1393 00:47:27,920 --> 00:47:30,640 each other all the time but like i'll 1394 00:47:29,280 --> 00:47:32,079 check once an hour to see whether there 1395 00:47:30,640 --> 00:47:33,839 are code reviews waiting for me because 1396 00:47:32,079 --> 00:47:36,319 i know that that's blocking someone from 1397 00:47:33,839 --> 00:47:38,160 getting their work into production 1398 00:47:36,319 --> 00:47:39,760 but what we don't do is we don't kind of 1399 00:47:38,160 --> 00:47:41,920 hold things in the holding pen forever 1400 00:47:39,760 --> 00:47:42,880 right as soon as the tests are green we 1401 00:47:41,920 --> 00:47:45,119 merge 1402 00:47:42,880 --> 00:47:46,960 and not only do we merge we push the 1403 00:47:45,119 --> 00:47:48,800 latest green build once an hour we 1404 00:47:46,960 --> 00:47:51,040 automatically push through each of the 1405 00:47:48,800 --> 00:47:53,119 environments in sequence 1406 00:47:51,040 --> 00:47:55,599 but we have the ability to stop to kind 1407 00:47:53,119 --> 00:47:57,280 of pull and on chord to stop releases if 1408 00:47:55,599 --> 00:47:58,640 we think that there's a problem so 1409 00:47:57,280 --> 00:48:01,520 that's kind of how we keep the assembly 1410 00:47:58,640 --> 00:48:03,280 line of production changes rolling out 1411 00:48:01,520 --> 00:48:04,880 but the most important thing that we do 1412 00:48:03,280 --> 00:48:06,559 is that we observe real customer 1413 00:48:04,880 --> 00:48:08,800 behavior and production 1414 00:48:06,559 --> 00:48:11,440 that we have a set of environments not 1415 00:48:08,800 --> 00:48:13,599 just for customers to observe their data 1416 00:48:11,440 --> 00:48:15,520 but for us to observe what's happening 1417 00:48:13,599 --> 00:48:17,520 inside production how are customers 1418 00:48:15,520 --> 00:48:20,240 experiencing honeycomb 1419 00:48:17,520 --> 00:48:21,760 and are there any problems right like 1420 00:48:20,240 --> 00:48:23,440 for instance how are people actually 1421 00:48:21,760 --> 00:48:25,760 using the feature that we built how fast 1422 00:48:23,440 --> 00:48:27,359 is it how performing is it 1423 00:48:25,760 --> 00:48:28,880 and we also have to be able to observe 1424 00:48:27,359 --> 00:48:30,000 dog food so of course we have a third 1425 00:48:28,880 --> 00:48:32,160 environment for that because it's 1426 00:48:30,000 --> 00:48:33,760 turtles all the way down 1427 00:48:32,160 --> 00:48:36,960 and that's how we have 40 engineers 1428 00:48:33,760 --> 00:48:38,240 deploying 18 times per day 1429 00:48:36,960 --> 00:48:39,760 but it's not just kind of product 1430 00:48:38,240 --> 00:48:41,440 deployment we also have applied that 1431 00:48:39,760 --> 00:48:43,200 same mentality to our to our 1432 00:48:41,440 --> 00:48:45,359 infrastructure 1433 00:48:43,200 --> 00:48:47,040 so for instance we think about using 1434 00:48:45,359 --> 00:48:48,880 terraform and chef and kind of all these 1435 00:48:47,040 --> 00:48:49,920 lovely tools to manage our linux vms 1436 00:48:48,880 --> 00:48:52,000 every day 1437 00:48:49,920 --> 00:48:54,319 and we apply these approaches of 1438 00:48:52,000 --> 00:48:56,319 continuous integration right 1439 00:48:54,319 --> 00:48:58,240 so for instance we use terraform cloud 1440 00:48:56,319 --> 00:49:00,960 to automatically push the latest screen 1441 00:48:58,240 --> 00:49:02,640 build so we can't drift out of sync 1442 00:49:00,960 --> 00:49:05,359 we use feature flags to handle things 1443 00:49:02,640 --> 00:49:07,440 like you know hey if i'm behind i can 1444 00:49:05,359 --> 00:49:10,800 toggle a feature flag and terraform to 1445 00:49:07,440 --> 00:49:12,960 stand up a catch-up fleet automatically 1446 00:49:10,800 --> 00:49:14,400 and i can quarantine bad traffic that's 1447 00:49:12,960 --> 00:49:16,240 causing crashes or that i want to 1448 00:49:14,400 --> 00:49:17,680 profile just with a feature flag in our 1449 00:49:16,240 --> 00:49:20,240 terraform code 1450 00:49:17,680 --> 00:49:22,800 and that really helps make life a lot 1451 00:49:20,240 --> 00:49:24,880 easier for us as as people who are 1452 00:49:22,800 --> 00:49:26,800 responsible for the platform who are 1453 00:49:24,880 --> 00:49:28,880 responsible for making it possible for 1454 00:49:26,800 --> 00:49:31,839 the product to move fast and scale on 1455 00:49:28,880 --> 00:49:31,839 top of our infrastructure 1456 00:49:32,240 --> 00:49:35,520 but we don't just design an abstract we 1457 00:49:33,839 --> 00:49:37,839 actually validate we actually test are 1458 00:49:35,520 --> 00:49:41,680 these things working correctly 1459 00:49:37,839 --> 00:49:43,200 so we experiment using our error budgets 1460 00:49:41,680 --> 00:49:44,640 now when we talk about chaos engineering 1461 00:49:43,200 --> 00:49:45,680 we're not just talking about enrolled 1462 00:49:44,640 --> 00:49:47,920 chaos 1463 00:49:45,680 --> 00:49:49,599 we have a goal of the experiment in mind 1464 00:49:47,920 --> 00:49:51,359 what are we validating 1465 00:49:49,599 --> 00:49:53,440 and is there a stop button is there 1466 00:49:51,359 --> 00:49:55,760 ability to pause or reset an experiment 1467 00:49:53,440 --> 00:49:57,359 if it's causing a problem 1468 00:49:55,760 --> 00:49:59,920 so we'll use feature flags for instance 1469 00:49:57,359 --> 00:50:02,000 to control this sort of thing 1470 00:49:59,920 --> 00:50:04,000 in the event of our persistent services 1471 00:50:02,000 --> 00:50:05,359 we have to do a lot of work in order to 1472 00:50:04,000 --> 00:50:08,240 make sure that the persistence 1473 00:50:05,359 --> 00:50:10,400 mechanisms work as we anticipate 1474 00:50:08,240 --> 00:50:12,640 because about half of our microservices 1475 00:50:10,400 --> 00:50:14,079 are stateless but half are stateful and 1476 00:50:12,640 --> 00:50:15,440 we have to treat them as separate 1477 00:50:14,079 --> 00:50:17,119 classes 1478 00:50:15,440 --> 00:50:19,359 now there's a lot been said in the past 1479 00:50:17,119 --> 00:50:21,280 about kind of uh you know chaos monkey 1480 00:50:19,359 --> 00:50:22,640 of automatically restarting stateless 1481 00:50:21,280 --> 00:50:24,880 servers i'm not going to hone in on that 1482 00:50:22,640 --> 00:50:27,040 too much what i want to hone in on is 1483 00:50:24,880 --> 00:50:28,640 our stateful workload 1484 00:50:27,040 --> 00:50:30,000 where we have to be able to tolerate 1485 00:50:28,640 --> 00:50:30,720 things like 1486 00:50:30,000 --> 00:50:32,559 a 1487 00:50:30,720 --> 00:50:34,720 individual kafka broker going away or 1488 00:50:32,559 --> 00:50:36,880 one of our indexing workers going away 1489 00:50:34,720 --> 00:50:39,040 we have to be able to validate that we 1490 00:50:36,880 --> 00:50:40,559 are able to do deploy safely that amazon 1491 00:50:39,040 --> 00:50:41,760 is able to terminate our instances 1492 00:50:40,559 --> 00:50:43,280 without causing failures to our 1493 00:50:41,760 --> 00:50:44,800 customers 1494 00:50:43,280 --> 00:50:47,280 so when we have these kind of 1495 00:50:44,800 --> 00:50:49,680 long-running storage instances that need 1496 00:50:47,280 --> 00:50:51,920 data integrity and consistency how do we 1497 00:50:49,680 --> 00:50:53,599 manage that 1498 00:50:51,920 --> 00:50:56,160 well the answer is we test these 1499 00:50:53,599 --> 00:50:58,240 failover dances we test the system to 1500 00:50:56,160 --> 00:50:59,760 make sure that in practice it works as 1501 00:50:58,240 --> 00:51:03,040 designed 1502 00:50:59,760 --> 00:51:05,599 for instance if a kafka broker is lost 1503 00:51:03,040 --> 00:51:07,920 are we able to successfully 1504 00:51:05,599 --> 00:51:08,800 start reading or writing to the new 1505 00:51:07,920 --> 00:51:10,880 leader 1506 00:51:08,800 --> 00:51:12,480 and in the background replicate in a new 1507 00:51:10,880 --> 00:51:14,319 kafka broker 1508 00:51:12,480 --> 00:51:16,960 or if we take out an indexing worker is 1509 00:51:14,319 --> 00:51:19,680 it able to replace successfully 1510 00:51:16,960 --> 00:51:22,880 based off of a snapshot stored to s3 1511 00:51:19,680 --> 00:51:25,680 plus replaying off of kafka 1512 00:51:22,880 --> 00:51:28,400 so we do experiments in production 1513 00:51:25,680 --> 00:51:29,599 so we restart one server one service at 1514 00:51:28,400 --> 00:51:31,040 a time right controlled experiments 1515 00:51:29,599 --> 00:51:33,200 we're not restarting all five things at 1516 00:51:31,040 --> 00:51:34,960 once we're just trying to test one thing 1517 00:51:33,200 --> 00:51:38,319 at a time to make sure 1518 00:51:34,960 --> 00:51:40,000 at 3 pm and not 3 a.m for two reasons 1519 00:51:38,319 --> 00:51:42,240 number one all hands are on deck right 1520 00:51:40,000 --> 00:51:44,000 bugs are more shallow with more eyes 1521 00:51:42,240 --> 00:51:46,240 but also number two 1522 00:51:44,000 --> 00:51:48,240 you know 2 p.m at least in the us is our 1523 00:51:46,240 --> 00:51:50,480 peak traffic time right if a failure 1524 00:51:48,240 --> 00:51:52,559 happens during peak traffic that's kind 1525 00:51:50,480 --> 00:51:53,760 of the decimal situation as far as you 1526 00:51:52,559 --> 00:51:55,599 know the amount of load on the system 1527 00:51:53,760 --> 00:51:57,040 and we want to verify we can always 1528 00:51:55,599 --> 00:51:59,280 catch up and make progress even under 1529 00:51:57,040 --> 00:52:01,599 the most load 1530 00:51:59,280 --> 00:52:03,520 and we are monitoring to make sure that 1531 00:52:01,599 --> 00:52:06,000 we are not damaging user experience 1532 00:52:03,520 --> 00:52:08,079 using our slos and slis 1533 00:52:06,000 --> 00:52:10,079 and we are using our observability 1534 00:52:08,079 --> 00:52:11,760 infrastructure to debug to understand if 1535 00:52:10,079 --> 00:52:13,280 our experiment didn't go the way that we 1536 00:52:11,760 --> 00:52:15,200 anticipated 1537 00:52:13,280 --> 00:52:16,720 why is it happening and how do we repair 1538 00:52:15,200 --> 00:52:18,640 it 1539 00:52:16,720 --> 00:52:20,559 and we also need to make sure that we 1540 00:52:18,640 --> 00:52:21,920 are not you know dropping telemetry 1541 00:52:20,559 --> 00:52:24,000 right like that if we perform one of 1542 00:52:21,920 --> 00:52:25,680 these experiments this actually happened 1543 00:52:24,000 --> 00:52:27,440 uh last week 1544 00:52:25,680 --> 00:52:28,800 a kafka broker failed to come back 1545 00:52:27,440 --> 00:52:30,640 correctly after being killed and 1546 00:52:28,800 --> 00:52:32,480 restarted 1547 00:52:30,640 --> 00:52:34,319 and we didn't have the telemetry to tell 1548 00:52:32,480 --> 00:52:36,480 us that the kafka broker was missing it 1549 00:52:34,319 --> 00:52:37,839 didn't report itself as missing and 1550 00:52:36,480 --> 00:52:39,200 therefore we didn't actually have 1551 00:52:37,839 --> 00:52:41,680 telemetry to tell us that there was a 1552 00:52:39,200 --> 00:52:43,440 problem so that was a useful scenario 1553 00:52:41,680 --> 00:52:44,839 for us to know that that that was a 1554 00:52:43,440 --> 00:52:47,520 problem that we needed to 1555 00:52:44,839 --> 00:52:48,720 fix and then you verify your fixes right 1556 00:52:47,520 --> 00:52:51,119 like if something doesn't go according 1557 00:52:48,720 --> 00:52:53,920 to plan you gotta keep doing the painful 1558 00:52:51,119 --> 00:52:55,920 thing until it stops being painful 1559 00:52:53,920 --> 00:52:58,160 so for instance taking out indexing 1560 00:52:55,920 --> 00:53:01,520 workers taking out kafka brokers 1561 00:52:58,160 --> 00:53:02,400 guess what we do that once per week 1562 00:53:01,520 --> 00:53:04,079 right 1563 00:53:02,400 --> 00:53:06,079 we try to not keep things running 1564 00:53:04,079 --> 00:53:08,079 forever so that we can validate that if 1565 00:53:06,079 --> 00:53:09,440 things do go wrong it's not been more 1566 00:53:08,079 --> 00:53:11,680 than a week since our most recent 1567 00:53:09,440 --> 00:53:13,280 failure test 1568 00:53:11,680 --> 00:53:14,800 same thing with zookeeper right like we 1569 00:53:13,280 --> 00:53:18,000 discovered oops 1570 00:53:14,800 --> 00:53:20,240 our zookeeper um coordinators 1571 00:53:18,000 --> 00:53:21,839 were we were only pulling the first 1572 00:53:20,240 --> 00:53:24,079 zookeeper worker 1573 00:53:21,839 --> 00:53:25,359 and that if you took out the zookeeper 1574 00:53:24,079 --> 00:53:28,319 first worker 1575 00:53:25,359 --> 00:53:30,079 none of our alerts would run oops so we 1576 00:53:28,319 --> 00:53:31,760 fixed that right 1577 00:53:30,079 --> 00:53:33,680 when you de-risk things with the design 1578 00:53:31,760 --> 00:53:35,599 and automation it makes it a lot easier 1579 00:53:33,680 --> 00:53:36,880 to run a sustainable system that is not 1580 00:53:35,599 --> 00:53:38,800 going to page you in the middle of the 1581 00:53:36,880 --> 00:53:41,040 night 1582 00:53:38,800 --> 00:53:42,880 so we just have to continuously verify 1583 00:53:41,040 --> 00:53:45,119 and keep on making sure that our system 1584 00:53:42,880 --> 00:53:47,200 is working the way that we intend 1585 00:53:45,119 --> 00:53:49,440 and this doesn't have benefits just for 1586 00:53:47,200 --> 00:53:50,960 reliability it also has benefits for 1587 00:53:49,440 --> 00:53:52,400 cost 1588 00:53:50,960 --> 00:53:54,800 why because if you're allowed to 1589 00:53:52,400 --> 00:53:57,839 tolerate failures more often 1590 00:53:54,800 --> 00:54:01,200 you can do things like adopt spot 1591 00:53:57,839 --> 00:54:02,960 instances or preemptable uh instances 1592 00:54:01,200 --> 00:54:05,520 because you know that you can tolerate 1593 00:54:02,960 --> 00:54:07,200 losing those stateless workers 1594 00:54:05,520 --> 00:54:09,200 or for instance 1595 00:54:07,200 --> 00:54:12,000 if you know that there's a well-managed 1596 00:54:09,200 --> 00:54:13,760 process for replacing worker nodes 1597 00:54:12,000 --> 00:54:15,680 that means you can gradually and 1598 00:54:13,760 --> 00:54:18,000 incrementally roll your fleet without 1599 00:54:15,680 --> 00:54:20,960 fear of damaging user experience 1600 00:54:18,000 --> 00:54:23,359 including rolling out arm 64 1601 00:54:20,960 --> 00:54:24,960 architecture instances 1602 00:54:23,359 --> 00:54:27,119 this is almost an entire talk in itself 1603 00:54:24,960 --> 00:54:29,040 but i firmly believe that the future of 1604 00:54:27,119 --> 00:54:30,640 solving climate change within the tech 1605 00:54:29,040 --> 00:54:33,440 industry at least 1606 00:54:30,640 --> 00:54:35,680 is by consuming less power per compute 1607 00:54:33,440 --> 00:54:37,839 and when you adopt arm instances you're 1608 00:54:35,680 --> 00:54:39,520 playing a direct role in fighting the 1609 00:54:37,839 --> 00:54:41,119 amount of carbon emissions that your 1610 00:54:39,520 --> 00:54:44,000 services produce 1611 00:54:41,119 --> 00:54:47,280 as well as of course saving about 40 40 1612 00:54:44,000 --> 00:54:47,280 to 60 on your bill 1613 00:54:47,760 --> 00:54:51,119 but not every experiment is going to 1614 00:54:49,359 --> 00:54:53,040 succeed and that's okay right you can 1615 00:54:51,119 --> 00:54:55,359 mitigate those risks and it's just 1616 00:54:53,040 --> 00:54:57,200 important to think about how do i bound 1617 00:54:55,359 --> 00:54:59,680 the risk to make it acceptable to do 1618 00:54:57,200 --> 00:55:01,520 these experiments how do i make it safe 1619 00:54:59,680 --> 00:55:03,280 to do these things and design for 1620 00:55:01,520 --> 00:55:06,720 reliability through my life cycle how do 1621 00:55:03,280 --> 00:55:07,440 i make my system excellent in production 1622 00:55:06,720 --> 00:55:10,079 so 1623 00:55:07,440 --> 00:55:12,640 feature flags can help um kind of doing 1624 00:55:10,079 --> 00:55:14,160 proactive risk experiments can help 1625 00:55:12,640 --> 00:55:15,359 but above all i think kind of you have 1626 00:55:14,160 --> 00:55:17,599 to ground yourself in the four things 1627 00:55:15,359 --> 00:55:20,160 they said originally right slos 1628 00:55:17,599 --> 00:55:22,240 observability collaboration and risk 1629 00:55:20,160 --> 00:55:25,520 management if you do those things you're 1630 00:55:22,240 --> 00:55:25,520 going to be in a much better place 1631 00:55:26,160 --> 00:55:30,880 and always be prepared always expect the 1632 00:55:28,799 --> 00:55:32,640 unexpected because we're always working 1633 00:55:30,880 --> 00:55:34,960 in the turbulent situations of 1634 00:55:32,640 --> 00:55:37,599 production 1635 00:55:34,960 --> 00:55:40,240 make experimentation routine and you can 1636 00:55:37,599 --> 00:55:42,559 not so jokingly talk about the idea of 1637 00:55:40,240 --> 00:55:44,559 oh i'm just going to set a chaos money 1638 00:55:42,559 --> 00:55:46,720 monkey running loose on our fleet you 1639 00:55:44,559 --> 00:55:48,160 don't get there no from nothing you have 1640 00:55:46,720 --> 00:55:51,520 to do the groundwork you have to do the 1641 00:55:48,160 --> 00:55:51,520 preparation in order to get there 1642 00:55:51,760 --> 00:55:54,880 so 1643 00:55:52,480 --> 00:55:56,640 we're all part of sociotechnical systems 1644 00:55:54,880 --> 00:55:58,799 as customers as engineers and as 1645 00:55:56,640 --> 00:56:00,960 stakeholders 1646 00:55:58,799 --> 00:56:03,040 learn from your outages invest 1647 00:56:00,960 --> 00:56:04,720 appropriately in observability 1648 00:56:03,040 --> 00:56:06,559 and have those conversations about how 1649 00:56:04,720 --> 00:56:08,160 we can improve our tools and improve 1650 00:56:06,559 --> 00:56:11,520 ourselves and improve our processes in 1651 00:56:08,160 --> 00:56:13,680 order to react better in the future 1652 00:56:11,520 --> 00:56:15,440 and yes you know tools can help buy the 1653 00:56:13,680 --> 00:56:17,359 right tools buy just the right amount of 1654 00:56:15,440 --> 00:56:19,200 tools right it's almost like uh that 1655 00:56:17,359 --> 00:56:21,280 author who said you know 1656 00:56:19,200 --> 00:56:23,200 eat food not too much mostly grains 1657 00:56:21,280 --> 00:56:25,200 right i think that's that's true that's 1658 00:56:23,200 --> 00:56:27,280 true about about our socio-technical 1659 00:56:25,200 --> 00:56:28,720 systems too right like you know run your 1660 00:56:27,280 --> 00:56:31,280 systems 1661 00:56:28,720 --> 00:56:33,440 by buy some tools buy the right tools 1662 00:56:31,280 --> 00:56:36,400 but above all else focus on your culture 1663 00:56:33,440 --> 00:56:38,960 first and the rest will follow 1664 00:56:36,400 --> 00:56:41,440 so i implore you think about how do you 1665 00:56:38,960 --> 00:56:42,559 measure your your reliability levels how 1666 00:56:41,440 --> 00:56:43,760 do you actually debug how do you 1667 00:56:42,559 --> 00:56:45,440 actually understand what's happening in 1668 00:56:43,760 --> 00:56:46,960 production 1669 00:56:45,440 --> 00:56:48,960 do you have the ability to collaborate 1670 00:56:46,960 --> 00:56:50,880 across teams 1671 00:56:48,960 --> 00:56:52,480 and are you actually investing time in 1672 00:56:50,880 --> 00:56:53,920 closing that feedback loop and not just 1673 00:56:52,480 --> 00:56:56,559 repeating the same outages over and over 1674 00:56:53,920 --> 00:56:58,240 like groundhog day 1675 00:56:56,559 --> 00:56:59,599 so that's all that i have for you today 1676 00:56:58,240 --> 00:57:01,599 and i understand that we're a little bit 1677 00:56:59,599 --> 00:57:03,359 behind schedule because of the av issues 1678 00:57:01,599 --> 00:57:05,040 so i'm not going to delay you too long 1679 00:57:03,359 --> 00:57:08,319 um but if you like and copy these slides 1680 00:57:05,040 --> 00:57:09,920 go check out honeycomb.io liz and also 1681 00:57:08,319 --> 00:57:11,280 um i happen to be around sydney at least 1682 00:57:09,920 --> 00:57:12,960 for the next couple of months so if 1683 00:57:11,280 --> 00:57:15,599 you'd like to go for a walk in hyde park 1684 00:57:12,960 --> 00:57:17,760 um just uh i'll drop a link into the 1685 00:57:15,599 --> 00:57:19,200 conference chat and you can uh just book 1686 00:57:17,760 --> 00:57:21,280 an hour with me and i'd love to go walk 1687 00:57:19,200 --> 00:57:23,839 around and meet some of you 1688 00:57:21,280 --> 00:57:25,839 thanks cheers everyone 1689 00:57:23,839 --> 00:57:27,839 thanks liz that was that was really good 1690 00:57:25,839 --> 00:57:29,440 again another great speak that uh speech 1691 00:57:27,839 --> 00:57:31,040 that uh was looking at the chat and then 1692 00:57:29,440 --> 00:57:32,640 we're saying oh this is really hitting 1693 00:57:31,040 --> 00:57:35,520 with a lot of people 1694 00:57:32,640 --> 00:57:36,960 um really good stuff um i especially 1695 00:57:35,520 --> 00:57:39,040 like all those little illustrations 1696 00:57:36,960 --> 00:57:41,119 you've got they're fantastic 1697 00:57:39,040 --> 00:57:43,040 yes there by the wonderful emily griffin 1698 00:57:41,119 --> 00:57:44,400 uh at emily with curls highly encouraged 1699 00:57:43,040 --> 00:57:45,839 checking her out 1700 00:57:44,400 --> 00:57:48,319 excellent great 1701 00:57:45,839 --> 00:57:50,160 anyway so we're about to finish off this 1702 00:57:48,319 --> 00:57:51,760 session um we're just coming up to the 1703 00:57:50,160 --> 00:57:53,280 first break of the day sorry about all 1704 00:57:51,760 --> 00:57:54,960 the delays there we've still got a fair 1705 00:57:53,280 --> 00:57:57,040 amount of time 1706 00:57:54,960 --> 00:57:58,559 remember that the main sessions start at 1707 00:57:57,040 --> 00:58:00,000 10 45 1708 00:57:58,559 --> 00:58:02,799 and then the lunch break will be around 1709 00:58:00,000 --> 00:58:05,280 12 25 depending on when stuff finishes 1710 00:58:02,799 --> 00:58:06,880 the afternoon session starts at 1 30 and 1711 00:58:05,280 --> 00:58:09,280 don't forget the conference close at 5 1712 00:58:06,880 --> 00:58:11,440 35 where you'll get to you know cry and 1713 00:58:09,280 --> 00:58:13,200 cheer and all that stuff so please go 1714 00:58:11,440 --> 00:58:16,599 and have fun and we'll see you a little 1715 00:58:13,200 --> 00:58:16,599 bit later on