1 00:00:06,320 --> 00:00:11,499 [Music] 2 00:00:15,360 --> 00:00:19,520 hi everyone welcome to the first session 3 00:00:17,119 --> 00:00:21,119 of system and mini conf our first talk 4 00:00:19,520 --> 00:00:23,199 is from alan 5 00:00:21,119 --> 00:00:24,800 shone and it is on the importance of 6 00:00:23,199 --> 00:00:26,800 visibility 7 00:00:24,800 --> 00:00:28,720 so here down excellent 8 00:00:26,800 --> 00:00:31,359 thank you very much simon 9 00:00:28,720 --> 00:00:34,320 so hi everyone uh today i want to talk a 10 00:00:31,359 --> 00:00:37,760 bit about visibility and 11 00:00:34,320 --> 00:00:41,120 it's really a very uh simple 12 00:00:37,760 --> 00:00:42,840 concept um for me uh it's it's a very 13 00:00:41,120 --> 00:00:45,200 basic thing 14 00:00:42,840 --> 00:00:46,399 and uh 15 00:00:45,200 --> 00:00:48,719 like the 16 00:00:46,399 --> 00:00:51,760 the sort of first steps that you take 17 00:00:48,719 --> 00:00:54,800 for anything system related really 18 00:00:51,760 --> 00:00:57,920 and so i've got three particular things 19 00:00:54,800 --> 00:00:59,120 that i'm going to cover uh first off uh 20 00:00:57,920 --> 00:01:01,760 the basics 21 00:00:59,120 --> 00:01:03,760 which is really just the basics uh those 22 00:01:01,760 --> 00:01:04,960 first bits and pieces uh 23 00:01:03,760 --> 00:01:07,360 a bit about 24 00:01:04,960 --> 00:01:09,760 what i'm actually trying to talk about 25 00:01:07,360 --> 00:01:11,280 uh not necessarily what 26 00:01:09,760 --> 00:01:13,280 you might expect 27 00:01:11,280 --> 00:01:15,600 and a bit about why 28 00:01:13,280 --> 00:01:17,920 the second will be a particular 29 00:01:15,600 --> 00:01:19,920 situation a bit of a story 30 00:01:17,920 --> 00:01:21,520 to go along with this 31 00:01:19,920 --> 00:01:23,600 something that has 32 00:01:21,520 --> 00:01:25,280 a bit of background and 33 00:01:23,600 --> 00:01:26,799 hopefully either some lessons or 34 00:01:25,280 --> 00:01:29,119 takeaways 35 00:01:26,799 --> 00:01:32,960 for everyone that's watching 36 00:01:29,119 --> 00:01:35,040 and the third uh incidence which really 37 00:01:32,960 --> 00:01:38,240 is another situation 38 00:01:35,040 --> 00:01:39,439 but a little bit different and hopefully 39 00:01:38,240 --> 00:01:42,240 a bit more 40 00:01:39,439 --> 00:01:44,240 color on the the actual topic of 41 00:01:42,240 --> 00:01:46,079 visibility in general 42 00:01:44,240 --> 00:01:48,799 so to start 43 00:01:46,079 --> 00:01:50,640 we have visibility 44 00:01:48,799 --> 00:01:51,920 what is it 45 00:01:50,640 --> 00:01:54,079 depending on 46 00:01:51,920 --> 00:01:55,200 who you talk to or what you're thinking 47 00:01:54,079 --> 00:01:57,680 about 48 00:01:55,200 --> 00:01:59,840 it's a buzz word really uh 49 00:01:57,680 --> 00:02:02,000 there's a lot of 50 00:01:59,840 --> 00:02:03,840 different interpretations of the one 51 00:02:02,000 --> 00:02:06,159 word and 52 00:02:03,840 --> 00:02:07,119 it doesn't necessarily mean 53 00:02:06,159 --> 00:02:09,599 what 54 00:02:07,119 --> 00:02:11,840 either i'm trying to talk about or what 55 00:02:09,599 --> 00:02:13,440 you might think it means uh 56 00:02:11,840 --> 00:02:15,200 so it's it's not 57 00:02:13,440 --> 00:02:17,520 uh necessarily 58 00:02:15,200 --> 00:02:20,239 what you might see talked about in an 59 00:02:17,520 --> 00:02:22,879 open source or or a public space uh from 60 00:02:20,239 --> 00:02:24,720 from companies and and uh tooling 61 00:02:22,879 --> 00:02:26,400 options 62 00:02:24,720 --> 00:02:29,440 what it should be though is is the bare 63 00:02:26,400 --> 00:02:30,400 minimum it's it's that first step 64 00:02:29,440 --> 00:02:31,200 that 65 00:02:30,400 --> 00:02:33,760 just 66 00:02:31,200 --> 00:02:34,720 knowing that something is there or 67 00:02:33,760 --> 00:02:37,519 knowing 68 00:02:34,720 --> 00:02:39,680 something is happening 69 00:02:37,519 --> 00:02:43,760 depending on the systems that you use as 70 00:02:39,680 --> 00:02:45,920 well it's it's often already available 71 00:02:43,760 --> 00:02:48,560 you might see the aws console for 72 00:02:45,920 --> 00:02:50,480 instance or other cloud providers 73 00:02:48,560 --> 00:02:52,400 you can see every service that's up and 74 00:02:50,480 --> 00:02:53,440 running and so that's 75 00:02:52,400 --> 00:02:57,120 that's 76 00:02:53,440 --> 00:02:59,760 visibility in in a very basic sense so 77 00:02:57,120 --> 00:03:00,560 for for what i'm trying to talk about is 78 00:02:59,760 --> 00:03:02,720 just 79 00:03:00,560 --> 00:03:04,720 that bare definition 80 00:03:02,720 --> 00:03:06,400 the the fact state or degree of being 81 00:03:04,720 --> 00:03:07,599 visible 82 00:03:06,400 --> 00:03:10,159 and this was 83 00:03:07,599 --> 00:03:12,080 purely from the dictionary uh 84 00:03:10,159 --> 00:03:13,200 and it really highlights 85 00:03:12,080 --> 00:03:14,640 just 86 00:03:13,200 --> 00:03:16,400 what visibility 87 00:03:14,640 --> 00:03:18,800 is at least to me 88 00:03:16,400 --> 00:03:19,840 and the way that i think about it 89 00:03:18,800 --> 00:03:22,560 um 90 00:03:19,840 --> 00:03:25,200 distinctly from from other terms 91 00:03:22,560 --> 00:03:27,200 and other concepts 92 00:03:25,200 --> 00:03:29,760 which then means that 93 00:03:27,200 --> 00:03:31,840 it's not observability uh 94 00:03:29,760 --> 00:03:33,519 it's 95 00:03:31,840 --> 00:03:35,519 the monitoring side of things nor the 96 00:03:33,519 --> 00:03:37,120 tooling side of things 97 00:03:35,519 --> 00:03:39,840 depending on exactly what you're 98 00:03:37,120 --> 00:03:42,640 thinking about or 99 00:03:39,840 --> 00:03:45,040 what you're talking about or trying to 100 00:03:42,640 --> 00:03:46,959 do it might encompass 101 00:03:45,040 --> 00:03:50,959 some of these things but visibility in a 102 00:03:46,959 --> 00:03:53,280 strict sense is not these things 103 00:03:50,959 --> 00:03:55,280 like again things like the the aws 104 00:03:53,280 --> 00:03:56,319 console that's that's a tool of some 105 00:03:55,280 --> 00:03:59,120 sort 106 00:03:56,319 --> 00:04:02,720 which gives that visibility but it 107 00:03:59,120 --> 00:04:03,680 doesn't mean that the tool is visibility 108 00:04:02,720 --> 00:04:05,439 uh 109 00:04:03,680 --> 00:04:08,480 and really it's 110 00:04:05,439 --> 00:04:11,439 it comes back to visibility being 111 00:04:08,480 --> 00:04:13,599 just the ability to know that something 112 00:04:11,439 --> 00:04:16,079 is there and that 113 00:04:13,599 --> 00:04:18,239 it's theoretically doing 114 00:04:16,079 --> 00:04:22,000 what it should be doing 115 00:04:18,239 --> 00:04:24,080 at a very bare minimum level 116 00:04:22,000 --> 00:04:26,479 and so with that 117 00:04:24,080 --> 00:04:30,479 we have our situation um 118 00:04:26,479 --> 00:04:33,040 and i i picked this picture here because 119 00:04:30,479 --> 00:04:34,800 visibility is like an onion and 120 00:04:33,040 --> 00:04:36,639 in that same sense visibility is a bit 121 00:04:34,800 --> 00:04:39,280 like an ogre uh 122 00:04:36,639 --> 00:04:42,240 there's a lot of layers and 123 00:04:39,280 --> 00:04:44,479 once you get beyond those first steps 124 00:04:42,240 --> 00:04:46,720 and start digging in deep but that's 125 00:04:44,479 --> 00:04:48,560 where you get into those other terms the 126 00:04:46,720 --> 00:04:50,160 the observability the 127 00:04:48,560 --> 00:04:52,240 the monitoring and all the rest of it 128 00:04:50,160 --> 00:04:54,160 and then you get into a lot more 129 00:04:52,240 --> 00:04:55,759 complexity 130 00:04:54,160 --> 00:04:58,000 but really 131 00:04:55,759 --> 00:05:00,479 being the first step visibility just 132 00:04:58,000 --> 00:05:02,800 gives those those very base 133 00:05:00,479 --> 00:05:06,400 uh insights and understandings that 134 00:05:02,800 --> 00:05:08,160 something is available and is doing 135 00:05:06,400 --> 00:05:10,320 at least 136 00:05:08,160 --> 00:05:12,000 at face value what it's supposed to be 137 00:05:10,320 --> 00:05:13,759 doing 138 00:05:12,000 --> 00:05:15,759 and so for this situation we had a 139 00:05:13,759 --> 00:05:17,600 problem uh 140 00:05:15,759 --> 00:05:19,919 we were already lacking 141 00:05:17,600 --> 00:05:21,759 uh certain 142 00:05:19,919 --> 00:05:23,759 observances and and 143 00:05:21,759 --> 00:05:26,000 visible aspects to 144 00:05:23,759 --> 00:05:27,759 some of our systems 145 00:05:26,000 --> 00:05:29,919 and 146 00:05:27,759 --> 00:05:32,560 what was available 147 00:05:29,919 --> 00:05:34,160 for us to readily implement 148 00:05:32,560 --> 00:05:36,960 was 149 00:05:34,160 --> 00:05:39,199 incredibly expensive 150 00:05:36,960 --> 00:05:41,199 and even though 151 00:05:39,199 --> 00:05:43,759 we could have just implemented it 152 00:05:41,199 --> 00:05:45,520 uh it would have been 153 00:05:43,759 --> 00:05:47,039 something that would have turned a lot 154 00:05:45,520 --> 00:05:49,199 of heads and 155 00:05:47,039 --> 00:05:52,160 probably gotten quite a few people into 156 00:05:49,199 --> 00:05:54,160 a lot of hot water if it had been 157 00:05:52,160 --> 00:05:56,400 enabled straight away 158 00:05:54,160 --> 00:05:57,360 but what we did have 159 00:05:56,400 --> 00:05:59,520 is 160 00:05:57,360 --> 00:06:02,400 a set of scenarios that 161 00:05:59,520 --> 00:06:04,560 we knew we needed to be able to 162 00:06:02,400 --> 00:06:05,600 to make visible so 163 00:06:04,560 --> 00:06:08,160 we knew 164 00:06:05,600 --> 00:06:10,800 certain things 165 00:06:08,160 --> 00:06:12,720 were potentially going to happen if they 166 00:06:10,800 --> 00:06:14,960 weren't already happening and we just 167 00:06:12,720 --> 00:06:18,080 needed to make them visible 168 00:06:14,960 --> 00:06:19,520 and it was really that that thing 169 00:06:18,080 --> 00:06:22,080 of 170 00:06:19,520 --> 00:06:24,400 we know that these are possible but we 171 00:06:22,080 --> 00:06:27,680 don't know that they're not happening 172 00:06:24,400 --> 00:06:29,120 and that was our visibility dilemma of 173 00:06:27,680 --> 00:06:30,960 the time 174 00:06:29,120 --> 00:06:33,600 and so 175 00:06:30,960 --> 00:06:36,240 we we had a solution um 176 00:06:33,600 --> 00:06:38,319 we can build our own tool 177 00:06:36,240 --> 00:06:40,720 which is 178 00:06:38,319 --> 00:06:43,440 always the best idea um 179 00:06:40,720 --> 00:06:45,600 but also not please don't uh 180 00:06:43,440 --> 00:06:46,479 we could make it composable though 181 00:06:45,600 --> 00:06:47,199 where 182 00:06:46,479 --> 00:06:49,759 we 183 00:06:47,199 --> 00:06:51,120 have our certain set of rules that we 184 00:06:49,759 --> 00:06:53,199 already know 185 00:06:51,120 --> 00:06:56,000 but we might have more in the future for 186 00:06:53,199 --> 00:06:59,120 instance um especially as 187 00:06:56,000 --> 00:07:01,759 the makeup of our platform changes 188 00:06:59,120 --> 00:07:05,599 we we need to 189 00:07:01,759 --> 00:07:06,880 make different components more visible 190 00:07:05,599 --> 00:07:08,960 and 191 00:07:06,880 --> 00:07:13,520 by making this particular tool 192 00:07:08,960 --> 00:07:13,520 composable we can split it up and 193 00:07:13,840 --> 00:07:18,319 i guess try a different paradigm for how 194 00:07:16,720 --> 00:07:19,440 we run 195 00:07:18,319 --> 00:07:22,160 tooling 196 00:07:19,440 --> 00:07:24,000 so this particular situation uses a 197 00:07:22,160 --> 00:07:26,560 series of lambda functions 198 00:07:24,000 --> 00:07:28,400 that interact with different 199 00:07:26,560 --> 00:07:31,520 components to 200 00:07:28,400 --> 00:07:33,120 retrieve data massage it and then look 201 00:07:31,520 --> 00:07:35,440 for certain things 202 00:07:33,120 --> 00:07:37,360 at different points in time based on the 203 00:07:35,440 --> 00:07:39,120 different rules for each from each of 204 00:07:37,360 --> 00:07:40,880 the different functions 205 00:07:39,120 --> 00:07:42,560 and this also meant that we could 206 00:07:40,880 --> 00:07:46,560 integrate it with all of our existing 207 00:07:42,560 --> 00:07:48,639 workflows all of our existing tools and 208 00:07:46,560 --> 00:07:49,599 within the rest of our 209 00:07:48,639 --> 00:07:51,919 uh 210 00:07:49,599 --> 00:07:52,960 infrastructure in general 211 00:07:51,919 --> 00:07:55,199 and 212 00:07:52,960 --> 00:07:56,160 what it looked like then was a slack 213 00:07:55,199 --> 00:07:58,479 message 214 00:07:56,160 --> 00:08:01,360 so we would have this 215 00:07:58,479 --> 00:08:03,520 uh set of lambda functions that would do 216 00:08:01,360 --> 00:08:07,840 all of this work and 217 00:08:03,520 --> 00:08:10,240 if anything was found a message would be 218 00:08:07,840 --> 00:08:13,280 dropped into this particular channel 219 00:08:10,240 --> 00:08:15,039 so that way one of the team were able to 220 00:08:13,280 --> 00:08:16,800 investigate it 221 00:08:15,039 --> 00:08:19,120 and so one of the 222 00:08:16,800 --> 00:08:22,319 interesting components was we we kept 223 00:08:19,120 --> 00:08:25,360 track of uh known ip addresses 224 00:08:22,319 --> 00:08:26,879 and these were from infrastructure these 225 00:08:25,360 --> 00:08:30,080 were from 226 00:08:26,879 --> 00:08:31,120 internal usage so like my ip address for 227 00:08:30,080 --> 00:08:32,880 instance 228 00:08:31,120 --> 00:08:35,680 as a member of the team 229 00:08:32,880 --> 00:08:37,760 and we we stored all of these 230 00:08:35,680 --> 00:08:39,519 uh via one of these particular lambda 231 00:08:37,760 --> 00:08:41,440 functions so that way 232 00:08:39,519 --> 00:08:42,560 when we were looking at different 233 00:08:41,440 --> 00:08:44,720 activity 234 00:08:42,560 --> 00:08:46,399 from one of the other functions it 235 00:08:44,720 --> 00:08:48,240 it could request 236 00:08:46,399 --> 00:08:51,279 uh whether or not 237 00:08:48,240 --> 00:08:54,560 a an ip address was known and it would 238 00:08:51,279 --> 00:08:56,800 use this other other function to do that 239 00:08:54,560 --> 00:08:58,080 and so this sort of composable 240 00:08:56,800 --> 00:09:00,560 architecture 241 00:08:58,080 --> 00:09:03,519 it makes sense uh because we could reuse 242 00:09:00,560 --> 00:09:05,440 these things in different ways and 243 00:09:03,519 --> 00:09:08,959 for very different purposes than than 244 00:09:05,440 --> 00:09:12,320 what they were built for 245 00:09:08,959 --> 00:09:13,519 and this was good it worked for us 246 00:09:12,320 --> 00:09:15,760 it 247 00:09:13,519 --> 00:09:17,040 allowed us to have our own rules we 248 00:09:15,760 --> 00:09:19,600 could 249 00:09:17,040 --> 00:09:20,800 work our own way 250 00:09:19,600 --> 00:09:22,399 and it also 251 00:09:20,800 --> 00:09:24,000 meant that we only implemented the 252 00:09:22,399 --> 00:09:26,480 things that we needed 253 00:09:24,000 --> 00:09:28,880 we we didn't end up with 254 00:09:26,480 --> 00:09:31,680 all of this extra stuff 255 00:09:28,880 --> 00:09:33,040 that we just weren't going to use 256 00:09:31,680 --> 00:09:35,360 it wasn't 257 00:09:33,040 --> 00:09:37,120 it was more of a a toolbox of sorts 258 00:09:35,360 --> 00:09:38,160 rather than a swiss army knife for 259 00:09:37,120 --> 00:09:39,200 instance 260 00:09:38,160 --> 00:09:40,880 um 261 00:09:39,200 --> 00:09:42,160 like i've never used a saw on a swiss 262 00:09:40,880 --> 00:09:43,519 army knife 263 00:09:42,160 --> 00:09:45,120 because i'm not sure what kind of wood 264 00:09:43,519 --> 00:09:47,120 it's going to cut but 265 00:09:45,120 --> 00:09:49,360 it's probably not going to be useful for 266 00:09:47,120 --> 00:09:50,720 anything that i would try to try to use 267 00:09:49,360 --> 00:09:51,680 it for 268 00:09:50,720 --> 00:09:53,839 um 269 00:09:51,680 --> 00:09:55,440 there were some additional problems with 270 00:09:53,839 --> 00:09:56,959 this though um 271 00:09:55,440 --> 00:09:58,880 which is always the way when you build 272 00:09:56,959 --> 00:10:00,880 things yourself um how do we know it's 273 00:09:58,880 --> 00:10:03,680 working um 274 00:10:00,880 --> 00:10:04,720 when it comes to things like 275 00:10:03,680 --> 00:10:06,720 external 276 00:10:04,720 --> 00:10:09,519 or external to the business 277 00:10:06,720 --> 00:10:12,320 third-party integrations these sorts of 278 00:10:09,519 --> 00:10:14,320 sas type solutions 279 00:10:12,320 --> 00:10:15,920 there's usually 280 00:10:14,320 --> 00:10:17,920 feedback loops and 281 00:10:15,920 --> 00:10:20,320 there's there's ways to get information 282 00:10:17,920 --> 00:10:23,360 about the current state 283 00:10:20,320 --> 00:10:25,839 of of the uh platform 284 00:10:23,360 --> 00:10:27,760 and so we added some very basic 285 00:10:25,839 --> 00:10:28,720 visibility uh 286 00:10:27,760 --> 00:10:31,360 which 287 00:10:28,720 --> 00:10:33,360 as as you might guess was another lambda 288 00:10:31,360 --> 00:10:34,800 function um or at least a couple of 289 00:10:33,360 --> 00:10:36,720 lambda functions 290 00:10:34,800 --> 00:10:37,680 and these 291 00:10:36,720 --> 00:10:38,480 all 292 00:10:37,680 --> 00:10:40,640 uh 293 00:10:38,480 --> 00:10:43,279 worked within the the aws environment 294 00:10:40,640 --> 00:10:46,640 that we we managed all of this 295 00:10:43,279 --> 00:10:48,480 and would follow the same steps as the 296 00:10:46,640 --> 00:10:50,480 actual tool itself 297 00:10:48,480 --> 00:10:52,480 it would reuse the 298 00:10:50,480 --> 00:10:54,240 the function that sent the message into 299 00:10:52,480 --> 00:10:55,760 slack it would 300 00:10:54,240 --> 00:10:56,640 look at the 301 00:10:55,760 --> 00:10:58,880 known 302 00:10:56,640 --> 00:11:02,720 uh cloudwatch information that we would 303 00:10:58,880 --> 00:11:04,640 generate and do all the processing um ip 304 00:11:02,720 --> 00:11:05,519 addresses and and all that sort of stuff 305 00:11:04,640 --> 00:11:07,600 as well 306 00:11:05,519 --> 00:11:10,560 and so it was great it 307 00:11:07,600 --> 00:11:11,680 was was an afterthought for sure 308 00:11:10,560 --> 00:11:15,980 but 309 00:11:11,680 --> 00:11:18,880 it was done and so that was fine 310 00:11:15,980 --> 00:11:20,560 [Music] 311 00:11:18,880 --> 00:11:22,480 except 312 00:11:20,560 --> 00:11:24,880 we didn't know when things weren't 313 00:11:22,480 --> 00:11:27,120 working 314 00:11:24,880 --> 00:11:29,360 we we knew when things were working and 315 00:11:27,120 --> 00:11:31,760 we could test that we could 316 00:11:29,360 --> 00:11:33,600 take our own actions 317 00:11:31,760 --> 00:11:36,560 to essentially canary 318 00:11:33,600 --> 00:11:39,120 the the processing pipeline and we could 319 00:11:36,560 --> 00:11:40,480 do things that we know would definitely 320 00:11:39,120 --> 00:11:42,480 trigger 321 00:11:40,480 --> 00:11:44,720 something happening 322 00:11:42,480 --> 00:11:44,720 and 323 00:11:45,120 --> 00:11:49,519 yeah we 324 00:11:46,560 --> 00:11:52,800 oh like this this visibility was very 325 00:11:49,519 --> 00:11:56,240 basic uh it it didn't really 326 00:11:52,800 --> 00:11:58,399 test everything um but it worked for 327 00:11:56,240 --> 00:11:59,600 just letting us know 328 00:11:58,399 --> 00:12:00,959 um 329 00:11:59,600 --> 00:12:03,519 and 330 00:12:00,959 --> 00:12:05,360 that was that was all good 331 00:12:03,519 --> 00:12:07,600 except um 332 00:12:05,360 --> 00:12:09,920 yeah we 333 00:12:07,600 --> 00:12:11,680 we didn't test things like the different 334 00:12:09,920 --> 00:12:13,680 integrations themselves 335 00:12:11,680 --> 00:12:15,120 so 336 00:12:13,680 --> 00:12:17,040 as as 337 00:12:15,120 --> 00:12:19,279 this image shows um 338 00:12:17,040 --> 00:12:21,600 we didn't test to make sure that the 339 00:12:19,279 --> 00:12:22,880 slack integration worked fine 340 00:12:21,600 --> 00:12:24,399 and 341 00:12:22,880 --> 00:12:26,959 we didn't know if it ever stopped 342 00:12:24,399 --> 00:12:29,760 working because the only messages we got 343 00:12:26,959 --> 00:12:31,279 from the system were interslack 344 00:12:29,760 --> 00:12:34,000 which is also a problem if slack is 345 00:12:31,279 --> 00:12:36,959 unavailable and that sort of thing but 346 00:12:34,000 --> 00:12:39,360 if the integration itself breaks um 347 00:12:36,959 --> 00:12:40,720 we don't know about that 348 00:12:39,360 --> 00:12:43,040 and so one day 349 00:12:40,720 --> 00:12:43,839 uh one of the team members 350 00:12:43,040 --> 00:12:45,440 uh 351 00:12:43,839 --> 00:12:47,360 noticed this 352 00:12:45,440 --> 00:12:49,519 like hey 353 00:12:47,360 --> 00:12:50,399 why haven't we had anything for a while 354 00:12:49,519 --> 00:12:51,519 like 355 00:12:50,399 --> 00:12:53,680 surely 356 00:12:51,519 --> 00:12:56,560 nothing is running so good that there 357 00:12:53,680 --> 00:12:59,120 are no problems uh that's 358 00:12:56,560 --> 00:13:00,880 usually a good sign that something is 359 00:12:59,120 --> 00:13:02,480 not actually 360 00:13:00,880 --> 00:13:03,680 working 361 00:13:02,480 --> 00:13:06,639 and 362 00:13:03,680 --> 00:13:08,000 that was a problem uh because you don't 363 00:13:06,639 --> 00:13:09,120 really think about it unless there's an 364 00:13:08,000 --> 00:13:12,560 alert 365 00:13:09,120 --> 00:13:14,959 when you look in a channel or 366 00:13:12,560 --> 00:13:16,320 when you're used to having notifications 367 00:13:14,959 --> 00:13:18,480 about things 368 00:13:16,320 --> 00:13:20,079 the absence of a notification doesn't 369 00:13:18,480 --> 00:13:21,600 really 370 00:13:20,079 --> 00:13:24,880 trigger anything you don't really think 371 00:13:21,600 --> 00:13:26,880 about it until this sort of time where 372 00:13:24,880 --> 00:13:28,639 i think it had been about a month a 373 00:13:26,880 --> 00:13:30,720 month and a half 374 00:13:28,639 --> 00:13:33,040 since we had actually 375 00:13:30,720 --> 00:13:35,600 seen anything come from this 376 00:13:33,040 --> 00:13:36,800 and so we did what we did before we we 377 00:13:35,600 --> 00:13:39,360 ran through 378 00:13:36,800 --> 00:13:40,320 uh one of our canary tests and sure 379 00:13:39,360 --> 00:13:41,680 enough 380 00:13:40,320 --> 00:13:43,519 nothing happened 381 00:13:41,680 --> 00:13:44,560 and so 382 00:13:43,519 --> 00:13:46,079 that was 383 00:13:44,560 --> 00:13:47,040 the problem 384 00:13:46,079 --> 00:13:49,120 so 385 00:13:47,040 --> 00:13:51,680 we started digging in we started looking 386 00:13:49,120 --> 00:13:53,040 around we went back through 387 00:13:51,680 --> 00:13:54,240 how everything was set up in the first 388 00:13:53,040 --> 00:13:56,639 place 389 00:13:54,240 --> 00:13:58,720 and we we looked 390 00:13:56,639 --> 00:14:00,720 at all the functions we looked at the 391 00:13:58,720 --> 00:14:03,199 code for the functions we looked at the 392 00:14:00,720 --> 00:14:04,639 infrastructure and 393 00:14:03,199 --> 00:14:05,920 everything was there 394 00:14:04,639 --> 00:14:07,519 the config 395 00:14:05,920 --> 00:14:10,240 hadn't drifted 396 00:14:07,519 --> 00:14:13,600 we we managed everything with terraform 397 00:14:10,240 --> 00:14:15,279 and all of the code was in git along 398 00:14:13,600 --> 00:14:17,920 with that terraform 399 00:14:15,279 --> 00:14:19,680 there hadn't been changes for months uh 400 00:14:17,920 --> 00:14:21,199 everything looks fine we could see that 401 00:14:19,680 --> 00:14:23,120 the functions were running 402 00:14:21,199 --> 00:14:25,600 we could validate that 403 00:14:23,120 --> 00:14:27,600 they executed without errors 404 00:14:25,600 --> 00:14:30,800 and 405 00:14:27,600 --> 00:14:30,800 it looked weird 406 00:14:31,360 --> 00:14:33,839 and 407 00:14:32,639 --> 00:14:35,519 what we 408 00:14:33,839 --> 00:14:37,040 sort of got to the point of was that 409 00:14:35,519 --> 00:14:38,959 well 410 00:14:37,040 --> 00:14:41,199 we know that all of this is there 411 00:14:38,959 --> 00:14:42,959 it's all very disparate but 412 00:14:41,199 --> 00:14:45,120 we know that things are working fine 413 00:14:42,959 --> 00:14:46,480 just that messages aren't coming through 414 00:14:45,120 --> 00:14:48,399 and so 415 00:14:46,480 --> 00:14:50,959 we checked okay well if that's the only 416 00:14:48,399 --> 00:14:52,959 thing that we're not seeing then 417 00:14:50,959 --> 00:14:54,639 maybe we need to update how that 418 00:14:52,959 --> 00:14:56,320 integration is working 419 00:14:54,639 --> 00:14:59,120 and suddenly 420 00:14:56,320 --> 00:15:00,639 it works fine again um 421 00:14:59,120 --> 00:15:03,040 we 422 00:15:00,639 --> 00:15:05,199 had to do all this digging but 423 00:15:03,040 --> 00:15:06,480 we just didn't know what we didn't know 424 00:15:05,199 --> 00:15:09,120 and 425 00:15:06,480 --> 00:15:12,399 a lot of this was was built in isolation 426 00:15:09,120 --> 00:15:15,199 uh away from the rest of the team so we 427 00:15:12,399 --> 00:15:17,519 had to dig we we didn't have 428 00:15:15,199 --> 00:15:20,560 any way to just know where certain 429 00:15:17,519 --> 00:15:22,639 things are or be able to see 430 00:15:20,560 --> 00:15:26,160 how things are progressing 431 00:15:22,639 --> 00:15:27,600 and updating the details was was all we 432 00:15:26,160 --> 00:15:28,399 needed to do 433 00:15:27,600 --> 00:15:31,360 and 434 00:15:28,399 --> 00:15:32,560 then we were back on track 435 00:15:31,360 --> 00:15:35,839 so 436 00:15:32,560 --> 00:15:37,920 the problem was that the the slack uh 437 00:15:35,839 --> 00:15:40,480 integration which was just a web hook to 438 00:15:37,920 --> 00:15:41,519 send the message uh it had been disabled 439 00:15:40,480 --> 00:15:44,079 uh 440 00:15:41,519 --> 00:15:45,600 which is a a wonderful feature it may 441 00:15:44,079 --> 00:15:47,680 have changed actually but it was a 442 00:15:45,600 --> 00:15:50,399 wonderful feature from slack that 443 00:15:47,680 --> 00:15:52,480 whenever a user 444 00:15:50,399 --> 00:15:54,720 adds an integration or sets up an 445 00:15:52,480 --> 00:15:57,440 application or anything like this it's 446 00:15:54,720 --> 00:16:00,480 owned by that user themself 447 00:15:57,440 --> 00:16:02,639 and that also means if that user is 448 00:16:00,480 --> 00:16:04,560 disabled or removed 449 00:16:02,639 --> 00:16:07,199 or otherwise 450 00:16:04,560 --> 00:16:09,199 unavailable to manage the integration 451 00:16:07,199 --> 00:16:11,040 anymore then 452 00:16:09,199 --> 00:16:12,480 that integration is disabled 453 00:16:11,040 --> 00:16:15,279 and so 454 00:16:12,480 --> 00:16:18,399 this web hook had 455 00:16:15,279 --> 00:16:22,320 become unavailable to be used and that 456 00:16:18,399 --> 00:16:24,639 meant that no messages were ever sent 457 00:16:22,320 --> 00:16:25,839 but yeah once we created a new one and 458 00:16:24,639 --> 00:16:27,920 we used 459 00:16:25,839 --> 00:16:29,440 a headless user to do so 460 00:16:27,920 --> 00:16:32,399 so that way there was 461 00:16:29,440 --> 00:16:33,519 a much lower risk of it being removed in 462 00:16:32,399 --> 00:16:35,040 the future 463 00:16:33,519 --> 00:16:36,800 we could update the function and it 464 00:16:35,040 --> 00:16:39,360 would work fine again 465 00:16:36,800 --> 00:16:42,079 and then retesting 466 00:16:39,360 --> 00:16:42,079 and it's all good 467 00:16:42,480 --> 00:16:46,560 i think this is this is another a good 468 00:16:44,800 --> 00:16:48,560 thing to think about as well 469 00:16:46,560 --> 00:16:51,279 when it comes to 470 00:16:48,560 --> 00:16:54,240 how these sorts of things fit uh 471 00:16:51,279 --> 00:16:56,240 how they fit together and i'm sure that 472 00:16:54,240 --> 00:16:57,360 if that original 473 00:16:56,240 --> 00:16:58,480 broken 474 00:16:57,360 --> 00:17:01,279 web hook 475 00:16:58,480 --> 00:17:02,399 had been attempted to be used 476 00:17:01,279 --> 00:17:05,919 there would have been an error message 477 00:17:02,399 --> 00:17:07,679 of some sort and 478 00:17:05,919 --> 00:17:09,919 that would have been a good thing to 479 00:17:07,679 --> 00:17:12,160 return from that lambda function 480 00:17:09,919 --> 00:17:13,760 if we had been able to see that there 481 00:17:12,160 --> 00:17:16,160 was that error message we would have 482 00:17:13,760 --> 00:17:17,839 known very quickly and it would have 483 00:17:16,160 --> 00:17:19,039 been something that we could have done 484 00:17:17,839 --> 00:17:20,880 something about 485 00:17:19,039 --> 00:17:23,039 and we 486 00:17:20,880 --> 00:17:24,799 wouldn't have missed whatever messages 487 00:17:23,039 --> 00:17:26,799 we had missed along the way up until 488 00:17:24,799 --> 00:17:27,679 that point 489 00:17:26,799 --> 00:17:29,600 but 490 00:17:27,679 --> 00:17:33,200 after a period of a couple of days we 491 00:17:29,600 --> 00:17:35,200 were all back on track again 492 00:17:33,200 --> 00:17:38,000 and so the third topic i wanted to cover 493 00:17:35,200 --> 00:17:40,400 was around incidents and 494 00:17:38,000 --> 00:17:41,840 the incidents uh the the sort of typical 495 00:17:40,400 --> 00:17:43,679 things from 496 00:17:41,840 --> 00:17:46,000 production infrastructure or just 497 00:17:43,679 --> 00:17:47,919 infrastructure in general the the things 498 00:17:46,000 --> 00:17:50,880 where you say hey there's 499 00:17:47,919 --> 00:17:53,280 like something is down or unavailable 500 00:17:50,880 --> 00:17:55,520 uh those those types of things and 501 00:17:53,280 --> 00:17:58,400 the the analogy here is 502 00:17:55,520 --> 00:18:00,320 if a tree falls in the woods 503 00:17:58,400 --> 00:18:01,679 and no one's around to hear it doesn't 504 00:18:00,320 --> 00:18:03,360 make a sound 505 00:18:01,679 --> 00:18:05,360 uh and 506 00:18:03,360 --> 00:18:06,960 that's where incidents come in 507 00:18:05,360 --> 00:18:07,919 for for us 508 00:18:06,960 --> 00:18:10,720 um 509 00:18:07,919 --> 00:18:11,760 so a couple of years ago um 510 00:18:10,720 --> 00:18:14,000 all 511 00:18:11,760 --> 00:18:16,480 sort of interruptions and incidents were 512 00:18:14,000 --> 00:18:19,840 very chaotic they were 513 00:18:16,480 --> 00:18:20,880 all over the place they were confusing 514 00:18:19,840 --> 00:18:23,919 and 515 00:18:20,880 --> 00:18:25,440 they typically happened in private 516 00:18:23,919 --> 00:18:26,720 channels 517 00:18:25,440 --> 00:18:29,200 they were 518 00:18:26,720 --> 00:18:30,480 typically between the same couple of 519 00:18:29,200 --> 00:18:32,160 people 520 00:18:30,480 --> 00:18:34,400 and 521 00:18:32,160 --> 00:18:37,520 often afterwards 522 00:18:34,400 --> 00:18:41,200 there weren't clear actions to come out 523 00:18:37,520 --> 00:18:43,360 from from handling that incident so 524 00:18:41,200 --> 00:18:45,360 the sorts of things like 525 00:18:43,360 --> 00:18:48,960 when there's a problem 526 00:18:45,360 --> 00:18:50,400 and we know that there is a fundamental 527 00:18:48,960 --> 00:18:53,039 uh issue 528 00:18:50,400 --> 00:18:54,000 whether it's design or architecture or 529 00:18:53,039 --> 00:18:56,720 even just 530 00:18:54,000 --> 00:18:58,400 bugs in code 531 00:18:56,720 --> 00:18:59,200 there was 532 00:18:58,400 --> 00:19:01,120 not 533 00:18:59,200 --> 00:19:03,520 always a very clear 534 00:19:01,120 --> 00:19:05,440 outcome to say okay we need to 535 00:19:03,520 --> 00:19:07,760 improve or shift 536 00:19:05,440 --> 00:19:09,600 these things so that way we can prevent 537 00:19:07,760 --> 00:19:10,720 this in the future 538 00:19:09,600 --> 00:19:13,600 and 539 00:19:10,720 --> 00:19:15,120 it was because of those things that we 540 00:19:13,600 --> 00:19:17,200 we decided okay we needed to do 541 00:19:15,120 --> 00:19:19,840 something about this we needed to 542 00:19:17,200 --> 00:19:22,000 make it visible 543 00:19:19,840 --> 00:19:23,760 and so we created our own 544 00:19:22,000 --> 00:19:25,679 incident management framework um and 545 00:19:23,760 --> 00:19:26,880 when i say created our own i don't mean 546 00:19:25,679 --> 00:19:28,880 we 547 00:19:26,880 --> 00:19:32,160 took a blank blank slate and started 548 00:19:28,880 --> 00:19:34,320 from scratch um we borrowed a lot from a 549 00:19:32,160 --> 00:19:35,840 lot of the common uh publicly known 550 00:19:34,320 --> 00:19:39,679 frameworks such as 551 00:19:35,840 --> 00:19:42,720 uh the one from etsy and what a lot of 552 00:19:39,679 --> 00:19:44,960 places like i think dropbox and netflix 553 00:19:42,720 --> 00:19:45,919 and some of the other uh big software 554 00:19:44,960 --> 00:19:47,679 companies 555 00:19:45,919 --> 00:19:49,919 their frameworks that they've published 556 00:19:47,679 --> 00:19:53,039 and talked about 557 00:19:49,919 --> 00:19:55,120 the intention here was to to streamline 558 00:19:53,039 --> 00:19:56,480 handling uh those 559 00:19:55,120 --> 00:19:57,280 interruptions 560 00:19:56,480 --> 00:19:59,840 um 561 00:19:57,280 --> 00:20:02,320 but also distribute the workload so make 562 00:19:59,840 --> 00:20:04,640 sure that it's not always just those 563 00:20:02,320 --> 00:20:07,919 same handful of people 564 00:20:04,640 --> 00:20:10,080 make it more visible as well uh 565 00:20:07,919 --> 00:20:11,840 for two reasons i guess um 566 00:20:10,080 --> 00:20:13,039 one was 567 00:20:11,840 --> 00:20:15,760 to sort of 568 00:20:13,039 --> 00:20:17,600 show just how much of an impact it makes 569 00:20:15,760 --> 00:20:19,600 to the team when there are these sorts 570 00:20:17,600 --> 00:20:21,760 of interruptions where 571 00:20:19,600 --> 00:20:24,240 groups of people have to drop whatever 572 00:20:21,760 --> 00:20:25,760 work they're doing to to look at a 573 00:20:24,240 --> 00:20:27,679 critical problem 574 00:20:25,760 --> 00:20:30,159 because if they don't then 575 00:20:27,679 --> 00:20:32,799 there's going to be customer problems or 576 00:20:30,159 --> 00:20:32,799 that sort of thing 577 00:20:33,120 --> 00:20:36,159 but the other thing 578 00:20:34,559 --> 00:20:39,200 that was a really great outcome from 579 00:20:36,159 --> 00:20:41,440 this was was the visibility and 580 00:20:39,200 --> 00:20:43,679 visibility in this case is 581 00:20:41,440 --> 00:20:46,080 uh both an internal to the company but 582 00:20:43,679 --> 00:20:48,960 an external thing to customers 583 00:20:46,080 --> 00:20:51,360 so not only making 584 00:20:48,960 --> 00:20:52,400 say a product manager aware that the 585 00:20:51,360 --> 00:20:54,320 reason 586 00:20:52,400 --> 00:20:56,799 their squad of engineers hasn't been 587 00:20:54,320 --> 00:20:58,720 able to progress is because of the 588 00:20:56,799 --> 00:21:01,200 incidents that have been happening it's 589 00:20:58,720 --> 00:21:02,960 also to communicate with customers to 590 00:21:01,200 --> 00:21:04,799 make them aware of 591 00:21:02,960 --> 00:21:06,480 like the reasons they're unable to use 592 00:21:04,799 --> 00:21:08,080 the platform right now or because of 593 00:21:06,480 --> 00:21:09,200 these issues happening 594 00:21:08,080 --> 00:21:11,840 and 595 00:21:09,200 --> 00:21:13,760 then giving more of an eta and keeping 596 00:21:11,840 --> 00:21:17,120 them in the loop on how everything's 597 00:21:13,760 --> 00:21:19,200 going and that visibility was was really 598 00:21:17,120 --> 00:21:20,880 one of the biggest impactful changes to 599 00:21:19,200 --> 00:21:23,120 come from the incident management 600 00:21:20,880 --> 00:21:25,120 framework 601 00:21:23,120 --> 00:21:26,000 and so the framework itself 602 00:21:25,120 --> 00:21:28,159 um 603 00:21:26,000 --> 00:21:28,159 we 604 00:21:28,240 --> 00:21:33,280 we have it um 605 00:21:30,400 --> 00:21:34,080 in a way that we give a lot of structure 606 00:21:33,280 --> 00:21:36,480 to 607 00:21:34,080 --> 00:21:38,080 i guess live problem solving 608 00:21:36,480 --> 00:21:40,640 so we we have a set of roles and 609 00:21:38,080 --> 00:21:43,840 responsibilities for those roles uh we 610 00:21:40,640 --> 00:21:45,919 have a hierarchy of um 611 00:21:43,840 --> 00:21:49,200 of the way that those roles interact 612 00:21:45,919 --> 00:21:51,039 with each other and we train people 613 00:21:49,200 --> 00:21:52,000 for each of those roles in particular 614 00:21:51,039 --> 00:21:53,840 because 615 00:21:52,000 --> 00:21:55,919 they have different requirements they 616 00:21:53,840 --> 00:21:57,600 they have a different skill set 617 00:21:55,919 --> 00:22:00,320 and 618 00:21:57,600 --> 00:22:01,840 what might work for one role and the way 619 00:22:00,320 --> 00:22:04,720 that you approach 620 00:22:01,840 --> 00:22:07,840 uh working in that role may not work at 621 00:22:04,720 --> 00:22:09,200 all in another role so we we provide 622 00:22:07,840 --> 00:22:10,960 that training 623 00:22:09,200 --> 00:22:12,880 for those individuals when they would 624 00:22:10,960 --> 00:22:16,080 like to or depending on role 625 00:22:12,880 --> 00:22:16,880 uh some roles are more suited to certain 626 00:22:16,080 --> 00:22:18,640 uh 627 00:22:16,880 --> 00:22:21,360 some roles in the business are more 628 00:22:18,640 --> 00:22:23,039 suited to different roles inside the 629 00:22:21,360 --> 00:22:24,720 framework 630 00:22:23,039 --> 00:22:27,840 and that structure is is really 631 00:22:24,720 --> 00:22:29,919 important because it means when there's 632 00:22:27,840 --> 00:22:32,000 like a critical issue where the entire 633 00:22:29,919 --> 00:22:33,600 platform is unavailable 634 00:22:32,000 --> 00:22:37,039 we don't have 635 00:22:33,600 --> 00:22:38,400 a series of executives or or managers or 636 00:22:37,039 --> 00:22:39,679 a group of people 637 00:22:38,400 --> 00:22:42,640 uh 638 00:22:39,679 --> 00:22:46,320 worrying that the world is on fire 639 00:22:42,640 --> 00:22:48,000 we have a structure that gives people a 640 00:22:46,320 --> 00:22:50,799 grounding to 641 00:22:48,000 --> 00:22:52,480 keep them calmer and actually focus on 642 00:22:50,799 --> 00:22:53,520 solving the problem 643 00:22:52,480 --> 00:22:55,200 without 644 00:22:53,520 --> 00:22:56,400 worrying as much 645 00:22:55,200 --> 00:22:58,000 um 646 00:22:56,400 --> 00:22:59,919 and it's a great way to facilitate the 647 00:22:58,000 --> 00:23:01,200 communication it's it's the back and 648 00:22:59,919 --> 00:23:03,679 forth the 649 00:23:01,200 --> 00:23:07,120 making sure that people are aware making 650 00:23:03,679 --> 00:23:09,600 sure that stakeholders know that 651 00:23:07,120 --> 00:23:12,640 this problem we know about it and we're 652 00:23:09,600 --> 00:23:15,039 actively working on it um and 653 00:23:12,640 --> 00:23:16,640 they can get updates uh through through 654 00:23:15,039 --> 00:23:18,960 that process and 655 00:23:16,640 --> 00:23:21,520 uh it's same for customers like uh for 656 00:23:18,960 --> 00:23:22,880 our incident management we publish to 657 00:23:21,520 --> 00:23:26,559 status page 658 00:23:22,880 --> 00:23:29,440 and people can subscribe to email or 659 00:23:26,559 --> 00:23:30,960 rss updates that sort of stuff 660 00:23:29,440 --> 00:23:33,120 and the framework also means that we 661 00:23:30,960 --> 00:23:35,200 involve the right people so because we 662 00:23:33,120 --> 00:23:36,880 have those roles we 663 00:23:35,200 --> 00:23:38,880 have people who are trained in those 664 00:23:36,880 --> 00:23:41,440 orals we can 665 00:23:38,880 --> 00:23:44,240 have very particular say page duty 666 00:23:41,440 --> 00:23:47,279 groups that we can page people 667 00:23:44,240 --> 00:23:48,880 for those roles in particular 668 00:23:47,279 --> 00:23:51,679 and 669 00:23:48,880 --> 00:23:53,600 we also capture actions as a part of the 670 00:23:51,679 --> 00:23:56,799 process and as a part of the 671 00:23:53,600 --> 00:23:58,720 retrospective afterwards we look at 672 00:23:56,799 --> 00:24:01,679 what the problems were that caused the 673 00:23:58,720 --> 00:24:04,400 incident and then we bring those into 674 00:24:01,679 --> 00:24:06,640 our development workflows and we make 675 00:24:04,400 --> 00:24:08,159 sure that we resolve those long term so 676 00:24:06,640 --> 00:24:11,279 that way we don't have those incidents 677 00:24:08,159 --> 00:24:11,279 again in the future 678 00:24:11,360 --> 00:24:15,919 and so 679 00:24:13,120 --> 00:24:16,880 the visibility aspect of this was all 680 00:24:15,919 --> 00:24:19,279 from 681 00:24:16,880 --> 00:24:21,200 the actual implementation 682 00:24:19,279 --> 00:24:24,080 when we first started the incident 683 00:24:21,200 --> 00:24:26,400 management framework it was mayhem and 684 00:24:24,080 --> 00:24:29,679 there was just a lot of confusion 685 00:24:26,400 --> 00:24:29,679 for about three weeks 686 00:24:30,240 --> 00:24:34,960 mostly in the systems team or actually 687 00:24:32,400 --> 00:24:37,279 in a lot of the engineering teams 688 00:24:34,960 --> 00:24:39,679 there was not really a lot of planned 689 00:24:37,279 --> 00:24:40,799 work getting done and that was because 690 00:24:39,679 --> 00:24:42,880 we would 691 00:24:40,799 --> 00:24:43,919 get an incident called we would all jump 692 00:24:42,880 --> 00:24:45,360 in 693 00:24:43,919 --> 00:24:47,760 as we needed to 694 00:24:45,360 --> 00:24:49,520 and as we resolved it there would be 695 00:24:47,760 --> 00:24:51,039 another incident called for something 696 00:24:49,520 --> 00:24:52,080 completely different 697 00:24:51,039 --> 00:24:54,640 and 698 00:24:52,080 --> 00:24:56,559 when you factor in that and writing the 699 00:24:54,640 --> 00:24:58,080 documentation afterwards and then 700 00:24:56,559 --> 00:25:01,279 meeting to discuss 701 00:24:58,080 --> 00:25:03,360 how it went um and then continuing to 702 00:25:01,279 --> 00:25:06,080 train everybody it was it was very 703 00:25:03,360 --> 00:25:07,279 taxing in the beginning um and it also 704 00:25:06,080 --> 00:25:09,679 gave 705 00:25:07,279 --> 00:25:12,400 uh quite a bit of a a hit to the 706 00:25:09,679 --> 00:25:14,400 confidence across the leadership group 707 00:25:12,400 --> 00:25:15,440 because it was just hey 708 00:25:14,400 --> 00:25:18,080 what's going on there's all these 709 00:25:15,440 --> 00:25:19,919 incidents um it looks really really bad 710 00:25:18,080 --> 00:25:21,919 uh and that was purely because it's 711 00:25:19,919 --> 00:25:23,919 visible now um 712 00:25:21,919 --> 00:25:26,080 originally these things were happening 713 00:25:23,919 --> 00:25:26,799 anyway it's just that nobody knew about 714 00:25:26,080 --> 00:25:29,039 it 715 00:25:26,799 --> 00:25:30,880 and so once we knew about it but we 716 00:25:29,039 --> 00:25:33,360 actually had a plan to do something 717 00:25:30,880 --> 00:25:37,039 about it we could handle those things 718 00:25:33,360 --> 00:25:38,159 and then we we get better we continually 719 00:25:37,039 --> 00:25:40,880 improve 720 00:25:38,159 --> 00:25:42,640 we also update the the framework if we 721 00:25:40,880 --> 00:25:44,400 need to if we find 722 00:25:42,640 --> 00:25:45,840 that we have different types of 723 00:25:44,400 --> 00:25:47,360 incidents that don't quite lend 724 00:25:45,840 --> 00:25:48,080 themselves to the way that we're doing 725 00:25:47,360 --> 00:25:50,640 it 726 00:25:48,080 --> 00:25:54,240 we can change that and that's fine 727 00:25:50,640 --> 00:25:57,360 and it works really well now um it's 728 00:25:54,240 --> 00:25:58,480 not really mayhem at all anymore 729 00:25:57,360 --> 00:26:00,240 and 730 00:25:58,480 --> 00:26:02,400 we have a really good feel for the way 731 00:26:00,240 --> 00:26:05,760 that the different roles interact 732 00:26:02,400 --> 00:26:08,559 and uh we have a pool of people that we 733 00:26:05,760 --> 00:26:12,640 can lean on to be able to help 734 00:26:08,559 --> 00:26:12,640 if and when anything goes wrong 735 00:26:13,279 --> 00:26:16,240 and so i want to 736 00:26:14,799 --> 00:26:19,200 talk a little bit about some takeaways 737 00:26:16,240 --> 00:26:21,039 then from these three uh particular 738 00:26:19,200 --> 00:26:22,400 topics in general 739 00:26:21,039 --> 00:26:24,159 um 740 00:26:22,400 --> 00:26:26,080 the first side is the the technical side 741 00:26:24,159 --> 00:26:28,080 of things um 742 00:26:26,080 --> 00:26:30,640 the 743 00:26:28,080 --> 00:26:33,200 the key point of visibility is just that 744 00:26:30,640 --> 00:26:36,080 it's being visible uh and it should be 745 00:26:33,200 --> 00:26:37,600 very basic it shouldn't be complex there 746 00:26:36,080 --> 00:26:38,559 shouldn't be 747 00:26:37,600 --> 00:26:41,279 any 748 00:26:38,559 --> 00:26:44,799 rules engine or anything like that 749 00:26:41,279 --> 00:26:46,640 it's just the bare minimum 750 00:26:44,799 --> 00:26:50,720 making sure that something is running 751 00:26:46,640 --> 00:26:52,559 like having a health check on a service 752 00:26:50,720 --> 00:26:54,080 whatever it is wherever it's running it 753 00:26:52,559 --> 00:26:56,720 doesn't really matter but 754 00:26:54,080 --> 00:26:57,679 having a service be able to 755 00:26:56,720 --> 00:26:58,720 to 756 00:26:57,679 --> 00:27:01,039 alert 757 00:26:58,720 --> 00:27:02,799 when it's not able to do what it's 758 00:27:01,039 --> 00:27:05,360 supposed to do 759 00:27:02,799 --> 00:27:07,600 and you can see like a dashboard for 760 00:27:05,360 --> 00:27:11,360 instance to say like the service is 761 00:27:07,600 --> 00:27:11,360 healthy it's it's working fine 762 00:27:11,840 --> 00:27:15,039 the next thing is to think about the way 763 00:27:13,360 --> 00:27:16,000 that things can fail 764 00:27:15,039 --> 00:27:18,559 so 765 00:27:16,000 --> 00:27:22,000 with our 766 00:27:18,559 --> 00:27:24,559 situation we we thought about okay how 767 00:27:22,000 --> 00:27:26,960 do we know that something's running and 768 00:27:24,559 --> 00:27:29,520 we added some some basic visibility for 769 00:27:26,960 --> 00:27:31,600 that and that was fine but 770 00:27:29,520 --> 00:27:32,960 what we didn't consider was what things 771 00:27:31,600 --> 00:27:35,679 could go wrong 772 00:27:32,960 --> 00:27:36,480 and that was where we missed that aspect 773 00:27:35,679 --> 00:27:38,559 of 774 00:27:36,480 --> 00:27:40,240 the actual message that gets sent to let 775 00:27:38,559 --> 00:27:42,559 us know that something is wrong we 776 00:27:40,240 --> 00:27:43,279 didn't know when that broke 777 00:27:42,559 --> 00:27:45,200 so 778 00:27:43,279 --> 00:27:46,000 being able to send the message was sort 779 00:27:45,200 --> 00:27:49,120 of 780 00:27:46,000 --> 00:27:50,399 the most fundamental part of all of it 781 00:27:49,120 --> 00:27:53,039 if 782 00:27:50,399 --> 00:27:55,039 if the messages can't be sent then 783 00:27:53,039 --> 00:27:56,880 the whole rest of the pipeline doesn't 784 00:27:55,039 --> 00:27:58,480 matter because we will never get any 785 00:27:56,880 --> 00:28:00,559 output from it 786 00:27:58,480 --> 00:28:01,600 so thinking about that is is something 787 00:28:00,559 --> 00:28:03,600 that's 788 00:28:01,600 --> 00:28:06,000 really good at informing 789 00:28:03,600 --> 00:28:08,960 how the rest of the visibility and 790 00:28:06,000 --> 00:28:11,279 beyond journey goes 791 00:28:08,960 --> 00:28:14,240 it's also good to have redundancy 792 00:28:11,279 --> 00:28:16,640 in the visibility side of things so 793 00:28:14,240 --> 00:28:19,039 adding the observability in the metrics 794 00:28:16,640 --> 00:28:21,200 means that you not only say get the 795 00:28:19,039 --> 00:28:22,960 health check for the visibility but then 796 00:28:21,200 --> 00:28:24,880 you can say okay 797 00:28:22,960 --> 00:28:26,159 the services is reporting that it's 798 00:28:24,880 --> 00:28:27,279 working fine 799 00:28:26,159 --> 00:28:29,440 but now 800 00:28:27,279 --> 00:28:31,279 let's check the the business logic to 801 00:28:29,440 --> 00:28:33,200 make sure that it's still 802 00:28:31,279 --> 00:28:36,960 when it takes an action it's actually 803 00:28:33,200 --> 00:28:38,399 doing the action we expect it to do 804 00:28:36,960 --> 00:28:41,440 and 805 00:28:38,399 --> 00:28:42,720 cost isn't always about money 806 00:28:41,440 --> 00:28:45,520 so 807 00:28:42,720 --> 00:28:47,600 while the tool that we didn't use was 808 00:28:45,520 --> 00:28:49,360 prohibitively expensive 809 00:28:47,600 --> 00:28:52,399 that doesn't mean that building our own 810 00:28:49,360 --> 00:28:55,039 was necessarily a good idea uh when you 811 00:28:52,399 --> 00:28:58,080 factor in things like uh the hours it 812 00:28:55,039 --> 00:28:59,440 takes for people to maintain or the 813 00:28:58,080 --> 00:29:02,159 hours it took to write in the first 814 00:28:59,440 --> 00:29:04,159 place um and then the hours for 815 00:29:02,159 --> 00:29:05,520 the few of us that dug into why it 816 00:29:04,159 --> 00:29:07,520 wasn't working 817 00:29:05,520 --> 00:29:09,919 they all add up and 818 00:29:07,520 --> 00:29:13,200 those hours and 819 00:29:09,919 --> 00:29:15,200 the time can be converted into some 820 00:29:13,200 --> 00:29:18,320 monetary value that would be comparable 821 00:29:15,200 --> 00:29:19,840 to the actual tool cost itself 822 00:29:18,320 --> 00:29:21,919 and and the final technical point is 823 00:29:19,840 --> 00:29:22,799 that sometimes standards can be good 824 00:29:21,919 --> 00:29:25,440 we 825 00:29:22,799 --> 00:29:27,760 have ways of running services and 826 00:29:25,440 --> 00:29:29,760 processes across the platform and they 827 00:29:27,760 --> 00:29:30,880 weren't used for this which meant that 828 00:29:29,760 --> 00:29:33,039 the team 829 00:29:30,880 --> 00:29:35,120 that already had a good understanding of 830 00:29:33,039 --> 00:29:36,880 how things function 831 00:29:35,120 --> 00:29:38,840 were left a bit 832 00:29:36,880 --> 00:29:42,559 uh in the dark 833 00:29:38,840 --> 00:29:43,440 really and generally for visibility 834 00:29:42,559 --> 00:29:44,480 uh 835 00:29:43,440 --> 00:29:47,520 it might 836 00:29:44,480 --> 00:29:48,559 be horrifying to make something visible 837 00:29:47,520 --> 00:29:50,480 and 838 00:29:48,559 --> 00:29:52,720 sometimes that's exactly what you need 839 00:29:50,480 --> 00:29:54,559 so with the incidents 840 00:29:52,720 --> 00:29:56,240 making them visible was really important 841 00:29:54,559 --> 00:29:57,919 but it meant that we could actually do 842 00:29:56,240 --> 00:29:59,840 something about it 843 00:29:57,919 --> 00:30:01,360 and with the data that you get from 844 00:29:59,840 --> 00:30:03,360 these types of things 845 00:30:01,360 --> 00:30:04,320 you can make improvements things can get 846 00:30:03,360 --> 00:30:06,240 better 847 00:30:04,320 --> 00:30:09,120 and it's good to understand who the 848 00:30:06,240 --> 00:30:11,360 target audience is and 849 00:30:09,120 --> 00:30:12,640 you can use that visibility then to 850 00:30:11,360 --> 00:30:15,760 build trust 851 00:30:12,640 --> 00:30:15,760 and and get better 852 00:30:16,559 --> 00:30:19,600 cool that so that was 853 00:30:18,399 --> 00:30:21,679 uh 854 00:30:19,600 --> 00:30:22,960 visibility i'm al 855 00:30:21,679 --> 00:30:25,360 thank you 856 00:30:22,960 --> 00:30:27,120 okay thank you very much helen um we 857 00:30:25,360 --> 00:30:28,960 don't have any questions but there's a 858 00:30:27,120 --> 00:30:30,080 few people sharing your stories in the 859 00:30:28,960 --> 00:30:32,399 chat 860 00:30:30,080 --> 00:30:35,559 all right thanks 861 00:30:32,399 --> 00:30:35,559 thank you