1 00:00:12,080 --> 00:00:16,240 welcome back to the devops track 2 00:00:14,240 --> 00:00:17,279 everyone i'm justin warren your host for 3 00:00:16,240 --> 00:00:20,080 today 4 00:00:17,279 --> 00:00:22,960 and our next talk is called planning for 5 00:00:20,080 --> 00:00:25,119 failure using chaos engineering and i am 6 00:00:22,960 --> 00:00:26,960 pleased to welcome our presenter amit 7 00:00:25,119 --> 00:00:29,039 sahar amit 8 00:00:26,960 --> 00:00:31,279 thank you for presenting here at devils 9 00:00:29,039 --> 00:00:34,239 take it away thank you 10 00:00:31,279 --> 00:00:36,160 hello everyone my name is sumitsa and i 11 00:00:34,239 --> 00:00:37,680 work for atlassian as a software 12 00:00:36,160 --> 00:00:39,840 engineer 13 00:00:37,680 --> 00:00:41,760 today i'll be sharing my learnings and 14 00:00:39,840 --> 00:00:43,920 excitement about using chaos engineering 15 00:00:41,760 --> 00:00:46,320 techniques to help plan for failure 16 00:00:43,920 --> 00:00:48,719 scenarios in software systems 17 00:00:46,320 --> 00:00:50,079 um this is the best kind of talk for me 18 00:00:48,719 --> 00:00:52,079 uh because 19 00:00:50,079 --> 00:00:53,520 i'm going to fumble a bit i'm going to 20 00:00:52,079 --> 00:00:55,600 stutter a bit 21 00:00:53,520 --> 00:00:59,640 but we'll get there then 22 00:00:55,600 --> 00:00:59,640 so yeah let's get started 23 00:01:01,600 --> 00:01:04,640 now 24 00:01:02,800 --> 00:01:06,000 now that you're here 25 00:01:04,640 --> 00:01:07,600 why should you continue listening 26 00:01:06,000 --> 00:01:10,400 listening to me 27 00:01:07,600 --> 00:01:12,640 in the recent past i worked with teams 28 00:01:10,400 --> 00:01:15,439 to assess their production readiness 29 00:01:12,640 --> 00:01:18,320 we are rolling out several new services 30 00:01:15,439 --> 00:01:20,400 we had alerts we had dashboards but we 31 00:01:18,320 --> 00:01:22,560 wanted to test the various 32 00:01:20,400 --> 00:01:25,439 configurations help team members get 33 00:01:22,560 --> 00:01:27,759 familiar with on-call hygiene 34 00:01:25,439 --> 00:01:29,040 so this team members were fairly new to 35 00:01:27,759 --> 00:01:30,960 the company 36 00:01:29,040 --> 00:01:33,119 so they may have had prior experience 37 00:01:30,960 --> 00:01:35,040 with being on call and 38 00:01:33,119 --> 00:01:36,400 for the services but 39 00:01:35,040 --> 00:01:39,040 the tools 40 00:01:36,400 --> 00:01:40,640 the service that we had in-house were 41 00:01:39,040 --> 00:01:42,720 quite different from what they were used 42 00:01:40,640 --> 00:01:45,119 to before 43 00:01:42,720 --> 00:01:47,759 so we wanted to get them uh you know 44 00:01:45,119 --> 00:01:49,840 sort of warming up almost for their 45 00:01:47,759 --> 00:01:51,439 on-call 46 00:01:49,840 --> 00:01:53,200 schedules 47 00:01:51,439 --> 00:01:54,720 atlasen had a very well-developed 48 00:01:53,200 --> 00:01:56,960 culture around chaos engineering 49 00:01:54,720 --> 00:02:00,079 practices we call it war games 50 00:01:56,960 --> 00:02:02,640 internally and i adapted that as my key 51 00:02:00,079 --> 00:02:04,320 strategy for this exercise 52 00:02:02,640 --> 00:02:06,799 and i became a fan 53 00:02:04,320 --> 00:02:09,360 of applying chaos engineering techniques 54 00:02:06,799 --> 00:02:11,440 for this purpose and so i want to spread 55 00:02:09,360 --> 00:02:13,840 my excitement and share my learnings 56 00:02:11,440 --> 00:02:13,840 today 57 00:02:14,400 --> 00:02:19,280 i will start my talk with a hypothesis 58 00:02:17,120 --> 00:02:22,720 and by the end i'll have made enough 59 00:02:19,280 --> 00:02:24,959 points towards proving that hypothesis 60 00:02:22,720 --> 00:02:26,959 and the hypothesis is 61 00:02:24,959 --> 00:02:28,640 chaos engineering techniques can help 62 00:02:26,959 --> 00:02:31,120 continuously assess your production 63 00:02:28,640 --> 00:02:31,120 readiness 64 00:02:31,200 --> 00:02:35,360 but i'll not declare it proved or 65 00:02:33,200 --> 00:02:38,160 validated by the end that's for you to 66 00:02:35,360 --> 00:02:40,720 find out for yourself but i'm positive 67 00:02:38,160 --> 00:02:42,080 you will find that's the case 68 00:02:40,720 --> 00:02:43,360 i also like to think of production 69 00:02:42,080 --> 00:02:47,440 readiness 70 00:02:43,360 --> 00:02:49,680 as failure readiness since in my head um 71 00:02:47,440 --> 00:02:50,800 and probably all of your heads 72 00:02:49,680 --> 00:02:51,599 production 73 00:02:50,800 --> 00:02:53,519 is 74 00:02:51,599 --> 00:02:56,480 pretty much failure in some shape or 75 00:02:53,519 --> 00:02:58,879 form whether it's point zero one percent 76 00:02:56,480 --> 00:03:00,159 or whether it's point uh double zero one 77 00:02:58,879 --> 00:03:03,800 percent or 78 00:03:00,159 --> 00:03:03,800 um ten percent 79 00:03:05,760 --> 00:03:10,159 we have uh 80 00:03:07,440 --> 00:03:12,640 sort of i have divided the talk into 81 00:03:10,159 --> 00:03:14,239 four key areas we'll start off with an 82 00:03:12,640 --> 00:03:15,920 introduction 83 00:03:14,239 --> 00:03:17,519 we'll then go over organizing chaos 84 00:03:15,920 --> 00:03:20,319 engineering experiments 85 00:03:17,519 --> 00:03:22,720 we'll talk a bit about people systems 86 00:03:20,319 --> 00:03:24,239 then we'll talk about chaos engineering 87 00:03:22,720 --> 00:03:26,239 tools 88 00:03:24,239 --> 00:03:29,120 and then finally a summary to really 89 00:03:26,239 --> 00:03:32,319 drill down the chaos into all of you 90 00:03:29,120 --> 00:03:34,319 let's get started 91 00:03:32,319 --> 00:03:35,519 so the best light plants often go 92 00:03:34,319 --> 00:03:38,480 alright 93 00:03:35,519 --> 00:03:42,080 and i have lived in sydney long enough 94 00:03:38,480 --> 00:03:44,000 um to see enough of these that the next 95 00:03:42,080 --> 00:03:45,680 time i see it which will be after a 96 00:03:44,000 --> 00:03:46,640 pretty long time because of obvious 97 00:03:45,680 --> 00:03:49,599 reasons 98 00:03:46,640 --> 00:03:51,280 i will probably be surprised that okay 99 00:03:49,599 --> 00:03:53,280 yeah we're back again here 100 00:03:51,280 --> 00:03:55,519 um so this is from a new service that 101 00:03:53,280 --> 00:03:57,760 was rolled out for those who are for 102 00:03:55,519 --> 00:04:00,879 those of you who are not from um sydney 103 00:03:57,760 --> 00:04:02,879 uh this is the sydney metro and uh which 104 00:04:00,879 --> 00:04:05,360 was rolled out with a lot of fanfare 105 00:04:02,879 --> 00:04:07,519 after a lot of planning uh with with 106 00:04:05,360 --> 00:04:10,000 sufficient testing for that matter but 107 00:04:07,519 --> 00:04:11,439 within days we had issues products so we 108 00:04:10,000 --> 00:04:12,720 rolled out something into production and 109 00:04:11,439 --> 00:04:14,080 we had issues 110 00:04:12,720 --> 00:04:15,760 and that's why we have the train 111 00:04:14,080 --> 00:04:17,840 replacement buses 112 00:04:15,760 --> 00:04:20,079 always ready to go 113 00:04:17,840 --> 00:04:23,120 um so getting closer to the conference 114 00:04:20,079 --> 00:04:27,280 topic systems fail 115 00:04:23,120 --> 00:04:30,240 uh we have hardware failures we we have 116 00:04:27,280 --> 00:04:32,560 material course course that don't count 117 00:04:30,240 --> 00:04:34,400 so check out this paper called silent 118 00:04:32,560 --> 00:04:36,720 data corruption which is a paper by 119 00:04:34,400 --> 00:04:38,720 facebook and related work by google 120 00:04:36,720 --> 00:04:42,639 where they found that a task 121 00:04:38,720 --> 00:04:44,880 only failed on a specific core 122 00:04:42,639 --> 00:04:47,040 closer to home we perhaps don't 123 00:04:44,880 --> 00:04:50,000 experience not most of us experience 124 00:04:47,040 --> 00:04:52,960 mercurial cores but we do experience 125 00:04:50,000 --> 00:04:55,840 exhaustion of finite resources memory 126 00:04:52,960 --> 00:04:59,120 disk space cpu cycles network capacity 127 00:04:55,840 --> 00:05:01,520 we are all too familiar with this 128 00:04:59,120 --> 00:05:02,479 and then we have software failures 129 00:05:01,520 --> 00:05:04,880 bugs 130 00:05:02,479 --> 00:05:07,680 where we may not have tested a code path 131 00:05:04,880 --> 00:05:10,320 properly where we may not not have 132 00:05:07,680 --> 00:05:11,680 foreseen something breaking in a way 133 00:05:10,320 --> 00:05:14,000 that it broke 134 00:05:11,680 --> 00:05:15,520 and then we have real world failures 135 00:05:14,000 --> 00:05:18,000 breaking underlying assumptions in 136 00:05:15,520 --> 00:05:18,000 software 137 00:05:19,039 --> 00:05:23,600 and then people during an incident the 138 00:05:21,680 --> 00:05:24,960 on-call person finding out that they 139 00:05:23,600 --> 00:05:27,680 don't have the necessary access to 140 00:05:24,960 --> 00:05:30,479 deactivate a malicious account 141 00:05:27,680 --> 00:05:32,800 and i can bet that 142 00:05:30,479 --> 00:05:35,199 a lot a lot among you has actually 143 00:05:32,800 --> 00:05:38,479 experienced this 144 00:05:35,199 --> 00:05:40,400 and hence we put in circuit breakers we 145 00:05:38,479 --> 00:05:41,600 have red limiters 146 00:05:40,400 --> 00:05:44,320 we have 147 00:05:41,600 --> 00:05:46,800 exponential backups in retries in our 148 00:05:44,320 --> 00:05:48,320 code we put in redundancy 149 00:05:46,800 --> 00:05:50,320 we know we want to run more than one 150 00:05:48,320 --> 00:05:52,800 copy of our application 151 00:05:50,320 --> 00:05:54,320 we put in disaster recovery plans 152 00:05:52,800 --> 00:05:55,360 and we create runbooks for people to 153 00:05:54,320 --> 00:05:57,759 follow 154 00:05:55,360 --> 00:06:00,080 and we also have our fail safe which is 155 00:05:57,759 --> 00:06:02,000 okay if everything else fails we'll 156 00:06:00,080 --> 00:06:04,400 still um 157 00:06:02,000 --> 00:06:06,639 have we'll say put in you know four 158 00:06:04,400 --> 00:06:08,000 nines and if we don't meet those four 159 00:06:06,639 --> 00:06:10,319 nines we'll 160 00:06:08,000 --> 00:06:12,080 pay will sort of compensate our customer 161 00:06:10,319 --> 00:06:14,560 in some way so we essentially throw 162 00:06:12,080 --> 00:06:16,560 money at the problem uh by signing sls 163 00:06:14,560 --> 00:06:19,120 we don't even tell them that yes we are 164 00:06:16,560 --> 00:06:22,000 going to be hundred percent off for you 165 00:06:19,120 --> 00:06:23,039 we we factored that in 166 00:06:22,000 --> 00:06:24,880 and yet 167 00:06:23,039 --> 00:06:27,440 the next incident happens 168 00:06:24,880 --> 00:06:29,280 and catches us by a surprise 169 00:06:27,440 --> 00:06:31,360 alerts didn't fire 170 00:06:29,280 --> 00:06:33,039 auto scaling didn't happen 171 00:06:31,360 --> 00:06:35,039 with a question mark 172 00:06:33,039 --> 00:06:37,919 i thought we had circuit breakers oh and 173 00:06:35,039 --> 00:06:39,759 we just had a slow breach and if you had 174 00:06:37,919 --> 00:06:43,600 any doubts about which bike on a u track 175 00:06:39,759 --> 00:06:43,600 you are in you're in the dev oops track 176 00:06:43,680 --> 00:06:48,240 so what can you do um 177 00:06:45,759 --> 00:06:49,919 we can borrow some ideas from the 178 00:06:48,240 --> 00:06:51,520 scientific method 179 00:06:49,919 --> 00:06:53,599 the scientific method 180 00:06:51,520 --> 00:06:57,199 essentially is 181 00:06:53,599 --> 00:07:00,240 it's a process an ongoing process where 182 00:06:57,199 --> 00:07:02,560 we question we ask questions we research 183 00:07:00,240 --> 00:07:05,440 an area we form a hypothesis 184 00:07:02,560 --> 00:07:07,199 we test that hypothesis with experiments 185 00:07:05,440 --> 00:07:09,599 we analyze the data 186 00:07:07,199 --> 00:07:10,479 we report conclusions 187 00:07:09,599 --> 00:07:12,560 uh 188 00:07:10,479 --> 00:07:16,160 come up with action items and then we 189 00:07:12,560 --> 00:07:17,120 repeat rinse and repeat 190 00:07:16,160 --> 00:07:19,199 so 191 00:07:17,120 --> 00:07:22,800 we are going to put our our science hat 192 00:07:19,199 --> 00:07:26,800 on and we are going to learn how we can 193 00:07:22,800 --> 00:07:26,800 organize chaos engineering experiments 194 00:07:27,680 --> 00:07:32,479 the basic idea is we want to 195 00:07:30,400 --> 00:07:34,479 intentionally break our systems under 196 00:07:32,479 --> 00:07:37,039 controlled conditions 197 00:07:34,479 --> 00:07:40,240 we break the carefully orchestrated mix 198 00:07:37,039 --> 00:07:42,960 of sync and async workflows and verify 199 00:07:40,240 --> 00:07:45,120 the guardrails 200 00:07:42,960 --> 00:07:49,199 what do we break 201 00:07:45,120 --> 00:07:50,400 we break our finite resources um exhaust 202 00:07:49,199 --> 00:07:52,560 the cpu 203 00:07:50,400 --> 00:07:54,000 exhaust the memory exhaust file 204 00:07:52,560 --> 00:07:55,199 descriptors 205 00:07:54,000 --> 00:07:57,840 we 206 00:07:55,199 --> 00:07:59,440 we we test the network capacity at 207 00:07:57,840 --> 00:08:00,160 different levels 208 00:07:59,440 --> 00:08:03,199 we 209 00:08:00,160 --> 00:08:05,120 we think about breaking our dependencies 210 00:08:03,199 --> 00:08:06,960 our libraries our services the data 211 00:08:05,120 --> 00:08:08,080 stores that we that our application 212 00:08:06,960 --> 00:08:09,840 depends on 213 00:08:08,080 --> 00:08:11,360 and any resource your software really 214 00:08:09,840 --> 00:08:13,840 depends on 215 00:08:11,360 --> 00:08:16,479 your logging your monitoring your 216 00:08:13,840 --> 00:08:20,000 apm the application monitoring 217 00:08:16,479 --> 00:08:22,000 management agents i believe i got that 218 00:08:20,000 --> 00:08:23,840 abbreviation correct so anything that 219 00:08:22,000 --> 00:08:25,599 you're running on the system 220 00:08:23,840 --> 00:08:26,639 and your application is talking to 221 00:08:25,599 --> 00:08:29,360 directly 222 00:08:26,639 --> 00:08:30,319 should be a potential uh fault injection 223 00:08:29,360 --> 00:08:32,479 point 224 00:08:30,319 --> 00:08:34,640 so it's a it's really becoming curious 225 00:08:32,479 --> 00:08:35,360 about your system and uh becoming the 226 00:08:34,640 --> 00:08:36,479 bad 227 00:08:35,360 --> 00:08:39,519 bad actor 228 00:08:36,479 --> 00:08:41,680 and thinking of ways you can break it 229 00:08:39,519 --> 00:08:43,599 so for example let's say you have 230 00:08:41,680 --> 00:08:45,120 implemented asynchronous workflow in 231 00:08:43,599 --> 00:08:46,240 your application 232 00:08:45,120 --> 00:08:48,160 and 233 00:08:46,240 --> 00:08:51,200 introduce timeout which is longer than 234 00:08:48,160 --> 00:08:51,200 your polling duration 235 00:08:53,279 --> 00:08:58,080 does your app keep pulling forever 236 00:08:56,959 --> 00:08:59,920 let's say you have webhooks in your 237 00:08:58,080 --> 00:09:01,279 application receive the webhook 238 00:08:59,920 --> 00:09:05,519 correctly 239 00:09:01,279 --> 00:09:07,600 but drop the payload what happens 240 00:09:05,519 --> 00:09:09,519 does something tell you that 241 00:09:07,600 --> 00:09:11,120 something like this has happened or you 242 00:09:09,519 --> 00:09:12,160 are basically left scratching your heads 243 00:09:11,120 --> 00:09:15,200 thinking 244 00:09:12,160 --> 00:09:17,839 you know what's going on 245 00:09:15,200 --> 00:09:20,080 and a final example let's say you have 246 00:09:17,839 --> 00:09:22,720 enabled um you are using whiskey 247 00:09:20,080 --> 00:09:27,200 applications you're using something like 248 00:09:22,720 --> 00:09:29,279 unicorn or uw sgi you have enabled 249 00:09:27,200 --> 00:09:31,440 worker recycling behavior and you have 250 00:09:29,279 --> 00:09:33,839 been careful enough to set up a jitter 251 00:09:31,440 --> 00:09:36,080 so that your workers don't restart 252 00:09:33,839 --> 00:09:37,279 all that don't always start the same 253 00:09:36,080 --> 00:09:39,680 time 254 00:09:37,279 --> 00:09:42,160 test that send traffic to your service 255 00:09:39,680 --> 00:09:43,760 at an expected request per second 256 00:09:42,160 --> 00:09:45,040 how does the worker recycling behavior 257 00:09:43,760 --> 00:09:46,480 manifest 258 00:09:45,040 --> 00:09:48,959 um does 259 00:09:46,480 --> 00:09:50,880 with that rps 260 00:09:48,959 --> 00:09:53,760 are there enough workers 261 00:09:50,880 --> 00:09:55,440 um there to serve requests when the 262 00:09:53,760 --> 00:09:58,399 others are recycling are those 263 00:09:55,440 --> 00:10:01,200 sufficient um kill a worker periodically 264 00:09:58,399 --> 00:10:01,200 see what happens 265 00:10:02,560 --> 00:10:08,160 a formal chaos engineering experiment 266 00:10:04,800 --> 00:10:10,320 consists of four key steps 267 00:10:08,160 --> 00:10:12,399 you you start off with a steady state 268 00:10:10,320 --> 00:10:14,079 hypothesis 269 00:10:12,399 --> 00:10:16,800 then you 270 00:10:14,079 --> 00:10:18,880 have the fault injection phase 271 00:10:16,800 --> 00:10:21,760 then you verify the hypothesis that you 272 00:10:18,880 --> 00:10:24,160 started off it in step one and then you 273 00:10:21,760 --> 00:10:26,000 identify action items and then you sort 274 00:10:24,160 --> 00:10:27,680 of rinse and repeat 275 00:10:26,000 --> 00:10:30,240 um as an example 276 00:10:27,680 --> 00:10:32,800 let's say you have two services 277 00:10:30,240 --> 00:10:34,560 service one and service two and service 278 00:10:32,800 --> 00:10:37,519 one depends on service too 279 00:10:34,560 --> 00:10:39,680 so we formulate and hypothesis saying if 280 00:10:37,519 --> 00:10:42,560 services does not respond respond within 281 00:10:39,680 --> 00:10:44,800 300 milliseconds service one is going to 282 00:10:42,560 --> 00:10:46,399 return an error within 283 00:10:44,800 --> 00:10:48,640 close to 300 milliseconds it won't be 284 00:10:46,399 --> 00:10:51,279 exactly 300 milliseconds because 285 00:10:48,640 --> 00:10:54,000 think bits and bytes take time to travel 286 00:10:51,279 --> 00:10:56,480 so it will be close to 200 milliseconds 287 00:10:54,000 --> 00:10:58,800 and then so that's a hypothesis 288 00:10:56,480 --> 00:11:01,519 and then you inject a fault you say i'm 289 00:10:58,800 --> 00:11:04,079 going to add a latency of five seconds 290 00:11:01,519 --> 00:11:07,040 uh between the service to um and the 291 00:11:04,079 --> 00:11:08,320 database so the application i've got the 292 00:11:07,040 --> 00:11:10,560 application text here it should be 293 00:11:08,320 --> 00:11:12,880 service too uh so essentially add a 294 00:11:10,560 --> 00:11:15,120 latency of five seconds between service 295 00:11:12,880 --> 00:11:17,519 to and database did the application 296 00:11:15,120 --> 00:11:18,320 return an error after 300 milliseconds 297 00:11:17,519 --> 00:11:19,760 and 298 00:11:18,320 --> 00:11:20,720 in this case the application is service 299 00:11:19,760 --> 00:11:22,720 one 300 00:11:20,720 --> 00:11:25,040 so that's your hypothesis verification 301 00:11:22,720 --> 00:11:27,040 phase or did it wait longer 302 00:11:25,040 --> 00:11:28,560 um if and depending on the answer to 303 00:11:27,040 --> 00:11:30,880 that question you come up with action 304 00:11:28,560 --> 00:11:32,560 items if it waited longer you should 305 00:11:30,880 --> 00:11:35,839 check out your timeout configurations 306 00:11:32,560 --> 00:11:37,680 and repeat the experiment to ensure that 307 00:11:35,839 --> 00:11:39,600 the timeout has actually solved the 308 00:11:37,680 --> 00:11:41,680 problem or your and that it validates 309 00:11:39,600 --> 00:11:42,560 your hypothesis 310 00:11:41,680 --> 00:11:45,120 now 311 00:11:42,560 --> 00:11:46,959 in in in a more production-like scenario 312 00:11:45,120 --> 00:11:49,760 you'd have you would want to verify 313 00:11:46,959 --> 00:11:51,040 whether your alerts fired or not as well 314 00:11:49,760 --> 00:11:53,120 telling you that 315 00:11:51,040 --> 00:11:57,279 yes this service one is 316 00:11:53,120 --> 00:11:57,279 certainly experiencing higher latency 317 00:11:57,600 --> 00:12:00,240 that's an example 318 00:11:58,959 --> 00:12:01,680 so 319 00:12:00,240 --> 00:12:04,160 now there are 320 00:12:01,680 --> 00:12:06,800 three key steps before you run a chaos 321 00:12:04,160 --> 00:12:08,880 engine experiment the first and the most 322 00:12:06,800 --> 00:12:11,760 important step is to brainstorm the 323 00:12:08,880 --> 00:12:13,279 experiment and formulate your hypothesis 324 00:12:11,760 --> 00:12:16,160 that is step one 325 00:12:13,279 --> 00:12:16,160 from the previous step 326 00:12:16,720 --> 00:12:21,920 what you will really find useful and 327 00:12:19,360 --> 00:12:23,200 what you will really need to design good 328 00:12:21,920 --> 00:12:27,519 experiments 329 00:12:23,200 --> 00:12:29,440 is a very uh good idea of the system 330 00:12:27,519 --> 00:12:30,320 that you are planning to inject faults 331 00:12:29,440 --> 00:12:32,079 in 332 00:12:30,320 --> 00:12:33,920 system design and architecture knowledge 333 00:12:32,079 --> 00:12:36,800 is essential to come up with interesting 334 00:12:33,920 --> 00:12:38,720 experiment ideas because it's easy to 335 00:12:36,800 --> 00:12:39,839 come up with very 336 00:12:38,720 --> 00:12:42,480 generic 337 00:12:39,839 --> 00:12:45,200 uh experimental ideas and those are good 338 00:12:42,480 --> 00:12:48,399 places to start but eventually the 339 00:12:45,200 --> 00:12:51,839 biggest uh bang for your buck almost 340 00:12:48,399 --> 00:12:53,920 will come to say almost will come from 341 00:12:51,839 --> 00:12:56,959 uh if you are designing this key 342 00:12:53,920 --> 00:13:01,200 experiments so for example um let's say 343 00:12:56,959 --> 00:13:04,720 you have an unexp i have an incident so 344 00:13:01,200 --> 00:13:06,480 that instant could give you a good um 345 00:13:04,720 --> 00:13:08,639 that could be your uh 346 00:13:06,480 --> 00:13:10,560 seed to come up with this experiment 347 00:13:08,639 --> 00:13:13,360 perhaps you could inject the same 348 00:13:10,560 --> 00:13:14,480 failure that that was that that caused 349 00:13:13,360 --> 00:13:17,360 the incident 350 00:13:14,480 --> 00:13:19,360 and uh for for your chaos ensuring 351 00:13:17,360 --> 00:13:21,839 experiment 352 00:13:19,360 --> 00:13:23,440 so that this is really a key step 353 00:13:21,839 --> 00:13:26,000 um i will just jump over to the next 354 00:13:23,440 --> 00:13:28,160 slide so 355 00:13:26,000 --> 00:13:29,519 now this is how we do planning at 356 00:13:28,160 --> 00:13:30,959 atlassian 357 00:13:29,519 --> 00:13:32,240 well i'm just kidding this is not how we 358 00:13:30,959 --> 00:13:34,959 do it um 359 00:13:32,240 --> 00:13:36,959 so so this is a a miniature war gaming 360 00:13:34,959 --> 00:13:39,519 um you know 361 00:13:36,959 --> 00:13:41,839 like a picture and but this is roughly 362 00:13:39,519 --> 00:13:43,600 what you should so this is uh a term 363 00:13:41,839 --> 00:13:46,079 there's a term we use internally called 364 00:13:43,600 --> 00:13:49,279 tabletop wargaming um i learned it from 365 00:13:46,079 --> 00:13:51,360 a colleague and what the term means is 366 00:13:49,279 --> 00:13:53,199 you get together with the team your team 367 00:13:51,360 --> 00:13:55,120 or the maybe the architect if you have 368 00:13:53,199 --> 00:13:58,160 one maybe the senior engineers in the 369 00:13:55,120 --> 00:13:59,839 team and you sort of brainstorm ideas to 370 00:13:58,160 --> 00:14:02,079 break your system 371 00:13:59,839 --> 00:14:03,839 and you do that on a whiteboard you do 372 00:14:02,079 --> 00:14:06,240 it on a table on a virtual whiteboard 373 00:14:03,839 --> 00:14:08,240 these days and so you lay down all the 374 00:14:06,240 --> 00:14:09,760 pieces you lay down all the dependencies 375 00:14:08,240 --> 00:14:11,519 for example you later on the different 376 00:14:09,760 --> 00:14:13,279 interactions of your system with your 377 00:14:11,519 --> 00:14:16,800 external systems and then you think 378 00:14:13,279 --> 00:14:18,720 about breaking them 379 00:14:16,800 --> 00:14:21,279 so once you have got the experiment once 380 00:14:18,720 --> 00:14:24,160 you've got the hypothesis the next step 381 00:14:21,279 --> 00:14:25,760 is how do you inject those failures so 382 00:14:24,160 --> 00:14:28,800 choose your tools 383 00:14:25,760 --> 00:14:32,000 uh we will go over a few tools 384 00:14:28,800 --> 00:14:35,440 later in in the presentation 385 00:14:32,000 --> 00:14:37,120 but sometimes these will be quite custom 386 00:14:35,440 --> 00:14:39,279 tools or custom code that you will have 387 00:14:37,120 --> 00:14:41,040 to write maybe perhaps modifying your 388 00:14:39,279 --> 00:14:43,279 application in a way 389 00:14:41,040 --> 00:14:46,160 to inject those failures so it will be a 390 00:14:43,279 --> 00:14:47,600 mix of both generic tools as well as 391 00:14:46,160 --> 00:14:50,560 something custom that you will have to 392 00:14:47,600 --> 00:14:52,560 implement for your application 393 00:14:50,560 --> 00:14:54,399 so once you're done with 394 00:14:52,560 --> 00:14:56,959 the brainstorming you have chosen your 395 00:14:54,399 --> 00:14:59,760 tools the third step is scheduler 396 00:14:56,959 --> 00:15:01,760 meeting with all the key people so for 397 00:14:59,760 --> 00:15:04,240 example people who would be on call the 398 00:15:01,760 --> 00:15:06,320 team leads architects and 399 00:15:04,240 --> 00:15:08,320 typically you'd block around one and one 400 00:15:06,320 --> 00:15:10,800 and a half hours 401 00:15:08,320 --> 00:15:12,800 for conducting the experiment as well as 402 00:15:10,800 --> 00:15:15,199 then coming up with uh you know 403 00:15:12,800 --> 00:15:17,199 analyzing the outcomes validating your 404 00:15:15,199 --> 00:15:20,639 hypothesis coming up with the action 405 00:15:17,199 --> 00:15:22,480 items um and uh just 406 00:15:20,639 --> 00:15:24,639 sort of doing uh 407 00:15:22,480 --> 00:15:28,000 uh like a comparison between expectation 408 00:15:24,639 --> 00:15:28,800 and reality and then you repeat the step 409 00:15:28,000 --> 00:15:32,079 again 410 00:15:28,800 --> 00:15:34,399 so it is a fairly time consuming step it 411 00:15:32,079 --> 00:15:36,560 is because to really get value out of it 412 00:15:34,399 --> 00:15:39,800 you need to devote that time into doing 413 00:15:36,560 --> 00:15:39,800 these experiments 414 00:15:41,040 --> 00:15:47,600 the another uh key so so to 415 00:15:44,560 --> 00:15:49,360 to sort of um you know like uh be a sin 416 00:15:47,600 --> 00:15:51,199 have a single person orchestrating all 417 00:15:49,360 --> 00:15:53,519 of these you know arranging meetings 418 00:15:51,199 --> 00:15:57,279 ensuring that people turn up 419 00:15:53,519 --> 00:15:59,040 uh what we follow internally is we we 420 00:15:57,279 --> 00:16:00,639 dominate a person called the war game 421 00:15:59,040 --> 00:16:02,800 champion who is in charge of 422 00:16:00,639 --> 00:16:04,720 coordinating the various logistics 423 00:16:02,800 --> 00:16:06,480 associated with the process 424 00:16:04,720 --> 00:16:07,920 you know schedule the experiment and 425 00:16:06,480 --> 00:16:09,360 then orchestrate the experiment along 426 00:16:07,920 --> 00:16:11,680 with other team members so this is the 427 00:16:09,360 --> 00:16:14,560 person who is uh driving this whole 428 00:16:11,680 --> 00:16:16,560 thing um along with and with help from 429 00:16:14,560 --> 00:16:17,759 everybody else who is involved with the 430 00:16:16,560 --> 00:16:19,120 experiment 431 00:16:17,759 --> 00:16:21,759 and of course you should ensure your 432 00:16:19,120 --> 00:16:23,600 failure injection tools and scripts work 433 00:16:21,759 --> 00:16:25,759 before the real experiment 434 00:16:23,600 --> 00:16:28,160 the last thing you want is get somebody 435 00:16:25,759 --> 00:16:28,640 in a room for one and a half hours um 436 00:16:28,160 --> 00:16:29,759 and then 437 00:16:28,640 --> 00:16:31,680 [Music] 438 00:16:29,759 --> 00:16:33,920 figure out uh that your scripts are not 439 00:16:31,680 --> 00:16:37,680 working 440 00:16:33,920 --> 00:16:40,800 during um and there are two other 441 00:16:37,680 --> 00:16:42,800 or perhaps one step that is really vital 442 00:16:40,800 --> 00:16:45,839 is the ability to generate load or 443 00:16:42,800 --> 00:16:48,560 traffic um for your experiment 444 00:16:45,839 --> 00:16:51,759 locust python community favorite is is 445 00:16:48,560 --> 00:16:53,519 great for http services um if you have 446 00:16:51,759 --> 00:16:55,519 integration or synthetic tests they are 447 00:16:53,519 --> 00:16:57,519 useful too uh the idea is that you need 448 00:16:55,519 --> 00:16:59,519 to generate enough traffic so that you 449 00:16:57,519 --> 00:17:00,639 can um you know there's something to 450 00:16:59,519 --> 00:17:02,800 break 451 00:17:00,639 --> 00:17:05,039 in your application 452 00:17:02,800 --> 00:17:06,799 during the experiment get everyone in a 453 00:17:05,039 --> 00:17:08,319 virtual room um 454 00:17:06,799 --> 00:17:10,400 when we are back in office it can be a 455 00:17:08,319 --> 00:17:12,959 single conference room like uh with 456 00:17:10,400 --> 00:17:15,600 dashboards and notif notification 457 00:17:12,959 --> 00:17:17,600 channels probably uh up on a giant 458 00:17:15,600 --> 00:17:20,400 monitor or screen 459 00:17:17,600 --> 00:17:22,559 um and then there's this other aspect is 460 00:17:20,400 --> 00:17:25,039 do you do blind filler injection or do 461 00:17:22,559 --> 00:17:27,679 you do predefined fault injection blind 462 00:17:25,039 --> 00:17:29,919 fill injections are super useful for 463 00:17:27,679 --> 00:17:32,400 like a as a training day exercise where 464 00:17:29,919 --> 00:17:33,840 you want to you like maybe senior 465 00:17:32,400 --> 00:17:34,640 members want to 466 00:17:33,840 --> 00:17:36,400 do 467 00:17:34,640 --> 00:17:38,000 use it as like a tool for training of 468 00:17:36,400 --> 00:17:39,360 your junior team members 469 00:17:38,000 --> 00:17:41,919 or it can be 470 00:17:39,360 --> 00:17:42,960 the predefined fault injections where 471 00:17:41,919 --> 00:17:44,400 the faults that are going to be 472 00:17:42,960 --> 00:17:47,840 introduced are well defined and well 473 00:17:44,400 --> 00:17:47,840 communicated before 474 00:17:48,480 --> 00:17:52,559 another fundamental question is 475 00:17:50,559 --> 00:17:54,880 staging our production 476 00:17:52,559 --> 00:17:57,039 uh the key is you should be able to 477 00:17:54,880 --> 00:17:58,640 easily reset your systems to the state 478 00:17:57,039 --> 00:18:00,400 that they were before you started the 479 00:17:58,640 --> 00:18:02,000 experiment 480 00:18:00,400 --> 00:18:03,360 start gradually with the staging and 481 00:18:02,000 --> 00:18:04,559 then when you're confident go for 482 00:18:03,360 --> 00:18:06,400 production 483 00:18:04,559 --> 00:18:09,200 and perhaps if you have automated 484 00:18:06,400 --> 00:18:11,120 failure injection you may want it as an 485 00:18:09,200 --> 00:18:15,559 opt-out model for staging 486 00:18:11,120 --> 00:18:15,559 and an opt-in model for production 487 00:18:15,600 --> 00:18:19,840 a couple of prerequisites you need to be 488 00:18:18,000 --> 00:18:21,919 able to run arbitrary commands or having 489 00:18:19,840 --> 00:18:24,960 admin style privileges in your cloud or 490 00:18:21,919 --> 00:18:25,760 one on-prem account and you also need to 491 00:18:24,960 --> 00:18:28,400 have 492 00:18:25,760 --> 00:18:30,320 metrics dashboards logs 493 00:18:28,400 --> 00:18:32,080 otherwise you don't really know what's 494 00:18:30,320 --> 00:18:33,760 going on with your application and you 495 00:18:32,080 --> 00:18:36,080 won't be able to 496 00:18:33,760 --> 00:18:38,720 analyze the the result of introducing 497 00:18:36,080 --> 00:18:38,720 those faults 498 00:18:39,440 --> 00:18:44,799 great so people systems 499 00:18:42,960 --> 00:18:46,320 and i've got a hypothetical scenario 500 00:18:44,799 --> 00:18:48,240 here um 501 00:18:46,320 --> 00:18:50,240 and i've got the hypothetical in quotes 502 00:18:48,240 --> 00:18:52,720 because it's based on something recently 503 00:18:50,240 --> 00:18:54,320 experienced so we get up on call we get 504 00:18:52,720 --> 00:18:55,919 a pager um 505 00:18:54,320 --> 00:18:57,760 and we see that okay we have some spam 506 00:18:55,919 --> 00:18:59,760 requests coming in and 507 00:18:57,760 --> 00:19:02,080 the uncle person jumps in 508 00:18:59,760 --> 00:19:05,360 they realize uh okay yeah let me just 509 00:19:02,080 --> 00:19:07,520 deactivate those accounts and then 510 00:19:05,360 --> 00:19:08,960 jane realizes that oh i don't have the 511 00:19:07,520 --> 00:19:11,919 right permissions let me get this other 512 00:19:08,960 --> 00:19:13,120 person and that person also doesn't have 513 00:19:11,919 --> 00:19:14,160 uh 514 00:19:13,120 --> 00:19:15,919 the 515 00:19:14,160 --> 00:19:19,520 right permissions and then eventually we 516 00:19:15,919 --> 00:19:21,039 get this person who has the right to 517 00:19:19,520 --> 00:19:22,840 access and then they start deactivating 518 00:19:21,039 --> 00:19:26,640 the accounts 519 00:19:22,840 --> 00:19:28,960 um so and that's why we we need to treat 520 00:19:26,640 --> 00:19:30,960 our people and the the steps where 521 00:19:28,960 --> 00:19:33,520 people are involved as systems as well 522 00:19:30,960 --> 00:19:36,799 we need to practice um we need to inject 523 00:19:33,520 --> 00:19:39,120 a bit of chaos into our people uh 524 00:19:36,799 --> 00:19:40,480 systems as well um 525 00:19:39,120 --> 00:19:42,960 there's this talk called chaos 526 00:19:40,480 --> 00:19:45,760 engineering for people systems um by 527 00:19:42,960 --> 00:19:47,919 dave branson from google who talks about 528 00:19:45,760 --> 00:19:50,559 companies being distributed systems and 529 00:19:47,919 --> 00:19:52,960 he and they share some ideas around how 530 00:19:50,559 --> 00:19:54,400 you could uh 531 00:19:52,960 --> 00:19:57,200 perform chaos engineering experiments 532 00:19:54,400 --> 00:20:00,799 for for people so things like um you 533 00:19:57,200 --> 00:20:02,880 know do a staycation where 50 of the 534 00:20:00,799 --> 00:20:04,960 normal usual staff will not they will 535 00:20:02,880 --> 00:20:07,360 just say okay no we are not present and 536 00:20:04,960 --> 00:20:10,080 you know let everybody else handle uh 537 00:20:07,360 --> 00:20:13,360 any failure any incident that comes up 538 00:20:10,080 --> 00:20:16,000 um the the book release 8 chapter 17 539 00:20:13,360 --> 00:20:17,919 talks about zombie apocalypse uh you 540 00:20:16,000 --> 00:20:19,200 know simulating a zombie apocalypse 541 00:20:17,919 --> 00:20:21,919 where you'd 542 00:20:19,200 --> 00:20:24,080 again the same idea is make some of your 543 00:20:21,919 --> 00:20:26,640 people uh not 544 00:20:24,080 --> 00:20:28,080 play any role uh during an incident and 545 00:20:26,640 --> 00:20:29,679 see how it goes 546 00:20:28,080 --> 00:20:31,360 and that is why it's 547 00:20:29,679 --> 00:20:34,400 really useful when you are performing 548 00:20:31,360 --> 00:20:37,600 this kind of experiments on uh staging 549 00:20:34,400 --> 00:20:40,000 environment rather than production 550 00:20:37,600 --> 00:20:42,080 test your playbooks um 551 00:20:40,000 --> 00:20:43,760 test your team build it and and treat it 552 00:20:42,080 --> 00:20:44,880 as a team building exercises like i i 553 00:20:43,760 --> 00:20:46,880 have really 554 00:20:44,880 --> 00:20:48,799 found it very useful to get everybody 555 00:20:46,880 --> 00:20:50,400 together and run this experience and 556 00:20:48,799 --> 00:20:52,240 they're all working with each other 557 00:20:50,400 --> 00:20:54,159 they're they're talking out loud 558 00:20:52,240 --> 00:20:55,760 discussing you know why this could be 559 00:20:54,159 --> 00:20:58,400 happening and you know what can be done 560 00:20:55,760 --> 00:21:01,120 about it and so on and so forth so it's 561 00:20:58,400 --> 00:21:03,200 it's really great for that too 562 00:21:01,120 --> 00:21:04,880 um so couple of key points to remember 563 00:21:03,200 --> 00:21:06,559 um 564 00:21:04,880 --> 00:21:08,640 let people know that you're running a 565 00:21:06,559 --> 00:21:10,320 chaos engine experiment the last thing 566 00:21:08,640 --> 00:21:13,120 want to happen is 567 00:21:10,320 --> 00:21:15,200 uh the people thinking or waking up 568 00:21:13,120 --> 00:21:16,400 being woken up at 3 a.m 569 00:21:15,200 --> 00:21:18,080 so that's another thing you should be 570 00:21:16,400 --> 00:21:20,159 mindful of time zones of the key on-call 571 00:21:18,080 --> 00:21:21,600 people you don't want to be waking 572 00:21:20,159 --> 00:21:23,840 people up at 3am because you're running 573 00:21:21,600 --> 00:21:26,400 a kyocera experiment 574 00:21:23,840 --> 00:21:28,960 that still is from the movie war games 575 00:21:26,400 --> 00:21:31,440 yeah it was great 576 00:21:28,960 --> 00:21:32,480 um okay um 577 00:21:31,440 --> 00:21:33,760 the 578 00:21:32,480 --> 00:21:36,480 next our section we're going to talk 579 00:21:33,760 --> 00:21:37,679 about chaos engineering tools 580 00:21:36,480 --> 00:21:40,080 and what 581 00:21:37,679 --> 00:21:42,480 there is actually a dearth of tools um 582 00:21:40,080 --> 00:21:44,080 that you will find and 583 00:21:42,480 --> 00:21:45,440 and that is why what i'm going to talk 584 00:21:44,080 --> 00:21:47,360 about is only going to be a 585 00:21:45,440 --> 00:21:49,280 representative sample 586 00:21:47,360 --> 00:21:50,640 um so there will be there's no shortage 587 00:21:49,280 --> 00:21:51,679 of tools in the chaos engineering 588 00:21:50,640 --> 00:21:54,400 community 589 00:21:51,679 --> 00:21:57,039 but you really need to understand 590 00:21:54,400 --> 00:21:59,760 most of these tools probably are using 591 00:21:57,039 --> 00:22:01,039 this fundamental tools that i have found 592 00:21:59,760 --> 00:22:02,799 so far 593 00:22:01,039 --> 00:22:04,400 iptables 594 00:22:02,799 --> 00:22:06,320 it's like the duct tape of the internet 595 00:22:04,400 --> 00:22:08,799 i i heard somebody say many many years 596 00:22:06,320 --> 00:22:10,880 back i think um 597 00:22:08,799 --> 00:22:11,600 that you will find 598 00:22:10,880 --> 00:22:13,200 like 599 00:22:11,600 --> 00:22:15,919 that being used 600 00:22:13,200 --> 00:22:18,559 in many tools underlying tools which 601 00:22:15,919 --> 00:22:20,400 give you a higher level layer tcp kill 602 00:22:18,559 --> 00:22:22,880 um which allows you to break tcp 603 00:22:20,400 --> 00:22:25,280 connections um and iptables allows you 604 00:22:22,880 --> 00:22:27,520 to introduce dns based sinkholes 605 00:22:25,280 --> 00:22:30,559 it allows you to introduce probable 606 00:22:27,520 --> 00:22:32,799 failure scenarios and so on and so forth 607 00:22:30,559 --> 00:22:35,039 taxiproxy and tc 608 00:22:32,799 --> 00:22:37,520 which is traffic control they are super 609 00:22:35,039 --> 00:22:40,400 useful to add delay to network traffic 610 00:22:37,520 --> 00:22:43,039 so you can use these tools to 611 00:22:40,400 --> 00:22:45,280 intercept traffic and add 612 00:22:43,039 --> 00:22:48,240 arbitrary delays 613 00:22:45,280 --> 00:22:50,240 the third program is stress ng 614 00:22:48,240 --> 00:22:52,880 which is 615 00:22:50,240 --> 00:22:55,840 really useful for simulating cpu memory 616 00:22:52,880 --> 00:22:55,840 and io stress 617 00:22:56,400 --> 00:22:59,280 um 618 00:22:57,360 --> 00:23:01,360 so i've got a 619 00:22:59,280 --> 00:23:03,679 demo um i'll not do it live i'll have a 620 00:23:01,360 --> 00:23:05,360 recording so i'll just show it but we i 621 00:23:03,679 --> 00:23:08,640 have this 622 00:23:05,360 --> 00:23:10,320 two sample services service 1 service 2 623 00:23:08,640 --> 00:23:12,320 where service 2 is talking to a mysql 624 00:23:10,320 --> 00:23:15,520 database and they're all running in 625 00:23:12,320 --> 00:23:16,640 docker containers that repo there has 626 00:23:15,520 --> 00:23:18,559 this 627 00:23:16,640 --> 00:23:20,320 you know complete demo that you can run 628 00:23:18,559 --> 00:23:22,640 on your own 629 00:23:20,320 --> 00:23:24,000 so what i'm going to do is 630 00:23:22,640 --> 00:23:24,320 i'm going to 631 00:23:24,000 --> 00:23:25,440 um 632 00:23:24,320 --> 00:23:29,039 [Music] 633 00:23:25,440 --> 00:23:29,039 play a video 634 00:23:32,720 --> 00:23:37,679 so i've got okay so i've got 635 00:23:35,760 --> 00:23:40,720 this services running and i'm going to 636 00:23:37,679 --> 00:23:43,440 run the curl command uh to get like make 637 00:23:40,720 --> 00:23:46,640 a http request to service one 638 00:23:43,440 --> 00:23:48,720 and as you can see here i 639 00:23:46,640 --> 00:23:51,279 it gets a response 640 00:23:48,720 --> 00:23:53,120 there's this html sort of block that 641 00:23:51,279 --> 00:23:55,919 i've got here and it gets a response in 642 00:23:53,120 --> 00:23:58,799 about 0.6 seconds right okay great so 643 00:23:55,919 --> 00:24:01,200 that's a normal um 644 00:23:58,799 --> 00:24:05,120 default behavior 645 00:24:01,200 --> 00:24:08,720 and i just do it again just to sort of 646 00:24:05,120 --> 00:24:12,080 and then what i do is i inject the fault 647 00:24:08,720 --> 00:24:12,080 so i run this python script 648 00:24:15,360 --> 00:24:19,520 yep so my fault is injected so now i'm 649 00:24:17,600 --> 00:24:22,880 going to run this 650 00:24:19,520 --> 00:24:22,880 script command again 651 00:24:25,440 --> 00:24:31,880 and you can see that it takes longer 652 00:24:28,640 --> 00:24:31,880 six seconds 653 00:24:34,080 --> 00:24:38,400 okay and now what i'm going to do is i'm 654 00:24:37,120 --> 00:24:39,760 going to 655 00:24:38,400 --> 00:24:42,320 um 656 00:24:39,760 --> 00:24:43,840 add this query parameter so by the way 657 00:24:42,320 --> 00:24:45,919 this is a this is a wrong query 658 00:24:43,840 --> 00:24:47,520 parameter that i added so that's my 659 00:24:45,919 --> 00:24:52,480 mistake so i'm going to add the right 660 00:24:47,520 --> 00:24:52,480 query parameter and then we'll see that 661 00:24:52,720 --> 00:24:57,840 yes so i've got this query parameter 662 00:24:55,440 --> 00:24:59,279 it's like a 663 00:24:57,840 --> 00:25:02,240 what it's going to do is it's going to 664 00:24:59,279 --> 00:25:02,240 add a timeout 665 00:25:03,279 --> 00:25:05,840 so now 666 00:25:10,080 --> 00:25:16,080 yeah so now you're going to see that we 667 00:25:12,720 --> 00:25:17,919 get the response within 0.3 seconds 668 00:25:16,080 --> 00:25:19,679 remember the latency injection is still 669 00:25:17,919 --> 00:25:22,400 there so the so the delay is still there 670 00:25:19,679 --> 00:25:25,279 but so we get a response but this is a 671 00:25:22,400 --> 00:25:27,919 500. that's because we have now enforced 672 00:25:25,279 --> 00:25:32,080 the timeout on service one and so that's 673 00:25:27,919 --> 00:25:32,080 how we get a 500 response 674 00:25:33,840 --> 00:25:43,840 yeah so that's the demo so let's go back 675 00:25:40,080 --> 00:25:43,840 to our presentation 676 00:25:44,000 --> 00:25:46,400 yeah 677 00:25:44,799 --> 00:25:48,880 so 678 00:25:46,400 --> 00:25:51,360 and then um yes so i talked about this 679 00:25:48,880 --> 00:25:53,760 fundamental tools and the inject latency 680 00:25:51,360 --> 00:25:55,360 script that i have that uses 681 00:25:53,760 --> 00:25:59,039 the tc 682 00:25:55,360 --> 00:26:01,039 command or program to inject the latency 683 00:25:59,039 --> 00:26:02,240 okay 684 00:26:01,039 --> 00:26:04,080 the next 685 00:26:02,240 --> 00:26:06,159 tool or software i'm going to talk about 686 00:26:04,080 --> 00:26:08,480 that is really useful is called the 687 00:26:06,159 --> 00:26:10,400 chaos toolkit and it allows you to 688 00:26:08,480 --> 00:26:12,080 inject various faults using external 689 00:26:10,400 --> 00:26:15,679 drivers so 690 00:26:12,080 --> 00:26:18,240 they have inbuilt drivers for aws gcp 691 00:26:15,679 --> 00:26:20,320 kubernetes toxiproxy and you can write 692 00:26:18,240 --> 00:26:22,080 your own driver using python which is 693 00:26:20,320 --> 00:26:24,960 great 694 00:26:22,080 --> 00:26:26,799 if your service is hosted on aws aws has 695 00:26:24,960 --> 00:26:29,840 a service called fault injection 696 00:26:26,799 --> 00:26:32,400 simulator and you can do things like rds 697 00:26:29,840 --> 00:26:34,320 failover you can trigger errors into the 698 00:26:32,400 --> 00:26:36,400 api calls 699 00:26:34,320 --> 00:26:39,440 so for example you could add a delay to 700 00:26:36,400 --> 00:26:40,240 the auto scaling api calls and 701 00:26:39,440 --> 00:26:41,760 and 702 00:26:40,240 --> 00:26:43,760 you can think of the interesting 703 00:26:41,760 --> 00:26:45,600 experiments you can run using that what 704 00:26:43,760 --> 00:26:46,960 if your auto screen activity is delayed 705 00:26:45,600 --> 00:26:48,720 by 706 00:26:46,960 --> 00:26:51,679 you know 20 seconds what happens to your 707 00:26:48,720 --> 00:26:52,640 application under load 708 00:26:51,679 --> 00:26:55,039 so 709 00:26:52,640 --> 00:26:57,279 okay great so summary 710 00:26:55,039 --> 00:26:59,030 so the whole point of chaos engineering 711 00:26:57,279 --> 00:27:00,480 and what it's trying to um 712 00:26:59,030 --> 00:27:04,000 [Music] 713 00:27:00,480 --> 00:27:05,840 uh like state is embrace the chaos it's 714 00:27:04,000 --> 00:27:08,080 really about assisting your readiness 715 00:27:05,840 --> 00:27:10,080 when your systems fail because it's it's 716 00:27:08,080 --> 00:27:12,240 not a matter of if your systems fail 717 00:27:10,080 --> 00:27:15,200 it's a matter of when 718 00:27:12,240 --> 00:27:17,120 and often it'll it happens that known 719 00:27:15,200 --> 00:27:19,039 failure scenarios may 720 00:27:17,120 --> 00:27:20,799 come together to teach you about unknown 721 00:27:19,039 --> 00:27:23,200 failure scenarios 722 00:27:20,799 --> 00:27:25,039 emergent properties 723 00:27:23,200 --> 00:27:27,200 because 724 00:27:25,039 --> 00:27:28,799 two things you know may add up to 725 00:27:27,200 --> 00:27:30,320 something completely different that you 726 00:27:28,799 --> 00:27:32,159 didn't even expect 727 00:27:30,320 --> 00:27:34,399 and that's what we want to get ahead of 728 00:27:32,159 --> 00:27:35,679 almost by simulating this using keos new 729 00:27:34,399 --> 00:27:38,799 experiments 730 00:27:35,679 --> 00:27:41,200 um there's this article uh uh on coding 731 00:27:38,799 --> 00:27:43,440 horror called working with chaos monkey 732 00:27:41,200 --> 00:27:45,679 and there's this quote from that article 733 00:27:43,440 --> 00:27:47,200 um so it begins with this sometimes you 734 00:27:45,679 --> 00:27:50,240 don't get a choice the chaos monkey 735 00:27:47,200 --> 00:27:52,480 chooses you and that and it ends with 736 00:27:50,240 --> 00:27:54,240 this which is that's why even though it 737 00:27:52,480 --> 00:27:56,960 sounds crazy the best way to avoid 738 00:27:54,240 --> 00:28:00,480 failure is to fail constantly 739 00:27:56,960 --> 00:28:00,480 and that is very profound 740 00:28:01,679 --> 00:28:04,480 so 741 00:28:02,480 --> 00:28:06,480 there are mult quite a number of 742 00:28:04,480 --> 00:28:09,600 resources on chaos engineering 743 00:28:06,480 --> 00:28:12,880 um and i i have got three resources here 744 00:28:09,600 --> 00:28:14,240 because i find these these these are the 745 00:28:12,880 --> 00:28:15,679 um 746 00:28:14,240 --> 00:28:18,320 like these are the three resources which 747 00:28:15,679 --> 00:28:20,399 really drill down our or at least the 748 00:28:18,320 --> 00:28:22,799 first resources they really talk about 749 00:28:20,399 --> 00:28:25,120 the principles that talk about the ideas 750 00:28:22,799 --> 00:28:27,919 they are not about the tools uh which 751 00:28:25,120 --> 00:28:30,320 sometimes the software community um 752 00:28:27,919 --> 00:28:32,720 cannot get enough of um so those are the 753 00:28:30,320 --> 00:28:34,240 first resources so if you if you are 754 00:28:32,720 --> 00:28:36,000 getting started with chaos engineering 755 00:28:34,240 --> 00:28:37,200 if you wanted to learn about chaos 756 00:28:36,000 --> 00:28:39,840 engineering and the ideas and the 757 00:28:37,200 --> 00:28:42,559 principles um the first resources are 758 00:28:39,840 --> 00:28:44,880 the key the third book is great for 759 00:28:42,559 --> 00:28:47,919 hands-on learning the author has done a 760 00:28:44,880 --> 00:28:49,360 great job for for using you know basic 761 00:28:47,919 --> 00:28:52,240 tools which you can run in your local 762 00:28:49,360 --> 00:28:53,760 system to just play and familiarize 763 00:28:52,240 --> 00:28:56,159 yourself with the ideas of chaos 764 00:28:53,760 --> 00:28:56,159 engineering 765 00:28:56,960 --> 00:29:02,720 great thanks everybody who who chose my 766 00:28:59,600 --> 00:29:04,000 talk today um you know uh over the other 767 00:29:02,720 --> 00:29:06,799 of the tracks 768 00:29:04,000 --> 00:29:09,440 um thanks to all the organizers who have 769 00:29:06,799 --> 00:29:12,480 pulled up this conference in this 770 00:29:09,440 --> 00:29:14,080 circumstances it's just that's great 771 00:29:12,480 --> 00:29:15,600 um thanks to all the next day video 772 00:29:14,080 --> 00:29:17,679 folks i have had a chance to interact 773 00:29:15,600 --> 00:29:20,240 with they have been super helpful super 774 00:29:17,679 --> 00:29:22,720 you know friendly super nice people 775 00:29:20,240 --> 00:29:25,039 um i am contactable at ecorand on 776 00:29:22,720 --> 00:29:28,240 twitter you can email me i've got the 777 00:29:25,039 --> 00:29:31,120 talk materials on on my github repo 778 00:29:28,240 --> 00:29:33,520 and i have a website at ecoran dot me um 779 00:29:31,120 --> 00:29:35,440 i i will not take questions now but i'm 780 00:29:33,520 --> 00:29:38,080 happy to answer any um 781 00:29:35,440 --> 00:29:42,159 in the chat or you know later on 782 00:29:38,080 --> 00:29:42,159 um so great thank you everyone 783 00:29:43,919 --> 00:29:49,039 awesome thanks amit that that was 784 00:29:46,159 --> 00:29:50,720 amazing um lots of really really cool 785 00:29:49,039 --> 00:29:52,640 resources in there um there's a few 786 00:29:50,720 --> 00:29:54,640 there that i haven't seen so i'm keen to 787 00:29:52,640 --> 00:29:56,720 go uh and look those up this this kind 788 00:29:54,640 --> 00:29:59,440 of stuff of dealing with failure by 789 00:29:56,720 --> 00:30:01,360 embracing the chaos as you put it uh is 790 00:29:59,440 --> 00:30:04,320 is close to my heart so i look forward 791 00:30:01,360 --> 00:30:06,799 to digging into those uh yes so as amit 792 00:30:04,320 --> 00:30:08,000 said we're we're right up on time so any 793 00:30:06,799 --> 00:30:10,240 the questions that are there and there 794 00:30:08,000 --> 00:30:11,840 are some there already for you admit uh 795 00:30:10,240 --> 00:30:13,200 if you've got any more just raise them 796 00:30:11,840 --> 00:30:15,440 in the chat or chuck them into the 797 00:30:13,200 --> 00:30:18,480 questions amit will be around to answer 798 00:30:15,440 --> 00:30:20,240 them and with that we're now on a break 799 00:30:18,480 --> 00:30:22,960 for lunch so 800 00:30:20,240 --> 00:30:25,760 do get up leave the computer have some 801 00:30:22,960 --> 00:30:28,640 water um after the questions of course 802 00:30:25,760 --> 00:30:31,200 um grab some food um re-energize 803 00:30:28,640 --> 00:30:33,600 yourselves uh we are back as a 804 00:30:31,200 --> 00:30:35,840 conference we're back at about uh let me 805 00:30:33,600 --> 00:30:38,799 check my notes i think it's about 1 30 806 00:30:35,840 --> 00:30:40,640 yes lunch lunch break is until 1 30. um 807 00:30:38,799 --> 00:30:42,720 this track doesn't have a talk until 2 808 00:30:40,640 --> 00:30:44,320 15 though so you will have a chance to 809 00:30:42,720 --> 00:30:46,399 go and explore some of the other tracks 810 00:30:44,320 --> 00:30:50,240 and i encourage you to do so but be back 811 00:30:46,399 --> 00:30:52,240 here at 2 15 p.m uh when we have a talk 812 00:30:50,240 --> 00:30:54,240 by peter chu 813 00:30:52,240 --> 00:30:56,240 things might go wrong in a data 814 00:30:54,240 --> 00:30:58,799 intensive application 815 00:30:56,240 --> 00:31:00,799 uh until then enjoy lunch enjoy the 816 00:30:58,799 --> 00:31:04,039 questions enjoy the conference and we'll 817 00:31:00,799 --> 00:31:04,039 see you back 818 00:31:08,559 --> 00:31:10,640 you