1 00:00:11,440 --> 00:00:16,240 and welcome back everyone 2 00:00:13,519 --> 00:00:18,560 we have our next talk right now by 3 00:00:16,240 --> 00:00:20,480 yarkov which is called making swiss 4 00:00:18,560 --> 00:00:23,439 cheese of ticket sales and as we were 5 00:00:20,480 --> 00:00:25,760 just talking before uh in the green room 6 00:00:23,439 --> 00:00:26,960 uh clearly the abstract has has done its 7 00:00:25,760 --> 00:00:29,679 job because lots of people are 8 00:00:26,960 --> 00:00:31,359 interested in this talk including me so 9 00:00:29,679 --> 00:00:33,600 without further ado 10 00:00:31,359 --> 00:00:34,719 yakov take it away 11 00:00:33,600 --> 00:00:36,719 thank you 12 00:00:34,719 --> 00:00:38,640 hello everyone and shana tevan happy new 13 00:00:36,719 --> 00:00:41,280 year and special hello to our live chat 14 00:00:38,640 --> 00:00:42,960 today i'm yakov and in my day job i'm a 15 00:00:41,280 --> 00:00:44,879 senior software developer at weistec 16 00:00:42,960 --> 00:00:46,239 global where we build software that acts 17 00:00:44,879 --> 00:00:47,600 as the operating system for global 18 00:00:46,239 --> 00:00:48,879 logistics 19 00:00:47,600 --> 00:00:50,719 at other times i swap the shipping 20 00:00:48,879 --> 00:00:52,719 containers for docker containers and act 21 00:00:50,719 --> 00:00:55,120 as the systems administrator for smash 22 00:00:52,719 --> 00:00:57,120 the sydney manga and anime show where we 23 00:00:55,120 --> 00:00:59,039 host an annual convention for all things 24 00:00:57,120 --> 00:01:01,680 anime and manga at least in years where 25 00:00:59,039 --> 00:01:03,600 there isn't an active pandemic 26 00:01:01,680 --> 00:01:05,360 like most conventions we sell tickets to 27 00:01:03,600 --> 00:01:07,119 the event if you want to get in you need 28 00:01:05,360 --> 00:01:09,200 to buy a ticket a ticket gets you a 29 00:01:07,119 --> 00:01:11,280 badge and a badge gets you in the door 30 00:01:09,200 --> 00:01:13,040 we sell tickets in two ways at the door 31 00:01:11,280 --> 00:01:15,520 on the day of the event or online 32 00:01:13,040 --> 00:01:17,280 beforehand at a slight discount and this 33 00:01:15,520 --> 00:01:19,200 year online sales were quite a bit 34 00:01:17,280 --> 00:01:20,880 harder than it might sound 35 00:01:19,200 --> 00:01:23,200 we've done this many times before in 36 00:01:20,880 --> 00:01:24,720 years past in on our current soft 37 00:01:23,200 --> 00:01:27,439 current software stack we've done it 38 00:01:24,720 --> 00:01:29,280 twice in 2019 we sold tickets redeemed 39 00:01:27,439 --> 00:01:30,240 tickets and had a whole convention as 40 00:01:29,280 --> 00:01:33,520 usual 41 00:01:30,240 --> 00:01:35,360 in 2020 we sold tickets and 42 00:01:33,520 --> 00:01:37,119 we then had to postpone the convention 43 00:01:35,360 --> 00:01:38,960 but our launch of ticket sales did go 44 00:01:37,119 --> 00:01:41,759 off without a hitch 45 00:01:38,960 --> 00:01:43,680 this year for our 2021 flagship event we 46 00:01:41,759 --> 00:01:46,240 opened ticket pre-sales at 10 a.m on 47 00:01:43,680 --> 00:01:48,320 sunday the 2nd of may 2021 48 00:01:46,240 --> 00:01:50,560 that morning well it started for me 49 00:01:48,320 --> 00:01:52,399 pretty much like any other i got up i 50 00:01:50,560 --> 00:01:53,360 got dressed i did the morning prayers 51 00:01:52,399 --> 00:01:54,720 perhaps i should have prayed a little 52 00:01:53,360 --> 00:01:56,799 bit harder that everything here would go 53 00:01:54,720 --> 00:01:58,479 to plan and then i logged onto slack and 54 00:01:56,799 --> 00:02:00,000 onto our ticketing website just to keep 55 00:01:58,479 --> 00:02:02,159 an eye on things and to watch the ticket 56 00:02:00,000 --> 00:02:04,079 sales start to roll in 57 00:02:02,159 --> 00:02:05,920 as the clock ticked over to 10 am i 58 00:02:04,079 --> 00:02:08,080 refreshed the page expecting to see our 59 00:02:05,920 --> 00:02:10,160 list of tickets for sale go from nothing 60 00:02:08,080 --> 00:02:11,520 to our three regular online items plus 61 00:02:10,160 --> 00:02:13,280 the early bird special that was 62 00:02:11,520 --> 00:02:16,000 available at the time 63 00:02:13,280 --> 00:02:17,599 instead all i got was a loading spinner 64 00:02:16,000 --> 00:02:19,680 and reports started coming in that it 65 00:02:17,599 --> 00:02:21,360 wasn't just me 66 00:02:19,680 --> 00:02:23,680 two minutes later the main smash 67 00:02:21,360 --> 00:02:25,680 wordpress website was down too 68 00:02:23,680 --> 00:02:27,360 by 10 pass we're in full battle stations 69 00:02:25,680 --> 00:02:29,440 mode trying to figure out what on earth 70 00:02:27,360 --> 00:02:31,200 was bringing our websites down 71 00:02:29,440 --> 00:02:33,360 some users even reported seeing sql 72 00:02:31,200 --> 00:02:36,640 related errors on the front end and 73 00:02:33,360 --> 00:02:39,519 that's usually never a good sign 74 00:02:36,640 --> 00:02:41,280 40 minutes later at 10 50 am we thought 75 00:02:39,519 --> 00:02:42,879 we identified the problem and deployed a 76 00:02:41,280 --> 00:02:44,640 mitigation 77 00:02:42,879 --> 00:02:47,360 but that didn't help 78 00:02:44,640 --> 00:02:49,680 eventually at 11 40 am we decided to 79 00:02:47,360 --> 00:02:50,879 pull the plug having only managed to 80 00:02:49,680 --> 00:02:52,560 sell one 81 00:02:50,879 --> 00:02:54,879 single ticket 82 00:02:52,560 --> 00:02:56,720 in our first hundred minutes 83 00:02:54,879 --> 00:02:58,720 over the next 20 minutes we discovered 84 00:02:56,720 --> 00:03:01,040 that our panic button didn't quite work 85 00:02:58,720 --> 00:03:02,640 in the current circumstances and we had 86 00:03:01,040 --> 00:03:05,040 to manually push out a new build of the 87 00:03:02,640 --> 00:03:06,480 website with that panic mode hard coded 88 00:03:05,040 --> 00:03:08,000 to on 89 00:03:06,480 --> 00:03:09,599 and yet somehow 90 00:03:08,000 --> 00:03:11,519 ticket sales started flowing through 91 00:03:09,599 --> 00:03:12,879 from that point onwards we spent the 92 00:03:11,519 --> 00:03:14,560 rest of the afternoon identifying 93 00:03:12,879 --> 00:03:17,120 inconsistencies in the data from those 94 00:03:14,560 --> 00:03:18,560 sales such as users who had paid but 95 00:03:17,120 --> 00:03:20,640 whose payment hadn't been recorded in 96 00:03:18,560 --> 00:03:22,800 the system or users had been double or 97 00:03:20,640 --> 00:03:24,560 triple charged in fact one order had 98 00:03:22,800 --> 00:03:25,680 been charged 12 times over the poor 99 00:03:24,560 --> 00:03:27,280 fellow 100 00:03:25,680 --> 00:03:29,120 we added missing links into data and 101 00:03:27,280 --> 00:03:31,680 issued appropriate refunds and i went to 102 00:03:29,120 --> 00:03:35,360 bed that night very confused 103 00:03:31,680 --> 00:03:37,280 so what on earth went wrong here 104 00:03:35,360 --> 00:03:38,319 before we get to that let's set the 105 00:03:37,280 --> 00:03:40,560 scene 106 00:03:38,319 --> 00:03:42,879 smash has been running for over 10 years 107 00:03:40,560 --> 00:03:44,799 i first found out about it in 2018 and 108 00:03:42,879 --> 00:03:46,879 shortly joined staff as the new systems 109 00:03:44,799 --> 00:03:48,000 administrator and second in command of 110 00:03:46,879 --> 00:03:49,519 it 111 00:03:48,000 --> 00:03:51,200 for the it team that was a very 112 00:03:49,519 --> 00:03:52,799 transitional year with the existing 113 00:03:51,200 --> 00:03:55,280 staff leaving and handing over to us 114 00:03:52,799 --> 00:03:56,640 newcomers everything that they had built 115 00:03:55,280 --> 00:03:58,560 the old systems were a mixture of 116 00:03:56,640 --> 00:04:01,040 appliances and frameworks and languages 117 00:03:58,560 --> 00:04:02,640 and some strange disparate microservices 118 00:04:01,040 --> 00:04:04,640 the ticketing system alone was four or 119 00:04:02,640 --> 00:04:06,480 five different services some written 120 00:04:04,640 --> 00:04:08,159 ruby and others in go 121 00:04:06,480 --> 00:04:10,000 it was designed many years earlier for 122 00:04:08,159 --> 00:04:11,439 different requirements and constraints 123 00:04:10,000 --> 00:04:13,840 for a very different environment that 124 00:04:11,439 --> 00:04:16,160 what than what we now faced and where we 125 00:04:13,840 --> 00:04:17,680 wanted to go with the i.t systems 126 00:04:16,160 --> 00:04:19,919 for example it was designed to run 127 00:04:17,680 --> 00:04:21,600 ticket redemption to particular ticket 128 00:04:19,919 --> 00:04:23,440 redemption at the convention center 129 00:04:21,600 --> 00:04:25,199 that's where you scan everyone in 130 00:04:23,440 --> 00:04:26,560 completely offline 131 00:04:25,199 --> 00:04:28,800 but this meant we would never be able to 132 00:04:26,560 --> 00:04:30,320 do online ticket sales during the event 133 00:04:28,800 --> 00:04:32,000 and it made it very hard to try and set 134 00:04:30,320 --> 00:04:36,240 up mobile handheld scanners that would 135 00:04:32,000 --> 00:04:38,240 work outside over 4g or lte 136 00:04:36,240 --> 00:04:40,080 among the many changes we made was with 137 00:04:38,240 --> 00:04:42,479 the blessing of the old guard building a 138 00:04:40,080 --> 00:04:44,400 brand new ticketing system now we 139 00:04:42,479 --> 00:04:46,400 couldn't find any existing open source 140 00:04:44,400 --> 00:04:48,639 or low-cost solution that would do it we 141 00:04:46,400 --> 00:04:49,919 needed so again we wrote our own from 142 00:04:48,639 --> 00:04:52,320 scratch 143 00:04:49,919 --> 00:04:54,000 smash is entirely driven by volunteers 144 00:04:52,320 --> 00:04:55,919 so neither stack that volunteers would 145 00:04:54,000 --> 00:04:59,280 be able to get the hang of quickly so we 146 00:04:55,919 --> 00:05:01,759 chose a javascript typescript stack 147 00:04:59,280 --> 00:05:04,560 we used react.js for the front-end 148 00:05:01,759 --> 00:05:06,720 node.js and express.js for the back-end 149 00:05:04,560 --> 00:05:08,560 and mysql for data persistence since we 150 00:05:06,720 --> 00:05:10,560 already had mysql infrastructure for our 151 00:05:08,560 --> 00:05:12,400 wordpress sites including things like 152 00:05:10,560 --> 00:05:14,800 automatic backups and i didn't want to 153 00:05:12,400 --> 00:05:16,880 have to manage two database stacks 154 00:05:14,800 --> 00:05:19,440 to host the sites we decided to use 155 00:05:16,880 --> 00:05:21,600 kubernetes mostly because i wanted to 156 00:05:19,440 --> 00:05:23,680 learn kubernetes but also because it 157 00:05:21,600 --> 00:05:25,440 made a bunch of things really easy words 158 00:05:23,680 --> 00:05:27,840 that may never be said again as it's 159 00:05:25,440 --> 00:05:30,000 gotten more complex such as pushing out 160 00:05:27,840 --> 00:05:31,840 updates the software managing secrets 161 00:05:30,000 --> 00:05:33,759 automatically managing nginx and let's 162 00:05:31,840 --> 00:05:35,600 encrypt certificates and it let us 163 00:05:33,759 --> 00:05:37,360 establish a really solid infrastructure 164 00:05:35,600 --> 00:05:39,199 as code foundation 165 00:05:37,360 --> 00:05:40,800 and in the long run it has been a lot 166 00:05:39,199 --> 00:05:43,520 easier than manually configuring a 167 00:05:40,800 --> 00:05:44,960 handful of eps boxes 168 00:05:43,520 --> 00:05:46,400 so this is our rough architecture for 169 00:05:44,960 --> 00:05:48,800 the ticketing system 170 00:05:46,400 --> 00:05:51,039 requests to slash api go to the node.js 171 00:05:48,800 --> 00:05:53,440 backend which talks in mysql instance 172 00:05:51,039 --> 00:05:55,360 which is then backed by persistent disk 173 00:05:53,440 --> 00:05:56,800 all other requests go to a tiny nginx 174 00:05:55,360 --> 00:05:58,880 container that serves up the static 175 00:05:56,800 --> 00:06:00,880 front end if we add in the wordpress 176 00:05:58,880 --> 00:06:02,319 site and then account for replicas this 177 00:06:00,880 --> 00:06:03,919 is basically what our setup looked like 178 00:06:02,319 --> 00:06:05,600 at the time 179 00:06:03,919 --> 00:06:07,919 and this setup served us nicely through 180 00:06:05,600 --> 00:06:10,000 2019 in fact we built the system all 181 00:06:07,919 --> 00:06:11,600 throughout all through 2019 182 00:06:10,000 --> 00:06:13,600 in a very just in time approach you 183 00:06:11,600 --> 00:06:15,520 might even call it agile 184 00:06:13,600 --> 00:06:18,000 but for example when we launched ticket 185 00:06:15,520 --> 00:06:19,919 sales that first year we hadn't finished 186 00:06:18,000 --> 00:06:22,160 the code to generate pdfs for tickets or 187 00:06:19,919 --> 00:06:24,080 receipts so he later had to backfill all 188 00:06:22,160 --> 00:06:25,440 those orders by generating those pdfs 189 00:06:24,080 --> 00:06:26,960 and sending out their emails once the 190 00:06:25,440 --> 00:06:28,560 code was ready 191 00:06:26,960 --> 00:06:30,080 another example is the ticket redemption 192 00:06:28,560 --> 00:06:31,759 code which was only finalized and 193 00:06:30,080 --> 00:06:33,840 deployed to production 194 00:06:31,759 --> 00:06:35,039 the night before exhibitor setup day 195 00:06:33,840 --> 00:06:36,319 when we had to scan in all the 196 00:06:35,039 --> 00:06:39,039 exhibitors 197 00:06:36,319 --> 00:06:40,720 but it worked and it held and we ran a 198 00:06:39,039 --> 00:06:44,400 convention that weekend including a 199 00:06:40,720 --> 00:06:46,960 concert with almost no technical issues 200 00:06:44,400 --> 00:06:48,800 throughout 2020 and early 2021 we built 201 00:06:46,960 --> 00:06:51,039 some new features refactored the code 202 00:06:48,800 --> 00:06:53,199 updated npm packages and had our very 203 00:06:51,039 --> 00:06:54,639 short-lived 2020 ticket sales 204 00:06:53,199 --> 00:06:57,520 but there were no major changes from the 205 00:06:54,639 --> 00:06:59,840 2019 code base or architecture 206 00:06:57,520 --> 00:07:03,840 so how come in 2021 207 00:06:59,840 --> 00:07:03,840 this system fell flat on its face 208 00:07:07,759 --> 00:07:11,520 there are many different models that you 209 00:07:09,199 --> 00:07:13,199 can use to analyze risk and accidents 210 00:07:11,520 --> 00:07:15,360 and the swiss cheese model is used in 211 00:07:13,199 --> 00:07:16,880 the aviation industry and it's one of my 212 00:07:15,360 --> 00:07:17,840 favorites because it is so often 213 00:07:16,880 --> 00:07:19,919 accurate 214 00:07:17,840 --> 00:07:21,440 the model works roughly like this 215 00:07:19,919 --> 00:07:22,319 for any event that you don't want to 216 00:07:21,440 --> 00:07:23,680 happen 217 00:07:22,319 --> 00:07:25,919 there are a bunch of different barriers 218 00:07:23,680 --> 00:07:27,840 that contribute towards protection 219 00:07:25,919 --> 00:07:29,440 each of them are ultimately fallible but 220 00:07:27,840 --> 00:07:31,599 in different ways 221 00:07:29,440 --> 00:07:33,199 for that event to then occur multiple 222 00:07:31,599 --> 00:07:35,360 barriers have to be individually 223 00:07:33,199 --> 00:07:37,680 breached if any single one of these 224 00:07:35,360 --> 00:07:39,919 barriers holds up then the event usually 225 00:07:37,680 --> 00:07:41,199 some kind of accident or disaster is 226 00:07:39,919 --> 00:07:42,800 still prevented 227 00:07:41,199 --> 00:07:45,520 the more barriers that were breached the 228 00:07:42,800 --> 00:07:47,440 closer you've come to total disaster 229 00:07:45,520 --> 00:07:48,960 over at the snake oil academy track they 230 00:07:47,440 --> 00:07:50,800 would probably call this defense in 231 00:07:48,960 --> 00:07:52,479 depth but here we're not trying to fend 232 00:07:50,800 --> 00:07:55,599 off an attacker we're just trying to 233 00:07:52,479 --> 00:07:57,759 protect ourselves from failure 234 00:07:55,599 --> 00:08:00,720 for a worked example let's look briefly 235 00:07:57,759 --> 00:08:02,319 at u.s airways flight 1549 the infamous 236 00:08:00,720 --> 00:08:03,680 hudson river ditching 237 00:08:02,319 --> 00:08:04,879 big content warning on this one if 238 00:08:03,680 --> 00:08:07,440 you're not comfortable come back in 239 00:08:04,879 --> 00:08:08,960 about two minutes 240 00:08:07,440 --> 00:08:10,240 for the rest of you i'm going to assume 241 00:08:08,960 --> 00:08:11,919 that by now you've heard about what 242 00:08:10,240 --> 00:08:12,879 happened because it's pretty famous at 243 00:08:11,919 --> 00:08:14,479 this point 244 00:08:12,879 --> 00:08:16,400 and that whole series of events really 245 00:08:14,479 --> 00:08:18,160 should not have occurred 246 00:08:16,400 --> 00:08:20,879 for a start the aircraft encountered 247 00:08:18,160 --> 00:08:22,639 birds technically geese on takeoff 248 00:08:20,879 --> 00:08:24,639 engines are designed to withstand 249 00:08:22,639 --> 00:08:27,120 ingesting birds in fact during testing 250 00:08:24,639 --> 00:08:28,319 they fire frozen birds at engines from a 251 00:08:27,120 --> 00:08:29,840 cannon 252 00:08:28,319 --> 00:08:32,000 according to the official ntsb 253 00:08:29,840 --> 00:08:34,000 investigation report though if a bird 254 00:08:32,000 --> 00:08:35,839 enters the engine near the outer edges 255 00:08:34,000 --> 00:08:38,399 it'll most likely strike only the fan 256 00:08:35,839 --> 00:08:40,560 blades continue along the bypass duct 257 00:08:38,399 --> 00:08:42,240 and get ejected out the back probably 258 00:08:40,560 --> 00:08:43,760 dead possibly cooked 259 00:08:42,240 --> 00:08:46,880 but because it hit the inner part of the 260 00:08:43,760 --> 00:08:48,880 engine it got ingested into the core 261 00:08:46,880 --> 00:08:50,800 even then a twin-engine aircraft like 262 00:08:48,880 --> 00:08:53,600 that can still fly including a full 263 00:08:50,800 --> 00:08:55,600 takeoff with only one working engine 264 00:08:53,600 --> 00:08:56,480 we have some basic redundancy that's 265 00:08:55,600 --> 00:08:57,920 great 266 00:08:56,480 --> 00:09:00,399 but unfortunately in this case birds 267 00:08:57,920 --> 00:09:02,640 took out both engines 268 00:09:00,399 --> 00:09:04,480 even then an aircraft can still glide 269 00:09:02,640 --> 00:09:06,000 for quite some time so if you ask a 270 00:09:04,480 --> 00:09:09,120 pilot they will probably call it falling 271 00:09:06,000 --> 00:09:11,120 slowly rather than gliding regardless 272 00:09:09,120 --> 00:09:13,519 they hadn't gotten high enough yet i 273 00:09:11,120 --> 00:09:16,399 didn't have very far to glide definitely 274 00:09:13,519 --> 00:09:18,480 not far enough to a runway 275 00:09:16,399 --> 00:09:19,920 and so on and so forth 276 00:09:18,480 --> 00:09:22,160 different layers of protection 277 00:09:19,920 --> 00:09:24,000 redundancy or resiliency each one 278 00:09:22,160 --> 00:09:26,080 breached until they had no choice but to 279 00:09:24,000 --> 00:09:27,680 crash into the hudson river 280 00:09:26,080 --> 00:09:30,000 fortunately in this case there were 281 00:09:27,680 --> 00:09:31,600 still laser protection that held up the 282 00:09:30,000 --> 00:09:33,200 pilots used their train to ditch the 283 00:09:31,600 --> 00:09:34,399 aircraft without crashing it and 284 00:09:33,200 --> 00:09:36,080 destroying it 285 00:09:34,399 --> 00:09:37,680 the flight crew evacuated everyone 286 00:09:36,080 --> 00:09:39,200 before the plane sank 287 00:09:37,680 --> 00:09:41,120 and the planes were equipped with life 288 00:09:39,200 --> 00:09:42,320 rafts so people actually had somewhere 289 00:09:41,120 --> 00:09:44,800 to go 290 00:09:42,320 --> 00:09:46,560 in this case it wasn't a total disaster 291 00:09:44,800 --> 00:09:48,240 and a good amount of those cheese layers 292 00:09:46,560 --> 00:09:50,320 still held up 293 00:09:48,240 --> 00:09:53,279 but in the days that followed the 2nd of 294 00:09:50,320 --> 00:09:55,200 may 2021 i was strongly reminded of this 295 00:09:53,279 --> 00:09:57,200 model as we work through different 296 00:09:55,200 --> 00:09:58,800 problems in our ticketing system 297 00:09:57,200 --> 00:10:01,040 none of which alone could be pinpointed 298 00:09:58,800 --> 00:10:02,640 as the singular cause of our failed 299 00:10:01,040 --> 00:10:04,800 ticketing sales launch 300 00:10:02,640 --> 00:10:08,560 but all of all whom could have helped 301 00:10:04,800 --> 00:10:09,839 prevent or mitigate it 302 00:10:08,560 --> 00:10:11,360 the first issue that we managed to 303 00:10:09,839 --> 00:10:13,120 identify and fix was that the main 304 00:10:11,360 --> 00:10:14,480 wordpress site went down 305 00:10:13,120 --> 00:10:16,720 it shared a database server with a 306 00:10:14,480 --> 00:10:18,240 ticketing system and whilst we hadn't 307 00:10:16,720 --> 00:10:20,320 had a single problem with it in previous 308 00:10:18,240 --> 00:10:22,079 years even on the day of convention with 309 00:10:20,320 --> 00:10:23,600 people hammering the schedule 310 00:10:22,079 --> 00:10:25,519 all we could see was the generic error 311 00:10:23,600 --> 00:10:26,640 message error establishing the database 312 00:10:25,519 --> 00:10:28,320 connection 313 00:10:26,640 --> 00:10:29,920 i suspected this was due to a mixture of 314 00:10:28,320 --> 00:10:32,480 increased load on the main site and on 315 00:10:29,920 --> 00:10:34,320 the ticketing site at the same time 316 00:10:32,480 --> 00:10:35,920 i went digging and quickly discovered 317 00:10:34,320 --> 00:10:38,800 that the default connection limit in my 318 00:10:35,920 --> 00:10:40,800 sequel is 151 concurrent connections 319 00:10:38,800 --> 00:10:42,160 i've no idea why they picked that number 320 00:10:40,800 --> 00:10:44,079 perhaps whoever came up with it as a fan 321 00:10:42,160 --> 00:10:45,760 of the original pokemon games 322 00:10:44,079 --> 00:10:47,760 in any event we increased the connection 323 00:10:45,760 --> 00:10:49,760 limit to 500 and that particular problem 324 00:10:47,760 --> 00:10:51,680 went away 325 00:10:49,760 --> 00:10:53,760 unfortunately we were still seeing 326 00:10:51,680 --> 00:10:55,360 errors on the ticketing system the most 327 00:10:53,760 --> 00:10:57,680 common one we saw in server logs was a 328 00:10:55,360 --> 00:11:00,320 connects timeout error timeout acquiring 329 00:10:57,680 --> 00:11:02,959 a connection the pool is probably full 330 00:11:00,320 --> 00:11:04,560 are you missing a dog transacting call 331 00:11:02,959 --> 00:11:06,320 the way that most database client 332 00:11:04,560 --> 00:11:08,480 software works is that they don't create 333 00:11:06,320 --> 00:11:10,800 a new sql connection every time they 334 00:11:08,480 --> 00:11:12,640 need one these get pulled when you 335 00:11:10,800 --> 00:11:13,600 request one you either take one from the 336 00:11:12,640 --> 00:11:15,360 pool 337 00:11:13,600 --> 00:11:17,519 or get handed back a brand new one 338 00:11:15,360 --> 00:11:19,040 that's up to the pool to decide and when 339 00:11:17,519 --> 00:11:21,279 you're done you hand it back to the pool 340 00:11:19,040 --> 00:11:23,279 and it either scrubs it for reuse later 341 00:11:21,279 --> 00:11:25,279 or drops it entirely again that's up to 342 00:11:23,279 --> 00:11:26,800 the pool to decide 343 00:11:25,279 --> 00:11:29,040 in this case i believe the error message 344 00:11:26,800 --> 00:11:31,760 is technically wrong and it means the 345 00:11:29,040 --> 00:11:33,360 pool is probably empty that is all the 346 00:11:31,760 --> 00:11:35,200 connections that it has have already 347 00:11:33,360 --> 00:11:37,600 been handed out and it's not going to 348 00:11:35,200 --> 00:11:38,959 create another new one for you 349 00:11:37,600 --> 00:11:40,640 the first thing we did was go looking 350 00:11:38,959 --> 00:11:42,720 for any places in the code where we were 351 00:11:40,640 --> 00:11:44,000 requesting a connection but not handing 352 00:11:42,720 --> 00:11:45,760 it back 353 00:11:44,000 --> 00:11:48,800 and we found exactly zero instances of 354 00:11:45,760 --> 00:11:50,959 that the code was doing the right thing 355 00:11:48,800 --> 00:11:52,399 so the next question is how big is this 356 00:11:50,959 --> 00:11:53,920 pool exactly 357 00:11:52,399 --> 00:11:55,360 we did some initial digging and found 358 00:11:53,920 --> 00:11:57,200 buried away in the documentation that 359 00:11:55,360 --> 00:11:58,720 according to the default settings the 360 00:11:57,200 --> 00:12:00,160 pool has a minimum of two connections 361 00:11:58,720 --> 00:12:02,480 available 362 00:12:00,160 --> 00:12:04,240 and a maximum of 10. 363 00:12:02,480 --> 00:12:06,000 we had two servers and hundreds of users 364 00:12:04,240 --> 00:12:07,519 so we likely needed far more than 20 365 00:12:06,000 --> 00:12:09,040 concurrent connections 366 00:12:07,519 --> 00:12:11,040 we increased the connection pool limits 367 00:12:09,040 --> 00:12:13,760 deployed it to production and 368 00:12:11,040 --> 00:12:15,200 still no improvement 369 00:12:13,760 --> 00:12:16,720 at this point we decide to push the 370 00:12:15,200 --> 00:12:18,959 panic button and put the site into 371 00:12:16,720 --> 00:12:20,639 maintenance mode this is a special mode 372 00:12:18,959 --> 00:12:22,079 where users can't really do anything and 373 00:12:20,639 --> 00:12:23,920 it's designed for shutting down the site 374 00:12:22,079 --> 00:12:25,760 to perform maintenance such as moving 375 00:12:23,920 --> 00:12:27,200 between servers rebuilding the cluster 376 00:12:25,760 --> 00:12:28,399 or some kind of urgent database 377 00:12:27,200 --> 00:12:29,839 operation 378 00:12:28,399 --> 00:12:31,440 in our case we just wanted to pull the 379 00:12:29,839 --> 00:12:33,200 plug so we could start seeing what 380 00:12:31,440 --> 00:12:34,720 actually happened 381 00:12:33,200 --> 00:12:36,160 unfortunately the site was under such 382 00:12:34,720 --> 00:12:37,760 heavy load that we couldn't even get 383 00:12:36,160 --> 00:12:38,800 into the admin settings and save our 384 00:12:37,760 --> 00:12:40,079 changes 385 00:12:38,800 --> 00:12:42,000 i manually flip the switch in the 386 00:12:40,079 --> 00:12:43,600 database table i need to then discover 387 00:12:42,000 --> 00:12:45,680 that the website couldn't even load the 388 00:12:43,600 --> 00:12:47,440 settings of value due to the site 389 00:12:45,680 --> 00:12:49,120 crawling so slowly 390 00:12:47,440 --> 00:12:50,880 we ended up going into the code base and 391 00:12:49,120 --> 00:12:53,279 changing the client site is maintenance 392 00:12:50,880 --> 00:12:56,320 mode function to simply return true and 393 00:12:53,279 --> 00:12:58,320 deploy that straight to production 394 00:12:56,320 --> 00:13:00,639 so by now we should have no users and be 395 00:12:58,320 --> 00:13:02,320 able to begin a thorough investigation 396 00:13:00,639 --> 00:13:04,000 but strangely this is when orders kept 397 00:13:02,320 --> 00:13:05,519 coming into the system 398 00:13:04,000 --> 00:13:07,279 these were hitting the payment processor 399 00:13:05,519 --> 00:13:09,519 and our emailing provider 400 00:13:07,279 --> 00:13:10,880 and now that we had no new no new users 401 00:13:09,519 --> 00:13:13,600 coming through the system we were 402 00:13:10,880 --> 00:13:15,440 actually somehow selling tickets 403 00:13:13,600 --> 00:13:16,480 the system was slowly working through a 404 00:13:15,440 --> 00:13:18,240 backlog 405 00:13:16,480 --> 00:13:20,240 i suspected this was a backlog of users 406 00:13:18,240 --> 00:13:21,600 who already started the ticket flow or 407 00:13:20,240 --> 00:13:23,839 they were spamming the refresh button 408 00:13:21,600 --> 00:13:25,440 towards the end at that point the is 409 00:13:23,839 --> 00:13:26,720 maintenance mode function is no longer 410 00:13:25,440 --> 00:13:28,160 checked 411 00:13:26,720 --> 00:13:29,920 but as these orders started coming 412 00:13:28,160 --> 00:13:31,519 through we started getting complaints to 413 00:13:29,920 --> 00:13:33,040 customer service that some customers had 414 00:13:31,519 --> 00:13:34,320 been double charged 415 00:13:33,040 --> 00:13:35,760 we spent some time looking through the 416 00:13:34,320 --> 00:13:37,279 orders and cross-referencing them with 417 00:13:35,760 --> 00:13:39,120 the payment provider and discovered two 418 00:13:37,279 --> 00:13:40,880 surprising things 419 00:13:39,120 --> 00:13:42,560 one is some customers had indeed been 420 00:13:40,880 --> 00:13:43,680 double charged for their order 421 00:13:42,560 --> 00:13:45,440 as they must have been submitting the 422 00:13:43,680 --> 00:13:46,480 order form multiple times and retrying 423 00:13:45,440 --> 00:13:48,160 on hera 424 00:13:46,480 --> 00:13:50,800 we refunded these extra orders and we 425 00:13:48,160 --> 00:13:52,639 cancelled the duplicate tickets 426 00:13:50,800 --> 00:13:54,240 we hypothesized based on screenshots had 427 00:13:52,639 --> 00:13:56,160 been sent that what must have happened 428 00:13:54,240 --> 00:13:58,240 happened is customers submitted their 429 00:13:56,160 --> 00:13:59,760 order which took so long that the nginx 430 00:13:58,240 --> 00:14:01,519 ingers controller which was running on 431 00:13:59,760 --> 00:14:03,519 basically default settings 432 00:14:01,519 --> 00:14:07,519 gave up after 60 seconds and returned to 433 00:14:03,519 --> 00:14:09,360 http http 504 gateway timeout error 434 00:14:07,519 --> 00:14:10,800 but node.js and express.js continued 435 00:14:09,360 --> 00:14:12,240 handling the requests and ultimately 436 00:14:10,800 --> 00:14:14,160 fulfilled the order 437 00:14:12,240 --> 00:14:16,000 the user never saw that so when they saw 438 00:14:14,160 --> 00:14:17,519 the 504 error they simply resubmitted 439 00:14:16,000 --> 00:14:19,360 the order not realizing that they were 440 00:14:17,519 --> 00:14:20,880 now submitting a second order 441 00:14:19,360 --> 00:14:24,959 or a third 442 00:14:20,880 --> 00:14:26,800 of that one person number 12. 443 00:14:24,959 --> 00:14:29,279 to fix this we increase the proxy read 444 00:14:26,800 --> 00:14:31,680 timeout value to a larger value than the 445 00:14:29,279 --> 00:14:33,360 actual application server would take so 446 00:14:31,680 --> 00:14:36,160 that the application would time out or 447 00:14:33,360 --> 00:14:37,839 fail before nginx gave up on the request 448 00:14:36,160 --> 00:14:39,519 that should also help to reduce our load 449 00:14:37,839 --> 00:14:41,839 because if each user is making multiple 450 00:14:39,519 --> 00:14:43,760 requests then our effective user count 451 00:14:41,839 --> 00:14:45,279 and therefore our load is significantly 452 00:14:43,760 --> 00:14:46,560 larger it's multiple times what it 453 00:14:45,279 --> 00:14:48,399 should be 454 00:14:46,560 --> 00:14:50,240 but even more strangely surprising thing 455 00:14:48,399 --> 00:14:52,480 number two was that some customers 456 00:14:50,240 --> 00:14:54,560 experienced the exact opposite case 457 00:14:52,480 --> 00:14:55,920 they'd been charged for their order they 458 00:14:54,560 --> 00:14:57,199 had no corresponding ticket in our 459 00:14:55,920 --> 00:14:58,959 system 460 00:14:57,199 --> 00:15:00,399 there is logically no way this should be 461 00:14:58,959 --> 00:15:01,839 able to happen because the way the code 462 00:15:00,399 --> 00:15:03,680 was structured is 463 00:15:01,839 --> 00:15:05,360 we create a payment record 464 00:15:03,680 --> 00:15:07,680 then we uh call out to the payment 465 00:15:05,360 --> 00:15:09,120 processors api and then we update the 466 00:15:07,680 --> 00:15:10,560 payment record with the result of the 467 00:15:09,120 --> 00:15:12,959 api calls there should be a payment 468 00:15:10,560 --> 00:15:14,160 record there at all times 469 00:15:12,959 --> 00:15:16,160 we went digging through the code and 470 00:15:14,160 --> 00:15:18,320 discovered a horrible mistake 471 00:15:16,160 --> 00:15:19,360 the code that created the payment record 472 00:15:18,320 --> 00:15:21,680 and the code that called out to the 473 00:15:19,360 --> 00:15:23,680 payment processing api were both in the 474 00:15:21,680 --> 00:15:25,040 same database transaction 475 00:15:23,680 --> 00:15:27,120 if the container gets shut down in the 476 00:15:25,040 --> 00:15:29,199 middle for example when a new image is 477 00:15:27,120 --> 00:15:31,759 deployed then the payment record doesn't 478 00:15:29,199 --> 00:15:33,199 actually get saved to the database 479 00:15:31,759 --> 00:15:35,680 this could also explain our problem 480 00:15:33,199 --> 00:15:37,600 overall if the payment processor takes a 481 00:15:35,680 --> 00:15:39,120 significant amount of time then we hold 482 00:15:37,600 --> 00:15:40,880 our transactions open 483 00:15:39,120 --> 00:15:42,320 and exhaust the connection pool 484 00:15:40,880 --> 00:15:45,040 that's it there's our smoking gun 485 00:15:42,320 --> 00:15:46,639 everyone can go home 486 00:15:45,040 --> 00:15:48,720 clearly not there's still 15 minutes 487 00:15:46,639 --> 00:15:50,000 left in this talk 488 00:15:48,720 --> 00:15:51,920 since we no longer had any real 489 00:15:50,000 --> 00:15:53,680 customers online at this point we pushed 490 00:15:51,920 --> 00:15:55,440 the fix to staging and ran a synthetic 491 00:15:53,680 --> 00:15:57,519 load test talking to the payment 492 00:15:55,440 --> 00:16:00,240 processor's testing api to ensure that 493 00:15:57,519 --> 00:16:02,000 we don't spend or take any real money 494 00:16:00,240 --> 00:16:03,519 waiting with baited breath we ran our 495 00:16:02,000 --> 00:16:05,680 test script and 496 00:16:03,519 --> 00:16:08,240 it still failed spectacularly 497 00:16:05,680 --> 00:16:11,600 whilst this was clearly a problem it 498 00:16:08,240 --> 00:16:11,600 couldn't be the problem 499 00:16:12,880 --> 00:16:17,199 so at this point we were pretty much 500 00:16:14,639 --> 00:16:18,560 lost we knew that we had a problem 501 00:16:17,199 --> 00:16:20,720 we knew that it was likely related to 502 00:16:18,560 --> 00:16:22,480 the sql database since we had wordpress 503 00:16:20,720 --> 00:16:24,320 and connects both complaining about sql 504 00:16:22,480 --> 00:16:26,000 connections and we knew that it was 505 00:16:24,320 --> 00:16:27,920 induced by a massive load of customers 506 00:16:26,000 --> 00:16:31,440 trying to buy tickets 507 00:16:27,920 --> 00:16:34,000 but we had no clue what was really wrong 508 00:16:31,440 --> 00:16:35,199 as far as diagnosing optimizing sql goes 509 00:16:34,000 --> 00:16:36,560 this is something that i've seen before 510 00:16:35,199 --> 00:16:38,000 many times at work 511 00:16:36,560 --> 00:16:39,759 and when i get stuck we have some very 512 00:16:38,000 --> 00:16:40,800 good dbas that i can always turn to for 513 00:16:39,759 --> 00:16:42,720 help 514 00:16:40,800 --> 00:16:44,000 but i didn't have that here 515 00:16:42,720 --> 00:16:45,680 and i couldn't even turn to my usual 516 00:16:44,000 --> 00:16:48,079 tools that i use at work 517 00:16:45,680 --> 00:16:49,759 at work we rely on microsoft sql server 518 00:16:48,079 --> 00:16:51,519 and the best tool by far for initial 519 00:16:49,759 --> 00:16:52,720 investigation is a microsoft sql 520 00:16:51,519 --> 00:16:54,399 profiler 521 00:16:52,720 --> 00:16:56,160 this is a wonderful tool that can show 522 00:16:54,399 --> 00:16:59,199 you live every sql command being 523 00:16:56,160 --> 00:17:00,959 executed who ran them how long they took 524 00:16:59,199 --> 00:17:01,839 what machine it came from so on and so 525 00:17:00,959 --> 00:17:02,720 forth 526 00:17:01,839 --> 00:17:04,319 normally i would look here for 527 00:17:02,720 --> 00:17:06,240 long-running queries or blocked queries 528 00:17:04,319 --> 00:17:07,760 and start to unravel the mess 529 00:17:06,240 --> 00:17:11,919 but i actually could not find any 530 00:17:07,760 --> 00:17:11,919 equivalent tooling for my sequel 531 00:17:12,079 --> 00:17:15,520 the only option left was to instrument 532 00:17:13,760 --> 00:17:17,360 the application itself 533 00:17:15,520 --> 00:17:19,280 fortunately and well what got us into 534 00:17:17,360 --> 00:17:21,280 this mess is this is our own application 535 00:17:19,280 --> 00:17:23,120 and not a third-party proprietary one so 536 00:17:21,280 --> 00:17:25,199 we could do this fairly easily 537 00:17:23,120 --> 00:17:27,760 we did two things here one is we added a 538 00:17:25,199 --> 00:17:29,200 warning if a transaction takes too long 539 00:17:27,760 --> 00:17:30,240 too long an arbitrary amount of time to 540 00:17:29,200 --> 00:17:31,760 run 541 00:17:30,240 --> 00:17:33,280 and two is we ran the application in 542 00:17:31,760 --> 00:17:35,280 development mode with an environment 543 00:17:33,280 --> 00:17:38,080 variable set so that the orm would print 544 00:17:35,280 --> 00:17:39,679 out every sql query it ran as it ran it 545 00:17:38,080 --> 00:17:41,120 we ran a few test orders in development 546 00:17:39,679 --> 00:17:43,760 we sat down looked through these queries 547 00:17:41,120 --> 00:17:46,480 and we noticed a few interesting things 548 00:17:43,760 --> 00:17:47,919 firstly each http request to order some 549 00:17:46,480 --> 00:17:49,919 tickets had about five or six 550 00:17:47,919 --> 00:17:51,600 transactions running 551 00:17:49,919 --> 00:17:52,880 this highly increases the chance of 552 00:17:51,600 --> 00:17:55,200 failure when the connection pool is 553 00:17:52,880 --> 00:17:57,120 exhausted once a request gets that first 554 00:17:55,200 --> 00:17:58,240 connection there is no guarantee that 555 00:17:57,120 --> 00:18:00,480 the connection pool will give it the 556 00:17:58,240 --> 00:18:01,600 second one it needs or the third 557 00:18:00,480 --> 00:18:03,280 or the fourth 558 00:18:01,600 --> 00:18:04,480 and so on 559 00:18:03,280 --> 00:18:05,919 we consolidated some of these 560 00:18:04,480 --> 00:18:08,240 transactions together and managed to 561 00:18:05,919 --> 00:18:09,840 bring it down to three 562 00:18:08,240 --> 00:18:11,520 secondly we noticed a lot of extra 563 00:18:09,840 --> 00:18:13,200 queries that we didn't need 564 00:18:11,520 --> 00:18:14,720 often when we created or updated a row 565 00:18:13,200 --> 00:18:16,720 the rm would subsequently select that 566 00:18:14,720 --> 00:18:18,080 row again but we never did anything with 567 00:18:16,720 --> 00:18:19,520 a new result 568 00:18:18,080 --> 00:18:22,320 we changed a few method calls and those 569 00:18:19,520 --> 00:18:24,080 select queries disappeared 570 00:18:22,320 --> 00:18:27,440 thirdly we noticed a few of them were 571 00:18:24,080 --> 00:18:29,360 select for update which locks the row 572 00:18:27,440 --> 00:18:32,000 this was by design as we keep track of 573 00:18:29,360 --> 00:18:33,840 how many sales each item or bundle has 574 00:18:32,000 --> 00:18:35,840 definitions an item is an individual 575 00:18:33,840 --> 00:18:37,679 thing such as a sunday pass 576 00:18:35,840 --> 00:18:39,039 a bundle is a collection of items such 577 00:18:37,679 --> 00:18:41,760 as a weekend ticket which would contain 578 00:18:39,039 --> 00:18:43,679 both a saturday pass and a sunday pass 579 00:18:41,760 --> 00:18:45,919 this is so we can limit cert 580 00:18:43,679 --> 00:18:47,679 certain purchases for example our 581 00:18:45,919 --> 00:18:49,600 discounted early bird bundle was limited 582 00:18:47,679 --> 00:18:51,760 to the first 500 people 583 00:18:49,600 --> 00:18:53,600 and in 2019 when we held the annie song 584 00:18:51,760 --> 00:18:55,679 concert the hall only has a certain 585 00:18:53,600 --> 00:18:57,919 number of seats we can't oversell that 586 00:18:55,679 --> 00:18:59,280 or some people would have no edit set 587 00:18:57,919 --> 00:19:01,039 this year we also have things like you 588 00:18:59,280 --> 00:19:02,400 know square meter per person capacity 589 00:19:01,039 --> 00:19:06,080 limits to worry about but we haven't 590 00:19:02,400 --> 00:19:08,000 programmed any of that into the system 591 00:19:06,080 --> 00:19:10,160 the row locking is only supposed to be 592 00:19:08,000 --> 00:19:11,600 very brief we lock the rows only for a 593 00:19:10,160 --> 00:19:13,440 very small period when we create the 594 00:19:11,600 --> 00:19:15,120 pending order and increment the counts 595 00:19:13,440 --> 00:19:17,039 for the relevant items 596 00:19:15,120 --> 00:19:19,280 if the order processing fails we lock 597 00:19:17,039 --> 00:19:20,960 again briefly and decrement the counts 598 00:19:19,280 --> 00:19:21,760 we don't keep it locked for a long time 599 00:19:20,960 --> 00:19:23,440 but 600 00:19:21,760 --> 00:19:25,679 if we did this would also explain a 601 00:19:23,440 --> 00:19:27,200 large amount of the performance problems 602 00:19:25,679 --> 00:19:28,640 with the row locked during an order 603 00:19:27,200 --> 00:19:30,160 somebody loading the home page can't 604 00:19:28,640 --> 00:19:32,080 even load the list of things available 605 00:19:30,160 --> 00:19:33,840 for sale as we can't determine if 606 00:19:32,080 --> 00:19:35,280 something is out of stock or not while 607 00:19:33,840 --> 00:19:36,799 it's locked 608 00:19:35,280 --> 00:19:38,880 technically we don't even know if it 609 00:19:36,799 --> 00:19:40,640 exists since the definition of the item 610 00:19:38,880 --> 00:19:42,880 and the quantity are stored in the same 611 00:19:40,640 --> 00:19:44,640 table in the same row 612 00:19:42,880 --> 00:19:46,480 since we don't have hard limits this 613 00:19:44,640 --> 00:19:47,840 year we checked with the executive team 614 00:19:46,480 --> 00:19:50,320 to see if we need to stick to the hard 615 00:19:47,840 --> 00:19:51,919 limit spec we had in 2019 or if we can 616 00:19:50,320 --> 00:19:53,280 disable the row locks and maybe just go 617 00:19:51,919 --> 00:19:55,039 over a little bit 618 00:19:53,280 --> 00:19:57,039 they gave us a go ahead we disabled the 619 00:19:55,039 --> 00:19:58,080 row locking we ran the test script again 620 00:19:57,039 --> 00:19:59,600 and 621 00:19:58,080 --> 00:20:00,320 this time it actually helped a little 622 00:19:59,600 --> 00:20:03,360 bit 623 00:20:00,320 --> 00:20:04,960 but each request was still far too slow 624 00:20:03,360 --> 00:20:07,840 and after a minute everything started 625 00:20:04,960 --> 00:20:07,840 timing out again 626 00:20:08,080 --> 00:20:12,080 now at this time while the team was 627 00:20:09,919 --> 00:20:13,760 digging through sql i decided to look at 628 00:20:12,080 --> 00:20:15,120 a different problem 629 00:20:13,760 --> 00:20:16,880 something we noticed while rebooting 630 00:20:15,120 --> 00:20:18,720 containers and deploying new ones during 631 00:20:16,880 --> 00:20:20,960 the ticket launch was that the server 632 00:20:18,720 --> 00:20:22,400 side container image was taking about 633 00:20:20,960 --> 00:20:24,640 three to five minutes from the time the 634 00:20:22,400 --> 00:20:26,400 container started until the time that it 635 00:20:24,640 --> 00:20:27,840 started listing for http requests on the 636 00:20:26,400 --> 00:20:29,520 tcp port 637 00:20:27,840 --> 00:20:31,200 i ripped in a javascript expert from 638 00:20:29,520 --> 00:20:32,559 work to give me the rundown on how to do 639 00:20:31,200 --> 00:20:34,880 some basic performance analysis and 640 00:20:32,559 --> 00:20:35,919 debugging with node.js and he showed me 641 00:20:34,880 --> 00:20:38,080 how to connect the chrome developer 642 00:20:35,919 --> 00:20:39,760 tools to a node.js application 643 00:20:38,080 --> 00:20:41,200 from there i extrapolated a bit and by 644 00:20:39,760 --> 00:20:42,640 exposing the correct port number through 645 00:20:41,200 --> 00:20:45,600 docker i managed to attach the chrome 646 00:20:42,640 --> 00:20:47,200 devtools dev tools to node.js inside the 647 00:20:45,600 --> 00:20:48,720 docker image 648 00:20:47,200 --> 00:20:50,320 strangely while stepping through it on 649 00:20:48,720 --> 00:20:52,400 my macbook the performance problem just 650 00:20:50,320 --> 00:20:53,600 seemed to be in importing types 651 00:20:52,400 --> 00:20:55,280 so i did some 652 00:20:53,600 --> 00:20:57,120 some chin scratching i did some thinking 653 00:20:55,280 --> 00:20:58,320 i did a lot of googling and discovered 654 00:20:57,120 --> 00:21:00,000 there's an open issue on github that 655 00:20:58,320 --> 00:21:02,559 docker for mac is just apparently really 656 00:21:00,000 --> 00:21:04,000 really slow at file system operations 657 00:21:02,559 --> 00:21:06,000 and clearly that's not our production 658 00:21:04,000 --> 00:21:08,640 problem as azure kubernetes does not run 659 00:21:06,000 --> 00:21:10,480 on mac os 660 00:21:08,640 --> 00:21:12,080 so the next steps i took to debug this 661 00:21:10,480 --> 00:21:14,320 was to create a new linux virtual 662 00:21:12,080 --> 00:21:16,000 machine i used ubuntu because it's an 663 00:21:14,320 --> 00:21:17,840 ancient african word for slightly easier 664 00:21:16,000 --> 00:21:20,159 than debian i installed that on my 665 00:21:17,840 --> 00:21:22,880 windows workstation pc installed docker 666 00:21:20,159 --> 00:21:24,320 in there and started to debug from there 667 00:21:22,880 --> 00:21:25,679 when running docker on linux our 668 00:21:24,320 --> 00:21:28,000 container started up pretty much 669 00:21:25,679 --> 00:21:29,280 instantly and it listened on port 3000 670 00:21:28,000 --> 00:21:31,360 immediately 671 00:21:29,280 --> 00:21:33,360 no performance problem to be seen 672 00:21:31,360 --> 00:21:35,280 whatsoever 673 00:21:33,360 --> 00:21:37,039 but like one of the main selling points 674 00:21:35,280 --> 00:21:39,200 for containerization is that you have a 675 00:21:37,039 --> 00:21:41,679 perfect reproducibility as you have the 676 00:21:39,200 --> 00:21:43,360 same container image on all your systems 677 00:21:41,679 --> 00:21:45,679 the only thing that's different is 678 00:21:43,360 --> 00:21:46,640 hardware performance characteristics the 679 00:21:45,679 --> 00:21:48,799 network 680 00:21:46,640 --> 00:21:50,080 and the underlying linux kernel 681 00:21:48,799 --> 00:21:51,600 and i didn't think any of those were 682 00:21:50,080 --> 00:21:53,120 relevant here 683 00:21:51,600 --> 00:21:55,039 so i went digging through our kubernetes 684 00:21:53,120 --> 00:21:56,480 yaml for any major differences and came 685 00:21:55,039 --> 00:21:59,120 across this fun block that i'd written a 686 00:21:56,480 --> 00:22:01,280 long time ago and mostly forgotten 687 00:21:59,120 --> 00:22:02,640 back after the 2019 convention while we 688 00:22:01,280 --> 00:22:04,960 were downscaling everything back to 689 00:22:02,640 --> 00:22:07,280 minimal business as usual operations the 690 00:22:04,960 --> 00:22:08,720 yama linton visual studio code suggested 691 00:22:07,280 --> 00:22:10,480 had resource limits and after reading 692 00:22:08,720 --> 00:22:12,080 through the docs i thought well that 693 00:22:10,480 --> 00:22:14,000 sounds like a really really good idea 694 00:22:12,080 --> 00:22:15,840 resource limits sound good so i added 695 00:22:14,000 --> 00:22:18,480 some limits based on the peak usage we'd 696 00:22:15,840 --> 00:22:20,480 observed during the convention that year 697 00:22:18,480 --> 00:22:22,720 i recreated the container image on my 698 00:22:20,480 --> 00:22:24,880 ubuntu vm but this time limited the cpu 699 00:22:22,720 --> 00:22:26,480 to just a quarter of a cpu core 700 00:22:24,880 --> 00:22:30,240 and would you believe it 701 00:22:26,480 --> 00:22:30,240 the image took three minutes to start up 702 00:22:30,320 --> 00:22:33,679 this immediately threw a new light on 703 00:22:31,919 --> 00:22:35,600 the whole situation 704 00:22:33,679 --> 00:22:37,520 was it possible that our sequel after 705 00:22:35,600 --> 00:22:39,120 all was perfectly fine but we were just 706 00:22:37,520 --> 00:22:41,280 cpu constrained 707 00:22:39,120 --> 00:22:42,960 after all during the 2019 ticket sales 708 00:22:41,280 --> 00:22:43,840 we hadn't written the pdf generation 709 00:22:42,960 --> 00:22:45,600 code 710 00:22:43,840 --> 00:22:47,440 and the 2020 ticket sales were quite 711 00:22:45,600 --> 00:22:49,679 lackluster due to the looming pandemic 712 00:22:47,440 --> 00:22:51,200 and just general uncertainty 713 00:22:49,679 --> 00:22:52,960 we commented out the function call to 714 00:22:51,200 --> 00:22:55,200 generate the ticket pdf and attach it to 715 00:22:52,960 --> 00:22:58,240 the email we push a new code to staging 716 00:22:55,200 --> 00:22:59,919 only we re-ran the load test script and 717 00:22:58,240 --> 00:23:02,080 bang all our requests will handle 718 00:22:59,919 --> 00:23:04,000 lightning fast no performance problems 719 00:23:02,080 --> 00:23:05,600 whatsoever 720 00:23:04,000 --> 00:23:07,039 unfortunately we can't actually ship 721 00:23:05,600 --> 00:23:08,880 these changes as people need their 722 00:23:07,039 --> 00:23:10,960 ticket pdfs if they're going to attend 723 00:23:08,880 --> 00:23:12,880 the convention that's the whole point of 724 00:23:10,960 --> 00:23:14,640 buying the ticket 725 00:23:12,880 --> 00:23:16,400 increasing the amount of cpu allocate 726 00:23:14,640 --> 00:23:19,600 the container as well isn't hugely 727 00:23:16,400 --> 00:23:21,039 beneficial we can go from 25 up to 100 728 00:23:19,600 --> 00:23:23,120 and handle maybe four times as many 729 00:23:21,039 --> 00:23:25,120 requests if it scales linearly 730 00:23:23,120 --> 00:23:27,840 but any more than a single vcpu call 731 00:23:25,120 --> 00:23:28,799 won't help as node.js is famously single 732 00:23:27,840 --> 00:23:30,159 threaded 733 00:23:28,799 --> 00:23:32,159 it's not going to farm out different 734 00:23:30,159 --> 00:23:33,840 incoming requests to different cpu cores 735 00:23:32,159 --> 00:23:35,840 by itself 736 00:23:33,840 --> 00:23:38,559 in fact if you google how do i make 737 00:23:35,840 --> 00:23:40,240 node.js use more cpu the top 738 00:23:38,559 --> 00:23:42,320 stackoverflow answer is run multiple 739 00:23:40,240 --> 00:23:43,760 copies of it duh 740 00:23:42,320 --> 00:23:47,039 so we would somehow need to offload the 741 00:23:43,760 --> 00:23:48,400 pdf generation another way 742 00:23:47,039 --> 00:23:50,240 now that we figured out the technical 743 00:23:48,400 --> 00:23:51,600 side we still have the human side to 744 00:23:50,240 --> 00:23:53,279 understand 745 00:23:51,600 --> 00:23:55,440 why did we have so many users trying to 746 00:23:53,279 --> 00:23:58,320 buy tickets we had never seen this level 747 00:23:55,440 --> 00:23:58,320 of demand before 748 00:23:58,720 --> 00:24:02,080 i don't know for certain but i have two 749 00:24:00,480 --> 00:24:04,880 hypotheses 750 00:24:02,080 --> 00:24:06,080 the first is due to covert in 2020 there 751 00:24:04,880 --> 00:24:08,240 were almost no events due to the 752 00:24:06,080 --> 00:24:10,720 pandemic and the lockdowns there was no 753 00:24:08,240 --> 00:24:13,440 smash no oops 754 00:24:10,720 --> 00:24:16,080 there was no smash no supernova no oz 755 00:24:13,440 --> 00:24:17,360 comic-con and no pax australia 756 00:24:16,080 --> 00:24:18,880 it's quite possible that people were 757 00:24:17,360 --> 00:24:21,279 just hungry for their cultural events 758 00:24:18,880 --> 00:24:22,880 after a year of deprivation 759 00:24:21,279 --> 00:24:24,000 the second hypothesis is due to 760 00:24:22,880 --> 00:24:25,600 marketing 761 00:24:24,000 --> 00:24:28,159 our early bird special was limited to 762 00:24:25,600 --> 00:24:30,000 500 tickets but it seems at least a few 763 00:24:28,159 --> 00:24:32,159 people misinterpreted the social media 764 00:24:30,000 --> 00:24:35,600 posts and thought there were only 500 765 00:24:32,159 --> 00:24:37,039 tickets to the event in total 766 00:24:35,600 --> 00:24:38,880 some critical thinking might just that 767 00:24:37,039 --> 00:24:40,799 downscaling the event that much isn't 768 00:24:38,880 --> 00:24:42,720 really financially feasible 769 00:24:40,799 --> 00:24:44,320 but you can just get up early and smash 770 00:24:42,720 --> 00:24:46,320 the five button be one of those lucky 771 00:24:44,320 --> 00:24:48,320 people 772 00:24:46,320 --> 00:24:50,159 so now with all that information gleaned 773 00:24:48,320 --> 00:24:51,440 from our investigation and the fixes we 774 00:24:50,159 --> 00:24:53,360 put into place 775 00:24:51,440 --> 00:24:56,799 we can now explain the original events 776 00:24:53,360 --> 00:24:58,640 on the 2nd of may in its entirety 777 00:24:56,799 --> 00:25:00,400 due to a huge amount of demand after a 778 00:24:58,640 --> 00:25:02,799 year of cancelled events and possibly 779 00:25:00,400 --> 00:25:04,400 fed by marketing miscommunication we had 780 00:25:02,799 --> 00:25:06,640 way more users trying to order tickets 781 00:25:04,400 --> 00:25:08,080 than in previous years 782 00:25:06,640 --> 00:25:10,400 as the first orders started coming 783 00:25:08,080 --> 00:25:12,159 through they held open sql transactions 784 00:25:10,400 --> 00:25:14,080 which prevented subsequent requests or 785 00:25:12,159 --> 00:25:16,240 request continuations from obtaining the 786 00:25:14,080 --> 00:25:17,760 connection they needed 787 00:25:16,240 --> 00:25:19,840 when we restarted and deployed new 788 00:25:17,760 --> 00:25:21,600 backend images this interrupted a couple 789 00:25:19,840 --> 00:25:23,360 requests inside the dodgy payments 790 00:25:21,600 --> 00:25:25,440 transaction so that payment was never 791 00:25:23,360 --> 00:25:27,840 recorded 792 00:25:25,440 --> 00:25:29,840 pdf generation used up the full cpu 793 00:25:27,840 --> 00:25:31,440 allocation so incoming requests were 794 00:25:29,840 --> 00:25:33,279 added to the backlog 795 00:25:31,440 --> 00:25:36,400 the server did not have enough cpu to 796 00:25:33,279 --> 00:25:38,480 process this and definitely not its 25 797 00:25:36,400 --> 00:25:40,480 cpu with three out of four hands tied 798 00:25:38,480 --> 00:25:42,799 behind its back 799 00:25:40,480 --> 00:25:44,799 and so after 60 seconds engine x gave up 800 00:25:42,799 --> 00:25:46,080 on the request but no js continued to 801 00:25:44,799 --> 00:25:47,679 process it 802 00:25:46,080 --> 00:25:49,440 when users got gateway timeout errors 803 00:25:47,679 --> 00:25:50,720 they retried which only further added to 804 00:25:49,440 --> 00:25:52,320 the backlog 805 00:25:50,720 --> 00:25:53,840 and after a certain point new requests 806 00:25:52,320 --> 00:25:55,600 would always time out before they could 807 00:25:53,840 --> 00:25:57,360 be handled by node.js since the backlog 808 00:25:55,600 --> 00:25:59,520 was so large 809 00:25:57,360 --> 00:26:01,840 none of these alone can be pinpointed as 810 00:25:59,520 --> 00:26:05,440 the course of the outage 811 00:26:01,840 --> 00:26:05,440 but our cheese had some pretty big holes 812 00:26:05,679 --> 00:26:09,600 this all went down on sunday and the 813 00:26:07,200 --> 00:26:11,039 investigation ran for several more days 814 00:26:09,600 --> 00:26:13,600 but we needed to have ticket sales 815 00:26:11,039 --> 00:26:15,440 operational again by thursday 816 00:26:13,600 --> 00:26:18,159 for thursday we made several key changes 817 00:26:15,440 --> 00:26:19,919 to our best to avoid repeat of the crash 818 00:26:18,159 --> 00:26:21,600 on the application side we had fixed the 819 00:26:19,919 --> 00:26:23,120 open transactions 820 00:26:21,600 --> 00:26:24,799 and consolidated a few more requests 821 00:26:23,120 --> 00:26:26,400 into fewer transactions 822 00:26:24,799 --> 00:26:28,400 we removed sql queries from the code 823 00:26:26,400 --> 00:26:29,840 that we didn't need and we disabled the 824 00:26:28,400 --> 00:26:31,279 row locking 825 00:26:29,840 --> 00:26:32,720 we also increased the connection pool 826 00:26:31,279 --> 00:26:34,799 size 827 00:26:32,720 --> 00:26:37,200 we also plan to offload pdf generation 828 00:26:34,799 --> 00:26:39,279 to some azure functions view aws folk 829 00:26:37,200 --> 00:26:41,919 think lambdas but obviously we wouldn't 830 00:26:39,279 --> 00:26:43,840 have that ready by thursday 831 00:26:41,919 --> 00:26:46,240 on the operations side we'd increase the 832 00:26:43,840 --> 00:26:47,840 mysql connection limit we'd increase the 833 00:26:46,240 --> 00:26:51,039 cpu limits for the server containers 834 00:26:47,840 --> 00:26:53,200 from 25 of a cpu core to 95 of a cpu 835 00:26:51,039 --> 00:26:54,799 core and we added an additional server 836 00:26:53,200 --> 00:26:57,440 for a total of almost six times as much 837 00:26:54,799 --> 00:26:59,039 cpu capacity as previously 838 00:26:57,440 --> 00:27:01,279 the last change we made was to human 839 00:26:59,039 --> 00:27:03,360 factors instead of hyping everyone up 840 00:27:01,279 --> 00:27:05,600 again for a rush unlimited ticket sales 841 00:27:03,360 --> 00:27:07,679 we changed the terms so anyone could 842 00:27:05,600 --> 00:27:10,159 purchase a discounted weekend ticket 843 00:27:07,679 --> 00:27:11,520 across a 10-hour window from 2pm until 844 00:27:10,159 --> 00:27:12,960 midnight 845 00:27:11,520 --> 00:27:14,240 and then we opened that window a few 846 00:27:12,960 --> 00:27:16,799 minutes early to deal with any 847 00:27:14,240 --> 00:27:19,200 over-eager customers 848 00:27:16,799 --> 00:27:20,399 thursday's ticket sales were an absolute 849 00:27:19,200 --> 00:27:22,159 success 850 00:27:20,399 --> 00:27:24,080 not only did we manage our pre-sales 851 00:27:22,159 --> 00:27:25,440 without crashing again we set a new 852 00:27:24,080 --> 00:27:29,600 internal record for the number of 853 00:27:25,440 --> 00:27:29,600 pre-sales tickets sold in a single day 854 00:27:30,159 --> 00:27:34,399 unfortunately due to the delta outbreak 855 00:27:32,399 --> 00:27:36,000 across new south wales and australia 856 00:27:34,399 --> 00:27:38,720 we had to postpone the event four weeks 857 00:27:36,000 --> 00:27:40,480 ago through to 2022. 858 00:27:38,720 --> 00:27:42,000 however the lessons we learned here are 859 00:27:40,480 --> 00:27:43,520 quite valuable and i share them with you 860 00:27:42,000 --> 00:27:45,919 all today in the hope that they won't be 861 00:27:43,520 --> 00:27:47,440 repeated frequently 862 00:27:45,919 --> 00:27:49,840 the key things i took away from this 863 00:27:47,440 --> 00:27:51,120 were to load test regularly 864 00:27:49,840 --> 00:27:53,520 if you don't think you need a load test 865 00:27:51,120 --> 00:27:54,880 for the event do it again anyway and 866 00:27:53,520 --> 00:27:57,679 make sure the test is harder than you 867 00:27:54,880 --> 00:27:59,279 think you'll encounter in production 868 00:27:57,679 --> 00:28:02,480 you also need to understand profiling 869 00:27:59,279 --> 00:28:04,399 tools before you urgently need them 870 00:28:02,480 --> 00:28:06,159 we need to make sure to correctly deal 871 00:28:04,399 --> 00:28:07,440 with timeouts and request cancellation 872 00:28:06,159 --> 00:28:10,159 something that very often goes 873 00:28:07,440 --> 00:28:11,760 overlooked in application development 874 00:28:10,159 --> 00:28:13,600 to make sure that any panic buttons are 875 00:28:11,760 --> 00:28:15,200 actually disaster proof and will work in 876 00:28:13,600 --> 00:28:17,039 an emergency 877 00:28:15,200 --> 00:28:19,600 and mostly 878 00:28:17,039 --> 00:28:22,000 not to perform any cpu intensive work in 879 00:28:19,600 --> 00:28:25,360 a single threaded web application 880 00:28:22,000 --> 00:28:28,600 that's literally just asking for trouble 881 00:28:25,360 --> 00:28:28,600 thank you 882 00:28:30,240 --> 00:28:35,039 thanks jakov that was 883 00:28:32,240 --> 00:28:37,200 an excellent talk and yes lots of great 884 00:28:35,039 --> 00:28:39,039 lessons there about how to respond to 885 00:28:37,200 --> 00:28:40,720 various kinds of oops 886 00:28:39,039 --> 00:28:42,640 so i'm sure that everyone enjoyed that 887 00:28:40,720 --> 00:28:44,880 talk i've seen a lot of really positive 888 00:28:42,640 --> 00:28:47,279 comments there in the chat um now i 889 00:28:44,880 --> 00:28:49,279 don't see any other questions 890 00:28:47,279 --> 00:28:50,720 um but if you would like to hang out i 891 00:28:49,279 --> 00:28:52,960 know you've been in the chat already 892 00:28:50,720 --> 00:28:55,039 yakov so uh if people do have questions 893 00:28:52,960 --> 00:28:56,640 and want to ask you about anything about 894 00:28:55,039 --> 00:28:59,840 this what you've learned about how to 895 00:28:56,640 --> 00:29:01,200 instrument my mysql um if you've 896 00:28:59,840 --> 00:29:02,960 actually i have a question have you 897 00:29:01,200 --> 00:29:04,799 managed to find any profiling tools for 898 00:29:02,960 --> 00:29:07,200 mysql that would be useful i still 899 00:29:04,799 --> 00:29:08,960 haven't if anybody knows of something 900 00:29:07,200 --> 00:29:10,799 that does what we need there i would 901 00:29:08,960 --> 00:29:13,279 absolutely love to know please drop me a 902 00:29:10,799 --> 00:29:15,919 tweet or an email that's elephant 903 00:29:13,279 --> 00:29:17,840 mind boggling i i recall having to do 904 00:29:15,919 --> 00:29:20,080 this with other database systems many 905 00:29:17,840 --> 00:29:21,520 many years ago when i was technical 906 00:29:20,080 --> 00:29:22,880 and yeah not having that would be 907 00:29:21,520 --> 00:29:25,520 horrendous 908 00:29:22,880 --> 00:29:28,480 um thank you yakov once again for giving 909 00:29:25,520 --> 00:29:30,880 a great talk and uh for everyone else 910 00:29:28,480 --> 00:29:33,360 we're going to take a bit of a break uh 911 00:29:30,880 --> 00:29:35,120 and we'll be back in a few minutes uh 912 00:29:33,360 --> 00:29:36,960 about 15 minutes from now let's see 913 00:29:35,120 --> 00:29:38,720 we're scheduled to be about 11 o'clock 914 00:29:36,960 --> 00:29:41,360 so you've got 15 minutes before our next 915 00:29:38,720 --> 00:29:43,279 talk with amit sahar which is called 916 00:29:41,360 --> 00:29:46,559 planning for failure using chaos 917 00:29:43,279 --> 00:29:48,159 engineering um so we've just had a talk 918 00:29:46,559 --> 00:29:49,840 on how to fix things that went broken 919 00:29:48,159 --> 00:29:52,640 but they got broken by accident now 920 00:29:49,840 --> 00:29:54,320 we're going to break them on purpose 921 00:29:52,640 --> 00:29:57,320 we'll see you all back here in about 15 922 00:29:54,320 --> 00:29:57,320 minutes