1
00:00:11,440 --> 00:00:16,240
and welcome back everyone

2
00:00:13,519 --> 00:00:18,560
we have our next talk right now by

3
00:00:16,240 --> 00:00:20,480
yarkov which is called making swiss

4
00:00:18,560 --> 00:00:23,439
cheese of ticket sales and as we were

5
00:00:20,480 --> 00:00:25,760
just talking before uh in the green room

6
00:00:23,439 --> 00:00:26,960
uh clearly the abstract has has done its

7
00:00:25,760 --> 00:00:29,679
job because lots of people are

8
00:00:26,960 --> 00:00:31,359
interested in this talk including me so

9
00:00:29,679 --> 00:00:33,600
without further ado

10
00:00:31,359 --> 00:00:34,719
yakov take it away

11
00:00:33,600 --> 00:00:36,719
thank you

12
00:00:34,719 --> 00:00:38,640
hello everyone and shana tevan happy new

13
00:00:36,719 --> 00:00:41,280
year and special hello to our live chat

14
00:00:38,640 --> 00:00:42,960
today i'm yakov and in my day job i'm a

15
00:00:41,280 --> 00:00:44,879
senior software developer at weistec

16
00:00:42,960 --> 00:00:46,239
global where we build software that acts

17
00:00:44,879 --> 00:00:47,600
as the operating system for global

18
00:00:46,239 --> 00:00:48,879
logistics

19
00:00:47,600 --> 00:00:50,719
at other times i swap the shipping

20
00:00:48,879 --> 00:00:52,719
containers for docker containers and act

21
00:00:50,719 --> 00:00:55,120
as the systems administrator for smash

22
00:00:52,719 --> 00:00:57,120
the sydney manga and anime show where we

23
00:00:55,120 --> 00:00:59,039
host an annual convention for all things

24
00:00:57,120 --> 00:01:01,680
anime and manga at least in years where

25
00:00:59,039 --> 00:01:03,600
there isn't an active pandemic

26
00:01:01,680 --> 00:01:05,360
like most conventions we sell tickets to

27
00:01:03,600 --> 00:01:07,119
the event if you want to get in you need

28
00:01:05,360 --> 00:01:09,200
to buy a ticket a ticket gets you a

29
00:01:07,119 --> 00:01:11,280
badge and a badge gets you in the door

30
00:01:09,200 --> 00:01:13,040
we sell tickets in two ways at the door

31
00:01:11,280 --> 00:01:15,520
on the day of the event or online

32
00:01:13,040 --> 00:01:17,280
beforehand at a slight discount and this

33
00:01:15,520 --> 00:01:19,200
year online sales were quite a bit

34
00:01:17,280 --> 00:01:20,880
harder than it might sound

35
00:01:19,200 --> 00:01:23,200
we've done this many times before in

36
00:01:20,880 --> 00:01:24,720
years past in on our current soft

37
00:01:23,200 --> 00:01:27,439
current software stack we've done it

38
00:01:24,720 --> 00:01:29,280
twice in 2019 we sold tickets redeemed

39
00:01:27,439 --> 00:01:30,240
tickets and had a whole convention as

40
00:01:29,280 --> 00:01:33,520
usual

41
00:01:30,240 --> 00:01:35,360
in 2020 we sold tickets and

42
00:01:33,520 --> 00:01:37,119
we then had to postpone the convention

43
00:01:35,360 --> 00:01:38,960
but our launch of ticket sales did go

44
00:01:37,119 --> 00:01:41,759
off without a hitch

45
00:01:38,960 --> 00:01:43,680
this year for our 2021 flagship event we

46
00:01:41,759 --> 00:01:46,240
opened ticket pre-sales at 10 a.m on

47
00:01:43,680 --> 00:01:48,320
sunday the 2nd of may 2021

48
00:01:46,240 --> 00:01:50,560
that morning well it started for me

49
00:01:48,320 --> 00:01:52,399
pretty much like any other i got up i

50
00:01:50,560 --> 00:01:53,360
got dressed i did the morning prayers

51
00:01:52,399 --> 00:01:54,720
perhaps i should have prayed a little

52
00:01:53,360 --> 00:01:56,799
bit harder that everything here would go

53
00:01:54,720 --> 00:01:58,479
to plan and then i logged onto slack and

54
00:01:56,799 --> 00:02:00,000
onto our ticketing website just to keep

55
00:01:58,479 --> 00:02:02,159
an eye on things and to watch the ticket

56
00:02:00,000 --> 00:02:04,079
sales start to roll in

57
00:02:02,159 --> 00:02:05,920
as the clock ticked over to 10 am i

58
00:02:04,079 --> 00:02:08,080
refreshed the page expecting to see our

59
00:02:05,920 --> 00:02:10,160
list of tickets for sale go from nothing

60
00:02:08,080 --> 00:02:11,520
to our three regular online items plus

61
00:02:10,160 --> 00:02:13,280
the early bird special that was

62
00:02:11,520 --> 00:02:16,000
available at the time

63
00:02:13,280 --> 00:02:17,599
instead all i got was a loading spinner

64
00:02:16,000 --> 00:02:19,680
and reports started coming in that it

65
00:02:17,599 --> 00:02:21,360
wasn't just me

66
00:02:19,680 --> 00:02:23,680
two minutes later the main smash

67
00:02:21,360 --> 00:02:25,680
wordpress website was down too

68
00:02:23,680 --> 00:02:27,360
by 10 pass we're in full battle stations

69
00:02:25,680 --> 00:02:29,440
mode trying to figure out what on earth

70
00:02:27,360 --> 00:02:31,200
was bringing our websites down

71
00:02:29,440 --> 00:02:33,360
some users even reported seeing sql

72
00:02:31,200 --> 00:02:36,640
related errors on the front end and

73
00:02:33,360 --> 00:02:39,519
that's usually never a good sign

74
00:02:36,640 --> 00:02:41,280
40 minutes later at 10 50 am we thought

75
00:02:39,519 --> 00:02:42,879
we identified the problem and deployed a

76
00:02:41,280 --> 00:02:44,640
mitigation

77
00:02:42,879 --> 00:02:47,360
but that didn't help

78
00:02:44,640 --> 00:02:49,680
eventually at 11 40 am we decided to

79
00:02:47,360 --> 00:02:50,879
pull the plug having only managed to

80
00:02:49,680 --> 00:02:52,560
sell one

81
00:02:50,879 --> 00:02:54,879
single ticket

82
00:02:52,560 --> 00:02:56,720
in our first hundred minutes

83
00:02:54,879 --> 00:02:58,720
over the next 20 minutes we discovered

84
00:02:56,720 --> 00:03:01,040
that our panic button didn't quite work

85
00:02:58,720 --> 00:03:02,640
in the current circumstances and we had

86
00:03:01,040 --> 00:03:05,040
to manually push out a new build of the

87
00:03:02,640 --> 00:03:06,480
website with that panic mode hard coded

88
00:03:05,040 --> 00:03:08,000
to on

89
00:03:06,480 --> 00:03:09,599
and yet somehow

90
00:03:08,000 --> 00:03:11,519
ticket sales started flowing through

91
00:03:09,599 --> 00:03:12,879
from that point onwards we spent the

92
00:03:11,519 --> 00:03:14,560
rest of the afternoon identifying

93
00:03:12,879 --> 00:03:17,120
inconsistencies in the data from those

94
00:03:14,560 --> 00:03:18,560
sales such as users who had paid but

95
00:03:17,120 --> 00:03:20,640
whose payment hadn't been recorded in

96
00:03:18,560 --> 00:03:22,800
the system or users had been double or

97
00:03:20,640 --> 00:03:24,560
triple charged in fact one order had

98
00:03:22,800 --> 00:03:25,680
been charged 12 times over the poor

99
00:03:24,560 --> 00:03:27,280
fellow

100
00:03:25,680 --> 00:03:29,120
we added missing links into data and

101
00:03:27,280 --> 00:03:31,680
issued appropriate refunds and i went to

102
00:03:29,120 --> 00:03:35,360
bed that night very confused

103
00:03:31,680 --> 00:03:37,280
so what on earth went wrong here

104
00:03:35,360 --> 00:03:38,319
before we get to that let's set the

105
00:03:37,280 --> 00:03:40,560
scene

106
00:03:38,319 --> 00:03:42,879
smash has been running for over 10 years

107
00:03:40,560 --> 00:03:44,799
i first found out about it in 2018 and

108
00:03:42,879 --> 00:03:46,879
shortly joined staff as the new systems

109
00:03:44,799 --> 00:03:48,000
administrator and second in command of

110
00:03:46,879 --> 00:03:49,519
it

111
00:03:48,000 --> 00:03:51,200
for the it team that was a very

112
00:03:49,519 --> 00:03:52,799
transitional year with the existing

113
00:03:51,200 --> 00:03:55,280
staff leaving and handing over to us

114
00:03:52,799 --> 00:03:56,640
newcomers everything that they had built

115
00:03:55,280 --> 00:03:58,560
the old systems were a mixture of

116
00:03:56,640 --> 00:04:01,040
appliances and frameworks and languages

117
00:03:58,560 --> 00:04:02,640
and some strange disparate microservices

118
00:04:01,040 --> 00:04:04,640
the ticketing system alone was four or

119
00:04:02,640 --> 00:04:06,480
five different services some written

120
00:04:04,640 --> 00:04:08,159
ruby and others in go

121
00:04:06,480 --> 00:04:10,000
it was designed many years earlier for

122
00:04:08,159 --> 00:04:11,439
different requirements and constraints

123
00:04:10,000 --> 00:04:13,840
for a very different environment that

124
00:04:11,439 --> 00:04:16,160
what than what we now faced and where we

125
00:04:13,840 --> 00:04:17,680
wanted to go with the i.t systems

126
00:04:16,160 --> 00:04:19,919
for example it was designed to run

127
00:04:17,680 --> 00:04:21,600
ticket redemption to particular ticket

128
00:04:19,919 --> 00:04:23,440
redemption at the convention center

129
00:04:21,600 --> 00:04:25,199
that's where you scan everyone in

130
00:04:23,440 --> 00:04:26,560
completely offline

131
00:04:25,199 --> 00:04:28,800
but this meant we would never be able to

132
00:04:26,560 --> 00:04:30,320
do online ticket sales during the event

133
00:04:28,800 --> 00:04:32,000
and it made it very hard to try and set

134
00:04:30,320 --> 00:04:36,240
up mobile handheld scanners that would

135
00:04:32,000 --> 00:04:38,240
work outside over 4g or lte

136
00:04:36,240 --> 00:04:40,080
among the many changes we made was with

137
00:04:38,240 --> 00:04:42,479
the blessing of the old guard building a

138
00:04:40,080 --> 00:04:44,400
brand new ticketing system now we

139
00:04:42,479 --> 00:04:46,400
couldn't find any existing open source

140
00:04:44,400 --> 00:04:48,639
or low-cost solution that would do it we

141
00:04:46,400 --> 00:04:49,919
needed so again we wrote our own from

142
00:04:48,639 --> 00:04:52,320
scratch

143
00:04:49,919 --> 00:04:54,000
smash is entirely driven by volunteers

144
00:04:52,320 --> 00:04:55,919
so neither stack that volunteers would

145
00:04:54,000 --> 00:04:59,280
be able to get the hang of quickly so we

146
00:04:55,919 --> 00:05:01,759
chose a javascript typescript stack

147
00:04:59,280 --> 00:05:04,560
we used react.js for the front-end

148
00:05:01,759 --> 00:05:06,720
node.js and express.js for the back-end

149
00:05:04,560 --> 00:05:08,560
and mysql for data persistence since we

150
00:05:06,720 --> 00:05:10,560
already had mysql infrastructure for our

151
00:05:08,560 --> 00:05:12,400
wordpress sites including things like

152
00:05:10,560 --> 00:05:14,800
automatic backups and i didn't want to

153
00:05:12,400 --> 00:05:16,880
have to manage two database stacks

154
00:05:14,800 --> 00:05:19,440
to host the sites we decided to use

155
00:05:16,880 --> 00:05:21,600
kubernetes mostly because i wanted to

156
00:05:19,440 --> 00:05:23,680
learn kubernetes but also because it

157
00:05:21,600 --> 00:05:25,440
made a bunch of things really easy words

158
00:05:23,680 --> 00:05:27,840
that may never be said again as it's

159
00:05:25,440 --> 00:05:30,000
gotten more complex such as pushing out

160
00:05:27,840 --> 00:05:31,840
updates the software managing secrets

161
00:05:30,000 --> 00:05:33,759
automatically managing nginx and let's

162
00:05:31,840 --> 00:05:35,600
encrypt certificates and it let us

163
00:05:33,759 --> 00:05:37,360
establish a really solid infrastructure

164
00:05:35,600 --> 00:05:39,199
as code foundation

165
00:05:37,360 --> 00:05:40,800
and in the long run it has been a lot

166
00:05:39,199 --> 00:05:43,520
easier than manually configuring a

167
00:05:40,800 --> 00:05:44,960
handful of eps boxes

168
00:05:43,520 --> 00:05:46,400
so this is our rough architecture for

169
00:05:44,960 --> 00:05:48,800
the ticketing system

170
00:05:46,400 --> 00:05:51,039
requests to slash api go to the node.js

171
00:05:48,800 --> 00:05:53,440
backend which talks in mysql instance

172
00:05:51,039 --> 00:05:55,360
which is then backed by persistent disk

173
00:05:53,440 --> 00:05:56,800
all other requests go to a tiny nginx

174
00:05:55,360 --> 00:05:58,880
container that serves up the static

175
00:05:56,800 --> 00:06:00,880
front end if we add in the wordpress

176
00:05:58,880 --> 00:06:02,319
site and then account for replicas this

177
00:06:00,880 --> 00:06:03,919
is basically what our setup looked like

178
00:06:02,319 --> 00:06:05,600
at the time

179
00:06:03,919 --> 00:06:07,919
and this setup served us nicely through

180
00:06:05,600 --> 00:06:10,000
2019 in fact we built the system all

181
00:06:07,919 --> 00:06:11,600
throughout all through 2019

182
00:06:10,000 --> 00:06:13,600
in a very just in time approach you

183
00:06:11,600 --> 00:06:15,520
might even call it agile

184
00:06:13,600 --> 00:06:18,000
but for example when we launched ticket

185
00:06:15,520 --> 00:06:19,919
sales that first year we hadn't finished

186
00:06:18,000 --> 00:06:22,160
the code to generate pdfs for tickets or

187
00:06:19,919 --> 00:06:24,080
receipts so he later had to backfill all

188
00:06:22,160 --> 00:06:25,440
those orders by generating those pdfs

189
00:06:24,080 --> 00:06:26,960
and sending out their emails once the

190
00:06:25,440 --> 00:06:28,560
code was ready

191
00:06:26,960 --> 00:06:30,080
another example is the ticket redemption

192
00:06:28,560 --> 00:06:31,759
code which was only finalized and

193
00:06:30,080 --> 00:06:33,840
deployed to production

194
00:06:31,759 --> 00:06:35,039
the night before exhibitor setup day

195
00:06:33,840 --> 00:06:36,319
when we had to scan in all the

196
00:06:35,039 --> 00:06:39,039
exhibitors

197
00:06:36,319 --> 00:06:40,720
but it worked and it held and we ran a

198
00:06:39,039 --> 00:06:44,400
convention that weekend including a

199
00:06:40,720 --> 00:06:46,960
concert with almost no technical issues

200
00:06:44,400 --> 00:06:48,800
throughout 2020 and early 2021 we built

201
00:06:46,960 --> 00:06:51,039
some new features refactored the code

202
00:06:48,800 --> 00:06:53,199
updated npm packages and had our very

203
00:06:51,039 --> 00:06:54,639
short-lived 2020 ticket sales

204
00:06:53,199 --> 00:06:57,520
but there were no major changes from the

205
00:06:54,639 --> 00:06:59,840
2019 code base or architecture

206
00:06:57,520 --> 00:07:03,840
so how come in 2021

207
00:06:59,840 --> 00:07:03,840
this system fell flat on its face

208
00:07:07,759 --> 00:07:11,520
there are many different models that you

209
00:07:09,199 --> 00:07:13,199
can use to analyze risk and accidents

210
00:07:11,520 --> 00:07:15,360
and the swiss cheese model is used in

211
00:07:13,199 --> 00:07:16,880
the aviation industry and it's one of my

212
00:07:15,360 --> 00:07:17,840
favorites because it is so often

213
00:07:16,880 --> 00:07:19,919
accurate

214
00:07:17,840 --> 00:07:21,440
the model works roughly like this

215
00:07:19,919 --> 00:07:22,319
for any event that you don't want to

216
00:07:21,440 --> 00:07:23,680
happen

217
00:07:22,319 --> 00:07:25,919
there are a bunch of different barriers

218
00:07:23,680 --> 00:07:27,840
that contribute towards protection

219
00:07:25,919 --> 00:07:29,440
each of them are ultimately fallible but

220
00:07:27,840 --> 00:07:31,599
in different ways

221
00:07:29,440 --> 00:07:33,199
for that event to then occur multiple

222
00:07:31,599 --> 00:07:35,360
barriers have to be individually

223
00:07:33,199 --> 00:07:37,680
breached if any single one of these

224
00:07:35,360 --> 00:07:39,919
barriers holds up then the event usually

225
00:07:37,680 --> 00:07:41,199
some kind of accident or disaster is

226
00:07:39,919 --> 00:07:42,800
still prevented

227
00:07:41,199 --> 00:07:45,520
the more barriers that were breached the

228
00:07:42,800 --> 00:07:47,440
closer you've come to total disaster

229
00:07:45,520 --> 00:07:48,960
over at the snake oil academy track they

230
00:07:47,440 --> 00:07:50,800
would probably call this defense in

231
00:07:48,960 --> 00:07:52,479
depth but here we're not trying to fend

232
00:07:50,800 --> 00:07:55,599
off an attacker we're just trying to

233
00:07:52,479 --> 00:07:57,759
protect ourselves from failure

234
00:07:55,599 --> 00:08:00,720
for a worked example let's look briefly

235
00:07:57,759 --> 00:08:02,319
at u.s airways flight 1549 the infamous

236
00:08:00,720 --> 00:08:03,680
hudson river ditching

237
00:08:02,319 --> 00:08:04,879
big content warning on this one if

238
00:08:03,680 --> 00:08:07,440
you're not comfortable come back in

239
00:08:04,879 --> 00:08:08,960
about two minutes

240
00:08:07,440 --> 00:08:10,240
for the rest of you i'm going to assume

241
00:08:08,960 --> 00:08:11,919
that by now you've heard about what

242
00:08:10,240 --> 00:08:12,879
happened because it's pretty famous at

243
00:08:11,919 --> 00:08:14,479
this point

244
00:08:12,879 --> 00:08:16,400
and that whole series of events really

245
00:08:14,479 --> 00:08:18,160
should not have occurred

246
00:08:16,400 --> 00:08:20,879
for a start the aircraft encountered

247
00:08:18,160 --> 00:08:22,639
birds technically geese on takeoff

248
00:08:20,879 --> 00:08:24,639
engines are designed to withstand

249
00:08:22,639 --> 00:08:27,120
ingesting birds in fact during testing

250
00:08:24,639 --> 00:08:28,319
they fire frozen birds at engines from a

251
00:08:27,120 --> 00:08:29,840
cannon

252
00:08:28,319 --> 00:08:32,000
according to the official ntsb

253
00:08:29,840 --> 00:08:34,000
investigation report though if a bird

254
00:08:32,000 --> 00:08:35,839
enters the engine near the outer edges

255
00:08:34,000 --> 00:08:38,399
it'll most likely strike only the fan

256
00:08:35,839 --> 00:08:40,560
blades continue along the bypass duct

257
00:08:38,399 --> 00:08:42,240
and get ejected out the back probably

258
00:08:40,560 --> 00:08:43,760
dead possibly cooked

259
00:08:42,240 --> 00:08:46,880
but because it hit the inner part of the

260
00:08:43,760 --> 00:08:48,880
engine it got ingested into the core

261
00:08:46,880 --> 00:08:50,800
even then a twin-engine aircraft like

262
00:08:48,880 --> 00:08:53,600
that can still fly including a full

263
00:08:50,800 --> 00:08:55,600
takeoff with only one working engine

264
00:08:53,600 --> 00:08:56,480
we have some basic redundancy that's

265
00:08:55,600 --> 00:08:57,920
great

266
00:08:56,480 --> 00:09:00,399
but unfortunately in this case birds

267
00:08:57,920 --> 00:09:02,640
took out both engines

268
00:09:00,399 --> 00:09:04,480
even then an aircraft can still glide

269
00:09:02,640 --> 00:09:06,000
for quite some time so if you ask a

270
00:09:04,480 --> 00:09:09,120
pilot they will probably call it falling

271
00:09:06,000 --> 00:09:11,120
slowly rather than gliding regardless

272
00:09:09,120 --> 00:09:13,519
they hadn't gotten high enough yet i

273
00:09:11,120 --> 00:09:16,399
didn't have very far to glide definitely

274
00:09:13,519 --> 00:09:18,480
not far enough to a runway

275
00:09:16,399 --> 00:09:19,920
and so on and so forth

276
00:09:18,480 --> 00:09:22,160
different layers of protection

277
00:09:19,920 --> 00:09:24,000
redundancy or resiliency each one

278
00:09:22,160 --> 00:09:26,080
breached until they had no choice but to

279
00:09:24,000 --> 00:09:27,680
crash into the hudson river

280
00:09:26,080 --> 00:09:30,000
fortunately in this case there were

281
00:09:27,680 --> 00:09:31,600
still laser protection that held up the

282
00:09:30,000 --> 00:09:33,200
pilots used their train to ditch the

283
00:09:31,600 --> 00:09:34,399
aircraft without crashing it and

284
00:09:33,200 --> 00:09:36,080
destroying it

285
00:09:34,399 --> 00:09:37,680
the flight crew evacuated everyone

286
00:09:36,080 --> 00:09:39,200
before the plane sank

287
00:09:37,680 --> 00:09:41,120
and the planes were equipped with life

288
00:09:39,200 --> 00:09:42,320
rafts so people actually had somewhere

289
00:09:41,120 --> 00:09:44,800
to go

290
00:09:42,320 --> 00:09:46,560
in this case it wasn't a total disaster

291
00:09:44,800 --> 00:09:48,240
and a good amount of those cheese layers

292
00:09:46,560 --> 00:09:50,320
still held up

293
00:09:48,240 --> 00:09:53,279
but in the days that followed the 2nd of

294
00:09:50,320 --> 00:09:55,200
may 2021 i was strongly reminded of this

295
00:09:53,279 --> 00:09:57,200
model as we work through different

296
00:09:55,200 --> 00:09:58,800
problems in our ticketing system

297
00:09:57,200 --> 00:10:01,040
none of which alone could be pinpointed

298
00:09:58,800 --> 00:10:02,640
as the singular cause of our failed

299
00:10:01,040 --> 00:10:04,800
ticketing sales launch

300
00:10:02,640 --> 00:10:08,560
but all of all whom could have helped

301
00:10:04,800 --> 00:10:09,839
prevent or mitigate it

302
00:10:08,560 --> 00:10:11,360
the first issue that we managed to

303
00:10:09,839 --> 00:10:13,120
identify and fix was that the main

304
00:10:11,360 --> 00:10:14,480
wordpress site went down

305
00:10:13,120 --> 00:10:16,720
it shared a database server with a

306
00:10:14,480 --> 00:10:18,240
ticketing system and whilst we hadn't

307
00:10:16,720 --> 00:10:20,320
had a single problem with it in previous

308
00:10:18,240 --> 00:10:22,079
years even on the day of convention with

309
00:10:20,320 --> 00:10:23,600
people hammering the schedule

310
00:10:22,079 --> 00:10:25,519
all we could see was the generic error

311
00:10:23,600 --> 00:10:26,640
message error establishing the database

312
00:10:25,519 --> 00:10:28,320
connection

313
00:10:26,640 --> 00:10:29,920
i suspected this was due to a mixture of

314
00:10:28,320 --> 00:10:32,480
increased load on the main site and on

315
00:10:29,920 --> 00:10:34,320
the ticketing site at the same time

316
00:10:32,480 --> 00:10:35,920
i went digging and quickly discovered

317
00:10:34,320 --> 00:10:38,800
that the default connection limit in my

318
00:10:35,920 --> 00:10:40,800
sequel is 151 concurrent connections

319
00:10:38,800 --> 00:10:42,160
i've no idea why they picked that number

320
00:10:40,800 --> 00:10:44,079
perhaps whoever came up with it as a fan

321
00:10:42,160 --> 00:10:45,760
of the original pokemon games

322
00:10:44,079 --> 00:10:47,760
in any event we increased the connection

323
00:10:45,760 --> 00:10:49,760
limit to 500 and that particular problem

324
00:10:47,760 --> 00:10:51,680
went away

325
00:10:49,760 --> 00:10:53,760
unfortunately we were still seeing

326
00:10:51,680 --> 00:10:55,360
errors on the ticketing system the most

327
00:10:53,760 --> 00:10:57,680
common one we saw in server logs was a

328
00:10:55,360 --> 00:11:00,320
connects timeout error timeout acquiring

329
00:10:57,680 --> 00:11:02,959
a connection the pool is probably full

330
00:11:00,320 --> 00:11:04,560
are you missing a dog transacting call

331
00:11:02,959 --> 00:11:06,320
the way that most database client

332
00:11:04,560 --> 00:11:08,480
software works is that they don't create

333
00:11:06,320 --> 00:11:10,800
a new sql connection every time they

334
00:11:08,480 --> 00:11:12,640
need one these get pulled when you

335
00:11:10,800 --> 00:11:13,600
request one you either take one from the

336
00:11:12,640 --> 00:11:15,360
pool

337
00:11:13,600 --> 00:11:17,519
or get handed back a brand new one

338
00:11:15,360 --> 00:11:19,040
that's up to the pool to decide and when

339
00:11:17,519 --> 00:11:21,279
you're done you hand it back to the pool

340
00:11:19,040 --> 00:11:23,279
and it either scrubs it for reuse later

341
00:11:21,279 --> 00:11:25,279
or drops it entirely again that's up to

342
00:11:23,279 --> 00:11:26,800
the pool to decide

343
00:11:25,279 --> 00:11:29,040
in this case i believe the error message

344
00:11:26,800 --> 00:11:31,760
is technically wrong and it means the

345
00:11:29,040 --> 00:11:33,360
pool is probably empty that is all the

346
00:11:31,760 --> 00:11:35,200
connections that it has have already

347
00:11:33,360 --> 00:11:37,600
been handed out and it's not going to

348
00:11:35,200 --> 00:11:38,959
create another new one for you

349
00:11:37,600 --> 00:11:40,640
the first thing we did was go looking

350
00:11:38,959 --> 00:11:42,720
for any places in the code where we were

351
00:11:40,640 --> 00:11:44,000
requesting a connection but not handing

352
00:11:42,720 --> 00:11:45,760
it back

353
00:11:44,000 --> 00:11:48,800
and we found exactly zero instances of

354
00:11:45,760 --> 00:11:50,959
that the code was doing the right thing

355
00:11:48,800 --> 00:11:52,399
so the next question is how big is this

356
00:11:50,959 --> 00:11:53,920
pool exactly

357
00:11:52,399 --> 00:11:55,360
we did some initial digging and found

358
00:11:53,920 --> 00:11:57,200
buried away in the documentation that

359
00:11:55,360 --> 00:11:58,720
according to the default settings the

360
00:11:57,200 --> 00:12:00,160
pool has a minimum of two connections

361
00:11:58,720 --> 00:12:02,480
available

362
00:12:00,160 --> 00:12:04,240
and a maximum of 10.

363
00:12:02,480 --> 00:12:06,000
we had two servers and hundreds of users

364
00:12:04,240 --> 00:12:07,519
so we likely needed far more than 20

365
00:12:06,000 --> 00:12:09,040
concurrent connections

366
00:12:07,519 --> 00:12:11,040
we increased the connection pool limits

367
00:12:09,040 --> 00:12:13,760
deployed it to production and

368
00:12:11,040 --> 00:12:15,200
still no improvement

369
00:12:13,760 --> 00:12:16,720
at this point we decide to push the

370
00:12:15,200 --> 00:12:18,959
panic button and put the site into

371
00:12:16,720 --> 00:12:20,639
maintenance mode this is a special mode

372
00:12:18,959 --> 00:12:22,079
where users can't really do anything and

373
00:12:20,639 --> 00:12:23,920
it's designed for shutting down the site

374
00:12:22,079 --> 00:12:25,760
to perform maintenance such as moving

375
00:12:23,920 --> 00:12:27,200
between servers rebuilding the cluster

376
00:12:25,760 --> 00:12:28,399
or some kind of urgent database

377
00:12:27,200 --> 00:12:29,839
operation

378
00:12:28,399 --> 00:12:31,440
in our case we just wanted to pull the

379
00:12:29,839 --> 00:12:33,200
plug so we could start seeing what

380
00:12:31,440 --> 00:12:34,720
actually happened

381
00:12:33,200 --> 00:12:36,160
unfortunately the site was under such

382
00:12:34,720 --> 00:12:37,760
heavy load that we couldn't even get

383
00:12:36,160 --> 00:12:38,800
into the admin settings and save our

384
00:12:37,760 --> 00:12:40,079
changes

385
00:12:38,800 --> 00:12:42,000
i manually flip the switch in the

386
00:12:40,079 --> 00:12:43,600
database table i need to then discover

387
00:12:42,000 --> 00:12:45,680
that the website couldn't even load the

388
00:12:43,600 --> 00:12:47,440
settings of value due to the site

389
00:12:45,680 --> 00:12:49,120
crawling so slowly

390
00:12:47,440 --> 00:12:50,880
we ended up going into the code base and

391
00:12:49,120 --> 00:12:53,279
changing the client site is maintenance

392
00:12:50,880 --> 00:12:56,320
mode function to simply return true and

393
00:12:53,279 --> 00:12:58,320
deploy that straight to production

394
00:12:56,320 --> 00:13:00,639
so by now we should have no users and be

395
00:12:58,320 --> 00:13:02,320
able to begin a thorough investigation

396
00:13:00,639 --> 00:13:04,000
but strangely this is when orders kept

397
00:13:02,320 --> 00:13:05,519
coming into the system

398
00:13:04,000 --> 00:13:07,279
these were hitting the payment processor

399
00:13:05,519 --> 00:13:09,519
and our emailing provider

400
00:13:07,279 --> 00:13:10,880
and now that we had no new no new users

401
00:13:09,519 --> 00:13:13,600
coming through the system we were

402
00:13:10,880 --> 00:13:15,440
actually somehow selling tickets

403
00:13:13,600 --> 00:13:16,480
the system was slowly working through a

404
00:13:15,440 --> 00:13:18,240
backlog

405
00:13:16,480 --> 00:13:20,240
i suspected this was a backlog of users

406
00:13:18,240 --> 00:13:21,600
who already started the ticket flow or

407
00:13:20,240 --> 00:13:23,839
they were spamming the refresh button

408
00:13:21,600 --> 00:13:25,440
towards the end at that point the is

409
00:13:23,839 --> 00:13:26,720
maintenance mode function is no longer

410
00:13:25,440 --> 00:13:28,160
checked

411
00:13:26,720 --> 00:13:29,920
but as these orders started coming

412
00:13:28,160 --> 00:13:31,519
through we started getting complaints to

413
00:13:29,920 --> 00:13:33,040
customer service that some customers had

414
00:13:31,519 --> 00:13:34,320
been double charged

415
00:13:33,040 --> 00:13:35,760
we spent some time looking through the

416
00:13:34,320 --> 00:13:37,279
orders and cross-referencing them with

417
00:13:35,760 --> 00:13:39,120
the payment provider and discovered two

418
00:13:37,279 --> 00:13:40,880
surprising things

419
00:13:39,120 --> 00:13:42,560
one is some customers had indeed been

420
00:13:40,880 --> 00:13:43,680
double charged for their order

421
00:13:42,560 --> 00:13:45,440
as they must have been submitting the

422
00:13:43,680 --> 00:13:46,480
order form multiple times and retrying

423
00:13:45,440 --> 00:13:48,160
on hera

424
00:13:46,480 --> 00:13:50,800
we refunded these extra orders and we

425
00:13:48,160 --> 00:13:52,639
cancelled the duplicate tickets

426
00:13:50,800 --> 00:13:54,240
we hypothesized based on screenshots had

427
00:13:52,639 --> 00:13:56,160
been sent that what must have happened

428
00:13:54,240 --> 00:13:58,240
happened is customers submitted their

429
00:13:56,160 --> 00:13:59,760
order which took so long that the nginx

430
00:13:58,240 --> 00:14:01,519
ingers controller which was running on

431
00:13:59,760 --> 00:14:03,519
basically default settings

432
00:14:01,519 --> 00:14:07,519
gave up after 60 seconds and returned to

433
00:14:03,519 --> 00:14:09,360
http http 504 gateway timeout error

434
00:14:07,519 --> 00:14:10,800
but node.js and express.js continued

435
00:14:09,360 --> 00:14:12,240
handling the requests and ultimately

436
00:14:10,800 --> 00:14:14,160
fulfilled the order

437
00:14:12,240 --> 00:14:16,000
the user never saw that so when they saw

438
00:14:14,160 --> 00:14:17,519
the 504 error they simply resubmitted

439
00:14:16,000 --> 00:14:19,360
the order not realizing that they were

440
00:14:17,519 --> 00:14:20,880
now submitting a second order

441
00:14:19,360 --> 00:14:24,959
or a third

442
00:14:20,880 --> 00:14:26,800
of that one person number 12.

443
00:14:24,959 --> 00:14:29,279
to fix this we increase the proxy read

444
00:14:26,800 --> 00:14:31,680
timeout value to a larger value than the

445
00:14:29,279 --> 00:14:33,360
actual application server would take so

446
00:14:31,680 --> 00:14:36,160
that the application would time out or

447
00:14:33,360 --> 00:14:37,839
fail before nginx gave up on the request

448
00:14:36,160 --> 00:14:39,519
that should also help to reduce our load

449
00:14:37,839 --> 00:14:41,839
because if each user is making multiple

450
00:14:39,519 --> 00:14:43,760
requests then our effective user count

451
00:14:41,839 --> 00:14:45,279
and therefore our load is significantly

452
00:14:43,760 --> 00:14:46,560
larger it's multiple times what it

453
00:14:45,279 --> 00:14:48,399
should be

454
00:14:46,560 --> 00:14:50,240
but even more strangely surprising thing

455
00:14:48,399 --> 00:14:52,480
number two was that some customers

456
00:14:50,240 --> 00:14:54,560
experienced the exact opposite case

457
00:14:52,480 --> 00:14:55,920
they'd been charged for their order they

458
00:14:54,560 --> 00:14:57,199
had no corresponding ticket in our

459
00:14:55,920 --> 00:14:58,959
system

460
00:14:57,199 --> 00:15:00,399
there is logically no way this should be

461
00:14:58,959 --> 00:15:01,839
able to happen because the way the code

462
00:15:00,399 --> 00:15:03,680
was structured is

463
00:15:01,839 --> 00:15:05,360
we create a payment record

464
00:15:03,680 --> 00:15:07,680
then we uh call out to the payment

465
00:15:05,360 --> 00:15:09,120
processors api and then we update the

466
00:15:07,680 --> 00:15:10,560
payment record with the result of the

467
00:15:09,120 --> 00:15:12,959
api calls there should be a payment

468
00:15:10,560 --> 00:15:14,160
record there at all times

469
00:15:12,959 --> 00:15:16,160
we went digging through the code and

470
00:15:14,160 --> 00:15:18,320
discovered a horrible mistake

471
00:15:16,160 --> 00:15:19,360
the code that created the payment record

472
00:15:18,320 --> 00:15:21,680
and the code that called out to the

473
00:15:19,360 --> 00:15:23,680
payment processing api were both in the

474
00:15:21,680 --> 00:15:25,040
same database transaction

475
00:15:23,680 --> 00:15:27,120
if the container gets shut down in the

476
00:15:25,040 --> 00:15:29,199
middle for example when a new image is

477
00:15:27,120 --> 00:15:31,759
deployed then the payment record doesn't

478
00:15:29,199 --> 00:15:33,199
actually get saved to the database

479
00:15:31,759 --> 00:15:35,680
this could also explain our problem

480
00:15:33,199 --> 00:15:37,600
overall if the payment processor takes a

481
00:15:35,680 --> 00:15:39,120
significant amount of time then we hold

482
00:15:37,600 --> 00:15:40,880
our transactions open

483
00:15:39,120 --> 00:15:42,320
and exhaust the connection pool

484
00:15:40,880 --> 00:15:45,040
that's it there's our smoking gun

485
00:15:42,320 --> 00:15:46,639
everyone can go home

486
00:15:45,040 --> 00:15:48,720
clearly not there's still 15 minutes

487
00:15:46,639 --> 00:15:50,000
left in this talk

488
00:15:48,720 --> 00:15:51,920
since we no longer had any real

489
00:15:50,000 --> 00:15:53,680
customers online at this point we pushed

490
00:15:51,920 --> 00:15:55,440
the fix to staging and ran a synthetic

491
00:15:53,680 --> 00:15:57,519
load test talking to the payment

492
00:15:55,440 --> 00:16:00,240
processor's testing api to ensure that

493
00:15:57,519 --> 00:16:02,000
we don't spend or take any real money

494
00:16:00,240 --> 00:16:03,519
waiting with baited breath we ran our

495
00:16:02,000 --> 00:16:05,680
test script and

496
00:16:03,519 --> 00:16:08,240
it still failed spectacularly

497
00:16:05,680 --> 00:16:11,600
whilst this was clearly a problem it

498
00:16:08,240 --> 00:16:11,600
couldn't be the problem

499
00:16:12,880 --> 00:16:17,199
so at this point we were pretty much

500
00:16:14,639 --> 00:16:18,560
lost we knew that we had a problem

501
00:16:17,199 --> 00:16:20,720
we knew that it was likely related to

502
00:16:18,560 --> 00:16:22,480
the sql database since we had wordpress

503
00:16:20,720 --> 00:16:24,320
and connects both complaining about sql

504
00:16:22,480 --> 00:16:26,000
connections and we knew that it was

505
00:16:24,320 --> 00:16:27,920
induced by a massive load of customers

506
00:16:26,000 --> 00:16:31,440
trying to buy tickets

507
00:16:27,920 --> 00:16:34,000
but we had no clue what was really wrong

508
00:16:31,440 --> 00:16:35,199
as far as diagnosing optimizing sql goes

509
00:16:34,000 --> 00:16:36,560
this is something that i've seen before

510
00:16:35,199 --> 00:16:38,000
many times at work

511
00:16:36,560 --> 00:16:39,759
and when i get stuck we have some very

512
00:16:38,000 --> 00:16:40,800
good dbas that i can always turn to for

513
00:16:39,759 --> 00:16:42,720
help

514
00:16:40,800 --> 00:16:44,000
but i didn't have that here

515
00:16:42,720 --> 00:16:45,680
and i couldn't even turn to my usual

516
00:16:44,000 --> 00:16:48,079
tools that i use at work

517
00:16:45,680 --> 00:16:49,759
at work we rely on microsoft sql server

518
00:16:48,079 --> 00:16:51,519
and the best tool by far for initial

519
00:16:49,759 --> 00:16:52,720
investigation is a microsoft sql

520
00:16:51,519 --> 00:16:54,399
profiler

521
00:16:52,720 --> 00:16:56,160
this is a wonderful tool that can show

522
00:16:54,399 --> 00:16:59,199
you live every sql command being

523
00:16:56,160 --> 00:17:00,959
executed who ran them how long they took

524
00:16:59,199 --> 00:17:01,839
what machine it came from so on and so

525
00:17:00,959 --> 00:17:02,720
forth

526
00:17:01,839 --> 00:17:04,319
normally i would look here for

527
00:17:02,720 --> 00:17:06,240
long-running queries or blocked queries

528
00:17:04,319 --> 00:17:07,760
and start to unravel the mess

529
00:17:06,240 --> 00:17:11,919
but i actually could not find any

530
00:17:07,760 --> 00:17:11,919
equivalent tooling for my sequel

531
00:17:12,079 --> 00:17:15,520
the only option left was to instrument

532
00:17:13,760 --> 00:17:17,360
the application itself

533
00:17:15,520 --> 00:17:19,280
fortunately and well what got us into

534
00:17:17,360 --> 00:17:21,280
this mess is this is our own application

535
00:17:19,280 --> 00:17:23,120
and not a third-party proprietary one so

536
00:17:21,280 --> 00:17:25,199
we could do this fairly easily

537
00:17:23,120 --> 00:17:27,760
we did two things here one is we added a

538
00:17:25,199 --> 00:17:29,200
warning if a transaction takes too long

539
00:17:27,760 --> 00:17:30,240
too long an arbitrary amount of time to

540
00:17:29,200 --> 00:17:31,760
run

541
00:17:30,240 --> 00:17:33,280
and two is we ran the application in

542
00:17:31,760 --> 00:17:35,280
development mode with an environment

543
00:17:33,280 --> 00:17:38,080
variable set so that the orm would print

544
00:17:35,280 --> 00:17:39,679
out every sql query it ran as it ran it

545
00:17:38,080 --> 00:17:41,120
we ran a few test orders in development

546
00:17:39,679 --> 00:17:43,760
we sat down looked through these queries

547
00:17:41,120 --> 00:17:46,480
and we noticed a few interesting things

548
00:17:43,760 --> 00:17:47,919
firstly each http request to order some

549
00:17:46,480 --> 00:17:49,919
tickets had about five or six

550
00:17:47,919 --> 00:17:51,600
transactions running

551
00:17:49,919 --> 00:17:52,880
this highly increases the chance of

552
00:17:51,600 --> 00:17:55,200
failure when the connection pool is

553
00:17:52,880 --> 00:17:57,120
exhausted once a request gets that first

554
00:17:55,200 --> 00:17:58,240
connection there is no guarantee that

555
00:17:57,120 --> 00:18:00,480
the connection pool will give it the

556
00:17:58,240 --> 00:18:01,600
second one it needs or the third

557
00:18:00,480 --> 00:18:03,280
or the fourth

558
00:18:01,600 --> 00:18:04,480
and so on

559
00:18:03,280 --> 00:18:05,919
we consolidated some of these

560
00:18:04,480 --> 00:18:08,240
transactions together and managed to

561
00:18:05,919 --> 00:18:09,840
bring it down to three

562
00:18:08,240 --> 00:18:11,520
secondly we noticed a lot of extra

563
00:18:09,840 --> 00:18:13,200
queries that we didn't need

564
00:18:11,520 --> 00:18:14,720
often when we created or updated a row

565
00:18:13,200 --> 00:18:16,720
the rm would subsequently select that

566
00:18:14,720 --> 00:18:18,080
row again but we never did anything with

567
00:18:16,720 --> 00:18:19,520
a new result

568
00:18:18,080 --> 00:18:22,320
we changed a few method calls and those

569
00:18:19,520 --> 00:18:24,080
select queries disappeared

570
00:18:22,320 --> 00:18:27,440
thirdly we noticed a few of them were

571
00:18:24,080 --> 00:18:29,360
select for update which locks the row

572
00:18:27,440 --> 00:18:32,000
this was by design as we keep track of

573
00:18:29,360 --> 00:18:33,840
how many sales each item or bundle has

574
00:18:32,000 --> 00:18:35,840
definitions an item is an individual

575
00:18:33,840 --> 00:18:37,679
thing such as a sunday pass

576
00:18:35,840 --> 00:18:39,039
a bundle is a collection of items such

577
00:18:37,679 --> 00:18:41,760
as a weekend ticket which would contain

578
00:18:39,039 --> 00:18:43,679
both a saturday pass and a sunday pass

579
00:18:41,760 --> 00:18:45,919
this is so we can limit cert

580
00:18:43,679 --> 00:18:47,679
certain purchases for example our

581
00:18:45,919 --> 00:18:49,600
discounted early bird bundle was limited

582
00:18:47,679 --> 00:18:51,760
to the first 500 people

583
00:18:49,600 --> 00:18:53,600
and in 2019 when we held the annie song

584
00:18:51,760 --> 00:18:55,679
concert the hall only has a certain

585
00:18:53,600 --> 00:18:57,919
number of seats we can't oversell that

586
00:18:55,679 --> 00:18:59,280
or some people would have no edit set

587
00:18:57,919 --> 00:19:01,039
this year we also have things like you

588
00:18:59,280 --> 00:19:02,400
know square meter per person capacity

589
00:19:01,039 --> 00:19:06,080
limits to worry about but we haven't

590
00:19:02,400 --> 00:19:08,000
programmed any of that into the system

591
00:19:06,080 --> 00:19:10,160
the row locking is only supposed to be

592
00:19:08,000 --> 00:19:11,600
very brief we lock the rows only for a

593
00:19:10,160 --> 00:19:13,440
very small period when we create the

594
00:19:11,600 --> 00:19:15,120
pending order and increment the counts

595
00:19:13,440 --> 00:19:17,039
for the relevant items

596
00:19:15,120 --> 00:19:19,280
if the order processing fails we lock

597
00:19:17,039 --> 00:19:20,960
again briefly and decrement the counts

598
00:19:19,280 --> 00:19:21,760
we don't keep it locked for a long time

599
00:19:20,960 --> 00:19:23,440
but

600
00:19:21,760 --> 00:19:25,679
if we did this would also explain a

601
00:19:23,440 --> 00:19:27,200
large amount of the performance problems

602
00:19:25,679 --> 00:19:28,640
with the row locked during an order

603
00:19:27,200 --> 00:19:30,160
somebody loading the home page can't

604
00:19:28,640 --> 00:19:32,080
even load the list of things available

605
00:19:30,160 --> 00:19:33,840
for sale as we can't determine if

606
00:19:32,080 --> 00:19:35,280
something is out of stock or not while

607
00:19:33,840 --> 00:19:36,799
it's locked

608
00:19:35,280 --> 00:19:38,880
technically we don't even know if it

609
00:19:36,799 --> 00:19:40,640
exists since the definition of the item

610
00:19:38,880 --> 00:19:42,880
and the quantity are stored in the same

611
00:19:40,640 --> 00:19:44,640
table in the same row

612
00:19:42,880 --> 00:19:46,480
since we don't have hard limits this

613
00:19:44,640 --> 00:19:47,840
year we checked with the executive team

614
00:19:46,480 --> 00:19:50,320
to see if we need to stick to the hard

615
00:19:47,840 --> 00:19:51,919
limit spec we had in 2019 or if we can

616
00:19:50,320 --> 00:19:53,280
disable the row locks and maybe just go

617
00:19:51,919 --> 00:19:55,039
over a little bit

618
00:19:53,280 --> 00:19:57,039
they gave us a go ahead we disabled the

619
00:19:55,039 --> 00:19:58,080
row locking we ran the test script again

620
00:19:57,039 --> 00:19:59,600
and

621
00:19:58,080 --> 00:20:00,320
this time it actually helped a little

622
00:19:59,600 --> 00:20:03,360
bit

623
00:20:00,320 --> 00:20:04,960
but each request was still far too slow

624
00:20:03,360 --> 00:20:07,840
and after a minute everything started

625
00:20:04,960 --> 00:20:07,840
timing out again

626
00:20:08,080 --> 00:20:12,080
now at this time while the team was

627
00:20:09,919 --> 00:20:13,760
digging through sql i decided to look at

628
00:20:12,080 --> 00:20:15,120
a different problem

629
00:20:13,760 --> 00:20:16,880
something we noticed while rebooting

630
00:20:15,120 --> 00:20:18,720
containers and deploying new ones during

631
00:20:16,880 --> 00:20:20,960
the ticket launch was that the server

632
00:20:18,720 --> 00:20:22,400
side container image was taking about

633
00:20:20,960 --> 00:20:24,640
three to five minutes from the time the

634
00:20:22,400 --> 00:20:26,400
container started until the time that it

635
00:20:24,640 --> 00:20:27,840
started listing for http requests on the

636
00:20:26,400 --> 00:20:29,520
tcp port

637
00:20:27,840 --> 00:20:31,200
i ripped in a javascript expert from

638
00:20:29,520 --> 00:20:32,559
work to give me the rundown on how to do

639
00:20:31,200 --> 00:20:34,880
some basic performance analysis and

640
00:20:32,559 --> 00:20:35,919
debugging with node.js and he showed me

641
00:20:34,880 --> 00:20:38,080
how to connect the chrome developer

642
00:20:35,919 --> 00:20:39,760
tools to a node.js application

643
00:20:38,080 --> 00:20:41,200
from there i extrapolated a bit and by

644
00:20:39,760 --> 00:20:42,640
exposing the correct port number through

645
00:20:41,200 --> 00:20:45,600
docker i managed to attach the chrome

646
00:20:42,640 --> 00:20:47,200
devtools dev tools to node.js inside the

647
00:20:45,600 --> 00:20:48,720
docker image

648
00:20:47,200 --> 00:20:50,320
strangely while stepping through it on

649
00:20:48,720 --> 00:20:52,400
my macbook the performance problem just

650
00:20:50,320 --> 00:20:53,600
seemed to be in importing types

651
00:20:52,400 --> 00:20:55,280
so i did some

652
00:20:53,600 --> 00:20:57,120
some chin scratching i did some thinking

653
00:20:55,280 --> 00:20:58,320
i did a lot of googling and discovered

654
00:20:57,120 --> 00:21:00,000
there's an open issue on github that

655
00:20:58,320 --> 00:21:02,559
docker for mac is just apparently really

656
00:21:00,000 --> 00:21:04,000
really slow at file system operations

657
00:21:02,559 --> 00:21:06,000
and clearly that's not our production

658
00:21:04,000 --> 00:21:08,640
problem as azure kubernetes does not run

659
00:21:06,000 --> 00:21:10,480
on mac os

660
00:21:08,640 --> 00:21:12,080
so the next steps i took to debug this

661
00:21:10,480 --> 00:21:14,320
was to create a new linux virtual

662
00:21:12,080 --> 00:21:16,000
machine i used ubuntu because it's an

663
00:21:14,320 --> 00:21:17,840
ancient african word for slightly easier

664
00:21:16,000 --> 00:21:20,159
than debian i installed that on my

665
00:21:17,840 --> 00:21:22,880
windows workstation pc installed docker

666
00:21:20,159 --> 00:21:24,320
in there and started to debug from there

667
00:21:22,880 --> 00:21:25,679
when running docker on linux our

668
00:21:24,320 --> 00:21:28,000
container started up pretty much

669
00:21:25,679 --> 00:21:29,280
instantly and it listened on port 3000

670
00:21:28,000 --> 00:21:31,360
immediately

671
00:21:29,280 --> 00:21:33,360
no performance problem to be seen

672
00:21:31,360 --> 00:21:35,280
whatsoever

673
00:21:33,360 --> 00:21:37,039
but like one of the main selling points

674
00:21:35,280 --> 00:21:39,200
for containerization is that you have a

675
00:21:37,039 --> 00:21:41,679
perfect reproducibility as you have the

676
00:21:39,200 --> 00:21:43,360
same container image on all your systems

677
00:21:41,679 --> 00:21:45,679
the only thing that's different is

678
00:21:43,360 --> 00:21:46,640
hardware performance characteristics the

679
00:21:45,679 --> 00:21:48,799
network

680
00:21:46,640 --> 00:21:50,080
and the underlying linux kernel

681
00:21:48,799 --> 00:21:51,600
and i didn't think any of those were

682
00:21:50,080 --> 00:21:53,120
relevant here

683
00:21:51,600 --> 00:21:55,039
so i went digging through our kubernetes

684
00:21:53,120 --> 00:21:56,480
yaml for any major differences and came

685
00:21:55,039 --> 00:21:59,120
across this fun block that i'd written a

686
00:21:56,480 --> 00:22:01,280
long time ago and mostly forgotten

687
00:21:59,120 --> 00:22:02,640
back after the 2019 convention while we

688
00:22:01,280 --> 00:22:04,960
were downscaling everything back to

689
00:22:02,640 --> 00:22:07,280
minimal business as usual operations the

690
00:22:04,960 --> 00:22:08,720
yama linton visual studio code suggested

691
00:22:07,280 --> 00:22:10,480
had resource limits and after reading

692
00:22:08,720 --> 00:22:12,080
through the docs i thought well that

693
00:22:10,480 --> 00:22:14,000
sounds like a really really good idea

694
00:22:12,080 --> 00:22:15,840
resource limits sound good so i added

695
00:22:14,000 --> 00:22:18,480
some limits based on the peak usage we'd

696
00:22:15,840 --> 00:22:20,480
observed during the convention that year

697
00:22:18,480 --> 00:22:22,720
i recreated the container image on my

698
00:22:20,480 --> 00:22:24,880
ubuntu vm but this time limited the cpu

699
00:22:22,720 --> 00:22:26,480
to just a quarter of a cpu core

700
00:22:24,880 --> 00:22:30,240
and would you believe it

701
00:22:26,480 --> 00:22:30,240
the image took three minutes to start up

702
00:22:30,320 --> 00:22:33,679
this immediately threw a new light on

703
00:22:31,919 --> 00:22:35,600
the whole situation

704
00:22:33,679 --> 00:22:37,520
was it possible that our sequel after

705
00:22:35,600 --> 00:22:39,120
all was perfectly fine but we were just

706
00:22:37,520 --> 00:22:41,280
cpu constrained

707
00:22:39,120 --> 00:22:42,960
after all during the 2019 ticket sales

708
00:22:41,280 --> 00:22:43,840
we hadn't written the pdf generation

709
00:22:42,960 --> 00:22:45,600
code

710
00:22:43,840 --> 00:22:47,440
and the 2020 ticket sales were quite

711
00:22:45,600 --> 00:22:49,679
lackluster due to the looming pandemic

712
00:22:47,440 --> 00:22:51,200
and just general uncertainty

713
00:22:49,679 --> 00:22:52,960
we commented out the function call to

714
00:22:51,200 --> 00:22:55,200
generate the ticket pdf and attach it to

715
00:22:52,960 --> 00:22:58,240
the email we push a new code to staging

716
00:22:55,200 --> 00:22:59,919
only we re-ran the load test script and

717
00:22:58,240 --> 00:23:02,080
bang all our requests will handle

718
00:22:59,919 --> 00:23:04,000
lightning fast no performance problems

719
00:23:02,080 --> 00:23:05,600
whatsoever

720
00:23:04,000 --> 00:23:07,039
unfortunately we can't actually ship

721
00:23:05,600 --> 00:23:08,880
these changes as people need their

722
00:23:07,039 --> 00:23:10,960
ticket pdfs if they're going to attend

723
00:23:08,880 --> 00:23:12,880
the convention that's the whole point of

724
00:23:10,960 --> 00:23:14,640
buying the ticket

725
00:23:12,880 --> 00:23:16,400
increasing the amount of cpu allocate

726
00:23:14,640 --> 00:23:19,600
the container as well isn't hugely

727
00:23:16,400 --> 00:23:21,039
beneficial we can go from 25 up to 100

728
00:23:19,600 --> 00:23:23,120
and handle maybe four times as many

729
00:23:21,039 --> 00:23:25,120
requests if it scales linearly

730
00:23:23,120 --> 00:23:27,840
but any more than a single vcpu call

731
00:23:25,120 --> 00:23:28,799
won't help as node.js is famously single

732
00:23:27,840 --> 00:23:30,159
threaded

733
00:23:28,799 --> 00:23:32,159
it's not going to farm out different

734
00:23:30,159 --> 00:23:33,840
incoming requests to different cpu cores

735
00:23:32,159 --> 00:23:35,840
by itself

736
00:23:33,840 --> 00:23:38,559
in fact if you google how do i make

737
00:23:35,840 --> 00:23:40,240
node.js use more cpu the top

738
00:23:38,559 --> 00:23:42,320
stackoverflow answer is run multiple

739
00:23:40,240 --> 00:23:43,760
copies of it duh

740
00:23:42,320 --> 00:23:47,039
so we would somehow need to offload the

741
00:23:43,760 --> 00:23:48,400
pdf generation another way

742
00:23:47,039 --> 00:23:50,240
now that we figured out the technical

743
00:23:48,400 --> 00:23:51,600
side we still have the human side to

744
00:23:50,240 --> 00:23:53,279
understand

745
00:23:51,600 --> 00:23:55,440
why did we have so many users trying to

746
00:23:53,279 --> 00:23:58,320
buy tickets we had never seen this level

747
00:23:55,440 --> 00:23:58,320
of demand before

748
00:23:58,720 --> 00:24:02,080
i don't know for certain but i have two

749
00:24:00,480 --> 00:24:04,880
hypotheses

750
00:24:02,080 --> 00:24:06,080
the first is due to covert in 2020 there

751
00:24:04,880 --> 00:24:08,240
were almost no events due to the

752
00:24:06,080 --> 00:24:10,720
pandemic and the lockdowns there was no

753
00:24:08,240 --> 00:24:13,440
smash no oops

754
00:24:10,720 --> 00:24:16,080
there was no smash no supernova no oz

755
00:24:13,440 --> 00:24:17,360
comic-con and no pax australia

756
00:24:16,080 --> 00:24:18,880
it's quite possible that people were

757
00:24:17,360 --> 00:24:21,279
just hungry for their cultural events

758
00:24:18,880 --> 00:24:22,880
after a year of deprivation

759
00:24:21,279 --> 00:24:24,000
the second hypothesis is due to

760
00:24:22,880 --> 00:24:25,600
marketing

761
00:24:24,000 --> 00:24:28,159
our early bird special was limited to

762
00:24:25,600 --> 00:24:30,000
500 tickets but it seems at least a few

763
00:24:28,159 --> 00:24:32,159
people misinterpreted the social media

764
00:24:30,000 --> 00:24:35,600
posts and thought there were only 500

765
00:24:32,159 --> 00:24:37,039
tickets to the event in total

766
00:24:35,600 --> 00:24:38,880
some critical thinking might just that

767
00:24:37,039 --> 00:24:40,799
downscaling the event that much isn't

768
00:24:38,880 --> 00:24:42,720
really financially feasible

769
00:24:40,799 --> 00:24:44,320
but you can just get up early and smash

770
00:24:42,720 --> 00:24:46,320
the five button be one of those lucky

771
00:24:44,320 --> 00:24:48,320
people

772
00:24:46,320 --> 00:24:50,159
so now with all that information gleaned

773
00:24:48,320 --> 00:24:51,440
from our investigation and the fixes we

774
00:24:50,159 --> 00:24:53,360
put into place

775
00:24:51,440 --> 00:24:56,799
we can now explain the original events

776
00:24:53,360 --> 00:24:58,640
on the 2nd of may in its entirety

777
00:24:56,799 --> 00:25:00,400
due to a huge amount of demand after a

778
00:24:58,640 --> 00:25:02,799
year of cancelled events and possibly

779
00:25:00,400 --> 00:25:04,400
fed by marketing miscommunication we had

780
00:25:02,799 --> 00:25:06,640
way more users trying to order tickets

781
00:25:04,400 --> 00:25:08,080
than in previous years

782
00:25:06,640 --> 00:25:10,400
as the first orders started coming

783
00:25:08,080 --> 00:25:12,159
through they held open sql transactions

784
00:25:10,400 --> 00:25:14,080
which prevented subsequent requests or

785
00:25:12,159 --> 00:25:16,240
request continuations from obtaining the

786
00:25:14,080 --> 00:25:17,760
connection they needed

787
00:25:16,240 --> 00:25:19,840
when we restarted and deployed new

788
00:25:17,760 --> 00:25:21,600
backend images this interrupted a couple

789
00:25:19,840 --> 00:25:23,360
requests inside the dodgy payments

790
00:25:21,600 --> 00:25:25,440
transaction so that payment was never

791
00:25:23,360 --> 00:25:27,840
recorded

792
00:25:25,440 --> 00:25:29,840
pdf generation used up the full cpu

793
00:25:27,840 --> 00:25:31,440
allocation so incoming requests were

794
00:25:29,840 --> 00:25:33,279
added to the backlog

795
00:25:31,440 --> 00:25:36,400
the server did not have enough cpu to

796
00:25:33,279 --> 00:25:38,480
process this and definitely not its 25

797
00:25:36,400 --> 00:25:40,480
cpu with three out of four hands tied

798
00:25:38,480 --> 00:25:42,799
behind its back

799
00:25:40,480 --> 00:25:44,799
and so after 60 seconds engine x gave up

800
00:25:42,799 --> 00:25:46,080
on the request but no js continued to

801
00:25:44,799 --> 00:25:47,679
process it

802
00:25:46,080 --> 00:25:49,440
when users got gateway timeout errors

803
00:25:47,679 --> 00:25:50,720
they retried which only further added to

804
00:25:49,440 --> 00:25:52,320
the backlog

805
00:25:50,720 --> 00:25:53,840
and after a certain point new requests

806
00:25:52,320 --> 00:25:55,600
would always time out before they could

807
00:25:53,840 --> 00:25:57,360
be handled by node.js since the backlog

808
00:25:55,600 --> 00:25:59,520
was so large

809
00:25:57,360 --> 00:26:01,840
none of these alone can be pinpointed as

810
00:25:59,520 --> 00:26:05,440
the course of the outage

811
00:26:01,840 --> 00:26:05,440
but our cheese had some pretty big holes

812
00:26:05,679 --> 00:26:09,600
this all went down on sunday and the

813
00:26:07,200 --> 00:26:11,039
investigation ran for several more days

814
00:26:09,600 --> 00:26:13,600
but we needed to have ticket sales

815
00:26:11,039 --> 00:26:15,440
operational again by thursday

816
00:26:13,600 --> 00:26:18,159
for thursday we made several key changes

817
00:26:15,440 --> 00:26:19,919
to our best to avoid repeat of the crash

818
00:26:18,159 --> 00:26:21,600
on the application side we had fixed the

819
00:26:19,919 --> 00:26:23,120
open transactions

820
00:26:21,600 --> 00:26:24,799
and consolidated a few more requests

821
00:26:23,120 --> 00:26:26,400
into fewer transactions

822
00:26:24,799 --> 00:26:28,400
we removed sql queries from the code

823
00:26:26,400 --> 00:26:29,840
that we didn't need and we disabled the

824
00:26:28,400 --> 00:26:31,279
row locking

825
00:26:29,840 --> 00:26:32,720
we also increased the connection pool

826
00:26:31,279 --> 00:26:34,799
size

827
00:26:32,720 --> 00:26:37,200
we also plan to offload pdf generation

828
00:26:34,799 --> 00:26:39,279
to some azure functions view aws folk

829
00:26:37,200 --> 00:26:41,919
think lambdas but obviously we wouldn't

830
00:26:39,279 --> 00:26:43,840
have that ready by thursday

831
00:26:41,919 --> 00:26:46,240
on the operations side we'd increase the

832
00:26:43,840 --> 00:26:47,840
mysql connection limit we'd increase the

833
00:26:46,240 --> 00:26:51,039
cpu limits for the server containers

834
00:26:47,840 --> 00:26:53,200
from 25 of a cpu core to 95 of a cpu

835
00:26:51,039 --> 00:26:54,799
core and we added an additional server

836
00:26:53,200 --> 00:26:57,440
for a total of almost six times as much

837
00:26:54,799 --> 00:26:59,039
cpu capacity as previously

838
00:26:57,440 --> 00:27:01,279
the last change we made was to human

839
00:26:59,039 --> 00:27:03,360
factors instead of hyping everyone up

840
00:27:01,279 --> 00:27:05,600
again for a rush unlimited ticket sales

841
00:27:03,360 --> 00:27:07,679
we changed the terms so anyone could

842
00:27:05,600 --> 00:27:10,159
purchase a discounted weekend ticket

843
00:27:07,679 --> 00:27:11,520
across a 10-hour window from 2pm until

844
00:27:10,159 --> 00:27:12,960
midnight

845
00:27:11,520 --> 00:27:14,240
and then we opened that window a few

846
00:27:12,960 --> 00:27:16,799
minutes early to deal with any

847
00:27:14,240 --> 00:27:19,200
over-eager customers

848
00:27:16,799 --> 00:27:20,399
thursday's ticket sales were an absolute

849
00:27:19,200 --> 00:27:22,159
success

850
00:27:20,399 --> 00:27:24,080
not only did we manage our pre-sales

851
00:27:22,159 --> 00:27:25,440
without crashing again we set a new

852
00:27:24,080 --> 00:27:29,600
internal record for the number of

853
00:27:25,440 --> 00:27:29,600
pre-sales tickets sold in a single day

854
00:27:30,159 --> 00:27:34,399
unfortunately due to the delta outbreak

855
00:27:32,399 --> 00:27:36,000
across new south wales and australia

856
00:27:34,399 --> 00:27:38,720
we had to postpone the event four weeks

857
00:27:36,000 --> 00:27:40,480
ago through to 2022.

858
00:27:38,720 --> 00:27:42,000
however the lessons we learned here are

859
00:27:40,480 --> 00:27:43,520
quite valuable and i share them with you

860
00:27:42,000 --> 00:27:45,919
all today in the hope that they won't be

861
00:27:43,520 --> 00:27:47,440
repeated frequently

862
00:27:45,919 --> 00:27:49,840
the key things i took away from this

863
00:27:47,440 --> 00:27:51,120
were to load test regularly

864
00:27:49,840 --> 00:27:53,520
if you don't think you need a load test

865
00:27:51,120 --> 00:27:54,880
for the event do it again anyway and

866
00:27:53,520 --> 00:27:57,679
make sure the test is harder than you

867
00:27:54,880 --> 00:27:59,279
think you'll encounter in production

868
00:27:57,679 --> 00:28:02,480
you also need to understand profiling

869
00:27:59,279 --> 00:28:04,399
tools before you urgently need them

870
00:28:02,480 --> 00:28:06,159
we need to make sure to correctly deal

871
00:28:04,399 --> 00:28:07,440
with timeouts and request cancellation

872
00:28:06,159 --> 00:28:10,159
something that very often goes

873
00:28:07,440 --> 00:28:11,760
overlooked in application development

874
00:28:10,159 --> 00:28:13,600
to make sure that any panic buttons are

875
00:28:11,760 --> 00:28:15,200
actually disaster proof and will work in

876
00:28:13,600 --> 00:28:17,039
an emergency

877
00:28:15,200 --> 00:28:19,600
and mostly

878
00:28:17,039 --> 00:28:22,000
not to perform any cpu intensive work in

879
00:28:19,600 --> 00:28:25,360
a single threaded web application

880
00:28:22,000 --> 00:28:28,600
that's literally just asking for trouble

881
00:28:25,360 --> 00:28:28,600
thank you

882
00:28:30,240 --> 00:28:35,039
thanks jakov that was

883
00:28:32,240 --> 00:28:37,200
an excellent talk and yes lots of great

884
00:28:35,039 --> 00:28:39,039
lessons there about how to respond to

885
00:28:37,200 --> 00:28:40,720
various kinds of oops

886
00:28:39,039 --> 00:28:42,640
so i'm sure that everyone enjoyed that

887
00:28:40,720 --> 00:28:44,880
talk i've seen a lot of really positive

888
00:28:42,640 --> 00:28:47,279
comments there in the chat um now i

889
00:28:44,880 --> 00:28:49,279
don't see any other questions

890
00:28:47,279 --> 00:28:50,720
um but if you would like to hang out i

891
00:28:49,279 --> 00:28:52,960
know you've been in the chat already

892
00:28:50,720 --> 00:28:55,039
yakov so uh if people do have questions

893
00:28:52,960 --> 00:28:56,640
and want to ask you about anything about

894
00:28:55,039 --> 00:28:59,840
this what you've learned about how to

895
00:28:56,640 --> 00:29:01,200
instrument my mysql um if you've

896
00:28:59,840 --> 00:29:02,960
actually i have a question have you

897
00:29:01,200 --> 00:29:04,799
managed to find any profiling tools for

898
00:29:02,960 --> 00:29:07,200
mysql that would be useful i still

899
00:29:04,799 --> 00:29:08,960
haven't if anybody knows of something

900
00:29:07,200 --> 00:29:10,799
that does what we need there i would

901
00:29:08,960 --> 00:29:13,279
absolutely love to know please drop me a

902
00:29:10,799 --> 00:29:15,919
tweet or an email that's elephant

903
00:29:13,279 --> 00:29:17,840
mind boggling i i recall having to do

904
00:29:15,919 --> 00:29:20,080
this with other database systems many

905
00:29:17,840 --> 00:29:21,520
many years ago when i was technical

906
00:29:20,080 --> 00:29:22,880
and yeah not having that would be

907
00:29:21,520 --> 00:29:25,520
horrendous

908
00:29:22,880 --> 00:29:28,480
um thank you yakov once again for giving

909
00:29:25,520 --> 00:29:30,880
a great talk and uh for everyone else

910
00:29:28,480 --> 00:29:33,360
we're going to take a bit of a break uh

911
00:29:30,880 --> 00:29:35,120
and we'll be back in a few minutes uh

912
00:29:33,360 --> 00:29:36,960
about 15 minutes from now let's see

913
00:29:35,120 --> 00:29:38,720
we're scheduled to be about 11 o'clock

914
00:29:36,960 --> 00:29:41,360
so you've got 15 minutes before our next

915
00:29:38,720 --> 00:29:43,279
talk with amit sahar which is called

916
00:29:41,360 --> 00:29:46,559
planning for failure using chaos

917
00:29:43,279 --> 00:29:48,159
engineering um so we've just had a talk

918
00:29:46,559 --> 00:29:49,840
on how to fix things that went broken

919
00:29:48,159 --> 00:29:52,640
but they got broken by accident now

920
00:29:49,840 --> 00:29:54,320
we're going to break them on purpose

921
00:29:52,640 --> 00:29:57,320
we'll see you all back here in about 15

922
00:29:54,320 --> 00:29:57,320
minutes