1
00:00:06,320 --> 00:00:11,499
[Music]

2
00:00:15,679 --> 00:00:19,600
welcome back everybody from morning tea

3
00:00:17,279 --> 00:00:21,279
i hope you've had a great snack time to

4
00:00:19,600 --> 00:00:22,320
catch up with people and have some

5
00:00:21,279 --> 00:00:24,320
refreshment

6
00:00:22,320 --> 00:00:25,760
apologies this is my first time emceeing

7
00:00:24,320 --> 00:00:28,000
and it all happened within the last 10

8
00:00:25,760 --> 00:00:30,640
minutes so this is really agile software

9
00:00:28,000 --> 00:00:33,120
development on the fly so apologies if i

10
00:00:30,640 --> 00:00:35,040
put my foot in my mouth so this morning

11
00:00:33,120 --> 00:00:37,760
i'd like to introduce brian who's coming

12
00:00:35,040 --> 00:00:40,160
to us from sony ontario in canada

13
00:00:37,760 --> 00:00:41,920
currently -30 outside for all of those

14
00:00:40,160 --> 00:00:44,079
of you in sydney that are suffering from

15
00:00:41,920 --> 00:00:45,840
30 plus or from the west coast of

16
00:00:44,079 --> 00:00:48,079
australia and it's even hotter just

17
00:00:45,840 --> 00:00:49,680
think of them in canada minus 30. i

18
00:00:48,079 --> 00:00:52,160
think our sunny australian coasts are a

19
00:00:49,680 --> 00:00:55,680
little warmer than canada at this time

20
00:00:52,160 --> 00:00:58,559
so brian uh is joining us to so we can

21
00:00:55,680 --> 00:01:00,719
learn about how hootsuite used load

22
00:00:58,559 --> 00:01:03,120
testing to evaluate the performance of

23
00:01:00,719 --> 00:01:04,239
prometheus and grafana based metric

24
00:01:03,120 --> 00:01:06,159
platforms

25
00:01:04,239 --> 00:01:08,560
developing a framework based on open

26
00:01:06,159 --> 00:01:10,799
source tooling locust that can be used

27
00:01:08,560 --> 00:01:13,200
to quickly evaluate the performance

28
00:01:10,799 --> 00:01:15,600
impact of configuration and component

29
00:01:13,200 --> 00:01:17,680
changes prior to deployment now brian's

30
00:01:15,600 --> 00:01:19,600
talk is pre-recorded in case we have any

31
00:01:17,680 --> 00:01:22,479
technical glitches across the pacific

32
00:01:19,600 --> 00:01:24,240
but brian will be joining us again for a

33
00:01:22,479 --> 00:01:25,759
q a at the end so brian will be live

34
00:01:24,240 --> 00:01:27,520
that is him there live at the moment

35
00:01:25,759 --> 00:01:29,200
nodding along it's not just a naughty

36
00:01:27,520 --> 00:01:30,400
and those plants in the background are

37
00:01:29,200 --> 00:01:32,159
real they're not even a virtual

38
00:01:30,400 --> 00:01:33,280
background so we really really have him

39
00:01:32,159 --> 00:01:35,200
here

40
00:01:33,280 --> 00:01:37,520
uh thanks enjoy the session there'll be

41
00:01:35,200 --> 00:01:38,720
a few seconds pause as we switch over to

42
00:01:37,520 --> 00:01:41,840
the recording

43
00:01:38,720 --> 00:01:41,840
see you soon

44
00:01:46,159 --> 00:01:50,720
welcome to the inaugural voyage of

45
00:01:49,040 --> 00:01:53,840
benchmarking prometheus metrics

46
00:01:50,720 --> 00:01:56,399
platforms my name is brian gru and i'm a

47
00:01:53,840 --> 00:02:00,640
senior software developer at hootsuite

48
00:01:56,399 --> 00:02:00,640
and i mainly focus on observability

49
00:02:00,799 --> 00:02:05,520
so some background we implemented a

50
00:02:03,200 --> 00:02:07,280
prometheus thanos grafana stack to

51
00:02:05,520 --> 00:02:09,039
support multi-window multi-burn rate

52
00:02:07,280 --> 00:02:10,640
alerting for most of our product

53
00:02:09,039 --> 00:02:12,239
software but also for some of our

54
00:02:10,640 --> 00:02:14,319
infrastructure as well

55
00:02:12,239 --> 00:02:16,800
after the initial implementation

56
00:02:14,319 --> 00:02:19,920
we received reports from developers that

57
00:02:16,800 --> 00:02:21,920
their dashboards were loading slowly and

58
00:02:19,920 --> 00:02:23,599
while we could load these dashboards and

59
00:02:21,920 --> 00:02:26,160
see that some were slow others were

60
00:02:23,599 --> 00:02:28,160
performing fine so we really didn't have

61
00:02:26,160 --> 00:02:30,800
an idea as to the extent of the problem

62
00:02:28,160 --> 00:02:32,319
we really had no empirical data

63
00:02:30,800 --> 00:02:34,720
to prove that things were actually

64
00:02:32,319 --> 00:02:36,400
running slow or how many things were

65
00:02:34,720 --> 00:02:37,200
running slow

66
00:02:36,400 --> 00:02:38,800
so

67
00:02:37,200 --> 00:02:41,440
we set out and we thought to ourselves

68
00:02:38,800 --> 00:02:42,879
how can we verify and also tune the

69
00:02:41,440 --> 00:02:45,280
performance of the current metrics

70
00:02:42,879 --> 00:02:46,800
platform we have but how could we also

71
00:02:45,280 --> 00:02:48,720
possibly benchmark other

72
00:02:46,800 --> 00:02:49,519
prometheus-based solutions

73
00:02:48,720 --> 00:02:52,480
and

74
00:02:49,519 --> 00:02:56,000
how so our answer was basically we want

75
00:02:52,480 --> 00:02:57,519
to performance to start infrastructure

76
00:02:56,000 --> 00:02:59,840
so you may be asking yourself why

77
00:02:57,519 --> 00:03:01,599
performance test your infrastructure

78
00:02:59,840 --> 00:03:03,120
it's our belief that infrastructure

79
00:03:01,599 --> 00:03:04,800
critical to assisting product

80
00:03:03,120 --> 00:03:07,760
development should be just about as

81
00:03:04,800 --> 00:03:09,519
critical as the product itself

82
00:03:07,760 --> 00:03:11,280
it also helps you understand baseline

83
00:03:09,519 --> 00:03:13,360
performance it helps you validate

84
00:03:11,280 --> 00:03:15,599
changes against existing baselines

85
00:03:13,360 --> 00:03:18,239
before changes actually get integrated

86
00:03:15,599 --> 00:03:19,599
it helps you watch for regressions in

87
00:03:18,239 --> 00:03:21,680
the system itself

88
00:03:19,599 --> 00:03:23,200
and also to allow you to be proactive so

89
00:03:21,680 --> 00:03:26,319
if you're running this in an automated

90
00:03:23,200 --> 00:03:28,159
fashion or manually on a cadence you can

91
00:03:26,319 --> 00:03:31,640
catch problems before your users

92
00:03:28,159 --> 00:03:31,640
actually start complaining

93
00:03:32,000 --> 00:03:35,920
first we set up to figure out how to

94
00:03:33,599 --> 00:03:37,599
model these tests correctly

95
00:03:35,920 --> 00:03:40,239
so generally speaking we want to

96
00:03:37,599 --> 00:03:41,840
understand end user performance and we

97
00:03:40,239 --> 00:03:44,879
want to understand end user performance

98
00:03:41,840 --> 00:03:47,840
from the perspective of the client that

99
00:03:44,879 --> 00:03:50,319
our users are accessing to

100
00:03:47,840 --> 00:03:52,640
query the metrics platform so more

101
00:03:50,319 --> 00:03:54,239
specifically we were looking to assess

102
00:03:52,640 --> 00:03:56,480
the performance of

103
00:03:54,239 --> 00:03:58,640
our prometheus or thanos platform which

104
00:03:56,480 --> 00:04:01,680
was accessed primarily through grafana

105
00:03:58,640 --> 00:04:03,439
dashboards so to this end we thought we

106
00:04:01,680 --> 00:04:05,360
would actually model the performance

107
00:04:03,439 --> 00:04:08,319
test themselves after

108
00:04:05,360 --> 00:04:10,239
actual graffana dashboard loads

109
00:04:08,319 --> 00:04:13,040
so the first question became how is

110
00:04:10,239 --> 00:04:15,200
grafana loading dashboards and chrome

111
00:04:13,040 --> 00:04:17,280
developer tools to the rescue here uh we

112
00:04:15,200 --> 00:04:18,799
basically loaded a dashboard and watched

113
00:04:17,280 --> 00:04:21,359
and saw what was happening so

114
00:04:18,799 --> 00:04:24,560
unsurprisingly grafana is making http

115
00:04:21,359 --> 00:04:26,800
requests to its own back end and then it

116
00:04:24,560 --> 00:04:29,680
would proxy these requests over to the

117
00:04:26,800 --> 00:04:32,080
prometheus or thanos backend and

118
00:04:29,680 --> 00:04:33,759
their request path was basically the

119
00:04:32,080 --> 00:04:35,919
same as what you would throw on a

120
00:04:33,759 --> 00:04:37,840
prometheus api request so it's easy to

121
00:04:35,919 --> 00:04:39,600
copy pasta

122
00:04:37,840 --> 00:04:41,759
the next question became what are these

123
00:04:39,600 --> 00:04:43,919
requests actually doing and again

124
00:04:41,759 --> 00:04:46,639
wearing chrome developer tools so first

125
00:04:43,919 --> 00:04:49,680
we could see that template variable

126
00:04:46,639 --> 00:04:51,680
queries uh were being run and that

127
00:04:49,680 --> 00:04:53,199
there's actually dependency resolution

128
00:04:51,680 --> 00:04:55,600
between the variables so if you had

129
00:04:53,199 --> 00:04:57,600
multiple template variables uh grafana

130
00:04:55,600 --> 00:04:59,120
would actually figure out the order in

131
00:04:57,600 --> 00:05:01,440
which to make those requests and then

132
00:04:59,120 --> 00:05:03,520
kind of request them top down uh and

133
00:05:01,440 --> 00:05:06,560
next grafana was loading uh panel

134
00:05:03,520 --> 00:05:09,280
queries and it would start by requesting

135
00:05:06,560 --> 00:05:11,919
all visible panels concurrently and then

136
00:05:09,280 --> 00:05:14,080
as you scrolled it would request the

137
00:05:11,919 --> 00:05:17,120
other panel queries as those panels

138
00:05:14,080 --> 00:05:17,120
actually became visible

139
00:05:18,320 --> 00:05:22,320
so the question became do we have to

140
00:05:20,479 --> 00:05:24,639
replicate this complex behavior

141
00:05:22,320 --> 00:05:26,080
precisely and my thought was no you just

142
00:05:24,639 --> 00:05:28,400
need to determine the best way to

143
00:05:26,080 --> 00:05:29,680
approximate the client's behavior in a

144
00:05:28,400 --> 00:05:31,280
way that your

145
00:05:29,680 --> 00:05:34,639
performance tests

146
00:05:31,280 --> 00:05:37,120
still make sense and for this use case

147
00:05:34,639 --> 00:05:39,120
we decided to approximate the behavior i

148
00:05:37,120 --> 00:05:41,440
just talked about by first serially

149
00:05:39,120 --> 00:05:43,360
loading all the variable queries and

150
00:05:41,440 --> 00:05:45,199
then loading all the panel queries in

151
00:05:43,360 --> 00:05:48,320
parallel and this would kind of provide

152
00:05:45,199 --> 00:05:50,000
us a uh worst case scenario for loading

153
00:05:48,320 --> 00:05:51,360
a dashboard and that all panels were

154
00:05:50,000 --> 00:05:54,479
actually loading at the same time are

155
00:05:51,360 --> 00:05:54,479
visible at the same time

156
00:05:55,440 --> 00:06:00,560
so next the question was what

157
00:05:58,240 --> 00:06:03,360
specific request should we actually be

158
00:06:00,560 --> 00:06:05,440
making and as i said prior we wanted to

159
00:06:03,360 --> 00:06:06,880
simulate the user behavior so we came to

160
00:06:05,440 --> 00:06:08,639
the conclusion that no one knows how to

161
00:06:06,880 --> 00:06:10,720
break our stuff better than our own

162
00:06:08,639 --> 00:06:13,280
users or our developers

163
00:06:10,720 --> 00:06:16,160
so we should actually model these tests

164
00:06:13,280 --> 00:06:17,759
based on existing dashboards and the

165
00:06:16,160 --> 00:06:19,520
alternative would have been just to

166
00:06:17,759 --> 00:06:22,240
generate some synthetic load perhaps

167
00:06:19,520 --> 00:06:24,560
requests that we know are problematic

168
00:06:22,240 --> 00:06:27,520
but maybe not as true to form as

169
00:06:24,560 --> 00:06:28,319
behaving like a user

170
00:06:27,520 --> 00:06:30,560
so

171
00:06:28,319 --> 00:06:32,080
the last question was what is grafana

172
00:06:30,560 --> 00:06:34,880
actually sending in these requests and

173
00:06:32,080 --> 00:06:36,400
surprise surprise it's it's prom ql with

174
00:06:34,880 --> 00:06:38,960
the variables evaluated and it's like

175
00:06:36,400 --> 00:06:40,400
wait what more variables yes uh so there

176
00:06:38,960 --> 00:06:42,479
are template variables in here that we

177
00:06:40,400 --> 00:06:45,199
had already kind of talked about uh and

178
00:06:42,479 --> 00:06:47,600
these are either static or or generated

179
00:06:45,199 --> 00:06:49,039
dynamically from like a prom ql query

180
00:06:47,600 --> 00:06:52,800
and there's also built-in variables

181
00:06:49,039 --> 00:06:52,800
things like range and underscore

182
00:06:53,199 --> 00:06:56,800
so how did we decide to handle variable

183
00:06:54,960 --> 00:06:58,479
evaluation well we decided for the

184
00:06:56,800 --> 00:07:00,319
built-in variables that we just set some

185
00:06:58,479 --> 00:07:02,880
same defaults to things like interval

186
00:07:00,319 --> 00:07:04,240
for five minutes and you know range was

187
00:07:02,880 --> 00:07:06,160
obviously going to be calculated based

188
00:07:04,240 --> 00:07:08,400
off the start and end dates uh we

189
00:07:06,160 --> 00:07:10,080
decided for the query based template

190
00:07:08,400 --> 00:07:12,080
variables uh that we would actually

191
00:07:10,080 --> 00:07:13,280
still make those requests to prometheus

192
00:07:12,080 --> 00:07:16,000
or thanos

193
00:07:13,280 --> 00:07:18,080
and that uh and that this would help

194
00:07:16,000 --> 00:07:21,599
better mimic uh the dashboard load

195
00:07:18,080 --> 00:07:24,000
however we just we also decided that uh

196
00:07:21,599 --> 00:07:26,319
you know uh we would use the template

197
00:07:24,000 --> 00:07:29,440
variable default values in the actual

198
00:07:26,319 --> 00:07:31,199
panel queries themselves as these were

199
00:07:29,440 --> 00:07:32,800
the default queries with those default

200
00:07:31,199 --> 00:07:34,800
values for those variables that were

201
00:07:32,800 --> 00:07:37,039
actually run when the dashboard came up

202
00:07:34,800 --> 00:07:37,919
and decide we didn't have a great way to

203
00:07:37,039 --> 00:07:39,680
say

204
00:07:37,919 --> 00:07:41,759
which value we should actually be

205
00:07:39,680 --> 00:07:44,479
substituting anyway other than the

206
00:07:41,759 --> 00:07:47,039
default

207
00:07:44,479 --> 00:07:48,879
so to kind of recap our modeling section

208
00:07:47,039 --> 00:07:50,639
we decided to make multiple http

209
00:07:48,879 --> 00:07:52,240
requests to our metrics back end to

210
00:07:50,639 --> 00:07:54,000
replicate the load

211
00:07:52,240 --> 00:07:56,639
we are going to serially request all

212
00:07:54,000 --> 00:07:58,080
template variables first then move on to

213
00:07:56,639 --> 00:07:59,520
concurrently request all the panel

214
00:07:58,080 --> 00:08:01,120
queries

215
00:07:59,520 --> 00:08:03,919
we we're going to model the request

216
00:08:01,120 --> 00:08:06,160
order and content based on existing

217
00:08:03,919 --> 00:08:08,720
dashboards that we already had and we're

218
00:08:06,160 --> 00:08:09,840
going to set sane built-in and template

219
00:08:08,720 --> 00:08:12,000
variable

220
00:08:09,840 --> 00:08:13,919
default values but we're still going to

221
00:08:12,000 --> 00:08:18,400
make those variable requests the

222
00:08:13,919 --> 00:08:18,400
variable query request to the back end

223
00:08:18,720 --> 00:08:22,479
uh so before we actually set out to do

224
00:08:20,879 --> 00:08:25,120
this we wanted to come up with a testing

225
00:08:22,479 --> 00:08:27,440
methodology and also some boundaries for

226
00:08:25,120 --> 00:08:29,199
our tests so in the center here you see

227
00:08:27,440 --> 00:08:30,800
a standard software development life

228
00:08:29,199 --> 00:08:33,120
cycle that everyone should be familiar

229
00:08:30,800 --> 00:08:35,440
with and we thought we could apply this

230
00:08:33,120 --> 00:08:37,919
to our performance testing as well and

231
00:08:35,440 --> 00:08:38,880
that we could come up with a hypothesis

232
00:08:37,919 --> 00:08:41,440
you know

233
00:08:38,880 --> 00:08:44,080
this thing is slow because of this or

234
00:08:41,440 --> 00:08:46,320
that and then we could come up with a

235
00:08:44,080 --> 00:08:48,640
design of how to prove or disprove that

236
00:08:46,320 --> 00:08:51,040
hypothesis maybe we can throw more

237
00:08:48,640 --> 00:08:53,680
resources at it or maybe we can

238
00:08:51,040 --> 00:08:55,120
tweak configuration in some manner

239
00:08:53,680 --> 00:08:57,200
then we'd actually make those changes

240
00:08:55,120 --> 00:08:59,279
once we figured out what to do

241
00:08:57,200 --> 00:09:01,839
then we would run a standardized test

242
00:08:59,279 --> 00:09:03,839
suite to verify and benchmark those

243
00:09:01,839 --> 00:09:06,240
changes and then the results from the

244
00:09:03,839 --> 00:09:09,640
test suite could feed back into a

245
00:09:06,240 --> 00:09:09,640
subsequent hypothesis

246
00:09:16,240 --> 00:09:20,480
so just to kind of recap some of the

247
00:09:17,920 --> 00:09:22,080
tenets of this methodology is that it's

248
00:09:20,480 --> 00:09:24,240
systematic in that it provides us a

249
00:09:22,080 --> 00:09:26,720
framework to develop and prove out these

250
00:09:24,240 --> 00:09:29,120
hypotheses and it's consistent in that

251
00:09:26,720 --> 00:09:30,959
the same tests are on on every iteration

252
00:09:29,120 --> 00:09:33,440
which actually allows every

253
00:09:30,959 --> 00:09:36,720
run to be compared to previous runs and

254
00:09:33,440 --> 00:09:38,320
it's data driven test feedback the test

255
00:09:36,720 --> 00:09:40,080
data actually feeds back into the

256
00:09:38,320 --> 00:09:43,279
decision making process

257
00:09:40,080 --> 00:09:45,279
and in addition to the test data itself

258
00:09:43,279 --> 00:09:48,080
external data can also be integrated

259
00:09:45,279 --> 00:09:50,560
into the analysis phase so if you have

260
00:09:48,080 --> 00:09:52,800
metrics from somewhere else or logs or

261
00:09:50,560 --> 00:09:54,000
traces or whatever else you may have

262
00:09:52,800 --> 00:09:57,760
you can use that to influence your

263
00:09:54,000 --> 00:09:59,760
analysis phase as well and lastly the

264
00:09:57,760 --> 00:10:01,519
methodology is iterative it results in

265
00:09:59,760 --> 00:10:03,279
this kind of tight loop

266
00:10:01,519 --> 00:10:06,160
where changes can be evaluated quickly

267
00:10:03,279 --> 00:10:07,839
and hypothesis can kind of evolve with

268
00:10:06,160 --> 00:10:09,600
each iteration

269
00:10:07,839 --> 00:10:11,760
which hopefully leads to an increase in

270
00:10:09,600 --> 00:10:14,000
velocity

271
00:10:11,760 --> 00:10:16,560
so just a quick note about data validity

272
00:10:14,000 --> 00:10:18,720
here uh a data-driven process like this

273
00:10:16,560 --> 00:10:19,600
one is really only as good as its input

274
00:10:18,720 --> 00:10:21,120
data

275
00:10:19,600 --> 00:10:23,120
so we started thinking about what

276
00:10:21,120 --> 00:10:24,800
happens if during one of these tests if

277
00:10:23,120 --> 00:10:26,560
a service goes offline or if it

278
00:10:24,800 --> 00:10:28,399
temporarily slows down

279
00:10:26,560 --> 00:10:30,880
let's say because someone else is using

280
00:10:28,399 --> 00:10:33,600
the service and the testing environment

281
00:10:30,880 --> 00:10:35,839
so we actually had decided to smooth out

282
00:10:33,600 --> 00:10:37,920
the data for the tests by running the

283
00:10:35,839 --> 00:10:39,360
tests multiple times and then using an

284
00:10:37,920 --> 00:10:41,440
aggregate of the result instead of just

285
00:10:39,360 --> 00:10:44,240
running the test once and using that in

286
00:10:41,440 --> 00:10:46,320
case uh the test run happened to be uh

287
00:10:44,240 --> 00:10:48,560
botched or influenced by external

288
00:10:46,320 --> 00:10:50,800
factors

289
00:10:48,560 --> 00:10:52,959
so lastly we needed to decide on some

290
00:10:50,800 --> 00:10:56,000
boundaries for our tests our we found

291
00:10:52,959 --> 00:10:58,240
our longest common uh use use case for

292
00:10:56,000 --> 00:11:00,399
dashboard was basically about 30 days

293
00:10:58,240 --> 00:11:02,800
and we noticed grafana didn't have many

294
00:11:00,399 --> 00:11:04,640
concurrent users so we decided to cap

295
00:11:02,800 --> 00:11:06,320
the maximum number of virtual users we

296
00:11:04,640 --> 00:11:08,720
simulate to about 10.

297
00:11:06,320 --> 00:11:11,680
uh then we had decided uh that we would

298
00:11:08,720 --> 00:11:14,480
run each kind of test suite for user

299
00:11:11,680 --> 00:11:16,320
loads of one five and ten and each one

300
00:11:14,480 --> 00:11:19,040
of those for one day three days seven

301
00:11:16,320 --> 00:11:20,880
days and 30 day windows so

302
00:11:19,040 --> 00:11:22,320
each one of these combinations would

303
00:11:20,880 --> 00:11:25,120
actually be ran

304
00:11:22,320 --> 00:11:27,200
10 times to smooth out the data and just

305
00:11:25,120 --> 00:11:29,040
kind of

306
00:11:27,200 --> 00:11:30,720
ignore any external factors and maybe

307
00:11:29,040 --> 00:11:33,200
influencing the test to kind of smooth

308
00:11:30,720 --> 00:11:33,200
that out

309
00:11:33,519 --> 00:11:38,399
so now we had a methodology we wanted to

310
00:11:36,480 --> 00:11:40,240
come up with a testing platform and a

311
00:11:38,399 --> 00:11:42,480
design for our tests

312
00:11:40,240 --> 00:11:44,800
so a little note on technology selection

313
00:11:42,480 --> 00:11:46,480
uh we needed a tool uh for load testing

314
00:11:44,800 --> 00:11:48,480
that could make http requests and

315
00:11:46,480 --> 00:11:51,839
there's lots out there uh we also needed

316
00:11:48,480 --> 00:11:54,480
to simulate multiple users so many load

317
00:11:51,839 --> 00:11:56,480
testing tools kind of fall into this

318
00:11:54,480 --> 00:11:57,760
into those requirements and locus had

319
00:11:56,480 --> 00:12:00,240
actually been used in the past at

320
00:11:57,760 --> 00:12:02,000
hootsuite uh for load testing different

321
00:12:00,240 --> 00:12:05,519
things and we'd actually used it for low

322
00:12:02,000 --> 00:12:07,519
testing individual prometheus uh servers

323
00:12:05,519 --> 00:12:10,160
and time had also been invested in the

324
00:12:07,519 --> 00:12:12,079
tool to make sure it ran on kubernetes

325
00:12:10,160 --> 00:12:14,399
for us and some other tweaks here and

326
00:12:12,079 --> 00:12:15,839
there so given all this we decided to

327
00:12:14,399 --> 00:12:17,600
move forward

328
00:12:15,839 --> 00:12:19,680
with locust

329
00:12:17,600 --> 00:12:21,200
so to bring forward some of our

330
00:12:19,680 --> 00:12:23,200
requirements from the modeling and

331
00:12:21,200 --> 00:12:25,200
methodology sections and how these

332
00:12:23,200 --> 00:12:26,959
implements our tool

333
00:12:25,200 --> 00:12:28,720
we needed to seriously make http

334
00:12:26,959 --> 00:12:31,440
requests for these template variables

335
00:12:28,720 --> 00:12:33,760
which is met by the default locust http

336
00:12:31,440 --> 00:12:35,920
client and we needed to concur

337
00:12:33,760 --> 00:12:37,360
concurrently request

338
00:12:35,920 --> 00:12:39,600
all the panel queries which is something

339
00:12:37,360 --> 00:12:42,480
locus does not handle we also needed to

340
00:12:39,600 --> 00:12:44,560
execute requests for multiple dashboards

341
00:12:42,480 --> 00:12:46,000
multiple times and provide an aggregate

342
00:12:44,560 --> 00:12:48,160
of the results

343
00:12:46,000 --> 00:12:49,440
which isn't really met by locus but it's

344
00:12:48,160 --> 00:12:51,920
something we'll address in our test

345
00:12:49,440 --> 00:12:53,600
design in a minute and we also needed to

346
00:12:51,920 --> 00:12:56,000
expose data to feedback into the

347
00:12:53,600 --> 00:12:58,880
analysis phase of our testing loop which

348
00:12:56,000 --> 00:13:01,200
was met by locust as it has this web ui

349
00:12:58,880 --> 00:13:03,440
that allows you to export csvs and while

350
00:13:01,200 --> 00:13:04,800
it's not terribly automated it worked

351
00:13:03,440 --> 00:13:07,680
for us where we could just take that

352
00:13:04,800 --> 00:13:09,839
data and dump it into another uh sync

353
00:13:07,680 --> 00:13:09,839
like

354
00:13:10,000 --> 00:13:15,200
google sheets for instance is what we

355
00:13:11,519 --> 00:13:16,880
used for doing most of our comparisons

356
00:13:15,200 --> 00:13:18,639
so the first step was adding concurrency

357
00:13:16,880 --> 00:13:20,880
into locust and

358
00:13:18,639 --> 00:13:22,880
locus itself is python based both the

359
00:13:20,880 --> 00:13:24,720
tool and the authoring and as uh people

360
00:13:22,880 --> 00:13:26,880
who are familiar with python no

361
00:13:24,720 --> 00:13:29,040
concurrency isn't necessarily known to

362
00:13:26,880 --> 00:13:32,639
be one of its strong suits but we did

363
00:13:29,040 --> 00:13:34,639
find this aio async io http client to be

364
00:13:32,639 --> 00:13:36,240
the kind of recommended solution uh for

365
00:13:34,639 --> 00:13:38,399
doing this thing so we proceeded to

366
00:13:36,240 --> 00:13:39,839
implement a locus client based on this

367
00:13:38,399 --> 00:13:41,360
package

368
00:13:39,839 --> 00:13:43,760
it did take some time to kind of figure

369
00:13:41,360 --> 00:13:45,519
out uh the exception and error handling

370
00:13:43,760 --> 00:13:48,000
uh with async io

371
00:13:45,519 --> 00:13:50,560
so we weren't doing things like

372
00:13:48,000 --> 00:13:52,000
exploding the event loop or swallowing

373
00:13:50,560 --> 00:13:53,839
exceptions in the event loop and not

374
00:13:52,000 --> 00:13:55,199
passing those back to locusts so they

375
00:13:53,839 --> 00:13:57,120
could actually be

376
00:13:55,199 --> 00:13:58,079
reported

377
00:13:57,120 --> 00:14:00,160
to the

378
00:13:58,079 --> 00:14:02,399
error statistics that are displayed on

379
00:14:00,160 --> 00:14:05,680
these dashboards as well

380
00:14:02,399 --> 00:14:07,760
also locus that can run n virtual users

381
00:14:05,680 --> 00:14:09,440
per node essentially and we never really

382
00:14:07,760 --> 00:14:10,720
figured out how to get this working with

383
00:14:09,440 --> 00:14:13,279
async io

384
00:14:10,720 --> 00:14:16,079
just by trying things like creating an

385
00:14:13,279 --> 00:14:18,959
event loop per python process or per

386
00:14:16,079 --> 00:14:21,279
async io client

387
00:14:18,959 --> 00:14:23,920
so the client overview i mean the idea

388
00:14:21,279 --> 00:14:26,959
here was to expose them expose a method

389
00:14:23,920 --> 00:14:28,240
to make a list of requests concurrently

390
00:14:26,959 --> 00:14:31,199
but we thought it would also make sense

391
00:14:28,240 --> 00:14:34,000
to encapsulate a base http client so the

392
00:14:31,199 --> 00:14:37,040
consumers can request both serially and

393
00:14:34,000 --> 00:14:39,199
concurrently with the same client

394
00:14:37,040 --> 00:14:40,240
it also as i said prior like it creates

395
00:14:39,199 --> 00:14:43,199
this new

396
00:14:40,240 --> 00:14:44,560
async i o event loop per client instance

397
00:14:43,199 --> 00:14:45,920
and we're still not sure this is the

398
00:14:44,560 --> 00:14:47,519
right approaches we never quite got it

399
00:14:45,920 --> 00:14:49,360
working

400
00:14:47,519 --> 00:14:52,000
the async request mechanism itself it

401
00:14:49,360 --> 00:14:54,160
passes back a list of these custom

402
00:14:52,000 --> 00:14:56,160
resolved objects so if

403
00:14:54,160 --> 00:14:59,199
an exception does actually occur during

404
00:14:56,160 --> 00:15:00,959
the request they are caught and they're

405
00:14:59,199 --> 00:15:03,360
passed back inside this result object

406
00:15:00,959 --> 00:15:05,600
which allows our handling code on the

407
00:15:03,360 --> 00:15:09,519
other side of the event loop to tie that

408
00:15:05,600 --> 00:15:09,519
back into locusts error reporting

409
00:15:09,760 --> 00:15:14,560
a little overview of the test layout uh

410
00:15:12,959 --> 00:15:16,000
so as you can see here at the top we

411
00:15:14,560 --> 00:15:18,959
have a uh

412
00:15:16,000 --> 00:15:21,279
a encapsulating class locus user tasks

413
00:15:18,959 --> 00:15:22,959
and inside that there is uh two of these

414
00:15:21,279 --> 00:15:26,079
dashboard one and dashboard two tasks

415
00:15:22,959 --> 00:15:29,040
both uh are a sequential task set which

416
00:15:26,079 --> 00:15:31,600
is a locust construct from the api and

417
00:15:29,040 --> 00:15:33,279
you'll notice here that the

418
00:15:31,600 --> 00:15:34,720
the embedded classes dashboard one

419
00:15:33,279 --> 00:15:36,399
dashboard two

420
00:15:34,720 --> 00:15:37,600
they would declare this kind of tasks

421
00:15:36,399 --> 00:15:39,279
variable

422
00:15:37,600 --> 00:15:41,519
which is something that's part of locust

423
00:15:39,279 --> 00:15:42,399
api and this will define

424
00:15:41,519 --> 00:15:45,040
the

425
00:15:42,399 --> 00:15:46,480
sequence of the tasks that are executed

426
00:15:45,040 --> 00:15:48,000
and in the case of the dashboards

427
00:15:46,480 --> 00:15:50,800
themselves

428
00:15:48,000 --> 00:15:52,079
these task variables are set to methods

429
00:15:50,800 --> 00:15:54,160
right so they load the variables they

430
00:15:52,079 --> 00:15:56,720
load the panels and then they stop the

431
00:15:54,160 --> 00:15:57,920
dashboard load and dashboard 2 would do

432
00:15:56,720 --> 00:15:59,440
the same thing it would load the

433
00:15:57,920 --> 00:16:01,839
variables like the panels and stop the

434
00:15:59,440 --> 00:16:04,320
dashboard load and then you can see for

435
00:16:01,839 --> 00:16:06,639
the locus user

436
00:16:04,320 --> 00:16:08,639
tasks outer class that there's also a

437
00:16:06,639 --> 00:16:11,040
task variable declared and that's

438
00:16:08,639 --> 00:16:12,639
basically deferring to the inner classes

439
00:16:11,040 --> 00:16:14,880
so it wants to load dashboard one and

440
00:16:12,639 --> 00:16:16,399
dashboard two and then to repeat this

441
00:16:14,880 --> 00:16:18,160
for the send times you just kind of

442
00:16:16,399 --> 00:16:20,160
repeat that for as many times as you

443
00:16:18,160 --> 00:16:22,720
want to run dashboard one dashboard two

444
00:16:20,160 --> 00:16:24,959
dashboard one dashboard 1.42 etc and

445
00:16:22,720 --> 00:16:27,759
then at the end you just say stop test

446
00:16:24,959 --> 00:16:28,880
again and stop doesn't stop dash

447
00:16:27,759 --> 00:16:30,800
they're not shown here but they're

448
00:16:28,880 --> 00:16:33,360
really just methods they call an

449
00:16:30,800 --> 00:16:37,360
interrupt on the locus test itself to

450
00:16:33,360 --> 00:16:39,600
stop things from executing over and over

451
00:16:37,360 --> 00:16:42,160
so now we knew what we wanted to do we

452
00:16:39,600 --> 00:16:43,600
actually had to implement these things

453
00:16:42,160 --> 00:16:45,519
so to bring forward some of our

454
00:16:43,600 --> 00:16:48,160
requirements from modeling methodology

455
00:16:45,519 --> 00:16:50,639
and the platform itself the tests had to

456
00:16:48,160 --> 00:16:53,040
be written in python for locust the

457
00:16:50,639 --> 00:16:54,399
tests must be modeled based on existing

458
00:16:53,040 --> 00:16:56,079
dashboards

459
00:16:54,399 --> 00:16:58,399
the test should be consistent between

460
00:16:56,079 --> 00:17:00,240
runs so that they're fully repeatable

461
00:16:58,399 --> 00:17:02,399
and some other considerations we had was

462
00:17:00,240 --> 00:17:04,959
that it would be great to increase the

463
00:17:02,399 --> 00:17:06,880
velocity for operators if the test could

464
00:17:04,959 --> 00:17:08,559
easily be updated

465
00:17:06,880 --> 00:17:09,679
so what we came up with was to actually

466
00:17:08,559 --> 00:17:11,839
generate

467
00:17:09,679 --> 00:17:14,400
the tests themselves from

468
00:17:11,839 --> 00:17:16,720
grafana dashboard json and we already

469
00:17:14,400 --> 00:17:18,559
had some experience in the team with the

470
00:17:16,720 --> 00:17:20,240
graphone dashboard json model so that

471
00:17:18,559 --> 00:17:22,160
lent itself well

472
00:17:20,240 --> 00:17:24,400
this also allowed us to

473
00:17:22,160 --> 00:17:27,919
model the tests after the dashboards a

474
00:17:24,400 --> 00:17:30,000
little more precisely and it enables a

475
00:17:27,919 --> 00:17:31,600
quick adjustment to the tests so if you

476
00:17:30,000 --> 00:17:34,080
needed to tweak something you could just

477
00:17:31,600 --> 00:17:35,360
tweak it and rerun the generator and we

478
00:17:34,080 --> 00:17:37,520
have a bunch of new tasks coming on the

479
00:17:35,360 --> 00:17:39,280
other side and the nice thing is that it

480
00:17:37,520 --> 00:17:41,200
generates a consistent test suite every

481
00:17:39,280 --> 00:17:43,520
time you don't really have to worry

482
00:17:41,200 --> 00:17:46,720
about user error as long as the input's

483
00:17:43,520 --> 00:17:48,960
good the output should be good as well

484
00:17:46,720 --> 00:17:51,760
so the overview of our implementation of

485
00:17:48,960 --> 00:17:54,559
the generator itself we identified some

486
00:17:51,760 --> 00:17:56,799
key dashboards in grafana

487
00:17:54,559 --> 00:17:58,160
things that were slow things were fast

488
00:17:56,799 --> 00:18:00,960
things that had different kinds of

489
00:17:58,160 --> 00:18:02,480
panels things that had

490
00:18:00,960 --> 00:18:05,120
lots of template variables things that

491
00:18:02,480 --> 00:18:07,440
had no no template variables etc i think

492
00:18:05,120 --> 00:18:10,640
overall we probably identify close to a

493
00:18:07,440 --> 00:18:12,559
dozen or so and we pulled these via

494
00:18:10,640 --> 00:18:15,200
the grafana

495
00:18:12,559 --> 00:18:17,120
rest api and uh we actually ended up

496
00:18:15,200 --> 00:18:19,360
saving a copy of these in the generator

497
00:18:17,120 --> 00:18:21,440
repo and that was mainly due to the fact

498
00:18:19,360 --> 00:18:23,600
that if somebody edited the frontal

499
00:18:21,440 --> 00:18:24,960
dashboard upstream we didn't want to

500
00:18:23,600 --> 00:18:27,440
have our

501
00:18:24,960 --> 00:18:28,880
tests change on us we needed them to

502
00:18:27,440 --> 00:18:30,960
remain consistent

503
00:18:28,880 --> 00:18:33,600
so we actually saved this copy like the

504
00:18:30,960 --> 00:18:34,960
copy of all these dashboards uh the json

505
00:18:33,600 --> 00:18:37,760
into our

506
00:18:34,960 --> 00:18:40,320
tester repo to kind of codify them

507
00:18:37,760 --> 00:18:42,640
so we wrote this python script to parse

508
00:18:40,320 --> 00:18:45,280
that json and it extracts the template

509
00:18:42,640 --> 00:18:47,520
variables and the panel queries and it

510
00:18:45,280 --> 00:18:49,280
performs that variable substitution

511
00:18:47,520 --> 00:18:51,919
and it actually uses this data to

512
00:18:49,280 --> 00:18:53,919
template the tests uh with jinja which

513
00:18:51,919 --> 00:18:55,840
are just python test templates written

514
00:18:53,919 --> 00:18:57,360
in jinja 2

515
00:18:55,840 --> 00:18:59,280
they in this template would really

516
00:18:57,360 --> 00:19:01,919
define the structure for each test right

517
00:18:59,280 --> 00:19:03,679
for each dashboard uh

518
00:19:01,919 --> 00:19:06,640
handle the variable loads handle the

519
00:19:03,679 --> 00:19:09,200
panel loads and uh these templates uh

520
00:19:06,640 --> 00:19:12,000
this python that that was templated it

521
00:19:09,200 --> 00:19:14,880
mainly used uh the new async locus

522
00:19:12,000 --> 00:19:15,760
client to make all those requests

523
00:19:14,880 --> 00:19:18,240
and

524
00:19:15,760 --> 00:19:20,720
from the output of this whole system was

525
00:19:18,240 --> 00:19:23,600
a bunch of locus tests that are all

526
00:19:20,720 --> 00:19:25,440
valid python and they're ready to run

527
00:19:23,600 --> 00:19:28,240
and we would actually

528
00:19:25,440 --> 00:19:30,320
we ended up generating uh one test per

529
00:19:28,240 --> 00:19:33,520
query window so one day three days seven

530
00:19:30,320 --> 00:19:35,840
day and thirty days and this was really

531
00:19:33,520 --> 00:19:38,160
just for us i mean it the multiple tests

532
00:19:35,840 --> 00:19:39,840
it allowed us to test each query window

533
00:19:38,160 --> 00:19:41,280
distinctly

534
00:19:39,840 --> 00:19:43,520
but we could load all the tests at once

535
00:19:41,280 --> 00:19:44,640
so we could rerun a specific query

536
00:19:43,520 --> 00:19:47,200
window or something if you want to

537
00:19:44,640 --> 00:19:49,679
without having to rerun everything and

538
00:19:47,200 --> 00:19:51,600
again for us just logistically this made

539
00:19:49,679 --> 00:19:52,400
more sense

540
00:19:51,600 --> 00:19:54,000
so

541
00:19:52,400 --> 00:19:55,600
uh now we were just about ready to go we

542
00:19:54,000 --> 00:19:57,360
need to figure out what kind of data we

543
00:19:55,600 --> 00:19:59,200
wanted to get out of these for tests uh

544
00:19:57,360 --> 00:20:00,799
these tests and kind of what we wanted

545
00:19:59,200 --> 00:20:02,799
to compare on

546
00:20:00,799 --> 00:20:05,360
uh so it came down to choosing some

547
00:20:02,799 --> 00:20:06,960
meaningful indicators so error errors to

548
00:20:05,360 --> 00:20:08,640
us are always meaningful they help track

549
00:20:06,960 --> 00:20:10,159
your failures under load more data is

550
00:20:08,640 --> 00:20:11,840
always better than less so we actually

551
00:20:10,159 --> 00:20:14,640
chose to record basically everything

552
00:20:11,840 --> 00:20:17,120
that locus provided in our google sheet

553
00:20:14,640 --> 00:20:20,400
but we'd really only compare on

554
00:20:17,120 --> 00:20:23,440
uh on error rates uh your median your 90

555
00:20:20,400 --> 00:20:25,840
and 99th percentile so just to dig in

556
00:20:23,440 --> 00:20:27,360
there a little more error rates they're

557
00:20:25,840 --> 00:20:29,120
not only helpful to understand what's

558
00:20:27,360 --> 00:20:30,880
currently failing

559
00:20:29,120 --> 00:20:32,559
but they help you understand the point

560
00:20:30,880 --> 00:20:35,360
at which your system starts to fail so

561
00:20:32,559 --> 00:20:38,320
you can actually increase load uh using

562
00:20:35,360 --> 00:20:40,000
your load testing tool until you until

563
00:20:38,320 --> 00:20:42,000
uh you're either content with the

564
00:20:40,000 --> 00:20:43,120
performance under load or things start

565
00:20:42,000 --> 00:20:45,039
to fail

566
00:20:43,120 --> 00:20:46,960
the next thing is quantiles these are

567
00:20:45,039 --> 00:20:49,919
extremely useful in the observability

568
00:20:46,960 --> 00:20:52,400
world just that most latency based slos

569
00:20:49,919 --> 00:20:53,919
are already expressed as a percentage so

570
00:20:52,400 --> 00:20:57,039
for example you might have something

571
00:20:53,919 --> 00:20:58,640
like 99 of your requests complete within

572
00:20:57,039 --> 00:21:00,559
x seconds

573
00:20:58,640 --> 00:21:02,320
and like the quantiles

574
00:21:00,559 --> 00:21:03,360
chosen they should be meaningful to that

575
00:21:02,320 --> 00:21:04,960
slo

576
00:21:03,360 --> 00:21:06,880
but they should also be

577
00:21:04,960 --> 00:21:08,000
wide enough and varied to kind of help

578
00:21:06,880 --> 00:21:09,600
understand

579
00:21:08,000 --> 00:21:10,880
the spread or how frequent those slow

580
00:21:09,600 --> 00:21:14,000
requests are

581
00:21:10,880 --> 00:21:15,679
so for example your p50 or your median i

582
00:21:14,000 --> 00:21:17,200
mean half year requests are basically

583
00:21:15,679 --> 00:21:19,840
completing in this time

584
00:21:17,200 --> 00:21:21,919
so if this is really good or acceptable

585
00:21:19,840 --> 00:21:23,039
performance then you're kind of already

586
00:21:21,919 --> 00:21:26,240
doing okay

587
00:21:23,039 --> 00:21:27,600
uh at your 90th percentile only 10 of

588
00:21:26,240 --> 00:21:30,960
the requests are actually slower than

589
00:21:27,600 --> 00:21:32,400
this so is this is this performance

590
00:21:30,960 --> 00:21:34,880
acceptable for

591
00:21:32,400 --> 00:21:37,600
the majority of users who are accessing

592
00:21:34,880 --> 00:21:39,039
your system and you know uh the 99th

593
00:21:37,600 --> 00:21:40,400
percentile just one percent of your

594
00:21:39,039 --> 00:21:42,480
requests will actually be slower than

595
00:21:40,400 --> 00:21:43,919
this so is this a tolerable edge case

596
00:21:42,480 --> 00:21:46,880
and kind of how does this align to your

597
00:21:43,919 --> 00:21:49,679
slo if your slo was 99

598
00:21:46,880 --> 00:21:52,720
uh for that dnc

599
00:21:49,679 --> 00:21:55,039
um so now we built this thing and we

600
00:21:52,720 --> 00:21:57,679
needed to turn it loose so our first

601
00:21:55,039 --> 00:22:00,080
step was to identify

602
00:21:57,679 --> 00:22:02,720
the current performance of the system we

603
00:22:00,080 --> 00:22:03,679
already had and this is what it looked

604
00:22:02,720 --> 00:22:06,080
like

605
00:22:03,679 --> 00:22:09,360
so 30 day windows

606
00:22:06,080 --> 00:22:11,520
you could see were pretty horrendous and

607
00:22:09,360 --> 00:22:12,640
added load really only exacerbated the

608
00:22:11,520 --> 00:22:16,240
issue

609
00:22:12,640 --> 00:22:18,480
and uh 10 user 30 day windows as you can

610
00:22:16,240 --> 00:22:19,919
see here there's actually no data for

611
00:22:18,480 --> 00:22:21,919
those

612
00:22:19,919 --> 00:22:23,919
they are timing out

613
00:22:21,919 --> 00:22:25,760
except for the p50

614
00:22:23,919 --> 00:22:27,200
so clearly things

615
00:22:25,760 --> 00:22:29,360
weren't great

616
00:22:27,200 --> 00:22:30,559
and just note here the y-axis and all

617
00:22:29,360 --> 00:22:32,240
these graphs

618
00:22:30,559 --> 00:22:35,600
this is the time in milliseconds for the

619
00:22:32,240 --> 00:22:37,600
entire test suite to run

620
00:22:35,600 --> 00:22:39,360
so iteration 2

621
00:22:37,600 --> 00:22:41,120
we thought hey maybe other metrics

622
00:22:39,360 --> 00:22:43,679
platforms are faster and thanos is the

623
00:22:41,120 --> 00:22:45,520
problem it couldn't possibly be us

624
00:22:43,679 --> 00:22:48,000
so this is what happened we implemented

625
00:22:45,520 --> 00:22:50,159
a proof of concept for amazon managed

626
00:22:48,000 --> 00:22:51,919
prometheus which you see here is amp and

627
00:22:50,159 --> 00:22:54,880
we implemented a proof of concept for

628
00:22:51,919 --> 00:22:57,360
prom scale and we ran those against our

629
00:22:54,880 --> 00:22:58,720
current uh thanos system

630
00:22:57,360 --> 00:23:01,679
so as you can see here from the error

631
00:22:58,720 --> 00:23:03,120
rates uh slide that as a user range

632
00:23:01,679 --> 00:23:06,080
increase so did the error rates in

633
00:23:03,120 --> 00:23:08,000
general uh most of the solutions saw

634
00:23:06,080 --> 00:23:11,039
nowhere little error rates at the

635
00:23:08,000 --> 00:23:14,000
shorter smaller ranges and uh virtual

636
00:23:11,039 --> 00:23:15,280
users and prom scale had error rates

637
00:23:14,000 --> 00:23:16,960
kind of thrift

638
00:23:15,280 --> 00:23:20,080
and uh

639
00:23:16,960 --> 00:23:22,400
amazon managed prometheus and thanos

640
00:23:20,080 --> 00:23:24,640
tracked very similarly although

641
00:23:22,400 --> 00:23:25,840
uh the management solution was slightly

642
00:23:24,640 --> 00:23:28,000
ahead

643
00:23:25,840 --> 00:23:30,159
uh in terms of performance we found

644
00:23:28,000 --> 00:23:32,320
extremely similar trends uh between the

645
00:23:30,159 --> 00:23:34,799
platforms for the most part as user

646
00:23:32,320 --> 00:23:36,960
range increased uh performance basically

647
00:23:34,799 --> 00:23:38,960
became unacceptable again

648
00:23:36,960 --> 00:23:40,880
so the question really became why do

649
00:23:38,960 --> 00:23:43,039
none of these platforms actually work

650
00:23:40,880 --> 00:23:45,360
well for us

651
00:23:43,039 --> 00:23:47,919
um so the conclusion we came to is

652
00:23:45,360 --> 00:23:50,240
basically that given that amp which i

653
00:23:47,919 --> 00:23:52,480
believe is cortex-based

654
00:23:50,240 --> 00:23:54,400
and prom scale and thanos all degraded

655
00:23:52,480 --> 00:23:55,440
pretty severely approaching 30-day query

656
00:23:54,400 --> 00:23:56,720
windows

657
00:23:55,440 --> 00:23:58,799
uh

658
00:23:56,720 --> 00:24:01,679
we came to that conclusion that our data

659
00:23:58,799 --> 00:24:04,960
was likely suspect and that

660
00:24:01,679 --> 00:24:07,600
the way we were

661
00:24:04,960 --> 00:24:10,080
recording metrics uh was not lending

662
00:24:07,600 --> 00:24:12,799
itself well to the way prometheus tsdb

663
00:24:10,080 --> 00:24:14,559
or thanos or something else worked and a

664
00:24:12,799 --> 00:24:16,000
future investigation would actually

665
00:24:14,559 --> 00:24:18,480
confirm this

666
00:24:16,000 --> 00:24:21,360
in that our certain series had major

667
00:24:18,480 --> 00:24:23,520
cardinality issues and it was something

668
00:24:21,360 --> 00:24:26,880
to the tune of less than one percent of

669
00:24:23,520 --> 00:24:29,840
our metric series by name accounted for

670
00:24:26,880 --> 00:24:33,120
almost 50 percent of the total samples

671
00:24:29,840 --> 00:24:34,320
uh per scrape basically um but in the

672
00:24:33,120 --> 00:24:36,559
meantime until we got to that

673
00:24:34,320 --> 00:24:39,440
investigation uh we decided to actually

674
00:24:36,559 --> 00:24:40,799
use this testing platform to tune thanos

675
00:24:39,440 --> 00:24:43,279
itself

676
00:24:40,799 --> 00:24:45,760
um so this was really iterations three

677
00:24:43,279 --> 00:24:48,240
and on uh we were benchmarking changes

678
00:24:45,760 --> 00:24:51,760
to the existing platform

679
00:24:48,240 --> 00:24:54,559
um so uh the tuning we focused on 10

680
00:24:51,760 --> 00:24:57,520
user and 30-day query ranges as this was

681
00:24:54,559 --> 00:25:00,320
our worst case and you know if we could

682
00:24:57,520 --> 00:25:02,559
make our worst case better than our best

683
00:25:00,320 --> 00:25:05,120
cases only got even better right

684
00:25:02,559 --> 00:25:07,360
so we decided on a potential bottleneck

685
00:25:05,120 --> 00:25:08,880
we investigated how to solve that via

686
00:25:07,360 --> 00:25:11,039
configure resource options and

687
00:25:08,880 --> 00:25:12,640
implemented and this is a tight loop

688
00:25:11,039 --> 00:25:14,960
again we talked about and we went

689
00:25:12,640 --> 00:25:16,000
through about 17 iterations of these in

690
00:25:14,960 --> 00:25:17,120
totals

691
00:25:16,000 --> 00:25:19,360
and every time we went through an

692
00:25:17,120 --> 00:25:20,480
iteration if any increase to the

693
00:25:19,360 --> 00:25:23,440
performance or

694
00:25:20,480 --> 00:25:25,120
or or decrease in error rate was found

695
00:25:23,440 --> 00:25:26,960
we actually implemented that as the

696
00:25:25,120 --> 00:25:29,440
iterations continued

697
00:25:26,960 --> 00:25:31,200
and if you look at the graph here uh

698
00:25:29,440 --> 00:25:33,760
what you can see is that against our

699
00:25:31,200 --> 00:25:36,320
baseline our tuned

700
00:25:33,760 --> 00:25:37,600
thanos actually performed

701
00:25:36,320 --> 00:25:39,760
much better

702
00:25:37,600 --> 00:25:42,880
than what we had seen previously

703
00:25:39,760 --> 00:25:44,400
and not only was performance up uh we

704
00:25:42,880 --> 00:25:46,000
were able to get error rates down from

705
00:25:44,400 --> 00:25:49,039
about 21

706
00:25:46,000 --> 00:25:51,360
to one percent

707
00:25:49,039 --> 00:25:52,960
so uh summary of the changes uh for

708
00:25:51,360 --> 00:25:55,760
those that are interested things that

709
00:25:52,960 --> 00:25:57,679
had a positive impact were increasing

710
00:25:55,760 --> 00:26:00,320
resources surprise surprise like memory

711
00:25:57,679 --> 00:26:02,880
and cpu specifically for uh storing

712
00:26:00,320 --> 00:26:04,480
query pods uh thanos store we actually

713
00:26:02,880 --> 00:26:06,400
changed that from being a time based uh

714
00:26:04,480 --> 00:26:08,480
partitioning scheme to hash base which

715
00:26:06,400 --> 00:26:10,480
is just a configuration option and this

716
00:26:08,480 --> 00:26:12,559
led to some better query distribution so

717
00:26:10,480 --> 00:26:14,320
the thanos store or store instances that

718
00:26:12,559 --> 00:26:16,159
were responsible for the most commonly

719
00:26:14,320 --> 00:26:18,480
queried ranges didn't always get

720
00:26:16,159 --> 00:26:20,559
obliterated uh we also increased the

721
00:26:18,480 --> 00:26:21,679
dano's storage erpc concurrency just so

722
00:26:20,559 --> 00:26:24,720
more things could get through the door

723
00:26:21,679 --> 00:26:27,279
at once we moved from an in cluster

724
00:26:24,720 --> 00:26:29,600
memcached to an elastic managed

725
00:26:27,279 --> 00:26:31,200
memcached instance and you know this

726
00:26:29,600 --> 00:26:33,840
gave us kind of benefits of a managed

727
00:26:31,200 --> 00:26:36,000
service but also it gave us access to

728
00:26:33,840 --> 00:26:37,279
much larger much larger nodes without

729
00:26:36,000 --> 00:26:40,400
having to worry about resource

730
00:26:37,279 --> 00:26:41,760
constraints inside kubernetes and

731
00:26:40,400 --> 00:26:44,640
and working on nodes where everybody

732
00:26:41,760 --> 00:26:47,120
else is also running workloads

733
00:26:44,640 --> 00:26:49,279
so we also tuned thanos store index

734
00:26:47,120 --> 00:26:51,200
cache to increase the timeout increase

735
00:26:49,279 --> 00:26:53,840
async concurrency and increase the

736
00:26:51,200 --> 00:26:55,919
buffer size but we actually decreased

737
00:26:53,840 --> 00:26:57,600
the max number of auto connections just

738
00:26:55,919 --> 00:27:00,880
so things would get booted a little more

739
00:26:57,600 --> 00:27:02,799
aggressively we also enabled thanos

740
00:27:00,880 --> 00:27:05,120
store caching bucket which actually just

741
00:27:02,799 --> 00:27:06,720
speeds up chunk loading

742
00:27:05,120 --> 00:27:08,480
and things that we had tried that really

743
00:27:06,720 --> 00:27:11,679
didn't work out for us

744
00:27:08,480 --> 00:27:13,279
we tried to overly shard thanos store

745
00:27:11,679 --> 00:27:15,919
and we basically found it had

746
00:27:13,279 --> 00:27:19,120
diminishing returns uh after a certain

747
00:27:15,919 --> 00:27:21,200
point uh we tried thanos star

748
00:27:19,120 --> 00:27:23,679
store shard replicas

749
00:27:21,200 --> 00:27:26,320
but we really found no benefit running

750
00:27:23,679 --> 00:27:28,159
that with kubernetes in in just that

751
00:27:26,320 --> 00:27:30,240
like if a

752
00:27:28,159 --> 00:27:33,840
store shard goes down kubernetes

753
00:27:30,240 --> 00:27:35,279
replaces that pod pretty quickly and the

754
00:27:33,840 --> 00:27:37,360
way this worked was that when you

755
00:27:35,279 --> 00:27:39,440
queried it would actually query both it

756
00:27:37,360 --> 00:27:41,039
actually query across both replicas at

757
00:27:39,440 --> 00:27:43,120
the same time so there wasn't any real

758
00:27:41,039 --> 00:27:44,960
performance gained from this

759
00:27:43,120 --> 00:27:47,440
and also we had tried thanos store in

760
00:27:44,960 --> 00:27:49,679
memory index caching but it just put too

761
00:27:47,440 --> 00:27:51,279
much memory pressure on our kubernetes

762
00:27:49,679 --> 00:27:54,240
nodes

763
00:27:51,279 --> 00:27:56,559
so wrapping it all up the tldr slide is

764
00:27:54,240 --> 00:27:58,799
that we identified a lack of visibility

765
00:27:56,559 --> 00:28:01,039
into the performance of our systems we

766
00:27:58,799 --> 00:28:02,960
ended up designing and implementing this

767
00:28:01,039 --> 00:28:05,600
performance testing framework

768
00:28:02,960 --> 00:28:07,679
and we established baseline performance

769
00:28:05,600 --> 00:28:09,279
and we also evaluated other solutions

770
00:28:07,679 --> 00:28:11,279
with that framework

771
00:28:09,279 --> 00:28:12,960
this allowed us to draw then later prove

772
00:28:11,279 --> 00:28:15,440
out some conclusions from our test

773
00:28:12,960 --> 00:28:18,399
results and using that same framework in

774
00:28:15,440 --> 00:28:20,559
a tight loop fashion we were able to

775
00:28:18,399 --> 00:28:23,120
successfully tune our thanos and

776
00:28:20,559 --> 00:28:24,640
prometheus instance

777
00:28:23,120 --> 00:28:27,279
so for more information i'm going to

778
00:28:24,640 --> 00:28:29,520
make the slides the locust aio http

779
00:28:27,279 --> 00:28:31,760
client and the grafana test generator

780
00:28:29,520 --> 00:28:32,559
available at the github repo you can see

781
00:28:31,760 --> 00:28:34,720
there

782
00:28:32,559 --> 00:28:36,159
and if you want to contact me or if you

783
00:28:34,720 --> 00:28:37,440
have questions or you want to share some

784
00:28:36,159 --> 00:28:39,360
opinions

785
00:28:37,440 --> 00:28:41,679
my email is on the screen there and also

786
00:28:39,360 --> 00:28:43,840
you can hit me up at the cncf for

787
00:28:41,679 --> 00:28:45,760
grafana slack and my handle is just

788
00:28:43,840 --> 00:28:48,559
brian grew

789
00:28:45,760 --> 00:28:51,120
so uh thank you everybody for

790
00:28:48,559 --> 00:28:53,360
sitting through the presentation and now

791
00:28:51,120 --> 00:28:54,480
we will move on to

792
00:28:53,360 --> 00:28:56,000
q a

793
00:28:54,480 --> 00:28:57,760
thanks so much

794
00:28:56,000 --> 00:28:59,360
hey welcome back everybody brian thanks

795
00:28:57,760 --> 00:29:01,600
for an amazing talk and the wonderful

796
00:28:59,360 --> 00:29:03,039
memes um i didn't know what hootsuit was

797
00:29:01,600 --> 00:29:04,880
before your recording so thanks so much

798
00:29:03,039 --> 00:29:06,399
for your uh doing that for us very

799
00:29:04,880 --> 00:29:08,000
enjoyable

800
00:29:06,399 --> 00:29:10,000
so brian we've got a couple of questions

801
00:29:08,000 --> 00:29:12,240
that have come through in the chat um

802
00:29:10,000 --> 00:29:14,640
the first one's a double question the

803
00:29:12,240 --> 00:29:16,960
user behavior was codified as python and

804
00:29:14,640 --> 00:29:16,960
locus

805
00:29:17,360 --> 00:29:20,559
go ahead sorry yeah no

806
00:29:19,440 --> 00:29:22,159
if you want to finish reading the second

807
00:29:20,559 --> 00:29:23,840
half that's fine sure and the second

808
00:29:22,159 --> 00:29:26,320
half is do the number of developers

809
00:29:23,840 --> 00:29:29,679
affect the number of load test users

810
00:29:26,320 --> 00:29:32,640
were implemented right uh so yeah uh the

811
00:29:29,679 --> 00:29:35,600
first half we did choose uh several

812
00:29:32,640 --> 00:29:36,960
dashboards and uh the user behavior of

813
00:29:35,600 --> 00:29:39,360
loading those dashboards or what it

814
00:29:36,960 --> 00:29:42,240
would look like coming from grafana

815
00:29:39,360 --> 00:29:45,120
that was definitely codified into uh the

816
00:29:42,240 --> 00:29:46,880
tests we wrote and it did impact uh the

817
00:29:45,120 --> 00:29:49,039
number of users right i mean that we

818
00:29:46,880 --> 00:29:51,679
found we had relatively low uh

819
00:29:49,039 --> 00:29:52,960
simultaneous users on griffon at any one

820
00:29:51,679 --> 00:29:56,240
given time

821
00:29:52,960 --> 00:29:59,120
so we decided to uh max out at about uh

822
00:29:56,240 --> 00:30:00,320
10 virtual users for locust

823
00:29:59,120 --> 00:30:02,399
um

824
00:30:00,320 --> 00:30:04,240
and then uh

825
00:30:02,399 --> 00:30:06,159
and then i think yeah like we were

826
00:30:04,240 --> 00:30:08,399
trying to determine i mean why the

827
00:30:06,159 --> 00:30:09,679
performance was slow and not try to find

828
00:30:08,399 --> 00:30:11,440
the breaking point of the system so i

829
00:30:09,679 --> 00:30:13,279
mean that's really the reason we didn't

830
00:30:11,440 --> 00:30:16,159
uh ratchet up the number of virtual

831
00:30:13,279 --> 00:30:16,159
users too high

832
00:30:17,120 --> 00:30:21,279
excellent thanks brian and team please

833
00:30:19,520 --> 00:30:23,520
do continue to put chats questions

834
00:30:21,279 --> 00:30:24,880
through into the the chat the great av

835
00:30:23,520 --> 00:30:26,399
guys etc in the background and the

836
00:30:24,880 --> 00:30:28,240
moderators will post them through to us

837
00:30:26,399 --> 00:30:30,399
so please keep more questions coming

838
00:30:28,240 --> 00:30:31,120
we've got about 15 minutes we'll finish

839
00:30:30,399 --> 00:30:32,240
at

840
00:30:31,120 --> 00:30:35,279
uh

841
00:30:32,240 --> 00:30:37,919
12 25 australian eastern standard time

842
00:30:35,279 --> 00:30:39,600
on the east coast for sydney apologies i

843
00:30:37,919 --> 00:30:41,520
can't work out the rest of the six or

844
00:30:39,600 --> 00:30:43,279
seven times zone in australia on the fly

845
00:30:41,520 --> 00:30:46,000
but hopefully you can so we've got about

846
00:30:43,279 --> 00:30:48,159
15 minutes left before we go to lunch

847
00:30:46,000 --> 00:30:49,039
the second question is

848
00:30:48,159 --> 00:30:51,919
what's

849
00:30:49,039 --> 00:30:53,279
sorry that's the most common type sorry

850
00:30:51,919 --> 00:30:55,600
there should be what's the most common

851
00:30:53,279 --> 00:30:57,200
type of metrics you have seen people

852
00:30:55,600 --> 00:30:59,279
overlook

853
00:30:57,200 --> 00:31:01,679
what that would add value to their

854
00:30:59,279 --> 00:31:03,360
polling and graphing

855
00:31:01,679 --> 00:31:05,519
uh yeah for sure so i've been racking my

856
00:31:03,360 --> 00:31:07,760
brain over this one a bit uh

857
00:31:05,519 --> 00:31:09,919
but i mean in terms of general metrics

858
00:31:07,760 --> 00:31:11,760
for services or applications i mean it's

859
00:31:09,919 --> 00:31:13,840
kind of hard to say because everyone's

860
00:31:11,760 --> 00:31:16,320
stack is so different right

861
00:31:13,840 --> 00:31:19,600
but in general i do find a lot of people

862
00:31:16,320 --> 00:31:21,279
they tend to focus on uh resources right

863
00:31:19,600 --> 00:31:24,000
where's my memory at where's my cpu and

864
00:31:21,279 --> 00:31:25,600
i mean that is very important uh but i

865
00:31:24,000 --> 00:31:28,799
always like to think of things in terms

866
00:31:25,600 --> 00:31:30,559
of user experience and uh whether it's a

867
00:31:28,799 --> 00:31:31,679
real user out there in the world that's

868
00:31:30,559 --> 00:31:32,559
external

869
00:31:31,679 --> 00:31:34,880
or

870
00:31:32,559 --> 00:31:37,120
whether it's like another service that's

871
00:31:34,880 --> 00:31:41,039
using it or like for our use case where

872
00:31:37,120 --> 00:31:43,679
we were uh providing grafana to our

873
00:31:41,039 --> 00:31:44,880
eu for our users and i mean in a lot of

874
00:31:43,679 --> 00:31:47,039
these cases

875
00:31:44,880 --> 00:31:49,039
like latency does come in handy and as i

876
00:31:47,039 --> 00:31:50,799
said in the talk as well and

877
00:31:49,039 --> 00:31:53,039
and not just running with an average or

878
00:31:50,799 --> 00:31:55,440
maxis but actually understanding

879
00:31:53,039 --> 00:31:57,600
uh the quantiles and kind of how that

880
00:31:55,440 --> 00:31:59,360
aligns that's low and how many users are

881
00:31:57,600 --> 00:32:00,960
being impacted uh through those

882
00:31:59,360 --> 00:32:02,080
quantiles

883
00:32:00,960 --> 00:32:04,240
but i think something that's also

884
00:32:02,080 --> 00:32:06,960
generally overlooked is that like the

885
00:32:04,240 --> 00:32:09,120
impact of downstream or managed uh

886
00:32:06,960 --> 00:32:11,919
services your service may be dependent

887
00:32:09,120 --> 00:32:14,000
on and you know and remembering to take

888
00:32:11,919 --> 00:32:15,200
their performance and their status into

889
00:32:14,000 --> 00:32:16,720
account

890
00:32:15,200 --> 00:32:18,880
both when you're performing you're

891
00:32:16,720 --> 00:32:21,360
monitoring your writing dashboards but

892
00:32:18,880 --> 00:32:25,720
also when you're coming up with

893
00:32:21,360 --> 00:32:25,720
dslos for your service as well

894
00:32:32,880 --> 00:32:36,159
i'm not getting any audio from you

895
00:32:34,240 --> 00:32:37,679
michael

896
00:32:36,159 --> 00:32:39,039
thank you i fell for the oldest trap in

897
00:32:37,679 --> 00:32:41,039
the book

898
00:32:39,039 --> 00:32:43,279
sorry the next question is and apologize

899
00:32:41,039 --> 00:32:45,679
for my pronunciations here greek was not

900
00:32:43,279 --> 00:32:47,840
my strength what is the benefit of

901
00:32:45,679 --> 00:32:50,480
thanos over prometheus

902
00:32:47,840 --> 00:32:53,519
uh yeah for sure um so the reason why

903
00:32:50,480 --> 00:32:55,200
hootsuite went with uh thanos is that

904
00:32:53,519 --> 00:32:58,960
we had a requirement that we wanted to

905
00:32:55,200 --> 00:33:02,000
provide our users with uh up to a year

906
00:32:58,960 --> 00:33:03,440
or so of of metrics and

907
00:33:02,000 --> 00:33:05,200
we necessarily didn't want to keep all

908
00:33:03,440 --> 00:33:06,720
those metrics live and prometheus and

909
00:33:05,200 --> 00:33:08,480
have a massive

910
00:33:06,720 --> 00:33:10,960
tsdb

911
00:33:08,480 --> 00:33:13,519
so the benefit for us uh was really

912
00:33:10,960 --> 00:33:16,399
implementing thanos as that uh long-term

913
00:33:13,519 --> 00:33:19,600
store uh but thanos uh and and the other

914
00:33:16,399 --> 00:33:22,320
side of it is also uh the uh high

915
00:33:19,600 --> 00:33:24,960
availability side right uh we run uh

916
00:33:22,320 --> 00:33:27,760
multiple uh prometheus or promethei

917
00:33:24,960 --> 00:33:30,720
pairs uh and thanos uh provides the

918
00:33:27,760 --> 00:33:33,120
query across that pair uh the aj pair

919
00:33:30,720 --> 00:33:34,640
prometheus instances and it will also uh

920
00:33:33,120 --> 00:33:36,320
detube the results on the other side

921
00:33:34,640 --> 00:33:37,120
through the query mechanism

922
00:33:36,320 --> 00:33:39,600
so

923
00:33:37,120 --> 00:33:41,519
it it kind of bulks out prometheus where

924
00:33:39,600 --> 00:33:44,080
maybe vanilla prometheus falls a bit

925
00:33:41,519 --> 00:33:45,840
short in terms of h.a or uh long-term

926
00:33:44,080 --> 00:33:48,640
storage and uh long-term storage on

927
00:33:45,840 --> 00:33:50,640
thanos is uh can can be sampled down as

928
00:33:48,640 --> 00:33:52,480
well

929
00:33:50,640 --> 00:33:53,919
excellent thank you

930
00:33:52,480 --> 00:33:56,000
well that's the last formal question

931
00:33:53,919 --> 00:33:57,840
please do ask more questions they did

932
00:33:56,000 --> 00:34:00,080
foolishly say that i'm allowed to add

933
00:33:57,840 --> 00:34:00,799
live which is always very dangerous

934
00:34:00,080 --> 00:34:02,640
so

935
00:34:00,799 --> 00:34:04,799
my ad-lib question is what was the

936
00:34:02,640 --> 00:34:07,279
biggest surprise you found when running

937
00:34:04,799 --> 00:34:09,200
your first test

938
00:34:07,279 --> 00:34:10,159
uh yeah so the biggest surprise was just

939
00:34:09,200 --> 00:34:12,320
uh

940
00:34:10,159 --> 00:34:14,720
how absolutely horrific the performance

941
00:34:12,320 --> 00:34:18,480
on querying some of these series were

942
00:34:14,720 --> 00:34:19,679
and uh and it became pretty obvious uh

943
00:34:18,480 --> 00:34:22,560
when we started running these tests

944
00:34:19,679 --> 00:34:25,200
manually to begin with uh just like

945
00:34:22,560 --> 00:34:27,040
just testing if stuff worked uh

946
00:34:25,200 --> 00:34:31,040
uh it became pretty obvious who the

947
00:34:27,040 --> 00:34:32,639
worst defenders were and uh

948
00:34:31,040 --> 00:34:34,320
like in terms of the series right like

949
00:34:32,639 --> 00:34:36,079
not the people but yeah but it became

950
00:34:34,320 --> 00:34:38,079
pretty obvious who the worst uh metric

951
00:34:36,079 --> 00:34:39,440
series were so when we did perform

952
00:34:38,079 --> 00:34:41,919
subsequent investigations it kind of

953
00:34:39,440 --> 00:34:43,359
gave us the okay i i kind of know where

954
00:34:41,919 --> 00:34:45,200
the bodies are buried i know where to

955
00:34:43,359 --> 00:34:47,679
look

956
00:34:45,200 --> 00:34:49,280
excellent and what was the biggest waste

957
00:34:47,679 --> 00:34:50,720
of time like you spent all of this time

958
00:34:49,280 --> 00:34:52,240
coding something up and it didn't

959
00:34:50,720 --> 00:34:54,720
actually show any in performance and

960
00:34:52,240 --> 00:34:57,760
what could you learn from that

961
00:34:54,720 --> 00:34:59,359
right uh for sure so i mean

962
00:34:57,760 --> 00:35:01,599
uh like are you asking the biggest waste

963
00:34:59,359 --> 00:35:02,960
of time that we've hit or

964
00:35:01,599 --> 00:35:04,720
yeah it doesn't you thought you would

965
00:35:02,960 --> 00:35:07,280
find value in a space and you didn't

966
00:35:04,720 --> 00:35:09,040
actually find any value uh yeah that's a

967
00:35:07,280 --> 00:35:12,000
good question i mean in terms of the

968
00:35:09,040 --> 00:35:14,240
overall uh system and the approach which

969
00:35:12,000 --> 00:35:15,280
using locust uh for performance and load

970
00:35:14,240 --> 00:35:17,040
testing

971
00:35:15,280 --> 00:35:18,880
we've found value there i don't think

972
00:35:17,040 --> 00:35:19,839
we've fallen down anywhere

973
00:35:18,880 --> 00:35:22,320
um

974
00:35:19,839 --> 00:35:24,880
i think the only thing uh

975
00:35:22,320 --> 00:35:26,160
that's maybe fallen slightly short is

976
00:35:24,880 --> 00:35:27,920
just uh

977
00:35:26,160 --> 00:35:29,040
now that this has worked so well for us

978
00:35:27,920 --> 00:35:31,680
we've been trying to figure out how to

979
00:35:29,040 --> 00:35:32,960
scale it uh to the rest of the team and

980
00:35:31,680 --> 00:35:35,040
you know we're trying to figure out if

981
00:35:32,960 --> 00:35:37,200
locust is that tool to do that going

982
00:35:35,040 --> 00:35:38,800
forward and we're actually not sure um

983
00:35:37,200 --> 00:35:41,040
so if you want to talk about perhaps

984
00:35:38,800 --> 00:35:43,760
burning time then yeah we put a lot of

985
00:35:41,040 --> 00:35:45,359
time into this and into locust and you

986
00:35:43,760 --> 00:35:46,560
know uh we may be starting over with

987
00:35:45,359 --> 00:35:48,000
something that's uh slightly more

988
00:35:46,560 --> 00:35:49,680
scalable or that we can relate to the

989
00:35:48,000 --> 00:35:51,680
whole org

990
00:35:49,680 --> 00:35:53,440
excellent thank you as an ex-software

991
00:35:51,680 --> 00:35:55,359
tester i never found testing or

992
00:35:53,440 --> 00:35:56,720
performance testing was a waste of time

993
00:35:55,359 --> 00:35:58,160
even though the other people on the

994
00:35:56,720 --> 00:36:01,680
other side of the quad sometimes

995
00:35:58,160 --> 00:36:01,680
wondered what we were doing on ourselves

996
00:36:02,000 --> 00:36:05,760
excellent and what would be your one key

997
00:36:03,839 --> 00:36:08,320
takeaway if you if someone walked in and

998
00:36:05,760 --> 00:36:09,599
say what should i learn take away from

999
00:36:08,320 --> 00:36:11,520
this talk right

1000
00:36:09,599 --> 00:36:13,280
yeah yeah like i think the

1001
00:36:11,520 --> 00:36:14,160
best thing we've got out of the system

1002
00:36:13,280 --> 00:36:15,920
uh

1003
00:36:14,160 --> 00:36:19,680
was definitely uh

1004
00:36:15,920 --> 00:36:22,720
tuning thanos and just how easy it was

1005
00:36:19,680 --> 00:36:23,920
to uh make a change run the unit or

1006
00:36:22,720 --> 00:36:25,280
sorry make a change and run the

1007
00:36:23,920 --> 00:36:27,359
performance test and make a change and

1008
00:36:25,280 --> 00:36:30,800
run the test again and then actually see

1009
00:36:27,359 --> 00:36:33,520
those uh benefits or detriments that

1010
00:36:30,800 --> 00:36:35,920
sometimes happened as as as we went

1011
00:36:33,520 --> 00:36:38,720
along so i mean the one takeaway is just

1012
00:36:35,920 --> 00:36:40,560
say yes uh performance testing is useful

1013
00:36:38,720 --> 00:36:43,920
even in infrastructure even if it's not

1014
00:36:40,560 --> 00:36:44,960
an end user facing product uh your users

1015
00:36:43,920 --> 00:36:46,400
the developers at the end of the day

1016
00:36:44,960 --> 00:36:48,160
well thank you

1017
00:36:46,400 --> 00:36:49,920
good good excellent

1018
00:36:48,160 --> 00:36:52,560
that long that dying between the

1019
00:36:49,920 --> 00:36:53,920
developer and the end user is sometimes

1020
00:36:52,560 --> 00:36:55,440
a little blurred with management and

1021
00:36:53,920 --> 00:36:57,200
cost of performance testing and

1022
00:36:55,440 --> 00:36:59,200
continuous integration so yeah i'm a

1023
00:36:57,200 --> 00:37:00,960
great believer in it and as a manager

1024
00:36:59,200 --> 00:37:01,839
i'm encouraging it so yes wonderful

1025
00:37:00,960 --> 00:37:03,119
excellent

1026
00:37:01,839 --> 00:37:04,480
good good

1027
00:37:03,119 --> 00:37:06,400
um i think that's all the questions

1028
00:37:04,480 --> 00:37:08,160
we've had come through

1029
00:37:06,400 --> 00:37:10,079
we might give people an early mark for

1030
00:37:08,160 --> 00:37:11,920
lunch if they like you'll get a few

1031
00:37:10,079 --> 00:37:16,320
minutes early but brian will be heading

1032
00:37:11,920 --> 00:37:18,480
across to the uh the chat in the venulis

1033
00:37:16,320 --> 00:37:20,960
if you haven't found the channels yet so

1034
00:37:18,480 --> 00:37:23,119
this is my second lca took me a while to

1035
00:37:20,960 --> 00:37:25,040
find them scroll down on the left go to

1036
00:37:23,119 --> 00:37:27,040
browse channels and then you'll find a

1037
00:37:25,040 --> 00:37:28,320
whole pile of channels to join and if

1038
00:37:27,040 --> 00:37:29,680
that's why you missed out on some other

1039
00:37:28,320 --> 00:37:31,280
things sorry that um you didn't hear

1040
00:37:29,680 --> 00:37:33,440
about that earlier but please do join

1041
00:37:31,280 --> 00:37:35,040
brian and brian a great thank you thank

1042
00:37:33,440 --> 00:37:36,880
you for taking your time your evening

1043
00:37:35,040 --> 00:37:38,800
thanks your your family and plants as

1044
00:37:36,880 --> 00:37:40,160
well for the after hours work you're

1045
00:37:38,800 --> 00:37:42,000
having to do in canada and please do

1046
00:37:40,160 --> 00:37:43,280
stay warm over there

1047
00:37:42,000 --> 00:37:47,800
great thanks so much

1048
00:37:43,280 --> 00:37:47,800
thanks bye everyone enjoy lunch

1049
00:37:50,880 --> 00:37:52,960
you