1 00:00:06,320 --> 00:00:11,499 [Music] 2 00:00:15,679 --> 00:00:19,600 welcome back everybody from morning tea 3 00:00:17,279 --> 00:00:21,279 i hope you've had a great snack time to 4 00:00:19,600 --> 00:00:22,320 catch up with people and have some 5 00:00:21,279 --> 00:00:24,320 refreshment 6 00:00:22,320 --> 00:00:25,760 apologies this is my first time emceeing 7 00:00:24,320 --> 00:00:28,000 and it all happened within the last 10 8 00:00:25,760 --> 00:00:30,640 minutes so this is really agile software 9 00:00:28,000 --> 00:00:33,120 development on the fly so apologies if i 10 00:00:30,640 --> 00:00:35,040 put my foot in my mouth so this morning 11 00:00:33,120 --> 00:00:37,760 i'd like to introduce brian who's coming 12 00:00:35,040 --> 00:00:40,160 to us from sony ontario in canada 13 00:00:37,760 --> 00:00:41,920 currently -30 outside for all of those 14 00:00:40,160 --> 00:00:44,079 of you in sydney that are suffering from 15 00:00:41,920 --> 00:00:45,840 30 plus or from the west coast of 16 00:00:44,079 --> 00:00:48,079 australia and it's even hotter just 17 00:00:45,840 --> 00:00:49,680 think of them in canada minus 30. i 18 00:00:48,079 --> 00:00:52,160 think our sunny australian coasts are a 19 00:00:49,680 --> 00:00:55,680 little warmer than canada at this time 20 00:00:52,160 --> 00:00:58,559 so brian uh is joining us to so we can 21 00:00:55,680 --> 00:01:00,719 learn about how hootsuite used load 22 00:00:58,559 --> 00:01:03,120 testing to evaluate the performance of 23 00:01:00,719 --> 00:01:04,239 prometheus and grafana based metric 24 00:01:03,120 --> 00:01:06,159 platforms 25 00:01:04,239 --> 00:01:08,560 developing a framework based on open 26 00:01:06,159 --> 00:01:10,799 source tooling locust that can be used 27 00:01:08,560 --> 00:01:13,200 to quickly evaluate the performance 28 00:01:10,799 --> 00:01:15,600 impact of configuration and component 29 00:01:13,200 --> 00:01:17,680 changes prior to deployment now brian's 30 00:01:15,600 --> 00:01:19,600 talk is pre-recorded in case we have any 31 00:01:17,680 --> 00:01:22,479 technical glitches across the pacific 32 00:01:19,600 --> 00:01:24,240 but brian will be joining us again for a 33 00:01:22,479 --> 00:01:25,759 q a at the end so brian will be live 34 00:01:24,240 --> 00:01:27,520 that is him there live at the moment 35 00:01:25,759 --> 00:01:29,200 nodding along it's not just a naughty 36 00:01:27,520 --> 00:01:30,400 and those plants in the background are 37 00:01:29,200 --> 00:01:32,159 real they're not even a virtual 38 00:01:30,400 --> 00:01:33,280 background so we really really have him 39 00:01:32,159 --> 00:01:35,200 here 40 00:01:33,280 --> 00:01:37,520 uh thanks enjoy the session there'll be 41 00:01:35,200 --> 00:01:38,720 a few seconds pause as we switch over to 42 00:01:37,520 --> 00:01:41,840 the recording 43 00:01:38,720 --> 00:01:41,840 see you soon 44 00:01:46,159 --> 00:01:50,720 welcome to the inaugural voyage of 45 00:01:49,040 --> 00:01:53,840 benchmarking prometheus metrics 46 00:01:50,720 --> 00:01:56,399 platforms my name is brian gru and i'm a 47 00:01:53,840 --> 00:02:00,640 senior software developer at hootsuite 48 00:01:56,399 --> 00:02:00,640 and i mainly focus on observability 49 00:02:00,799 --> 00:02:05,520 so some background we implemented a 50 00:02:03,200 --> 00:02:07,280 prometheus thanos grafana stack to 51 00:02:05,520 --> 00:02:09,039 support multi-window multi-burn rate 52 00:02:07,280 --> 00:02:10,640 alerting for most of our product 53 00:02:09,039 --> 00:02:12,239 software but also for some of our 54 00:02:10,640 --> 00:02:14,319 infrastructure as well 55 00:02:12,239 --> 00:02:16,800 after the initial implementation 56 00:02:14,319 --> 00:02:19,920 we received reports from developers that 57 00:02:16,800 --> 00:02:21,920 their dashboards were loading slowly and 58 00:02:19,920 --> 00:02:23,599 while we could load these dashboards and 59 00:02:21,920 --> 00:02:26,160 see that some were slow others were 60 00:02:23,599 --> 00:02:28,160 performing fine so we really didn't have 61 00:02:26,160 --> 00:02:30,800 an idea as to the extent of the problem 62 00:02:28,160 --> 00:02:32,319 we really had no empirical data 63 00:02:30,800 --> 00:02:34,720 to prove that things were actually 64 00:02:32,319 --> 00:02:36,400 running slow or how many things were 65 00:02:34,720 --> 00:02:37,200 running slow 66 00:02:36,400 --> 00:02:38,800 so 67 00:02:37,200 --> 00:02:41,440 we set out and we thought to ourselves 68 00:02:38,800 --> 00:02:42,879 how can we verify and also tune the 69 00:02:41,440 --> 00:02:45,280 performance of the current metrics 70 00:02:42,879 --> 00:02:46,800 platform we have but how could we also 71 00:02:45,280 --> 00:02:48,720 possibly benchmark other 72 00:02:46,800 --> 00:02:49,519 prometheus-based solutions 73 00:02:48,720 --> 00:02:52,480 and 74 00:02:49,519 --> 00:02:56,000 how so our answer was basically we want 75 00:02:52,480 --> 00:02:57,519 to performance to start infrastructure 76 00:02:56,000 --> 00:02:59,840 so you may be asking yourself why 77 00:02:57,519 --> 00:03:01,599 performance test your infrastructure 78 00:02:59,840 --> 00:03:03,120 it's our belief that infrastructure 79 00:03:01,599 --> 00:03:04,800 critical to assisting product 80 00:03:03,120 --> 00:03:07,760 development should be just about as 81 00:03:04,800 --> 00:03:09,519 critical as the product itself 82 00:03:07,760 --> 00:03:11,280 it also helps you understand baseline 83 00:03:09,519 --> 00:03:13,360 performance it helps you validate 84 00:03:11,280 --> 00:03:15,599 changes against existing baselines 85 00:03:13,360 --> 00:03:18,239 before changes actually get integrated 86 00:03:15,599 --> 00:03:19,599 it helps you watch for regressions in 87 00:03:18,239 --> 00:03:21,680 the system itself 88 00:03:19,599 --> 00:03:23,200 and also to allow you to be proactive so 89 00:03:21,680 --> 00:03:26,319 if you're running this in an automated 90 00:03:23,200 --> 00:03:28,159 fashion or manually on a cadence you can 91 00:03:26,319 --> 00:03:31,640 catch problems before your users 92 00:03:28,159 --> 00:03:31,640 actually start complaining 93 00:03:32,000 --> 00:03:35,920 first we set up to figure out how to 94 00:03:33,599 --> 00:03:37,599 model these tests correctly 95 00:03:35,920 --> 00:03:40,239 so generally speaking we want to 96 00:03:37,599 --> 00:03:41,840 understand end user performance and we 97 00:03:40,239 --> 00:03:44,879 want to understand end user performance 98 00:03:41,840 --> 00:03:47,840 from the perspective of the client that 99 00:03:44,879 --> 00:03:50,319 our users are accessing to 100 00:03:47,840 --> 00:03:52,640 query the metrics platform so more 101 00:03:50,319 --> 00:03:54,239 specifically we were looking to assess 102 00:03:52,640 --> 00:03:56,480 the performance of 103 00:03:54,239 --> 00:03:58,640 our prometheus or thanos platform which 104 00:03:56,480 --> 00:04:01,680 was accessed primarily through grafana 105 00:03:58,640 --> 00:04:03,439 dashboards so to this end we thought we 106 00:04:01,680 --> 00:04:05,360 would actually model the performance 107 00:04:03,439 --> 00:04:08,319 test themselves after 108 00:04:05,360 --> 00:04:10,239 actual graffana dashboard loads 109 00:04:08,319 --> 00:04:13,040 so the first question became how is 110 00:04:10,239 --> 00:04:15,200 grafana loading dashboards and chrome 111 00:04:13,040 --> 00:04:17,280 developer tools to the rescue here uh we 112 00:04:15,200 --> 00:04:18,799 basically loaded a dashboard and watched 113 00:04:17,280 --> 00:04:21,359 and saw what was happening so 114 00:04:18,799 --> 00:04:24,560 unsurprisingly grafana is making http 115 00:04:21,359 --> 00:04:26,800 requests to its own back end and then it 116 00:04:24,560 --> 00:04:29,680 would proxy these requests over to the 117 00:04:26,800 --> 00:04:32,080 prometheus or thanos backend and 118 00:04:29,680 --> 00:04:33,759 their request path was basically the 119 00:04:32,080 --> 00:04:35,919 same as what you would throw on a 120 00:04:33,759 --> 00:04:37,840 prometheus api request so it's easy to 121 00:04:35,919 --> 00:04:39,600 copy pasta 122 00:04:37,840 --> 00:04:41,759 the next question became what are these 123 00:04:39,600 --> 00:04:43,919 requests actually doing and again 124 00:04:41,759 --> 00:04:46,639 wearing chrome developer tools so first 125 00:04:43,919 --> 00:04:49,680 we could see that template variable 126 00:04:46,639 --> 00:04:51,680 queries uh were being run and that 127 00:04:49,680 --> 00:04:53,199 there's actually dependency resolution 128 00:04:51,680 --> 00:04:55,600 between the variables so if you had 129 00:04:53,199 --> 00:04:57,600 multiple template variables uh grafana 130 00:04:55,600 --> 00:04:59,120 would actually figure out the order in 131 00:04:57,600 --> 00:05:01,440 which to make those requests and then 132 00:04:59,120 --> 00:05:03,520 kind of request them top down uh and 133 00:05:01,440 --> 00:05:06,560 next grafana was loading uh panel 134 00:05:03,520 --> 00:05:09,280 queries and it would start by requesting 135 00:05:06,560 --> 00:05:11,919 all visible panels concurrently and then 136 00:05:09,280 --> 00:05:14,080 as you scrolled it would request the 137 00:05:11,919 --> 00:05:17,120 other panel queries as those panels 138 00:05:14,080 --> 00:05:17,120 actually became visible 139 00:05:18,320 --> 00:05:22,320 so the question became do we have to 140 00:05:20,479 --> 00:05:24,639 replicate this complex behavior 141 00:05:22,320 --> 00:05:26,080 precisely and my thought was no you just 142 00:05:24,639 --> 00:05:28,400 need to determine the best way to 143 00:05:26,080 --> 00:05:29,680 approximate the client's behavior in a 144 00:05:28,400 --> 00:05:31,280 way that your 145 00:05:29,680 --> 00:05:34,639 performance tests 146 00:05:31,280 --> 00:05:37,120 still make sense and for this use case 147 00:05:34,639 --> 00:05:39,120 we decided to approximate the behavior i 148 00:05:37,120 --> 00:05:41,440 just talked about by first serially 149 00:05:39,120 --> 00:05:43,360 loading all the variable queries and 150 00:05:41,440 --> 00:05:45,199 then loading all the panel queries in 151 00:05:43,360 --> 00:05:48,320 parallel and this would kind of provide 152 00:05:45,199 --> 00:05:50,000 us a uh worst case scenario for loading 153 00:05:48,320 --> 00:05:51,360 a dashboard and that all panels were 154 00:05:50,000 --> 00:05:54,479 actually loading at the same time are 155 00:05:51,360 --> 00:05:54,479 visible at the same time 156 00:05:55,440 --> 00:06:00,560 so next the question was what 157 00:05:58,240 --> 00:06:03,360 specific request should we actually be 158 00:06:00,560 --> 00:06:05,440 making and as i said prior we wanted to 159 00:06:03,360 --> 00:06:06,880 simulate the user behavior so we came to 160 00:06:05,440 --> 00:06:08,639 the conclusion that no one knows how to 161 00:06:06,880 --> 00:06:10,720 break our stuff better than our own 162 00:06:08,639 --> 00:06:13,280 users or our developers 163 00:06:10,720 --> 00:06:16,160 so we should actually model these tests 164 00:06:13,280 --> 00:06:17,759 based on existing dashboards and the 165 00:06:16,160 --> 00:06:19,520 alternative would have been just to 166 00:06:17,759 --> 00:06:22,240 generate some synthetic load perhaps 167 00:06:19,520 --> 00:06:24,560 requests that we know are problematic 168 00:06:22,240 --> 00:06:27,520 but maybe not as true to form as 169 00:06:24,560 --> 00:06:28,319 behaving like a user 170 00:06:27,520 --> 00:06:30,560 so 171 00:06:28,319 --> 00:06:32,080 the last question was what is grafana 172 00:06:30,560 --> 00:06:34,880 actually sending in these requests and 173 00:06:32,080 --> 00:06:36,400 surprise surprise it's it's prom ql with 174 00:06:34,880 --> 00:06:38,960 the variables evaluated and it's like 175 00:06:36,400 --> 00:06:40,400 wait what more variables yes uh so there 176 00:06:38,960 --> 00:06:42,479 are template variables in here that we 177 00:06:40,400 --> 00:06:45,199 had already kind of talked about uh and 178 00:06:42,479 --> 00:06:47,600 these are either static or or generated 179 00:06:45,199 --> 00:06:49,039 dynamically from like a prom ql query 180 00:06:47,600 --> 00:06:52,800 and there's also built-in variables 181 00:06:49,039 --> 00:06:52,800 things like range and underscore 182 00:06:53,199 --> 00:06:56,800 so how did we decide to handle variable 183 00:06:54,960 --> 00:06:58,479 evaluation well we decided for the 184 00:06:56,800 --> 00:07:00,319 built-in variables that we just set some 185 00:06:58,479 --> 00:07:02,880 same defaults to things like interval 186 00:07:00,319 --> 00:07:04,240 for five minutes and you know range was 187 00:07:02,880 --> 00:07:06,160 obviously going to be calculated based 188 00:07:04,240 --> 00:07:08,400 off the start and end dates uh we 189 00:07:06,160 --> 00:07:10,080 decided for the query based template 190 00:07:08,400 --> 00:07:12,080 variables uh that we would actually 191 00:07:10,080 --> 00:07:13,280 still make those requests to prometheus 192 00:07:12,080 --> 00:07:16,000 or thanos 193 00:07:13,280 --> 00:07:18,080 and that uh and that this would help 194 00:07:16,000 --> 00:07:21,599 better mimic uh the dashboard load 195 00:07:18,080 --> 00:07:24,000 however we just we also decided that uh 196 00:07:21,599 --> 00:07:26,319 you know uh we would use the template 197 00:07:24,000 --> 00:07:29,440 variable default values in the actual 198 00:07:26,319 --> 00:07:31,199 panel queries themselves as these were 199 00:07:29,440 --> 00:07:32,800 the default queries with those default 200 00:07:31,199 --> 00:07:34,800 values for those variables that were 201 00:07:32,800 --> 00:07:37,039 actually run when the dashboard came up 202 00:07:34,800 --> 00:07:37,919 and decide we didn't have a great way to 203 00:07:37,039 --> 00:07:39,680 say 204 00:07:37,919 --> 00:07:41,759 which value we should actually be 205 00:07:39,680 --> 00:07:44,479 substituting anyway other than the 206 00:07:41,759 --> 00:07:47,039 default 207 00:07:44,479 --> 00:07:48,879 so to kind of recap our modeling section 208 00:07:47,039 --> 00:07:50,639 we decided to make multiple http 209 00:07:48,879 --> 00:07:52,240 requests to our metrics back end to 210 00:07:50,639 --> 00:07:54,000 replicate the load 211 00:07:52,240 --> 00:07:56,639 we are going to serially request all 212 00:07:54,000 --> 00:07:58,080 template variables first then move on to 213 00:07:56,639 --> 00:07:59,520 concurrently request all the panel 214 00:07:58,080 --> 00:08:01,120 queries 215 00:07:59,520 --> 00:08:03,919 we we're going to model the request 216 00:08:01,120 --> 00:08:06,160 order and content based on existing 217 00:08:03,919 --> 00:08:08,720 dashboards that we already had and we're 218 00:08:06,160 --> 00:08:09,840 going to set sane built-in and template 219 00:08:08,720 --> 00:08:12,000 variable 220 00:08:09,840 --> 00:08:13,919 default values but we're still going to 221 00:08:12,000 --> 00:08:18,400 make those variable requests the 222 00:08:13,919 --> 00:08:18,400 variable query request to the back end 223 00:08:18,720 --> 00:08:22,479 uh so before we actually set out to do 224 00:08:20,879 --> 00:08:25,120 this we wanted to come up with a testing 225 00:08:22,479 --> 00:08:27,440 methodology and also some boundaries for 226 00:08:25,120 --> 00:08:29,199 our tests so in the center here you see 227 00:08:27,440 --> 00:08:30,800 a standard software development life 228 00:08:29,199 --> 00:08:33,120 cycle that everyone should be familiar 229 00:08:30,800 --> 00:08:35,440 with and we thought we could apply this 230 00:08:33,120 --> 00:08:37,919 to our performance testing as well and 231 00:08:35,440 --> 00:08:38,880 that we could come up with a hypothesis 232 00:08:37,919 --> 00:08:41,440 you know 233 00:08:38,880 --> 00:08:44,080 this thing is slow because of this or 234 00:08:41,440 --> 00:08:46,320 that and then we could come up with a 235 00:08:44,080 --> 00:08:48,640 design of how to prove or disprove that 236 00:08:46,320 --> 00:08:51,040 hypothesis maybe we can throw more 237 00:08:48,640 --> 00:08:53,680 resources at it or maybe we can 238 00:08:51,040 --> 00:08:55,120 tweak configuration in some manner 239 00:08:53,680 --> 00:08:57,200 then we'd actually make those changes 240 00:08:55,120 --> 00:08:59,279 once we figured out what to do 241 00:08:57,200 --> 00:09:01,839 then we would run a standardized test 242 00:08:59,279 --> 00:09:03,839 suite to verify and benchmark those 243 00:09:01,839 --> 00:09:06,240 changes and then the results from the 244 00:09:03,839 --> 00:09:09,640 test suite could feed back into a 245 00:09:06,240 --> 00:09:09,640 subsequent hypothesis 246 00:09:16,240 --> 00:09:20,480 so just to kind of recap some of the 247 00:09:17,920 --> 00:09:22,080 tenets of this methodology is that it's 248 00:09:20,480 --> 00:09:24,240 systematic in that it provides us a 249 00:09:22,080 --> 00:09:26,720 framework to develop and prove out these 250 00:09:24,240 --> 00:09:29,120 hypotheses and it's consistent in that 251 00:09:26,720 --> 00:09:30,959 the same tests are on on every iteration 252 00:09:29,120 --> 00:09:33,440 which actually allows every 253 00:09:30,959 --> 00:09:36,720 run to be compared to previous runs and 254 00:09:33,440 --> 00:09:38,320 it's data driven test feedback the test 255 00:09:36,720 --> 00:09:40,080 data actually feeds back into the 256 00:09:38,320 --> 00:09:43,279 decision making process 257 00:09:40,080 --> 00:09:45,279 and in addition to the test data itself 258 00:09:43,279 --> 00:09:48,080 external data can also be integrated 259 00:09:45,279 --> 00:09:50,560 into the analysis phase so if you have 260 00:09:48,080 --> 00:09:52,800 metrics from somewhere else or logs or 261 00:09:50,560 --> 00:09:54,000 traces or whatever else you may have 262 00:09:52,800 --> 00:09:57,760 you can use that to influence your 263 00:09:54,000 --> 00:09:59,760 analysis phase as well and lastly the 264 00:09:57,760 --> 00:10:01,519 methodology is iterative it results in 265 00:09:59,760 --> 00:10:03,279 this kind of tight loop 266 00:10:01,519 --> 00:10:06,160 where changes can be evaluated quickly 267 00:10:03,279 --> 00:10:07,839 and hypothesis can kind of evolve with 268 00:10:06,160 --> 00:10:09,600 each iteration 269 00:10:07,839 --> 00:10:11,760 which hopefully leads to an increase in 270 00:10:09,600 --> 00:10:14,000 velocity 271 00:10:11,760 --> 00:10:16,560 so just a quick note about data validity 272 00:10:14,000 --> 00:10:18,720 here uh a data-driven process like this 273 00:10:16,560 --> 00:10:19,600 one is really only as good as its input 274 00:10:18,720 --> 00:10:21,120 data 275 00:10:19,600 --> 00:10:23,120 so we started thinking about what 276 00:10:21,120 --> 00:10:24,800 happens if during one of these tests if 277 00:10:23,120 --> 00:10:26,560 a service goes offline or if it 278 00:10:24,800 --> 00:10:28,399 temporarily slows down 279 00:10:26,560 --> 00:10:30,880 let's say because someone else is using 280 00:10:28,399 --> 00:10:33,600 the service and the testing environment 281 00:10:30,880 --> 00:10:35,839 so we actually had decided to smooth out 282 00:10:33,600 --> 00:10:37,920 the data for the tests by running the 283 00:10:35,839 --> 00:10:39,360 tests multiple times and then using an 284 00:10:37,920 --> 00:10:41,440 aggregate of the result instead of just 285 00:10:39,360 --> 00:10:44,240 running the test once and using that in 286 00:10:41,440 --> 00:10:46,320 case uh the test run happened to be uh 287 00:10:44,240 --> 00:10:48,560 botched or influenced by external 288 00:10:46,320 --> 00:10:50,800 factors 289 00:10:48,560 --> 00:10:52,959 so lastly we needed to decide on some 290 00:10:50,800 --> 00:10:56,000 boundaries for our tests our we found 291 00:10:52,959 --> 00:10:58,240 our longest common uh use use case for 292 00:10:56,000 --> 00:11:00,399 dashboard was basically about 30 days 293 00:10:58,240 --> 00:11:02,800 and we noticed grafana didn't have many 294 00:11:00,399 --> 00:11:04,640 concurrent users so we decided to cap 295 00:11:02,800 --> 00:11:06,320 the maximum number of virtual users we 296 00:11:04,640 --> 00:11:08,720 simulate to about 10. 297 00:11:06,320 --> 00:11:11,680 uh then we had decided uh that we would 298 00:11:08,720 --> 00:11:14,480 run each kind of test suite for user 299 00:11:11,680 --> 00:11:16,320 loads of one five and ten and each one 300 00:11:14,480 --> 00:11:19,040 of those for one day three days seven 301 00:11:16,320 --> 00:11:20,880 days and 30 day windows so 302 00:11:19,040 --> 00:11:22,320 each one of these combinations would 303 00:11:20,880 --> 00:11:25,120 actually be ran 304 00:11:22,320 --> 00:11:27,200 10 times to smooth out the data and just 305 00:11:25,120 --> 00:11:29,040 kind of 306 00:11:27,200 --> 00:11:30,720 ignore any external factors and maybe 307 00:11:29,040 --> 00:11:33,200 influencing the test to kind of smooth 308 00:11:30,720 --> 00:11:33,200 that out 309 00:11:33,519 --> 00:11:38,399 so now we had a methodology we wanted to 310 00:11:36,480 --> 00:11:40,240 come up with a testing platform and a 311 00:11:38,399 --> 00:11:42,480 design for our tests 312 00:11:40,240 --> 00:11:44,800 so a little note on technology selection 313 00:11:42,480 --> 00:11:46,480 uh we needed a tool uh for load testing 314 00:11:44,800 --> 00:11:48,480 that could make http requests and 315 00:11:46,480 --> 00:11:51,839 there's lots out there uh we also needed 316 00:11:48,480 --> 00:11:54,480 to simulate multiple users so many load 317 00:11:51,839 --> 00:11:56,480 testing tools kind of fall into this 318 00:11:54,480 --> 00:11:57,760 into those requirements and locus had 319 00:11:56,480 --> 00:12:00,240 actually been used in the past at 320 00:11:57,760 --> 00:12:02,000 hootsuite uh for load testing different 321 00:12:00,240 --> 00:12:05,519 things and we'd actually used it for low 322 00:12:02,000 --> 00:12:07,519 testing individual prometheus uh servers 323 00:12:05,519 --> 00:12:10,160 and time had also been invested in the 324 00:12:07,519 --> 00:12:12,079 tool to make sure it ran on kubernetes 325 00:12:10,160 --> 00:12:14,399 for us and some other tweaks here and 326 00:12:12,079 --> 00:12:15,839 there so given all this we decided to 327 00:12:14,399 --> 00:12:17,600 move forward 328 00:12:15,839 --> 00:12:19,680 with locust 329 00:12:17,600 --> 00:12:21,200 so to bring forward some of our 330 00:12:19,680 --> 00:12:23,200 requirements from the modeling and 331 00:12:21,200 --> 00:12:25,200 methodology sections and how these 332 00:12:23,200 --> 00:12:26,959 implements our tool 333 00:12:25,200 --> 00:12:28,720 we needed to seriously make http 334 00:12:26,959 --> 00:12:31,440 requests for these template variables 335 00:12:28,720 --> 00:12:33,760 which is met by the default locust http 336 00:12:31,440 --> 00:12:35,920 client and we needed to concur 337 00:12:33,760 --> 00:12:37,360 concurrently request 338 00:12:35,920 --> 00:12:39,600 all the panel queries which is something 339 00:12:37,360 --> 00:12:42,480 locus does not handle we also needed to 340 00:12:39,600 --> 00:12:44,560 execute requests for multiple dashboards 341 00:12:42,480 --> 00:12:46,000 multiple times and provide an aggregate 342 00:12:44,560 --> 00:12:48,160 of the results 343 00:12:46,000 --> 00:12:49,440 which isn't really met by locus but it's 344 00:12:48,160 --> 00:12:51,920 something we'll address in our test 345 00:12:49,440 --> 00:12:53,600 design in a minute and we also needed to 346 00:12:51,920 --> 00:12:56,000 expose data to feedback into the 347 00:12:53,600 --> 00:12:58,880 analysis phase of our testing loop which 348 00:12:56,000 --> 00:13:01,200 was met by locust as it has this web ui 349 00:12:58,880 --> 00:13:03,440 that allows you to export csvs and while 350 00:13:01,200 --> 00:13:04,800 it's not terribly automated it worked 351 00:13:03,440 --> 00:13:07,680 for us where we could just take that 352 00:13:04,800 --> 00:13:09,839 data and dump it into another uh sync 353 00:13:07,680 --> 00:13:09,839 like 354 00:13:10,000 --> 00:13:15,200 google sheets for instance is what we 355 00:13:11,519 --> 00:13:16,880 used for doing most of our comparisons 356 00:13:15,200 --> 00:13:18,639 so the first step was adding concurrency 357 00:13:16,880 --> 00:13:20,880 into locust and 358 00:13:18,639 --> 00:13:22,880 locus itself is python based both the 359 00:13:20,880 --> 00:13:24,720 tool and the authoring and as uh people 360 00:13:22,880 --> 00:13:26,880 who are familiar with python no 361 00:13:24,720 --> 00:13:29,040 concurrency isn't necessarily known to 362 00:13:26,880 --> 00:13:32,639 be one of its strong suits but we did 363 00:13:29,040 --> 00:13:34,639 find this aio async io http client to be 364 00:13:32,639 --> 00:13:36,240 the kind of recommended solution uh for 365 00:13:34,639 --> 00:13:38,399 doing this thing so we proceeded to 366 00:13:36,240 --> 00:13:39,839 implement a locus client based on this 367 00:13:38,399 --> 00:13:41,360 package 368 00:13:39,839 --> 00:13:43,760 it did take some time to kind of figure 369 00:13:41,360 --> 00:13:45,519 out uh the exception and error handling 370 00:13:43,760 --> 00:13:48,000 uh with async io 371 00:13:45,519 --> 00:13:50,560 so we weren't doing things like 372 00:13:48,000 --> 00:13:52,000 exploding the event loop or swallowing 373 00:13:50,560 --> 00:13:53,839 exceptions in the event loop and not 374 00:13:52,000 --> 00:13:55,199 passing those back to locusts so they 375 00:13:53,839 --> 00:13:57,120 could actually be 376 00:13:55,199 --> 00:13:58,079 reported 377 00:13:57,120 --> 00:14:00,160 to the 378 00:13:58,079 --> 00:14:02,399 error statistics that are displayed on 379 00:14:00,160 --> 00:14:05,680 these dashboards as well 380 00:14:02,399 --> 00:14:07,760 also locus that can run n virtual users 381 00:14:05,680 --> 00:14:09,440 per node essentially and we never really 382 00:14:07,760 --> 00:14:10,720 figured out how to get this working with 383 00:14:09,440 --> 00:14:13,279 async io 384 00:14:10,720 --> 00:14:16,079 just by trying things like creating an 385 00:14:13,279 --> 00:14:18,959 event loop per python process or per 386 00:14:16,079 --> 00:14:21,279 async io client 387 00:14:18,959 --> 00:14:23,920 so the client overview i mean the idea 388 00:14:21,279 --> 00:14:26,959 here was to expose them expose a method 389 00:14:23,920 --> 00:14:28,240 to make a list of requests concurrently 390 00:14:26,959 --> 00:14:31,199 but we thought it would also make sense 391 00:14:28,240 --> 00:14:34,000 to encapsulate a base http client so the 392 00:14:31,199 --> 00:14:37,040 consumers can request both serially and 393 00:14:34,000 --> 00:14:39,199 concurrently with the same client 394 00:14:37,040 --> 00:14:40,240 it also as i said prior like it creates 395 00:14:39,199 --> 00:14:43,199 this new 396 00:14:40,240 --> 00:14:44,560 async i o event loop per client instance 397 00:14:43,199 --> 00:14:45,920 and we're still not sure this is the 398 00:14:44,560 --> 00:14:47,519 right approaches we never quite got it 399 00:14:45,920 --> 00:14:49,360 working 400 00:14:47,519 --> 00:14:52,000 the async request mechanism itself it 401 00:14:49,360 --> 00:14:54,160 passes back a list of these custom 402 00:14:52,000 --> 00:14:56,160 resolved objects so if 403 00:14:54,160 --> 00:14:59,199 an exception does actually occur during 404 00:14:56,160 --> 00:15:00,959 the request they are caught and they're 405 00:14:59,199 --> 00:15:03,360 passed back inside this result object 406 00:15:00,959 --> 00:15:05,600 which allows our handling code on the 407 00:15:03,360 --> 00:15:09,519 other side of the event loop to tie that 408 00:15:05,600 --> 00:15:09,519 back into locusts error reporting 409 00:15:09,760 --> 00:15:14,560 a little overview of the test layout uh 410 00:15:12,959 --> 00:15:16,000 so as you can see here at the top we 411 00:15:14,560 --> 00:15:18,959 have a uh 412 00:15:16,000 --> 00:15:21,279 a encapsulating class locus user tasks 413 00:15:18,959 --> 00:15:22,959 and inside that there is uh two of these 414 00:15:21,279 --> 00:15:26,079 dashboard one and dashboard two tasks 415 00:15:22,959 --> 00:15:29,040 both uh are a sequential task set which 416 00:15:26,079 --> 00:15:31,600 is a locust construct from the api and 417 00:15:29,040 --> 00:15:33,279 you'll notice here that the 418 00:15:31,600 --> 00:15:34,720 the embedded classes dashboard one 419 00:15:33,279 --> 00:15:36,399 dashboard two 420 00:15:34,720 --> 00:15:37,600 they would declare this kind of tasks 421 00:15:36,399 --> 00:15:39,279 variable 422 00:15:37,600 --> 00:15:41,519 which is something that's part of locust 423 00:15:39,279 --> 00:15:42,399 api and this will define 424 00:15:41,519 --> 00:15:45,040 the 425 00:15:42,399 --> 00:15:46,480 sequence of the tasks that are executed 426 00:15:45,040 --> 00:15:48,000 and in the case of the dashboards 427 00:15:46,480 --> 00:15:50,800 themselves 428 00:15:48,000 --> 00:15:52,079 these task variables are set to methods 429 00:15:50,800 --> 00:15:54,160 right so they load the variables they 430 00:15:52,079 --> 00:15:56,720 load the panels and then they stop the 431 00:15:54,160 --> 00:15:57,920 dashboard load and dashboard 2 would do 432 00:15:56,720 --> 00:15:59,440 the same thing it would load the 433 00:15:57,920 --> 00:16:01,839 variables like the panels and stop the 434 00:15:59,440 --> 00:16:04,320 dashboard load and then you can see for 435 00:16:01,839 --> 00:16:06,639 the locus user 436 00:16:04,320 --> 00:16:08,639 tasks outer class that there's also a 437 00:16:06,639 --> 00:16:11,040 task variable declared and that's 438 00:16:08,639 --> 00:16:12,639 basically deferring to the inner classes 439 00:16:11,040 --> 00:16:14,880 so it wants to load dashboard one and 440 00:16:12,639 --> 00:16:16,399 dashboard two and then to repeat this 441 00:16:14,880 --> 00:16:18,160 for the send times you just kind of 442 00:16:16,399 --> 00:16:20,160 repeat that for as many times as you 443 00:16:18,160 --> 00:16:22,720 want to run dashboard one dashboard two 444 00:16:20,160 --> 00:16:24,959 dashboard one dashboard 1.42 etc and 445 00:16:22,720 --> 00:16:27,759 then at the end you just say stop test 446 00:16:24,959 --> 00:16:28,880 again and stop doesn't stop dash 447 00:16:27,759 --> 00:16:30,800 they're not shown here but they're 448 00:16:28,880 --> 00:16:33,360 really just methods they call an 449 00:16:30,800 --> 00:16:37,360 interrupt on the locus test itself to 450 00:16:33,360 --> 00:16:39,600 stop things from executing over and over 451 00:16:37,360 --> 00:16:42,160 so now we knew what we wanted to do we 452 00:16:39,600 --> 00:16:43,600 actually had to implement these things 453 00:16:42,160 --> 00:16:45,519 so to bring forward some of our 454 00:16:43,600 --> 00:16:48,160 requirements from modeling methodology 455 00:16:45,519 --> 00:16:50,639 and the platform itself the tests had to 456 00:16:48,160 --> 00:16:53,040 be written in python for locust the 457 00:16:50,639 --> 00:16:54,399 tests must be modeled based on existing 458 00:16:53,040 --> 00:16:56,079 dashboards 459 00:16:54,399 --> 00:16:58,399 the test should be consistent between 460 00:16:56,079 --> 00:17:00,240 runs so that they're fully repeatable 461 00:16:58,399 --> 00:17:02,399 and some other considerations we had was 462 00:17:00,240 --> 00:17:04,959 that it would be great to increase the 463 00:17:02,399 --> 00:17:06,880 velocity for operators if the test could 464 00:17:04,959 --> 00:17:08,559 easily be updated 465 00:17:06,880 --> 00:17:09,679 so what we came up with was to actually 466 00:17:08,559 --> 00:17:11,839 generate 467 00:17:09,679 --> 00:17:14,400 the tests themselves from 468 00:17:11,839 --> 00:17:16,720 grafana dashboard json and we already 469 00:17:14,400 --> 00:17:18,559 had some experience in the team with the 470 00:17:16,720 --> 00:17:20,240 graphone dashboard json model so that 471 00:17:18,559 --> 00:17:22,160 lent itself well 472 00:17:20,240 --> 00:17:24,400 this also allowed us to 473 00:17:22,160 --> 00:17:27,919 model the tests after the dashboards a 474 00:17:24,400 --> 00:17:30,000 little more precisely and it enables a 475 00:17:27,919 --> 00:17:31,600 quick adjustment to the tests so if you 476 00:17:30,000 --> 00:17:34,080 needed to tweak something you could just 477 00:17:31,600 --> 00:17:35,360 tweak it and rerun the generator and we 478 00:17:34,080 --> 00:17:37,520 have a bunch of new tasks coming on the 479 00:17:35,360 --> 00:17:39,280 other side and the nice thing is that it 480 00:17:37,520 --> 00:17:41,200 generates a consistent test suite every 481 00:17:39,280 --> 00:17:43,520 time you don't really have to worry 482 00:17:41,200 --> 00:17:46,720 about user error as long as the input's 483 00:17:43,520 --> 00:17:48,960 good the output should be good as well 484 00:17:46,720 --> 00:17:51,760 so the overview of our implementation of 485 00:17:48,960 --> 00:17:54,559 the generator itself we identified some 486 00:17:51,760 --> 00:17:56,799 key dashboards in grafana 487 00:17:54,559 --> 00:17:58,160 things that were slow things were fast 488 00:17:56,799 --> 00:18:00,960 things that had different kinds of 489 00:17:58,160 --> 00:18:02,480 panels things that had 490 00:18:00,960 --> 00:18:05,120 lots of template variables things that 491 00:18:02,480 --> 00:18:07,440 had no no template variables etc i think 492 00:18:05,120 --> 00:18:10,640 overall we probably identify close to a 493 00:18:07,440 --> 00:18:12,559 dozen or so and we pulled these via 494 00:18:10,640 --> 00:18:15,200 the grafana 495 00:18:12,559 --> 00:18:17,120 rest api and uh we actually ended up 496 00:18:15,200 --> 00:18:19,360 saving a copy of these in the generator 497 00:18:17,120 --> 00:18:21,440 repo and that was mainly due to the fact 498 00:18:19,360 --> 00:18:23,600 that if somebody edited the frontal 499 00:18:21,440 --> 00:18:24,960 dashboard upstream we didn't want to 500 00:18:23,600 --> 00:18:27,440 have our 501 00:18:24,960 --> 00:18:28,880 tests change on us we needed them to 502 00:18:27,440 --> 00:18:30,960 remain consistent 503 00:18:28,880 --> 00:18:33,600 so we actually saved this copy like the 504 00:18:30,960 --> 00:18:34,960 copy of all these dashboards uh the json 505 00:18:33,600 --> 00:18:37,760 into our 506 00:18:34,960 --> 00:18:40,320 tester repo to kind of codify them 507 00:18:37,760 --> 00:18:42,640 so we wrote this python script to parse 508 00:18:40,320 --> 00:18:45,280 that json and it extracts the template 509 00:18:42,640 --> 00:18:47,520 variables and the panel queries and it 510 00:18:45,280 --> 00:18:49,280 performs that variable substitution 511 00:18:47,520 --> 00:18:51,919 and it actually uses this data to 512 00:18:49,280 --> 00:18:53,919 template the tests uh with jinja which 513 00:18:51,919 --> 00:18:55,840 are just python test templates written 514 00:18:53,919 --> 00:18:57,360 in jinja 2 515 00:18:55,840 --> 00:18:59,280 they in this template would really 516 00:18:57,360 --> 00:19:01,919 define the structure for each test right 517 00:18:59,280 --> 00:19:03,679 for each dashboard uh 518 00:19:01,919 --> 00:19:06,640 handle the variable loads handle the 519 00:19:03,679 --> 00:19:09,200 panel loads and uh these templates uh 520 00:19:06,640 --> 00:19:12,000 this python that that was templated it 521 00:19:09,200 --> 00:19:14,880 mainly used uh the new async locus 522 00:19:12,000 --> 00:19:15,760 client to make all those requests 523 00:19:14,880 --> 00:19:18,240 and 524 00:19:15,760 --> 00:19:20,720 from the output of this whole system was 525 00:19:18,240 --> 00:19:23,600 a bunch of locus tests that are all 526 00:19:20,720 --> 00:19:25,440 valid python and they're ready to run 527 00:19:23,600 --> 00:19:28,240 and we would actually 528 00:19:25,440 --> 00:19:30,320 we ended up generating uh one test per 529 00:19:28,240 --> 00:19:33,520 query window so one day three days seven 530 00:19:30,320 --> 00:19:35,840 day and thirty days and this was really 531 00:19:33,520 --> 00:19:38,160 just for us i mean it the multiple tests 532 00:19:35,840 --> 00:19:39,840 it allowed us to test each query window 533 00:19:38,160 --> 00:19:41,280 distinctly 534 00:19:39,840 --> 00:19:43,520 but we could load all the tests at once 535 00:19:41,280 --> 00:19:44,640 so we could rerun a specific query 536 00:19:43,520 --> 00:19:47,200 window or something if you want to 537 00:19:44,640 --> 00:19:49,679 without having to rerun everything and 538 00:19:47,200 --> 00:19:51,600 again for us just logistically this made 539 00:19:49,679 --> 00:19:52,400 more sense 540 00:19:51,600 --> 00:19:54,000 so 541 00:19:52,400 --> 00:19:55,600 uh now we were just about ready to go we 542 00:19:54,000 --> 00:19:57,360 need to figure out what kind of data we 543 00:19:55,600 --> 00:19:59,200 wanted to get out of these for tests uh 544 00:19:57,360 --> 00:20:00,799 these tests and kind of what we wanted 545 00:19:59,200 --> 00:20:02,799 to compare on 546 00:20:00,799 --> 00:20:05,360 uh so it came down to choosing some 547 00:20:02,799 --> 00:20:06,960 meaningful indicators so error errors to 548 00:20:05,360 --> 00:20:08,640 us are always meaningful they help track 549 00:20:06,960 --> 00:20:10,159 your failures under load more data is 550 00:20:08,640 --> 00:20:11,840 always better than less so we actually 551 00:20:10,159 --> 00:20:14,640 chose to record basically everything 552 00:20:11,840 --> 00:20:17,120 that locus provided in our google sheet 553 00:20:14,640 --> 00:20:20,400 but we'd really only compare on 554 00:20:17,120 --> 00:20:23,440 uh on error rates uh your median your 90 555 00:20:20,400 --> 00:20:25,840 and 99th percentile so just to dig in 556 00:20:23,440 --> 00:20:27,360 there a little more error rates they're 557 00:20:25,840 --> 00:20:29,120 not only helpful to understand what's 558 00:20:27,360 --> 00:20:30,880 currently failing 559 00:20:29,120 --> 00:20:32,559 but they help you understand the point 560 00:20:30,880 --> 00:20:35,360 at which your system starts to fail so 561 00:20:32,559 --> 00:20:38,320 you can actually increase load uh using 562 00:20:35,360 --> 00:20:40,000 your load testing tool until you until 563 00:20:38,320 --> 00:20:42,000 uh you're either content with the 564 00:20:40,000 --> 00:20:43,120 performance under load or things start 565 00:20:42,000 --> 00:20:45,039 to fail 566 00:20:43,120 --> 00:20:46,960 the next thing is quantiles these are 567 00:20:45,039 --> 00:20:49,919 extremely useful in the observability 568 00:20:46,960 --> 00:20:52,400 world just that most latency based slos 569 00:20:49,919 --> 00:20:53,919 are already expressed as a percentage so 570 00:20:52,400 --> 00:20:57,039 for example you might have something 571 00:20:53,919 --> 00:20:58,640 like 99 of your requests complete within 572 00:20:57,039 --> 00:21:00,559 x seconds 573 00:20:58,640 --> 00:21:02,320 and like the quantiles 574 00:21:00,559 --> 00:21:03,360 chosen they should be meaningful to that 575 00:21:02,320 --> 00:21:04,960 slo 576 00:21:03,360 --> 00:21:06,880 but they should also be 577 00:21:04,960 --> 00:21:08,000 wide enough and varied to kind of help 578 00:21:06,880 --> 00:21:09,600 understand 579 00:21:08,000 --> 00:21:10,880 the spread or how frequent those slow 580 00:21:09,600 --> 00:21:14,000 requests are 581 00:21:10,880 --> 00:21:15,679 so for example your p50 or your median i 582 00:21:14,000 --> 00:21:17,200 mean half year requests are basically 583 00:21:15,679 --> 00:21:19,840 completing in this time 584 00:21:17,200 --> 00:21:21,919 so if this is really good or acceptable 585 00:21:19,840 --> 00:21:23,039 performance then you're kind of already 586 00:21:21,919 --> 00:21:26,240 doing okay 587 00:21:23,039 --> 00:21:27,600 uh at your 90th percentile only 10 of 588 00:21:26,240 --> 00:21:30,960 the requests are actually slower than 589 00:21:27,600 --> 00:21:32,400 this so is this is this performance 590 00:21:30,960 --> 00:21:34,880 acceptable for 591 00:21:32,400 --> 00:21:37,600 the majority of users who are accessing 592 00:21:34,880 --> 00:21:39,039 your system and you know uh the 99th 593 00:21:37,600 --> 00:21:40,400 percentile just one percent of your 594 00:21:39,039 --> 00:21:42,480 requests will actually be slower than 595 00:21:40,400 --> 00:21:43,919 this so is this a tolerable edge case 596 00:21:42,480 --> 00:21:46,880 and kind of how does this align to your 597 00:21:43,919 --> 00:21:49,679 slo if your slo was 99 598 00:21:46,880 --> 00:21:52,720 uh for that dnc 599 00:21:49,679 --> 00:21:55,039 um so now we built this thing and we 600 00:21:52,720 --> 00:21:57,679 needed to turn it loose so our first 601 00:21:55,039 --> 00:22:00,080 step was to identify 602 00:21:57,679 --> 00:22:02,720 the current performance of the system we 603 00:22:00,080 --> 00:22:03,679 already had and this is what it looked 604 00:22:02,720 --> 00:22:06,080 like 605 00:22:03,679 --> 00:22:09,360 so 30 day windows 606 00:22:06,080 --> 00:22:11,520 you could see were pretty horrendous and 607 00:22:09,360 --> 00:22:12,640 added load really only exacerbated the 608 00:22:11,520 --> 00:22:16,240 issue 609 00:22:12,640 --> 00:22:18,480 and uh 10 user 30 day windows as you can 610 00:22:16,240 --> 00:22:19,919 see here there's actually no data for 611 00:22:18,480 --> 00:22:21,919 those 612 00:22:19,919 --> 00:22:23,919 they are timing out 613 00:22:21,919 --> 00:22:25,760 except for the p50 614 00:22:23,919 --> 00:22:27,200 so clearly things 615 00:22:25,760 --> 00:22:29,360 weren't great 616 00:22:27,200 --> 00:22:30,559 and just note here the y-axis and all 617 00:22:29,360 --> 00:22:32,240 these graphs 618 00:22:30,559 --> 00:22:35,600 this is the time in milliseconds for the 619 00:22:32,240 --> 00:22:37,600 entire test suite to run 620 00:22:35,600 --> 00:22:39,360 so iteration 2 621 00:22:37,600 --> 00:22:41,120 we thought hey maybe other metrics 622 00:22:39,360 --> 00:22:43,679 platforms are faster and thanos is the 623 00:22:41,120 --> 00:22:45,520 problem it couldn't possibly be us 624 00:22:43,679 --> 00:22:48,000 so this is what happened we implemented 625 00:22:45,520 --> 00:22:50,159 a proof of concept for amazon managed 626 00:22:48,000 --> 00:22:51,919 prometheus which you see here is amp and 627 00:22:50,159 --> 00:22:54,880 we implemented a proof of concept for 628 00:22:51,919 --> 00:22:57,360 prom scale and we ran those against our 629 00:22:54,880 --> 00:22:58,720 current uh thanos system 630 00:22:57,360 --> 00:23:01,679 so as you can see here from the error 631 00:22:58,720 --> 00:23:03,120 rates uh slide that as a user range 632 00:23:01,679 --> 00:23:06,080 increase so did the error rates in 633 00:23:03,120 --> 00:23:08,000 general uh most of the solutions saw 634 00:23:06,080 --> 00:23:11,039 nowhere little error rates at the 635 00:23:08,000 --> 00:23:14,000 shorter smaller ranges and uh virtual 636 00:23:11,039 --> 00:23:15,280 users and prom scale had error rates 637 00:23:14,000 --> 00:23:16,960 kind of thrift 638 00:23:15,280 --> 00:23:20,080 and uh 639 00:23:16,960 --> 00:23:22,400 amazon managed prometheus and thanos 640 00:23:20,080 --> 00:23:24,640 tracked very similarly although 641 00:23:22,400 --> 00:23:25,840 uh the management solution was slightly 642 00:23:24,640 --> 00:23:28,000 ahead 643 00:23:25,840 --> 00:23:30,159 uh in terms of performance we found 644 00:23:28,000 --> 00:23:32,320 extremely similar trends uh between the 645 00:23:30,159 --> 00:23:34,799 platforms for the most part as user 646 00:23:32,320 --> 00:23:36,960 range increased uh performance basically 647 00:23:34,799 --> 00:23:38,960 became unacceptable again 648 00:23:36,960 --> 00:23:40,880 so the question really became why do 649 00:23:38,960 --> 00:23:43,039 none of these platforms actually work 650 00:23:40,880 --> 00:23:45,360 well for us 651 00:23:43,039 --> 00:23:47,919 um so the conclusion we came to is 652 00:23:45,360 --> 00:23:50,240 basically that given that amp which i 653 00:23:47,919 --> 00:23:52,480 believe is cortex-based 654 00:23:50,240 --> 00:23:54,400 and prom scale and thanos all degraded 655 00:23:52,480 --> 00:23:55,440 pretty severely approaching 30-day query 656 00:23:54,400 --> 00:23:56,720 windows 657 00:23:55,440 --> 00:23:58,799 uh 658 00:23:56,720 --> 00:24:01,679 we came to that conclusion that our data 659 00:23:58,799 --> 00:24:04,960 was likely suspect and that 660 00:24:01,679 --> 00:24:07,600 the way we were 661 00:24:04,960 --> 00:24:10,080 recording metrics uh was not lending 662 00:24:07,600 --> 00:24:12,799 itself well to the way prometheus tsdb 663 00:24:10,080 --> 00:24:14,559 or thanos or something else worked and a 664 00:24:12,799 --> 00:24:16,000 future investigation would actually 665 00:24:14,559 --> 00:24:18,480 confirm this 666 00:24:16,000 --> 00:24:21,360 in that our certain series had major 667 00:24:18,480 --> 00:24:23,520 cardinality issues and it was something 668 00:24:21,360 --> 00:24:26,880 to the tune of less than one percent of 669 00:24:23,520 --> 00:24:29,840 our metric series by name accounted for 670 00:24:26,880 --> 00:24:33,120 almost 50 percent of the total samples 671 00:24:29,840 --> 00:24:34,320 uh per scrape basically um but in the 672 00:24:33,120 --> 00:24:36,559 meantime until we got to that 673 00:24:34,320 --> 00:24:39,440 investigation uh we decided to actually 674 00:24:36,559 --> 00:24:40,799 use this testing platform to tune thanos 675 00:24:39,440 --> 00:24:43,279 itself 676 00:24:40,799 --> 00:24:45,760 um so this was really iterations three 677 00:24:43,279 --> 00:24:48,240 and on uh we were benchmarking changes 678 00:24:45,760 --> 00:24:51,760 to the existing platform 679 00:24:48,240 --> 00:24:54,559 um so uh the tuning we focused on 10 680 00:24:51,760 --> 00:24:57,520 user and 30-day query ranges as this was 681 00:24:54,559 --> 00:25:00,320 our worst case and you know if we could 682 00:24:57,520 --> 00:25:02,559 make our worst case better than our best 683 00:25:00,320 --> 00:25:05,120 cases only got even better right 684 00:25:02,559 --> 00:25:07,360 so we decided on a potential bottleneck 685 00:25:05,120 --> 00:25:08,880 we investigated how to solve that via 686 00:25:07,360 --> 00:25:11,039 configure resource options and 687 00:25:08,880 --> 00:25:12,640 implemented and this is a tight loop 688 00:25:11,039 --> 00:25:14,960 again we talked about and we went 689 00:25:12,640 --> 00:25:16,000 through about 17 iterations of these in 690 00:25:14,960 --> 00:25:17,120 totals 691 00:25:16,000 --> 00:25:19,360 and every time we went through an 692 00:25:17,120 --> 00:25:20,480 iteration if any increase to the 693 00:25:19,360 --> 00:25:23,440 performance or 694 00:25:20,480 --> 00:25:25,120 or or decrease in error rate was found 695 00:25:23,440 --> 00:25:26,960 we actually implemented that as the 696 00:25:25,120 --> 00:25:29,440 iterations continued 697 00:25:26,960 --> 00:25:31,200 and if you look at the graph here uh 698 00:25:29,440 --> 00:25:33,760 what you can see is that against our 699 00:25:31,200 --> 00:25:36,320 baseline our tuned 700 00:25:33,760 --> 00:25:37,600 thanos actually performed 701 00:25:36,320 --> 00:25:39,760 much better 702 00:25:37,600 --> 00:25:42,880 than what we had seen previously 703 00:25:39,760 --> 00:25:44,400 and not only was performance up uh we 704 00:25:42,880 --> 00:25:46,000 were able to get error rates down from 705 00:25:44,400 --> 00:25:49,039 about 21 706 00:25:46,000 --> 00:25:51,360 to one percent 707 00:25:49,039 --> 00:25:52,960 so uh summary of the changes uh for 708 00:25:51,360 --> 00:25:55,760 those that are interested things that 709 00:25:52,960 --> 00:25:57,679 had a positive impact were increasing 710 00:25:55,760 --> 00:26:00,320 resources surprise surprise like memory 711 00:25:57,679 --> 00:26:02,880 and cpu specifically for uh storing 712 00:26:00,320 --> 00:26:04,480 query pods uh thanos store we actually 713 00:26:02,880 --> 00:26:06,400 changed that from being a time based uh 714 00:26:04,480 --> 00:26:08,480 partitioning scheme to hash base which 715 00:26:06,400 --> 00:26:10,480 is just a configuration option and this 716 00:26:08,480 --> 00:26:12,559 led to some better query distribution so 717 00:26:10,480 --> 00:26:14,320 the thanos store or store instances that 718 00:26:12,559 --> 00:26:16,159 were responsible for the most commonly 719 00:26:14,320 --> 00:26:18,480 queried ranges didn't always get 720 00:26:16,159 --> 00:26:20,559 obliterated uh we also increased the 721 00:26:18,480 --> 00:26:21,679 dano's storage erpc concurrency just so 722 00:26:20,559 --> 00:26:24,720 more things could get through the door 723 00:26:21,679 --> 00:26:27,279 at once we moved from an in cluster 724 00:26:24,720 --> 00:26:29,600 memcached to an elastic managed 725 00:26:27,279 --> 00:26:31,200 memcached instance and you know this 726 00:26:29,600 --> 00:26:33,840 gave us kind of benefits of a managed 727 00:26:31,200 --> 00:26:36,000 service but also it gave us access to 728 00:26:33,840 --> 00:26:37,279 much larger much larger nodes without 729 00:26:36,000 --> 00:26:40,400 having to worry about resource 730 00:26:37,279 --> 00:26:41,760 constraints inside kubernetes and 731 00:26:40,400 --> 00:26:44,640 and working on nodes where everybody 732 00:26:41,760 --> 00:26:47,120 else is also running workloads 733 00:26:44,640 --> 00:26:49,279 so we also tuned thanos store index 734 00:26:47,120 --> 00:26:51,200 cache to increase the timeout increase 735 00:26:49,279 --> 00:26:53,840 async concurrency and increase the 736 00:26:51,200 --> 00:26:55,919 buffer size but we actually decreased 737 00:26:53,840 --> 00:26:57,600 the max number of auto connections just 738 00:26:55,919 --> 00:27:00,880 so things would get booted a little more 739 00:26:57,600 --> 00:27:02,799 aggressively we also enabled thanos 740 00:27:00,880 --> 00:27:05,120 store caching bucket which actually just 741 00:27:02,799 --> 00:27:06,720 speeds up chunk loading 742 00:27:05,120 --> 00:27:08,480 and things that we had tried that really 743 00:27:06,720 --> 00:27:11,679 didn't work out for us 744 00:27:08,480 --> 00:27:13,279 we tried to overly shard thanos store 745 00:27:11,679 --> 00:27:15,919 and we basically found it had 746 00:27:13,279 --> 00:27:19,120 diminishing returns uh after a certain 747 00:27:15,919 --> 00:27:21,200 point uh we tried thanos star 748 00:27:19,120 --> 00:27:23,679 store shard replicas 749 00:27:21,200 --> 00:27:26,320 but we really found no benefit running 750 00:27:23,679 --> 00:27:28,159 that with kubernetes in in just that 751 00:27:26,320 --> 00:27:30,240 like if a 752 00:27:28,159 --> 00:27:33,840 store shard goes down kubernetes 753 00:27:30,240 --> 00:27:35,279 replaces that pod pretty quickly and the 754 00:27:33,840 --> 00:27:37,360 way this worked was that when you 755 00:27:35,279 --> 00:27:39,440 queried it would actually query both it 756 00:27:37,360 --> 00:27:41,039 actually query across both replicas at 757 00:27:39,440 --> 00:27:43,120 the same time so there wasn't any real 758 00:27:41,039 --> 00:27:44,960 performance gained from this 759 00:27:43,120 --> 00:27:47,440 and also we had tried thanos store in 760 00:27:44,960 --> 00:27:49,679 memory index caching but it just put too 761 00:27:47,440 --> 00:27:51,279 much memory pressure on our kubernetes 762 00:27:49,679 --> 00:27:54,240 nodes 763 00:27:51,279 --> 00:27:56,559 so wrapping it all up the tldr slide is 764 00:27:54,240 --> 00:27:58,799 that we identified a lack of visibility 765 00:27:56,559 --> 00:28:01,039 into the performance of our systems we 766 00:27:58,799 --> 00:28:02,960 ended up designing and implementing this 767 00:28:01,039 --> 00:28:05,600 performance testing framework 768 00:28:02,960 --> 00:28:07,679 and we established baseline performance 769 00:28:05,600 --> 00:28:09,279 and we also evaluated other solutions 770 00:28:07,679 --> 00:28:11,279 with that framework 771 00:28:09,279 --> 00:28:12,960 this allowed us to draw then later prove 772 00:28:11,279 --> 00:28:15,440 out some conclusions from our test 773 00:28:12,960 --> 00:28:18,399 results and using that same framework in 774 00:28:15,440 --> 00:28:20,559 a tight loop fashion we were able to 775 00:28:18,399 --> 00:28:23,120 successfully tune our thanos and 776 00:28:20,559 --> 00:28:24,640 prometheus instance 777 00:28:23,120 --> 00:28:27,279 so for more information i'm going to 778 00:28:24,640 --> 00:28:29,520 make the slides the locust aio http 779 00:28:27,279 --> 00:28:31,760 client and the grafana test generator 780 00:28:29,520 --> 00:28:32,559 available at the github repo you can see 781 00:28:31,760 --> 00:28:34,720 there 782 00:28:32,559 --> 00:28:36,159 and if you want to contact me or if you 783 00:28:34,720 --> 00:28:37,440 have questions or you want to share some 784 00:28:36,159 --> 00:28:39,360 opinions 785 00:28:37,440 --> 00:28:41,679 my email is on the screen there and also 786 00:28:39,360 --> 00:28:43,840 you can hit me up at the cncf for 787 00:28:41,679 --> 00:28:45,760 grafana slack and my handle is just 788 00:28:43,840 --> 00:28:48,559 brian grew 789 00:28:45,760 --> 00:28:51,120 so uh thank you everybody for 790 00:28:48,559 --> 00:28:53,360 sitting through the presentation and now 791 00:28:51,120 --> 00:28:54,480 we will move on to 792 00:28:53,360 --> 00:28:56,000 q a 793 00:28:54,480 --> 00:28:57,760 thanks so much 794 00:28:56,000 --> 00:28:59,360 hey welcome back everybody brian thanks 795 00:28:57,760 --> 00:29:01,600 for an amazing talk and the wonderful 796 00:28:59,360 --> 00:29:03,039 memes um i didn't know what hootsuit was 797 00:29:01,600 --> 00:29:04,880 before your recording so thanks so much 798 00:29:03,039 --> 00:29:06,399 for your uh doing that for us very 799 00:29:04,880 --> 00:29:08,000 enjoyable 800 00:29:06,399 --> 00:29:10,000 so brian we've got a couple of questions 801 00:29:08,000 --> 00:29:12,240 that have come through in the chat um 802 00:29:10,000 --> 00:29:14,640 the first one's a double question the 803 00:29:12,240 --> 00:29:16,960 user behavior was codified as python and 804 00:29:14,640 --> 00:29:16,960 locus 805 00:29:17,360 --> 00:29:20,559 go ahead sorry yeah no 806 00:29:19,440 --> 00:29:22,159 if you want to finish reading the second 807 00:29:20,559 --> 00:29:23,840 half that's fine sure and the second 808 00:29:22,159 --> 00:29:26,320 half is do the number of developers 809 00:29:23,840 --> 00:29:29,679 affect the number of load test users 810 00:29:26,320 --> 00:29:32,640 were implemented right uh so yeah uh the 811 00:29:29,679 --> 00:29:35,600 first half we did choose uh several 812 00:29:32,640 --> 00:29:36,960 dashboards and uh the user behavior of 813 00:29:35,600 --> 00:29:39,360 loading those dashboards or what it 814 00:29:36,960 --> 00:29:42,240 would look like coming from grafana 815 00:29:39,360 --> 00:29:45,120 that was definitely codified into uh the 816 00:29:42,240 --> 00:29:46,880 tests we wrote and it did impact uh the 817 00:29:45,120 --> 00:29:49,039 number of users right i mean that we 818 00:29:46,880 --> 00:29:51,679 found we had relatively low uh 819 00:29:49,039 --> 00:29:52,960 simultaneous users on griffon at any one 820 00:29:51,679 --> 00:29:56,240 given time 821 00:29:52,960 --> 00:29:59,120 so we decided to uh max out at about uh 822 00:29:56,240 --> 00:30:00,320 10 virtual users for locust 823 00:29:59,120 --> 00:30:02,399 um 824 00:30:00,320 --> 00:30:04,240 and then uh 825 00:30:02,399 --> 00:30:06,159 and then i think yeah like we were 826 00:30:04,240 --> 00:30:08,399 trying to determine i mean why the 827 00:30:06,159 --> 00:30:09,679 performance was slow and not try to find 828 00:30:08,399 --> 00:30:11,440 the breaking point of the system so i 829 00:30:09,679 --> 00:30:13,279 mean that's really the reason we didn't 830 00:30:11,440 --> 00:30:16,159 uh ratchet up the number of virtual 831 00:30:13,279 --> 00:30:16,159 users too high 832 00:30:17,120 --> 00:30:21,279 excellent thanks brian and team please 833 00:30:19,520 --> 00:30:23,520 do continue to put chats questions 834 00:30:21,279 --> 00:30:24,880 through into the the chat the great av 835 00:30:23,520 --> 00:30:26,399 guys etc in the background and the 836 00:30:24,880 --> 00:30:28,240 moderators will post them through to us 837 00:30:26,399 --> 00:30:30,399 so please keep more questions coming 838 00:30:28,240 --> 00:30:31,120 we've got about 15 minutes we'll finish 839 00:30:30,399 --> 00:30:32,240 at 840 00:30:31,120 --> 00:30:35,279 uh 841 00:30:32,240 --> 00:30:37,919 12 25 australian eastern standard time 842 00:30:35,279 --> 00:30:39,600 on the east coast for sydney apologies i 843 00:30:37,919 --> 00:30:41,520 can't work out the rest of the six or 844 00:30:39,600 --> 00:30:43,279 seven times zone in australia on the fly 845 00:30:41,520 --> 00:30:46,000 but hopefully you can so we've got about 846 00:30:43,279 --> 00:30:48,159 15 minutes left before we go to lunch 847 00:30:46,000 --> 00:30:49,039 the second question is 848 00:30:48,159 --> 00:30:51,919 what's 849 00:30:49,039 --> 00:30:53,279 sorry that's the most common type sorry 850 00:30:51,919 --> 00:30:55,600 there should be what's the most common 851 00:30:53,279 --> 00:30:57,200 type of metrics you have seen people 852 00:30:55,600 --> 00:30:59,279 overlook 853 00:30:57,200 --> 00:31:01,679 what that would add value to their 854 00:30:59,279 --> 00:31:03,360 polling and graphing 855 00:31:01,679 --> 00:31:05,519 uh yeah for sure so i've been racking my 856 00:31:03,360 --> 00:31:07,760 brain over this one a bit uh 857 00:31:05,519 --> 00:31:09,919 but i mean in terms of general metrics 858 00:31:07,760 --> 00:31:11,760 for services or applications i mean it's 859 00:31:09,919 --> 00:31:13,840 kind of hard to say because everyone's 860 00:31:11,760 --> 00:31:16,320 stack is so different right 861 00:31:13,840 --> 00:31:19,600 but in general i do find a lot of people 862 00:31:16,320 --> 00:31:21,279 they tend to focus on uh resources right 863 00:31:19,600 --> 00:31:24,000 where's my memory at where's my cpu and 864 00:31:21,279 --> 00:31:25,600 i mean that is very important uh but i 865 00:31:24,000 --> 00:31:28,799 always like to think of things in terms 866 00:31:25,600 --> 00:31:30,559 of user experience and uh whether it's a 867 00:31:28,799 --> 00:31:31,679 real user out there in the world that's 868 00:31:30,559 --> 00:31:32,559 external 869 00:31:31,679 --> 00:31:34,880 or 870 00:31:32,559 --> 00:31:37,120 whether it's like another service that's 871 00:31:34,880 --> 00:31:41,039 using it or like for our use case where 872 00:31:37,120 --> 00:31:43,679 we were uh providing grafana to our 873 00:31:41,039 --> 00:31:44,880 eu for our users and i mean in a lot of 874 00:31:43,679 --> 00:31:47,039 these cases 875 00:31:44,880 --> 00:31:49,039 like latency does come in handy and as i 876 00:31:47,039 --> 00:31:50,799 said in the talk as well and 877 00:31:49,039 --> 00:31:53,039 and not just running with an average or 878 00:31:50,799 --> 00:31:55,440 maxis but actually understanding 879 00:31:53,039 --> 00:31:57,600 uh the quantiles and kind of how that 880 00:31:55,440 --> 00:31:59,360 aligns that's low and how many users are 881 00:31:57,600 --> 00:32:00,960 being impacted uh through those 882 00:31:59,360 --> 00:32:02,080 quantiles 883 00:32:00,960 --> 00:32:04,240 but i think something that's also 884 00:32:02,080 --> 00:32:06,960 generally overlooked is that like the 885 00:32:04,240 --> 00:32:09,120 impact of downstream or managed uh 886 00:32:06,960 --> 00:32:11,919 services your service may be dependent 887 00:32:09,120 --> 00:32:14,000 on and you know and remembering to take 888 00:32:11,919 --> 00:32:15,200 their performance and their status into 889 00:32:14,000 --> 00:32:16,720 account 890 00:32:15,200 --> 00:32:18,880 both when you're performing you're 891 00:32:16,720 --> 00:32:21,360 monitoring your writing dashboards but 892 00:32:18,880 --> 00:32:25,720 also when you're coming up with 893 00:32:21,360 --> 00:32:25,720 dslos for your service as well 894 00:32:32,880 --> 00:32:36,159 i'm not getting any audio from you 895 00:32:34,240 --> 00:32:37,679 michael 896 00:32:36,159 --> 00:32:39,039 thank you i fell for the oldest trap in 897 00:32:37,679 --> 00:32:41,039 the book 898 00:32:39,039 --> 00:32:43,279 sorry the next question is and apologize 899 00:32:41,039 --> 00:32:45,679 for my pronunciations here greek was not 900 00:32:43,279 --> 00:32:47,840 my strength what is the benefit of 901 00:32:45,679 --> 00:32:50,480 thanos over prometheus 902 00:32:47,840 --> 00:32:53,519 uh yeah for sure um so the reason why 903 00:32:50,480 --> 00:32:55,200 hootsuite went with uh thanos is that 904 00:32:53,519 --> 00:32:58,960 we had a requirement that we wanted to 905 00:32:55,200 --> 00:33:02,000 provide our users with uh up to a year 906 00:32:58,960 --> 00:33:03,440 or so of of metrics and 907 00:33:02,000 --> 00:33:05,200 we necessarily didn't want to keep all 908 00:33:03,440 --> 00:33:06,720 those metrics live and prometheus and 909 00:33:05,200 --> 00:33:08,480 have a massive 910 00:33:06,720 --> 00:33:10,960 tsdb 911 00:33:08,480 --> 00:33:13,519 so the benefit for us uh was really 912 00:33:10,960 --> 00:33:16,399 implementing thanos as that uh long-term 913 00:33:13,519 --> 00:33:19,600 store uh but thanos uh and and the other 914 00:33:16,399 --> 00:33:22,320 side of it is also uh the uh high 915 00:33:19,600 --> 00:33:24,960 availability side right uh we run uh 916 00:33:22,320 --> 00:33:27,760 multiple uh prometheus or promethei 917 00:33:24,960 --> 00:33:30,720 pairs uh and thanos uh provides the 918 00:33:27,760 --> 00:33:33,120 query across that pair uh the aj pair 919 00:33:30,720 --> 00:33:34,640 prometheus instances and it will also uh 920 00:33:33,120 --> 00:33:36,320 detube the results on the other side 921 00:33:34,640 --> 00:33:37,120 through the query mechanism 922 00:33:36,320 --> 00:33:39,600 so 923 00:33:37,120 --> 00:33:41,519 it it kind of bulks out prometheus where 924 00:33:39,600 --> 00:33:44,080 maybe vanilla prometheus falls a bit 925 00:33:41,519 --> 00:33:45,840 short in terms of h.a or uh long-term 926 00:33:44,080 --> 00:33:48,640 storage and uh long-term storage on 927 00:33:45,840 --> 00:33:50,640 thanos is uh can can be sampled down as 928 00:33:48,640 --> 00:33:52,480 well 929 00:33:50,640 --> 00:33:53,919 excellent thank you 930 00:33:52,480 --> 00:33:56,000 well that's the last formal question 931 00:33:53,919 --> 00:33:57,840 please do ask more questions they did 932 00:33:56,000 --> 00:34:00,080 foolishly say that i'm allowed to add 933 00:33:57,840 --> 00:34:00,799 live which is always very dangerous 934 00:34:00,080 --> 00:34:02,640 so 935 00:34:00,799 --> 00:34:04,799 my ad-lib question is what was the 936 00:34:02,640 --> 00:34:07,279 biggest surprise you found when running 937 00:34:04,799 --> 00:34:09,200 your first test 938 00:34:07,279 --> 00:34:10,159 uh yeah so the biggest surprise was just 939 00:34:09,200 --> 00:34:12,320 uh 940 00:34:10,159 --> 00:34:14,720 how absolutely horrific the performance 941 00:34:12,320 --> 00:34:18,480 on querying some of these series were 942 00:34:14,720 --> 00:34:19,679 and uh and it became pretty obvious uh 943 00:34:18,480 --> 00:34:22,560 when we started running these tests 944 00:34:19,679 --> 00:34:25,200 manually to begin with uh just like 945 00:34:22,560 --> 00:34:27,040 just testing if stuff worked uh 946 00:34:25,200 --> 00:34:31,040 uh it became pretty obvious who the 947 00:34:27,040 --> 00:34:32,639 worst defenders were and uh 948 00:34:31,040 --> 00:34:34,320 like in terms of the series right like 949 00:34:32,639 --> 00:34:36,079 not the people but yeah but it became 950 00:34:34,320 --> 00:34:38,079 pretty obvious who the worst uh metric 951 00:34:36,079 --> 00:34:39,440 series were so when we did perform 952 00:34:38,079 --> 00:34:41,919 subsequent investigations it kind of 953 00:34:39,440 --> 00:34:43,359 gave us the okay i i kind of know where 954 00:34:41,919 --> 00:34:45,200 the bodies are buried i know where to 955 00:34:43,359 --> 00:34:47,679 look 956 00:34:45,200 --> 00:34:49,280 excellent and what was the biggest waste 957 00:34:47,679 --> 00:34:50,720 of time like you spent all of this time 958 00:34:49,280 --> 00:34:52,240 coding something up and it didn't 959 00:34:50,720 --> 00:34:54,720 actually show any in performance and 960 00:34:52,240 --> 00:34:57,760 what could you learn from that 961 00:34:54,720 --> 00:34:59,359 right uh for sure so i mean 962 00:34:57,760 --> 00:35:01,599 uh like are you asking the biggest waste 963 00:34:59,359 --> 00:35:02,960 of time that we've hit or 964 00:35:01,599 --> 00:35:04,720 yeah it doesn't you thought you would 965 00:35:02,960 --> 00:35:07,280 find value in a space and you didn't 966 00:35:04,720 --> 00:35:09,040 actually find any value uh yeah that's a 967 00:35:07,280 --> 00:35:12,000 good question i mean in terms of the 968 00:35:09,040 --> 00:35:14,240 overall uh system and the approach which 969 00:35:12,000 --> 00:35:15,280 using locust uh for performance and load 970 00:35:14,240 --> 00:35:17,040 testing 971 00:35:15,280 --> 00:35:18,880 we've found value there i don't think 972 00:35:17,040 --> 00:35:19,839 we've fallen down anywhere 973 00:35:18,880 --> 00:35:22,320 um 974 00:35:19,839 --> 00:35:24,880 i think the only thing uh 975 00:35:22,320 --> 00:35:26,160 that's maybe fallen slightly short is 976 00:35:24,880 --> 00:35:27,920 just uh 977 00:35:26,160 --> 00:35:29,040 now that this has worked so well for us 978 00:35:27,920 --> 00:35:31,680 we've been trying to figure out how to 979 00:35:29,040 --> 00:35:32,960 scale it uh to the rest of the team and 980 00:35:31,680 --> 00:35:35,040 you know we're trying to figure out if 981 00:35:32,960 --> 00:35:37,200 locust is that tool to do that going 982 00:35:35,040 --> 00:35:38,800 forward and we're actually not sure um 983 00:35:37,200 --> 00:35:41,040 so if you want to talk about perhaps 984 00:35:38,800 --> 00:35:43,760 burning time then yeah we put a lot of 985 00:35:41,040 --> 00:35:45,359 time into this and into locust and you 986 00:35:43,760 --> 00:35:46,560 know uh we may be starting over with 987 00:35:45,359 --> 00:35:48,000 something that's uh slightly more 988 00:35:46,560 --> 00:35:49,680 scalable or that we can relate to the 989 00:35:48,000 --> 00:35:51,680 whole org 990 00:35:49,680 --> 00:35:53,440 excellent thank you as an ex-software 991 00:35:51,680 --> 00:35:55,359 tester i never found testing or 992 00:35:53,440 --> 00:35:56,720 performance testing was a waste of time 993 00:35:55,359 --> 00:35:58,160 even though the other people on the 994 00:35:56,720 --> 00:36:01,680 other side of the quad sometimes 995 00:35:58,160 --> 00:36:01,680 wondered what we were doing on ourselves 996 00:36:02,000 --> 00:36:05,760 excellent and what would be your one key 997 00:36:03,839 --> 00:36:08,320 takeaway if you if someone walked in and 998 00:36:05,760 --> 00:36:09,599 say what should i learn take away from 999 00:36:08,320 --> 00:36:11,520 this talk right 1000 00:36:09,599 --> 00:36:13,280 yeah yeah like i think the 1001 00:36:11,520 --> 00:36:14,160 best thing we've got out of the system 1002 00:36:13,280 --> 00:36:15,920 uh 1003 00:36:14,160 --> 00:36:19,680 was definitely uh 1004 00:36:15,920 --> 00:36:22,720 tuning thanos and just how easy it was 1005 00:36:19,680 --> 00:36:23,920 to uh make a change run the unit or 1006 00:36:22,720 --> 00:36:25,280 sorry make a change and run the 1007 00:36:23,920 --> 00:36:27,359 performance test and make a change and 1008 00:36:25,280 --> 00:36:30,800 run the test again and then actually see 1009 00:36:27,359 --> 00:36:33,520 those uh benefits or detriments that 1010 00:36:30,800 --> 00:36:35,920 sometimes happened as as as we went 1011 00:36:33,520 --> 00:36:38,720 along so i mean the one takeaway is just 1012 00:36:35,920 --> 00:36:40,560 say yes uh performance testing is useful 1013 00:36:38,720 --> 00:36:43,920 even in infrastructure even if it's not 1014 00:36:40,560 --> 00:36:44,960 an end user facing product uh your users 1015 00:36:43,920 --> 00:36:46,400 the developers at the end of the day 1016 00:36:44,960 --> 00:36:48,160 well thank you 1017 00:36:46,400 --> 00:36:49,920 good good excellent 1018 00:36:48,160 --> 00:36:52,560 that long that dying between the 1019 00:36:49,920 --> 00:36:53,920 developer and the end user is sometimes 1020 00:36:52,560 --> 00:36:55,440 a little blurred with management and 1021 00:36:53,920 --> 00:36:57,200 cost of performance testing and 1022 00:36:55,440 --> 00:36:59,200 continuous integration so yeah i'm a 1023 00:36:57,200 --> 00:37:00,960 great believer in it and as a manager 1024 00:36:59,200 --> 00:37:01,839 i'm encouraging it so yes wonderful 1025 00:37:00,960 --> 00:37:03,119 excellent 1026 00:37:01,839 --> 00:37:04,480 good good 1027 00:37:03,119 --> 00:37:06,400 um i think that's all the questions 1028 00:37:04,480 --> 00:37:08,160 we've had come through 1029 00:37:06,400 --> 00:37:10,079 we might give people an early mark for 1030 00:37:08,160 --> 00:37:11,920 lunch if they like you'll get a few 1031 00:37:10,079 --> 00:37:16,320 minutes early but brian will be heading 1032 00:37:11,920 --> 00:37:18,480 across to the uh the chat in the venulis 1033 00:37:16,320 --> 00:37:20,960 if you haven't found the channels yet so 1034 00:37:18,480 --> 00:37:23,119 this is my second lca took me a while to 1035 00:37:20,960 --> 00:37:25,040 find them scroll down on the left go to 1036 00:37:23,119 --> 00:37:27,040 browse channels and then you'll find a 1037 00:37:25,040 --> 00:37:28,320 whole pile of channels to join and if 1038 00:37:27,040 --> 00:37:29,680 that's why you missed out on some other 1039 00:37:28,320 --> 00:37:31,280 things sorry that um you didn't hear 1040 00:37:29,680 --> 00:37:33,440 about that earlier but please do join 1041 00:37:31,280 --> 00:37:35,040 brian and brian a great thank you thank 1042 00:37:33,440 --> 00:37:36,880 you for taking your time your evening 1043 00:37:35,040 --> 00:37:38,800 thanks your your family and plants as 1044 00:37:36,880 --> 00:37:40,160 well for the after hours work you're 1045 00:37:38,800 --> 00:37:42,000 having to do in canada and please do 1046 00:37:40,160 --> 00:37:43,280 stay warm over there 1047 00:37:42,000 --> 00:37:47,800 great thanks so much 1048 00:37:43,280 --> 00:37:47,800 thanks bye everyone enjoy lunch 1049 00:37:50,880 --> 00:37:52,960 you