1 00:00:00,000 --> 00:00:08,469 foreign 2 00:00:00,500 --> 00:00:08,469 [Music] 3 00:00:10,980 --> 00:00:14,300 thank you very much good afternoon 4 00:00:14,400 --> 00:00:20,460 so today I want to talk about uh 5 00:00:18,539 --> 00:00:22,320 some of the tools that I use in in 6 00:00:20,460 --> 00:00:23,699 solving problems 7 00:00:22,320 --> 00:00:25,320 and there are kind of two kinds of 8 00:00:23,699 --> 00:00:27,900 problems that I'm going to talk about 9 00:00:25,320 --> 00:00:29,760 solving or at least uh 10 00:00:27,900 --> 00:00:31,399 trying to diagnose on the way to solving 11 00:00:29,760 --> 00:00:33,660 the solving Parts usually the hard part 12 00:00:31,399 --> 00:00:35,340 but but we'd like to figure out exactly 13 00:00:33,660 --> 00:00:37,260 what's going on and and so the kind of 14 00:00:35,340 --> 00:00:39,780 tools I use for this the first category 15 00:00:37,260 --> 00:00:41,219 I would put as performance analysis and 16 00:00:39,780 --> 00:00:42,480 so those are tools that are maybe giving 17 00:00:41,219 --> 00:00:45,780 us some kind of information about 18 00:00:42,480 --> 00:00:48,000 latency or operations per second or you 19 00:00:45,780 --> 00:00:49,200 know bytes per second uh that that kind 20 00:00:48,000 --> 00:00:51,719 of thing 21 00:00:49,200 --> 00:00:53,399 and the second kind of category is what 22 00:00:51,719 --> 00:00:54,960 I sort of call Behavior Analysis and 23 00:00:53,399 --> 00:00:56,579 that's more looking at exactly what a 24 00:00:54,960 --> 00:00:58,980 program is doing what steps is it taking 25 00:00:56,579 --> 00:01:00,600 maybe trying to figure out rather than 26 00:00:58,980 --> 00:01:02,940 it being too slow maybe it's doing the 27 00:01:00,600 --> 00:01:05,159 complete wrong thing 28 00:01:02,940 --> 00:01:06,180 and in in my day-to-day role I've been 29 00:01:05,159 --> 00:01:09,000 some 30 00:01:06,180 --> 00:01:11,880 administrator or SRE for uh for more 31 00:01:09,000 --> 00:01:14,460 than 15 years but currently I work in 32 00:01:11,880 --> 00:01:16,439 the Ubuntu support team and we support a 33 00:01:14,460 --> 00:01:18,180 right away of customer problems so it's 34 00:01:16,439 --> 00:01:20,280 not just one product there's kind of 35 00:01:18,180 --> 00:01:22,259 could be anything that runs on Ubuntu 36 00:01:20,280 --> 00:01:25,560 spend quite a bit of time on openstack 37 00:01:22,259 --> 00:01:26,960 and ceph which themselves have lots and 38 00:01:25,560 --> 00:01:29,520 lots of 39 00:01:26,960 --> 00:01:31,439 components but really someone can open a 40 00:01:29,520 --> 00:01:33,299 case about anything and so we don't 41 00:01:31,439 --> 00:01:36,479 always have built-in the best tools or 42 00:01:33,299 --> 00:01:37,860 best metrics for finding out what we 43 00:01:36,479 --> 00:01:39,720 want to know because the problems are 44 00:01:37,860 --> 00:01:41,600 kind of of always different and I need 45 00:01:39,720 --> 00:01:45,360 to do this at runtime and in production 46 00:01:41,600 --> 00:01:46,520 and uh not crash or slow down the system 47 00:01:45,360 --> 00:01:48,479 in the process 48 00:01:46,520 --> 00:01:50,100 so that's kind of what we're going to 49 00:01:48,479 --> 00:01:52,200 talk about today and my hope is that 50 00:01:50,100 --> 00:01:53,399 I'll introduce this new tool that you 51 00:01:52,200 --> 00:01:55,259 may be either a little bit familiar with 52 00:01:53,399 --> 00:01:57,000 or maybe give you some ideas about what 53 00:01:55,259 --> 00:01:59,479 you can do with it to maybe help solve 54 00:01:57,000 --> 00:01:59,479 your problems 55 00:02:00,420 --> 00:02:03,780 so in the performance analysis kind of 56 00:02:02,280 --> 00:02:05,520 bucket we have some traditional tools 57 00:02:03,780 --> 00:02:08,720 that you might be familiar with 58 00:02:05,520 --> 00:02:11,819 these are things like iostat or top 59 00:02:08,720 --> 00:02:13,620 vmstat NFS stat bwng all these kinds of 60 00:02:11,819 --> 00:02:14,879 tools and what most of these tools are 61 00:02:13,620 --> 00:02:16,680 doing is they're kind of they're very 62 00:02:14,879 --> 00:02:18,180 fast they don't typically impact our 63 00:02:16,680 --> 00:02:19,560 system performance because they're 64 00:02:18,180 --> 00:02:21,420 looking at some kind of performance 65 00:02:19,560 --> 00:02:22,980 counter that usually the kernel is 66 00:02:21,420 --> 00:02:24,780 already kind of taken care of and 67 00:02:22,980 --> 00:02:26,040 Counting for us you know how many 68 00:02:24,780 --> 00:02:28,319 transactions going through a network 69 00:02:26,040 --> 00:02:30,780 card or a disk or or something to do 70 00:02:28,319 --> 00:02:32,520 with the process 71 00:02:30,780 --> 00:02:34,200 these are super great for a top-down 72 00:02:32,520 --> 00:02:35,879 view but some of the limitations number 73 00:02:34,200 --> 00:02:37,680 one is they they really make heavy use 74 00:02:35,879 --> 00:02:39,900 of averages and averages are really good 75 00:02:37,680 --> 00:02:41,220 at hiding outlayers and we'll talk in a 76 00:02:39,900 --> 00:02:43,680 minute about why that can be a problem 77 00:02:41,220 --> 00:02:45,480 and they often have a predefined and 78 00:02:43,680 --> 00:02:47,040 limited kind of breakdown so often we 79 00:02:45,480 --> 00:02:49,019 might be able to see disk statistics for 80 00:02:47,040 --> 00:02:50,700 a disk but without correlation to a 81 00:02:49,019 --> 00:02:52,140 process or we may even be able to see i 82 00:02:50,700 --> 00:02:54,180 o information for a process but without 83 00:02:52,140 --> 00:02:56,959 correlation to which disk that I O 84 00:02:54,180 --> 00:02:56,959 information is from 85 00:02:58,019 --> 00:03:01,800 another kind of uh subcategory these 86 00:03:00,239 --> 00:03:03,959 tools and top and Nets that are really 87 00:03:01,800 --> 00:03:05,459 good examples of this is they'll take an 88 00:03:03,959 --> 00:03:07,379 instant snapshot of the system every 89 00:03:05,459 --> 00:03:08,819 second and if something happened in like 90 00:03:07,379 --> 00:03:10,200 point one of a second in between the 91 00:03:08,819 --> 00:03:11,640 last snapshot and this snapshot it might 92 00:03:10,200 --> 00:03:13,260 miss that completely 93 00:03:11,640 --> 00:03:14,940 so in the case of top if you're 94 00:03:13,260 --> 00:03:16,620 launching lots and lots of processes 95 00:03:14,940 --> 00:03:18,480 that take a very short time to execute 96 00:03:16,620 --> 00:03:20,400 those may not show up in in top at all 97 00:03:18,480 --> 00:03:21,120 and you might miss the CPU usage from 98 00:03:20,400 --> 00:03:23,159 those 99 00:03:21,120 --> 00:03:25,140 and netstat will show you all the live 100 00:03:23,159 --> 00:03:26,640 TCP and EDP connections on a system but 101 00:03:25,140 --> 00:03:28,019 only at the instant you run the tool and 102 00:03:26,640 --> 00:03:32,280 you won't see all the ones that were 103 00:03:28,019 --> 00:03:34,080 short-lived in between those executions 104 00:03:32,280 --> 00:03:36,599 so look at some examples of what 105 00:03:34,080 --> 00:03:38,220 averages can hurt us so this is a tool 106 00:03:36,599 --> 00:03:41,340 that I really like to use called iostat 107 00:03:38,220 --> 00:03:43,200 and it gives us a breakdown of i o 108 00:03:41,340 --> 00:03:47,519 performance statistics for each disk on 109 00:03:43,200 --> 00:03:49,920 the system and so here we can see uh 110 00:03:47,519 --> 00:03:51,900 we've got like multiple discs and some 111 00:03:49,920 --> 00:03:54,000 hard drives and nvme device and some 112 00:03:51,900 --> 00:03:56,220 Columns of data so we're seeing reads 113 00:03:54,000 --> 00:03:58,680 per second and writes per second that's 114 00:03:56,220 --> 00:04:00,299 iops the read kilobytes and right 115 00:03:58,680 --> 00:04:02,220 kilobytes a second that's the bandwidth 116 00:04:00,299 --> 00:04:04,440 going for the disk and also the average 117 00:04:02,220 --> 00:04:05,760 latency so the ra weight and the right a 118 00:04:04,440 --> 00:04:08,180 weight is the average latency of those 119 00:04:05,760 --> 00:04:08,180 operations 120 00:04:08,459 --> 00:04:13,500 this is kind of good as an overall but 121 00:04:11,700 --> 00:04:15,000 latency and latency can be a really 122 00:04:13,500 --> 00:04:16,919 important thing to look at is is an 123 00:04:15,000 --> 00:04:19,260 average over all 27 000 these operations 124 00:04:16,919 --> 00:04:21,540 that happened in in that second 125 00:04:19,260 --> 00:04:23,520 and so it could be that all 27 000 of 126 00:04:21,540 --> 00:04:25,740 those took 0.03 of a second 127 00:04:23,520 --> 00:04:27,720 it could also be that you know almost 128 00:04:25,740 --> 00:04:28,800 all of them except for one took 0.01 of 129 00:04:27,720 --> 00:04:30,180 a second and one of them took three 130 00:04:28,800 --> 00:04:31,560 seconds and if you've got a disco 131 00:04:30,180 --> 00:04:33,540 operation that's taking three seconds 132 00:04:31,560 --> 00:04:35,759 that's really interesting but you can 133 00:04:33,540 --> 00:04:37,800 entirely miss that in in tools uh like 134 00:04:35,759 --> 00:04:40,020 this 135 00:04:37,800 --> 00:04:41,699 if you've been around Iowa storage you 136 00:04:40,020 --> 00:04:43,320 might understand this but if if this is 137 00:04:41,699 --> 00:04:45,660 confusing I think one that we can all 138 00:04:43,320 --> 00:04:48,180 relate to uh is any kind of computer 139 00:04:45,660 --> 00:04:50,699 Enthusiast is CPU usage 140 00:04:48,180 --> 00:04:53,400 and so in this example of top output I'm 141 00:04:50,699 --> 00:04:55,199 running the fio Benchmark tool and I've 142 00:04:53,400 --> 00:04:57,120 I've set it to only use one thread to 143 00:04:55,199 --> 00:04:59,340 kind of stress a a point in this case 144 00:04:57,120 --> 00:05:02,639 but we can see top is telling us that 145 00:04:59,340 --> 00:05:05,940 our our CPU usage is is like only six 146 00:05:02,639 --> 00:05:07,320 percent and 93 of our CPU is Idle but at 147 00:05:05,940 --> 00:05:09,419 the same time it's telling us that that 148 00:05:07,320 --> 00:05:11,220 FAO is using 100 of our CPU and these 149 00:05:09,419 --> 00:05:13,259 numbers kind of like don't add up 150 00:05:11,220 --> 00:05:15,120 and if you haven't used this tool before 151 00:05:13,259 --> 00:05:16,919 it might surprise you the percentage of 152 00:05:15,120 --> 00:05:18,660 CPU it doesn't go to 100 actually goes 153 00:05:16,919 --> 00:05:20,400 to however many CPUs you have times 100 154 00:05:18,660 --> 00:05:23,639 so if this was using my whole system 155 00:05:20,400 --> 00:05:25,440 let's say 3200 percent 156 00:05:23,639 --> 00:05:27,060 but we can ask top to break it down so 157 00:05:25,440 --> 00:05:28,560 if you hit the number one key on top it 158 00:05:27,060 --> 00:05:29,759 will actually show us all the CPUs on 159 00:05:28,560 --> 00:05:31,560 the system and the utilization of each 160 00:05:29,759 --> 00:05:33,539 one and in this view we can see that 161 00:05:31,560 --> 00:05:35,639 actually one CPU is running it at Full 162 00:05:33,539 --> 00:05:37,560 Steam because FAO is single threaded it 163 00:05:35,639 --> 00:05:39,900 can only run on a single CPU but the 164 00:05:37,560 --> 00:05:41,580 other seven CPUs are idle 165 00:05:39,900 --> 00:05:45,060 and so if we only looked at that average 166 00:05:41,580 --> 00:05:46,320 we had here at six percent 93 idle we 167 00:05:45,060 --> 00:05:49,259 might think our CPU is not a problem 168 00:05:46,320 --> 00:05:50,340 when in reality one CPU is a problem and 169 00:05:49,259 --> 00:05:53,000 there is a bottleneck that we were 170 00:05:50,340 --> 00:05:53,000 missing with that average 171 00:05:53,220 --> 00:05:56,160 and these are more simple ones but these 172 00:05:54,720 --> 00:05:58,560 kind of problems show up everywhere in 173 00:05:56,160 --> 00:06:00,479 any kind of latency or utilization of 174 00:05:58,560 --> 00:06:02,759 any kind of system resource these 175 00:06:00,479 --> 00:06:04,919 outliers and averages hide stuff from us 176 00:06:02,759 --> 00:06:06,419 all the time that end up making it 177 00:06:04,919 --> 00:06:09,380 difficult to solve problems that I hit 178 00:06:06,419 --> 00:06:09,380 quite often 179 00:06:09,840 --> 00:06:14,340 another category of tools uh which is my 180 00:06:12,120 --> 00:06:16,320 behavior analysis kind of category these 181 00:06:14,340 --> 00:06:18,000 tools rather than looking at very fast 182 00:06:16,320 --> 00:06:19,940 performance counters are trying to like 183 00:06:18,000 --> 00:06:22,979 look at everything a program is doing 184 00:06:19,940 --> 00:06:24,780 s-trace is probably my favorite and S 185 00:06:22,979 --> 00:06:27,120 Trace will attach to every system call a 186 00:06:24,780 --> 00:06:28,680 program makes and Report exactly what 187 00:06:27,120 --> 00:06:30,180 that system core was how much data was 188 00:06:28,680 --> 00:06:31,680 written can even tell you what data was 189 00:06:30,180 --> 00:06:33,840 written 190 00:06:31,680 --> 00:06:35,520 but this tool transfers every one of 191 00:06:33,840 --> 00:06:37,500 those operations from the other process 192 00:06:35,520 --> 00:06:39,780 into user space to s-trace to then print 193 00:06:37,500 --> 00:06:41,220 out and that's very heavyweight 194 00:06:39,780 --> 00:06:42,840 if you're doing a very small program 195 00:06:41,220 --> 00:06:43,860 that might be a small amount of data but 196 00:06:42,840 --> 00:06:45,720 if you're trying to analyze something 197 00:06:43,860 --> 00:06:47,400 doing gigabits a second or gigabytes a 198 00:06:45,720 --> 00:06:49,139 second of disk or Network 199 00:06:47,400 --> 00:06:50,639 number one it won't be able to keep up 200 00:06:49,139 --> 00:06:54,000 with processing that amount of data but 201 00:06:50,639 --> 00:06:56,340 it'll also slow down that application 202 00:06:54,000 --> 00:06:58,080 uh block Trace is a pretty awesome tool 203 00:06:56,340 --> 00:06:59,580 it's kind of a kernel facility it'll 204 00:06:58,080 --> 00:07:01,620 dump out every block operation that 205 00:06:59,580 --> 00:07:03,600 happens on the whole system every every 206 00:07:01,620 --> 00:07:04,919 request queued when it comes out of the 207 00:07:03,600 --> 00:07:06,780 queue when it goes to the disk when it 208 00:07:04,919 --> 00:07:08,520 comes back but again it's going to queue 209 00:07:06,780 --> 00:07:10,380 all of that information out into user 210 00:07:08,520 --> 00:07:12,660 space and into a file and that's very 211 00:07:10,380 --> 00:07:15,479 heavyweight 212 00:07:12,660 --> 00:07:18,180 uh IP traffic looks quite innocent if 213 00:07:15,479 --> 00:07:19,979 you run IP traffic the first mode will 214 00:07:18,180 --> 00:07:21,000 tell you like every connection and like 215 00:07:19,979 --> 00:07:22,620 the number of bytes that are coming 216 00:07:21,000 --> 00:07:24,419 through all those connections but to do 217 00:07:22,620 --> 00:07:26,340 that it's actually a packet dumping the 218 00:07:24,419 --> 00:07:28,620 entire network card into IP traffic to 219 00:07:26,340 --> 00:07:30,240 like find like all those values 220 00:07:28,620 --> 00:07:31,800 and a few megabits not a problem if 221 00:07:30,240 --> 00:07:34,139 you're doing 10 or 20 gigabits a second 222 00:07:31,800 --> 00:07:36,060 uh that can be a problem 223 00:07:34,139 --> 00:07:37,919 and even just debug logging if you've 224 00:07:36,060 --> 00:07:39,720 ever turned up a debug in something to 225 00:07:37,919 --> 00:07:41,099 like level one five or Twenty the amount 226 00:07:39,720 --> 00:07:42,180 of stuff coming out usually will slow it 227 00:07:41,099 --> 00:07:43,740 down 228 00:07:42,180 --> 00:07:45,660 so those are the kinds of problems we 229 00:07:43,740 --> 00:07:47,220 have with these behavioral analysis 230 00:07:45,660 --> 00:07:48,599 tools but often they can tell us things 231 00:07:47,220 --> 00:07:50,099 that those performance counters can't 232 00:07:48,599 --> 00:07:53,639 because there's only only so many of 233 00:07:50,099 --> 00:07:55,020 those performance counters and 234 00:07:53,639 --> 00:07:56,400 if one of those doesn't cover the 235 00:07:55,020 --> 00:07:57,960 situation we're looking for if we kind 236 00:07:56,400 --> 00:07:59,099 of look at this detailed output we might 237 00:07:57,960 --> 00:08:02,160 be able to find what we're looking for 238 00:07:59,099 --> 00:08:05,340 at a cost of a big Slowdown 239 00:08:02,160 --> 00:08:08,039 to demonstrate how big that slowdown is 240 00:08:05,340 --> 00:08:10,860 I've got here an example case of running 241 00:08:08,039 --> 00:08:12,300 DD and that's just a disc copying tool 242 00:08:10,860 --> 00:08:14,039 we're probably all used with at least a 243 00:08:12,300 --> 00:08:15,599 Raspberry Pi or something I'm reading 244 00:08:14,039 --> 00:08:17,400 from Dev zero which is like a string of 245 00:08:15,599 --> 00:08:18,720 null bytes and I'm writing to Dev null 246 00:08:17,400 --> 00:08:20,699 so I'm reading a bunch of zeros and 247 00:08:18,720 --> 00:08:22,379 throwing them away straight away 248 00:08:20,699 --> 00:08:24,120 to make the example a little contrived 249 00:08:22,379 --> 00:08:25,440 and stand out I'm doing that one byte at 250 00:08:24,120 --> 00:08:26,819 a time I'm not batching it into like 251 00:08:25,440 --> 00:08:28,860 four kilobytes or a megabyte and we're 252 00:08:26,819 --> 00:08:30,960 gonna do it 500 000 times 253 00:08:28,860 --> 00:08:32,459 and so if I do that just with DD on its 254 00:08:30,960 --> 00:08:33,539 own it takes 0.2 of a second which is 255 00:08:32,459 --> 00:08:35,940 pretty fast 256 00:08:33,539 --> 00:08:38,159 if I use S trace and I don't even ask S 257 00:08:35,940 --> 00:08:39,300 Trace I wanna I ask it with except they 258 00:08:38,159 --> 00:08:41,820 say I only want to look at the accept 259 00:08:39,300 --> 00:08:43,260 system call that's to accept a 260 00:08:41,820 --> 00:08:44,459 connection for the network and Diddy 261 00:08:43,260 --> 00:08:46,920 doesn't even make network connections it 262 00:08:44,459 --> 00:08:48,480 might ask you to do that that same task 263 00:08:46,920 --> 00:08:51,440 will now take 18 seconds that's nearly 264 00:08:48,480 --> 00:08:51,440 100 times slow down 265 00:08:54,600 --> 00:08:58,200 in a production system that's very busy 266 00:08:56,399 --> 00:09:00,060 that's a penalty we we may not be able 267 00:08:58,200 --> 00:09:02,040 to incur on a developmental test system 268 00:09:00,060 --> 00:09:04,440 it might be reasonable or on a quiet 269 00:09:02,040 --> 00:09:05,580 system I've used it a lot but there are 270 00:09:04,440 --> 00:09:08,940 cases that's not going to work for us 271 00:09:05,580 --> 00:09:10,680 and the reason why it's 100 times slower 272 00:09:08,940 --> 00:09:12,839 is because the way this Tracer works it 273 00:09:10,680 --> 00:09:14,279 uses a function called in a kernel 274 00:09:12,839 --> 00:09:16,019 called P Trace which is the debugger 275 00:09:14,279 --> 00:09:16,980 interface it's the same interface GDB 276 00:09:16,019 --> 00:09:18,899 users 277 00:09:16,980 --> 00:09:20,220 and it's remote controlling DD it's 278 00:09:18,899 --> 00:09:21,899 literally like little remote control it 279 00:09:20,220 --> 00:09:23,100 says the kernel work Petra Sysco I want 280 00:09:21,899 --> 00:09:25,080 you to run this program until the next 281 00:09:23,100 --> 00:09:27,120 system call the kernel says okay it runs 282 00:09:25,080 --> 00:09:29,160 and it does the right system call 283 00:09:27,120 --> 00:09:30,660 and then uh it goes into the kernel to 284 00:09:29,160 --> 00:09:32,100 do that system call and goes oh I've got 285 00:09:30,660 --> 00:09:33,899 to talk to S trace it wanted to know so 286 00:09:32,100 --> 00:09:35,880 it's pause DD Diddy is doing nothing we 287 00:09:33,899 --> 00:09:37,080 go to S trace it goes cool we had a 288 00:09:35,880 --> 00:09:39,000 system call I want to know what the 289 00:09:37,080 --> 00:09:40,740 system call was to find that out it 290 00:09:39,000 --> 00:09:42,300 needs to know the CPU registers in the 291 00:09:40,740 --> 00:09:43,920 context of DD sort of ask the kernel 292 00:09:42,300 --> 00:09:45,839 what were the registers goes back to the 293 00:09:43,920 --> 00:09:47,580 kernel kernel sends it back to P Trace 294 00:09:45,839 --> 00:09:49,560 Petro says great now we need to actually 295 00:09:47,580 --> 00:09:51,300 run the system call so it goes back to 296 00:09:49,560 --> 00:09:53,339 the kernel it actually finishes the the 297 00:09:51,300 --> 00:09:55,320 work of that right system call but then 298 00:09:53,339 --> 00:09:57,660 we go back to S Trace again 299 00:09:55,320 --> 00:09:59,220 why do we do that well at the start we 300 00:09:57,660 --> 00:10:00,959 we have the registers we're telling us 301 00:09:59,220 --> 00:10:02,820 the parameters to the system calls so I 302 00:10:00,959 --> 00:10:04,860 want to write this data to this file or 303 00:10:02,820 --> 00:10:07,500 this many bytes but we didn't know the 304 00:10:04,860 --> 00:10:08,820 result yet and so after it runs we have 305 00:10:07,500 --> 00:10:10,019 to then get the registers to get the 306 00:10:08,820 --> 00:10:11,339 result but at that point we can no 307 00:10:10,019 --> 00:10:12,899 longer see 308 00:10:11,339 --> 00:10:14,580 all the parameters that were passed in 309 00:10:12,899 --> 00:10:16,560 so we have to interrupt it twice once at 310 00:10:14,580 --> 00:10:18,000 the start once at the end so it gets 311 00:10:16,560 --> 00:10:20,040 those registers then it goes back to the 312 00:10:18,000 --> 00:10:21,660 kernel uh gets those registers of the 313 00:10:20,040 --> 00:10:23,519 ancestor kernel okay you can continue to 314 00:10:21,660 --> 00:10:25,260 send the data back to the process and so 315 00:10:23,519 --> 00:10:26,459 that's context switched from DB back to 316 00:10:25,260 --> 00:10:28,440 West Trace through the kernel eight 317 00:10:26,459 --> 00:10:30,720 times 318 00:10:28,440 --> 00:10:33,060 and that's why it's 100 times slower 319 00:10:30,720 --> 00:10:36,480 and DD has to wait in all the time all 320 00:10:33,060 --> 00:10:38,339 the time here we've spent uh 321 00:10:36,480 --> 00:10:41,779 in s-trace land deity is doing nothing 322 00:10:38,339 --> 00:10:41,779 and so that that's why it slowed down 323 00:10:42,240 --> 00:10:45,540 so it'd be really great if we could 324 00:10:43,680 --> 00:10:46,920 write something a custom tool to gather 325 00:10:45,540 --> 00:10:48,000 performance metrics on anything we 326 00:10:46,920 --> 00:10:49,620 wanted 327 00:10:48,000 --> 00:10:51,180 you'll be also super great if we could 328 00:10:49,620 --> 00:10:52,980 group those in some way we wanted that 329 00:10:51,180 --> 00:10:54,899 wasn't predetermined you might want to 330 00:10:52,980 --> 00:10:57,480 group those by device network card or 331 00:10:54,899 --> 00:11:00,420 disk by which process by which C group 332 00:10:57,480 --> 00:11:02,100 anything we can kind of think of it'll 333 00:11:00,420 --> 00:11:03,420 be awesome if we could do that or even 334 00:11:02,100 --> 00:11:05,519 just get the same information we had 335 00:11:03,420 --> 00:11:07,500 before but with a much smaller impact on 336 00:11:05,519 --> 00:11:09,360 system performance 337 00:11:07,500 --> 00:11:10,860 and the good news is we can do that and 338 00:11:09,360 --> 00:11:11,940 there there are a number of ways we can 339 00:11:10,860 --> 00:11:13,200 do that now so here are a few quick 340 00:11:11,940 --> 00:11:15,360 demos 341 00:11:13,200 --> 00:11:17,399 so these are BPF Choice tools and I'll 342 00:11:15,360 --> 00:11:18,360 explain how they work in a minute but 343 00:11:17,399 --> 00:11:19,740 they can do some really interesting 344 00:11:18,360 --> 00:11:21,060 things so here's one called get host 345 00:11:19,740 --> 00:11:23,279 latency 346 00:11:21,060 --> 00:11:25,380 and it will actually Trace not just one 347 00:11:23,279 --> 00:11:27,180 process as Trace only attaches to one 348 00:11:25,380 --> 00:11:29,640 process at a time we can actually trace 349 00:11:27,180 --> 00:11:31,500 the entire system with every process and 350 00:11:29,640 --> 00:11:33,660 we're going to hook into the get host by 351 00:11:31,500 --> 00:11:35,160 name system call in libc and say every 352 00:11:33,660 --> 00:11:36,720 time that happens I want to collect that 353 00:11:35,160 --> 00:11:38,640 and then print out which process of 354 00:11:36,720 --> 00:11:40,320 looking at which hostname and how long 355 00:11:38,640 --> 00:11:42,779 it took and that latency is kind of one 356 00:11:40,320 --> 00:11:44,519 of the cool things we can get from uh 357 00:11:42,779 --> 00:11:48,019 from BPF Trace that other tools often 358 00:11:44,519 --> 00:11:48,019 won't show us a per request latency 359 00:11:48,060 --> 00:11:52,380 and I've solved real problems like this 360 00:11:50,100 --> 00:11:54,959 uh when I was doing 361 00:11:52,380 --> 00:11:56,760 um web hosting before before now your 362 00:11:54,959 --> 00:11:58,200 customer applications would be written 363 00:11:56,760 --> 00:11:59,940 in WordPress and the wonderful thing 364 00:11:58,200 --> 00:12:01,560 about WordPress is there's like 100 365 00:11:59,940 --> 00:12:03,540 million plugins written by 100 million 366 00:12:01,560 --> 00:12:05,399 people who kind of know how to write 367 00:12:03,540 --> 00:12:06,839 plugins and then people install like 100 368 00:12:05,399 --> 00:12:09,000 of them and so you get all these bad 369 00:12:06,839 --> 00:12:10,740 decisions 100 times Amplified 370 00:12:09,000 --> 00:12:12,360 and often these are written by people 371 00:12:10,740 --> 00:12:14,339 maybe that live in the US and they have 372 00:12:12,360 --> 00:12:16,320 a like maybe 20 millisecond or 30 second 373 00:12:14,339 --> 00:12:18,600 millisecond latency to the entire 374 00:12:16,320 --> 00:12:20,040 internet uh most of us in Australia 375 00:12:18,600 --> 00:12:21,600 would know that our latency is 10 times 376 00:12:20,040 --> 00:12:23,100 that like 300 milliseconds to most 377 00:12:21,600 --> 00:12:25,500 things and so if you make like 20 378 00:12:23,100 --> 00:12:26,880 Network calls or 20 DNS calls at 0.3 of 379 00:12:25,500 --> 00:12:28,440 a second each that really adds up and 380 00:12:26,880 --> 00:12:29,399 suddenly it takes 10 seconds to load 381 00:12:28,440 --> 00:12:31,560 your web page 382 00:12:29,399 --> 00:12:33,600 and so this is an example of a tool we 383 00:12:31,560 --> 00:12:36,860 can see well how many of those requests 384 00:12:33,600 --> 00:12:36,860 did we make and how long did they take 385 00:12:41,700 --> 00:12:45,600 another tool that we have and these are 386 00:12:43,680 --> 00:12:47,820 all kind of in in the BPF Trace toolkit 387 00:12:45,600 --> 00:12:49,440 that we can we can do custom ones is UDP 388 00:12:47,820 --> 00:12:51,540 life and it will show us all the 389 00:12:49,440 --> 00:12:53,220 connections not just TCP but UDP 390 00:12:51,540 --> 00:12:54,899 connections that happened on the system 391 00:12:53,220 --> 00:12:57,180 and how long they took how long was 392 00:12:54,899 --> 00:12:59,220 traffic going back and forward 393 00:12:57,180 --> 00:13:00,839 uh in this case it's kind of a more 394 00:12:59,220 --> 00:13:02,040 generic version of the last one and the 395 00:13:00,839 --> 00:13:03,720 last one we're looking at only DNS 396 00:13:02,040 --> 00:13:05,579 requests now we're kind of looking at 397 00:13:03,720 --> 00:13:07,380 latency by program of all UDP 398 00:13:05,579 --> 00:13:09,180 connections and how long they took and 399 00:13:07,380 --> 00:13:11,160 where they were going 400 00:13:09,180 --> 00:13:14,279 and so in this case we can actually see 401 00:13:11,160 --> 00:13:15,600 the DNS uh same DNS requests and their 402 00:13:14,279 --> 00:13:18,959 latency the latencies here in the 403 00:13:15,600 --> 00:13:21,360 millisecond column uh and 404 00:13:18,959 --> 00:13:23,639 uh Port 53 is DNS we can also see an ntp 405 00:13:21,360 --> 00:13:25,320 request here that had a 15 millisecond 406 00:13:23,639 --> 00:13:27,180 latency 407 00:13:25,320 --> 00:13:29,100 and unlike Nets that this is going to 408 00:13:27,180 --> 00:13:31,260 show every connection in real time as 409 00:13:29,100 --> 00:13:33,240 they happen 410 00:13:31,260 --> 00:13:35,040 and again for the entire system another 411 00:13:33,240 --> 00:13:37,680 tool is open Snoop it's going to show us 412 00:13:35,040 --> 00:13:41,100 every program what path they opened and 413 00:13:37,680 --> 00:13:43,139 uh whether there was an error for that 414 00:13:41,100 --> 00:13:45,060 and so far those kind of three tools 415 00:13:43,139 --> 00:13:46,260 I've shown are in the category still 416 00:13:45,060 --> 00:13:49,079 where they're transferring a lot of data 417 00:13:46,260 --> 00:13:51,300 user space it's kind of every operation 418 00:13:49,079 --> 00:13:53,220 one of the other uh nice things we can 419 00:13:51,300 --> 00:13:54,839 do with with BBF Trace is we can start 420 00:13:53,220 --> 00:13:56,579 to summarize information so instead of 421 00:13:54,839 --> 00:13:58,920 transferring out of the kernel into the 422 00:13:56,579 --> 00:14:01,620 into the user space analysis tool every 423 00:13:58,920 --> 00:14:03,300 operation we can send only a summary and 424 00:14:01,620 --> 00:14:05,339 here's an example of that we're looking 425 00:14:03,300 --> 00:14:07,380 at the i o latency of every i o request 426 00:14:05,339 --> 00:14:09,360 to a disk on the entire system for every 427 00:14:07,380 --> 00:14:11,639 process but rather than just an average 428 00:14:09,360 --> 00:14:13,860 we're now bucketing it with a histogram 429 00:14:11,639 --> 00:14:15,540 and so in this example 430 00:14:13,860 --> 00:14:18,360 uh 431 00:14:15,540 --> 00:14:20,220 the vast majority 8 000 operations took 432 00:14:18,360 --> 00:14:22,440 8 to 16 microseconds another eight 433 00:14:20,220 --> 00:14:24,120 thousand took 16 to 32 and you can see 434 00:14:22,440 --> 00:14:26,100 this is kind of like a bar graph of like 435 00:14:24,120 --> 00:14:27,600 what uh proportion of the total requests 436 00:14:26,100 --> 00:14:29,880 were in that bucket most of them are 437 00:14:27,600 --> 00:14:32,700 very fast a few uh you know few tens of 438 00:14:29,880 --> 00:14:35,160 microseconds for some reason we had one 439 00:14:32,700 --> 00:14:36,959 request out of 30 000 plus requests that 440 00:14:35,160 --> 00:14:39,120 took one million microseconds that's one 441 00:14:36,959 --> 00:14:40,500 to two seconds 442 00:14:39,120 --> 00:14:41,699 and these are those outlines I was 443 00:14:40,500 --> 00:14:43,079 talking about if your program if you had 444 00:14:41,699 --> 00:14:45,120 one program made all these requests 445 00:14:43,079 --> 00:14:47,639 sequentially it doesn't matter that all 446 00:14:45,120 --> 00:14:49,680 the every single one except for one took 447 00:14:47,639 --> 00:14:51,060 like 10 microseconds one of them took 448 00:14:49,680 --> 00:14:52,920 one second so that entire process is 449 00:14:51,060 --> 00:14:54,420 going to take a second to complete 450 00:14:52,920 --> 00:14:56,339 and those are the outliers that we need 451 00:14:54,420 --> 00:14:58,920 to understand when we're trying to uh 452 00:14:56,339 --> 00:15:00,180 solve some of these weird problems 453 00:14:58,920 --> 00:15:02,459 so what's unique here is that we're 454 00:15:00,180 --> 00:15:05,660 getting a histogram 455 00:15:02,459 --> 00:15:05,660 not just an average 456 00:15:05,820 --> 00:15:09,000 is another kind of tool that's doing 457 00:15:07,199 --> 00:15:12,120 some summary analysis called sys count 458 00:15:09,000 --> 00:15:14,459 and it's just showing the top system 459 00:15:12,120 --> 00:15:16,380 caused by ID uh this tool unfortunately 460 00:15:14,459 --> 00:15:18,360 doesn't print the friendly text name but 461 00:15:16,380 --> 00:15:20,279 it is possible to do that and also the 462 00:15:18,360 --> 00:15:21,480 top 10 processes by process names so 463 00:15:20,279 --> 00:15:23,040 which processors will make in the most 464 00:15:21,480 --> 00:15:25,440 system calls 465 00:15:23,040 --> 00:15:27,120 and we can customize all of these things 466 00:15:25,440 --> 00:15:28,440 these are just example programs whatever 467 00:15:27,120 --> 00:15:31,139 problem you're trying to solve you can 468 00:15:28,440 --> 00:15:33,860 change that summary or analysis to be uh 469 00:15:31,139 --> 00:15:33,860 what you want it to be 470 00:15:34,680 --> 00:15:38,339 so what makes this possible so I've kind 471 00:15:37,019 --> 00:15:39,300 of that's a little taster about what we 472 00:15:38,339 --> 00:15:41,699 can do with this tool well let's 473 00:15:39,300 --> 00:15:43,320 understand how it's doing it 474 00:15:41,699 --> 00:15:45,540 so the First Technology that's making 475 00:15:43,320 --> 00:15:47,100 use of uh it comes from the F Trace 476 00:15:45,540 --> 00:15:49,019 subsystem which was merged way back in 477 00:15:47,100 --> 00:15:50,459 2008 so it's really not that new of a 478 00:15:49,019 --> 00:15:52,680 technology 479 00:15:50,459 --> 00:15:53,699 and the kind of 480 00:15:52,680 --> 00:15:55,440 um 481 00:15:53,699 --> 00:15:57,120 first part that kind of f Trace brought 482 00:15:55,440 --> 00:15:59,459 in was this ability to have these events 483 00:15:57,120 --> 00:16:01,860 in the kernel that we could attach to or 484 00:15:59,459 --> 00:16:03,540 or record in some way and there are a 485 00:16:01,860 --> 00:16:04,920 few different types of those there are 486 00:16:03,540 --> 00:16:06,000 more I'll explain in a minute but a 487 00:16:04,920 --> 00:16:07,860 couple of the main ones the first are 488 00:16:06,000 --> 00:16:09,720 these static Trace points and they're 489 00:16:07,860 --> 00:16:12,120 kind of like the kernel developer of 490 00:16:09,720 --> 00:16:14,100 that disk subsystem or net subsystem has 491 00:16:12,120 --> 00:16:15,899 defined some Trace points that aren't 492 00:16:14,100 --> 00:16:17,339 going to change along with some 493 00:16:15,899 --> 00:16:19,139 predetermined information that's going 494 00:16:17,339 --> 00:16:20,880 to record about them it might be the 495 00:16:19,139 --> 00:16:23,880 network device name it might be the file 496 00:16:20,880 --> 00:16:26,459 name that kind of thing 497 00:16:23,880 --> 00:16:28,079 and those are ABI Stables so we can rely 498 00:16:26,459 --> 00:16:29,760 on the Kernel even if the internals of 499 00:16:28,079 --> 00:16:31,079 the kernel change that Trace point is 500 00:16:29,760 --> 00:16:33,240 always going to have the same name and 501 00:16:31,079 --> 00:16:35,459 output the same information 502 00:16:33,240 --> 00:16:36,839 we also have the ability to attach onto 503 00:16:35,459 --> 00:16:38,399 any function in the kernel if you go 504 00:16:36,839 --> 00:16:39,300 looking through the kernel source and 505 00:16:38,399 --> 00:16:40,740 there's something that interests you you 506 00:16:39,300 --> 00:16:43,860 can attach to any one of those functions 507 00:16:40,740 --> 00:16:45,779 and we can also Trace those but in those 508 00:16:43,860 --> 00:16:47,699 cases on different kernel versions those 509 00:16:45,779 --> 00:16:49,199 functions might change and then you also 510 00:16:47,699 --> 00:16:51,240 have to figure out how to get that extra 511 00:16:49,199 --> 00:16:53,639 context yourself you want the file name 512 00:16:51,240 --> 00:16:55,139 or the uh the disk score or something 513 00:16:53,639 --> 00:16:57,600 like that you have to start poking 514 00:16:55,139 --> 00:16:59,040 around in the data structures that's 515 00:16:57,600 --> 00:17:00,779 great when you get kind of really good 516 00:16:59,040 --> 00:17:02,399 at it but for you know the more typical 517 00:17:00,779 --> 00:17:04,679 case it's nice to have these predefined 518 00:17:02,399 --> 00:17:07,640 Trace points that know are already kind 519 00:17:04,679 --> 00:17:07,640 of what information to get 520 00:17:15,419 --> 00:17:19,799 this infrastructure is also super high 521 00:17:17,160 --> 00:17:21,660 performance and the way that works is is 522 00:17:19,799 --> 00:17:22,799 when the Kernel's compiled uh and 523 00:17:21,660 --> 00:17:24,540 there's a couple of different mechanisms 524 00:17:22,799 --> 00:17:26,699 different places but basically it it 525 00:17:24,540 --> 00:17:28,980 puts out a string of knot but no op 526 00:17:26,699 --> 00:17:30,480 codes for the CPU to execute and CPUs 527 00:17:28,980 --> 00:17:32,460 are really fast at doing nothing that's 528 00:17:30,480 --> 00:17:34,200 been optimized to death by the CPU 529 00:17:32,460 --> 00:17:36,360 vendors and so it's just like these five 530 00:17:34,200 --> 00:17:37,980 bytes in memory that do nothing and only 531 00:17:36,360 --> 00:17:40,620 when we turn on tracing for a certain 532 00:17:37,980 --> 00:17:41,820 kind of function does that in memory get 533 00:17:40,620 --> 00:17:43,380 overwritten with instructions to 534 00:17:41,820 --> 00:17:45,120 actually jump out to the code to handle 535 00:17:43,380 --> 00:17:46,980 this performance analysis 536 00:17:45,120 --> 00:17:49,200 and so every function in the kernel is 537 00:17:46,980 --> 00:17:50,880 ready to be kind of traced like this but 538 00:17:49,200 --> 00:17:53,360 there's no performance impact until you 539 00:17:50,880 --> 00:17:53,360 turn it on 540 00:17:54,059 --> 00:17:58,919 this incidentally is similar to how live 541 00:17:56,220 --> 00:18:00,900 patching also works uh it'll actually 542 00:17:58,919 --> 00:18:02,160 patch the start of the function to like 543 00:18:00,900 --> 00:18:03,720 go off to the new version with live 544 00:18:02,160 --> 00:18:06,360 patch rather than the existing one 545 00:18:03,720 --> 00:18:08,520 they're kind of related 546 00:18:06,360 --> 00:18:09,539 the second technology that uh kind of 547 00:18:08,520 --> 00:18:12,419 brings this together makes it really 548 00:18:09,539 --> 00:18:13,620 work was we moved in the s-trace case it 549 00:18:12,419 --> 00:18:15,720 was like a remote control for the 550 00:18:13,620 --> 00:18:17,160 process and and the kernel would send 551 00:18:15,720 --> 00:18:18,299 control to S trace it would tell the 552 00:18:17,160 --> 00:18:20,160 kernel like what it wanted to do back 553 00:18:18,299 --> 00:18:21,900 and Boards back and forth 554 00:18:20,160 --> 00:18:23,640 now with the F train system you have 555 00:18:21,900 --> 00:18:25,380 this ring buffer in memory in the kernel 556 00:18:23,640 --> 00:18:26,820 and the kernel itself handles the trace 557 00:18:25,380 --> 00:18:29,400 event so it doesn't break out to that 558 00:18:26,820 --> 00:18:31,260 user space analysis application at all 559 00:18:29,400 --> 00:18:32,820 and it writes that Trace event into a 560 00:18:31,260 --> 00:18:35,280 ring bar for in-memory all internal 561 00:18:32,820 --> 00:18:36,900 context then the user space tool reads 562 00:18:35,280 --> 00:18:38,100 out from that buffer asynchronously at 563 00:18:36,900 --> 00:18:40,380 the same time there's no lock between 564 00:18:38,100 --> 00:18:42,120 them 565 00:18:40,380 --> 00:18:43,620 and what that allows is now a 566 00:18:42,120 --> 00:18:45,179 recording's entirely in kernel space so 567 00:18:43,620 --> 00:18:47,460 we're not wasting all that time context 568 00:18:45,179 --> 00:18:49,559 switching back and forth 569 00:18:47,460 --> 00:18:51,059 and our process isn't paused either we 570 00:18:49,559 --> 00:18:52,740 that little bit of recording work it's 571 00:18:51,059 --> 00:18:54,179 just like you know poking a few bits in 572 00:18:52,740 --> 00:18:55,980 memory that's very fast to do compared 573 00:18:54,179 --> 00:18:57,660 to to all that that jumping back and 574 00:18:55,980 --> 00:18:59,460 forth 575 00:18:57,660 --> 00:19:02,280 one of the disadvantages of that however 576 00:18:59,460 --> 00:19:04,200 is uh if we have those predefined static 577 00:19:02,280 --> 00:19:05,820 Trace points the kernel knows how to get 578 00:19:04,200 --> 00:19:07,740 that context and it knows how to put it 579 00:19:05,820 --> 00:19:09,360 into that buffer in memory so the user 580 00:19:07,740 --> 00:19:11,039 space application can read it 581 00:19:09,360 --> 00:19:12,419 if we don't have a predefined trace 582 00:19:11,039 --> 00:19:14,220 point and this happens we want to just 583 00:19:12,419 --> 00:19:16,140 trace a random kernel function instead 584 00:19:14,220 --> 00:19:17,940 which doesn't know how to get its 585 00:19:16,140 --> 00:19:19,320 context then the only thing we can 586 00:19:17,940 --> 00:19:21,240 record is that that function was called 587 00:19:19,320 --> 00:19:22,799 and we can't really store 588 00:19:21,240 --> 00:19:24,059 you know the context about the disk 589 00:19:22,799 --> 00:19:26,720 score the length of the operation and 590 00:19:24,059 --> 00:19:26,720 that kind of thing 591 00:19:27,179 --> 00:19:30,780 this has some safety elements built in 592 00:19:28,860 --> 00:19:32,640 as well if if the buffer fills up they 593 00:19:30,780 --> 00:19:34,140 just those events just get lost and so 594 00:19:32,640 --> 00:19:35,820 rather than the kernel stalling and 595 00:19:34,140 --> 00:19:37,740 maybe making applications slow if for 596 00:19:35,820 --> 00:19:39,000 some reason the user space side can't 597 00:19:37,740 --> 00:19:39,960 read fast enough it just drops those 598 00:19:39,000 --> 00:19:41,760 events 599 00:19:39,960 --> 00:19:43,200 and if the actual processing of these 600 00:19:41,760 --> 00:19:44,400 events take too much to climb that the 601 00:19:43,200 --> 00:19:45,780 kernel also has mechanisms for 602 00:19:44,400 --> 00:19:48,000 throttling that time back to make sure 603 00:19:45,780 --> 00:19:51,260 that it it maybe is a few percent of 604 00:19:48,000 --> 00:19:51,260 your your system timer most 605 00:19:51,740 --> 00:19:57,179 and so if we look at the effect of this 606 00:19:54,240 --> 00:19:58,980 so here's uh before we we showed uh you 607 00:19:57,179 --> 00:20:01,380 know that same DD operation it took 0.2 608 00:19:58,980 --> 00:20:03,059 of a second if we s traced it and didn't 609 00:20:01,380 --> 00:20:05,700 even do any work as Trace was interested 610 00:20:03,059 --> 00:20:07,980 in it now took 18 seconds well if we use 611 00:20:05,700 --> 00:20:09,419 perf Trace instead which takes advantage 612 00:20:07,980 --> 00:20:11,760 of that kernel infrastructure to trace 613 00:20:09,419 --> 00:20:13,500 the event and also uses that ring buffer 614 00:20:11,760 --> 00:20:15,299 so that the the applications are not 615 00:20:13,500 --> 00:20:18,240 waiting on each other now we can trace 616 00:20:15,299 --> 00:20:20,160 the same information in three seconds 617 00:20:18,240 --> 00:20:21,840 that's still 15 times slower than our 618 00:20:20,160 --> 00:20:23,940 Baseline but that's a huge improvement 619 00:20:21,840 --> 00:20:25,919 over 18 seconds and we can improve that 620 00:20:23,940 --> 00:20:27,720 again if we if this was tracing every 621 00:20:25,919 --> 00:20:30,000 system called reads writes connects 622 00:20:27,720 --> 00:20:32,100 everything if we narrow that down again 623 00:20:30,000 --> 00:20:33,539 to only the right system call and 624 00:20:32,100 --> 00:20:35,940 remember here it's taking us 18 seconds 625 00:20:33,539 --> 00:20:37,620 to observe nothing here we can observe 626 00:20:35,940 --> 00:20:39,660 half of the work the entire process did 627 00:20:37,620 --> 00:20:41,880 because it's reading 500 000 bytes and 628 00:20:39,660 --> 00:20:43,200 writing 500 000 bytes so half the work 629 00:20:41,880 --> 00:20:44,940 is reading it half the work is writing 630 00:20:43,200 --> 00:20:46,020 if we only look at the writing of it we 631 00:20:44,940 --> 00:20:47,520 can now get that down to one second 632 00:20:46,020 --> 00:20:50,179 that's only five times slower than our 633 00:20:47,520 --> 00:20:50,179 Baseline case 634 00:20:51,240 --> 00:20:54,419 but we're still copying a lot of data to 635 00:20:53,160 --> 00:20:56,039 use the space so every one of those 636 00:20:54,419 --> 00:20:58,679 operations is going to this ring buffer 637 00:20:56,039 --> 00:21:00,240 and user spaces reading out all 500 000 638 00:20:58,679 --> 00:21:02,220 of those operations 639 00:21:00,240 --> 00:21:04,140 so we're way faster but we're still 640 00:21:02,220 --> 00:21:06,480 wasting resources that maybe could still 641 00:21:04,140 --> 00:21:09,480 be improved 642 00:21:06,480 --> 00:21:11,940 and so that's where BPF comes in 643 00:21:09,480 --> 00:21:13,860 and so the origins of BPF were were in 644 00:21:11,940 --> 00:21:17,100 packet capture so when you ran TCP dump 645 00:21:13,860 --> 00:21:19,500 and you were filtering on Port 22 646 00:21:17,100 --> 00:21:21,000 without BPF every single packet would 647 00:21:19,500 --> 00:21:22,380 have to be sent in the same same way 648 00:21:21,000 --> 00:21:23,820 we've just been discussing sent out of 649 00:21:22,380 --> 00:21:25,380 the kernel into tsp dump it would then 650 00:21:23,820 --> 00:21:26,220 parse the packet and only print it if it 651 00:21:25,380 --> 00:21:28,320 matched 652 00:21:26,220 --> 00:21:30,419 and at gigabits a second that's that's 653 00:21:28,320 --> 00:21:31,620 going to be a problem and so the BPF was 654 00:21:30,419 --> 00:21:34,260 this little virtual machine that was put 655 00:21:31,620 --> 00:21:35,700 into the kernel where we could Define a 656 00:21:34,260 --> 00:21:36,960 tiny little program and send it to the 657 00:21:35,700 --> 00:21:38,700 kernel for how to filter the packets 658 00:21:36,960 --> 00:21:40,980 that's another kernel would take a 659 00:21:38,700 --> 00:21:42,720 program that tested up wrote it would 660 00:21:40,980 --> 00:21:44,220 verify that that was safe it wasn't 661 00:21:42,720 --> 00:21:45,659 going to corrupt memory it wasn't going 662 00:21:44,220 --> 00:21:47,100 to take too long to run it has all these 663 00:21:45,659 --> 00:21:49,020 safety checks that make it safe to 664 00:21:47,100 --> 00:21:51,299 actually put user code into the kernel 665 00:21:49,020 --> 00:21:52,799 which sounds a little crazy and uses 666 00:21:51,299 --> 00:21:54,419 that to filter the packets and so now 667 00:21:52,799 --> 00:21:55,860 only the packets we're interested in are 668 00:21:54,419 --> 00:21:58,020 actually sent out to T speed dump which 669 00:21:55,860 --> 00:22:01,260 had a huge performance win but it was 670 00:21:58,020 --> 00:22:03,600 very specific to to packet capture 671 00:22:01,260 --> 00:22:05,840 and this was in 1992 so very long time 672 00:22:03,600 --> 00:22:05,840 ago 673 00:22:06,480 --> 00:22:12,179 in about 2013 uh some developers took 674 00:22:10,260 --> 00:22:13,679 that BPF and they enhanced it they added 675 00:22:12,179 --> 00:22:15,840 that little virtual machine they've 676 00:22:13,679 --> 00:22:18,240 added more registers more storage space 677 00:22:15,840 --> 00:22:19,860 they made it jit compiled so rather than 678 00:22:18,240 --> 00:22:21,360 kind of interpreting that virtual 679 00:22:19,860 --> 00:22:24,240 machine that's now converted to machine 680 00:22:21,360 --> 00:22:25,799 code kind of like JavaScript might do 681 00:22:24,240 --> 00:22:27,659 and they added the ability to drive 682 00:22:25,799 --> 00:22:28,620 these programs not just from packets but 683 00:22:27,659 --> 00:22:30,360 from all sorts of different event 684 00:22:28,620 --> 00:22:32,640 sources 685 00:22:30,360 --> 00:22:34,140 and one of those event sources is these 686 00:22:32,640 --> 00:22:36,480 same events that we're tracing so those 687 00:22:34,140 --> 00:22:37,679 those static Trace points those K probes 688 00:22:36,480 --> 00:22:39,840 all these event sources can now be 689 00:22:37,679 --> 00:22:42,000 connected to a BBF program 690 00:22:39,840 --> 00:22:43,380 and we have all that same safety the 691 00:22:42,000 --> 00:22:44,700 kernel is going to verify there's 692 00:22:43,380 --> 00:22:46,380 actually reads and parcels and 693 00:22:44,700 --> 00:22:48,120 understands the entire program before it 694 00:22:46,380 --> 00:22:50,039 loads it to make sure it won't crash 695 00:22:48,120 --> 00:22:51,659 that it won't take too long to run but 696 00:22:50,039 --> 00:22:53,700 it can't access unsafe memory in that 697 00:22:51,659 --> 00:22:55,260 and that kind of thing 698 00:22:53,700 --> 00:22:57,179 that does put some limits on the program 699 00:22:55,260 --> 00:22:58,799 and so to get a lot of the work done we 700 00:22:57,179 --> 00:23:01,200 need to do the kernel has little helper 701 00:22:58,799 --> 00:23:02,460 programs the BPF program basically make 702 00:23:01,200 --> 00:23:04,620 some function calls out to certain 703 00:23:02,460 --> 00:23:05,880 kernel programs that will maybe make 704 00:23:04,620 --> 00:23:08,299 sure that whatever you're getting back 705 00:23:05,880 --> 00:23:08,299 is safe 706 00:23:08,880 --> 00:23:12,720 the win this gives us is now we can take 707 00:23:11,159 --> 00:23:14,340 any tracing event and instead of just 708 00:23:12,720 --> 00:23:15,720 putting the full output of that into the 709 00:23:14,340 --> 00:23:17,400 ring buffer and sending it to user space 710 00:23:15,720 --> 00:23:19,140 and now we can process that in kernel 711 00:23:17,400 --> 00:23:22,020 and we can actually use code just like 712 00:23:19,140 --> 00:23:24,059 the packet dump to summarize or extract 713 00:23:22,020 --> 00:23:25,559 data in the kernel 714 00:23:24,059 --> 00:23:26,820 and then put the results of that in the 715 00:23:25,559 --> 00:23:29,000 ring buffer that's sent out to user 716 00:23:26,820 --> 00:23:29,000 space 717 00:23:29,760 --> 00:23:32,820 and now we're only transmitting a very 718 00:23:31,440 --> 00:23:35,039 small amount of data out to user space 719 00:23:32,820 --> 00:23:37,380 and so we're even faster again 720 00:23:35,039 --> 00:23:39,179 and so if we look here I've written a 721 00:23:37,380 --> 00:23:41,280 tool in BPF Trace using this technology 722 00:23:39,179 --> 00:23:43,260 to do the same thing we did here so 723 00:23:41,280 --> 00:23:45,000 before it took us one second to trace 724 00:23:43,260 --> 00:23:46,740 every right call because we're 725 00:23:45,000 --> 00:23:48,179 transferring every right call out to 726 00:23:46,740 --> 00:23:49,980 user space but I might have just wanted 727 00:23:48,179 --> 00:23:51,539 to know how many there were and so I 728 00:23:49,980 --> 00:23:53,100 wrote a program which counted how many 729 00:23:51,539 --> 00:23:54,299 right calls there were but it counted 730 00:23:53,100 --> 00:23:55,500 all of that in the kernel and the only 731 00:23:54,299 --> 00:23:57,960 thing it said to user space was the 732 00:23:55,500 --> 00:24:00,360 number 512 000. and now that takes 0.3 733 00:23:57,960 --> 00:24:03,419 of a second which is only 50 slower than 734 00:24:00,360 --> 00:24:05,220 our Baseline case 735 00:24:03,419 --> 00:24:06,960 and this is a really contrived example 736 00:24:05,220 --> 00:24:10,020 because we're doing a purely system call 737 00:24:06,960 --> 00:24:11,520 uh kind of load work uh if if in reality 738 00:24:10,020 --> 00:24:13,200 this program spent a lot of CPU time 739 00:24:11,520 --> 00:24:14,700 doing actual kind of work programs do 740 00:24:13,200 --> 00:24:17,159 and not just sending data back and forth 741 00:24:14,700 --> 00:24:18,900 that that uh performance impact would be 742 00:24:17,159 --> 00:24:21,620 would be even less often maybe five or 743 00:24:18,900 --> 00:24:21,620 ten percent at most 744 00:24:21,720 --> 00:24:24,960 and so as an example of that if I use 745 00:24:23,340 --> 00:24:26,640 fio this is a really good disk 746 00:24:24,960 --> 00:24:28,440 benchmarking tool with an actual kind of 747 00:24:26,640 --> 00:24:30,539 uh benchmark 748 00:24:28,440 --> 00:24:32,340 I can do 35 or a thousand iops a second 749 00:24:30,539 --> 00:24:34,559 with no analysis running and if I turn 750 00:24:32,340 --> 00:24:36,659 on a tool to analyze the latency of 751 00:24:34,559 --> 00:24:38,580 every eye operation on the system so 752 00:24:36,659 --> 00:24:40,080 sorry I get 37 000 without it but it 753 00:24:38,580 --> 00:24:41,640 only drops to thirty four thousand when 754 00:24:40,080 --> 00:24:43,380 I start analyzing the latency of every 755 00:24:41,640 --> 00:24:44,880 requesting the system so that's a a much 756 00:24:43,380 --> 00:24:47,659 much smaller drop than we have with all 757 00:24:44,880 --> 00:24:47,659 those other methods 758 00:24:47,700 --> 00:24:51,780 incidentally BPF was first added to 759 00:24:50,100 --> 00:24:53,880 kennel ebpf was first added to the 760 00:24:51,780 --> 00:24:56,640 kernel for s-trace while not just for 761 00:24:53,880 --> 00:24:58,740 ashtrays for uh but uh for a thing 762 00:24:56,640 --> 00:25:00,000 called set comp BBF and so this was a 763 00:24:58,740 --> 00:25:01,740 little program we sent to the colonel 764 00:25:00,000 --> 00:25:03,900 and when we used that p-trace mechanism 765 00:25:01,740 --> 00:25:05,100 and said intercept every system call we 766 00:25:03,900 --> 00:25:06,600 could send a little program to the 767 00:25:05,100 --> 00:25:07,919 kernel to filter out which system calls 768 00:25:06,600 --> 00:25:09,419 we wanted to trace and so I wouldn't 769 00:25:07,919 --> 00:25:11,280 actually wake up S trays for all the 770 00:25:09,419 --> 00:25:12,480 unrelated system calls only the ones we 771 00:25:11,280 --> 00:25:13,980 wanted 772 00:25:12,480 --> 00:25:16,440 um and this is used for for a few things 773 00:25:13,980 --> 00:25:17,880 but when we use it in s-trace we can 774 00:25:16,440 --> 00:25:20,580 halve the amount of time we take but it 775 00:25:17,880 --> 00:25:22,320 still takes 10 seconds uh you know 776 00:25:20,580 --> 00:25:24,000 versus the like one or point three of a 777 00:25:22,320 --> 00:25:25,140 second that we've managed to to take 778 00:25:24,000 --> 00:25:28,440 using 779 00:25:25,140 --> 00:25:30,419 uh this internal uh collection of the 780 00:25:28,440 --> 00:25:31,860 data and internal analysis because this 781 00:25:30,419 --> 00:25:33,240 this is only just cutting out all the 782 00:25:31,860 --> 00:25:34,620 events we didn't care about it still 783 00:25:33,240 --> 00:25:38,659 transfers all the events we care about 784 00:25:34,620 --> 00:25:38,659 to user space with s Trace 785 00:25:40,559 --> 00:25:44,340 so let's have a quick look at the 786 00:25:42,000 --> 00:25:46,620 language of BPF trace the the nice thing 787 00:25:44,340 --> 00:25:48,960 about BPF race is uh you can actually 788 00:25:46,620 --> 00:25:51,419 write BPF programs in python or in C or 789 00:25:48,960 --> 00:25:52,860 in go but doing that is quite of a boost 790 00:25:51,419 --> 00:25:54,299 and if you're a software engineer that's 791 00:25:52,860 --> 00:25:56,640 quite possible but as more of a sys 792 00:25:54,299 --> 00:25:58,260 admin or SRE type person who's maybe uh 793 00:25:56,640 --> 00:25:59,820 knows not to be dangerous uh when it 794 00:25:58,260 --> 00:26:01,799 comes to coding or how the kernel Works 795 00:25:59,820 --> 00:26:03,059 uh that's quite 796 00:26:01,799 --> 00:26:05,520 um 797 00:26:03,059 --> 00:26:08,400 a forbidden kind of kind of task BPF 798 00:26:05,520 --> 00:26:10,380 traces are much nicer Uh custom language 799 00:26:08,400 --> 00:26:11,700 lets us write these programs in a much 800 00:26:10,380 --> 00:26:15,260 more friendly way so here's an example 801 00:26:11,700 --> 00:26:15,260 of what the language looks like 802 00:26:15,720 --> 00:26:20,880 uh sorry 803 00:26:18,240 --> 00:26:22,500 uh yes so the BPF Trace language is 804 00:26:20,880 --> 00:26:25,140 inspired by the d-trace language which 805 00:26:22,500 --> 00:26:26,580 is a tool on Solaris from 2005 which did 806 00:26:25,140 --> 00:26:28,919 very much the same thing in the same way 807 00:26:26,580 --> 00:26:30,419 it was inspired by orc so the reason it 808 00:26:28,919 --> 00:26:32,039 looks like August because it was heavily 809 00:26:30,419 --> 00:26:33,120 inspired by orc 810 00:26:32,039 --> 00:26:35,340 um 811 00:26:33,120 --> 00:26:37,260 and so that's why we have a begin and an 812 00:26:35,340 --> 00:26:39,000 end so we can run some code at the start 813 00:26:37,260 --> 00:26:40,500 and end of our script in this case we 814 00:26:39,000 --> 00:26:41,640 just print out a nice little string so 815 00:26:40,500 --> 00:26:44,340 the person knows we're actually doing 816 00:26:41,640 --> 00:26:46,620 something and we've started our work we 817 00:26:44,340 --> 00:26:48,299 can specify a probe type and so an 818 00:26:46,620 --> 00:26:49,919 example of a probe we might do is a 819 00:26:48,299 --> 00:26:51,240 system call system call right but 820 00:26:49,919 --> 00:26:52,740 there's there's a lot of different 821 00:26:51,240 --> 00:26:55,260 probes and we'll look at all them in a 822 00:26:52,740 --> 00:26:57,179 second we then have a predicate so we 823 00:26:55,260 --> 00:26:59,400 can actually have a condition again like 824 00:26:57,179 --> 00:27:01,260 orc we can say I only want to match on 825 00:26:59,400 --> 00:27:02,159 this probe if the process is a python 826 00:27:01,260 --> 00:27:03,960 process 827 00:27:02,159 --> 00:27:05,880 and then we can run some actual code and 828 00:27:03,960 --> 00:27:08,279 so in this case we're going to store a 829 00:27:05,880 --> 00:27:10,200 global array called start we're going to 830 00:27:08,279 --> 00:27:12,659 index it by the thread ID and we're 831 00:27:10,200 --> 00:27:14,880 going to store the system time when that 832 00:27:12,659 --> 00:27:16,980 probe was called 833 00:27:14,880 --> 00:27:19,679 here's an example of another probe we've 834 00:27:16,980 --> 00:27:20,940 got a more complicated uh predicate 835 00:27:19,679 --> 00:27:22,679 first we're going to check that we 836 00:27:20,940 --> 00:27:24,360 actually had a non-error return value 837 00:27:22,679 --> 00:27:25,740 we're also going to check that we're 838 00:27:24,360 --> 00:27:27,720 Python and we're going to check that 839 00:27:25,740 --> 00:27:29,460 initially we kind of match this other 840 00:27:27,720 --> 00:27:30,600 other probe first we'll make sure we 841 00:27:29,460 --> 00:27:31,799 match that first and then we'll do some 842 00:27:30,600 --> 00:27:33,779 other logic and in this case we're 843 00:27:31,799 --> 00:27:34,919 summing the latency the time it took 844 00:27:33,779 --> 00:27:37,020 from the first function to the second 845 00:27:34,919 --> 00:27:40,020 function 846 00:27:37,020 --> 00:27:42,600 and so we calculate that latency we we 847 00:27:40,020 --> 00:27:43,919 sum up that latency per process ID so 848 00:27:42,600 --> 00:27:46,380 for whatever we're probing we're going 849 00:27:43,919 --> 00:27:48,539 to add up the total time spent uh in in 850 00:27:46,380 --> 00:27:49,860 that time period and then every 30 851 00:27:48,539 --> 00:27:51,600 second we're going to print out all that 852 00:27:49,860 --> 00:27:53,220 data we've stored and at the end we'll 853 00:27:51,600 --> 00:27:55,320 print it and so that a bit like a lot of 854 00:27:53,220 --> 00:27:57,000 iostat and top and stuff do it can print 855 00:27:55,320 --> 00:28:00,320 out kind of like intervals and then kind 856 00:27:57,000 --> 00:28:00,320 of print out the result at the end 857 00:28:02,100 --> 00:28:07,020 specifically some magic variables in 858 00:28:03,840 --> 00:28:08,159 there like count so the question is uh 859 00:28:07,020 --> 00:28:11,039 there seems to be some magic variables 860 00:28:08,159 --> 00:28:12,240 like count so yes so there are a bunch 861 00:28:11,039 --> 00:28:13,500 of 862 00:28:12,240 --> 00:28:16,080 um it's actually not a variable it's 863 00:28:13,500 --> 00:28:20,220 kind of a function but it 864 00:28:16,080 --> 00:28:23,220 yeah it basically just says uh count up 865 00:28:20,220 --> 00:28:25,500 that um 866 00:28:23,220 --> 00:28:27,120 oh you're saying I've uh oh I've 867 00:28:25,500 --> 00:28:28,260 actually made a mistake when I was 868 00:28:27,120 --> 00:28:30,120 making this line I see what you're 869 00:28:28,260 --> 00:28:32,940 saying my mistake I did account that 870 00:28:30,120 --> 00:28:35,279 should probably be uh at time that was 871 00:28:32,940 --> 00:28:36,720 just an error in my slide preparation so 872 00:28:35,279 --> 00:28:38,279 uh yeah that should just be at time this 873 00:28:36,720 --> 00:28:39,539 is not a real program I was kind of just 874 00:28:38,279 --> 00:28:42,260 trying to show this in text but thank 875 00:28:39,539 --> 00:28:42,260 you for catching that 876 00:28:42,720 --> 00:28:46,260 and so here are a bunch of one-liner 877 00:28:44,700 --> 00:28:48,179 examples of BPF programs that actually 878 00:28:46,260 --> 00:28:51,840 do things uh these come straight from 879 00:28:48,179 --> 00:28:53,159 the BPF Trace manual and uh so here's an 880 00:28:51,840 --> 00:28:55,320 example that we're going to check all 881 00:28:53,159 --> 00:28:57,299 the files open by a process so we get 882 00:28:55,320 --> 00:28:59,400 the open system call and we want to 883 00:28:57,299 --> 00:29:01,559 print out the file name and com which is 884 00:28:59,400 --> 00:29:03,480 a special variable BPF Trace knows it's 885 00:29:01,559 --> 00:29:06,720 just the the process name 886 00:29:03,480 --> 00:29:08,460 we can count system calls by program so 887 00:29:06,720 --> 00:29:10,559 we we match this probe and we're just 888 00:29:08,460 --> 00:29:12,179 going to count indexed on whatever the 889 00:29:10,559 --> 00:29:13,860 program name is so even if you have 890 00:29:12,179 --> 00:29:16,140 multiple copies of python running all of 891 00:29:13,860 --> 00:29:18,480 them are going to kind of count together 892 00:29:16,140 --> 00:29:20,100 uh this is a case of we're summing so 893 00:29:18,480 --> 00:29:21,299 rather than just counting we might want 894 00:29:20,100 --> 00:29:23,520 to sum the number of bytes that are 895 00:29:21,299 --> 00:29:25,200 being read uh that's an example of that 896 00:29:23,520 --> 00:29:27,179 here's our histogram we talked about 897 00:29:25,200 --> 00:29:28,260 let's say you create a histogram all 898 00:29:27,179 --> 00:29:30,480 these kind of different things that we 899 00:29:28,260 --> 00:29:32,039 can put together we can do printf uh we 900 00:29:30,480 --> 00:29:33,800 can do multi-dimensional arrays all 901 00:29:32,039 --> 00:29:37,030 sorts of stuff 902 00:29:33,800 --> 00:29:37,030 [Music] 903 00:29:38,940 --> 00:29:41,760 I mentioned before there are a bunch of 904 00:29:40,679 --> 00:29:43,980 these different probe events we can 905 00:29:41,760 --> 00:29:45,720 attach to and so uh the two we talked 906 00:29:43,980 --> 00:29:47,640 about so far one was Trace Point these 907 00:29:45,720 --> 00:29:48,960 are these predefined ones in the kernel 908 00:29:47,640 --> 00:29:50,820 that 909 00:29:48,960 --> 00:29:52,380 um that they've kind of given us as a 910 00:29:50,820 --> 00:29:54,840 convenience and it already knows how to 911 00:29:52,380 --> 00:29:56,580 extract some some context we can also do 912 00:29:54,840 --> 00:29:58,039 a k-probe and that attaches to any 913 00:29:56,580 --> 00:30:00,960 kernel function 914 00:29:58,039 --> 00:30:03,059 and uh the cape probe has a k probe and 915 00:30:00,960 --> 00:30:05,100 a k rep probe so as I described earlier 916 00:30:03,059 --> 00:30:06,720 in the S Trace example at the start of a 917 00:30:05,100 --> 00:30:08,039 function we need to intercept then to 918 00:30:06,720 --> 00:30:09,720 capture the arguments of the function 919 00:30:08,039 --> 00:30:11,520 and then we need to intercept that at 920 00:30:09,720 --> 00:30:13,860 the end to get the return value or 921 00:30:11,520 --> 00:30:16,020 perhaps collect um the results of that 922 00:30:13,860 --> 00:30:18,659 function running so we we kind of probe 923 00:30:16,020 --> 00:30:20,760 those separately we can also probe uh 924 00:30:18,659 --> 00:30:22,020 user space programs so a u-probe will 925 00:30:20,760 --> 00:30:24,179 actually probe a function not in the 926 00:30:22,020 --> 00:30:26,279 kernel but in a program or in a shared 927 00:30:24,179 --> 00:30:28,620 library and I'll show in a bit for 928 00:30:26,279 --> 00:30:30,240 example we can probe the SSL library on 929 00:30:28,620 --> 00:30:33,179 every process in the system and kind of 930 00:30:30,240 --> 00:30:35,580 capture information from that 931 00:30:33,179 --> 00:30:37,080 use based programs can also much like 932 00:30:35,580 --> 00:30:38,760 Trace points in the kernel there are 933 00:30:37,080 --> 00:30:40,679 statically defined user space Trace 934 00:30:38,760 --> 00:30:42,720 points called usdt a bunch of bits of 935 00:30:40,679 --> 00:30:43,799 software have some of those and we can 936 00:30:42,720 --> 00:30:44,640 also do 937 00:30:43,799 --> 00:30:46,440 um 938 00:30:44,640 --> 00:30:48,179 a bunch of periodical things we can do 939 00:30:46,440 --> 00:30:50,640 things on an interval 940 00:30:48,179 --> 00:30:52,200 um we can do things 941 00:30:50,640 --> 00:30:54,360 um based on performance candles from the 942 00:30:52,200 --> 00:30:55,260 CPU those ones get a little bit more 943 00:30:54,360 --> 00:30:56,220 advanced something we're going to talk 944 00:30:55,260 --> 00:30:58,440 about those today but there's a whole 945 00:30:56,220 --> 00:31:00,179 range of them that you can look at the 946 00:30:58,440 --> 00:31:01,320 user space probes the kernel probes and 947 00:31:00,179 --> 00:31:05,000 the trace points are kind of the most 948 00:31:01,320 --> 00:31:05,000 interesting uh most interesting ones 949 00:31:06,179 --> 00:31:08,940 so I promised five recipes so I'm going 950 00:31:07,919 --> 00:31:10,260 to go through and talk about kind of 951 00:31:08,940 --> 00:31:12,179 five Solutions I've actually used when 952 00:31:10,260 --> 00:31:15,860 solving problems and kind of show the 953 00:31:12,179 --> 00:31:15,860 program and what that looks like 954 00:31:16,799 --> 00:31:22,919 uh the first one is uh I was trying to 955 00:31:19,980 --> 00:31:24,419 profile some python applications uh now 956 00:31:22,919 --> 00:31:26,279 BPF Trace actually has the ability to 957 00:31:24,419 --> 00:31:27,659 get the kernel stack and the user space 958 00:31:26,279 --> 00:31:29,460 stack the actual call stack like a 959 00:31:27,659 --> 00:31:31,080 debugger would but it can only do that 960 00:31:29,460 --> 00:31:33,480 from programs with like traditional like 961 00:31:31,080 --> 00:31:35,880 frame pointers and debug info uh that 962 00:31:33,480 --> 00:31:37,020 doesn't work with python right now that 963 00:31:35,880 --> 00:31:39,059 might get fixed because there's some 964 00:31:37,020 --> 00:31:40,200 ability to add like some some magic so 965 00:31:39,059 --> 00:31:42,419 that it can kind of understand from 966 00:31:40,200 --> 00:31:43,559 python what's going on but I was using a 967 00:31:42,419 --> 00:31:45,059 different python tool called Austin 968 00:31:43,559 --> 00:31:47,279 which will profile the python 969 00:31:45,059 --> 00:31:48,899 application but I wasn't launching 970 00:31:47,279 --> 00:31:50,220 python something else was and so what I 971 00:31:48,899 --> 00:31:51,840 wanted to do was capture every time 972 00:31:50,220 --> 00:31:54,240 python was started and then immediately 973 00:31:51,840 --> 00:31:55,860 launched the Austin profiler to profile 974 00:31:54,240 --> 00:31:57,779 that python process 975 00:31:55,860 --> 00:31:59,760 and so this is a simple program which 976 00:31:57,779 --> 00:32:01,200 where we're going to capture every time 977 00:31:59,760 --> 00:32:03,779 we execute something this isn't called 978 00:32:01,200 --> 00:32:06,179 it's called X exactly if it was 979 00:32:03,779 --> 00:32:08,580 successful and if we're a Python 3 980 00:32:06,179 --> 00:32:10,799 process then we're going to Output the 981 00:32:08,580 --> 00:32:13,460 process ID and then I I process that to 982 00:32:10,799 --> 00:32:13,460 launch the process 983 00:32:14,760 --> 00:32:19,020 I was using this because I wanted to 984 00:32:16,559 --> 00:32:20,580 profile uh we at canonical we deploy 985 00:32:19,020 --> 00:32:22,980 software with a bunch of python code 986 00:32:20,580 --> 00:32:24,779 basically that's orchestrated by a tool 987 00:32:22,980 --> 00:32:26,039 called Juju and it it was launching them 988 00:32:24,779 --> 00:32:27,779 and so I just wanted to profile 989 00:32:26,039 --> 00:32:29,460 everything it launched without having to 990 00:32:27,779 --> 00:32:31,820 modify Juju so that was a super easy way 991 00:32:29,460 --> 00:32:31,820 to do that 992 00:32:31,980 --> 00:32:36,899 uh here's another example uh a version 993 00:32:34,260 --> 00:32:39,240 of this tool ships with BPF Trace but it 994 00:32:36,899 --> 00:32:41,279 doesn't tell you the offset uh into the 995 00:32:39,240 --> 00:32:42,899 disk that it was reading from and the 996 00:32:41,279 --> 00:32:44,700 link that was reading so I was debugging 997 00:32:42,899 --> 00:32:46,380 a ceph case and I wanted to understand 998 00:32:44,700 --> 00:32:48,179 uh 999 00:32:46,380 --> 00:32:49,799 where from the disk we're reading it and 1000 00:32:48,179 --> 00:32:52,320 at what length and then what the latency 1001 00:32:49,799 --> 00:32:54,360 was and so I customized that tool and 1002 00:32:52,320 --> 00:32:56,159 enhanced it to understand the disk 1003 00:32:54,360 --> 00:32:58,260 offset and unless down the disk length 1004 00:32:56,159 --> 00:33:00,860 and kind of output that 1005 00:32:58,260 --> 00:33:00,860 yeah 1006 00:33:00,899 --> 00:33:04,679 uh in doing so I discovered that uh the 1007 00:33:03,419 --> 00:33:06,000 reason Trace points are great is because 1008 00:33:04,679 --> 00:33:07,260 kernel developers who know much more 1009 00:33:06,000 --> 00:33:08,880 about that kernel subsystem have 1010 00:33:07,260 --> 00:33:11,520 predefined all that information and how 1011 00:33:08,880 --> 00:33:12,539 to get it I was using kprobe so I had to 1012 00:33:11,520 --> 00:33:14,100 uh 1013 00:33:12,539 --> 00:33:15,779 look through the data structures and 1014 00:33:14,100 --> 00:33:18,120 follow several structures myself to find 1015 00:33:15,779 --> 00:33:19,559 things like the link and the offset 1016 00:33:18,120 --> 00:33:20,940 um and it turns that I did that wrong in 1017 00:33:19,559 --> 00:33:22,320 such a way that the length and offset 1018 00:33:20,940 --> 00:33:25,200 looked right most of the time but 1019 00:33:22,320 --> 00:33:27,000 sometimes it was wrong entirely and so I 1020 00:33:25,200 --> 00:33:28,500 I do advise some caution if you're a 1021 00:33:27,000 --> 00:33:30,179 kernel developer that knows enough to be 1022 00:33:28,500 --> 00:33:32,519 dangerous that's me 1023 00:33:30,179 --> 00:33:34,500 um to maybe verify the output of of 1024 00:33:32,519 --> 00:33:36,840 things you've come up with if it's a 1025 00:33:34,500 --> 00:33:38,760 k-probe or rely on someone that maybe 1026 00:33:36,840 --> 00:33:41,659 knows uh some more but I've only made 1027 00:33:38,760 --> 00:33:41,659 that mistake once so 1028 00:33:42,960 --> 00:33:48,000 another tool I used in a real case 1029 00:33:45,539 --> 00:33:51,299 um I did a whole talk on this at LCA in 1030 00:33:48,000 --> 00:33:53,159 2020 uh called uh fixing Windows guests 1031 00:33:51,299 --> 00:33:55,980 on Seth uh 1032 00:33:53,159 --> 00:33:58,019 uh and a whole story on that uh in that 1033 00:33:55,980 --> 00:33:58,980 case what was happening is uh Linux 1034 00:33:58,019 --> 00:34:01,080 machines 1035 00:33:58,980 --> 00:34:03,000 do all their i o in four kilobyte blocks 1036 00:34:01,080 --> 00:34:04,559 Windows machines do it in five four five 1037 00:34:03,000 --> 00:34:06,659 blocks 1038 00:34:04,559 --> 00:34:08,280 most modern storage devices like all 1039 00:34:06,659 --> 00:34:09,899 their IO to be done in five tour byte 1040 00:34:08,280 --> 00:34:11,879 blocks sorry four kilobyte blocks not 1041 00:34:09,899 --> 00:34:13,379 five tour byte blocks and what was 1042 00:34:11,879 --> 00:34:15,060 happening was uh Windows was sending a 1043 00:34:13,379 --> 00:34:17,460 fire truck bike right and internally 1044 00:34:15,060 --> 00:34:19,740 Seth was sending that to the kernel as a 1045 00:34:17,460 --> 00:34:21,300 buffered i o request and the kernel only 1046 00:34:19,740 --> 00:34:22,800 got 512 bytes but it needed four 1047 00:34:21,300 --> 00:34:25,080 kilobytes because it can't send any less 1048 00:34:22,800 --> 00:34:26,399 than that back to the disk so in line 1049 00:34:25,080 --> 00:34:27,599 when it made the right call to the 1050 00:34:26,399 --> 00:34:28,740 kernel that you expected the kernel to 1051 00:34:27,599 --> 00:34:30,419 buffer in memory it wasn't even going to 1052 00:34:28,740 --> 00:34:31,740 wait to send it to the disk normally it 1053 00:34:30,419 --> 00:34:33,419 would then have to go out to the disk 1054 00:34:31,740 --> 00:34:36,839 read the rest of the four kilobytes pull 1055 00:34:33,419 --> 00:34:38,580 that in change only that 512 bytes and 1056 00:34:36,839 --> 00:34:40,980 then go back to the to the program it's 1057 00:34:38,580 --> 00:34:42,899 called a read modified write penalty 1058 00:34:40,980 --> 00:34:45,300 and uh we didn't realize that was 1059 00:34:42,899 --> 00:34:47,099 happening at the time but 1060 00:34:45,300 --> 00:34:48,960 what I did with this program was I'd 1061 00:34:47,099 --> 00:34:50,520 already narrowed down that this the P 1062 00:34:48,960 --> 00:34:53,099 right V system call was taking a long 1063 00:34:50,520 --> 00:34:55,320 time and I didn't know why 1064 00:34:53,099 --> 00:34:56,879 so I then went and broke down with this 1065 00:34:55,320 --> 00:34:59,099 program the latency of each pirate 1066 00:34:56,879 --> 00:35:01,140 vehicle and by which process did it and 1067 00:34:59,099 --> 00:35:03,420 so from that I determined that the the 1068 00:35:01,140 --> 00:35:06,119 tpf store op process which is like an 1069 00:35:03,420 --> 00:35:07,560 internal ceph thread I found that it had 1070 00:35:06,119 --> 00:35:08,700 these bad latencies and that was the 1071 00:35:07,560 --> 00:35:11,180 process that I was though I was 1072 00:35:08,700 --> 00:35:11,180 interested in 1073 00:35:13,140 --> 00:35:16,200 I had another case it was a murder 1074 00:35:14,700 --> 00:35:17,579 mystery there was uh something that was 1075 00:35:16,200 --> 00:35:18,900 randomly killing processes on the system 1076 00:35:17,579 --> 00:35:20,460 and I didn't know what and that's a 1077 00:35:18,900 --> 00:35:21,780 really hard problem to debug uh 1078 00:35:20,460 --> 00:35:23,640 traditionally there's no super easy way 1079 00:35:21,780 --> 00:35:25,260 to find out who killed what and so in 1080 00:35:23,640 --> 00:35:26,940 this case I attached onto every time I 1081 00:35:25,260 --> 00:35:28,320 generate a signal in the kernel I would 1082 00:35:26,940 --> 00:35:30,000 print out which process sent which 1083 00:35:28,320 --> 00:35:31,859 signal to which other process 1084 00:35:30,000 --> 00:35:33,599 and from that I was able to figure out 1085 00:35:31,859 --> 00:35:35,820 uh it was actually systemd killing my 1086 00:35:33,599 --> 00:35:39,060 process for a very complicated reason to 1087 00:35:35,820 --> 00:35:40,920 do with session scope management but 1088 00:35:39,060 --> 00:35:43,079 that's another excellent BPF Trace 1089 00:35:40,920 --> 00:35:45,180 example I had 1090 00:35:43,079 --> 00:35:46,140 and this was kind of a fun one uh that I 1091 00:35:45,180 --> 00:35:48,240 like 1092 00:35:46,140 --> 00:35:50,099 uh all that stuff before was kind of 1093 00:35:48,240 --> 00:35:51,540 looking at kernel stuff this is uh is 1094 00:35:50,099 --> 00:35:54,420 using uprobes and so what we're doing 1095 00:35:51,540 --> 00:35:56,640 here is we're probing the SSL library 1096 00:35:54,420 --> 00:35:58,380 and again on the entire system and we're 1097 00:35:56,640 --> 00:35:59,700 hooking into the SSL read and the SSL 1098 00:35:58,380 --> 00:36:01,740 right call 1099 00:35:59,700 --> 00:36:03,420 and when there's a call we actually read 1100 00:36:01,740 --> 00:36:04,680 the buffer that was passed into a cell 1101 00:36:03,420 --> 00:36:07,440 read and write and we actually print out 1102 00:36:04,680 --> 00:36:09,720 the data that the program is sending 1103 00:36:07,440 --> 00:36:11,880 if you've ever tried to debug like an a 1104 00:36:09,720 --> 00:36:13,740 https connection and you use TCP dump 1105 00:36:11,880 --> 00:36:15,119 all you get is the encrypted data and 1106 00:36:13,740 --> 00:36:16,800 sometimes you can give into your program 1107 00:36:15,119 --> 00:36:18,300 to like write the encryption key out to 1108 00:36:16,800 --> 00:36:20,820 try and decrypt that but we can just 1109 00:36:18,300 --> 00:36:22,560 bypass all that hook into the SSL calls 1110 00:36:20,820 --> 00:36:26,040 and look at the unencrypted data before 1111 00:36:22,560 --> 00:36:27,720 it's encrypted or after it's decrypted 1112 00:36:26,040 --> 00:36:29,640 this does unfortunately highlight a 1113 00:36:27,720 --> 00:36:31,560 current limitation of BPF Trace which is 1114 00:36:29,640 --> 00:36:34,020 that we can only uh 1115 00:36:31,560 --> 00:36:36,300 print or capture strings up to 200 bytes 1116 00:36:34,020 --> 00:36:38,460 and Links at most uh that is a fixable 1117 00:36:36,300 --> 00:36:40,320 limitation but no one's fixed it yet and 1118 00:36:38,460 --> 00:36:42,359 so there are there is some software ones 1119 00:36:40,320 --> 00:36:44,400 called ecapture which uses like Python 1120 00:36:42,359 --> 00:36:45,599 and go to do this and and can bypass 1121 00:36:44,400 --> 00:36:47,579 that limitation and print out all the 1122 00:36:45,599 --> 00:36:50,339 data but if you use BPF Trace to do this 1123 00:36:47,579 --> 00:36:52,740 you'll only get 200 bytes at a time 1124 00:36:50,339 --> 00:36:54,660 and as a bonus because I wanted to throw 1125 00:36:52,740 --> 00:36:56,040 an extra one in and number six out of 1126 00:36:54,660 --> 00:36:58,680 five 1127 00:36:56,040 --> 00:37:00,180 um I wanted to uh Trace something in the 1128 00:36:58,680 --> 00:37:02,339 kernel but then correlate that with the 1129 00:37:00,180 --> 00:37:05,099 kernel stack and in this case I was 1130 00:37:02,339 --> 00:37:07,740 trying to understand uh 1131 00:37:05,099 --> 00:37:09,420 systemd was at some point changing the 1132 00:37:07,740 --> 00:37:11,520 CPU set of a process that's the list of 1133 00:37:09,420 --> 00:37:13,859 CPUs that can run on 1134 00:37:11,520 --> 00:37:16,440 and I didn't know why from which 1135 00:37:13,859 --> 00:37:18,060 um which kind of uh you know area of 1136 00:37:16,440 --> 00:37:19,380 system dude was doing that so I just you 1137 00:37:18,060 --> 00:37:20,880 can actually Wild Card keeper 1138 00:37:19,380 --> 00:37:23,940 unfortunately it's not on the slide 1139 00:37:20,880 --> 00:37:25,859 right now but I said CPU set star 1140 00:37:23,940 --> 00:37:27,000 probe every CPU function in the kernel 1141 00:37:25,859 --> 00:37:29,339 and then print me out the user space 1142 00:37:27,000 --> 00:37:30,780 stack of what ultimately called it and 1143 00:37:29,339 --> 00:37:32,040 so I could see all the code paths and 1144 00:37:30,780 --> 00:37:33,900 systemd that were calling into those 1145 00:37:32,040 --> 00:37:35,760 functions and that let me try to figure 1146 00:37:33,900 --> 00:37:37,740 out you know where to go looking for for 1147 00:37:35,760 --> 00:37:40,460 why it was changing the CPU set in a way 1148 00:37:37,740 --> 00:37:40,460 that I didn't expect 1149 00:37:41,880 --> 00:37:44,820 foreign 1150 00:37:43,020 --> 00:37:47,040 so ultimately there are many different 1151 00:37:44,820 --> 00:37:48,839 methods of doing these things some of 1152 00:37:47,040 --> 00:37:51,300 the limitations I talked about like top 1153 00:37:48,839 --> 00:37:53,280 uh Missing things you can solve those in 1154 00:37:51,300 --> 00:37:54,839 different ways but the the nice thing 1155 00:37:53,280 --> 00:37:56,460 about BPF Trace is it lets you solve a 1156 00:37:54,839 --> 00:37:58,320 lot of those problems in a language 1157 00:37:56,460 --> 00:37:59,640 that's very kind of cis-admin friendly 1158 00:37:58,320 --> 00:38:02,460 you don't need to be a kernel engineer 1159 00:37:59,640 --> 00:38:04,520 necessarily sometimes 1160 00:38:02,460 --> 00:38:04,520 um 1161 00:38:04,560 --> 00:38:07,980 and always using the same tool rather 1162 00:38:06,359 --> 00:38:09,480 than like there is a version of top 1163 00:38:07,980 --> 00:38:11,160 called atop which uses process 1164 00:38:09,480 --> 00:38:13,140 accounting to kind of not miss processes 1165 00:38:11,160 --> 00:38:14,339 but it's very complicated so it kind of 1166 00:38:13,140 --> 00:38:16,140 lets you solve lots of problems in a 1167 00:38:14,339 --> 00:38:17,760 quick and easy way but there are a bunch 1168 00:38:16,140 --> 00:38:19,680 of limitations like that string length 1169 00:38:17,760 --> 00:38:21,060 one and so if you wanted to kind of more 1170 00:38:19,680 --> 00:38:22,560 advanced stuff you can still use the 1171 00:38:21,060 --> 00:38:24,420 same technology but in a real language 1172 00:38:22,560 --> 00:38:27,140 like python or go and that gives you 1173 00:38:24,420 --> 00:38:27,140 some more flexibility 1174 00:38:30,780 --> 00:38:34,320 there are other couple other related 1175 00:38:32,460 --> 00:38:35,760 tools I often use for this uh the perf 1176 00:38:34,320 --> 00:38:37,200 tool itself if you don't want to get 1177 00:38:35,760 --> 00:38:38,700 into BPF Trace you can kind of do a lot 1178 00:38:37,200 --> 00:38:41,040 with just perf itself it just doesn't do 1179 00:38:38,700 --> 00:38:42,480 that summarizing part for you 1180 00:38:41,040 --> 00:38:43,859 um and and Flame graph if you've never 1181 00:38:42,480 --> 00:38:45,480 heard of is a really great tool to show 1182 00:38:43,859 --> 00:38:49,500 you graphically uh where you're spending 1183 00:38:45,480 --> 00:38:51,720 CPU time which can kind of help in these 1184 00:38:49,500 --> 00:38:53,700 and last of all as a quick note this 1185 00:38:51,720 --> 00:38:55,500 concept of putting these little tiny uh 1186 00:38:53,700 --> 00:38:57,540 event-driven programs into the kernel 1187 00:38:55,500 --> 00:38:59,640 that a user defined has kind of taken 1188 00:38:57,540 --> 00:39:01,500 off um which some of you may or may not 1189 00:38:59,640 --> 00:39:04,380 have seen but now we're not just doing 1190 00:39:01,500 --> 00:39:06,240 packet captures or uh or tracing you can 1191 00:39:04,380 --> 00:39:07,800 now put entire network load balances in 1192 00:39:06,240 --> 00:39:10,440 the kernel custom CPU schedulers 1193 00:39:07,800 --> 00:39:11,760 firewall processing uh profiling 1194 00:39:10,440 --> 00:39:13,560 auditing 1195 00:39:11,760 --> 00:39:16,260 it's become like an industry so the term 1196 00:39:13,560 --> 00:39:18,000 BPF has become a technology and every 1197 00:39:16,260 --> 00:39:19,500 company that's doing uh crazy stuff 1198 00:39:18,000 --> 00:39:21,300 often in kubernetes is now putting all 1199 00:39:19,500 --> 00:39:22,800 sorts of crazy stuff in it and so if 1200 00:39:21,300 --> 00:39:24,540 that sounds interesting or horrifying 1201 00:39:22,800 --> 00:39:26,460 I'd recommend reading up on it and and 1202 00:39:24,540 --> 00:39:27,839 seeing what you can do 1203 00:39:26,460 --> 00:39:29,220 if you want to find out more there's 1204 00:39:27,839 --> 00:39:30,780 some resources uh here there's a couple 1205 00:39:29,220 --> 00:39:33,000 good books by Brendan Greg he's a very 1206 00:39:30,780 --> 00:39:34,560 very good resource on this topic uh 1207 00:39:33,000 --> 00:39:35,880 that's where to find BPF trades if your 1208 00:39:34,560 --> 00:39:38,280 YouTube BPF trades there are a lot more 1209 00:39:35,880 --> 00:39:41,540 talks that will go into a lot more depth 1210 00:39:38,280 --> 00:39:41,540 and that is all I have 1211 00:39:46,320 --> 00:39:51,300 so I believe we have oh maybe a couple 1212 00:39:49,380 --> 00:39:54,380 minutes or questions a little bit of 1213 00:39:51,300 --> 00:39:54,380 time anyone got any questions 1214 00:39:56,460 --> 00:40:01,980 my go-tool for this is system tap how 1215 00:39:59,400 --> 00:40:03,780 does BPF tracing system they're a great 1216 00:40:01,980 --> 00:40:06,000 question so the question uh is how to 1217 00:40:03,780 --> 00:40:08,040 system a system tap compared to BPF 1218 00:40:06,000 --> 00:40:09,599 Trace so system tap lets you do a bunch 1219 00:40:08,040 --> 00:40:10,940 of the same things that BPF Trace does 1220 00:40:09,599 --> 00:40:13,440 and has also done that for a long time 1221 00:40:10,940 --> 00:40:15,300 the way our system tap worked internally 1222 00:40:13,440 --> 00:40:16,920 is it would take a similar kind of 1223 00:40:15,300 --> 00:40:18,660 language it's different but kind of 1224 00:40:16,920 --> 00:40:20,520 similar it would then comp turn it into 1225 00:40:18,660 --> 00:40:23,700 C and compile it and load it as a kernel 1226 00:40:20,520 --> 00:40:25,020 module and so it didn't have that same 1227 00:40:23,700 --> 00:40:26,579 kind of like 1228 00:40:25,020 --> 00:40:28,200 I mean it tried to do its own safety 1229 00:40:26,579 --> 00:40:30,000 guarantees but it didn't have that same 1230 00:40:28,200 --> 00:40:32,820 implicit safety guarantee of the generic 1231 00:40:30,000 --> 00:40:35,579 BPF system it also required you to your 1232 00:40:32,820 --> 00:40:37,079 compiler kernel load a module and uh 1233 00:40:35,579 --> 00:40:38,520 that has become more of a problem now 1234 00:40:37,079 --> 00:40:39,960 with secure boot because you need to 1235 00:40:38,520 --> 00:40:41,520 then sign that module and that kind of 1236 00:40:39,960 --> 00:40:43,619 thing which you kind of avoid uh 1237 00:40:41,520 --> 00:40:44,940 potentially with with the BBF method but 1238 00:40:43,619 --> 00:40:47,400 you can solve a lot of the same problems 1239 00:40:44,940 --> 00:40:48,720 with with system tap 1240 00:40:47,400 --> 00:40:50,339 um Sonic I didn't talk about very much 1241 00:40:48,720 --> 00:40:51,660 was right now when your probe a user 1242 00:40:50,339 --> 00:40:53,460 space program that's actually a much 1243 00:40:51,660 --> 00:40:54,720 bigger performance penalty than the 1244 00:40:53,460 --> 00:40:56,400 kernel side because it does actually 1245 00:40:54,720 --> 00:40:58,800 trap to the kernel and back every time 1246 00:40:56,400 --> 00:41:00,660 one of those is hit 1247 00:40:58,800 --> 00:41:01,859 um system tap had a solution for that 1248 00:41:00,660 --> 00:41:03,780 where it would kind of push some of that 1249 00:41:01,859 --> 00:41:05,099 stuff into the into user space the same 1250 00:41:03,780 --> 00:41:06,359 thing could be done with BPF Trace but 1251 00:41:05,099 --> 00:41:08,220 no one's really done the work to make 1252 00:41:06,359 --> 00:41:12,200 that happen yet so there might be an 1253 00:41:08,220 --> 00:41:12,200 advantage there in some cases 1254 00:41:13,740 --> 00:41:18,180 okay thank you very much tread for your 1255 00:41:16,800 --> 00:41:20,760 time and effort it's a little gift from 1256 00:41:18,180 --> 00:41:24,660 us thank you very much just a little 1257 00:41:20,760 --> 00:41:27,300 thing people the key card for the Oaks 1258 00:41:24,660 --> 00:41:28,560 um accommodation hasn't been claimed so 1259 00:41:27,300 --> 00:41:30,300 if you're staying at one of the Oaks 1260 00:41:28,560 --> 00:41:32,099 venues please check that you've got your 1261 00:41:30,300 --> 00:41:33,900 key card because finding out when you 1262 00:41:32,099 --> 00:41:37,700 get there this evening is not going to 1263 00:41:33,900 --> 00:41:37,700 be a lot of fun all right let me go