1 00:00:06,320 --> 00:00:11,499 [Music] 2 00:00:15,280 --> 00:00:19,760 welcome back everyone 3 00:00:16,800 --> 00:00:21,480 um next up uh we have mohamed for luck 4 00:00:19,760 --> 00:00:24,800 talking about uh 5 00:00:21,480 --> 00:00:26,960 ebpf101 mohamed verlak is a software 6 00:00:24,800 --> 00:00:29,039 engineer who is interested in how the 7 00:00:26,960 --> 00:00:31,760 ebpf subsystem 8 00:00:29,039 --> 00:00:33,600 can be leveraged in novel ways to enable 9 00:00:31,760 --> 00:00:37,120 new use cases 10 00:00:33,600 --> 00:00:37,120 uh please welcome muhammad vlad 11 00:00:38,160 --> 00:00:45,680 hello so this is uh ebpf101 talk i hope 12 00:00:43,120 --> 00:00:48,160 everybody can hear me all right so the 13 00:00:45,680 --> 00:00:51,280 idea is that we only have 20 minutes 20 14 00:00:48,160 --> 00:00:52,480 or 25 odd minutes and plus questions so 15 00:00:51,280 --> 00:00:54,480 what we're going to do is we're not 16 00:00:52,480 --> 00:00:55,440 going to do any programming sort of 17 00:00:54,480 --> 00:00:57,680 learning 18 00:00:55,440 --> 00:00:59,280 what's how to program the eppf subsystem 19 00:00:57,680 --> 00:01:01,440 but it's a it's an overview from a 20 00:00:59,280 --> 00:01:03,359 perspective of non-kernel programmer of 21 00:01:01,440 --> 00:01:04,879 what the cbpf subsystem means because 22 00:01:03,359 --> 00:01:06,799 there's been lately a lot of buzz around 23 00:01:04,879 --> 00:01:08,880 the ebp subsystem 24 00:01:06,799 --> 00:01:11,600 so 25 00:01:08,880 --> 00:01:13,280 a little bit about me who am i i 26 00:01:11,600 --> 00:01:15,200 recently finished this school book 27 00:01:13,280 --> 00:01:16,799 pretending to be a linux kernel expert 28 00:01:15,200 --> 00:01:18,720 pun intended 29 00:01:16,799 --> 00:01:20,400 i come from a very beautiful part of the 30 00:01:18,720 --> 00:01:21,680 world called srinagar 31 00:01:20,400 --> 00:01:22,560 kashmir 32 00:01:21,680 --> 00:01:25,840 and 33 00:01:22,560 --> 00:01:27,759 yeah i am no way an expert in all of 34 00:01:25,840 --> 00:01:30,159 this i just am reading it for fun and 35 00:01:27,759 --> 00:01:34,000 trying to you know leverage it and use 36 00:01:30,159 --> 00:01:36,400 it in my own day job and try and do 37 00:01:34,000 --> 00:01:37,759 more experiments with it 38 00:01:36,400 --> 00:01:40,400 so 39 00:01:37,759 --> 00:01:41,360 yeah let's look at the agenda so first 40 00:01:40,400 --> 00:01:44,720 of all 41 00:01:41,360 --> 00:01:47,680 we are going to look at the history of 42 00:01:44,720 --> 00:01:49,040 what this bpf thingy is 43 00:01:47,680 --> 00:01:51,600 and then we are going to move to the 44 00:01:49,040 --> 00:01:54,880 more recent the current parts of the 45 00:01:51,600 --> 00:01:56,159 ebpf where the e comes from 46 00:01:54,880 --> 00:01:58,880 so 47 00:01:56,159 --> 00:02:02,159 without further ado let's get started 48 00:01:58,880 --> 00:02:05,119 so let's go back in time probably to the 49 00:02:02,159 --> 00:02:08,160 90s and assume that there's no 50 00:02:05,119 --> 00:02:10,399 tcp dump and we want to design a packet 51 00:02:08,160 --> 00:02:12,959 filter so 52 00:02:10,399 --> 00:02:13,920 what could be our sort of 53 00:02:12,959 --> 00:02:16,640 design 54 00:02:13,920 --> 00:02:18,959 goals are to design a packet filter 55 00:02:16,640 --> 00:02:20,720 which copies or 56 00:02:18,959 --> 00:02:23,840 gives us a way to look at every packet 57 00:02:20,720 --> 00:02:25,200 that goes out through the wire and 58 00:02:23,840 --> 00:02:27,840 inspect it 59 00:02:25,200 --> 00:02:29,040 so how could be implemented there are 60 00:02:27,840 --> 00:02:29,840 generally 61 00:02:29,040 --> 00:02:31,519 two 62 00:02:29,840 --> 00:02:33,440 ways these are not exhaustive ways but 63 00:02:31,519 --> 00:02:34,879 generally there are two ways 64 00:02:33,440 --> 00:02:36,959 one could be 65 00:02:34,879 --> 00:02:40,239 you copy everything that goes throughout 66 00:02:36,959 --> 00:02:41,840 the wire to user space and then apply a 67 00:02:40,239 --> 00:02:44,160 filter on that of whatever is 68 00:02:41,840 --> 00:02:46,640 interesting to you whatever is not 69 00:02:44,160 --> 00:02:48,400 the other way could be more optimal 70 00:02:46,640 --> 00:02:52,879 where you write a kernel module and you 71 00:02:48,400 --> 00:02:55,599 say if x is the destination port 72 00:02:52,879 --> 00:02:57,599 x is the source why is the source port 73 00:02:55,599 --> 00:03:00,720 blah blah blah you load that model in 74 00:02:57,599 --> 00:03:02,400 the kernel it only copies the packets 75 00:03:00,720 --> 00:03:05,599 what you wanted to look at were 76 00:03:02,400 --> 00:03:07,440 interesting to you but again we we are 77 00:03:05,599 --> 00:03:10,879 going to talk about the trade-offs 78 00:03:07,440 --> 00:03:13,120 involved so what's the problem here 79 00:03:10,879 --> 00:03:15,120 if we have the user space implementation 80 00:03:13,120 --> 00:03:15,840 we copy everything it's it's 81 00:03:15,120 --> 00:03:18,080 like 82 00:03:15,840 --> 00:03:21,760 a no-brainer you copy everything what 83 00:03:18,080 --> 00:03:22,959 goes on to the wire to the user space 84 00:03:21,760 --> 00:03:24,720 and then 85 00:03:22,959 --> 00:03:26,319 you do what you need to do with that in 86 00:03:24,720 --> 00:03:28,400 the kernel space if you implemented that 87 00:03:26,319 --> 00:03:31,440 module thing what would have happened 88 00:03:28,400 --> 00:03:33,120 you hard code whatever you wanted 89 00:03:31,440 --> 00:03:35,599 to look at what was interesting for your 90 00:03:33,120 --> 00:03:37,040 use case and then only that thing gets 91 00:03:35,599 --> 00:03:38,480 copied to the user space and you take a 92 00:03:37,040 --> 00:03:40,400 look at it at the end of the day you 93 00:03:38,480 --> 00:03:42,159 have to copy stuff to the user space the 94 00:03:40,400 --> 00:03:45,120 only optimization that we are looking at 95 00:03:42,159 --> 00:03:47,280 is how much do you copy 96 00:03:45,120 --> 00:03:49,920 and uh if you look at if you look at the 97 00:03:47,280 --> 00:03:52,159 trade-offs here in the user space thingy 98 00:03:49,920 --> 00:03:53,680 it is not optimal 99 00:03:52,159 --> 00:03:54,959 and the kernel 100 00:03:53,680 --> 00:03:56,720 module thing 101 00:03:54,959 --> 00:03:59,200 it is pretty optimal because it just 102 00:03:56,720 --> 00:04:00,560 copies what you want and you're not 103 00:03:59,200 --> 00:04:02,959 doing more than 104 00:04:00,560 --> 00:04:04,720 more work than it's required the user 105 00:04:02,959 --> 00:04:07,040 system implementation is a generic 106 00:04:04,720 --> 00:04:08,720 solution you can implement it once sort 107 00:04:07,040 --> 00:04:10,239 of have that switch in the driver or 108 00:04:08,720 --> 00:04:12,879 whatever 109 00:04:10,239 --> 00:04:14,959 and just just be done with it whenever 110 00:04:12,879 --> 00:04:17,120 you want to do some tcp dump thingy you 111 00:04:14,959 --> 00:04:19,519 just nudge the driver it copies 112 00:04:17,120 --> 00:04:21,120 everything to your user space and then 113 00:04:19,519 --> 00:04:22,639 you do the packet processing and then 114 00:04:21,120 --> 00:04:24,160 apply filters for example there are 100 115 00:04:22,639 --> 00:04:25,919 packets copied 116 00:04:24,160 --> 00:04:28,479 you could only sort of the interesting 117 00:04:25,919 --> 00:04:30,400 ones could be only three or four in the 118 00:04:28,479 --> 00:04:32,320 kernel side you only copy the three 119 00:04:30,400 --> 00:04:33,520 drivers but this solution is not generic 120 00:04:32,320 --> 00:04:36,800 because you had 121 00:04:33,520 --> 00:04:37,840 hard coded or you had written the module 122 00:04:36,800 --> 00:04:40,960 and 123 00:04:37,840 --> 00:04:43,120 it only would apply for that destination 124 00:04:40,960 --> 00:04:44,960 and that that that that destination port 125 00:04:43,120 --> 00:04:46,400 and that that whatever protocol you're 126 00:04:44,960 --> 00:04:48,240 looking at 127 00:04:46,400 --> 00:04:50,479 there's one more interesting case here 128 00:04:48,240 --> 00:04:52,960 is the user space solution is a little 129 00:04:50,479 --> 00:04:55,040 safer when we say safe i mean if there's 130 00:04:52,960 --> 00:04:56,880 a bug in your code what worse could go i 131 00:04:55,040 --> 00:04:58,400 mean what could go wrong 132 00:04:56,880 --> 00:05:00,639 you basically could have a sick fault in 133 00:04:58,400 --> 00:05:03,199 the user space that's not too bad but 134 00:05:00,639 --> 00:05:06,240 nothing is going to go too bad but if 135 00:05:03,199 --> 00:05:08,320 the kernel module had a bug 136 00:05:06,240 --> 00:05:10,000 the whole system is going down 137 00:05:08,320 --> 00:05:11,680 probably i mean the sanity of the system 138 00:05:10,000 --> 00:05:12,720 is in question 139 00:05:11,680 --> 00:05:14,400 so 140 00:05:12,720 --> 00:05:15,520 what would be the right way to do things 141 00:05:14,400 --> 00:05:18,000 here 142 00:05:15,520 --> 00:05:19,919 yeah i mean what if we had best of the 143 00:05:18,000 --> 00:05:21,600 both of the worlds like we had like a 144 00:05:19,919 --> 00:05:24,320 generic solution 145 00:05:21,600 --> 00:05:26,479 and an optimal performance so what would 146 00:05:24,320 --> 00:05:29,440 that kind of a design be i mean how 147 00:05:26,479 --> 00:05:30,160 would you how would you do it 148 00:05:29,440 --> 00:05:33,280 so 149 00:05:30,160 --> 00:05:35,919 in 1992 the psd packet filter this paper 150 00:05:33,280 --> 00:05:36,800 it's a seminal paper that was published 151 00:05:35,919 --> 00:05:39,120 where 152 00:05:36,800 --> 00:05:41,680 the folks designed a novel architecture 153 00:05:39,120 --> 00:05:44,400 to do exactly the same things of how do 154 00:05:41,680 --> 00:05:45,840 we implement tcp dump and do it 155 00:05:44,400 --> 00:05:46,639 optimally 156 00:05:45,840 --> 00:05:48,639 so 157 00:05:46,639 --> 00:05:50,160 they had their in interesting design 158 00:05:48,639 --> 00:05:54,080 choice what they did is 159 00:05:50,160 --> 00:05:56,400 they implemented a vm a simple vm which 160 00:05:54,080 --> 00:05:58,880 resides in the kernel and 161 00:05:56,400 --> 00:06:01,840 it just could not do much 162 00:05:58,880 --> 00:06:03,360 it just does a bunch of loads 163 00:06:01,840 --> 00:06:05,199 a bunch of stores 164 00:06:03,360 --> 00:06:07,199 a little bit of jumping 165 00:06:05,199 --> 00:06:10,160 jumping around very basic arithmetic 166 00:06:07,199 --> 00:06:12,560 operations returns and it had like an 167 00:06:10,160 --> 00:06:15,440 accumulator register and one more like x 168 00:06:12,560 --> 00:06:17,520 register so you could do transfer one 169 00:06:15,440 --> 00:06:20,240 instruction 170 00:06:17,520 --> 00:06:22,400 i mean data from the accumulator to the 171 00:06:20,240 --> 00:06:24,240 x register or maybe from the x register 172 00:06:22,400 --> 00:06:27,360 to the accumulator register nothing 173 00:06:24,240 --> 00:06:30,000 fancy at all so you have 174 00:06:27,360 --> 00:06:31,039 basically a vm that resides in the 175 00:06:30,000 --> 00:06:33,919 kernel 176 00:06:31,039 --> 00:06:35,600 and you you operate that and you do 177 00:06:33,919 --> 00:06:37,600 something with that which we'll come to 178 00:06:35,600 --> 00:06:38,720 shortly 179 00:06:37,600 --> 00:06:41,280 so 180 00:06:38,720 --> 00:06:43,120 okay now you you what you've done is 181 00:06:41,280 --> 00:06:45,840 you've created a vm you've dumped it 182 00:06:43,120 --> 00:06:47,360 inside the kernel but how do you run 183 00:06:45,840 --> 00:06:49,039 a ppf program 184 00:06:47,360 --> 00:06:50,800 okay let's take a digression first i 185 00:06:49,039 --> 00:06:53,840 mean how 186 00:06:50,800 --> 00:06:55,520 do user space programs work 187 00:06:53,840 --> 00:06:57,360 you could have a compiled program you 188 00:06:55,520 --> 00:06:59,759 could have an interpreted program for 189 00:06:57,360 --> 00:07:01,360 compiled programs you write the piece of 190 00:06:59,759 --> 00:07:03,599 code 191 00:07:01,360 --> 00:07:05,759 then you give it to the compiler 192 00:07:03,599 --> 00:07:07,599 plus the linker you get a binary out of 193 00:07:05,759 --> 00:07:09,039 it and then you go ahead and you run the 194 00:07:07,599 --> 00:07:11,039 binary 195 00:07:09,039 --> 00:07:12,800 and everything sort of you get the 196 00:07:11,039 --> 00:07:15,520 output you wanted to write a hello world 197 00:07:12,800 --> 00:07:18,319 program you write the source code 198 00:07:15,520 --> 00:07:21,039 compile plus link it and do an a dot out 199 00:07:18,319 --> 00:07:22,560 dot slash a dot out whatever and it just 200 00:07:21,039 --> 00:07:24,720 prints it on the screen 201 00:07:22,560 --> 00:07:27,039 you as a programmer or whoever wants to 202 00:07:24,720 --> 00:07:28,560 use the program has the control on when 203 00:07:27,039 --> 00:07:30,880 the program runs 204 00:07:28,560 --> 00:07:32,400 in the interpreted case similar thing 205 00:07:30,880 --> 00:07:34,479 you write the code you hand it over to 206 00:07:32,400 --> 00:07:37,199 the interpreter it juts out the 207 00:07:34,479 --> 00:07:40,319 instructions and then executes them 208 00:07:37,199 --> 00:07:42,960 but how do you do it in the ppf side 209 00:07:40,319 --> 00:07:45,120 because the bp vm is inside the kernel i 210 00:07:42,960 --> 00:07:46,879 mean you don't have control over the 211 00:07:45,120 --> 00:07:49,840 kernel you can't really really just go 212 00:07:46,879 --> 00:07:51,759 and run whatever you want in the kernel 213 00:07:49,840 --> 00:07:52,639 the only interfaces you have with the 214 00:07:51,759 --> 00:07:55,360 kernel 215 00:07:52,639 --> 00:07:57,520 is the syscall or maybe some interrupts 216 00:07:55,360 --> 00:07:58,960 but generally it's just the syscall so 217 00:07:57,520 --> 00:08:00,800 do you have a direct way of nudging the 218 00:07:58,960 --> 00:08:01,840 kernel to run a program 219 00:08:00,800 --> 00:08:03,919 probably 220 00:08:01,840 --> 00:08:05,599 i mean would that make sense 221 00:08:03,919 --> 00:08:06,639 we'll we'll look at it 222 00:08:05,599 --> 00:08:09,199 so 223 00:08:06,639 --> 00:08:10,639 how does a ppf program 224 00:08:09,199 --> 00:08:13,360 run 225 00:08:10,639 --> 00:08:15,840 bpf programs it's very important to note 226 00:08:13,360 --> 00:08:18,639 that bpf programs unlike the normal 227 00:08:15,840 --> 00:08:21,039 programs are not 228 00:08:18,639 --> 00:08:24,560 dependent upon the programmer or whoever 229 00:08:21,039 --> 00:08:25,759 wants to use them they do not run on his 230 00:08:24,560 --> 00:08:28,639 wish 231 00:08:25,759 --> 00:08:31,199 they are mostly event driven so there 232 00:08:28,639 --> 00:08:33,039 are a bunch of events in the kernel that 233 00:08:31,199 --> 00:08:34,880 are placed or a bunch of hook points in 234 00:08:33,039 --> 00:08:37,599 the kernel and 235 00:08:34,880 --> 00:08:39,680 whenever you write a ebf ebpf program or 236 00:08:37,599 --> 00:08:41,839 a bpf program 237 00:08:39,680 --> 00:08:43,680 you write that in the instruction set 238 00:08:41,839 --> 00:08:45,519 that that simple instruction set you 239 00:08:43,680 --> 00:08:47,920 target for that vm 240 00:08:45,519 --> 00:08:51,440 you load the kernel wire you load the 241 00:08:47,920 --> 00:08:53,360 program via a syscall into the kernel 242 00:08:51,440 --> 00:08:55,279 it still will not run because you just 243 00:08:53,360 --> 00:08:56,160 have loaded the program 244 00:08:55,279 --> 00:08:59,200 then 245 00:08:56,160 --> 00:09:00,240 you attach it to a particular hook point 246 00:08:59,200 --> 00:09:02,640 so 247 00:09:00,240 --> 00:09:05,040 let's let's actually look at it the bpf 248 00:09:02,640 --> 00:09:07,600 programs are stateless that's an 249 00:09:05,040 --> 00:09:11,360 interesting point to note that these 250 00:09:07,600 --> 00:09:14,320 programs are very simple they run to 251 00:09:11,360 --> 00:09:15,200 completion and you load them to the 252 00:09:14,320 --> 00:09:16,800 kernel 253 00:09:15,200 --> 00:09:18,399 you first of all write the filter 254 00:09:16,800 --> 00:09:21,120 expression the filter expression is 255 00:09:18,399 --> 00:09:22,399 pretty simple it's it's like here we are 256 00:09:21,120 --> 00:09:24,959 going to go through this in a short 257 00:09:22,399 --> 00:09:27,279 while but you write like 258 00:09:24,959 --> 00:09:29,279 eventual answer should be true or false 259 00:09:27,279 --> 00:09:31,920 you load the byte code in the kernel you 260 00:09:29,279 --> 00:09:33,279 attach the loaded program to a hook 261 00:09:31,920 --> 00:09:35,519 a hook could be for example every 262 00:09:33,279 --> 00:09:37,839 received packet is a hook whenever 263 00:09:35,519 --> 00:09:39,200 colonel receives a packet run this ppf 264 00:09:37,839 --> 00:09:41,440 program 265 00:09:39,200 --> 00:09:43,680 and the programs are event driven 266 00:09:41,440 --> 00:09:45,360 and they run to completion there's no 267 00:09:43,680 --> 00:09:49,760 sort of preemption or anything that 268 00:09:45,360 --> 00:09:52,160 happens with it a pvpf event occurs the 269 00:09:49,760 --> 00:09:55,600 bpf program runs to completion and at 270 00:09:52,160 --> 00:09:57,760 the end it tells you yes or no i mean 271 00:09:55,600 --> 00:10:00,399 and that boolean instruction at that 272 00:09:57,760 --> 00:10:02,720 boolean return value can be used to do 273 00:10:00,399 --> 00:10:05,279 very interesting sort of things 274 00:10:02,720 --> 00:10:08,640 so let's just take a small example of 275 00:10:05,279 --> 00:10:10,480 a very very simple bpf program that that 276 00:10:08,640 --> 00:10:12,720 that comes via the paper i've just 277 00:10:10,480 --> 00:10:14,160 copied it from the paper and this this 278 00:10:12,720 --> 00:10:15,839 program here 279 00:10:14,160 --> 00:10:19,040 is basically 280 00:10:15,839 --> 00:10:20,800 a program that just filters out 281 00:10:19,040 --> 00:10:23,200 ip packets 282 00:10:20,800 --> 00:10:25,200 and which go to a particular destination 283 00:10:23,200 --> 00:10:28,640 port so if you see 284 00:10:25,200 --> 00:10:31,040 i'm doing a load half word 285 00:10:28,640 --> 00:10:32,480 of the 12th offset in the packet so 286 00:10:31,040 --> 00:10:34,480 every packet when it comes it has a 287 00:10:32,480 --> 00:10:37,600 particular like the standard it has a 288 00:10:34,480 --> 00:10:39,200 particular format and what you go is you 289 00:10:37,600 --> 00:10:41,440 start with the packet starting and then 290 00:10:39,200 --> 00:10:43,760 go to the 12th offset and see 291 00:10:41,440 --> 00:10:46,880 is it ethernet protocol 292 00:10:43,760 --> 00:10:48,399 and if it is then jump to l1 otherwise 293 00:10:46,880 --> 00:10:50,240 l5 294 00:10:48,399 --> 00:10:51,680 0 i mean i don't want to do anything 295 00:10:50,240 --> 00:10:54,240 with it because it's not ethernet i 296 00:10:51,680 --> 00:10:56,959 don't know what it is so get out 297 00:10:54,240 --> 00:10:58,880 then you do a load byte on the 23rd 298 00:10:56,959 --> 00:11:01,120 offset of the 299 00:10:58,880 --> 00:11:04,399 whatever packet you had and then see if 300 00:11:01,120 --> 00:11:05,920 it's tcp if it is tcp you go to the 301 00:11:04,399 --> 00:11:07,519 this part of the 302 00:11:05,920 --> 00:11:09,519 program otherwise you just put the 303 00:11:07,519 --> 00:11:12,880 packet on the floor and then you keep on 304 00:11:09,519 --> 00:11:15,440 munching a bunch of these small small 305 00:11:12,880 --> 00:11:17,360 instructions that you do with it so here 306 00:11:15,440 --> 00:11:18,399 for example the interesting one is you 307 00:11:17,360 --> 00:11:20,880 load 308 00:11:18,399 --> 00:11:23,279 if if this is an ip packet like the ip 309 00:11:20,880 --> 00:11:24,959 head of the length of the ip header here 310 00:11:23,279 --> 00:11:27,040 this this part is the length of the ip 311 00:11:24,959 --> 00:11:30,160 header i mean it's not too interesting 312 00:11:27,040 --> 00:11:32,000 of how exactly this happens but it's 313 00:11:30,160 --> 00:11:33,519 it's just as a 314 00:11:32,000 --> 00:11:35,519 notion here that these are very simple 315 00:11:33,519 --> 00:11:37,680 instructions that you can do on a packet 316 00:11:35,519 --> 00:11:39,200 receive and then 317 00:11:37,680 --> 00:11:41,360 you know you return true or false and 318 00:11:39,200 --> 00:11:43,760 depending on that result you could have 319 00:11:41,360 --> 00:11:46,079 that if it's true i'm going to copy it 320 00:11:43,760 --> 00:11:48,000 to user space if it's not i'm not going 321 00:11:46,079 --> 00:11:51,600 to copy it to user space so this is 322 00:11:48,000 --> 00:11:53,360 basically what's the vm and 323 00:11:51,600 --> 00:11:54,800 how do you how do you write a bpf 324 00:11:53,360 --> 00:11:58,240 program and then load it inside the 325 00:11:54,800 --> 00:11:58,240 kernel and then make it run 326 00:11:58,399 --> 00:12:03,040 uh before we move any further to the e 327 00:12:01,040 --> 00:12:04,959 part of the bpf where does the e come 328 00:12:03,040 --> 00:12:07,680 from we we need to look at some of the 329 00:12:04,959 --> 00:12:10,399 ideas that are similar to ppf 330 00:12:07,680 --> 00:12:12,800 this is interesting and of course it's 331 00:12:10,399 --> 00:12:15,760 it's a little orthogonal but it helps me 332 00:12:12,800 --> 00:12:17,200 to make make sense of why do we need ppf 333 00:12:15,760 --> 00:12:19,839 at the first place 334 00:12:17,200 --> 00:12:21,760 so think of it as the embedded lua vm in 335 00:12:19,839 --> 00:12:23,680 nginx to modify behavior for example to 336 00:12:21,760 --> 00:12:26,800 check certain headers if there is a 337 00:12:23,680 --> 00:12:29,760 certain header in the http request 338 00:12:26,800 --> 00:12:31,120 uh allow it and if there is none just 339 00:12:29,760 --> 00:12:33,839 drop it on the floor 340 00:12:31,120 --> 00:12:36,720 now there were two ways of doing this 341 00:12:33,839 --> 00:12:40,000 kind of thing either you could pull the 342 00:12:36,720 --> 00:12:42,240 nginx source code and then 343 00:12:40,000 --> 00:12:44,639 add that piece of code in the c language 344 00:12:42,240 --> 00:12:47,279 whatever language nginx is written in 345 00:12:44,639 --> 00:12:48,880 and then compile it and whenever you 346 00:12:47,279 --> 00:12:52,160 have to do anything you have to modify 347 00:12:48,880 --> 00:12:54,320 that rule you always have to recompile 348 00:12:52,160 --> 00:12:57,279 change and recompile 349 00:12:54,320 --> 00:13:00,000 i i think that's a little unwieldy if 350 00:12:57,279 --> 00:13:02,560 you want to do these kind of things and 351 00:13:00,000 --> 00:13:05,360 having a lua vm embedded inside the 352 00:13:02,560 --> 00:13:08,959 nginx module it helps you a lot 353 00:13:05,360 --> 00:13:11,120 maybe a new of him i i don't think if i 354 00:13:08,959 --> 00:13:12,320 want to extend the functionality of my 355 00:13:11,120 --> 00:13:15,120 editor 356 00:13:12,320 --> 00:13:18,079 i would want to pull the source code in 357 00:13:15,120 --> 00:13:20,240 add that bunch of functionality inside 358 00:13:18,079 --> 00:13:23,360 the source code i would rather write a 359 00:13:20,240 --> 00:13:24,880 lua plug-in and then load it in the 360 00:13:23,360 --> 00:13:26,399 so so these are these are some of the 361 00:13:24,880 --> 00:13:29,040 things where it tells us that it's 362 00:13:26,399 --> 00:13:30,079 probably easier to use the vm based 363 00:13:29,040 --> 00:13:32,240 approach where you don't have to 364 00:13:30,079 --> 00:13:33,120 recompile all the program and start from 365 00:13:32,240 --> 00:13:34,560 zero 366 00:13:33,120 --> 00:13:36,240 and 367 00:13:34,560 --> 00:13:38,160 it's working out 368 00:13:36,240 --> 00:13:40,320 great i mean i haven't forgotten about 369 00:13:38,160 --> 00:13:42,240 emacs folks emacs 370 00:13:40,320 --> 00:13:43,199 there are two ways to modify emacs 371 00:13:42,240 --> 00:13:45,360 either 372 00:13:43,199 --> 00:13:49,440 you modify the source code or you write 373 00:13:45,360 --> 00:13:51,279 emacs lisp your init.tl whatever 374 00:13:49,440 --> 00:13:52,880 so this this sort of this sort of notion 375 00:13:51,279 --> 00:13:55,199 is is pretty 376 00:13:52,880 --> 00:13:57,519 prevalent these days and it's pretty 377 00:13:55,199 --> 00:13:59,199 pretty easy to use uh 378 00:13:57,519 --> 00:14:00,800 it sometimes becomes a little difficult 379 00:13:59,199 --> 00:14:02,240 to think in the sense of kernel that 380 00:14:00,800 --> 00:14:03,920 okay why are we not changing the kernel 381 00:14:02,240 --> 00:14:06,320 but we are 382 00:14:03,920 --> 00:14:07,519 introducing a certain vm inside the 383 00:14:06,320 --> 00:14:09,360 kernel and 384 00:14:07,519 --> 00:14:11,680 increasing complexity but if you look at 385 00:14:09,360 --> 00:14:14,160 it if you compare it with other things 386 00:14:11,680 --> 00:14:16,399 it's pretty normal that you have your 387 00:14:14,160 --> 00:14:18,560 own plugins and everything written a 388 00:14:16,399 --> 00:14:20,399 different language than what the 389 00:14:18,560 --> 00:14:22,079 original editor or whatever target you 390 00:14:20,399 --> 00:14:24,079 were planning to use it on 391 00:14:22,079 --> 00:14:25,920 was and it gives you a lot of 392 00:14:24,079 --> 00:14:29,480 flexibility and a lot of the turnaround 393 00:14:25,920 --> 00:14:29,480 time is pretty quick 394 00:14:30,639 --> 00:14:35,680 so now let's move on to ebpf so 395 00:14:34,320 --> 00:14:37,760 absolutely we need to introduce 396 00:14:35,680 --> 00:14:40,480 ourselves to the ebp of mascot which is 397 00:14:37,760 --> 00:14:40,480 the qtb 398 00:14:41,600 --> 00:14:48,560 so where did this extended or e in the 399 00:14:45,839 --> 00:14:52,160 bpf thingy come from alexey 400 00:14:48,560 --> 00:14:54,079 sent a patch close to 2013 2014-ish 401 00:14:52,160 --> 00:14:56,079 where he improved the existing bpf 402 00:14:54,079 --> 00:14:57,440 infrastructure in the kernel the bpf 403 00:14:56,079 --> 00:15:00,160 infrastructure was already in the kernel 404 00:14:57,440 --> 00:15:02,880 and the prime users for that was tcp 405 00:15:00,160 --> 00:15:04,320 dump because that's why it was sort of 406 00:15:02,880 --> 00:15:06,800 gotten into the kernel it started off 407 00:15:04,320 --> 00:15:09,120 from bsd but then it was very soon 408 00:15:06,800 --> 00:15:11,279 ported to linux 409 00:15:09,120 --> 00:15:12,639 and 410 00:15:11,279 --> 00:15:15,360 if you recall 411 00:15:12,639 --> 00:15:18,959 the bsd packet filter 412 00:15:15,360 --> 00:15:21,279 just had hook points inside the network 413 00:15:18,959 --> 00:15:23,360 stack only i mean all the hook points 414 00:15:21,279 --> 00:15:24,480 were embedded there just for the network 415 00:15:23,360 --> 00:15:25,680 stack 416 00:15:24,480 --> 00:15:28,560 nothing else 417 00:15:25,680 --> 00:15:30,720 what alexis patch did is first of all 418 00:15:28,560 --> 00:15:32,320 improved the vm quality i mean earlier 419 00:15:30,720 --> 00:15:34,000 if you remember we looked at the 420 00:15:32,320 --> 00:15:36,480 instructions there were just a few 421 00:15:34,000 --> 00:15:38,320 instructions a very small set of 422 00:15:36,480 --> 00:15:43,120 registers to work with 423 00:15:38,320 --> 00:15:44,480 this patch vastly improved on that 424 00:15:43,120 --> 00:15:46,800 by making 425 00:15:44,480 --> 00:15:48,399 uh improvements in the number of 426 00:15:46,800 --> 00:15:51,120 registers you have the number of 427 00:15:48,399 --> 00:15:53,839 instructions you could implement and 428 00:15:51,120 --> 00:15:56,800 sort of write and overall performance so 429 00:15:53,839 --> 00:15:59,600 this this makes it e this this adds the 430 00:15:56,800 --> 00:16:02,399 e and then 431 00:15:59,600 --> 00:16:04,560 you have hook points spread throughout 432 00:16:02,399 --> 00:16:06,880 the linux kernel it is not only the 433 00:16:04,560 --> 00:16:10,079 network stack that has the hook points 434 00:16:06,880 --> 00:16:12,160 there is a bunch of other places where 435 00:16:10,079 --> 00:16:14,399 the hook points are 436 00:16:12,160 --> 00:16:14,399 so 437 00:16:14,800 --> 00:16:18,480 yeah by the way if i if i was not clear 438 00:16:16,959 --> 00:16:21,120 if if there are any questions please 439 00:16:18,480 --> 00:16:23,120 feel free to ask them 440 00:16:21,120 --> 00:16:26,000 uh and 441 00:16:23,120 --> 00:16:26,800 now we have to talk a little about 442 00:16:26,000 --> 00:16:29,839 uh 443 00:16:26,800 --> 00:16:32,399 the c bpf or the old style bpf which is 444 00:16:29,839 --> 00:16:34,079 also called as the classical bpf and 445 00:16:32,399 --> 00:16:36,240 extended bpf so there are there are 446 00:16:34,079 --> 00:16:37,360 differences between the c bpf the 447 00:16:36,240 --> 00:16:39,920 classical 448 00:16:37,360 --> 00:16:42,720 style of bpf and the extended style of 449 00:16:39,920 --> 00:16:46,079 bpf the cbpf typically a very 450 00:16:42,720 --> 00:16:48,240 constrainted vm will not let you do 451 00:16:46,079 --> 00:16:50,959 very interesting things 452 00:16:48,240 --> 00:16:52,720 while as ebpf is the extended one which 453 00:16:50,959 --> 00:16:54,399 has hook points 454 00:16:52,720 --> 00:16:56,639 throughout the kernel you could do a 455 00:16:54,399 --> 00:16:58,320 bunch of things with it earlier you 456 00:16:56,639 --> 00:16:59,759 could only do sort of packet processing 457 00:16:58,320 --> 00:17:02,320 decisions 458 00:16:59,759 --> 00:17:04,559 with the classical vpf but with ebpf you 459 00:17:02,320 --> 00:17:06,319 could do much more for example 460 00:17:04,559 --> 00:17:08,880 you could have a 461 00:17:06,319 --> 00:17:12,400 hook point on every system call that's 462 00:17:08,880 --> 00:17:15,199 executed and what one could do is 463 00:17:12,400 --> 00:17:17,120 whenever a certain program does a system 464 00:17:15,199 --> 00:17:19,439 call you could have a hook point and a 465 00:17:17,120 --> 00:17:21,919 bpf program attached to that hook point 466 00:17:19,439 --> 00:17:23,520 a program did a syscall 467 00:17:21,919 --> 00:17:26,400 since the hook point is attached to the 468 00:17:23,520 --> 00:17:28,640 cisco event it gets fired you check 469 00:17:26,400 --> 00:17:30,400 whether this particular program 470 00:17:28,640 --> 00:17:33,600 is allowed to make the syscall for 471 00:17:30,400 --> 00:17:36,640 example a small application like a cat 472 00:17:33,600 --> 00:17:37,679 like application that just has work to 473 00:17:36,640 --> 00:17:40,080 get 474 00:17:37,679 --> 00:17:41,520 you know file data which is the read 475 00:17:40,080 --> 00:17:43,039 system call 476 00:17:41,520 --> 00:17:45,280 and dump it on the screen or probably 477 00:17:43,039 --> 00:17:47,440 redirect to a file does not have to do 478 00:17:45,280 --> 00:17:48,480 anything with the network socket i o 479 00:17:47,440 --> 00:17:49,360 calls 480 00:17:48,480 --> 00:17:51,360 and 481 00:17:49,360 --> 00:17:53,360 if for example for some reason you could 482 00:17:51,360 --> 00:17:55,039 see malicious behavior you could deny 483 00:17:53,360 --> 00:17:57,280 that so 484 00:17:55,039 --> 00:18:00,320 the security aspect of it is very 485 00:17:57,280 --> 00:18:02,000 interesting and you could also have hook 486 00:18:00,320 --> 00:18:04,160 points in 487 00:18:02,000 --> 00:18:06,160 there's there is that there are certain 488 00:18:04,160 --> 00:18:08,160 points in the kernel called k probes 489 00:18:06,160 --> 00:18:09,039 caret probes trace points these are 490 00:18:08,160 --> 00:18:11,919 again 491 00:18:09,039 --> 00:18:14,160 uh hooks where a certain function 492 00:18:11,919 --> 00:18:16,000 executes there's a there's an event 493 00:18:14,160 --> 00:18:18,720 attached to that so you could get a lot 494 00:18:16,000 --> 00:18:19,840 of telemetry data from it and if you if 495 00:18:18,720 --> 00:18:22,240 you look very 496 00:18:19,840 --> 00:18:25,679 closely you could see there's this 497 00:18:22,240 --> 00:18:27,440 penguin as well as the windows signs so 498 00:18:25,679 --> 00:18:30,400 ebpf nowadays 499 00:18:27,440 --> 00:18:32,400 runs generally on most of the platforms 500 00:18:30,400 --> 00:18:36,000 that are available and windows was the 501 00:18:32,400 --> 00:18:37,440 most recent edition like 2020ish 502 00:18:36,000 --> 00:18:39,840 was the 503 00:18:37,440 --> 00:18:41,600 thing that it started to come up and 504 00:18:39,840 --> 00:18:43,919 it can run on windows of course you 505 00:18:41,600 --> 00:18:47,200 can't run linux specific things 506 00:18:43,919 --> 00:18:49,520 on like evp of things on windows but 507 00:18:47,200 --> 00:18:51,760 generally the idea is there the vm is 508 00:18:49,520 --> 00:18:51,760 there 509 00:18:52,320 --> 00:18:57,039 so again the capabilities networking 510 00:18:55,200 --> 00:18:58,400 absolutely it's it's where it started 511 00:18:57,039 --> 00:18:59,440 from you could do a lot of networking 512 00:18:58,400 --> 00:19:00,640 stuff with 513 00:18:59,440 --> 00:19:02,480 ebpf 514 00:19:00,640 --> 00:19:04,000 you could do a lot of security stuff you 515 00:19:02,480 --> 00:19:04,880 could like we talked about you could 516 00:19:04,000 --> 00:19:07,679 have a 517 00:19:04,880 --> 00:19:10,400 ppf program which has a hook point and 518 00:19:07,679 --> 00:19:12,320 just looks at a policy of whether a 519 00:19:10,400 --> 00:19:13,919 program is allowed to do a certain 520 00:19:12,320 --> 00:19:15,760 syscall or not 521 00:19:13,919 --> 00:19:18,400 you could do a lot of observability like 522 00:19:15,760 --> 00:19:20,160 if you have those hook points where 523 00:19:18,400 --> 00:19:22,400 the trace points and the k probes you 524 00:19:20,160 --> 00:19:24,720 could get you could gather a bunch of 525 00:19:22,400 --> 00:19:26,960 information from the system dynamically 526 00:19:24,720 --> 00:19:28,799 by attaching the ebpf program whenever 527 00:19:26,960 --> 00:19:31,120 the hook fires you get that information 528 00:19:28,799 --> 00:19:32,720 all the processing is done inside the 529 00:19:31,120 --> 00:19:34,240 kernel and then finally you get the 530 00:19:32,720 --> 00:19:35,520 answer now 531 00:19:34,240 --> 00:19:38,240 why this is 532 00:19:35,520 --> 00:19:40,400 fast because since you have a vm inside 533 00:19:38,240 --> 00:19:44,000 the kernel you want to count how many 534 00:19:40,400 --> 00:19:46,960 syscalls happened at a point x or i mean 535 00:19:44,000 --> 00:19:49,600 from a to b time what you could do is 536 00:19:46,960 --> 00:19:52,480 you could write that code and then have 537 00:19:49,600 --> 00:19:54,320 internally the ebpa program gather all 538 00:19:52,480 --> 00:19:56,160 the details do all the number crunching 539 00:19:54,320 --> 00:19:57,919 and then finally when you're done it 540 00:19:56,160 --> 00:19:59,280 just spits out the answer back to the 541 00:19:57,919 --> 00:20:01,360 user space 542 00:19:59,280 --> 00:20:03,919 uh copying data from 543 00:20:01,360 --> 00:20:06,640 across the user space and kernel space 544 00:20:03,919 --> 00:20:08,799 boundaries is costly so we want to do it 545 00:20:06,640 --> 00:20:11,440 as little as possible and want to keep 546 00:20:08,799 --> 00:20:13,200 the number crunching as local to where 547 00:20:11,440 --> 00:20:16,000 the data actually is and we just only 548 00:20:13,200 --> 00:20:16,000 want to look at the data 549 00:20:16,559 --> 00:20:19,840 so 550 00:20:17,520 --> 00:20:21,200 an interesting thing of the ebpf 551 00:20:19,840 --> 00:20:22,880 verifier 552 00:20:21,200 --> 00:20:24,799 and jit is 553 00:20:22,880 --> 00:20:25,840 you can't really willy nilly run any 554 00:20:24,799 --> 00:20:27,360 program 555 00:20:25,840 --> 00:20:30,320 inside the 556 00:20:27,360 --> 00:20:32,000 ebpfvm or the linux kernel i mean when 557 00:20:30,320 --> 00:20:32,960 we talk about arbitrary program we talk 558 00:20:32,000 --> 00:20:36,480 about like 559 00:20:32,960 --> 00:20:38,880 we do that that very restricted set of 560 00:20:36,480 --> 00:20:41,200 instructions that were given 561 00:20:38,880 --> 00:20:43,520 that the ebp fem runs but you really 562 00:20:41,200 --> 00:20:46,000 cannot do anything because 563 00:20:43,520 --> 00:20:48,640 the ebpf program runs to completion now 564 00:20:46,000 --> 00:20:51,120 what if a malicious user put an infinite 565 00:20:48,640 --> 00:20:53,520 loop in that whenever an event fires the 566 00:20:51,120 --> 00:20:55,440 cbpf program is going to go and then 567 00:20:53,520 --> 00:20:58,080 just infinitely loop and since the 568 00:20:55,440 --> 00:21:00,480 program runs to completion you basically 569 00:20:58,080 --> 00:21:03,039 just start the cpu you did a denial of 570 00:21:00,480 --> 00:21:05,360 service because now there's no way for 571 00:21:03,039 --> 00:21:09,600 the system to yield the program it just 572 00:21:05,360 --> 00:21:10,799 runs in the app so bpf verifier looks at 573 00:21:09,600 --> 00:21:12,480 whenever you 574 00:21:10,799 --> 00:21:14,159 load 575 00:21:12,480 --> 00:21:16,240 the 576 00:21:14,159 --> 00:21:19,600 ebpf program that you've written inside 577 00:21:16,240 --> 00:21:22,080 the kernel via the bpf system call 578 00:21:19,600 --> 00:21:24,960 it first of all goes to the verifier 579 00:21:22,080 --> 00:21:26,799 the verifier looks at all possible 580 00:21:24,960 --> 00:21:28,799 branches and whatever you have done in 581 00:21:26,799 --> 00:21:30,799 your code and 582 00:21:28,799 --> 00:21:33,200 first of all verifies the sanity of the 583 00:21:30,799 --> 00:21:35,440 program if the program according to the 584 00:21:33,200 --> 00:21:37,520 epp of verifier is same then only it 585 00:21:35,440 --> 00:21:40,159 gets handed over to the jit compiler 586 00:21:37,520 --> 00:21:42,480 which then emits out 587 00:21:40,159 --> 00:21:44,400 native instructions for whatever 588 00:21:42,480 --> 00:21:46,080 architecture you're running on and then 589 00:21:44,400 --> 00:21:47,200 it moves along and does whatever it 590 00:21:46,080 --> 00:21:48,640 needs to do 591 00:21:47,200 --> 00:21:50,000 then you properly attach it to a 592 00:21:48,640 --> 00:21:51,520 particular hook point because just 593 00:21:50,000 --> 00:21:53,440 loading the program in the kernel is not 594 00:21:51,520 --> 00:21:55,120 going to do anything when you load a 595 00:21:53,440 --> 00:21:57,440 program in the kernel you have to attach 596 00:21:55,120 --> 00:21:58,480 it to a certain 597 00:21:57,440 --> 00:22:01,440 point 598 00:21:58,480 --> 00:22:02,799 and uh then you do whatever you want to 599 00:22:01,440 --> 00:22:03,760 do with it so let me look at the 600 00:22:02,799 --> 00:22:06,000 questions 601 00:22:03,760 --> 00:22:07,840 since ppf verifier evolves can we expect 602 00:22:06,000 --> 00:22:12,880 that a ppf program written today for 603 00:22:07,840 --> 00:22:15,760 5.16 will work in a few years oh okay 604 00:22:12,880 --> 00:22:18,720 so i think 605 00:22:15,760 --> 00:22:20,880 if we uh if if we look at it if 606 00:22:18,720 --> 00:22:23,840 we look at the architecture of the bpf 607 00:22:20,880 --> 00:22:24,840 vm it's generally very simple 608 00:22:23,840 --> 00:22:27,760 so 609 00:22:24,840 --> 00:22:29,280 if by what you mean is if you've written 610 00:22:27,760 --> 00:22:31,919 a program today 611 00:22:29,280 --> 00:22:34,159 and you have used 612 00:22:31,919 --> 00:22:36,080 trace points and you have not you have 613 00:22:34,159 --> 00:22:37,919 not relied yourself on 614 00:22:36,080 --> 00:22:39,039 api in the kernel that changes for 615 00:22:37,919 --> 00:22:40,960 example 616 00:22:39,039 --> 00:22:42,799 we talked about k probes and trace 617 00:22:40,960 --> 00:22:44,880 points unfortunately i do not have 618 00:22:42,799 --> 00:22:46,799 enough time to talk about what those are 619 00:22:44,880 --> 00:22:49,280 but trace points have a guarantee of 620 00:22:46,799 --> 00:22:50,880 being more rigid like system calls 621 00:22:49,280 --> 00:22:53,840 they're not going to they're going to 622 00:22:50,880 --> 00:22:55,280 survive multiple kernel versions but k 623 00:22:53,840 --> 00:22:57,200 probes don't give you that because 624 00:22:55,280 --> 00:22:58,159 that's the internal kernel functions and 625 00:22:57,200 --> 00:23:00,159 whenever the name of the function 626 00:22:58,159 --> 00:23:01,360 changes the k probe changes so if you if 627 00:23:00,159 --> 00:23:04,240 you you 628 00:23:01,360 --> 00:23:06,080 use or leverage those kind of techniques 629 00:23:04,240 --> 00:23:09,280 where you use k trace points instead of 630 00:23:06,080 --> 00:23:12,400 k probes probably uh the ebpf program 631 00:23:09,280 --> 00:23:14,960 should survive multiple kernel divisions 632 00:23:12,400 --> 00:23:17,440 can ebpf store state between system 633 00:23:14,960 --> 00:23:20,320 calls okay yeah sure that's an excellent 634 00:23:17,440 --> 00:23:21,360 question i'm going to come to it 635 00:23:20,320 --> 00:23:23,520 so 636 00:23:21,360 --> 00:23:27,039 what are the different types of ebpf 637 00:23:23,520 --> 00:23:30,320 programs recall in the classical ebpf 638 00:23:27,039 --> 00:23:32,799 sense it was only sort of relatively 639 00:23:30,320 --> 00:23:34,480 constrainted towards the socket filter 640 00:23:32,799 --> 00:23:36,640 or the network stack 641 00:23:34,480 --> 00:23:37,840 but in the extended sense 642 00:23:36,640 --> 00:23:39,520 it's just 643 00:23:37,840 --> 00:23:41,600 spread throughout the kernel so we're 644 00:23:39,520 --> 00:23:43,440 going to talk about a little 645 00:23:41,600 --> 00:23:44,640 interesting of those 646 00:23:43,440 --> 00:23:46,159 some of the interesting this is not an 647 00:23:44,640 --> 00:23:49,120 exhaustive list 648 00:23:46,159 --> 00:23:50,720 so this bpf prog type sock filter what 649 00:23:49,120 --> 00:23:51,600 is this this is basically a packet 650 00:23:50,720 --> 00:23:54,559 filter 651 00:23:51,600 --> 00:23:57,120 the thing that the original bpf vm 652 00:23:54,559 --> 00:23:59,120 started with you apply a filter at a 653 00:23:57,120 --> 00:24:01,600 particular socket that okay if this is 654 00:23:59,120 --> 00:24:03,200 this destination port this source 655 00:24:01,600 --> 00:24:05,200 address this 656 00:24:03,200 --> 00:24:06,480 source spot blah blah blah 657 00:24:05,200 --> 00:24:08,320 do something 658 00:24:06,480 --> 00:24:11,120 there's another one called 659 00:24:08,320 --> 00:24:13,360 xdp the bpf prog type xtp this is an 660 00:24:11,120 --> 00:24:15,120 interesting one 661 00:24:13,360 --> 00:24:16,640 i'm sorry this is an interesting one 662 00:24:15,120 --> 00:24:19,919 because 663 00:24:16,640 --> 00:24:22,159 xtp is express data path and this is a 664 00:24:19,919 --> 00:24:24,240 bpf program which is attached at a hook 665 00:24:22,159 --> 00:24:26,400 point which is as close to the device 666 00:24:24,240 --> 00:24:28,000 driver as possible now 667 00:24:26,400 --> 00:24:29,919 we have to talk a little about it so i'm 668 00:24:28,000 --> 00:24:32,000 going to give one minute to it so what 669 00:24:29,919 --> 00:24:34,080 happens here is whenever the packet 670 00:24:32,000 --> 00:24:37,919 comes to your nic card 671 00:24:34,080 --> 00:24:40,960 it first of all gets stored on your nics 672 00:24:37,919 --> 00:24:42,960 memory and then from the nik and nick 673 00:24:40,960 --> 00:24:45,120 network interface card and from the 674 00:24:42,960 --> 00:24:47,840 network interface card it gets dmade 675 00:24:45,120 --> 00:24:50,720 inside the linux kernel main memory and 676 00:24:47,840 --> 00:24:52,480 then you raise a soft irq and interrupt 677 00:24:50,720 --> 00:24:54,480 and then you you 678 00:24:52,480 --> 00:24:56,480 tell the linux kernel that hey i got a 679 00:24:54,480 --> 00:24:59,120 packet for you start processing it you 680 00:24:56,480 --> 00:25:01,600 know take it into your levels of 681 00:24:59,120 --> 00:25:04,400 bureaucracy of the network stack where 682 00:25:01,600 --> 00:25:06,640 you first of all you know plug the 683 00:25:04,400 --> 00:25:08,799 l2 header then the l3 header and then 684 00:25:06,640 --> 00:25:12,159 you keep on moving it up the stack and 685 00:25:08,799 --> 00:25:13,600 finally you do all sorts of you know 686 00:25:12,159 --> 00:25:15,360 net filter 687 00:25:13,600 --> 00:25:17,120 that kind of table manipulations and if 688 00:25:15,360 --> 00:25:20,559 all is green you 689 00:25:17,120 --> 00:25:22,640 give it to if it's for descent to your 690 00:25:20,559 --> 00:25:23,679 process or if it was distant for some 691 00:25:22,640 --> 00:25:26,000 routing 692 00:25:23,679 --> 00:25:27,039 this xtp is interesting because the 693 00:25:26,000 --> 00:25:28,880 moment 694 00:25:27,039 --> 00:25:30,799 a packet arrives on your nic card and 695 00:25:28,880 --> 00:25:32,799 it's dmade 696 00:25:30,799 --> 00:25:34,720 you raise a soft irq request with the 697 00:25:32,799 --> 00:25:36,640 nappy interface like the internal 698 00:25:34,720 --> 00:25:38,799 network stack 699 00:25:36,640 --> 00:25:40,799 you have a capability to run a program 700 00:25:38,799 --> 00:25:42,880 now at that point in time 701 00:25:40,799 --> 00:25:46,559 your packet is just a buffer it's not 702 00:25:42,880 --> 00:25:48,159 even an sk buff yet it's just a buffer 703 00:25:46,559 --> 00:25:49,840 you can do all sorts of crazy things 704 00:25:48,159 --> 00:25:50,640 there you could redirect a program you 705 00:25:49,840 --> 00:25:53,200 could 706 00:25:50,640 --> 00:25:54,960 drop it on the floor and you'd say i 707 00:25:53,200 --> 00:25:56,480 could do that with iptables why do i 708 00:25:54,960 --> 00:25:58,080 need to do something there 709 00:25:56,480 --> 00:26:01,200 well if you do that with iptables it 710 00:25:58,080 --> 00:26:03,279 happens much higher up the lane i mean 711 00:26:01,200 --> 00:26:05,600 it happens like after you've done the l2 712 00:26:03,279 --> 00:26:07,760 and the l3 things so you have to 713 00:26:05,600 --> 00:26:10,240 allocate a lot of space for it for a 714 00:26:07,760 --> 00:26:11,679 packet that you probably did not want so 715 00:26:10,240 --> 00:26:13,679 you could do that 716 00:26:11,679 --> 00:26:15,679 right when you receive that packet and 717 00:26:13,679 --> 00:26:17,840 if you were acting as a router or a 718 00:26:15,679 --> 00:26:19,840 forwarder you could straight away ask 719 00:26:17,840 --> 00:26:22,720 the bpf program 720 00:26:19,840 --> 00:26:24,559 in this case xtp to you know 721 00:26:22,720 --> 00:26:26,240 do something with it forward it via some 722 00:26:24,559 --> 00:26:27,840 other interface 723 00:26:26,240 --> 00:26:30,320 one interesting other thing that could 724 00:26:27,840 --> 00:26:32,159 be done here at this point is you could 725 00:26:30,320 --> 00:26:34,400 do 726 00:26:32,159 --> 00:26:35,919 a kernel bypass like you've got the 727 00:26:34,400 --> 00:26:37,200 buffer just 728 00:26:35,919 --> 00:26:38,320 leave the kernel i don't want to go 729 00:26:37,200 --> 00:26:40,080 through the network stack just straight 730 00:26:38,320 --> 00:26:41,600 away deliver it to the user space and i 731 00:26:40,080 --> 00:26:43,840 want to do whatever i want to do with it 732 00:26:41,600 --> 00:26:46,000 like the raw socket thingy 733 00:26:43,840 --> 00:26:48,720 the k probe price point sock ops are 734 00:26:46,000 --> 00:26:49,440 similar and there are much more 735 00:26:48,720 --> 00:26:51,840 now 736 00:26:49,440 --> 00:26:54,480 recall the classical bpf was entirely 737 00:26:51,840 --> 00:26:56,480 stateless ebpf as well is stateless you 738 00:26:54,480 --> 00:26:58,880 can't really store state but it has the 739 00:26:56,480 --> 00:27:00,880 capability to access storage which is 740 00:26:58,880 --> 00:27:03,039 called bpf maps now 741 00:27:00,880 --> 00:27:05,279 these maps are not not like actually a 742 00:27:03,039 --> 00:27:08,159 key value pair but whenever we say ebpf 743 00:27:05,279 --> 00:27:10,640 map think of it as storage ebpf storage 744 00:27:08,159 --> 00:27:12,880 so a bp map is basically a generic data 745 00:27:10,640 --> 00:27:15,440 structure that allows you 746 00:27:12,880 --> 00:27:18,000 to pass data to and flow from the user 747 00:27:15,440 --> 00:27:20,799 to the kernel and inside the kernel so 748 00:27:18,000 --> 00:27:22,880 you create a bpf map by using the same 749 00:27:20,799 --> 00:27:24,720 bpf syscall which is a multi-tool which 750 00:27:22,880 --> 00:27:26,480 lets you do a lot of things lets you 751 00:27:24,720 --> 00:27:29,039 load a program attach a program to a 752 00:27:26,480 --> 00:27:31,279 particular hook point create a map 753 00:27:29,039 --> 00:27:32,960 attach a map to a certain place do all 754 00:27:31,279 --> 00:27:34,720 sorts of things with the map and a few 755 00:27:32,960 --> 00:27:36,960 interesting map types are 756 00:27:34,720 --> 00:27:39,679 a map type hash which is actually like a 757 00:27:36,960 --> 00:27:41,840 key value store a map type array which 758 00:27:39,679 --> 00:27:44,320 is just like a normal array a map type 759 00:27:41,840 --> 00:27:46,000 prog array which stores file descriptors 760 00:27:44,320 --> 00:27:48,480 of a bunch of ebf programs that you've 761 00:27:46,000 --> 00:27:50,240 loaded and a bunch of other maps 762 00:27:48,480 --> 00:27:52,640 recently there was a map type plume 763 00:27:50,240 --> 00:27:54,799 filter that was added so you could do a 764 00:27:52,640 --> 00:27:56,799 bunch of state state 765 00:27:54,799 --> 00:27:58,000 stuff inside the kernel but this state 766 00:27:56,799 --> 00:28:00,480 is global like 767 00:27:58,000 --> 00:28:02,640 since the vf program comes in it can 768 00:28:00,480 --> 00:28:04,480 access that map do whatever it needs to 769 00:28:02,640 --> 00:28:05,919 do it does not store any state of its 770 00:28:04,480 --> 00:28:07,600 own but it can do whatever it needs to 771 00:28:05,919 --> 00:28:09,360 do in the state and then finally just 772 00:28:07,600 --> 00:28:12,960 die 773 00:28:09,360 --> 00:28:12,960 so i hope that answers the question 774 00:28:14,399 --> 00:28:20,720 can ebp store state between calls 775 00:28:16,399 --> 00:28:20,720 smoothing the input from pen uh i 776 00:28:20,880 --> 00:28:24,799 i think i i talked about how how does it 777 00:28:23,279 --> 00:28:26,080 store state but 778 00:28:24,799 --> 00:28:29,279 probably we can take that offline 779 00:28:26,080 --> 00:28:32,640 because we're pretty sure on time 780 00:28:29,279 --> 00:28:33,679 and yeah what's the conclusion ebpf 781 00:28:32,640 --> 00:28:36,000 programs 782 00:28:33,679 --> 00:28:39,200 are not controlled by the programmer but 783 00:28:36,000 --> 00:28:41,679 they run in response to events ebpm 784 00:28:39,200 --> 00:28:43,600 programs run to completion now an 785 00:28:41,679 --> 00:28:46,080 interesting thing here was i had put a 786 00:28:43,600 --> 00:28:48,399 bracket here and i'd said that they're 787 00:28:46,080 --> 00:28:50,240 not preemptive but then one friend of 788 00:28:48,399 --> 00:28:52,559 mine his name is kartik he's one of the 789 00:28:50,240 --> 00:28:55,200 epf developers he sort of sends patches 790 00:28:52,559 --> 00:28:57,279 regularly he corrected me and said that 791 00:28:55,200 --> 00:28:59,919 probably they can be preempted but not 792 00:28:57,279 --> 00:29:01,600 migrated so i i just left that 793 00:28:59,919 --> 00:29:03,120 and that's food for thought i myself 794 00:29:01,600 --> 00:29:04,159 don't know much about it but just just 795 00:29:03,120 --> 00:29:05,919 saying it 796 00:29:04,159 --> 00:29:07,360 running an ebpa program is much safer 797 00:29:05,919 --> 00:29:08,640 than running and maintaining a kernel 798 00:29:07,360 --> 00:29:11,279 module now 799 00:29:08,640 --> 00:29:12,000 what does that mean if somebody gave me 800 00:29:11,279 --> 00:29:14,399 a 801 00:29:12,000 --> 00:29:16,720 kernel module and said that yeah this 802 00:29:14,399 --> 00:29:18,240 does an amazing solves an amazing 803 00:29:16,720 --> 00:29:19,440 problem 804 00:29:18,240 --> 00:29:22,000 can you run it in your production 805 00:29:19,440 --> 00:29:24,559 environment i'd be very skeptical 806 00:29:22,000 --> 00:29:27,039 because me running a kernel module in 807 00:29:24,559 --> 00:29:29,279 production given by somebody it's it's 808 00:29:27,039 --> 00:29:31,440 it's a little dangerous but on the other 809 00:29:29,279 --> 00:29:33,600 hand if somebody gave me an ebf program 810 00:29:31,440 --> 00:29:35,120 i would very well just try it out and 811 00:29:33,600 --> 00:29:37,360 probably not in production but at least 812 00:29:35,120 --> 00:29:39,279 i'll not be that hesitant because i know 813 00:29:37,360 --> 00:29:42,240 the epp verifier is going to help me and 814 00:29:39,279 --> 00:29:43,760 not cause any problems the entry bar to 815 00:29:42,240 --> 00:29:45,679 get useful information from the kernel 816 00:29:43,760 --> 00:29:47,520 is significantly reduced people like me 817 00:29:45,679 --> 00:29:49,760 who don't know anything about the kernel 818 00:29:47,520 --> 00:29:52,000 or probably are pretending to be kernel 819 00:29:49,760 --> 00:29:53,919 experts can can sort of 820 00:29:52,000 --> 00:29:56,159 know a lot about how the kernel works 821 00:29:53,919 --> 00:29:57,840 and the overhead is just pay as you go 822 00:29:56,159 --> 00:30:00,480 zero cost abstraction style i mean if 823 00:29:57,840 --> 00:30:02,480 you're using it that's the only time 824 00:30:00,480 --> 00:30:04,880 you pay for it it's it's minimal but 825 00:30:02,480 --> 00:30:07,440 still you have to pay for it and 826 00:30:04,880 --> 00:30:08,399 the bpf vm is already there 827 00:30:07,440 --> 00:30:10,640 so 828 00:30:08,399 --> 00:30:13,840 yeah 829 00:30:10,640 --> 00:30:16,240 thank you and i think we are right on 830 00:30:13,840 --> 00:30:16,240 time 831 00:30:17,600 --> 00:30:21,039 thank you very much 832 00:30:19,039 --> 00:30:22,320 um 833 00:30:21,039 --> 00:30:24,559 luck 834 00:30:22,320 --> 00:30:26,399 for that excellent introduction um 835 00:30:24,559 --> 00:30:28,240 unfortunately we are at the end of the 836 00:30:26,399 --> 00:30:30,399 time slot so we don't have time for more 837 00:30:28,240 --> 00:30:32,480 questions uh if you do have more 838 00:30:30,399 --> 00:30:34,399 questions feel free to uh contact for 839 00:30:32,480 --> 00:30:36,240 luck offline 840 00:30:34,399 --> 00:30:37,200 outside of the session and i'm sure 841 00:30:36,240 --> 00:30:39,919 he'll be 842 00:30:37,200 --> 00:30:41,919 more than happy to keep talking about 843 00:30:39,919 --> 00:30:44,919 ebpf 844 00:30:41,919 --> 00:30:44,919 um