1 00:00:00,000 --> 00:00:08,469 foreign 2 00:00:00,500 --> 00:00:08,469 [Music] 3 00:00:11,900 --> 00:00:15,360 Peter is a senior software engineer at 4 00:00:14,519 --> 00:00:18,800 Jump 5 00:00:15,360 --> 00:00:21,840 um tradings core engineering division 6 00:00:18,800 --> 00:00:25,140 focusing on the Linux kernel and device 7 00:00:21,840 --> 00:00:29,099 driver development uh in embedded 8 00:00:25,140 --> 00:00:31,380 systems today Peter will uh discuss the 9 00:00:29,099 --> 00:00:33,840 changes being made to the Linux kernel 10 00:00:31,380 --> 00:00:36,860 to handle Hardware interrupts on 11 00:00:33,840 --> 00:00:36,860 non-traditional Hardware 12 00:00:41,340 --> 00:00:47,040 thank you very much 13 00:00:43,440 --> 00:00:48,300 you guys hear me okay all right 14 00:00:47,040 --> 00:00:50,640 so 15 00:00:48,300 --> 00:00:52,260 um like you said I wanted to give this 16 00:00:50,640 --> 00:00:54,180 talk today to talk about interrupt 17 00:00:52,260 --> 00:00:56,539 balancing it's something that I do kind 18 00:00:54,180 --> 00:00:58,579 of outside of my day job 19 00:00:56,539 --> 00:01:01,079 for fun 20 00:00:58,579 --> 00:01:02,640 and there have been some changes since 21 00:01:01,079 --> 00:01:03,920 the last time I talked about it in New 22 00:01:02,640 --> 00:01:06,240 Zealand 23 00:01:03,920 --> 00:01:08,400 particularly the year-round changes that 24 00:01:06,240 --> 00:01:11,840 arm and risk five and some other 25 00:01:08,400 --> 00:01:11,840 architectures have been driving so 26 00:01:11,939 --> 00:01:15,720 um so first and foremost 27 00:01:13,979 --> 00:01:17,159 um I am not going to attempt to 28 00:01:15,720 --> 00:01:18,479 pronounce 29 00:01:17,159 --> 00:01:19,920 um the names because I don't want to 30 00:01:18,479 --> 00:01:21,840 offend anyone but I did want to 31 00:01:19,920 --> 00:01:24,960 acknowledge 32 00:01:21,840 --> 00:01:27,360 um the the native folks um in this land 33 00:01:24,960 --> 00:01:28,860 and uh want to extend extend that 34 00:01:27,360 --> 00:01:29,939 respect 35 00:01:28,860 --> 00:01:33,080 um 36 00:01:29,939 --> 00:01:35,700 and I also I would really like to 37 00:01:33,080 --> 00:01:37,920 mention the everything open 2023 38 00:01:35,700 --> 00:01:39,780 conference the organizers the volunteers 39 00:01:37,920 --> 00:01:41,759 everyone that's working to bring this 40 00:01:39,780 --> 00:01:43,320 back just like it was mentioned this 41 00:01:41,759 --> 00:01:44,820 morning it's really exciting to be back 42 00:01:43,320 --> 00:01:47,579 in person and actually see people in 43 00:01:44,820 --> 00:01:49,140 person get to talk to people have people 44 00:01:47,579 --> 00:01:50,240 throw questions at me that I don't know 45 00:01:49,140 --> 00:01:52,500 how to answer 46 00:01:50,240 --> 00:01:55,140 so it's really really nice to be back in 47 00:01:52,500 --> 00:01:56,700 person and it's also a really big honor 48 00:01:55,140 --> 00:01:59,040 for the program committee to have 49 00:01:56,700 --> 00:02:00,840 invited me to come and speak so thank 50 00:01:59,040 --> 00:02:01,619 you very much to all of you who made 51 00:02:00,840 --> 00:02:04,079 this 52 00:02:01,619 --> 00:02:06,060 go and hopefully make this go well for 53 00:02:04,079 --> 00:02:07,799 the rest of the time 54 00:02:06,060 --> 00:02:10,140 all right so 55 00:02:07,799 --> 00:02:11,420 first and foremost why do I hope that 56 00:02:10,140 --> 00:02:15,599 you're here 57 00:02:11,420 --> 00:02:16,620 so hopefully that first bullet you do 58 00:02:15,599 --> 00:02:18,540 know that this is the interrupt 59 00:02:16,620 --> 00:02:20,640 balancing talk right so 60 00:02:18,540 --> 00:02:22,800 um hopefully it is 61 00:02:20,640 --> 00:02:25,080 um what is interrupt balancing why is it 62 00:02:22,800 --> 00:02:27,599 important why is it something that we 63 00:02:25,080 --> 00:02:29,040 have something that does this in Linux 64 00:02:27,599 --> 00:02:32,819 systems 65 00:02:29,040 --> 00:02:34,680 and also get some background as to why 66 00:02:32,819 --> 00:02:35,879 has this only been typically an x86 67 00:02:34,680 --> 00:02:37,980 Focus 68 00:02:35,879 --> 00:02:40,500 right it's really been something that 69 00:02:37,980 --> 00:02:43,379 has targeted those systems it really 70 00:02:40,500 --> 00:02:47,099 hasn't been either a thought for other 71 00:02:43,379 --> 00:02:49,019 architectures or a need and if that was 72 00:02:47,099 --> 00:02:50,340 still the same case today then I 73 00:02:49,019 --> 00:02:52,800 wouldn't be here talking to you about 74 00:02:50,340 --> 00:02:53,760 this so 75 00:02:52,800 --> 00:02:55,620 um 76 00:02:53,760 --> 00:02:56,459 and so yeah understanding why some of 77 00:02:55,620 --> 00:02:58,980 these other instruction set 78 00:02:56,459 --> 00:03:01,940 architectures are driving changes and 79 00:02:58,980 --> 00:03:04,680 kind of causing some some 80 00:03:01,940 --> 00:03:06,000 issues that have to be addressed in 81 00:03:04,680 --> 00:03:09,000 order to make this a more robust 82 00:03:06,000 --> 00:03:09,000 subsystem 83 00:03:09,239 --> 00:03:15,720 and I I figured a really good place to 84 00:03:11,580 --> 00:03:17,940 start is like I mentioned at LCA 2019 I 85 00:03:15,720 --> 00:03:19,500 gave a talk on what exactly is interrupt 86 00:03:17,940 --> 00:03:21,360 balancing so it's called demystifying 87 00:03:19,500 --> 00:03:23,760 interrupt balancing 88 00:03:21,360 --> 00:03:25,200 um so I wanted to kind of recap a little 89 00:03:23,760 --> 00:03:26,700 bit just so that people are kind of up 90 00:03:25,200 --> 00:03:28,860 to speed as to what exactly is this 91 00:03:26,700 --> 00:03:30,540 thing how does it work 92 00:03:28,860 --> 00:03:32,519 um but it'll be a very condensed version 93 00:03:30,540 --> 00:03:35,340 I do have a link in the end of this 94 00:03:32,519 --> 00:03:37,500 presentation to that talk in case you 95 00:03:35,340 --> 00:03:40,379 are interested in actually knowing a 96 00:03:37,500 --> 00:03:42,239 little bit more about that so 97 00:03:40,379 --> 00:03:43,920 um and then I wanted to kind of end with 98 00:03:42,239 --> 00:03:46,200 what does our current roadmap look like 99 00:03:43,920 --> 00:03:47,700 in the internet balancing space 100 00:03:46,200 --> 00:03:50,519 um you know we do have some other things 101 00:03:47,700 --> 00:03:52,080 that are still on the roadmap to finish 102 00:03:50,519 --> 00:03:53,760 um I'm sure that that will change as 103 00:03:52,080 --> 00:03:56,040 more and more platforms come online but 104 00:03:53,760 --> 00:03:58,440 we'll get to that in a little bit 105 00:03:56,040 --> 00:04:00,180 I also wanted to mention 106 00:03:58,440 --> 00:04:02,700 um we'll try to encourage questions to 107 00:04:00,180 --> 00:04:05,159 be at the end just so that we can get 108 00:04:02,700 --> 00:04:06,540 the microphone around but if there is 109 00:04:05,159 --> 00:04:08,519 something that I talk about that you 110 00:04:06,540 --> 00:04:10,319 have no idea what I said I'd rather you 111 00:04:08,519 --> 00:04:11,760 guys let me know so I can address it 112 00:04:10,319 --> 00:04:14,700 please feel free to shout the question 113 00:04:11,760 --> 00:04:16,199 out I'll try to repeat it because if you 114 00:04:14,700 --> 00:04:18,239 don't know something on like slide five 115 00:04:16,199 --> 00:04:21,419 and then it doesn't get explained until 116 00:04:18,239 --> 00:04:23,160 slide 23 then you're lost for what the 117 00:04:21,419 --> 00:04:25,560 18 slides or so 118 00:04:23,160 --> 00:04:28,500 so I'd prefer not to do that 119 00:04:25,560 --> 00:04:30,960 okay so brief who am I 120 00:04:28,500 --> 00:04:33,000 um I'm a kernel developer who maintain 121 00:04:30,960 --> 00:04:34,680 some subsystems I've been working in the 122 00:04:33,000 --> 00:04:37,500 kernel for about 17 years 123 00:04:34,680 --> 00:04:40,440 through various companies most notably 124 00:04:37,500 --> 00:04:42,120 for the length of time in Intel my 125 00:04:40,440 --> 00:04:43,979 background primarily is in the network 126 00:04:42,120 --> 00:04:47,040 stack so I've worked on 127 00:04:43,979 --> 00:04:49,340 the core stack protocols also the net 128 00:04:47,040 --> 00:04:52,020 Dev layer implemented multi-queue 129 00:04:49,340 --> 00:04:54,540 support in the kernel many years ago 130 00:04:52,020 --> 00:04:57,660 worked on device drivers and 131 00:04:54,540 --> 00:05:00,900 specifically more for more importantly 132 00:04:57,660 --> 00:05:03,000 for this talk is the scalability of 133 00:05:00,900 --> 00:05:04,680 network devices let me see if I can turn 134 00:05:03,000 --> 00:05:06,419 on a little laser pointer thing there we 135 00:05:04,680 --> 00:05:09,060 go 136 00:05:06,419 --> 00:05:12,240 um and so this scalability push is what 137 00:05:09,060 --> 00:05:14,699 was really got me into the world of irq 138 00:05:12,240 --> 00:05:17,100 balancing so when we have devices that 139 00:05:14,699 --> 00:05:18,600 can generate massive amounts of i o you 140 00:05:17,100 --> 00:05:20,400 have lots and lots of cores in a system 141 00:05:18,600 --> 00:05:22,740 it's really really bad when all of those 142 00:05:20,400 --> 00:05:24,840 interrupts from all of those cues and 143 00:05:22,740 --> 00:05:26,280 everything fall on one CPU right that 144 00:05:24,840 --> 00:05:27,300 that kind of defeats the whole purpose 145 00:05:26,280 --> 00:05:29,759 so 146 00:05:27,300 --> 00:05:31,919 that was how I got into this and then my 147 00:05:29,759 --> 00:05:34,380 day job when I'm not out fighting crime 148 00:05:31,919 --> 00:05:35,940 with hierarchy balance I do kernel and 149 00:05:34,380 --> 00:05:39,080 Driver work in the high frequency 150 00:05:35,940 --> 00:05:39,080 trading space at Jump 151 00:05:39,240 --> 00:05:43,199 Okay so 152 00:05:40,919 --> 00:05:48,419 long time ago feels like a long time ago 153 00:05:43,199 --> 00:05:49,820 since uh pre-covered but at LCA 2019 in 154 00:05:48,419 --> 00:05:52,199 Christchurch we had this presentation 155 00:05:49,820 --> 00:05:55,080 talked about what hierarchy balance is 156 00:05:52,199 --> 00:05:55,860 how it works why it exists 157 00:05:55,080 --> 00:05:57,840 um 158 00:05:55,860 --> 00:05:59,940 and really what it kind of came down to 159 00:05:57,840 --> 00:06:02,100 was what are the real challenges that 160 00:05:59,940 --> 00:06:03,960 hierarchy balance has and I'll have a 161 00:06:02,100 --> 00:06:05,100 couple more slides here in a bit I'm 162 00:06:03,960 --> 00:06:07,500 trying to highlight some of the 163 00:06:05,100 --> 00:06:10,020 challenges but namely it's trying to 164 00:06:07,500 --> 00:06:12,419 take a bunch of disjoint information so 165 00:06:10,020 --> 00:06:13,620 information about where is your card 166 00:06:12,419 --> 00:06:15,180 physically plugged in like which 167 00:06:13,620 --> 00:06:16,680 pneumonode is it plugged into so that 168 00:06:15,180 --> 00:06:18,419 means which PCI Express slot it's 169 00:06:16,680 --> 00:06:20,820 plugged into to which CPU socket it's 170 00:06:18,419 --> 00:06:22,199 plugged into that's one aspect and then 171 00:06:20,820 --> 00:06:24,120 the other aspect is where is your 172 00:06:22,199 --> 00:06:26,759 application running which core is it on 173 00:06:24,120 --> 00:06:28,020 which you know pneumonode is it on and 174 00:06:26,759 --> 00:06:29,460 then you have this pesky thing called 175 00:06:28,020 --> 00:06:30,780 the device driver that has an interrupt 176 00:06:29,460 --> 00:06:32,039 that's tied to a queue and maybe you 177 00:06:30,780 --> 00:06:33,419 have multiple cues and where are those 178 00:06:32,039 --> 00:06:35,460 interrupts pinned in terms of their 179 00:06:33,419 --> 00:06:36,960 affinity and is trying to take all of 180 00:06:35,460 --> 00:06:38,940 this disjoint information from the 181 00:06:36,960 --> 00:06:41,160 kernel that's exposed in various ways 182 00:06:38,940 --> 00:06:44,039 and then trying to make a smart decision 183 00:06:41,160 --> 00:06:45,660 as to how to balance that interrupt 184 00:06:44,039 --> 00:06:47,400 and then you have this other pesky thing 185 00:06:45,660 --> 00:06:49,080 of there are other devices in the system 186 00:06:47,400 --> 00:06:51,479 that also have interrupts and they're 187 00:06:49,080 --> 00:06:54,120 also getting routed to CPUs so it's it's 188 00:06:51,479 --> 00:06:55,860 really a very hard problem to solve to 189 00:06:54,120 --> 00:06:57,600 make everyone happy and so usually 190 00:06:55,860 --> 00:06:59,940 anyone that has IR key balance running 191 00:06:57,600 --> 00:07:02,280 on their system knows they're not happy 192 00:06:59,940 --> 00:07:04,280 and many often times that they kill it 193 00:07:02,280 --> 00:07:07,139 and then they run a script that manually 194 00:07:04,280 --> 00:07:08,940 affinitizes interrupts but there are 195 00:07:07,139 --> 00:07:12,180 ways that we have actually extended irq 196 00:07:08,940 --> 00:07:13,800 balance to try to solve some of that 197 00:07:12,180 --> 00:07:17,400 um and much of that detail is actually 198 00:07:13,800 --> 00:07:18,539 in that Christchurch implementation 199 00:07:17,400 --> 00:07:20,880 um 200 00:07:18,539 --> 00:07:23,099 namely policy scripts where you can 201 00:07:20,880 --> 00:07:25,139 actually inject some user control into 202 00:07:23,099 --> 00:07:27,180 ireqbalance saying I want to balance 203 00:07:25,139 --> 00:07:28,620 these interrupts myself I know how 204 00:07:27,180 --> 00:07:30,060 they're supposed to be laid out and then 205 00:07:28,620 --> 00:07:32,039 just do whatever you want with the other 206 00:07:30,060 --> 00:07:34,319 system level interrupts so there is some 207 00:07:32,039 --> 00:07:36,900 flexibility there that we talked about 208 00:07:34,319 --> 00:07:38,220 um but the main focus has been on x86 209 00:07:36,900 --> 00:07:41,160 right 210 00:07:38,220 --> 00:07:43,500 so think multi-core systems that are 211 00:07:41,160 --> 00:07:46,380 like more Enterprise Cloud big 212 00:07:43,500 --> 00:07:50,940 workstations lots of i o big network 213 00:07:46,380 --> 00:07:53,220 cards you know 10 25 40 100 Gig cards 214 00:07:50,940 --> 00:07:55,620 um it was really found mostly in the x86 215 00:07:53,220 --> 00:07:57,319 world right this is really the bread and 216 00:07:55,620 --> 00:08:00,120 butter there 217 00:07:57,319 --> 00:08:02,240 and this last part I want you to kind of 218 00:08:00,120 --> 00:08:05,099 file this one in the back your mind 219 00:08:02,240 --> 00:08:07,860 interrupt Discovery in terms of figuring 220 00:08:05,099 --> 00:08:10,620 out from the system at the PCI level and 221 00:08:07,860 --> 00:08:13,440 even in some non-pci level situations 222 00:08:10,620 --> 00:08:15,599 has typically been through acpi right so 223 00:08:13,440 --> 00:08:18,120 we can go and query you know the device 224 00:08:15,599 --> 00:08:19,560 firmware through the PCI systems and we 225 00:08:18,120 --> 00:08:22,199 can figure out where interrupts are how 226 00:08:19,560 --> 00:08:25,680 many do we have all of that is very 227 00:08:22,199 --> 00:08:28,319 standard and kind of Auto discoverable 228 00:08:25,680 --> 00:08:30,120 and I noticed that my speaker notes here 229 00:08:28,319 --> 00:08:31,919 are super tiny 230 00:08:30,120 --> 00:08:33,719 so I'm trying to think of what's coming 231 00:08:31,919 --> 00:08:34,919 up next without speaking to it before I 232 00:08:33,719 --> 00:08:36,360 actually get there 233 00:08:34,919 --> 00:08:37,979 okay 234 00:08:36,360 --> 00:08:40,020 so let's try to recap some of those 235 00:08:37,979 --> 00:08:43,320 those challenges that I kind of brought 236 00:08:40,020 --> 00:08:44,940 up so I already mentioned that they um 237 00:08:43,320 --> 00:08:47,880 that there's all these unrelated sources 238 00:08:44,940 --> 00:08:49,380 so one is see State and P state so 239 00:08:47,880 --> 00:08:52,260 different processor state or different 240 00:08:49,380 --> 00:08:53,880 socket States in terms of power right 241 00:08:52,260 --> 00:08:56,160 one of the things that our key balance 242 00:08:53,880 --> 00:08:58,200 tries to take into account is did a 243 00:08:56,160 --> 00:09:00,000 certain CPU core drop into a like a 244 00:08:58,200 --> 00:09:01,920 deeper sea state it wants to be turned 245 00:09:00,000 --> 00:09:03,839 off for whatever reason 246 00:09:01,920 --> 00:09:05,220 you don't want to route and interrupt to 247 00:09:03,839 --> 00:09:07,320 that core because you're going to 248 00:09:05,220 --> 00:09:08,640 immediately wake it up and 249 00:09:07,320 --> 00:09:09,959 if you're also concerned about 250 00:09:08,640 --> 00:09:11,580 performance then you're going to have a 251 00:09:09,959 --> 00:09:13,740 latency hit waiting for that core to 252 00:09:11,580 --> 00:09:16,320 spin back up and blah blah blah so 253 00:09:13,740 --> 00:09:18,540 so we try to get that information out of 254 00:09:16,320 --> 00:09:22,100 um out of sisfs and some of the CPU 255 00:09:18,540 --> 00:09:25,800 Governor stuff of the scaling bits 256 00:09:22,100 --> 00:09:27,779 CPU load right this is one of those like 257 00:09:25,800 --> 00:09:29,459 how do you measure CPU load is it you 258 00:09:27,779 --> 00:09:32,100 know wait time is it user time system 259 00:09:29,459 --> 00:09:34,800 time so we take some aspects of CPU load 260 00:09:32,100 --> 00:09:36,240 how busy is this CPU 261 00:09:34,800 --> 00:09:38,399 um and also part of that is how many 262 00:09:36,240 --> 00:09:41,339 interrupts are actually firing on that 263 00:09:38,399 --> 00:09:43,140 um on that particular core 264 00:09:41,339 --> 00:09:45,660 ah I just figured out how to make this 265 00:09:43,140 --> 00:09:47,160 bigger excellent 266 00:09:45,660 --> 00:09:48,480 um so cash locality this is something 267 00:09:47,160 --> 00:09:51,420 else that it tries to take into account 268 00:09:48,480 --> 00:09:52,920 right so if I am trying to balance out 269 00:09:51,420 --> 00:09:55,019 to a core 270 00:09:52,920 --> 00:09:57,420 and if I'm on a system that has a 271 00:09:55,019 --> 00:09:59,279 certain hierarchy say like an L1 is you 272 00:09:57,420 --> 00:10:01,500 know per CPU and then an L2 is maybe 273 00:09:59,279 --> 00:10:03,839 shared across you know very tight cores 274 00:10:01,500 --> 00:10:05,580 and then we have an LLC or an L3 cache 275 00:10:03,839 --> 00:10:08,640 that I'm trying to land these things 276 00:10:05,580 --> 00:10:11,580 kind of in that same domain so I have to 277 00:10:08,640 --> 00:10:13,860 take that topology into account as well 278 00:10:11,580 --> 00:10:15,300 pneuma locality so if you are on a 279 00:10:13,860 --> 00:10:16,620 multi-socket system 280 00:10:15,300 --> 00:10:18,839 you know you want to make sure that the 281 00:10:16,620 --> 00:10:20,580 interrupts for say 282 00:10:18,839 --> 00:10:22,500 um like a Nick plugged into network 283 00:10:20,580 --> 00:10:24,180 interface card plugged into a slot in 284 00:10:22,500 --> 00:10:25,500 pneuma node zero you don't want to Route 285 00:10:24,180 --> 00:10:26,760 all of its interrupts to pneuma node one 286 00:10:25,500 --> 00:10:29,640 unless you have a really really darn 287 00:10:26,760 --> 00:10:32,640 good reason to do that 288 00:10:29,640 --> 00:10:34,860 um now so there's a new one we'll talk 289 00:10:32,640 --> 00:10:36,660 about it a little bit more later this is 290 00:10:34,860 --> 00:10:38,240 actually since 2019 there's a new 291 00:10:36,660 --> 00:10:40,800 thermal 292 00:10:38,240 --> 00:10:42,959 integration where we take some thermal 293 00:10:40,800 --> 00:10:44,399 characteristics and also make decisions 294 00:10:42,959 --> 00:10:47,100 based on do we want to route an 295 00:10:44,399 --> 00:10:48,779 interrupt to a CPU that's overheating 296 00:10:47,100 --> 00:10:51,140 the answer to that is no we don't want 297 00:10:48,779 --> 00:10:51,140 to do that 298 00:10:51,540 --> 00:10:55,560 um so we try to steer interrupts towards 299 00:10:53,339 --> 00:10:58,320 our way from CPUs based on all of these 300 00:10:55,560 --> 00:11:00,300 above sources of information now the 301 00:10:58,320 --> 00:11:02,640 problem is uh one thing that I don't 302 00:11:00,300 --> 00:11:04,740 have up here and it's still an issue and 303 00:11:02,640 --> 00:11:06,899 I kind of alluded to it on one of the 304 00:11:04,740 --> 00:11:09,300 previous slides is application Level 305 00:11:06,899 --> 00:11:11,000 stuff we have no insight into what the 306 00:11:09,300 --> 00:11:13,740 scheduler is doing how it's dropping 307 00:11:11,000 --> 00:11:16,320 applications down on CPU cores we have 308 00:11:13,740 --> 00:11:18,360 no idea what they're doing so this is 309 00:11:16,320 --> 00:11:20,459 really like a best effort at the 310 00:11:18,360 --> 00:11:22,500 platform level to make it look like 311 00:11:20,459 --> 00:11:24,779 everything is happy and interrupts are 312 00:11:22,500 --> 00:11:27,540 firing in a spread out way 313 00:11:24,779 --> 00:11:29,160 and if anyone like I had mentioned and I 314 00:11:27,540 --> 00:11:31,200 heard some laughs so I have a feeling 315 00:11:29,160 --> 00:11:32,279 that people have run into this that 316 00:11:31,200 --> 00:11:34,200 doesn't translate into that your 317 00:11:32,279 --> 00:11:36,060 applications actually work better when 318 00:11:34,200 --> 00:11:38,779 this is in in play without any knowledge 319 00:11:36,060 --> 00:11:38,779 about what's going on 320 00:11:38,940 --> 00:11:42,300 okay so I promised some crudely drawn 321 00:11:41,220 --> 00:11:45,360 pictures 322 00:11:42,300 --> 00:11:48,060 um so consider this uh let's see this 323 00:11:45,360 --> 00:11:50,399 you know two CPU systems so two two 324 00:11:48,060 --> 00:11:52,140 pneumonodes will say our first four 325 00:11:50,399 --> 00:11:53,519 physical CPUs are in one socket the 326 00:11:52,140 --> 00:11:55,560 other are in the other socket we have 327 00:11:53,519 --> 00:11:56,579 two DDR controllers we have an i o 328 00:11:55,560 --> 00:11:59,100 controller 329 00:11:56,579 --> 00:11:59,880 we'll just call it PCI Express for now 330 00:11:59,100 --> 00:12:02,339 um 331 00:11:59,880 --> 00:12:05,220 that we have a like a high speed you 332 00:12:02,339 --> 00:12:07,560 know super high-tech card down here 333 00:12:05,220 --> 00:12:09,180 um has a physical PCI connection and 334 00:12:07,560 --> 00:12:11,220 then I'm trying to help illustrate like 335 00:12:09,180 --> 00:12:12,959 this is like worst case scenario kind of 336 00:12:11,220 --> 00:12:15,060 thing and this is what 337 00:12:12,959 --> 00:12:16,920 um is like the bane of irq balance right 338 00:12:15,060 --> 00:12:18,360 so I've got a Nick 339 00:12:16,920 --> 00:12:21,360 um I have a driver running on it in the 340 00:12:18,360 --> 00:12:24,180 kernel and let's say the kernel driver 341 00:12:21,360 --> 00:12:25,680 happens to allocate memory for its data 342 00:12:24,180 --> 00:12:28,260 structures and its buffers its dma 343 00:12:25,680 --> 00:12:29,820 regions off of the other pneuma node at 344 00:12:28,260 --> 00:12:31,320 this point we've already lost like in 345 00:12:29,820 --> 00:12:33,000 terms of the performance battle because 346 00:12:31,320 --> 00:12:35,160 every memory access is going to go 347 00:12:33,000 --> 00:12:38,399 across either the qpi or the infinity 348 00:12:35,160 --> 00:12:41,040 Fabric and that just sucks 349 00:12:38,399 --> 00:12:43,079 so let's make it suck even more so now 350 00:12:41,040 --> 00:12:47,300 I've got an application maybe running up 351 00:12:43,079 --> 00:12:50,459 on what CPU one or no I'm sorry in cpu2 352 00:12:47,300 --> 00:12:52,620 and maybe the interrupt for this is 353 00:12:50,459 --> 00:12:53,820 actually routed to CPU one and so the 354 00:12:52,620 --> 00:12:56,700 application is actually going to be 355 00:12:53,820 --> 00:12:57,839 woken up by CPU one's interrupt 356 00:12:56,700 --> 00:13:00,540 um 357 00:12:57,839 --> 00:13:02,700 the exit of the interrupt and then if 358 00:13:00,540 --> 00:13:04,800 the application has its memory mallicked 359 00:13:02,700 --> 00:13:06,000 off of pneuma node zeros so you can see 360 00:13:04,800 --> 00:13:08,760 how this is starting to get very very 361 00:13:06,000 --> 00:13:10,440 bad right this this is bad don't don't 362 00:13:08,760 --> 00:13:10,980 do this 363 00:13:10,440 --> 00:13:13,079 um 364 00:13:10,980 --> 00:13:15,420 so ideally what we're trying to do with 365 00:13:13,079 --> 00:13:17,519 irq balance and this is again where we 366 00:13:15,420 --> 00:13:19,260 don't have as much insight as we need to 367 00:13:17,519 --> 00:13:20,880 do some of these things but this is 368 00:13:19,260 --> 00:13:22,560 ideally what we want we have an 369 00:13:20,880 --> 00:13:23,880 application running on a CPU let's say 370 00:13:22,560 --> 00:13:26,339 cpu5 371 00:13:23,880 --> 00:13:28,019 I have a Nick that is on that same 372 00:13:26,339 --> 00:13:29,700 pneumonode it has its memory allocated 373 00:13:28,019 --> 00:13:31,260 from that pneuma node I have an 374 00:13:29,700 --> 00:13:33,060 application that's also running its 375 00:13:31,260 --> 00:13:34,800 buffers on that pneuma node this is all 376 00:13:33,060 --> 00:13:38,100 trying to keep memory bandwidth from 377 00:13:34,800 --> 00:13:40,380 having to burn qpi or Infinity fabric so 378 00:13:38,100 --> 00:13:42,899 Intel AMD 379 00:13:40,380 --> 00:13:45,440 um resources going between the CPUs so 380 00:13:42,899 --> 00:13:48,000 your latency on memory accesses is lower 381 00:13:45,440 --> 00:13:49,500 and then I want that same interrupt that 382 00:13:48,000 --> 00:13:51,660 is servicing the data coming out of the 383 00:13:49,500 --> 00:13:53,760 Nic to go to that same CPU that that 384 00:13:51,660 --> 00:13:56,040 application is already running on this 385 00:13:53,760 --> 00:13:57,440 is the ideal situation you wanted a very 386 00:13:56,040 --> 00:13:59,639 vertical pillar 387 00:13:57,440 --> 00:14:02,160 because this also maintains cache 388 00:13:59,639 --> 00:14:05,040 locality right so that applications 389 00:14:02,160 --> 00:14:06,779 buffers are sitting in cpu5's Cache you 390 00:14:05,040 --> 00:14:08,279 want to wake up that application and 391 00:14:06,779 --> 00:14:10,860 then not have to immediately take a 392 00:14:08,279 --> 00:14:12,959 cache Miss block you know take the 393 00:14:10,860 --> 00:14:15,180 exception page fault swap it in and then 394 00:14:12,959 --> 00:14:16,440 at that point again if you're worried 395 00:14:15,180 --> 00:14:18,360 about performance then you've already 396 00:14:16,440 --> 00:14:21,120 lost 397 00:14:18,360 --> 00:14:23,100 right so this sound reasonable 398 00:14:21,120 --> 00:14:23,940 everyone's still kind of awake all right 399 00:14:23,100 --> 00:14:25,980 good 400 00:14:23,940 --> 00:14:27,779 I'm really glad that this wasn't uh like 401 00:14:25,980 --> 00:14:30,600 after lunch right after because that's 402 00:14:27,779 --> 00:14:32,519 usually the food coma hits so 403 00:14:30,600 --> 00:14:35,519 it's I know this is like edge of the 404 00:14:32,519 --> 00:14:37,500 seat topics here so all right so um what 405 00:14:35,519 --> 00:14:39,060 are some other things uh that are worth 406 00:14:37,500 --> 00:14:41,519 noting 407 00:14:39,060 --> 00:14:44,639 um irq balance uh only balances Hardware 408 00:14:41,519 --> 00:14:48,360 interrupts uh not software interrupts so 409 00:14:44,639 --> 00:14:50,160 this is something that I think some 410 00:14:48,360 --> 00:14:52,320 people forget and this was actually a 411 00:14:50,160 --> 00:14:55,620 question that Paul McKinney who just 412 00:14:52,320 --> 00:14:57,779 spoke prior to this in a different room 413 00:14:55,620 --> 00:14:59,940 um Paul actually asked me this in 414 00:14:57,779 --> 00:15:01,860 Christchurch and I thought about it I 415 00:14:59,940 --> 00:15:05,240 was like ah crap you're right 416 00:15:01,860 --> 00:15:08,880 um but our key balance does influence 417 00:15:05,240 --> 00:15:11,339 how soft irqs can run right so one of 418 00:15:08,880 --> 00:15:14,339 the main uses for like heavy duty 419 00:15:11,339 --> 00:15:17,160 software interrupt for data processing 420 00:15:14,339 --> 00:15:18,800 is in the networking context or the 421 00:15:17,160 --> 00:15:21,540 networking core 422 00:15:18,800 --> 00:15:23,959 so all of networking processing these 423 00:15:21,540 --> 00:15:26,519 days is done out of software EQ context 424 00:15:23,959 --> 00:15:29,160 basically it's polling it's known as 425 00:15:26,519 --> 00:15:30,360 nappy how many of you have heard of 426 00:15:29,160 --> 00:15:32,820 nappy 427 00:15:30,360 --> 00:15:34,680 okay good good so it's still known as 428 00:15:32,820 --> 00:15:36,060 the new API even though it's what 18 429 00:15:34,680 --> 00:15:38,040 years old 430 00:15:36,060 --> 00:15:39,779 um but it's effectively we take a 431 00:15:38,040 --> 00:15:41,100 hardware interrupt and this is where our 432 00:15:39,779 --> 00:15:42,660 key balance comes into play we take a 433 00:15:41,100 --> 00:15:44,760 hardware interrupt on a particular 434 00:15:42,660 --> 00:15:48,360 particular CPU 435 00:15:44,760 --> 00:15:51,600 and the nappy model is to not do any 436 00:15:48,360 --> 00:15:54,120 work in the Nic in processing uh 437 00:15:51,600 --> 00:15:55,680 incoming cues incoming data and we 438 00:15:54,120 --> 00:15:59,579 immediately disable the interrupt on 439 00:15:55,680 --> 00:16:01,440 that CPU and we enter software soft irq 440 00:15:59,579 --> 00:16:03,300 polling with nappy and then the kernel 441 00:16:01,440 --> 00:16:06,060 goes ahead and decides when to start 442 00:16:03,300 --> 00:16:07,800 Doling out work well the CPU that we 443 00:16:06,060 --> 00:16:10,260 actually go into software EQ context is 444 00:16:07,800 --> 00:16:12,779 the CPU that that software soft irq 445 00:16:10,260 --> 00:16:15,959 interrupt will fire out of the kernel so 446 00:16:12,779 --> 00:16:17,339 if we take the interrupt on cpu5 and we 447 00:16:15,959 --> 00:16:19,680 go into nappy context the software 448 00:16:17,339 --> 00:16:21,899 accuse fire on cpu5 449 00:16:19,680 --> 00:16:23,760 but at that point it's now out of irq 450 00:16:21,899 --> 00:16:26,760 balance's ability to migrate it to a 451 00:16:23,760 --> 00:16:28,560 different CPU the only way is nappy to 452 00:16:26,760 --> 00:16:31,019 not have any work to do falls back in 453 00:16:28,560 --> 00:16:33,180 Hardware interrupt context we move 454 00:16:31,019 --> 00:16:36,380 things away to a different CPU and then 455 00:16:33,180 --> 00:16:36,380 we start the process over again 456 00:16:38,699 --> 00:16:42,480 um 457 00:16:40,259 --> 00:16:44,459 yeah so I already already mentioned that 458 00:16:42,480 --> 00:16:47,820 hierarchy balance isn't application 459 00:16:44,459 --> 00:16:49,380 aware to prevent bad bad decisions from 460 00:16:47,820 --> 00:16:50,820 being made 461 00:16:49,380 --> 00:16:52,680 um 462 00:16:50,820 --> 00:16:54,180 it tries to balance the system right so 463 00:16:52,680 --> 00:16:55,920 we're trying to balance 464 00:16:54,180 --> 00:16:58,259 um between CPU load we're trying to 465 00:16:55,920 --> 00:17:00,959 balance between you know one one CPU in 466 00:16:58,259 --> 00:17:02,880 particular isn't taking you know 10x 467 00:17:00,959 --> 00:17:04,079 um the number of interrupts from other 468 00:17:02,880 --> 00:17:07,740 CPUs 469 00:17:04,079 --> 00:17:09,059 so that's really its goal 470 00:17:07,740 --> 00:17:11,339 like I said some applications 471 00:17:09,059 --> 00:17:13,620 performance can still suffer and more 472 00:17:11,339 --> 00:17:14,280 often than not they do 473 00:17:13,620 --> 00:17:15,839 um 474 00:17:14,280 --> 00:17:17,160 and then I had mentioned these policy 475 00:17:15,839 --> 00:17:19,620 scripts this is a way that you can feed 476 00:17:17,160 --> 00:17:21,480 a script to hierarchy balance Has anyone 477 00:17:19,620 --> 00:17:23,880 used policy scripts in here 478 00:17:21,480 --> 00:17:25,500 even know what they are 479 00:17:23,880 --> 00:17:28,260 it's actually not a terrible option to 480 00:17:25,500 --> 00:17:30,240 irq balance but it is so I'm sure that 481 00:17:28,260 --> 00:17:32,340 everyone has seen the venerable set Iraq 482 00:17:30,240 --> 00:17:33,720 affinity.sh script out there that you 483 00:17:32,340 --> 00:17:36,179 kill hierarchy balance then you run said 484 00:17:33,720 --> 00:17:37,919 irq affinity and then set all of your 485 00:17:36,179 --> 00:17:40,980 interrupts well this is basically taking 486 00:17:37,919 --> 00:17:43,580 that logic for said IQ Affinity allowing 487 00:17:40,980 --> 00:17:45,960 you to generate a script 488 00:17:43,580 --> 00:17:48,179 in a certain format that you hand Iraq 489 00:17:45,960 --> 00:17:50,640 balance and say apply this when you're 490 00:17:48,179 --> 00:17:51,900 actually doing these other things unless 491 00:17:50,640 --> 00:17:55,260 there are some certain things like 492 00:17:51,900 --> 00:17:57,299 thermal events or like CPU hot plug then 493 00:17:55,260 --> 00:18:00,140 that you can't balance onto a socket 494 00:17:57,299 --> 00:18:00,140 that doesn't exist anymore 495 00:18:00,179 --> 00:18:04,500 so I would encourage people to 496 00:18:02,580 --> 00:18:05,820 uh take a peek at that 497 00:18:04,500 --> 00:18:07,740 um 498 00:18:05,820 --> 00:18:08,539 now this is one other point that I want 499 00:18:07,740 --> 00:18:11,340 to make 500 00:18:08,539 --> 00:18:13,919 and this is actually something I I have 501 00:18:11,340 --> 00:18:15,960 contended and advocated for this for a 502 00:18:13,919 --> 00:18:18,059 number of years this last point of 503 00:18:15,960 --> 00:18:20,280 letting drivers control their Iraq 504 00:18:18,059 --> 00:18:21,900 Affinity in fact I had a patch set many 505 00:18:20,280 --> 00:18:25,440 many years ago 506 00:18:21,900 --> 00:18:27,480 that was met with open arms and 507 00:18:25,440 --> 00:18:29,820 wonderful conversation 508 00:18:27,480 --> 00:18:30,720 um that was a pun 509 00:18:29,820 --> 00:18:34,080 um 510 00:18:30,720 --> 00:18:36,419 it was kind of dead in the water and um 511 00:18:34,080 --> 00:18:38,640 this is actually if you have ever seen 512 00:18:36,419 --> 00:18:42,360 in looking at how hierarchy balance 513 00:18:38,640 --> 00:18:46,020 works with the irq Affinity hint exposed 514 00:18:42,360 --> 00:18:48,960 in in proc that was the alternative to 515 00:18:46,020 --> 00:18:50,880 this the the Upstream maintainers said 516 00:18:48,960 --> 00:18:52,559 no you can't have a driver go ahead and 517 00:18:50,880 --> 00:18:55,500 set its own irq Affinity we don't want 518 00:18:52,559 --> 00:18:57,480 to expose that kind of policy level 519 00:18:55,500 --> 00:18:59,160 stuff in the kernel 520 00:18:57,480 --> 00:19:00,960 um I Heard key balance should do it so 521 00:18:59,160 --> 00:19:03,299 we went through and exposed a different 522 00:19:00,960 --> 00:19:05,940 thing in the interrupt core up to user 523 00:19:03,299 --> 00:19:08,820 space allowing a driver to say I want 524 00:19:05,940 --> 00:19:10,740 this Affinity to be this give it a CPU 525 00:19:08,820 --> 00:19:13,620 map and then Iraq balance decides to 526 00:19:10,740 --> 00:19:15,660 either use it or throw it away 527 00:19:13,620 --> 00:19:17,220 um I thought about this a long time and 528 00:19:15,660 --> 00:19:19,260 I always contended that drivers had a 529 00:19:17,220 --> 00:19:22,020 better idea of where am I allocating my 530 00:19:19,260 --> 00:19:23,580 memory uh from what pneumonodes I should 531 00:19:22,020 --> 00:19:25,799 have a better you know I in terms of the 532 00:19:23,580 --> 00:19:27,299 driver should have a better view on the 533 00:19:25,799 --> 00:19:28,260 world as to where I want these things to 534 00:19:27,299 --> 00:19:29,160 land 535 00:19:28,260 --> 00:19:30,960 um 536 00:19:29,160 --> 00:19:34,320 I don't think that's the case I still 537 00:19:30,960 --> 00:19:35,940 think I think these days that whether we 538 00:19:34,320 --> 00:19:37,140 balance in the driver or from hierarchy 539 00:19:35,940 --> 00:19:38,280 balance we're going to get the same 540 00:19:37,140 --> 00:19:40,080 result 541 00:19:38,280 --> 00:19:42,720 um driver can still do a really good job 542 00:19:40,080 --> 00:19:45,000 of managing its own interrupts but that 543 00:19:42,720 --> 00:19:47,280 can totally screw everything else on the 544 00:19:45,000 --> 00:19:49,740 system so 545 00:19:47,280 --> 00:19:51,720 I'll I'll go ahead and uh 546 00:19:49,740 --> 00:19:54,179 admit that that was uh probably not the 547 00:19:51,720 --> 00:19:55,440 right direction so 548 00:19:54,179 --> 00:19:57,059 okay 549 00:19:55,440 --> 00:20:01,260 so what you're probably here to talk 550 00:19:57,059 --> 00:20:03,120 hear about is what are these uh to ices 551 00:20:01,260 --> 00:20:04,860 and what is their role in kind of 552 00:20:03,120 --> 00:20:07,400 driving some of the RQ balance changes 553 00:20:04,860 --> 00:20:09,600 so arm and risk five 554 00:20:07,400 --> 00:20:11,640 obviously arm has been around for quite 555 00:20:09,600 --> 00:20:16,200 quite a long time risk five 556 00:20:11,640 --> 00:20:18,360 as well but it has certainly heated up 557 00:20:16,200 --> 00:20:20,940 quite a bit on the last few years in the 558 00:20:18,360 --> 00:20:23,460 Linux space itself I've also got some 559 00:20:20,940 --> 00:20:26,039 links I'll talk about them in a bit on 560 00:20:23,460 --> 00:20:27,600 some talks from Drew festini who's very 561 00:20:26,039 --> 00:20:29,179 very active he's one of the risk 5 562 00:20:27,600 --> 00:20:31,320 ambassadors he's a good friend of mine 563 00:20:29,179 --> 00:20:34,640 I've linked to a lot of the talks on 564 00:20:31,320 --> 00:20:34,640 risk five and Linux in general 565 00:20:34,860 --> 00:20:38,220 um 566 00:20:35,880 --> 00:20:40,020 so 567 00:20:38,220 --> 00:20:42,780 when we talk about some arm and risk 5 568 00:20:40,020 --> 00:20:46,400 socs they do have similar challenges 569 00:20:42,780 --> 00:20:46,400 with hierarchy balance in terms of 570 00:20:47,340 --> 00:20:51,120 um they themselves have similar 571 00:20:49,500 --> 00:20:53,280 challenges with hierarchy balance so I 572 00:20:51,120 --> 00:20:55,400 wanted to make sure that 573 00:20:53,280 --> 00:20:58,160 um I call this out just for the sake of 574 00:20:55,400 --> 00:21:01,679 less complexity in the slides 575 00:20:58,160 --> 00:21:02,820 that each Isa has similar issues but 576 00:21:01,679 --> 00:21:05,220 they're presented differently and so 577 00:21:02,820 --> 00:21:08,880 specifically around interrupts the way 578 00:21:05,220 --> 00:21:11,100 that an arm system actually exposes 579 00:21:08,880 --> 00:21:13,080 interrupts through like the ppis gispi 580 00:21:11,100 --> 00:21:15,960 type interrupts is very differently 581 00:21:13,080 --> 00:21:18,360 presented than the MCAS from the risk 5 582 00:21:15,960 --> 00:21:20,880 side but the net effect is the same 583 00:21:18,360 --> 00:21:22,320 right so so they present things to the 584 00:21:20,880 --> 00:21:24,240 kernel the kernel implements them down 585 00:21:22,320 --> 00:21:26,580 in the system level and you know how it 586 00:21:24,240 --> 00:21:28,980 discovers things through device tree and 587 00:21:26,580 --> 00:21:32,280 then irq balance then has the same view 588 00:21:28,980 --> 00:21:34,140 of the world and it breaks in the same 589 00:21:32,280 --> 00:21:36,240 way on both of them 590 00:21:34,140 --> 00:21:38,100 surprises prize 591 00:21:36,240 --> 00:21:41,520 um so just uh you know if I'm saying 592 00:21:38,100 --> 00:21:43,620 like arm I mean our men risk five right 593 00:21:41,520 --> 00:21:46,020 okay so why haven't we seen these 594 00:21:43,620 --> 00:21:48,900 problems up until recently well 595 00:21:46,020 --> 00:21:50,880 hobby boards have had typically really 596 00:21:48,900 --> 00:21:53,640 really wimpy i o right they didn't need 597 00:21:50,880 --> 00:21:55,919 to balance interrupts well one hobby 598 00:21:53,640 --> 00:21:59,220 boards either had socs that might have 599 00:21:55,919 --> 00:22:01,200 had one CPU which I didn't call this out 600 00:21:59,220 --> 00:22:02,700 in this presentation but recently we did 601 00:22:01,200 --> 00:22:04,559 have a patch come into hierarchy balance 602 00:22:02,700 --> 00:22:06,299 to support hierarchy balance on single 603 00:22:04,559 --> 00:22:09,480 CPU systems 604 00:22:06,299 --> 00:22:11,520 which means more to detect that we're on 605 00:22:09,480 --> 00:22:13,700 a single CPU system and don't decide to 606 00:22:11,520 --> 00:22:13,700 run 607 00:22:14,460 --> 00:22:18,419 but so really 608 00:22:16,919 --> 00:22:20,220 um The Hobby boards you know the 609 00:22:18,419 --> 00:22:22,380 Raspberry Pi's 610 00:22:20,220 --> 00:22:23,580 um you know the Beagle bones of the 611 00:22:22,380 --> 00:22:26,280 world really just didn't have anything 612 00:22:23,580 --> 00:22:28,200 compelling to worry about the interrupts 613 00:22:26,280 --> 00:22:30,539 from generating generating enough load 614 00:22:28,200 --> 00:22:31,980 that they actually caused a problem 615 00:22:30,539 --> 00:22:33,600 you know typically 616 00:22:31,980 --> 00:22:37,620 um these things had like a one gigabit 617 00:22:33,600 --> 00:22:39,419 Nick one gigabit Ethernet Nick 618 00:22:37,620 --> 00:22:41,820 one gigabit Knicks on those like even 619 00:22:39,419 --> 00:22:43,320 the real text you know very very um 620 00:22:41,820 --> 00:22:45,299 inexpensive ones still had enough 621 00:22:43,320 --> 00:22:46,799 Hardware offloads in the chip that they 622 00:22:45,299 --> 00:22:49,320 could do things like TCP segmentation 623 00:22:46,799 --> 00:22:51,419 offload or checksum offload and it gave 624 00:22:49,320 --> 00:22:54,240 it plenty of offload capabilities for 625 00:22:51,419 --> 00:22:56,940 one single CPU core for like an armor 626 00:22:54,240 --> 00:22:57,659 risk five to be able to drive gigabit 627 00:22:56,940 --> 00:23:00,419 um 628 00:22:57,659 --> 00:23:01,860 to a lot of these one gigabit nics were 629 00:23:00,419 --> 00:23:04,140 also hung off of like the USB bus 630 00:23:01,860 --> 00:23:05,340 internally so there's no way you're ever 631 00:23:04,140 --> 00:23:06,659 going to get to one gigabit on these 632 00:23:05,340 --> 00:23:08,640 things anyways 633 00:23:06,659 --> 00:23:10,860 um so it just wasn't a problem 634 00:23:08,640 --> 00:23:13,200 and then really the other things that 635 00:23:10,860 --> 00:23:14,820 people bought these boards for was all 636 00:23:13,200 --> 00:23:17,640 of these other you know gpios the I 637 00:23:14,820 --> 00:23:20,340 scored C the Spy interfaces and these 638 00:23:17,640 --> 00:23:22,559 either don't have an interrupt or 639 00:23:20,340 --> 00:23:24,960 you don't care about this interrupt it's 640 00:23:22,559 --> 00:23:26,700 going to fire and you can get to it in 641 00:23:24,960 --> 00:23:27,960 milliseconds later and it's not going to 642 00:23:26,700 --> 00:23:30,059 be a big deal 643 00:23:27,960 --> 00:23:32,520 whereas like a 40 gig or 100 Gig Nick 644 00:23:30,059 --> 00:23:34,799 you got to get there in like you know a 645 00:23:32,520 --> 00:23:38,880 few microseconds at you know the latest 646 00:23:34,799 --> 00:23:41,580 before you overrun packet buffers 647 00:23:38,880 --> 00:23:43,919 okay so no real need to to balance 648 00:23:41,580 --> 00:23:46,500 interrupts well enter 649 00:23:43,919 --> 00:23:48,120 today why why are we here 650 00:23:46,500 --> 00:23:49,860 um these SOC boards are getting more 651 00:23:48,120 --> 00:23:52,200 interesting 652 00:23:49,860 --> 00:23:53,100 um uh some of them there was a recent 653 00:23:52,200 --> 00:23:54,780 one 654 00:23:53,100 --> 00:23:55,980 um It's actually an active discussion 655 00:23:54,780 --> 00:23:58,679 right now 656 00:23:55,980 --> 00:24:00,419 um on the Iraqi balance mailing lists on 657 00:23:58,679 --> 00:24:02,340 some wireless routers that have some 10 658 00:24:00,419 --> 00:24:04,200 gig ports 659 00:24:02,340 --> 00:24:07,559 um well 10 gig definitely needs more 660 00:24:04,200 --> 00:24:09,240 than one CPU even in x86 to drive at 661 00:24:07,559 --> 00:24:12,059 least on the receive side 662 00:24:09,240 --> 00:24:14,400 so um we're now getting into where we 663 00:24:12,059 --> 00:24:16,679 actually have to spread some load 664 00:24:14,400 --> 00:24:18,960 um new Wi-Fi protocols Wi-Fi five Wi-Fi 665 00:24:16,679 --> 00:24:20,340 six are definitely causing 666 00:24:18,960 --> 00:24:23,820 um issues 667 00:24:20,340 --> 00:24:25,860 nvme support so high speed disk traffic 668 00:24:23,820 --> 00:24:28,559 is coming in 669 00:24:25,860 --> 00:24:30,720 we also have laptops right so people 670 00:24:28,559 --> 00:24:32,700 that have like an M1 or an M2 MacBook in 671 00:24:30,720 --> 00:24:35,520 the room or somewhere here that are 672 00:24:32,700 --> 00:24:37,919 running Linux you know Linus famously 673 00:24:35,520 --> 00:24:40,200 runs Linux on a on a MacBook with an arm 674 00:24:37,919 --> 00:24:41,880 chip right that has an nvme disk that 675 00:24:40,200 --> 00:24:44,159 has Wi-Fi six that has the ability 676 00:24:41,880 --> 00:24:46,740 through Thunderbolt to run 40 gigabits 677 00:24:44,159 --> 00:24:49,440 so we are starting to see some things 678 00:24:46,740 --> 00:24:51,960 that are demanding that kind of um that 679 00:24:49,440 --> 00:24:54,059 kind of load and then the other aspect 680 00:24:51,960 --> 00:24:56,700 is the server side right so the server 681 00:24:54,059 --> 00:25:00,480 side has always been very disjoint 682 00:24:56,700 --> 00:25:03,539 um but with people like ampere you know 683 00:25:00,480 --> 00:25:04,740 cavem with all the thunderx parts 684 00:25:03,539 --> 00:25:07,039 um 685 00:25:04,740 --> 00:25:09,480 you know Amazon with their graviton 686 00:25:07,039 --> 00:25:12,000 ampere with ultra just looked up Ultra 687 00:25:09,480 --> 00:25:14,100 has 128 cores per socket that's that's 688 00:25:12,000 --> 00:25:15,900 pretty dense and you have a peripheral 689 00:25:14,100 --> 00:25:17,760 that's trying to load up all of those if 690 00:25:15,900 --> 00:25:20,580 you have you know thousands of network 691 00:25:17,760 --> 00:25:21,840 cues attached to an ultra with 128 cores 692 00:25:20,580 --> 00:25:23,820 and you dropped all the interrupts onto 693 00:25:21,840 --> 00:25:25,620 CPU zero that really sucks 694 00:25:23,820 --> 00:25:28,080 right so this is something that we now 695 00:25:25,620 --> 00:25:30,500 have to take pretty seriously 696 00:25:28,080 --> 00:25:30,500 okay 697 00:25:30,779 --> 00:25:34,620 okay so the architecture differences can 698 00:25:33,419 --> 00:25:36,600 be subtle but they're enough to break 699 00:25:34,620 --> 00:25:37,740 things right 700 00:25:36,600 --> 00:25:40,740 um the interrupt types are different 701 00:25:37,740 --> 00:25:41,700 from x86 right so we have these ppis for 702 00:25:40,740 --> 00:25:43,200 example the private peripheral 703 00:25:41,700 --> 00:25:44,640 interrupts 704 00:25:43,200 --> 00:25:46,559 um they get presented to Linux 705 00:25:44,640 --> 00:25:49,500 differently than say some of the other 706 00:25:46,559 --> 00:25:51,480 like system level interrupts in x86 and 707 00:25:49,500 --> 00:25:53,640 surprise surprise they don't get handled 708 00:25:51,480 --> 00:25:55,980 by req balance at all or we handle them 709 00:25:53,640 --> 00:25:58,039 improperly 710 00:25:55,980 --> 00:25:58,039 um 711 00:25:58,980 --> 00:26:03,960 so this is another part and this is 712 00:26:01,620 --> 00:26:05,640 where I said like file away the acpi 713 00:26:03,960 --> 00:26:07,380 thing in your head 714 00:26:05,640 --> 00:26:08,700 um device tree and how device tree 715 00:26:07,380 --> 00:26:10,260 actually presents things and how things 716 00:26:08,700 --> 00:26:12,059 can get named and how they actually get 717 00:26:10,260 --> 00:26:14,340 sorted into the tree 718 00:26:12,059 --> 00:26:16,679 can cause a problem with layout or 719 00:26:14,340 --> 00:26:18,840 naming naming has actually been the 720 00:26:16,679 --> 00:26:21,840 bigger problem 721 00:26:18,840 --> 00:26:23,279 or some other bsp difference in how The 722 00:26:21,840 --> 00:26:24,900 Blob is actually presented to the kernel 723 00:26:23,279 --> 00:26:25,500 device 724 00:26:24,900 --> 00:26:28,380 um 725 00:26:25,500 --> 00:26:31,020 Discovery is much different between arm 726 00:26:28,380 --> 00:26:33,120 with arm and risk 5 than it is in x86 727 00:26:31,020 --> 00:26:34,919 and so in terms of the like the 728 00:26:33,120 --> 00:26:36,960 topologies and the trees and what is 729 00:26:34,919 --> 00:26:40,320 exposed to us to our key balance from 730 00:26:36,960 --> 00:26:41,640 the kernel is very different 731 00:26:40,320 --> 00:26:43,559 um we also could have different platform 732 00:26:41,640 --> 00:26:45,240 drivers that just don't exist 733 00:26:43,559 --> 00:26:47,460 um right there's the x86 platform 734 00:26:45,240 --> 00:26:51,120 drivers there's the arm platform drivers 735 00:26:47,460 --> 00:26:53,460 they have their own class of stuff 736 00:26:51,120 --> 00:26:54,960 so much more interesting stuff what 737 00:26:53,460 --> 00:26:56,700 sorts of issues and I'm just watching 738 00:26:54,960 --> 00:26:58,919 the time here 739 00:26:56,700 --> 00:27:00,000 okay so these are actual issues that 740 00:26:58,919 --> 00:27:02,279 came up 741 00:27:00,000 --> 00:27:04,799 um over the last few years it's just a 742 00:27:02,279 --> 00:27:07,140 sample but um handling platform 743 00:27:04,799 --> 00:27:08,640 interrupts so so the architecture 744 00:27:07,140 --> 00:27:12,840 differences can be subtle but they can 745 00:27:08,640 --> 00:27:15,720 be painful and so this was the first 746 00:27:12,840 --> 00:27:16,919 um kind of I'll say like important fix 747 00:27:15,720 --> 00:27:19,140 um 748 00:27:16,919 --> 00:27:19,980 so this commit came in 749 00:27:19,140 --> 00:27:22,559 um 750 00:27:19,980 --> 00:27:25,140 so hung actually works for Huawei and 751 00:27:22,559 --> 00:27:26,940 I'll mention a number of times here in 752 00:27:25,140 --> 00:27:29,100 this presentation actually 753 00:27:26,940 --> 00:27:31,080 so basically what happened was the the 754 00:27:29,100 --> 00:27:33,000 PPI interrupts these private peripheral 755 00:27:31,080 --> 00:27:34,500 interrupts and arm get presented to proc 756 00:27:33,000 --> 00:27:36,720 interrupts proc interrupts is the 757 00:27:34,500 --> 00:27:38,340 exposure of all of the interrupts and 758 00:27:36,720 --> 00:27:40,500 how they are laid out across the CPUs 759 00:27:38,340 --> 00:27:42,720 that is one source of where we scrape 760 00:27:40,500 --> 00:27:44,279 information with irq balance to see what 761 00:27:42,720 --> 00:27:46,380 what is the kernel doing with interrupts 762 00:27:44,279 --> 00:27:48,179 where are they at 763 00:27:46,380 --> 00:27:50,760 um and so what what really ended up 764 00:27:48,179 --> 00:27:52,860 happening was so here on x86 and kind of 765 00:27:50,760 --> 00:27:55,580 trimmed down we have you know these 766 00:27:52,860 --> 00:27:58,799 system level interrupts that are for CPU 767 00:27:55,580 --> 00:28:01,080 like local timer irq work 768 00:27:58,799 --> 00:28:02,400 etc etc tlb shootdowns these are 769 00:28:01,080 --> 00:28:04,620 interrupts that hierarchy balance 770 00:28:02,400 --> 00:28:06,059 ignores because these are per CPU you 771 00:28:04,620 --> 00:28:07,919 don't screw with them they they just 772 00:28:06,059 --> 00:28:11,760 kind of hang out there 773 00:28:07,919 --> 00:28:12,779 well in arm we have similar ones these 774 00:28:11,760 --> 00:28:14,460 ipis 775 00:28:12,779 --> 00:28:17,640 um in a processor interrupts these are 776 00:28:14,460 --> 00:28:19,380 ppis these are also things that we don't 777 00:28:17,640 --> 00:28:20,760 need to mess with the Affinity well irq 778 00:28:19,380 --> 00:28:22,500 balance grabbed them threw them into the 779 00:28:20,760 --> 00:28:26,820 irq balancing database and attempted to 780 00:28:22,500 --> 00:28:28,559 do stuff with it and that was bad so 781 00:28:26,820 --> 00:28:31,740 um you know seems like a very simple 782 00:28:28,559 --> 00:28:33,299 thing but it was something that we had 783 00:28:31,740 --> 00:28:36,419 never run into before because no one 784 00:28:33,299 --> 00:28:39,000 cared and so someone all of a sudden had 785 00:28:36,419 --> 00:28:43,100 a board that they cared and it became an 786 00:28:39,000 --> 00:28:43,100 issue so we got a fix that was great 787 00:28:43,919 --> 00:28:46,860 okay 788 00:28:45,000 --> 00:28:49,080 interrupt classification this is one 789 00:28:46,860 --> 00:28:50,400 that I don't think is going to be put to 790 00:28:49,080 --> 00:28:51,059 bed 791 00:28:50,400 --> 00:28:52,860 um 792 00:28:51,059 --> 00:28:54,620 yet I think we're going to see more 793 00:28:52,860 --> 00:28:57,659 manifestations of this 794 00:28:54,620 --> 00:28:59,760 as more and more systems come online 795 00:28:57,659 --> 00:29:01,919 but one of the things that irq balance 796 00:28:59,760 --> 00:29:03,840 tries to do is as it's going through and 797 00:29:01,919 --> 00:29:05,760 and parsing all of the interrupt 798 00:29:03,840 --> 00:29:07,919 information and how interrupts are 799 00:29:05,760 --> 00:29:09,779 actually presented by devices which 800 00:29:07,919 --> 00:29:11,340 there is no standard right so if I do 801 00:29:09,779 --> 00:29:13,620 request error queue in the kernel and I 802 00:29:11,340 --> 00:29:16,200 give it an interrupt name 803 00:29:13,620 --> 00:29:17,580 I could name it anything I want more or 804 00:29:16,200 --> 00:29:19,620 less 805 00:29:17,580 --> 00:29:21,620 but there are certain things that 806 00:29:19,620 --> 00:29:23,700 certain devices export certain 807 00:29:21,620 --> 00:29:26,100 interrupts in certain ways that 808 00:29:23,700 --> 00:29:27,720 hierarchy balance traditionally on x86 809 00:29:26,100 --> 00:29:29,760 could figure out what exactly is this 810 00:29:27,720 --> 00:29:31,440 device and what kind of interrupt is it 811 00:29:29,760 --> 00:29:32,039 so the types 812 00:29:31,440 --> 00:29:33,659 um 813 00:29:32,039 --> 00:29:35,120 you know so we have type Legacy we have 814 00:29:33,659 --> 00:29:37,679 MSI msix 815 00:29:35,120 --> 00:29:39,240 gigabit Ethernet 816 00:29:37,679 --> 00:29:40,320 yeah we didn't we didn't plan ahead on 817 00:29:39,240 --> 00:29:43,320 that one 818 00:29:40,320 --> 00:29:45,720 um eth scuzzy which also covers ATA 819 00:29:43,320 --> 00:29:48,179 Serial ATA nvme so again we didn't 820 00:29:45,720 --> 00:29:49,559 really plan ahead on that one uh virtual 821 00:29:48,179 --> 00:29:51,840 event so if it's like something 822 00:29:49,559 --> 00:29:55,200 interrupt firing off of like a virtual 823 00:29:51,840 --> 00:29:57,960 function routed to maybe to a virtual 824 00:29:55,200 --> 00:30:00,360 machine and then the venerable irq other 825 00:29:57,960 --> 00:30:02,159 we couldn't detect what it was 826 00:30:00,360 --> 00:30:03,179 well see maybe you can see where this is 827 00:30:02,159 --> 00:30:03,779 going 828 00:30:03,179 --> 00:30:05,940 um 829 00:30:03,779 --> 00:30:08,580 so we had another 830 00:30:05,940 --> 00:30:10,679 um actually this was 831 00:30:08,580 --> 00:30:15,419 yes this was also from Huawei 832 00:30:10,679 --> 00:30:16,799 we had a fix to guess our arm irq hints 833 00:30:15,419 --> 00:30:19,380 and 834 00:30:16,799 --> 00:30:21,779 um so what happened was we were going 835 00:30:19,380 --> 00:30:23,159 through and we were trying to match 836 00:30:21,779 --> 00:30:24,360 something we matched it once we went 837 00:30:23,159 --> 00:30:25,919 through the loop again said we still 838 00:30:24,360 --> 00:30:28,320 don't know what this is and then it got 839 00:30:25,919 --> 00:30:29,760 thrown into the irq other and so it was 840 00:30:28,320 --> 00:30:31,320 getting handled like a system level 841 00:30:29,760 --> 00:30:32,880 interrupt and not an Ethernet level 842 00:30:31,320 --> 00:30:34,679 interrupt 843 00:30:32,880 --> 00:30:36,419 um so this was a this was becoming a 844 00:30:34,679 --> 00:30:37,919 problem 845 00:30:36,419 --> 00:30:39,480 and 846 00:30:37,919 --> 00:30:42,600 then we had another one that was very 847 00:30:39,480 --> 00:30:45,539 similar and this one came in from TI 848 00:30:42,600 --> 00:30:47,340 um and so again I think that you might 849 00:30:45,539 --> 00:30:48,779 be sensing a trend of companies that are 850 00:30:47,340 --> 00:30:51,480 actually like more and more interested 851 00:30:48,779 --> 00:30:54,179 in RM socs and they're getting bigger 852 00:30:51,480 --> 00:30:56,159 um so this one was basically another one 853 00:30:54,179 --> 00:30:57,659 where we were parsing some of the things 854 00:30:56,159 --> 00:31:00,419 and we weren't guessing the right class 855 00:30:57,659 --> 00:31:03,840 based on how they were being presented 856 00:31:00,419 --> 00:31:08,100 so we got some fixes in for that 857 00:31:03,840 --> 00:31:10,679 which took us to a third one 858 00:31:08,100 --> 00:31:12,299 and this really goes to like the device 859 00:31:10,679 --> 00:31:13,260 tree level thing we got foiled by a 860 00:31:12,299 --> 00:31:14,279 string 861 00:31:13,260 --> 00:31:15,840 um 862 00:31:14,279 --> 00:31:17,820 and this is not going to be limited to 863 00:31:15,840 --> 00:31:19,200 Armory risk five it just happened to be 864 00:31:17,820 --> 00:31:22,260 that we didn't run into this before in 865 00:31:19,200 --> 00:31:24,960 x86 because typically it's acpi based 866 00:31:22,260 --> 00:31:26,340 there's now device tree support and 867 00:31:24,960 --> 00:31:29,460 there's more 868 00:31:26,340 --> 00:31:32,880 momentum behind that for supporting the 869 00:31:29,460 --> 00:31:34,860 vice Tree on x86 so I suspect that we 870 00:31:32,880 --> 00:31:37,440 will see something like this but 871 00:31:34,860 --> 00:31:39,120 something really simple and stupid as to 872 00:31:37,440 --> 00:31:42,659 the way that we parsed it out there was 873 00:31:39,120 --> 00:31:44,640 an extra space and so when irq balance 874 00:31:42,659 --> 00:31:45,960 went looking for this platform driver 875 00:31:44,640 --> 00:31:47,940 couldn't find it because there was a 876 00:31:45,960 --> 00:31:50,880 space in the name 877 00:31:47,940 --> 00:31:53,100 or we parsed it out that way so 878 00:31:50,880 --> 00:31:54,360 oops 879 00:31:53,100 --> 00:31:56,220 and then 880 00:31:54,360 --> 00:31:59,399 um this is the last one I'll share and 881 00:31:56,220 --> 00:32:01,980 this is a simple CI CD issue where if we 882 00:31:59,399 --> 00:32:05,120 remember two slides ago 883 00:32:01,980 --> 00:32:08,279 um there was a fix to 884 00:32:05,120 --> 00:32:11,700 fix the guess check out your Q Affinity 885 00:32:08,279 --> 00:32:13,860 uh and uh I'm sorry get Iraq class and 886 00:32:11,700 --> 00:32:17,039 try to classify the interrupts the fix 887 00:32:13,860 --> 00:32:19,380 went in and we happily merged it and 888 00:32:17,039 --> 00:32:20,880 then we shortly after had a fix that 889 00:32:19,380 --> 00:32:24,799 says hey you broke everything else 890 00:32:20,880 --> 00:32:24,799 that's non-arm it won't even compile 891 00:32:25,500 --> 00:32:30,419 so this really highlights uh like a 892 00:32:28,200 --> 00:32:32,220 platform issue right on on the RQ 893 00:32:30,419 --> 00:32:33,299 balance side this is now maintained by 894 00:32:32,220 --> 00:32:35,640 three people 895 00:32:33,299 --> 00:32:38,820 in our spare time we don't have formal 896 00:32:35,640 --> 00:32:40,380 CI CD to do you know pipeline builds and 897 00:32:38,820 --> 00:32:42,299 testing on multiple architectures this 898 00:32:40,380 --> 00:32:44,600 is a gap this is something that we need 899 00:32:42,299 --> 00:32:44,600 to address 900 00:32:45,299 --> 00:32:48,600 so why all the issues all of a sudden 901 00:32:46,860 --> 00:32:51,000 right so 902 00:32:48,600 --> 00:32:52,559 arguably hierarchy balance was kind of 903 00:32:51,000 --> 00:32:54,899 in maintenance mode we were not really 904 00:32:52,559 --> 00:32:56,580 doing a lot with it it just worked I 905 00:32:54,899 --> 00:32:58,980 mean for varying degrees of how you want 906 00:32:56,580 --> 00:33:01,919 to say it works it worked 907 00:32:58,980 --> 00:33:04,140 um but uh really the arm stuff that was 908 00:33:01,919 --> 00:33:04,799 coming in arm is getting bigger 909 00:33:04,140 --> 00:33:07,380 um 910 00:33:04,799 --> 00:33:08,820 and this is something that I'll make the 911 00:33:07,380 --> 00:33:11,159 statement I'm sure that I'll get some 912 00:33:08,820 --> 00:33:12,899 some hate mail from it but um our arm 913 00:33:11,159 --> 00:33:14,640 has struggled right over the years with 914 00:33:12,899 --> 00:33:15,240 a fractured ecosystem 915 00:33:14,640 --> 00:33:18,000 um 916 00:33:15,240 --> 00:33:19,980 so slightly different CPU designs you 917 00:33:18,000 --> 00:33:22,080 know vendor a wants to put in you know 918 00:33:19,980 --> 00:33:24,000 floating Point vendor B wants to put in 919 00:33:22,080 --> 00:33:26,399 like this instruction vendor C wants to 920 00:33:24,000 --> 00:33:27,659 not put that instruction in it's put an 921 00:33:26,399 --> 00:33:29,760 immense amount of pressure on the tool 922 00:33:27,659 --> 00:33:32,640 chains right so the compilers if I'm 923 00:33:29,760 --> 00:33:34,320 using like a TI you know beaglebone blah 924 00:33:32,640 --> 00:33:36,360 blah blah oh did you get GCC version 925 00:33:34,320 --> 00:33:38,760 blah blah blah blah Dash blah blah blah 926 00:33:36,360 --> 00:33:40,559 blah oh no you have the not blah blah 927 00:33:38,760 --> 00:33:42,120 blah you have the other blah then yeah 928 00:33:40,559 --> 00:33:45,419 that's not going to work so that's been 929 00:33:42,120 --> 00:33:46,679 something that has really hurt arm over 930 00:33:45,419 --> 00:33:48,059 the years 931 00:33:46,679 --> 00:33:49,620 and and there's been a lot of 932 00:33:48,059 --> 00:33:52,200 improvements to fix this right I'm not 933 00:33:49,620 --> 00:33:54,840 saying that this is an unsolved issue 934 00:33:52,200 --> 00:33:57,179 um but also the bsp support right so you 935 00:33:54,840 --> 00:33:59,039 get a bsp to support a certain board 936 00:33:57,179 --> 00:34:00,480 and everyone has a variant of a board 937 00:33:59,039 --> 00:34:01,980 and then you have a different bsp blog 938 00:34:00,480 --> 00:34:03,659 for that and the kernel has to deal with 939 00:34:01,980 --> 00:34:06,480 that that that's been a maintenance 940 00:34:03,659 --> 00:34:09,119 issue right 941 00:34:06,480 --> 00:34:11,040 um so hobby boards have really you know 942 00:34:09,119 --> 00:34:14,639 kind of made this a challenging issue 943 00:34:11,040 --> 00:34:16,139 risk five is not immune to this 944 00:34:14,639 --> 00:34:18,480 um 945 00:34:16,139 --> 00:34:20,339 yeah so same vendor different cores 946 00:34:18,480 --> 00:34:21,540 um look at that risk five not immune to 947 00:34:20,339 --> 00:34:22,619 this I should remember what I wrote on 948 00:34:21,540 --> 00:34:24,419 the slides 949 00:34:22,619 --> 00:34:27,419 so why all of a sudden did this become 950 00:34:24,419 --> 00:34:29,580 an issue um so server CPUs so ampere 951 00:34:27,419 --> 00:34:32,280 came in ampere hired a huge open source 952 00:34:29,580 --> 00:34:34,560 team they're a great group of folks um I 953 00:34:32,280 --> 00:34:36,359 know quite a few of them and one of 954 00:34:34,560 --> 00:34:39,300 their first jobs that they really wanted 955 00:34:36,359 --> 00:34:42,060 to focus on was cleaning up the 956 00:34:39,300 --> 00:34:44,339 ecosystem for arm-based CPUs 957 00:34:42,060 --> 00:34:46,379 very very big task I think they've done 958 00:34:44,339 --> 00:34:47,460 a fantastic job helping like Shepherd 959 00:34:46,379 --> 00:34:49,800 that along 960 00:34:47,460 --> 00:34:51,780 arm desktop chips have become more 961 00:34:49,800 --> 00:34:52,379 powerful and 962 00:34:51,780 --> 00:34:54,000 um 963 00:34:52,379 --> 00:34:56,099 yeah so I mentioned this active thread 964 00:34:54,000 --> 00:34:59,760 around open wrt this is actually the 965 00:34:56,099 --> 00:35:01,680 thing with the um got it thank you this 966 00:34:59,760 --> 00:35:04,140 is the thing with the 10 gig ports and 967 00:35:01,680 --> 00:35:06,200 Wi-Fi six issues 968 00:35:04,140 --> 00:35:06,200 um 969 00:35:06,839 --> 00:35:10,380 so what really happened is all of these 970 00:35:08,820 --> 00:35:11,460 things started showing up and then they 971 00:35:10,380 --> 00:35:13,140 started running into problems with 972 00:35:11,460 --> 00:35:14,700 hierarchy balance because they kind of 973 00:35:13,140 --> 00:35:16,740 needed it to help the system level stuff 974 00:35:14,700 --> 00:35:19,200 and they found out everything's broken 975 00:35:16,740 --> 00:35:20,700 that's why this all of a sudden caused 976 00:35:19,200 --> 00:35:22,200 some renewed interest 977 00:35:20,700 --> 00:35:24,960 so 978 00:35:22,200 --> 00:35:27,420 um that's a really great thing so one of 979 00:35:24,960 --> 00:35:30,900 the really nice things to come out of 980 00:35:27,420 --> 00:35:32,460 this was Pung Zhao from Huawei who 981 00:35:30,900 --> 00:35:35,099 authored a number of those patches that 982 00:35:32,460 --> 00:35:37,200 I that I showed on previous slides kind 983 00:35:35,099 --> 00:35:39,359 of came in and said hey between Neil 984 00:35:37,200 --> 00:35:41,099 Horman and myself who come maintained 985 00:35:39,359 --> 00:35:44,099 this said what do you think about having 986 00:35:41,099 --> 00:35:45,359 me as well we're like that sounds great 987 00:35:44,099 --> 00:35:46,859 um 988 00:35:45,359 --> 00:35:49,500 yeah 989 00:35:46,859 --> 00:35:51,420 um he's been very very active 990 00:35:49,500 --> 00:35:54,180 it's been been a really nice addition 991 00:35:51,420 --> 00:35:55,680 because then this is also uh some really 992 00:35:54,180 --> 00:35:57,240 strong arm expertise coming in and 993 00:35:55,680 --> 00:35:59,599 helping out with the direction of the of 994 00:35:57,240 --> 00:35:59,599 the project 995 00:35:59,760 --> 00:36:02,820 um and what ended up happening is as 996 00:36:01,500 --> 00:36:04,020 more and more people started like 997 00:36:02,820 --> 00:36:05,640 messing with this and they're like hey 998 00:36:04,020 --> 00:36:07,560 there's this hierarchy balance UI thing 999 00:36:05,640 --> 00:36:09,359 too and 1000 00:36:07,560 --> 00:36:11,280 um hey I'll fire it up and holy crap it 1001 00:36:09,359 --> 00:36:13,260 doesn't work brakes so we got some 1002 00:36:11,280 --> 00:36:14,700 patches so as people were starting to 1003 00:36:13,260 --> 00:36:16,680 come into this from the arm and the risk 1004 00:36:14,700 --> 00:36:17,880 five ecosystems they started realizing 1005 00:36:16,680 --> 00:36:19,560 that there was a lot of stuff that was 1006 00:36:17,880 --> 00:36:20,940 wrong and so all of a sudden this 1007 00:36:19,560 --> 00:36:23,700 renewed interest started and people 1008 00:36:20,940 --> 00:36:26,579 started looking even deeper 1009 00:36:23,700 --> 00:36:28,560 um we also added improved support for 1010 00:36:26,579 --> 00:36:31,140 PCI hot plug so if you hot plug a device 1011 00:36:28,560 --> 00:36:33,180 you want to remove the interrupts from 1012 00:36:31,140 --> 00:36:35,040 The Balancing database that was 1013 00:36:33,180 --> 00:36:37,619 something that wasn't there before 1014 00:36:35,040 --> 00:36:39,660 thermal events so this actually came in 1015 00:36:37,619 --> 00:36:40,859 from Intel 1016 00:36:39,660 --> 00:36:43,760 um 1017 00:36:40,859 --> 00:36:45,900 so they actually hooked to Thermal D to 1018 00:36:43,760 --> 00:36:47,400 basically feed information to ireq 1019 00:36:45,900 --> 00:36:49,560 balance saying the CPU went offline 1020 00:36:47,400 --> 00:36:51,000 because or the CPU is having problems 1021 00:36:49,560 --> 00:36:53,720 because of thermal issues please route 1022 00:36:51,000 --> 00:36:53,720 interrupts away 1023 00:36:54,060 --> 00:36:57,960 um so the limitation here and this is a 1024 00:36:55,980 --> 00:37:01,859 call to action is that this is only 1025 00:36:57,960 --> 00:37:03,960 available on x8664 right now but you 1026 00:37:01,859 --> 00:37:07,740 know not to fall back on Paul's thing 1027 00:37:03,960 --> 00:37:10,079 before uh today patches are welcome 1028 00:37:07,740 --> 00:37:12,000 um but this is I think a good place that 1029 00:37:10,079 --> 00:37:14,960 we could probably improve on in the arm 1030 00:37:12,000 --> 00:37:14,960 and risk 5 space 1031 00:37:15,540 --> 00:37:20,760 okay so looking to the Future what does 1032 00:37:18,420 --> 00:37:22,440 the road map look like 1033 00:37:20,760 --> 00:37:24,900 um RQ balance I think still has a lot of 1034 00:37:22,440 --> 00:37:26,940 work to better 1035 00:37:24,900 --> 00:37:28,500 have better flexibility for device tree 1036 00:37:26,940 --> 00:37:30,240 defined resources 1037 00:37:28,500 --> 00:37:32,760 right so this is this is something that 1038 00:37:30,240 --> 00:37:35,880 has been baked into the RT balance core 1039 00:37:32,760 --> 00:37:38,220 is we can go through the acpi mechanisms 1040 00:37:35,880 --> 00:37:39,359 that are in the kernel having something 1041 00:37:38,220 --> 00:37:41,160 that we can actually hook into device 1042 00:37:39,359 --> 00:37:43,040 tree and be a little bit more aware of 1043 00:37:41,160 --> 00:37:45,119 how things are laid out 1044 00:37:43,040 --> 00:37:48,599 because one thing that we have not run 1045 00:37:45,119 --> 00:37:50,040 into yet but I do fear this is msix 1046 00:37:48,599 --> 00:37:51,960 resources so these are things that are 1047 00:37:50,040 --> 00:37:55,380 typically dynamically allocated on a on 1048 00:37:51,960 --> 00:37:57,540 a Nick or or an nvme drive on some kind 1049 00:37:55,380 --> 00:37:58,859 of peripheral these things get Auto 1050 00:37:57,540 --> 00:38:00,480 discovered and then we try to bring them 1051 00:37:58,859 --> 00:38:03,240 up well these things are actually tied 1052 00:38:00,480 --> 00:38:06,119 to a device through a device tree should 1053 00:38:03,240 --> 00:38:10,339 we should we be balancing them I don't 1054 00:38:06,119 --> 00:38:10,339 know so we'll I guess time will tell 1055 00:38:10,619 --> 00:38:13,380 um 1056 00:38:11,520 --> 00:38:14,880 yeah the the 1057 00:38:13,380 --> 00:38:17,760 um risk 5 platform interrupts that are 1058 00:38:14,880 --> 00:38:19,859 similar to the arm PPI SPI SGI things 1059 00:38:17,760 --> 00:38:20,940 you know we need to comprehend that for 1060 00:38:19,859 --> 00:38:25,079 risk five 1061 00:38:20,940 --> 00:38:27,720 cxl devices so um if any of you are 1062 00:38:25,079 --> 00:38:31,079 familiar with compute express link it's 1063 00:38:27,720 --> 00:38:33,839 a bus architecture thing that's 1064 00:38:31,079 --> 00:38:35,400 basically on top of physical PCI Express 1065 00:38:33,839 --> 00:38:38,579 um just started started showing up in 1066 00:38:35,400 --> 00:38:39,420 Sapphire Rapids and AMD Genoa 1067 00:38:38,579 --> 00:38:41,940 um 1068 00:38:39,420 --> 00:38:43,920 there's a whole class of things and I 1069 00:38:41,940 --> 00:38:46,320 kind of also do a lot of stuff with the 1070 00:38:43,920 --> 00:38:48,420 cxl Consortium so if you have no idea 1071 00:38:46,320 --> 00:38:51,000 what type 2 type 1 type 2 type 3 devices 1072 00:38:48,420 --> 00:38:53,460 in cxl are see me outside I'll be happy 1073 00:38:51,000 --> 00:38:54,359 to give you a brain dump on cxl but it's 1074 00:38:53,460 --> 00:38:56,579 basically 1075 00:38:54,359 --> 00:38:59,280 um like an i o memory and cash level 1076 00:38:56,579 --> 00:39:00,180 protocol we have no idea how to deal 1077 00:38:59,280 --> 00:39:04,760 with that 1078 00:39:00,180 --> 00:39:04,760 and the arm server proliferation 1079 00:39:05,220 --> 00:39:10,380 so Linux development I mentioned that 1080 00:39:07,800 --> 00:39:12,180 Drew fustini has a lot of talks on the 1081 00:39:10,380 --> 00:39:13,980 state of risk five they're just marching 1082 00:39:12,180 --> 00:39:16,680 along really fast I have a feeling that 1083 00:39:13,980 --> 00:39:19,680 we're going to start seeing some like 1084 00:39:16,680 --> 00:39:21,420 public server class risk 5 CPUs here not 1085 00:39:19,680 --> 00:39:23,880 too long it's probably going to put some 1086 00:39:21,420 --> 00:39:25,079 pressure on hierarchy balance 1087 00:39:23,880 --> 00:39:28,980 um 1088 00:39:25,079 --> 00:39:31,200 so here's our GitHub project page please 1089 00:39:28,980 --> 00:39:33,180 go ahead and feel free to poke around 1090 00:39:31,200 --> 00:39:35,339 there that's where the code lives there 1091 00:39:33,180 --> 00:39:36,839 is an old PR for CI CD that just kind of 1092 00:39:35,339 --> 00:39:38,820 rotted 1093 00:39:36,839 --> 00:39:41,220 um you know we need to kind of resurrect 1094 00:39:38,820 --> 00:39:43,680 that and get it going please feel free 1095 00:39:41,220 --> 00:39:45,300 to contribute always happy to 1096 00:39:43,680 --> 00:39:46,020 bring in some people 1097 00:39:45,300 --> 00:39:48,260 um 1098 00:39:46,020 --> 00:39:51,180 so recap if you fell asleep 1099 00:39:48,260 --> 00:39:52,619 balancing is still hard it's hard to get 1100 00:39:51,180 --> 00:39:54,599 it right 1101 00:39:52,619 --> 00:39:56,940 um but arm and risk five 1102 00:39:54,599 --> 00:39:58,920 uh getting into the into the mix is 1103 00:39:56,940 --> 00:40:00,720 really helping us out 1104 00:39:58,920 --> 00:40:02,220 um we're back to pretty much active 1105 00:40:00,720 --> 00:40:03,420 development but we have a lot of work to 1106 00:40:02,220 --> 00:40:04,619 do 1107 00:40:03,420 --> 00:40:06,300 um and so just 1108 00:40:04,619 --> 00:40:07,920 I promise there are some some links here 1109 00:40:06,300 --> 00:40:10,560 in the presentation they'll be posted 1110 00:40:07,920 --> 00:40:12,900 later but uh yeah go ahead and put me on 1111 00:40:10,560 --> 00:40:15,119 the spot and see if I can not answer 1112 00:40:12,900 --> 00:40:17,420 something that you guys have appreciate 1113 00:40:15,119 --> 00:40:17,420 your time 1114 00:40:19,400 --> 00:40:23,240 does anyone have any questions 1115 00:40:23,579 --> 00:40:28,020 oh boy here we go 1116 00:40:26,520 --> 00:40:29,460 um do you talk to the device tree people 1117 00:40:28,020 --> 00:40:31,200 much about putting things in the device 1118 00:40:29,460 --> 00:40:32,820 tree to help you know what sort of 1119 00:40:31,200 --> 00:40:36,140 interrupts are do we talk to the device 1120 00:40:32,820 --> 00:40:38,220 tree people no that they love 1121 00:40:36,140 --> 00:40:39,900 standardizing things these days so I'm 1122 00:40:38,220 --> 00:40:42,119 sure they'd love some um some more 1123 00:40:39,900 --> 00:40:44,760 standardization you know if it had 1124 00:40:42,119 --> 00:40:47,520 helped um it could be absolutely uh used 1125 00:40:44,760 --> 00:40:48,839 to be a lot better at being useful no 1126 00:40:47,520 --> 00:40:50,880 that's a that's a pretty thing that's a 1127 00:40:48,839 --> 00:40:52,800 great idea I you know we 1128 00:40:50,880 --> 00:40:53,940 haven't really thought about that but 1129 00:40:52,800 --> 00:40:56,040 that's a really great idea especially 1130 00:40:53,940 --> 00:40:59,780 with x86 now kind of getting more into 1131 00:40:56,040 --> 00:40:59,780 the formal device tree area 1132 00:41:01,040 --> 00:41:06,180 so the question is if the x86 device 1133 00:41:04,560 --> 00:41:08,640 tree stuff is server class chips the 1134 00:41:06,180 --> 00:41:10,980 short answer is yes 1135 00:41:08,640 --> 00:41:13,500 um really for like Appliance level stuff 1136 00:41:10,980 --> 00:41:15,720 so red hat is looking to add device tree 1137 00:41:13,500 --> 00:41:19,280 support for x86 into relate into 1138 00:41:15,720 --> 00:41:19,280 pre-boot and all that stuff 1139 00:41:20,220 --> 00:41:24,720 yeah yeah device tree did exist for like 1140 00:41:22,560 --> 00:41:25,440 uh what's the The Minnow board 1141 00:41:24,720 --> 00:41:28,740 um 1142 00:41:25,440 --> 00:41:32,220 x86 based hobby board and um 1143 00:41:28,740 --> 00:41:35,540 yeah yeah AMD yep yep 1144 00:41:32,220 --> 00:41:35,540 other other questions 1145 00:41:35,820 --> 00:41:39,599 I will be here all day today I 1146 00:41:38,160 --> 00:41:41,880 unfortunately have to go back to the 1147 00:41:39,599 --> 00:41:42,780 States tomorrow so if you need me grab 1148 00:41:41,880 --> 00:41:45,119 me 1149 00:41:42,780 --> 00:41:46,800 um wandering around or contact me 1150 00:41:45,119 --> 00:41:48,300 through 1151 00:41:46,800 --> 00:41:51,839 through whatever contact info is 1152 00:41:48,300 --> 00:41:52,980 actually exposed to the site so 1153 00:41:51,839 --> 00:41:54,420 all right well thank you for your time 1154 00:41:52,980 --> 00:41:57,840 enjoy the rest of the conference 1155 00:41:54,420 --> 00:41:57,840 [Applause]