1 00:00:06,320 --> 00:00:11,499 [Music] 2 00:00:15,759 --> 00:00:20,960 good afternoon welcome back 3 00:00:17,920 --> 00:00:22,320 um next up we have uh 4 00:00:20,960 --> 00:00:25,680 project uh 5 00:00:22,320 --> 00:00:28,560 pratik rajesh sampat and gautham chenoy 6 00:00:25,680 --> 00:00:31,359 presenting on cpu namespaces a mechanism 7 00:00:28,560 --> 00:00:33,040 to isolate cpu topology information 8 00:00:31,359 --> 00:00:35,120 in the linux kernel 9 00:00:33,040 --> 00:00:38,079 pratech is a linux kernel developer at 10 00:00:35,120 --> 00:00:39,680 ibm who works primarily with schedulers 11 00:00:38,079 --> 00:00:42,079 in energy management but also on 12 00:00:39,680 --> 00:00:43,520 container primitives and gautham is a 13 00:00:42,079 --> 00:00:45,440 kernel programmer who has been working 14 00:00:43,520 --> 00:00:47,200 on the kernel since 2006 and has 15 00:00:45,440 --> 00:00:49,840 contributed to hot plug process 16 00:00:47,200 --> 00:00:52,079 scheduling rcu lock depth cpu idle and 17 00:00:49,840 --> 00:00:54,000 other things 18 00:00:52,079 --> 00:00:57,559 so please welcome them as they present 19 00:00:54,000 --> 00:00:57,559 on cpu namespaces 20 00:00:58,000 --> 00:01:02,079 thank you thank you uh hello everyone i 21 00:01:00,320 --> 00:01:04,479 am pratik sampat and i work for the 22 00:01:02,079 --> 00:01:06,479 linux technology center at ibm uh with 23 00:01:04,479 --> 00:01:08,560 me i have gotham chennai who works with 24 00:01:06,479 --> 00:01:10,479 the kernel team at amd and today we're 25 00:01:08,560 --> 00:01:13,040 here to talk about the isolation of cpu 26 00:01:10,479 --> 00:01:15,439 information in the linux corner 27 00:01:13,040 --> 00:01:17,680 the agenda for our talk really is uh 28 00:01:15,439 --> 00:01:19,759 that we'll first highlight the question 29 00:01:17,680 --> 00:01:21,600 of what is the purpose of cis and proc 30 00:01:19,759 --> 00:01:23,200 in the world of containers and are there 31 00:01:21,600 --> 00:01:25,520 any implications of exposing this 32 00:01:23,200 --> 00:01:27,280 information next we'll talk about some 33 00:01:25,520 --> 00:01:29,280 existing solutions that help mitigate 34 00:01:27,280 --> 00:01:31,119 this problem we then present our staff 35 00:01:29,280 --> 00:01:33,360 at a solution uh called the cpu 36 00:01:31,119 --> 00:01:34,960 namespace uh some experiments around it 37 00:01:33,360 --> 00:01:36,799 and finally we pose questions around the 38 00:01:34,960 --> 00:01:38,320 challenges that are that exist in this 39 00:01:36,799 --> 00:01:40,400 space 40 00:01:38,320 --> 00:01:43,040 all right motivation 41 00:01:40,400 --> 00:01:44,960 so a short introduction to csfs is that 42 00:01:43,040 --> 00:01:46,880 it's a file system that is used to 43 00:01:44,960 --> 00:01:48,799 expose kernel information to user space 44 00:01:46,880 --> 00:01:50,720 and often it's about kernel subsystems 45 00:01:48,799 --> 00:01:52,079 and the hardware that runs it 46 00:01:50,720 --> 00:01:54,079 applications today can look at this 47 00:01:52,079 --> 00:01:55,680 interface to determine system resources 48 00:01:54,079 --> 00:01:57,439 and they can make decisions based on 49 00:01:55,680 --> 00:01:59,439 that such as allocating resources like 50 00:01:57,439 --> 00:02:01,200 memory and spawning threats 51 00:01:59,439 --> 00:02:02,719 take an example of containers 52 00:02:01,200 --> 00:02:06,000 containerized applications can be 53 00:02:02,719 --> 00:02:07,119 restricted by c groups cpu set however 54 00:02:06,000 --> 00:02:08,560 they can be unaware of these 55 00:02:07,119 --> 00:02:10,879 restrictions and can still look at 56 00:02:08,560 --> 00:02:13,120 traditional interfaces of sys and proc 57 00:02:10,879 --> 00:02:14,319 to make decisions uh about their 58 00:02:13,120 --> 00:02:16,959 applications 59 00:02:14,319 --> 00:02:18,879 and now this problem is not only uh 60 00:02:16,959 --> 00:02:21,280 constrained to the realm of containers 61 00:02:18,879 --> 00:02:23,360 but uh but outside containers as well if 62 00:02:21,280 --> 00:02:25,920 you take say task set like the system 63 00:02:23,360 --> 00:02:28,319 called sched set affinity you can use it 64 00:02:25,920 --> 00:02:30,160 to set cpu restrictions on applications 65 00:02:28,319 --> 00:02:32,800 but applications may still choose to 66 00:02:30,160 --> 00:02:34,400 make their decisions based on looking at 67 00:02:32,800 --> 00:02:36,400 this assistant proc 68 00:02:34,400 --> 00:02:37,920 so the question that arises from all of 69 00:02:36,400 --> 00:02:39,599 this is that 70 00:02:37,920 --> 00:02:41,840 what does cis and proc really mean in 71 00:02:39,599 --> 00:02:43,920 the context of container restriction and 72 00:02:41,840 --> 00:02:45,599 second what are really the implications 73 00:02:43,920 --> 00:02:47,440 of exposing this information when 74 00:02:45,599 --> 00:02:49,120 applications can only use a very small 75 00:02:47,440 --> 00:02:50,959 side of it 76 00:02:49,120 --> 00:02:52,640 for the scope of this discussion we will 77 00:02:50,959 --> 00:02:54,480 stick to the implications of cpu 78 00:02:52,640 --> 00:02:56,400 resources however we will also 79 00:02:54,480 --> 00:02:58,480 periodically call out other potential 80 00:02:56,400 --> 00:03:01,120 problems such as memory as well 81 00:02:58,480 --> 00:03:02,480 so coming to the first implication uh 82 00:03:01,120 --> 00:03:04,800 restrictions can be set through 83 00:03:02,480 --> 00:03:06,480 interfaces like c group cpu set as i 84 00:03:04,800 --> 00:03:08,480 have already said however there are 85 00:03:06,480 --> 00:03:10,400 multiple interfaces which display cpu 86 00:03:08,480 --> 00:03:12,159 information and these control and 87 00:03:10,400 --> 00:03:13,280 display interfaces may be disjoined from 88 00:03:12,159 --> 00:03:15,840 one another 89 00:03:13,280 --> 00:03:18,800 for example uh you can see that if a 90 00:03:15,840 --> 00:03:20,800 host system has about 128 cpus and it 91 00:03:18,800 --> 00:03:23,200 spawns a container with its restriction 92 00:03:20,800 --> 00:03:24,799 set to 32 to 35 cpus now this 93 00:03:23,200 --> 00:03:27,519 restriction is set through the c group 94 00:03:24,799 --> 00:03:29,920 fs itself and uh and viewing this 95 00:03:27,519 --> 00:03:33,760 interface within the container uh yields 96 00:03:29,920 --> 00:03:36,319 uh the right 32 to 35 cpus a task within 97 00:03:33,760 --> 00:03:38,959 this container if it calls the sched get 98 00:03:36,319 --> 00:03:41,040 affinity onto it uh also gets the right 99 00:03:38,959 --> 00:03:42,799 uh view or the right view of the 100 00:03:41,040 --> 00:03:45,519 restriction that has been applied to it 101 00:03:42,799 --> 00:03:46,879 however if it looks at the proxlat stat 102 00:03:45,519 --> 00:03:49,360 which is generally for the load 103 00:03:46,879 --> 00:03:51,519 statistics and applications like uh top 104 00:03:49,360 --> 00:03:54,560 and edge top use it you will see uh you 105 00:03:51,519 --> 00:03:56,640 will see data about all the uh cpus and 106 00:03:54,560 --> 00:03:58,319 similarly for cis fs as well uh you if 107 00:03:56,640 --> 00:04:00,959 you look at this device's system cpu 108 00:03:58,319 --> 00:04:03,840 which is normally used by nproc and lscp 109 00:04:00,959 --> 00:04:06,799 kind of utilities and they also show 110 00:04:03,840 --> 00:04:09,439 that uh that 120 you know eight cpus 111 00:04:06,799 --> 00:04:11,920 exist on this system 112 00:04:09,439 --> 00:04:14,000 another uh you know we will talk about 113 00:04:11,920 --> 00:04:16,959 the potential impact in terms of 114 00:04:14,000 --> 00:04:18,799 performance of of what this really means 115 00:04:16,959 --> 00:04:19,919 in that in that context in in the coming 116 00:04:18,799 --> 00:04:22,400 few slides 117 00:04:19,919 --> 00:04:24,720 uh no next coming to the to another 118 00:04:22,400 --> 00:04:25,520 implication of fair use 119 00:04:24,720 --> 00:04:27,600 so 120 00:04:25,520 --> 00:04:29,600 when an application that is running 121 00:04:27,600 --> 00:04:31,680 within a container is restricted to some 122 00:04:29,600 --> 00:04:34,400 resources should they still be able to 123 00:04:31,680 --> 00:04:35,280 see the entire system resources 124 00:04:34,400 --> 00:04:38,080 and 125 00:04:35,280 --> 00:04:40,000 if so can this knowledge be potentially 126 00:04:38,080 --> 00:04:42,720 misused in any way 127 00:04:40,000 --> 00:04:44,320 now could there be a no user that can 128 00:04:42,720 --> 00:04:46,880 schedule workloads across sockets in 129 00:04:44,320 --> 00:04:48,479 such a way that the bus is now flooded 130 00:04:46,880 --> 00:04:51,199 and other container tenants now 131 00:04:48,479 --> 00:04:53,280 experience a slowdown or could a user 132 00:04:51,199 --> 00:04:56,000 now identify its vicinity from a 133 00:04:53,280 --> 00:04:57,680 peripheral such as a gpu and schedule 134 00:04:56,000 --> 00:05:00,080 themselves closer to get an undue 135 00:04:57,680 --> 00:05:02,400 latency advantage i know compared to the 136 00:05:00,080 --> 00:05:04,240 rest of the you know uh compared to the 137 00:05:02,400 --> 00:05:05,840 rest of the workloads 138 00:05:04,240 --> 00:05:09,280 so 139 00:05:05,840 --> 00:05:11,120 uh so are there any solutions that exist 140 00:05:09,280 --> 00:05:13,840 today that can help mitigate this 141 00:05:11,120 --> 00:05:16,320 problem of inconsistency of information 142 00:05:13,840 --> 00:05:17,919 well turns out there are there are a few 143 00:05:16,320 --> 00:05:20,320 and we've highlighted about it about 144 00:05:17,919 --> 00:05:22,880 three of them so one of the the most 145 00:05:20,320 --> 00:05:25,680 obvious solutions uh out there are hey 146 00:05:22,880 --> 00:05:27,440 just look at cfs so if you need 147 00:05:25,680 --> 00:05:29,280 information about the restrictions that 148 00:05:27,440 --> 00:05:30,639 are imposed on you just look at the 149 00:05:29,280 --> 00:05:32,320 interface that imposes those 150 00:05:30,639 --> 00:05:34,800 restrictions in the first place and 151 00:05:32,320 --> 00:05:36,639 that's a very strong argument to make 152 00:05:34,800 --> 00:05:39,280 however a lot of these applications 153 00:05:36,639 --> 00:05:41,680 legacy or otherwise rely on traditional 154 00:05:39,280 --> 00:05:43,919 interfaces lexis and brock and uh and 155 00:05:41,680 --> 00:05:46,479 asking all these players players to 156 00:05:43,919 --> 00:05:48,720 really change the way they uh they 157 00:05:46,479 --> 00:05:50,479 they interpret information it may be a 158 00:05:48,720 --> 00:05:53,360 difficult task 159 00:05:50,479 --> 00:05:55,440 another problem is also that you know uh 160 00:05:53,360 --> 00:05:57,919 now these applications also need to 161 00:05:55,440 --> 00:06:00,240 interpret newer concepts like uh period 162 00:05:57,919 --> 00:06:02,639 and quota which are cpu restrictions in 163 00:06:00,240 --> 00:06:04,880 time uh and but they're they're used to 164 00:06:02,639 --> 00:06:07,280 interpreting the information in terms of 165 00:06:04,880 --> 00:06:09,120 space and in terms of cpus and threads 166 00:06:07,280 --> 00:06:11,280 uh and how how does this information 167 00:06:09,120 --> 00:06:14,240 really need to be interpreted is is is a 168 00:06:11,280 --> 00:06:15,120 is a difficult uh uh thing to 169 00:06:14,240 --> 00:06:17,600 say 170 00:06:15,120 --> 00:06:19,360 uh and lastly while c groups can be used 171 00:06:17,600 --> 00:06:21,840 to extract information 172 00:06:19,360 --> 00:06:23,840 in principle they are a control 173 00:06:21,840 --> 00:06:25,919 mechanism for the host rather than a 174 00:06:23,840 --> 00:06:28,639 display interface inside the container 175 00:06:25,919 --> 00:06:31,680 and and there really doesn't uh 176 00:06:28,639 --> 00:06:33,199 there really isn't anything that stops 177 00:06:31,680 --> 00:06:35,600 them to change this interface in the 178 00:06:33,199 --> 00:06:37,440 future and uh and maybe like a vc group 179 00:06:35,600 --> 00:06:39,680 v3 comes out and then the applications 180 00:06:37,440 --> 00:06:42,160 have to go uh change the way they look 181 00:06:39,680 --> 00:06:45,120 at information all over again 182 00:06:42,160 --> 00:06:46,720 um uh so there are some user space 183 00:06:45,120 --> 00:06:48,319 innovations in this area as well uh 184 00:06:46,720 --> 00:06:51,440 there's a user space solution called lx 185 00:06:48,319 --> 00:06:54,080 cfs uh by the by the lxc uh current 186 00:06:51,440 --> 00:06:55,039 containers and basically what they do is 187 00:06:54,080 --> 00:06:56,800 uh 188 00:06:55,039 --> 00:06:59,199 it's a user space file system that bind 189 00:06:56,800 --> 00:07:01,199 mounts over the existing system proc fs 190 00:06:59,199 --> 00:07:02,639 and they basically provide consistent 191 00:07:01,199 --> 00:07:03,840 information in accordance to whatever 192 00:07:02,639 --> 00:07:05,520 restrictions that were set on these 193 00:07:03,840 --> 00:07:07,360 applications they're essentially trying 194 00:07:05,520 --> 00:07:08,479 to fake this information in in a way 195 00:07:07,360 --> 00:07:10,240 right 196 00:07:08,479 --> 00:07:12,080 the advantage of is a user space 197 00:07:10,240 --> 00:07:15,199 innovation like this is that it's a very 198 00:07:12,080 --> 00:07:16,880 light easy to use user space tool and we 199 00:07:15,199 --> 00:07:19,039 have we have seen a few articles uh 200 00:07:16,880 --> 00:07:20,960 where it's currently being used uh or 201 00:07:19,039 --> 00:07:23,520 currently being described by uh by 202 00:07:20,960 --> 00:07:26,720 google and uh alibaba as well 203 00:07:23,520 --> 00:07:28,720 uh and and if a user space innovation uh 204 00:07:26,720 --> 00:07:30,400 exists out there to solve solving this 205 00:07:28,720 --> 00:07:32,560 problem this kind of bolsters our 206 00:07:30,400 --> 00:07:34,639 confidence that this problem uh you know 207 00:07:32,560 --> 00:07:36,639 exists in the first place 208 00:07:34,639 --> 00:07:38,639 but a problem with user space 209 00:07:36,639 --> 00:07:40,639 innovations are that they need explicit 210 00:07:38,639 --> 00:07:42,160 setup for applications and they need 211 00:07:40,639 --> 00:07:43,919 explicit setup for applications that 212 00:07:42,160 --> 00:07:47,120 experience this effect of incorrect 213 00:07:43,919 --> 00:07:48,879 information in the first place so uh so 214 00:07:47,120 --> 00:07:50,000 a lot of times inconsistent information 215 00:07:48,879 --> 00:07:51,680 is not going to really crash your 216 00:07:50,000 --> 00:07:54,319 application it's rather going to give 217 00:07:51,680 --> 00:07:55,599 you a a performance hit or it's going to 218 00:07:54,319 --> 00:07:57,680 give you 219 00:07:55,599 --> 00:07:59,520 give you a problem that that is somewhat 220 00:07:57,680 --> 00:08:01,039 of a silent failure and first 221 00:07:59,520 --> 00:08:02,319 identifying that you are facing this 222 00:08:01,039 --> 00:08:04,639 problem in the first place and then 223 00:08:02,319 --> 00:08:06,879 identifying that alex cfs is the right 224 00:08:04,639 --> 00:08:08,479 solution for you can be a bit of a 225 00:08:06,879 --> 00:08:11,440 hassle 226 00:08:08,479 --> 00:08:13,680 uh lastly uh there is 227 00:08:11,440 --> 00:08:15,680 an effort uh for an rfc patch set which 228 00:08:13,680 --> 00:08:18,400 was posted a few months ago which added 229 00:08:15,680 --> 00:08:20,720 uh a proc slash self slash mem info as a 230 00:08:18,400 --> 00:08:22,639 new interface which respects the c group 231 00:08:20,720 --> 00:08:25,280 instruction restrictions and provides 232 00:08:22,639 --> 00:08:27,199 this consistent information uh 233 00:08:25,280 --> 00:08:30,080 for for applications to see 234 00:08:27,199 --> 00:08:32,240 and this is a very good solution as it 235 00:08:30,080 --> 00:08:34,320 introduces standards for exposing and 236 00:08:32,240 --> 00:08:36,000 interpreting this information it is also 237 00:08:34,320 --> 00:08:37,919 a very clean interface as it does not 238 00:08:36,000 --> 00:08:40,800 meddle with the current established 239 00:08:37,919 --> 00:08:42,800 system proc interfaces and 240 00:08:40,800 --> 00:08:45,120 and it kind of 241 00:08:42,800 --> 00:08:46,000 keeps the sanity of of those interfaces 242 00:08:45,120 --> 00:08:48,560 intact 243 00:08:46,000 --> 00:08:51,040 however um just like c group fs it also 244 00:08:48,560 --> 00:08:52,640 faces the problem of a problem that a 245 00:08:51,040 --> 00:08:53,600 lot of applications still look at cis 246 00:08:52,640 --> 00:08:55,839 and clock 247 00:08:53,600 --> 00:08:57,680 instead of c group and the motivation to 248 00:08:55,839 --> 00:08:59,920 use yet another interface may be a 249 00:08:57,680 --> 00:09:01,760 little bit low uh there was a comment 250 00:08:59,920 --> 00:09:03,680 which kind of highlighted of the same in 251 00:09:01,760 --> 00:09:05,839 the same path set as well uh which which 252 00:09:03,680 --> 00:09:08,000 i have you know linked down 253 00:09:05,839 --> 00:09:09,120 on these things 254 00:09:08,000 --> 00:09:11,519 so 255 00:09:09,120 --> 00:09:13,760 what if we could have a solution that 256 00:09:11,519 --> 00:09:15,680 kind of took some good points from all 257 00:09:13,760 --> 00:09:18,399 the three solutions and and built 258 00:09:15,680 --> 00:09:20,000 something around it uh so what if we 259 00:09:18,399 --> 00:09:22,320 could present information about our 260 00:09:20,000 --> 00:09:24,000 restrictions uh we could present them 261 00:09:22,320 --> 00:09:25,760 consistently with all these existing 262 00:09:24,000 --> 00:09:28,080 interfaces of cis and proc 263 00:09:25,760 --> 00:09:29,760 and we could introduce standardization 264 00:09:28,080 --> 00:09:32,880 of how to expose and interpret the 265 00:09:29,760 --> 00:09:35,600 solution by an in-kernel solution 266 00:09:32,880 --> 00:09:39,279 introducing cpu namespace 267 00:09:35,600 --> 00:09:41,279 so we basically try to isolate cpu 268 00:09:39,279 --> 00:09:42,800 information for each task based on 269 00:09:41,279 --> 00:09:44,880 whatever restrictions that have been 270 00:09:42,800 --> 00:09:46,880 applied to it why are 271 00:09:44,880 --> 00:09:49,360 the control and 272 00:09:46,880 --> 00:09:51,680 display interfaces 273 00:09:49,360 --> 00:09:53,600 and and and we make that consistent with 274 00:09:51,680 --> 00:09:55,600 the rest of the interfaces as well so 275 00:09:53,600 --> 00:09:56,959 basically we isolate the cpu information 276 00:09:55,600 --> 00:09:58,640 by maintaining a 277 00:09:56,959 --> 00:10:01,040 translation of these cpus within the 278 00:09:58,640 --> 00:10:03,200 name space and from the namespace pcpu 279 00:10:01,040 --> 00:10:04,959 to a logical cpu we have scrambled the 280 00:10:03,200 --> 00:10:06,399 cpus to help mitigate the problems of 281 00:10:04,959 --> 00:10:07,760 the knowledge of topology that we have 282 00:10:06,399 --> 00:10:10,079 highlighted in one of our previous 283 00:10:07,760 --> 00:10:12,720 slides i'm not an expert here but if it 284 00:10:10,079 --> 00:10:14,000 helps that uh then then that's great uh 285 00:10:12,720 --> 00:10:15,920 in our proof of concept we have 286 00:10:14,000 --> 00:10:18,399 scrambled this map just to show that a 287 00:10:15,920 --> 00:10:20,399 discontiguous cpu numbering works just 288 00:10:18,399 --> 00:10:22,000 right out of the box and if the map if 289 00:10:20,399 --> 00:10:23,360 the mapping does matter we can could 290 00:10:22,000 --> 00:10:25,040 change it to something like a zero to 291 00:10:23,360 --> 00:10:26,480 end map or even make it as a one to one 292 00:10:25,040 --> 00:10:28,399 mapping right 293 00:10:26,480 --> 00:10:31,040 but in summary just like a pid name 294 00:10:28,399 --> 00:10:33,360 space when you view a task cpu resources 295 00:10:31,040 --> 00:10:35,279 within a cpu namespace we can get an 296 00:10:33,360 --> 00:10:36,640 isolated view of the restrictions that 297 00:10:35,279 --> 00:10:38,800 it is bound by 298 00:10:36,640 --> 00:10:40,079 and viewing the tasks resources outside 299 00:10:38,800 --> 00:10:42,000 this namespace will yield the 300 00:10:40,079 --> 00:10:44,800 translations of these 301 00:10:42,000 --> 00:10:46,399 so for for example 302 00:10:44,800 --> 00:10:48,160 if you look if you look at the diagram 303 00:10:46,399 --> 00:10:50,240 which is without the cpu names which is 304 00:10:48,160 --> 00:10:52,160 basically uh the one that the diagram 305 00:10:50,240 --> 00:10:54,560 that we saw uh in one of our earlier 306 00:10:52,160 --> 00:10:57,040 slides you can clearly see that uh you 307 00:10:54,560 --> 00:10:59,040 know procensus uh no variant inc very 308 00:10:57,040 --> 00:11:01,519 consistent in information when there was 309 00:10:59,040 --> 00:11:03,600 a cpu set restriction applied to it uh 310 00:11:01,519 --> 00:11:05,519 when we look at it from a cpu namespace 311 00:11:03,600 --> 00:11:07,519 point of view while when we add the cpu 312 00:11:05,519 --> 00:11:10,000 namespace layer to it we can now see 313 00:11:07,519 --> 00:11:13,120 that a scrambled map is first generated 314 00:11:10,000 --> 00:11:15,519 and c group fs viewing c group fs within 315 00:11:13,120 --> 00:11:18,720 this container now yields a translation 316 00:11:15,519 --> 00:11:21,440 of 5 12 21 23 and similarly all the all 317 00:11:18,720 --> 00:11:24,240 the system calls proc fs as well as sfs 318 00:11:21,440 --> 00:11:25,600 gives this exact view to the system and 319 00:11:24,240 --> 00:11:27,440 which is in accordance to whatever 320 00:11:25,600 --> 00:11:29,839 restrictions that were applied to it of 321 00:11:27,440 --> 00:11:32,399 course uh this this namespace cpus is 322 00:11:29,839 --> 00:11:35,839 going to translate to these real cpus so 323 00:11:32,399 --> 00:11:38,399 where 5121 2023 is going to translate to 324 00:11:35,839 --> 00:11:40,560 32 33 34 and 35. 325 00:11:38,399 --> 00:11:43,440 so basically this is the design of what 326 00:11:40,560 --> 00:11:45,360 a cpu names faces uh the reference link 327 00:11:43,440 --> 00:11:47,040 is also uh is also on the top of this 328 00:11:45,360 --> 00:11:49,040 slide uh where this is where we have 329 00:11:47,040 --> 00:11:50,959 posted the patches and we have had quite 330 00:11:49,040 --> 00:11:53,040 a few interesting discussions uh on on 331 00:11:50,959 --> 00:11:56,560 this as well which we will discuss 332 00:11:53,040 --> 00:11:58,560 further uh in the in the coming slides 333 00:11:56,560 --> 00:12:00,720 so um 334 00:11:58,560 --> 00:12:02,639 well we we showed that there is a 335 00:12:00,720 --> 00:12:04,240 problem of inconsistency of information 336 00:12:02,639 --> 00:12:05,680 and we kind of showed with our solution 337 00:12:04,240 --> 00:12:08,639 that we kind of 338 00:12:05,680 --> 00:12:10,480 have a a a solution that elegantly 339 00:12:08,639 --> 00:12:12,160 brings together all these interfaces to 340 00:12:10,480 --> 00:12:14,320 solving this problem 341 00:12:12,160 --> 00:12:17,680 but is there any performance benefit to 342 00:12:14,320 --> 00:12:19,360 doing this as a spoiler alert but yes uh 343 00:12:17,680 --> 00:12:22,240 in in our experiment we have done this 344 00:12:19,360 --> 00:12:25,680 on an ibm power 9 machine only 44 smt4 345 00:12:22,240 --> 00:12:28,000 containers on 176 cpus uh but these are 346 00:12:25,680 --> 00:12:28,720 experiments are architectural agnostic 347 00:12:28,000 --> 00:12:31,440 uh 348 00:12:28,720 --> 00:12:34,240 as well so our experiment is as follows 349 00:12:31,440 --> 00:12:36,639 we benchmark nginx which is a http 350 00:12:34,240 --> 00:12:38,959 server and which is a fairly modern http 351 00:12:36,639 --> 00:12:40,560 server family at and we benchmark it 352 00:12:38,959 --> 00:12:43,120 with a multi-threaded workload called 353 00:12:40,560 --> 00:12:44,160 work which is a very simple http load 354 00:12:43,120 --> 00:12:46,320 generator 355 00:12:44,160 --> 00:12:48,320 now nginx is configured with a worker 356 00:12:46,320 --> 00:12:50,079 processes auto and this auto really 357 00:12:48,320 --> 00:12:51,600 helps us to enable this application to 358 00:12:50,079 --> 00:12:54,480 manage resources based on the system 359 00:12:51,600 --> 00:12:56,560 configuration that it sees uh nginx 360 00:12:54,480 --> 00:12:59,279 container is configured to cpu set to 361 00:12:56,560 --> 00:13:01,440 four cpus and the work benchmark spawns 362 00:12:59,279 --> 00:13:03,760 about 500 requests in 30 seconds or four 363 00:13:01,440 --> 00:13:04,480 threads which is enough to saturate uh 364 00:13:03,760 --> 00:13:06,800 the 365 00:13:04,480 --> 00:13:09,120 the resources of uh of four cpus that we 366 00:13:06,800 --> 00:13:10,720 have bound our nginx container by 367 00:13:09,120 --> 00:13:12,800 uh on the right hand side if you can see 368 00:13:10,720 --> 00:13:15,200 that there is a small summary graph of 369 00:13:12,800 --> 00:13:17,680 of the percentage of improvement that uh 370 00:13:15,200 --> 00:13:19,519 you really see uh with this experiment 371 00:13:17,680 --> 00:13:22,079 so we have we have a few metrics of 372 00:13:19,519 --> 00:13:25,200 measurements uh first is the is a memory 373 00:13:22,079 --> 00:13:27,440 usage at the initialization and that 374 00:13:25,200 --> 00:13:29,360 when you compare it with a vanilla 5.14 375 00:13:27,440 --> 00:13:30,880 kernel you can see that it is dropped by 376 00:13:29,360 --> 00:13:33,360 about a 91 377 00:13:30,880 --> 00:13:35,360 uh similarly memory at peak drops at 378 00:13:33,360 --> 00:13:37,200 about eighty nine percent uh throttle 379 00:13:35,360 --> 00:13:39,360 drops by about seventy four percent 380 00:13:37,200 --> 00:13:42,079 latency drops by about thirteen percent 381 00:13:39,360 --> 00:13:43,839 and requests per second uh improves and 382 00:13:42,079 --> 00:13:46,560 or increases by about twenty point seven 383 00:13:43,839 --> 00:13:49,279 three percent um we can clearly see that 384 00:13:46,560 --> 00:13:52,560 there is a net net uh improvement in in 385 00:13:49,279 --> 00:13:54,320 providing consistent information 386 00:13:52,560 --> 00:13:56,800 to two applications 387 00:13:54,320 --> 00:14:01,120 but uh why is that uh in the next slide 388 00:13:56,800 --> 00:14:04,480 i aim to just to answer just that and uh 389 00:14:01,120 --> 00:14:07,040 and basically uh if you look at the top 390 00:14:04,480 --> 00:14:09,600 most left side of it which is the pids 391 00:14:07,040 --> 00:14:12,800 you can see that uh uh 392 00:14:09,600 --> 00:14:15,600 on a vanilla kernel it spawns about 177 393 00:14:12,800 --> 00:14:18,560 uh processes whereas on a cpu namespace 394 00:14:15,600 --> 00:14:20,160 kernel just spawns about five 177 is not 395 00:14:18,560 --> 00:14:22,000 a random number uh as we have seen 396 00:14:20,160 --> 00:14:24,800 before uh the system is configured to 397 00:14:22,000 --> 00:14:26,880 have about 176 cpus so basically it 398 00:14:24,800 --> 00:14:28,959 spawns 176 worker threads plus one 399 00:14:26,880 --> 00:14:30,639 master thread and in the in the case of 400 00:14:28,959 --> 00:14:33,839 cpu namespace it spawns four worker 401 00:14:30,639 --> 00:14:35,920 threads plus uh uh one master thread and 402 00:14:33,839 --> 00:14:38,639 uh as a result of that or as a 403 00:14:35,920 --> 00:14:40,480 consequence of that uh the memory usage 404 00:14:38,639 --> 00:14:42,480 in the in the vanilla kernel is pretty 405 00:14:40,480 --> 00:14:44,959 high because now it needs to allocate 406 00:14:42,480 --> 00:14:46,639 memory to to keep track of all these uh 407 00:14:44,959 --> 00:14:49,440 extra pids 408 00:14:46,639 --> 00:14:51,120 uh because of that also the throttle is 409 00:14:49,440 --> 00:14:54,399 pretty high and it's it's as high as 410 00:14:51,120 --> 00:14:56,959 about 97 of throttling and this is also 411 00:14:54,399 --> 00:14:59,199 because that now we are trying to run a 412 00:14:56,959 --> 00:15:01,920 very trying to uh there are a lot of 413 00:14:59,199 --> 00:15:03,279 these resources were trying to contend a 414 00:15:01,920 --> 00:15:05,120 lot of these pids that are trying to 415 00:15:03,279 --> 00:15:06,480 contend for the same exact resource and 416 00:15:05,120 --> 00:15:08,560 that restored this resource is quite 417 00:15:06,480 --> 00:15:10,880 constrained and therefore there's there 418 00:15:08,560 --> 00:15:12,720 is going to be quite a bit of throttling 419 00:15:10,880 --> 00:15:14,720 but even though there is throttling they 420 00:15:12,720 --> 00:15:17,279 are essentially trying to do the same 421 00:15:14,720 --> 00:15:19,279 thing uh do the thing of the same task 422 00:15:17,279 --> 00:15:21,040 and uh and there shouldn't be a lot of 423 00:15:19,279 --> 00:15:24,399 hit on the performance in in that case 424 00:15:21,040 --> 00:15:26,959 right uh but that is not true uh 425 00:15:24,399 --> 00:15:28,480 in that case you're going to be hit by a 426 00:15:26,959 --> 00:15:31,199 scheduler overhead such as context 427 00:15:28,480 --> 00:15:34,160 switch and and that basically kind of 428 00:15:31,199 --> 00:15:35,680 shows us that the requests per second or 429 00:15:34,160 --> 00:15:37,680 the throughput 430 00:15:35,680 --> 00:15:40,800 is quite higher in terms of our cpu name 431 00:15:37,680 --> 00:15:42,800 space and and our latest in c also uh 432 00:15:40,800 --> 00:15:44,800 latency goes lower which is uh where the 433 00:15:42,800 --> 00:15:46,639 lower is better uh 434 00:15:44,800 --> 00:15:48,160 in our cpu namespace as well and this is 435 00:15:46,639 --> 00:15:50,160 this is all because of these extra 436 00:15:48,160 --> 00:15:52,160 overheads that now that the application 437 00:15:50,160 --> 00:15:54,000 needs to really uh uh 438 00:15:52,160 --> 00:15:56,160 or the kernel really needs to uh you 439 00:15:54,000 --> 00:15:56,959 know handle 440 00:15:56,160 --> 00:15:59,360 so 441 00:15:56,959 --> 00:16:02,079 we we kind of showed that there there is 442 00:15:59,360 --> 00:16:03,920 some benefit of doing uh uh of running 443 00:16:02,079 --> 00:16:05,680 these uh of giving consistent 444 00:16:03,920 --> 00:16:07,920 information uh now the proof of the 445 00:16:05,680 --> 00:16:09,600 pudding is you know really eating it so 446 00:16:07,920 --> 00:16:12,320 let's let's show you our implementation 447 00:16:09,600 --> 00:16:14,320 of how we uh really do cpu name space 448 00:16:12,320 --> 00:16:16,639 and uh just so that you have an idea 449 00:16:14,320 --> 00:16:19,279 that uh uh the idea that how it really 450 00:16:16,639 --> 00:16:22,560 works in the linux kernel today so as 451 00:16:19,279 --> 00:16:24,240 you can see there are two tabs um or in 452 00:16:22,560 --> 00:16:26,079 the terminal uh the right hand side is 453 00:16:24,240 --> 00:16:28,560 basically the initial cpu namespace 454 00:16:26,079 --> 00:16:30,079 which is the host outside a container 455 00:16:28,560 --> 00:16:32,560 and the left hand side is the cpu 456 00:16:30,079 --> 00:16:33,600 namespace a um just a is an as an 457 00:16:32,560 --> 00:16:36,480 acronym 458 00:16:33,600 --> 00:16:38,079 within a container like docker right 459 00:16:36,480 --> 00:16:39,920 on the left hand side we basically start 460 00:16:38,079 --> 00:16:42,800 a very simple ubuntu containers uh with 461 00:16:39,920 --> 00:16:44,000 a batch prompt and we we name it say p 462 00:16:42,800 --> 00:16:47,279 example 463 00:16:44,000 --> 00:16:49,360 and uh and we we run it unconstrained so 464 00:16:47,279 --> 00:16:51,360 when we do an ls cpu we should see you 465 00:16:49,360 --> 00:16:52,880 should see all these cpus that that 466 00:16:51,360 --> 00:16:55,839 exist on this system 467 00:16:52,880 --> 00:16:58,560 um similarly if you should do a cat of 468 00:16:55,839 --> 00:17:00,720 you know cpset.cpu's in the c group fs 469 00:16:58,560 --> 00:17:03,519 directory you will also see the entire 470 00:17:00,720 --> 00:17:05,760 list of cpus now when we try to restrict 471 00:17:03,519 --> 00:17:07,120 this container's cpu set and we will try 472 00:17:05,760 --> 00:17:09,360 to restrict it with the docker update 473 00:17:07,120 --> 00:17:11,679 command and we'll restrict it to cpu set 474 00:17:09,360 --> 00:17:13,760 zero to three uh in this case now when 475 00:17:11,679 --> 00:17:15,600 we do an ls cpu now you can see that 476 00:17:13,760 --> 00:17:17,839 there are only four cpus that uh that 477 00:17:15,600 --> 00:17:21,120 exist on the system and uh and the 478 00:17:17,839 --> 00:17:24,559 online cpu is a scrambled map of uh of 479 00:17:21,120 --> 00:17:26,079 four uh no randomized uh cpus of course 480 00:17:24,559 --> 00:17:27,679 uh there are there are a few things like 481 00:17:26,079 --> 00:17:29,280 the new newman node zero and new one 482 00:17:27,679 --> 00:17:32,000 node eight which is not uh really 483 00:17:29,280 --> 00:17:34,240 virtualized uh in information or or or 484 00:17:32,000 --> 00:17:35,840 it's not really uh 485 00:17:34,240 --> 00:17:37,840 shown in in its 486 00:17:35,840 --> 00:17:40,160 true scrambled map but uh this is a 487 00:17:37,840 --> 00:17:43,840 proof of concept and i know we aim to 488 00:17:40,160 --> 00:17:46,799 have a a more fully fledged uh uh 489 00:17:43,840 --> 00:17:49,360 we aim to fully fledge fledge this out 490 00:17:46,799 --> 00:17:50,960 um so next you know same thing if you 491 00:17:49,360 --> 00:17:52,640 try to look at the c group cpu set 492 00:17:50,960 --> 00:17:54,240 interface basically it shows us that 493 00:17:52,640 --> 00:17:56,880 this information is also consistent with 494 00:17:54,240 --> 00:17:58,799 whatever ls cpu saw and it is uh it is 495 00:17:56,880 --> 00:18:00,000 basically just these four cpus in a 496 00:17:58,799 --> 00:18:00,880 scrambled 497 00:18:00,000 --> 00:18:02,720 way 498 00:18:00,880 --> 00:18:04,720 uh next we try to spawn 499 00:18:02,720 --> 00:18:06,960 a stress on one of these available cpus 500 00:18:04,720 --> 00:18:09,360 and we'll test it to uh one of our 501 00:18:06,960 --> 00:18:11,200 available cpus that is cpu 17 and we 502 00:18:09,360 --> 00:18:14,480 stress it to minus c1 which is just to 503 00:18:11,200 --> 00:18:16,480 know cpu is to one cpus and we can do a 504 00:18:14,480 --> 00:18:18,080 task set minus cp to that task to see 505 00:18:16,480 --> 00:18:20,000 whatever the current affinity of that 506 00:18:18,080 --> 00:18:22,160 task is and we can clearly see that it 507 00:18:20,000 --> 00:18:24,480 is uh now 17 which means that the system 508 00:18:22,160 --> 00:18:27,039 call is also coherent 509 00:18:24,480 --> 00:18:30,640 coherently showing this information 510 00:18:27,039 --> 00:18:33,120 if we do a top on to on top of this uh 511 00:18:30,640 --> 00:18:36,240 you can see that now it according to 512 00:18:33,120 --> 00:18:39,360 this container it has only four cpus and 513 00:18:36,240 --> 00:18:43,120 cpu 17 is what is uh consuming a 100 514 00:18:39,360 --> 00:18:45,840 percent of you know utilization 515 00:18:43,120 --> 00:18:47,600 uh on on the similar side if you if you 516 00:18:45,840 --> 00:18:49,440 try to look at the same information from 517 00:18:47,600 --> 00:18:50,960 outside this container if you try to 518 00:18:49,440 --> 00:18:53,039 view this tasks affinity from outside 519 00:18:50,960 --> 00:18:54,720 this container we can do a top and you 520 00:18:53,039 --> 00:18:56,720 know we can see what what consumes 521 00:18:54,720 --> 00:18:59,679 hundred percent of cpu time that is cpu 522 00:18:56,720 --> 00:19:01,840 zero uh and if you try to do this with 523 00:18:59,679 --> 00:19:04,960 this with the task set minus cp by 524 00:19:01,840 --> 00:19:06,720 getting the uh no ps minus ef of crep 525 00:19:04,960 --> 00:19:08,240 stress uh where of course you're not 526 00:19:06,720 --> 00:19:10,240 gonna get like two tasks one is the 527 00:19:08,240 --> 00:19:12,000 parent and the child so parent really 528 00:19:10,240 --> 00:19:13,760 spawns the stressor and we can really 529 00:19:12,000 --> 00:19:15,919 look at either the parent or the child 530 00:19:13,760 --> 00:19:18,000 to see where it is really uh 531 00:19:15,919 --> 00:19:20,559 bounded by uh you can clearly see that 532 00:19:18,000 --> 00:19:22,400 the task set minus cp is really 533 00:19:20,559 --> 00:19:25,679 showing us to be that it is bounded to 534 00:19:22,400 --> 00:19:27,679 cpu zero next 535 00:19:25,679 --> 00:19:29,440 if we try to change this affinity to say 536 00:19:27,679 --> 00:19:31,360 cpu 2 which is which is again in the 537 00:19:29,440 --> 00:19:33,919 permutable limit of whatever cpu set 538 00:19:31,360 --> 00:19:36,960 restrictions we have applied to it uh so 539 00:19:33,919 --> 00:19:39,600 if we tar set it to you know cpu 2 540 00:19:36,960 --> 00:19:42,080 we can we can try to see how this uh 541 00:19:39,600 --> 00:19:44,640 you know varies this information uh in 542 00:19:42,080 --> 00:19:46,240 in the top command uh on both uh within 543 00:19:44,640 --> 00:19:47,919 the container and outside the container 544 00:19:46,240 --> 00:19:50,640 so with this outside the container you 545 00:19:47,919 --> 00:19:53,039 can now see that cpu 2 shows 100 546 00:19:50,640 --> 00:19:55,440 utilization and within the container it 547 00:19:53,039 --> 00:19:58,720 sure it has now migrated from server cpu 548 00:19:55,440 --> 00:20:01,919 17 uh to cpu83 549 00:19:58,720 --> 00:20:04,320 so so that is pretty much uh you know 550 00:20:01,919 --> 00:20:05,200 what it is and uh 551 00:20:04,320 --> 00:20:07,520 in 552 00:20:05,200 --> 00:20:09,760 in the next slide we will talk about a 553 00:20:07,520 --> 00:20:13,360 few challenges and and know what is the 554 00:20:09,760 --> 00:20:14,960 future of of isolation of of information 555 00:20:13,360 --> 00:20:18,880 so while 556 00:20:14,960 --> 00:20:20,480 the solution uh works in a way uh but it 557 00:20:18,880 --> 00:20:22,799 is not perfect and and there are a few 558 00:20:20,480 --> 00:20:24,960 challenges associated with it one of the 559 00:20:22,799 --> 00:20:27,039 most uh foremost challenges that that 560 00:20:24,960 --> 00:20:28,880 exists with this is that until now name 561 00:20:27,039 --> 00:20:31,520 spaces and c groups have been fairly 562 00:20:28,880 --> 00:20:34,400 disjoined from one another uh cpu name 563 00:20:31,520 --> 00:20:37,039 space kind of breaks that and without 564 00:20:34,400 --> 00:20:39,600 cpu or cpu set c groups the cpu 565 00:20:37,039 --> 00:20:41,760 namespace itself loses its meaning and 566 00:20:39,600 --> 00:20:42,640 that brings up the question really that 567 00:20:41,760 --> 00:20:45,120 if 568 00:20:42,640 --> 00:20:47,840 that is a time to now 569 00:20:45,120 --> 00:20:50,960 define interactions between spaces and c 570 00:20:47,840 --> 00:20:53,440 groups uh in in a in a you know it's 571 00:20:50,960 --> 00:20:55,039 reasonable amount of way and and what 572 00:20:53,440 --> 00:20:57,039 does you know what do containers really 573 00:20:55,039 --> 00:20:59,679 mean from that point onwards 574 00:20:57,039 --> 00:21:01,120 um another uh challenge that that exists 575 00:20:59,679 --> 00:21:02,400 with our current design is that the 576 00:21:01,120 --> 00:21:04,400 current design only addresses 577 00:21:02,400 --> 00:21:07,039 restrictions in space which is you know 578 00:21:04,400 --> 00:21:10,240 cpus and threads and pids and so on but 579 00:21:07,039 --> 00:21:12,960 not uh time and not pids by the way uh 580 00:21:10,240 --> 00:21:15,120 the containers also frequently use cf 581 00:21:12,960 --> 00:21:16,559 spirits and quotas and then it's you 582 00:21:15,120 --> 00:21:18,720 know fondly called millicodes in the 583 00:21:16,559 --> 00:21:19,520 kubernetes world and in the cloud world 584 00:21:18,720 --> 00:21:21,440 uh 585 00:21:19,520 --> 00:21:24,240 so how does this information now need to 586 00:21:21,440 --> 00:21:26,159 be exposed for this these restrictions 587 00:21:24,240 --> 00:21:28,320 it can be as simple as some defining 588 00:21:26,159 --> 00:21:30,559 some standards that say that okay if 589 00:21:28,320 --> 00:21:32,320 this is the ratio of period in quota 590 00:21:30,559 --> 00:21:36,000 this is the this is the cpu's worth of 591 00:21:32,320 --> 00:21:37,039 runtime but then uh is ratios the only 592 00:21:36,000 --> 00:21:38,559 factor 593 00:21:37,039 --> 00:21:39,520 for that or not 594 00:21:38,559 --> 00:21:42,480 um 595 00:21:39,520 --> 00:21:44,080 lastly while cpu namespace mitigates 596 00:21:42,480 --> 00:21:45,520 potential misuse stemming from knowledge 597 00:21:44,080 --> 00:21:47,679 of topology by obfuscation of 598 00:21:45,520 --> 00:21:49,840 information the topology can still be 599 00:21:47,679 --> 00:21:51,840 roughly figured out if you know with ipl 600 00:21:49,840 --> 00:21:53,919 latencies to determine who's your 601 00:21:51,840 --> 00:21:55,600 sibling or who's uh or which core is 602 00:21:53,919 --> 00:21:56,840 really far away uh 603 00:21:55,600 --> 00:21:59,039 from 604 00:21:56,840 --> 00:22:01,039 you so 605 00:21:59,039 --> 00:22:03,679 that's that brings us uh to our last 606 00:22:01,039 --> 00:22:06,559 slide of future uh where uh you know the 607 00:22:03,679 --> 00:22:09,120 intention of of these uh of of this 608 00:22:06,559 --> 00:22:11,280 presentation is to spark a discussion on 609 00:22:09,120 --> 00:22:13,840 the problem rather than be the new and 610 00:22:11,280 --> 00:22:16,559 end all of all solutions 611 00:22:13,840 --> 00:22:18,320 if the solution is for applications to 612 00:22:16,559 --> 00:22:20,559 change and look at c group fs or any 613 00:22:18,320 --> 00:22:22,159 other interface there are a few exciting 614 00:22:20,559 --> 00:22:24,640 discussions that are happening around 615 00:22:22,159 --> 00:22:26,080 exporting more useful metrics to entice 616 00:22:24,640 --> 00:22:29,120 applications to change and these were 617 00:22:26,080 --> 00:22:32,159 discussions were happening on the uh uh 618 00:22:29,120 --> 00:22:34,080 uh on the patch set that i had posted um 619 00:22:32,159 --> 00:22:35,840 if the solution is an external user 620 00:22:34,080 --> 00:22:37,919 space program bind mounting like custom 621 00:22:35,840 --> 00:22:40,159 system proc fs then should that be the 622 00:22:37,919 --> 00:22:42,640 norm for the future as well now should 623 00:22:40,159 --> 00:22:45,440 uh sure should user space innovations uh 624 00:22:42,640 --> 00:22:47,840 be encouraged further or should we start 625 00:22:45,440 --> 00:22:50,480 looking at uh you know defining and 626 00:22:47,840 --> 00:22:52,880 standardizing uh a lot of these things 627 00:22:50,480 --> 00:22:56,000 uh for you know within the linux kernel 628 00:22:52,880 --> 00:22:56,720 itself and finally is it a time to you 629 00:22:56,000 --> 00:22:58,000 know 630 00:22:56,720 --> 00:23:00,720 finally define 631 00:22:58,000 --> 00:23:02,640 uh a container as a first class object 632 00:23:00,720 --> 00:23:04,320 uh in linux 633 00:23:02,640 --> 00:23:07,200 so that was pretty much all the 634 00:23:04,320 --> 00:23:09,360 questions i had this is a legal slide 635 00:23:07,200 --> 00:23:11,679 for uh 636 00:23:09,360 --> 00:23:12,960 for attributions and finally some 637 00:23:11,679 --> 00:23:15,840 references 638 00:23:12,960 --> 00:23:17,679 so thank you for uh for this i will look 639 00:23:15,840 --> 00:23:18,880 at if there are any questions around it 640 00:23:17,679 --> 00:23:20,880 and uh 641 00:23:18,880 --> 00:23:23,840 and i will try to i'll try my best to 642 00:23:20,880 --> 00:23:23,840 answer them 643 00:23:25,360 --> 00:23:30,559 yes the first question yeah the first 644 00:23:27,520 --> 00:23:32,960 question is does the new map of the name 645 00:23:30,559 --> 00:23:34,559 spaced cpu still correspond to the 646 00:23:32,960 --> 00:23:36,640 hardware so in the current 647 00:23:34,559 --> 00:23:38,240 implementation that pratik has it does 648 00:23:36,640 --> 00:23:40,000 not so we have not taken that into 649 00:23:38,240 --> 00:23:41,360 consideration but the idea is if we 650 00:23:40,000 --> 00:23:42,960 still want to you know do this whole 651 00:23:41,360 --> 00:23:46,080 permutation thing then we can restrict 652 00:23:42,960 --> 00:23:48,000 the permutation to uh the new uh node so 653 00:23:46,080 --> 00:23:50,720 that uh you know we are consistent at 654 00:23:48,000 --> 00:23:52,159 least with respect to numa but but 655 00:23:50,720 --> 00:23:54,000 if we are going for a random permutation 656 00:23:52,159 --> 00:23:56,159 we will still not be consistent with 657 00:23:54,000 --> 00:23:58,720 other topological information such as 658 00:23:56,159 --> 00:24:01,679 last level caches and 659 00:23:58,720 --> 00:24:04,640 smt siblings for instance 660 00:24:01,679 --> 00:24:06,799 uh yeah the next question is uh why 661 00:24:04,640 --> 00:24:09,039 scramble the cpus rather than just 662 00:24:06,799 --> 00:24:10,640 showing you know zero to four or in 663 00:24:09,039 --> 00:24:12,240 general i think zero to three is what 664 00:24:10,640 --> 00:24:15,120 meant what was meant here since there 665 00:24:12,240 --> 00:24:18,480 are four series now we can we can easily 666 00:24:15,120 --> 00:24:20,320 do that i mean we pick the most uh 667 00:24:18,480 --> 00:24:22,799 generic permutation that one could think 668 00:24:20,320 --> 00:24:24,960 of but it is very easily possible to you 669 00:24:22,799 --> 00:24:27,120 know redefine this map to be just you 670 00:24:24,960 --> 00:24:28,799 know zero to n or or if if that is not 671 00:24:27,120 --> 00:24:30,400 preferable we can just have a one to one 672 00:24:28,799 --> 00:24:32,320 map where you know whatever the host 673 00:24:30,400 --> 00:24:34,720 sees it's the same cpus that you know 674 00:24:32,320 --> 00:24:38,000 you see inside the container as well so 675 00:24:34,720 --> 00:24:41,200 it's just an implementational detail 676 00:24:38,000 --> 00:24:42,720 it's a it's a matter of choice 677 00:24:41,200 --> 00:24:44,080 so i think you want to add anything to 678 00:24:42,720 --> 00:24:46,000 that 679 00:24:44,080 --> 00:24:47,440 no no i think you're absolutely right so 680 00:24:46,000 --> 00:24:49,600 it's just uh it's just a matter of 681 00:24:47,440 --> 00:24:51,039 implementation details and uh and that 682 00:24:49,600 --> 00:24:53,200 was pretty much the first thing that i 683 00:24:51,039 --> 00:24:55,520 implemented uh so like i just also said 684 00:24:53,200 --> 00:24:57,600 before right this could easily be as a 685 00:24:55,520 --> 00:24:59,440 zero to three map or or a one-to-one map 686 00:24:57,600 --> 00:25:03,120 as well right it should show you 687 00:24:59,440 --> 00:25:03,120 whatever cpus that you really have 688 00:25:04,320 --> 00:25:08,000 okay uh 689 00:25:06,159 --> 00:25:10,640 there's one more question 690 00:25:08,000 --> 00:25:13,360 which asks is the idea that you would 691 00:25:10,640 --> 00:25:15,679 scramble the cpu ids 692 00:25:13,360 --> 00:25:18,000 node ids together 693 00:25:15,679 --> 00:25:20,240 in some way where the cpu is on the same 694 00:25:18,000 --> 00:25:23,200 humanoid will still appear on the same 695 00:25:20,240 --> 00:25:26,880 demand node inside the container 696 00:25:23,200 --> 00:25:28,880 yes that that that is the eventual idea 697 00:25:26,880 --> 00:25:30,480 but but it is not present in the current 698 00:25:28,880 --> 00:25:34,880 implementation so current implementation 699 00:25:30,480 --> 00:25:38,240 for instance if you say take two cpus 700 00:25:34,880 --> 00:25:39,679 five and say 130 which happen to be 701 00:25:38,240 --> 00:25:41,919 these are real cpu numbers which happen 702 00:25:39,679 --> 00:25:44,159 to be different um ids when you view it 703 00:25:41,919 --> 00:25:46,559 inside the container they may still 704 00:25:44,159 --> 00:25:48,400 get you know some numbers like 705 00:25:46,559 --> 00:25:49,679 10 and 11 706 00:25:48,400 --> 00:25:51,520 which 707 00:25:49,679 --> 00:25:52,720 inside the container you know mapped to 708 00:25:51,520 --> 00:25:54,159 the same 709 00:25:52,720 --> 00:25:55,200 same you my 710 00:25:54,159 --> 00:25:57,120 id so 711 00:25:55,200 --> 00:25:58,720 that is something that we need to fix 712 00:25:57,120 --> 00:26:00,159 when we are setting this permutation so 713 00:25:58,720 --> 00:26:01,919 currently i think we are taking all the 714 00:26:00,159 --> 00:26:03,440 cpus and then we are 715 00:26:01,919 --> 00:26:05,279 defining a permutation at the start of 716 00:26:03,440 --> 00:26:06,960 the container that could be anything to 717 00:26:05,279 --> 00:26:09,039 anything but then we can partition these 718 00:26:06,960 --> 00:26:12,559 and have this permutations within those 719 00:26:09,039 --> 00:26:12,559 humanities that's that's possible 720 00:26:13,760 --> 00:26:17,760 there's one more question cpu scrambling 721 00:26:15,919 --> 00:26:20,159 hides topology information from the 722 00:26:17,760 --> 00:26:22,320 container but doesn't that mean that 723 00:26:20,159 --> 00:26:24,720 apps that try and optimize their access 724 00:26:22,320 --> 00:26:26,880 patterns for pneuma will actually be 725 00:26:24,720 --> 00:26:29,039 anti-optimized for memory access that is 726 00:26:26,880 --> 00:26:30,559 true that is true and like like i said 727 00:26:29,039 --> 00:26:32,080 and it's it's something that we have not 728 00:26:30,559 --> 00:26:36,440 taken care of in our current application 729 00:26:32,080 --> 00:26:36,440 but it's not very hard to do that 730 00:26:42,080 --> 00:26:47,279 any other questions 731 00:26:45,039 --> 00:26:49,679 any comments because what we are really 732 00:26:47,279 --> 00:26:51,279 looking for is feedback since there are 733 00:26:49,679 --> 00:26:53,840 user space solutions there are 734 00:26:51,279 --> 00:26:56,400 alternative you know kernel uh solutions 735 00:26:53,840 --> 00:26:58,159 what would be the right way forward uh 736 00:26:56,400 --> 00:26:59,840 you know to provide a consistent 737 00:26:58,159 --> 00:27:03,279 information to 738 00:26:59,840 --> 00:27:03,279 applications running inside campaign 739 00:27:03,840 --> 00:27:09,440 what user space pieces are responsible 740 00:27:06,799 --> 00:27:10,480 for setting up the cpu mappings 741 00:27:09,440 --> 00:27:13,279 i 742 00:27:10,480 --> 00:27:14,799 am assuming that this question 743 00:27:13,279 --> 00:27:16,159 you know is restricted to our 744 00:27:14,799 --> 00:27:17,679 implementation of 745 00:27:16,159 --> 00:27:19,279 cpu name space 746 00:27:17,679 --> 00:27:21,200 so pratik you want to take that what 747 00:27:19,279 --> 00:27:23,520 user space pieces are responsible for 748 00:27:21,200 --> 00:27:26,320 setting up the cpu mapping 749 00:27:23,520 --> 00:27:29,200 um so responsible for cpu mappings is 750 00:27:26,320 --> 00:27:32,559 nothing much really we we expose this by 751 00:27:29,200 --> 00:27:34,960 uh by a clone system call and uh we have 752 00:27:32,559 --> 00:27:36,559 just defined a new system called uh or 753 00:27:34,960 --> 00:27:38,720 we just defined a new flag in this clone 754 00:27:36,559 --> 00:27:41,440 system called whereas if you if you call 755 00:27:38,720 --> 00:27:44,080 clone new cpu uh you're you're basically 756 00:27:41,440 --> 00:27:45,919 going to get a new cpu namespace and uh 757 00:27:44,080 --> 00:27:47,200 these these are these are automatically 758 00:27:45,919 --> 00:27:49,600 going to be mapped at the start of 759 00:27:47,200 --> 00:27:51,679 creating this cpu namespace 760 00:27:49,600 --> 00:27:53,279 so yeah in addition to i mean that that 761 00:27:51,679 --> 00:27:54,880 clone is something that in our current 762 00:27:53,279 --> 00:27:57,440 implementation 763 00:27:54,880 --> 00:27:59,760 it is set by default whenever uh pit 764 00:27:57,440 --> 00:28:03,120 namespace is asked for so right so 765 00:27:59,760 --> 00:28:04,640 that's a hack uh but apart from that 766 00:28:03,120 --> 00:28:05,919 there is nothing that the user space 767 00:28:04,640 --> 00:28:07,679 currently needs to do for our 768 00:28:05,919 --> 00:28:09,440 implementation in order to get you know 769 00:28:07,679 --> 00:28:12,640 this restricted information because that 770 00:28:09,440 --> 00:28:14,640 is exposed through rockensis and 771 00:28:12,640 --> 00:28:17,120 utilities such as stop anywhere read 772 00:28:14,640 --> 00:28:20,399 this information so the idea is to 773 00:28:17,120 --> 00:28:20,399 present consistent information 774 00:28:22,880 --> 00:28:26,000 proxies 775 00:28:24,000 --> 00:28:27,520 see group fs and 776 00:28:26,000 --> 00:28:30,159 system calls such as set and get 777 00:28:27,520 --> 00:28:32,240 affinity 778 00:28:30,159 --> 00:28:34,159 yeah it's a hack only because that you 779 00:28:32,240 --> 00:28:36,240 know we want to use your all these 780 00:28:34,159 --> 00:28:37,039 pre-made utilities like docker uh to 781 00:28:36,240 --> 00:28:39,440 really 782 00:28:37,039 --> 00:28:40,960 get going of course you you you can you 783 00:28:39,440 --> 00:28:43,120 can write your own c programs and call 784 00:28:40,960 --> 00:28:44,640 your clone system calls and 785 00:28:43,120 --> 00:28:47,640 get the same thing up and running as 786 00:28:44,640 --> 00:28:47,640 well 787 00:28:59,679 --> 00:29:02,320 all right uh 788 00:29:04,720 --> 00:29:09,919 yeah we still have a minute we can we 789 00:29:06,720 --> 00:29:12,880 can take any questions or comments 790 00:29:09,919 --> 00:29:14,720 yep we've still got uh about one minute 791 00:29:12,880 --> 00:29:18,360 so if anyone has any has one last 792 00:29:14,720 --> 00:29:18,360 question to put in 793 00:29:27,760 --> 00:29:31,799 now we've got one coming in 794 00:29:40,159 --> 00:29:45,120 the question is 795 00:29:41,520 --> 00:29:48,080 once the cpu name space is unshared 796 00:29:45,120 --> 00:29:51,520 how are the additions or removals or 797 00:29:48,080 --> 00:29:52,840 renumberings of the cpus controlled 798 00:29:51,520 --> 00:29:56,159 okay 799 00:29:52,840 --> 00:29:59,039 uh i think you want to take that so 800 00:29:56,159 --> 00:30:00,720 yeah sure uh so basically uh uh when 801 00:29:59,039 --> 00:30:03,200 where so these are these are basically 802 00:30:00,720 --> 00:30:05,360 virtual sort of mappings right uh they 803 00:30:03,200 --> 00:30:07,679 don't really matter uh 804 00:30:05,360 --> 00:30:10,000 they only matter in the sense of that 805 00:30:07,679 --> 00:30:12,240 namespace itself and when you when you 806 00:30:10,000 --> 00:30:15,679 unshare it uh those those mappings just 807 00:30:12,240 --> 00:30:17,039 go away and uh and i and uh of course 808 00:30:15,679 --> 00:30:18,480 the apple the 809 00:30:17,039 --> 00:30:20,480 the the tasks that are that have been 810 00:30:18,480 --> 00:30:22,640 running in those mappings now are mapped 811 00:30:20,480 --> 00:30:24,960 to the translations or the or the real 812 00:30:22,640 --> 00:30:27,200 numberings uh of whatever uh physical 813 00:30:24,960 --> 00:30:31,200 cpus or or logical cpus in terms of 814 00:30:27,200 --> 00:30:31,200 linux is really really mapped to 815 00:30:31,360 --> 00:30:36,320 so so they will see the entire system i 816 00:30:33,760 --> 00:30:38,159 mean if you add a new cpu uh you will 817 00:30:36,320 --> 00:30:39,760 they'll see the you know the permutation 818 00:30:38,159 --> 00:30:43,440 that has been assigned to that new cpu 819 00:30:39,760 --> 00:30:46,159 when you remove a cpu for instance uh uh 820 00:30:43,440 --> 00:30:47,919 then that number goes away and if you if 821 00:30:46,159 --> 00:30:49,120 you unshare the entire namespace i think 822 00:30:47,919 --> 00:30:51,279 the current implementation will give 823 00:30:49,120 --> 00:30:54,880 whatever the host will see so 824 00:30:51,279 --> 00:30:54,880 just get rid of that recommendation 825 00:30:56,960 --> 00:30:59,679 well 826 00:30:57,760 --> 00:31:02,240 i think that brings us to the end of our 827 00:30:59,679 --> 00:31:05,760 time slot so thank you very much to uh 828 00:31:02,240 --> 00:31:06,640 prospect and gotham for their talk 829 00:31:05,760 --> 00:31:09,600 um 830 00:31:06,640 --> 00:31:11,760 and that brings us to afternoon tea uh 831 00:31:09,600 --> 00:31:13,519 so we'll be taking a 30 minute break 832 00:31:11,760 --> 00:31:16,159 until the next talk which will be uh 833 00:31:13,519 --> 00:31:18,559 alice farazi uh talking about merging an 834 00:31:16,159 --> 00:31:22,000 existing framework into kernel ci 835 00:31:18,559 --> 00:31:23,440 that will be coming up at 3 40 pm 836 00:31:22,000 --> 00:31:25,440 see you all then have a good afternoon 837 00:31:23,440 --> 00:31:27,760 tay 838 00:31:25,440 --> 00:31:31,000 thank you thank you 839 00:31:27,760 --> 00:31:31,000 thank you