1
00:00:06,320 --> 00:00:11,499
[Music]

2
00:00:15,759 --> 00:00:20,960
good afternoon welcome back

3
00:00:17,920 --> 00:00:22,320
um next up we have uh

4
00:00:20,960 --> 00:00:25,680
project uh

5
00:00:22,320 --> 00:00:28,560
pratik rajesh sampat and gautham chenoy

6
00:00:25,680 --> 00:00:31,359
presenting on cpu namespaces a mechanism

7
00:00:28,560 --> 00:00:33,040
to isolate cpu topology information

8
00:00:31,359 --> 00:00:35,120
in the linux kernel

9
00:00:33,040 --> 00:00:38,079
pratech is a linux kernel developer at

10
00:00:35,120 --> 00:00:39,680
ibm who works primarily with schedulers

11
00:00:38,079 --> 00:00:42,079
in energy management but also on

12
00:00:39,680 --> 00:00:43,520
container primitives and gautham is a

13
00:00:42,079 --> 00:00:45,440
kernel programmer who has been working

14
00:00:43,520 --> 00:00:47,200
on the kernel since 2006 and has

15
00:00:45,440 --> 00:00:49,840
contributed to hot plug process

16
00:00:47,200 --> 00:00:52,079
scheduling rcu lock depth cpu idle and

17
00:00:49,840 --> 00:00:54,000
other things

18
00:00:52,079 --> 00:00:57,559
so please welcome them as they present

19
00:00:54,000 --> 00:00:57,559
on cpu namespaces

20
00:00:58,000 --> 00:01:02,079
thank you thank you uh hello everyone i

21
00:01:00,320 --> 00:01:04,479
am pratik sampat and i work for the

22
00:01:02,079 --> 00:01:06,479
linux technology center at ibm uh with

23
00:01:04,479 --> 00:01:08,560
me i have gotham chennai who works with

24
00:01:06,479 --> 00:01:10,479
the kernel team at amd and today we're

25
00:01:08,560 --> 00:01:13,040
here to talk about the isolation of cpu

26
00:01:10,479 --> 00:01:15,439
information in the linux corner

27
00:01:13,040 --> 00:01:17,680
the agenda for our talk really is uh

28
00:01:15,439 --> 00:01:19,759
that we'll first highlight the question

29
00:01:17,680 --> 00:01:21,600
of what is the purpose of cis and proc

30
00:01:19,759 --> 00:01:23,200
in the world of containers and are there

31
00:01:21,600 --> 00:01:25,520
any implications of exposing this

32
00:01:23,200 --> 00:01:27,280
information next we'll talk about some

33
00:01:25,520 --> 00:01:29,280
existing solutions that help mitigate

34
00:01:27,280 --> 00:01:31,119
this problem we then present our staff

35
00:01:29,280 --> 00:01:33,360
at a solution uh called the cpu

36
00:01:31,119 --> 00:01:34,960
namespace uh some experiments around it

37
00:01:33,360 --> 00:01:36,799
and finally we pose questions around the

38
00:01:34,960 --> 00:01:38,320
challenges that are that exist in this

39
00:01:36,799 --> 00:01:40,400
space

40
00:01:38,320 --> 00:01:43,040
all right motivation

41
00:01:40,400 --> 00:01:44,960
so a short introduction to csfs is that

42
00:01:43,040 --> 00:01:46,880
it's a file system that is used to

43
00:01:44,960 --> 00:01:48,799
expose kernel information to user space

44
00:01:46,880 --> 00:01:50,720
and often it's about kernel subsystems

45
00:01:48,799 --> 00:01:52,079
and the hardware that runs it

46
00:01:50,720 --> 00:01:54,079
applications today can look at this

47
00:01:52,079 --> 00:01:55,680
interface to determine system resources

48
00:01:54,079 --> 00:01:57,439
and they can make decisions based on

49
00:01:55,680 --> 00:01:59,439
that such as allocating resources like

50
00:01:57,439 --> 00:02:01,200
memory and spawning threats

51
00:01:59,439 --> 00:02:02,719
take an example of containers

52
00:02:01,200 --> 00:02:06,000
containerized applications can be

53
00:02:02,719 --> 00:02:07,119
restricted by c groups cpu set however

54
00:02:06,000 --> 00:02:08,560
they can be unaware of these

55
00:02:07,119 --> 00:02:10,879
restrictions and can still look at

56
00:02:08,560 --> 00:02:13,120
traditional interfaces of sys and proc

57
00:02:10,879 --> 00:02:14,319
to make decisions uh about their

58
00:02:13,120 --> 00:02:16,959
applications

59
00:02:14,319 --> 00:02:18,879
and now this problem is not only uh

60
00:02:16,959 --> 00:02:21,280
constrained to the realm of containers

61
00:02:18,879 --> 00:02:23,360
but uh but outside containers as well if

62
00:02:21,280 --> 00:02:25,920
you take say task set like the system

63
00:02:23,360 --> 00:02:28,319
called sched set affinity you can use it

64
00:02:25,920 --> 00:02:30,160
to set cpu restrictions on applications

65
00:02:28,319 --> 00:02:32,800
but applications may still choose to

66
00:02:30,160 --> 00:02:34,400
make their decisions based on looking at

67
00:02:32,800 --> 00:02:36,400
this assistant proc

68
00:02:34,400 --> 00:02:37,920
so the question that arises from all of

69
00:02:36,400 --> 00:02:39,599
this is that

70
00:02:37,920 --> 00:02:41,840
what does cis and proc really mean in

71
00:02:39,599 --> 00:02:43,920
the context of container restriction and

72
00:02:41,840 --> 00:02:45,599
second what are really the implications

73
00:02:43,920 --> 00:02:47,440
of exposing this information when

74
00:02:45,599 --> 00:02:49,120
applications can only use a very small

75
00:02:47,440 --> 00:02:50,959
side of it

76
00:02:49,120 --> 00:02:52,640
for the scope of this discussion we will

77
00:02:50,959 --> 00:02:54,480
stick to the implications of cpu

78
00:02:52,640 --> 00:02:56,400
resources however we will also

79
00:02:54,480 --> 00:02:58,480
periodically call out other potential

80
00:02:56,400 --> 00:03:01,120
problems such as memory as well

81
00:02:58,480 --> 00:03:02,480
so coming to the first implication uh

82
00:03:01,120 --> 00:03:04,800
restrictions can be set through

83
00:03:02,480 --> 00:03:06,480
interfaces like c group cpu set as i

84
00:03:04,800 --> 00:03:08,480
have already said however there are

85
00:03:06,480 --> 00:03:10,400
multiple interfaces which display cpu

86
00:03:08,480 --> 00:03:12,159
information and these control and

87
00:03:10,400 --> 00:03:13,280
display interfaces may be disjoined from

88
00:03:12,159 --> 00:03:15,840
one another

89
00:03:13,280 --> 00:03:18,800
for example uh you can see that if a

90
00:03:15,840 --> 00:03:20,800
host system has about 128 cpus and it

91
00:03:18,800 --> 00:03:23,200
spawns a container with its restriction

92
00:03:20,800 --> 00:03:24,799
set to 32 to 35 cpus now this

93
00:03:23,200 --> 00:03:27,519
restriction is set through the c group

94
00:03:24,799 --> 00:03:29,920
fs itself and uh and viewing this

95
00:03:27,519 --> 00:03:33,760
interface within the container uh yields

96
00:03:29,920 --> 00:03:36,319
uh the right 32 to 35 cpus a task within

97
00:03:33,760 --> 00:03:38,959
this container if it calls the sched get

98
00:03:36,319 --> 00:03:41,040
affinity onto it uh also gets the right

99
00:03:38,959 --> 00:03:42,799
uh view or the right view of the

100
00:03:41,040 --> 00:03:45,519
restriction that has been applied to it

101
00:03:42,799 --> 00:03:46,879
however if it looks at the proxlat stat

102
00:03:45,519 --> 00:03:49,360
which is generally for the load

103
00:03:46,879 --> 00:03:51,519
statistics and applications like uh top

104
00:03:49,360 --> 00:03:54,560
and edge top use it you will see uh you

105
00:03:51,519 --> 00:03:56,640
will see data about all the uh cpus and

106
00:03:54,560 --> 00:03:58,319
similarly for cis fs as well uh you if

107
00:03:56,640 --> 00:04:00,959
you look at this device's system cpu

108
00:03:58,319 --> 00:04:03,840
which is normally used by nproc and lscp

109
00:04:00,959 --> 00:04:06,799
kind of utilities and they also show

110
00:04:03,840 --> 00:04:09,439
that uh that 120 you know eight cpus

111
00:04:06,799 --> 00:04:11,920
exist on this system

112
00:04:09,439 --> 00:04:14,000
another uh you know we will talk about

113
00:04:11,920 --> 00:04:16,959
the potential impact in terms of

114
00:04:14,000 --> 00:04:18,799
performance of of what this really means

115
00:04:16,959 --> 00:04:19,919
in that in that context in in the coming

116
00:04:18,799 --> 00:04:22,400
few slides

117
00:04:19,919 --> 00:04:24,720
uh no next coming to the to another

118
00:04:22,400 --> 00:04:25,520
implication of fair use

119
00:04:24,720 --> 00:04:27,600
so

120
00:04:25,520 --> 00:04:29,600
when an application that is running

121
00:04:27,600 --> 00:04:31,680
within a container is restricted to some

122
00:04:29,600 --> 00:04:34,400
resources should they still be able to

123
00:04:31,680 --> 00:04:35,280
see the entire system resources

124
00:04:34,400 --> 00:04:38,080
and

125
00:04:35,280 --> 00:04:40,000
if so can this knowledge be potentially

126
00:04:38,080 --> 00:04:42,720
misused in any way

127
00:04:40,000 --> 00:04:44,320
now could there be a no user that can

128
00:04:42,720 --> 00:04:46,880
schedule workloads across sockets in

129
00:04:44,320 --> 00:04:48,479
such a way that the bus is now flooded

130
00:04:46,880 --> 00:04:51,199
and other container tenants now

131
00:04:48,479 --> 00:04:53,280
experience a slowdown or could a user

132
00:04:51,199 --> 00:04:56,000
now identify its vicinity from a

133
00:04:53,280 --> 00:04:57,680
peripheral such as a gpu and schedule

134
00:04:56,000 --> 00:05:00,080
themselves closer to get an undue

135
00:04:57,680 --> 00:05:02,400
latency advantage i know compared to the

136
00:05:00,080 --> 00:05:04,240
rest of the you know uh compared to the

137
00:05:02,400 --> 00:05:05,840
rest of the workloads

138
00:05:04,240 --> 00:05:09,280
so

139
00:05:05,840 --> 00:05:11,120
uh so are there any solutions that exist

140
00:05:09,280 --> 00:05:13,840
today that can help mitigate this

141
00:05:11,120 --> 00:05:16,320
problem of inconsistency of information

142
00:05:13,840 --> 00:05:17,919
well turns out there are there are a few

143
00:05:16,320 --> 00:05:20,320
and we've highlighted about it about

144
00:05:17,919 --> 00:05:22,880
three of them so one of the the most

145
00:05:20,320 --> 00:05:25,680
obvious solutions uh out there are hey

146
00:05:22,880 --> 00:05:27,440
just look at cfs so if you need

147
00:05:25,680 --> 00:05:29,280
information about the restrictions that

148
00:05:27,440 --> 00:05:30,639
are imposed on you just look at the

149
00:05:29,280 --> 00:05:32,320
interface that imposes those

150
00:05:30,639 --> 00:05:34,800
restrictions in the first place and

151
00:05:32,320 --> 00:05:36,639
that's a very strong argument to make

152
00:05:34,800 --> 00:05:39,280
however a lot of these applications

153
00:05:36,639 --> 00:05:41,680
legacy or otherwise rely on traditional

154
00:05:39,280 --> 00:05:43,919
interfaces lexis and brock and uh and

155
00:05:41,680 --> 00:05:46,479
asking all these players players to

156
00:05:43,919 --> 00:05:48,720
really change the way they uh they

157
00:05:46,479 --> 00:05:50,479
they interpret information it may be a

158
00:05:48,720 --> 00:05:53,360
difficult task

159
00:05:50,479 --> 00:05:55,440
another problem is also that you know uh

160
00:05:53,360 --> 00:05:57,919
now these applications also need to

161
00:05:55,440 --> 00:06:00,240
interpret newer concepts like uh period

162
00:05:57,919 --> 00:06:02,639
and quota which are cpu restrictions in

163
00:06:00,240 --> 00:06:04,880
time uh and but they're they're used to

164
00:06:02,639 --> 00:06:07,280
interpreting the information in terms of

165
00:06:04,880 --> 00:06:09,120
space and in terms of cpus and threads

166
00:06:07,280 --> 00:06:11,280
uh and how how does this information

167
00:06:09,120 --> 00:06:14,240
really need to be interpreted is is is a

168
00:06:11,280 --> 00:06:15,120
is a difficult uh uh thing to

169
00:06:14,240 --> 00:06:17,600
say

170
00:06:15,120 --> 00:06:19,360
uh and lastly while c groups can be used

171
00:06:17,600 --> 00:06:21,840
to extract information

172
00:06:19,360 --> 00:06:23,840
in principle they are a control

173
00:06:21,840 --> 00:06:25,919
mechanism for the host rather than a

174
00:06:23,840 --> 00:06:28,639
display interface inside the container

175
00:06:25,919 --> 00:06:31,680
and and there really doesn't uh

176
00:06:28,639 --> 00:06:33,199
there really isn't anything that stops

177
00:06:31,680 --> 00:06:35,600
them to change this interface in the

178
00:06:33,199 --> 00:06:37,440
future and uh and maybe like a vc group

179
00:06:35,600 --> 00:06:39,680
v3 comes out and then the applications

180
00:06:37,440 --> 00:06:42,160
have to go uh change the way they look

181
00:06:39,680 --> 00:06:45,120
at information all over again

182
00:06:42,160 --> 00:06:46,720
um uh so there are some user space

183
00:06:45,120 --> 00:06:48,319
innovations in this area as well uh

184
00:06:46,720 --> 00:06:51,440
there's a user space solution called lx

185
00:06:48,319 --> 00:06:54,080
cfs uh by the by the lxc uh current

186
00:06:51,440 --> 00:06:55,039
containers and basically what they do is

187
00:06:54,080 --> 00:06:56,800
uh

188
00:06:55,039 --> 00:06:59,199
it's a user space file system that bind

189
00:06:56,800 --> 00:07:01,199
mounts over the existing system proc fs

190
00:06:59,199 --> 00:07:02,639
and they basically provide consistent

191
00:07:01,199 --> 00:07:03,840
information in accordance to whatever

192
00:07:02,639 --> 00:07:05,520
restrictions that were set on these

193
00:07:03,840 --> 00:07:07,360
applications they're essentially trying

194
00:07:05,520 --> 00:07:08,479
to fake this information in in a way

195
00:07:07,360 --> 00:07:10,240
right

196
00:07:08,479 --> 00:07:12,080
the advantage of is a user space

197
00:07:10,240 --> 00:07:15,199
innovation like this is that it's a very

198
00:07:12,080 --> 00:07:16,880
light easy to use user space tool and we

199
00:07:15,199 --> 00:07:19,039
have we have seen a few articles uh

200
00:07:16,880 --> 00:07:20,960
where it's currently being used uh or

201
00:07:19,039 --> 00:07:23,520
currently being described by uh by

202
00:07:20,960 --> 00:07:26,720
google and uh alibaba as well

203
00:07:23,520 --> 00:07:28,720
uh and and if a user space innovation uh

204
00:07:26,720 --> 00:07:30,400
exists out there to solve solving this

205
00:07:28,720 --> 00:07:32,560
problem this kind of bolsters our

206
00:07:30,400 --> 00:07:34,639
confidence that this problem uh you know

207
00:07:32,560 --> 00:07:36,639
exists in the first place

208
00:07:34,639 --> 00:07:38,639
but a problem with user space

209
00:07:36,639 --> 00:07:40,639
innovations are that they need explicit

210
00:07:38,639 --> 00:07:42,160
setup for applications and they need

211
00:07:40,639 --> 00:07:43,919
explicit setup for applications that

212
00:07:42,160 --> 00:07:47,120
experience this effect of incorrect

213
00:07:43,919 --> 00:07:48,879
information in the first place so uh so

214
00:07:47,120 --> 00:07:50,000
a lot of times inconsistent information

215
00:07:48,879 --> 00:07:51,680
is not going to really crash your

216
00:07:50,000 --> 00:07:54,319
application it's rather going to give

217
00:07:51,680 --> 00:07:55,599
you a a performance hit or it's going to

218
00:07:54,319 --> 00:07:57,680
give you

219
00:07:55,599 --> 00:07:59,520
give you a problem that that is somewhat

220
00:07:57,680 --> 00:08:01,039
of a silent failure and first

221
00:07:59,520 --> 00:08:02,319
identifying that you are facing this

222
00:08:01,039 --> 00:08:04,639
problem in the first place and then

223
00:08:02,319 --> 00:08:06,879
identifying that alex cfs is the right

224
00:08:04,639 --> 00:08:08,479
solution for you can be a bit of a

225
00:08:06,879 --> 00:08:11,440
hassle

226
00:08:08,479 --> 00:08:13,680
uh lastly uh there is

227
00:08:11,440 --> 00:08:15,680
an effort uh for an rfc patch set which

228
00:08:13,680 --> 00:08:18,400
was posted a few months ago which added

229
00:08:15,680 --> 00:08:20,720
uh a proc slash self slash mem info as a

230
00:08:18,400 --> 00:08:22,639
new interface which respects the c group

231
00:08:20,720 --> 00:08:25,280
instruction restrictions and provides

232
00:08:22,639 --> 00:08:27,199
this consistent information uh

233
00:08:25,280 --> 00:08:30,080
for for applications to see

234
00:08:27,199 --> 00:08:32,240
and this is a very good solution as it

235
00:08:30,080 --> 00:08:34,320
introduces standards for exposing and

236
00:08:32,240 --> 00:08:36,000
interpreting this information it is also

237
00:08:34,320 --> 00:08:37,919
a very clean interface as it does not

238
00:08:36,000 --> 00:08:40,800
meddle with the current established

239
00:08:37,919 --> 00:08:42,800
system proc interfaces and

240
00:08:40,800 --> 00:08:45,120
and it kind of

241
00:08:42,800 --> 00:08:46,000
keeps the sanity of of those interfaces

242
00:08:45,120 --> 00:08:48,560
intact

243
00:08:46,000 --> 00:08:51,040
however um just like c group fs it also

244
00:08:48,560 --> 00:08:52,640
faces the problem of a problem that a

245
00:08:51,040 --> 00:08:53,600
lot of applications still look at cis

246
00:08:52,640 --> 00:08:55,839
and clock

247
00:08:53,600 --> 00:08:57,680
instead of c group and the motivation to

248
00:08:55,839 --> 00:08:59,920
use yet another interface may be a

249
00:08:57,680 --> 00:09:01,760
little bit low uh there was a comment

250
00:08:59,920 --> 00:09:03,680
which kind of highlighted of the same in

251
00:09:01,760 --> 00:09:05,839
the same path set as well uh which which

252
00:09:03,680 --> 00:09:08,000
i have you know linked down

253
00:09:05,839 --> 00:09:09,120
on these things

254
00:09:08,000 --> 00:09:11,519
so

255
00:09:09,120 --> 00:09:13,760
what if we could have a solution that

256
00:09:11,519 --> 00:09:15,680
kind of took some good points from all

257
00:09:13,760 --> 00:09:18,399
the three solutions and and built

258
00:09:15,680 --> 00:09:20,000
something around it uh so what if we

259
00:09:18,399 --> 00:09:22,320
could present information about our

260
00:09:20,000 --> 00:09:24,000
restrictions uh we could present them

261
00:09:22,320 --> 00:09:25,760
consistently with all these existing

262
00:09:24,000 --> 00:09:28,080
interfaces of cis and proc

263
00:09:25,760 --> 00:09:29,760
and we could introduce standardization

264
00:09:28,080 --> 00:09:32,880
of how to expose and interpret the

265
00:09:29,760 --> 00:09:35,600
solution by an in-kernel solution

266
00:09:32,880 --> 00:09:39,279
introducing cpu namespace

267
00:09:35,600 --> 00:09:41,279
so we basically try to isolate cpu

268
00:09:39,279 --> 00:09:42,800
information for each task based on

269
00:09:41,279 --> 00:09:44,880
whatever restrictions that have been

270
00:09:42,800 --> 00:09:46,880
applied to it why are

271
00:09:44,880 --> 00:09:49,360
the control and

272
00:09:46,880 --> 00:09:51,680
display interfaces

273
00:09:49,360 --> 00:09:53,600
and and and we make that consistent with

274
00:09:51,680 --> 00:09:55,600
the rest of the interfaces as well so

275
00:09:53,600 --> 00:09:56,959
basically we isolate the cpu information

276
00:09:55,600 --> 00:09:58,640
by maintaining a

277
00:09:56,959 --> 00:10:01,040
translation of these cpus within the

278
00:09:58,640 --> 00:10:03,200
name space and from the namespace pcpu

279
00:10:01,040 --> 00:10:04,959
to a logical cpu we have scrambled the

280
00:10:03,200 --> 00:10:06,399
cpus to help mitigate the problems of

281
00:10:04,959 --> 00:10:07,760
the knowledge of topology that we have

282
00:10:06,399 --> 00:10:10,079
highlighted in one of our previous

283
00:10:07,760 --> 00:10:12,720
slides i'm not an expert here but if it

284
00:10:10,079 --> 00:10:14,000
helps that uh then then that's great uh

285
00:10:12,720 --> 00:10:15,920
in our proof of concept we have

286
00:10:14,000 --> 00:10:18,399
scrambled this map just to show that a

287
00:10:15,920 --> 00:10:20,399
discontiguous cpu numbering works just

288
00:10:18,399 --> 00:10:22,000
right out of the box and if the map if

289
00:10:20,399 --> 00:10:23,360
the mapping does matter we can could

290
00:10:22,000 --> 00:10:25,040
change it to something like a zero to

291
00:10:23,360 --> 00:10:26,480
end map or even make it as a one to one

292
00:10:25,040 --> 00:10:28,399
mapping right

293
00:10:26,480 --> 00:10:31,040
but in summary just like a pid name

294
00:10:28,399 --> 00:10:33,360
space when you view a task cpu resources

295
00:10:31,040 --> 00:10:35,279
within a cpu namespace we can get an

296
00:10:33,360 --> 00:10:36,640
isolated view of the restrictions that

297
00:10:35,279 --> 00:10:38,800
it is bound by

298
00:10:36,640 --> 00:10:40,079
and viewing the tasks resources outside

299
00:10:38,800 --> 00:10:42,000
this namespace will yield the

300
00:10:40,079 --> 00:10:44,800
translations of these

301
00:10:42,000 --> 00:10:46,399
so for for example

302
00:10:44,800 --> 00:10:48,160
if you look if you look at the diagram

303
00:10:46,399 --> 00:10:50,240
which is without the cpu names which is

304
00:10:48,160 --> 00:10:52,160
basically uh the one that the diagram

305
00:10:50,240 --> 00:10:54,560
that we saw uh in one of our earlier

306
00:10:52,160 --> 00:10:57,040
slides you can clearly see that uh you

307
00:10:54,560 --> 00:10:59,040
know procensus uh no variant inc very

308
00:10:57,040 --> 00:11:01,519
consistent in information when there was

309
00:10:59,040 --> 00:11:03,600
a cpu set restriction applied to it uh

310
00:11:01,519 --> 00:11:05,519
when we look at it from a cpu namespace

311
00:11:03,600 --> 00:11:07,519
point of view while when we add the cpu

312
00:11:05,519 --> 00:11:10,000
namespace layer to it we can now see

313
00:11:07,519 --> 00:11:13,120
that a scrambled map is first generated

314
00:11:10,000 --> 00:11:15,519
and c group fs viewing c group fs within

315
00:11:13,120 --> 00:11:18,720
this container now yields a translation

316
00:11:15,519 --> 00:11:21,440
of 5 12 21 23 and similarly all the all

317
00:11:18,720 --> 00:11:24,240
the system calls proc fs as well as sfs

318
00:11:21,440 --> 00:11:25,600
gives this exact view to the system and

319
00:11:24,240 --> 00:11:27,440
which is in accordance to whatever

320
00:11:25,600 --> 00:11:29,839
restrictions that were applied to it of

321
00:11:27,440 --> 00:11:32,399
course uh this this namespace cpus is

322
00:11:29,839 --> 00:11:35,839
going to translate to these real cpus so

323
00:11:32,399 --> 00:11:38,399
where 5121 2023 is going to translate to

324
00:11:35,839 --> 00:11:40,560
32 33 34 and 35.

325
00:11:38,399 --> 00:11:43,440
so basically this is the design of what

326
00:11:40,560 --> 00:11:45,360
a cpu names faces uh the reference link

327
00:11:43,440 --> 00:11:47,040
is also uh is also on the top of this

328
00:11:45,360 --> 00:11:49,040
slide uh where this is where we have

329
00:11:47,040 --> 00:11:50,959
posted the patches and we have had quite

330
00:11:49,040 --> 00:11:53,040
a few interesting discussions uh on on

331
00:11:50,959 --> 00:11:56,560
this as well which we will discuss

332
00:11:53,040 --> 00:11:58,560
further uh in the in the coming slides

333
00:11:56,560 --> 00:12:00,720
so um

334
00:11:58,560 --> 00:12:02,639
well we we showed that there is a

335
00:12:00,720 --> 00:12:04,240
problem of inconsistency of information

336
00:12:02,639 --> 00:12:05,680
and we kind of showed with our solution

337
00:12:04,240 --> 00:12:08,639
that we kind of

338
00:12:05,680 --> 00:12:10,480
have a a a solution that elegantly

339
00:12:08,639 --> 00:12:12,160
brings together all these interfaces to

340
00:12:10,480 --> 00:12:14,320
solving this problem

341
00:12:12,160 --> 00:12:17,680
but is there any performance benefit to

342
00:12:14,320 --> 00:12:19,360
doing this as a spoiler alert but yes uh

343
00:12:17,680 --> 00:12:22,240
in in our experiment we have done this

344
00:12:19,360 --> 00:12:25,680
on an ibm power 9 machine only 44 smt4

345
00:12:22,240 --> 00:12:28,000
containers on 176 cpus uh but these are

346
00:12:25,680 --> 00:12:28,720
experiments are architectural agnostic

347
00:12:28,000 --> 00:12:31,440
uh

348
00:12:28,720 --> 00:12:34,240
as well so our experiment is as follows

349
00:12:31,440 --> 00:12:36,639
we benchmark nginx which is a http

350
00:12:34,240 --> 00:12:38,959
server and which is a fairly modern http

351
00:12:36,639 --> 00:12:40,560
server family at and we benchmark it

352
00:12:38,959 --> 00:12:43,120
with a multi-threaded workload called

353
00:12:40,560 --> 00:12:44,160
work which is a very simple http load

354
00:12:43,120 --> 00:12:46,320
generator

355
00:12:44,160 --> 00:12:48,320
now nginx is configured with a worker

356
00:12:46,320 --> 00:12:50,079
processes auto and this auto really

357
00:12:48,320 --> 00:12:51,600
helps us to enable this application to

358
00:12:50,079 --> 00:12:54,480
manage resources based on the system

359
00:12:51,600 --> 00:12:56,560
configuration that it sees uh nginx

360
00:12:54,480 --> 00:12:59,279
container is configured to cpu set to

361
00:12:56,560 --> 00:13:01,440
four cpus and the work benchmark spawns

362
00:12:59,279 --> 00:13:03,760
about 500 requests in 30 seconds or four

363
00:13:01,440 --> 00:13:04,480
threads which is enough to saturate uh

364
00:13:03,760 --> 00:13:06,800
the

365
00:13:04,480 --> 00:13:09,120
the resources of uh of four cpus that we

366
00:13:06,800 --> 00:13:10,720
have bound our nginx container by

367
00:13:09,120 --> 00:13:12,800
uh on the right hand side if you can see

368
00:13:10,720 --> 00:13:15,200
that there is a small summary graph of

369
00:13:12,800 --> 00:13:17,680
of the percentage of improvement that uh

370
00:13:15,200 --> 00:13:19,519
you really see uh with this experiment

371
00:13:17,680 --> 00:13:22,079
so we have we have a few metrics of

372
00:13:19,519 --> 00:13:25,200
measurements uh first is the is a memory

373
00:13:22,079 --> 00:13:27,440
usage at the initialization and that

374
00:13:25,200 --> 00:13:29,360
when you compare it with a vanilla 5.14

375
00:13:27,440 --> 00:13:30,880
kernel you can see that it is dropped by

376
00:13:29,360 --> 00:13:33,360
about a 91

377
00:13:30,880 --> 00:13:35,360
uh similarly memory at peak drops at

378
00:13:33,360 --> 00:13:37,200
about eighty nine percent uh throttle

379
00:13:35,360 --> 00:13:39,360
drops by about seventy four percent

380
00:13:37,200 --> 00:13:42,079
latency drops by about thirteen percent

381
00:13:39,360 --> 00:13:43,839
and requests per second uh improves and

382
00:13:42,079 --> 00:13:46,560
or increases by about twenty point seven

383
00:13:43,839 --> 00:13:49,279
three percent um we can clearly see that

384
00:13:46,560 --> 00:13:52,560
there is a net net uh improvement in in

385
00:13:49,279 --> 00:13:54,320
providing consistent information

386
00:13:52,560 --> 00:13:56,800
to two applications

387
00:13:54,320 --> 00:14:01,120
but uh why is that uh in the next slide

388
00:13:56,800 --> 00:14:04,480
i aim to just to answer just that and uh

389
00:14:01,120 --> 00:14:07,040
and basically uh if you look at the top

390
00:14:04,480 --> 00:14:09,600
most left side of it which is the pids

391
00:14:07,040 --> 00:14:12,800
you can see that uh uh

392
00:14:09,600 --> 00:14:15,600
on a vanilla kernel it spawns about 177

393
00:14:12,800 --> 00:14:18,560
uh processes whereas on a cpu namespace

394
00:14:15,600 --> 00:14:20,160
kernel just spawns about five 177 is not

395
00:14:18,560 --> 00:14:22,000
a random number uh as we have seen

396
00:14:20,160 --> 00:14:24,800
before uh the system is configured to

397
00:14:22,000 --> 00:14:26,880
have about 176 cpus so basically it

398
00:14:24,800 --> 00:14:28,959
spawns 176 worker threads plus one

399
00:14:26,880 --> 00:14:30,639
master thread and in the in the case of

400
00:14:28,959 --> 00:14:33,839
cpu namespace it spawns four worker

401
00:14:30,639 --> 00:14:35,920
threads plus uh uh one master thread and

402
00:14:33,839 --> 00:14:38,639
uh as a result of that or as a

403
00:14:35,920 --> 00:14:40,480
consequence of that uh the memory usage

404
00:14:38,639 --> 00:14:42,480
in the in the vanilla kernel is pretty

405
00:14:40,480 --> 00:14:44,959
high because now it needs to allocate

406
00:14:42,480 --> 00:14:46,639
memory to to keep track of all these uh

407
00:14:44,959 --> 00:14:49,440
extra pids

408
00:14:46,639 --> 00:14:51,120
uh because of that also the throttle is

409
00:14:49,440 --> 00:14:54,399
pretty high and it's it's as high as

410
00:14:51,120 --> 00:14:56,959
about 97 of throttling and this is also

411
00:14:54,399 --> 00:14:59,199
because that now we are trying to run a

412
00:14:56,959 --> 00:15:01,920
very trying to uh there are a lot of

413
00:14:59,199 --> 00:15:03,279
these resources were trying to contend a

414
00:15:01,920 --> 00:15:05,120
lot of these pids that are trying to

415
00:15:03,279 --> 00:15:06,480
contend for the same exact resource and

416
00:15:05,120 --> 00:15:08,560
that restored this resource is quite

417
00:15:06,480 --> 00:15:10,880
constrained and therefore there's there

418
00:15:08,560 --> 00:15:12,720
is going to be quite a bit of throttling

419
00:15:10,880 --> 00:15:14,720
but even though there is throttling they

420
00:15:12,720 --> 00:15:17,279
are essentially trying to do the same

421
00:15:14,720 --> 00:15:19,279
thing uh do the thing of the same task

422
00:15:17,279 --> 00:15:21,040
and uh and there shouldn't be a lot of

423
00:15:19,279 --> 00:15:24,399
hit on the performance in in that case

424
00:15:21,040 --> 00:15:26,959
right uh but that is not true uh

425
00:15:24,399 --> 00:15:28,480
in that case you're going to be hit by a

426
00:15:26,959 --> 00:15:31,199
scheduler overhead such as context

427
00:15:28,480 --> 00:15:34,160
switch and and that basically kind of

428
00:15:31,199 --> 00:15:35,680
shows us that the requests per second or

429
00:15:34,160 --> 00:15:37,680
the throughput

430
00:15:35,680 --> 00:15:40,800
is quite higher in terms of our cpu name

431
00:15:37,680 --> 00:15:42,800
space and and our latest in c also uh

432
00:15:40,800 --> 00:15:44,800
latency goes lower which is uh where the

433
00:15:42,800 --> 00:15:46,639
lower is better uh

434
00:15:44,800 --> 00:15:48,160
in our cpu namespace as well and this is

435
00:15:46,639 --> 00:15:50,160
this is all because of these extra

436
00:15:48,160 --> 00:15:52,160
overheads that now that the application

437
00:15:50,160 --> 00:15:54,000
needs to really uh uh

438
00:15:52,160 --> 00:15:56,160
or the kernel really needs to uh you

439
00:15:54,000 --> 00:15:56,959
know handle

440
00:15:56,160 --> 00:15:59,360
so

441
00:15:56,959 --> 00:16:02,079
we we kind of showed that there there is

442
00:15:59,360 --> 00:16:03,920
some benefit of doing uh uh of running

443
00:16:02,079 --> 00:16:05,680
these uh of giving consistent

444
00:16:03,920 --> 00:16:07,920
information uh now the proof of the

445
00:16:05,680 --> 00:16:09,600
pudding is you know really eating it so

446
00:16:07,920 --> 00:16:12,320
let's let's show you our implementation

447
00:16:09,600 --> 00:16:14,320
of how we uh really do cpu name space

448
00:16:12,320 --> 00:16:16,639
and uh just so that you have an idea

449
00:16:14,320 --> 00:16:19,279
that uh uh the idea that how it really

450
00:16:16,639 --> 00:16:22,560
works in the linux kernel today so as

451
00:16:19,279 --> 00:16:24,240
you can see there are two tabs um or in

452
00:16:22,560 --> 00:16:26,079
the terminal uh the right hand side is

453
00:16:24,240 --> 00:16:28,560
basically the initial cpu namespace

454
00:16:26,079 --> 00:16:30,079
which is the host outside a container

455
00:16:28,560 --> 00:16:32,560
and the left hand side is the cpu

456
00:16:30,079 --> 00:16:33,600
namespace a um just a is an as an

457
00:16:32,560 --> 00:16:36,480
acronym

458
00:16:33,600 --> 00:16:38,079
within a container like docker right

459
00:16:36,480 --> 00:16:39,920
on the left hand side we basically start

460
00:16:38,079 --> 00:16:42,800
a very simple ubuntu containers uh with

461
00:16:39,920 --> 00:16:44,000
a batch prompt and we we name it say p

462
00:16:42,800 --> 00:16:47,279
example

463
00:16:44,000 --> 00:16:49,360
and uh and we we run it unconstrained so

464
00:16:47,279 --> 00:16:51,360
when we do an ls cpu we should see you

465
00:16:49,360 --> 00:16:52,880
should see all these cpus that that

466
00:16:51,360 --> 00:16:55,839
exist on this system

467
00:16:52,880 --> 00:16:58,560
um similarly if you should do a cat of

468
00:16:55,839 --> 00:17:00,720
you know cpset.cpu's in the c group fs

469
00:16:58,560 --> 00:17:03,519
directory you will also see the entire

470
00:17:00,720 --> 00:17:05,760
list of cpus now when we try to restrict

471
00:17:03,519 --> 00:17:07,120
this container's cpu set and we will try

472
00:17:05,760 --> 00:17:09,360
to restrict it with the docker update

473
00:17:07,120 --> 00:17:11,679
command and we'll restrict it to cpu set

474
00:17:09,360 --> 00:17:13,760
zero to three uh in this case now when

475
00:17:11,679 --> 00:17:15,600
we do an ls cpu now you can see that

476
00:17:13,760 --> 00:17:17,839
there are only four cpus that uh that

477
00:17:15,600 --> 00:17:21,120
exist on the system and uh and the

478
00:17:17,839 --> 00:17:24,559
online cpu is a scrambled map of uh of

479
00:17:21,120 --> 00:17:26,079
four uh no randomized uh cpus of course

480
00:17:24,559 --> 00:17:27,679
uh there are there are a few things like

481
00:17:26,079 --> 00:17:29,280
the new newman node zero and new one

482
00:17:27,679 --> 00:17:32,000
node eight which is not uh really

483
00:17:29,280 --> 00:17:34,240
virtualized uh in information or or or

484
00:17:32,000 --> 00:17:35,840
it's not really uh

485
00:17:34,240 --> 00:17:37,840
shown in in its

486
00:17:35,840 --> 00:17:40,160
true scrambled map but uh this is a

487
00:17:37,840 --> 00:17:43,840
proof of concept and i know we aim to

488
00:17:40,160 --> 00:17:46,799
have a a more fully fledged uh uh

489
00:17:43,840 --> 00:17:49,360
we aim to fully fledge fledge this out

490
00:17:46,799 --> 00:17:50,960
um so next you know same thing if you

491
00:17:49,360 --> 00:17:52,640
try to look at the c group cpu set

492
00:17:50,960 --> 00:17:54,240
interface basically it shows us that

493
00:17:52,640 --> 00:17:56,880
this information is also consistent with

494
00:17:54,240 --> 00:17:58,799
whatever ls cpu saw and it is uh it is

495
00:17:56,880 --> 00:18:00,000
basically just these four cpus in a

496
00:17:58,799 --> 00:18:00,880
scrambled

497
00:18:00,000 --> 00:18:02,720
way

498
00:18:00,880 --> 00:18:04,720
uh next we try to spawn

499
00:18:02,720 --> 00:18:06,960
a stress on one of these available cpus

500
00:18:04,720 --> 00:18:09,360
and we'll test it to uh one of our

501
00:18:06,960 --> 00:18:11,200
available cpus that is cpu 17 and we

502
00:18:09,360 --> 00:18:14,480
stress it to minus c1 which is just to

503
00:18:11,200 --> 00:18:16,480
know cpu is to one cpus and we can do a

504
00:18:14,480 --> 00:18:18,080
task set minus cp to that task to see

505
00:18:16,480 --> 00:18:20,000
whatever the current affinity of that

506
00:18:18,080 --> 00:18:22,160
task is and we can clearly see that it

507
00:18:20,000 --> 00:18:24,480
is uh now 17 which means that the system

508
00:18:22,160 --> 00:18:27,039
call is also coherent

509
00:18:24,480 --> 00:18:30,640
coherently showing this information

510
00:18:27,039 --> 00:18:33,120
if we do a top on to on top of this uh

511
00:18:30,640 --> 00:18:36,240
you can see that now it according to

512
00:18:33,120 --> 00:18:39,360
this container it has only four cpus and

513
00:18:36,240 --> 00:18:43,120
cpu 17 is what is uh consuming a 100

514
00:18:39,360 --> 00:18:45,840
percent of you know utilization

515
00:18:43,120 --> 00:18:47,600
uh on on the similar side if you if you

516
00:18:45,840 --> 00:18:49,440
try to look at the same information from

517
00:18:47,600 --> 00:18:50,960
outside this container if you try to

518
00:18:49,440 --> 00:18:53,039
view this tasks affinity from outside

519
00:18:50,960 --> 00:18:54,720
this container we can do a top and you

520
00:18:53,039 --> 00:18:56,720
know we can see what what consumes

521
00:18:54,720 --> 00:18:59,679
hundred percent of cpu time that is cpu

522
00:18:56,720 --> 00:19:01,840
zero uh and if you try to do this with

523
00:18:59,679 --> 00:19:04,960
this with the task set minus cp by

524
00:19:01,840 --> 00:19:06,720
getting the uh no ps minus ef of crep

525
00:19:04,960 --> 00:19:08,240
stress uh where of course you're not

526
00:19:06,720 --> 00:19:10,240
gonna get like two tasks one is the

527
00:19:08,240 --> 00:19:12,000
parent and the child so parent really

528
00:19:10,240 --> 00:19:13,760
spawns the stressor and we can really

529
00:19:12,000 --> 00:19:15,919
look at either the parent or the child

530
00:19:13,760 --> 00:19:18,000
to see where it is really uh

531
00:19:15,919 --> 00:19:20,559
bounded by uh you can clearly see that

532
00:19:18,000 --> 00:19:22,400
the task set minus cp is really

533
00:19:20,559 --> 00:19:25,679
showing us to be that it is bounded to

534
00:19:22,400 --> 00:19:27,679
cpu zero next

535
00:19:25,679 --> 00:19:29,440
if we try to change this affinity to say

536
00:19:27,679 --> 00:19:31,360
cpu 2 which is which is again in the

537
00:19:29,440 --> 00:19:33,919
permutable limit of whatever cpu set

538
00:19:31,360 --> 00:19:36,960
restrictions we have applied to it uh so

539
00:19:33,919 --> 00:19:39,600
if we tar set it to you know cpu 2

540
00:19:36,960 --> 00:19:42,080
we can we can try to see how this uh

541
00:19:39,600 --> 00:19:44,640
you know varies this information uh in

542
00:19:42,080 --> 00:19:46,240
in the top command uh on both uh within

543
00:19:44,640 --> 00:19:47,919
the container and outside the container

544
00:19:46,240 --> 00:19:50,640
so with this outside the container you

545
00:19:47,919 --> 00:19:53,039
can now see that cpu 2 shows 100

546
00:19:50,640 --> 00:19:55,440
utilization and within the container it

547
00:19:53,039 --> 00:19:58,720
sure it has now migrated from server cpu

548
00:19:55,440 --> 00:20:01,919
17 uh to cpu83

549
00:19:58,720 --> 00:20:04,320
so so that is pretty much uh you know

550
00:20:01,919 --> 00:20:05,200
what it is and uh

551
00:20:04,320 --> 00:20:07,520
in

552
00:20:05,200 --> 00:20:09,760
in the next slide we will talk about a

553
00:20:07,520 --> 00:20:13,360
few challenges and and know what is the

554
00:20:09,760 --> 00:20:14,960
future of of isolation of of information

555
00:20:13,360 --> 00:20:18,880
so while

556
00:20:14,960 --> 00:20:20,480
the solution uh works in a way uh but it

557
00:20:18,880 --> 00:20:22,799
is not perfect and and there are a few

558
00:20:20,480 --> 00:20:24,960
challenges associated with it one of the

559
00:20:22,799 --> 00:20:27,039
most uh foremost challenges that that

560
00:20:24,960 --> 00:20:28,880
exists with this is that until now name

561
00:20:27,039 --> 00:20:31,520
spaces and c groups have been fairly

562
00:20:28,880 --> 00:20:34,400
disjoined from one another uh cpu name

563
00:20:31,520 --> 00:20:37,039
space kind of breaks that and without

564
00:20:34,400 --> 00:20:39,600
cpu or cpu set c groups the cpu

565
00:20:37,039 --> 00:20:41,760
namespace itself loses its meaning and

566
00:20:39,600 --> 00:20:42,640
that brings up the question really that

567
00:20:41,760 --> 00:20:45,120
if

568
00:20:42,640 --> 00:20:47,840
that is a time to now

569
00:20:45,120 --> 00:20:50,960
define interactions between spaces and c

570
00:20:47,840 --> 00:20:53,440
groups uh in in a in a you know it's

571
00:20:50,960 --> 00:20:55,039
reasonable amount of way and and what

572
00:20:53,440 --> 00:20:57,039
does you know what do containers really

573
00:20:55,039 --> 00:20:59,679
mean from that point onwards

574
00:20:57,039 --> 00:21:01,120
um another uh challenge that that exists

575
00:20:59,679 --> 00:21:02,400
with our current design is that the

576
00:21:01,120 --> 00:21:04,400
current design only addresses

577
00:21:02,400 --> 00:21:07,039
restrictions in space which is you know

578
00:21:04,400 --> 00:21:10,240
cpus and threads and pids and so on but

579
00:21:07,039 --> 00:21:12,960
not uh time and not pids by the way uh

580
00:21:10,240 --> 00:21:15,120
the containers also frequently use cf

581
00:21:12,960 --> 00:21:16,559
spirits and quotas and then it's you

582
00:21:15,120 --> 00:21:18,720
know fondly called millicodes in the

583
00:21:16,559 --> 00:21:19,520
kubernetes world and in the cloud world

584
00:21:18,720 --> 00:21:21,440
uh

585
00:21:19,520 --> 00:21:24,240
so how does this information now need to

586
00:21:21,440 --> 00:21:26,159
be exposed for this these restrictions

587
00:21:24,240 --> 00:21:28,320
it can be as simple as some defining

588
00:21:26,159 --> 00:21:30,559
some standards that say that okay if

589
00:21:28,320 --> 00:21:32,320
this is the ratio of period in quota

590
00:21:30,559 --> 00:21:36,000
this is the this is the cpu's worth of

591
00:21:32,320 --> 00:21:37,039
runtime but then uh is ratios the only

592
00:21:36,000 --> 00:21:38,559
factor

593
00:21:37,039 --> 00:21:39,520
for that or not

594
00:21:38,559 --> 00:21:42,480
um

595
00:21:39,520 --> 00:21:44,080
lastly while cpu namespace mitigates

596
00:21:42,480 --> 00:21:45,520
potential misuse stemming from knowledge

597
00:21:44,080 --> 00:21:47,679
of topology by obfuscation of

598
00:21:45,520 --> 00:21:49,840
information the topology can still be

599
00:21:47,679 --> 00:21:51,840
roughly figured out if you know with ipl

600
00:21:49,840 --> 00:21:53,919
latencies to determine who's your

601
00:21:51,840 --> 00:21:55,600
sibling or who's uh or which core is

602
00:21:53,919 --> 00:21:56,840
really far away uh

603
00:21:55,600 --> 00:21:59,039
from

604
00:21:56,840 --> 00:22:01,039
you so

605
00:21:59,039 --> 00:22:03,679
that's that brings us uh to our last

606
00:22:01,039 --> 00:22:06,559
slide of future uh where uh you know the

607
00:22:03,679 --> 00:22:09,120
intention of of these uh of of this

608
00:22:06,559 --> 00:22:11,280
presentation is to spark a discussion on

609
00:22:09,120 --> 00:22:13,840
the problem rather than be the new and

610
00:22:11,280 --> 00:22:16,559
end all of all solutions

611
00:22:13,840 --> 00:22:18,320
if the solution is for applications to

612
00:22:16,559 --> 00:22:20,559
change and look at c group fs or any

613
00:22:18,320 --> 00:22:22,159
other interface there are a few exciting

614
00:22:20,559 --> 00:22:24,640
discussions that are happening around

615
00:22:22,159 --> 00:22:26,080
exporting more useful metrics to entice

616
00:22:24,640 --> 00:22:29,120
applications to change and these were

617
00:22:26,080 --> 00:22:32,159
discussions were happening on the uh uh

618
00:22:29,120 --> 00:22:34,080
uh on the patch set that i had posted um

619
00:22:32,159 --> 00:22:35,840
if the solution is an external user

620
00:22:34,080 --> 00:22:37,919
space program bind mounting like custom

621
00:22:35,840 --> 00:22:40,159
system proc fs then should that be the

622
00:22:37,919 --> 00:22:42,640
norm for the future as well now should

623
00:22:40,159 --> 00:22:45,440
uh sure should user space innovations uh

624
00:22:42,640 --> 00:22:47,840
be encouraged further or should we start

625
00:22:45,440 --> 00:22:50,480
looking at uh you know defining and

626
00:22:47,840 --> 00:22:52,880
standardizing uh a lot of these things

627
00:22:50,480 --> 00:22:56,000
uh for you know within the linux kernel

628
00:22:52,880 --> 00:22:56,720
itself and finally is it a time to you

629
00:22:56,000 --> 00:22:58,000
know

630
00:22:56,720 --> 00:23:00,720
finally define

631
00:22:58,000 --> 00:23:02,640
uh a container as a first class object

632
00:23:00,720 --> 00:23:04,320
uh in linux

633
00:23:02,640 --> 00:23:07,200
so that was pretty much all the

634
00:23:04,320 --> 00:23:09,360
questions i had this is a legal slide

635
00:23:07,200 --> 00:23:11,679
for uh

636
00:23:09,360 --> 00:23:12,960
for attributions and finally some

637
00:23:11,679 --> 00:23:15,840
references

638
00:23:12,960 --> 00:23:17,679
so thank you for uh for this i will look

639
00:23:15,840 --> 00:23:18,880
at if there are any questions around it

640
00:23:17,679 --> 00:23:20,880
and uh

641
00:23:18,880 --> 00:23:23,840
and i will try to i'll try my best to

642
00:23:20,880 --> 00:23:23,840
answer them

643
00:23:25,360 --> 00:23:30,559
yes the first question yeah the first

644
00:23:27,520 --> 00:23:32,960
question is does the new map of the name

645
00:23:30,559 --> 00:23:34,559
spaced cpu still correspond to the

646
00:23:32,960 --> 00:23:36,640
hardware so in the current

647
00:23:34,559 --> 00:23:38,240
implementation that pratik has it does

648
00:23:36,640 --> 00:23:40,000
not so we have not taken that into

649
00:23:38,240 --> 00:23:41,360
consideration but the idea is if we

650
00:23:40,000 --> 00:23:42,960
still want to you know do this whole

651
00:23:41,360 --> 00:23:46,080
permutation thing then we can restrict

652
00:23:42,960 --> 00:23:48,000
the permutation to uh the new uh node so

653
00:23:46,080 --> 00:23:50,720
that uh you know we are consistent at

654
00:23:48,000 --> 00:23:52,159
least with respect to numa but but

655
00:23:50,720 --> 00:23:54,000
if we are going for a random permutation

656
00:23:52,159 --> 00:23:56,159
we will still not be consistent with

657
00:23:54,000 --> 00:23:58,720
other topological information such as

658
00:23:56,159 --> 00:24:01,679
last level caches and

659
00:23:58,720 --> 00:24:04,640
smt siblings for instance

660
00:24:01,679 --> 00:24:06,799
uh yeah the next question is uh why

661
00:24:04,640 --> 00:24:09,039
scramble the cpus rather than just

662
00:24:06,799 --> 00:24:10,640
showing you know zero to four or in

663
00:24:09,039 --> 00:24:12,240
general i think zero to three is what

664
00:24:10,640 --> 00:24:15,120
meant what was meant here since there

665
00:24:12,240 --> 00:24:18,480
are four series now we can we can easily

666
00:24:15,120 --> 00:24:20,320
do that i mean we pick the most uh

667
00:24:18,480 --> 00:24:22,799
generic permutation that one could think

668
00:24:20,320 --> 00:24:24,960
of but it is very easily possible to you

669
00:24:22,799 --> 00:24:27,120
know redefine this map to be just you

670
00:24:24,960 --> 00:24:28,799
know zero to n or or if if that is not

671
00:24:27,120 --> 00:24:30,400
preferable we can just have a one to one

672
00:24:28,799 --> 00:24:32,320
map where you know whatever the host

673
00:24:30,400 --> 00:24:34,720
sees it's the same cpus that you know

674
00:24:32,320 --> 00:24:38,000
you see inside the container as well so

675
00:24:34,720 --> 00:24:41,200
it's just an implementational detail

676
00:24:38,000 --> 00:24:42,720
it's a it's a matter of choice

677
00:24:41,200 --> 00:24:44,080
so i think you want to add anything to

678
00:24:42,720 --> 00:24:46,000
that

679
00:24:44,080 --> 00:24:47,440
no no i think you're absolutely right so

680
00:24:46,000 --> 00:24:49,600
it's just uh it's just a matter of

681
00:24:47,440 --> 00:24:51,039
implementation details and uh and that

682
00:24:49,600 --> 00:24:53,200
was pretty much the first thing that i

683
00:24:51,039 --> 00:24:55,520
implemented uh so like i just also said

684
00:24:53,200 --> 00:24:57,600
before right this could easily be as a

685
00:24:55,520 --> 00:24:59,440
zero to three map or or a one-to-one map

686
00:24:57,600 --> 00:25:03,120
as well right it should show you

687
00:24:59,440 --> 00:25:03,120
whatever cpus that you really have

688
00:25:04,320 --> 00:25:08,000
okay uh

689
00:25:06,159 --> 00:25:10,640
there's one more question

690
00:25:08,000 --> 00:25:13,360
which asks is the idea that you would

691
00:25:10,640 --> 00:25:15,679
scramble the cpu ids

692
00:25:13,360 --> 00:25:18,000
node ids together

693
00:25:15,679 --> 00:25:20,240
in some way where the cpu is on the same

694
00:25:18,000 --> 00:25:23,200
humanoid will still appear on the same

695
00:25:20,240 --> 00:25:26,880
demand node inside the container

696
00:25:23,200 --> 00:25:28,880
yes that that that is the eventual idea

697
00:25:26,880 --> 00:25:30,480
but but it is not present in the current

698
00:25:28,880 --> 00:25:34,880
implementation so current implementation

699
00:25:30,480 --> 00:25:38,240
for instance if you say take two cpus

700
00:25:34,880 --> 00:25:39,679
five and say 130 which happen to be

701
00:25:38,240 --> 00:25:41,919
these are real cpu numbers which happen

702
00:25:39,679 --> 00:25:44,159
to be different um ids when you view it

703
00:25:41,919 --> 00:25:46,559
inside the container they may still

704
00:25:44,159 --> 00:25:48,400
get you know some numbers like

705
00:25:46,559 --> 00:25:49,679
10 and 11

706
00:25:48,400 --> 00:25:51,520
which

707
00:25:49,679 --> 00:25:52,720
inside the container you know mapped to

708
00:25:51,520 --> 00:25:54,159
the same

709
00:25:52,720 --> 00:25:55,200
same you my

710
00:25:54,159 --> 00:25:57,120
id so

711
00:25:55,200 --> 00:25:58,720
that is something that we need to fix

712
00:25:57,120 --> 00:26:00,159
when we are setting this permutation so

713
00:25:58,720 --> 00:26:01,919
currently i think we are taking all the

714
00:26:00,159 --> 00:26:03,440
cpus and then we are

715
00:26:01,919 --> 00:26:05,279
defining a permutation at the start of

716
00:26:03,440 --> 00:26:06,960
the container that could be anything to

717
00:26:05,279 --> 00:26:09,039
anything but then we can partition these

718
00:26:06,960 --> 00:26:12,559
and have this permutations within those

719
00:26:09,039 --> 00:26:12,559
humanities that's that's possible

720
00:26:13,760 --> 00:26:17,760
there's one more question cpu scrambling

721
00:26:15,919 --> 00:26:20,159
hides topology information from the

722
00:26:17,760 --> 00:26:22,320
container but doesn't that mean that

723
00:26:20,159 --> 00:26:24,720
apps that try and optimize their access

724
00:26:22,320 --> 00:26:26,880
patterns for pneuma will actually be

725
00:26:24,720 --> 00:26:29,039
anti-optimized for memory access that is

726
00:26:26,880 --> 00:26:30,559
true that is true and like like i said

727
00:26:29,039 --> 00:26:32,080
and it's it's something that we have not

728
00:26:30,559 --> 00:26:36,440
taken care of in our current application

729
00:26:32,080 --> 00:26:36,440
but it's not very hard to do that

730
00:26:42,080 --> 00:26:47,279
any other questions

731
00:26:45,039 --> 00:26:49,679
any comments because what we are really

732
00:26:47,279 --> 00:26:51,279
looking for is feedback since there are

733
00:26:49,679 --> 00:26:53,840
user space solutions there are

734
00:26:51,279 --> 00:26:56,400
alternative you know kernel uh solutions

735
00:26:53,840 --> 00:26:58,159
what would be the right way forward uh

736
00:26:56,400 --> 00:26:59,840
you know to provide a consistent

737
00:26:58,159 --> 00:27:03,279
information to

738
00:26:59,840 --> 00:27:03,279
applications running inside campaign

739
00:27:03,840 --> 00:27:09,440
what user space pieces are responsible

740
00:27:06,799 --> 00:27:10,480
for setting up the cpu mappings

741
00:27:09,440 --> 00:27:13,279
i

742
00:27:10,480 --> 00:27:14,799
am assuming that this question

743
00:27:13,279 --> 00:27:16,159
you know is restricted to our

744
00:27:14,799 --> 00:27:17,679
implementation of

745
00:27:16,159 --> 00:27:19,279
cpu name space

746
00:27:17,679 --> 00:27:21,200
so pratik you want to take that what

747
00:27:19,279 --> 00:27:23,520
user space pieces are responsible for

748
00:27:21,200 --> 00:27:26,320
setting up the cpu mapping

749
00:27:23,520 --> 00:27:29,200
um so responsible for cpu mappings is

750
00:27:26,320 --> 00:27:32,559
nothing much really we we expose this by

751
00:27:29,200 --> 00:27:34,960
uh by a clone system call and uh we have

752
00:27:32,559 --> 00:27:36,559
just defined a new system called uh or

753
00:27:34,960 --> 00:27:38,720
we just defined a new flag in this clone

754
00:27:36,559 --> 00:27:41,440
system called whereas if you if you call

755
00:27:38,720 --> 00:27:44,080
clone new cpu uh you're you're basically

756
00:27:41,440 --> 00:27:45,919
going to get a new cpu namespace and uh

757
00:27:44,080 --> 00:27:47,200
these these are these are automatically

758
00:27:45,919 --> 00:27:49,600
going to be mapped at the start of

759
00:27:47,200 --> 00:27:51,679
creating this cpu namespace

760
00:27:49,600 --> 00:27:53,279
so yeah in addition to i mean that that

761
00:27:51,679 --> 00:27:54,880
clone is something that in our current

762
00:27:53,279 --> 00:27:57,440
implementation

763
00:27:54,880 --> 00:27:59,760
it is set by default whenever uh pit

764
00:27:57,440 --> 00:28:03,120
namespace is asked for so right so

765
00:27:59,760 --> 00:28:04,640
that's a hack uh but apart from that

766
00:28:03,120 --> 00:28:05,919
there is nothing that the user space

767
00:28:04,640 --> 00:28:07,679
currently needs to do for our

768
00:28:05,919 --> 00:28:09,440
implementation in order to get you know

769
00:28:07,679 --> 00:28:12,640
this restricted information because that

770
00:28:09,440 --> 00:28:14,640
is exposed through rockensis and

771
00:28:12,640 --> 00:28:17,120
utilities such as stop anywhere read

772
00:28:14,640 --> 00:28:20,399
this information so the idea is to

773
00:28:17,120 --> 00:28:20,399
present consistent information

774
00:28:22,880 --> 00:28:26,000
proxies

775
00:28:24,000 --> 00:28:27,520
see group fs and

776
00:28:26,000 --> 00:28:30,159
system calls such as set and get

777
00:28:27,520 --> 00:28:32,240
affinity

778
00:28:30,159 --> 00:28:34,159
yeah it's a hack only because that you

779
00:28:32,240 --> 00:28:36,240
know we want to use your all these

780
00:28:34,159 --> 00:28:37,039
pre-made utilities like docker uh to

781
00:28:36,240 --> 00:28:39,440
really

782
00:28:37,039 --> 00:28:40,960
get going of course you you you can you

783
00:28:39,440 --> 00:28:43,120
can write your own c programs and call

784
00:28:40,960 --> 00:28:44,640
your clone system calls and

785
00:28:43,120 --> 00:28:47,640
get the same thing up and running as

786
00:28:44,640 --> 00:28:47,640
well

787
00:28:59,679 --> 00:29:02,320
all right uh

788
00:29:04,720 --> 00:29:09,919
yeah we still have a minute we can we

789
00:29:06,720 --> 00:29:12,880
can take any questions or comments

790
00:29:09,919 --> 00:29:14,720
yep we've still got uh about one minute

791
00:29:12,880 --> 00:29:18,360
so if anyone has any has one last

792
00:29:14,720 --> 00:29:18,360
question to put in

793
00:29:27,760 --> 00:29:31,799
now we've got one coming in

794
00:29:40,159 --> 00:29:45,120
the question is

795
00:29:41,520 --> 00:29:48,080
once the cpu name space is unshared

796
00:29:45,120 --> 00:29:51,520
how are the additions or removals or

797
00:29:48,080 --> 00:29:52,840
renumberings of the cpus controlled

798
00:29:51,520 --> 00:29:56,159
okay

799
00:29:52,840 --> 00:29:59,039
uh i think you want to take that so

800
00:29:56,159 --> 00:30:00,720
yeah sure uh so basically uh uh when

801
00:29:59,039 --> 00:30:03,200
where so these are these are basically

802
00:30:00,720 --> 00:30:05,360
virtual sort of mappings right uh they

803
00:30:03,200 --> 00:30:07,679
don't really matter uh

804
00:30:05,360 --> 00:30:10,000
they only matter in the sense of that

805
00:30:07,679 --> 00:30:12,240
namespace itself and when you when you

806
00:30:10,000 --> 00:30:15,679
unshare it uh those those mappings just

807
00:30:12,240 --> 00:30:17,039
go away and uh and i and uh of course

808
00:30:15,679 --> 00:30:18,480
the apple the

809
00:30:17,039 --> 00:30:20,480
the the tasks that are that have been

810
00:30:18,480 --> 00:30:22,640
running in those mappings now are mapped

811
00:30:20,480 --> 00:30:24,960
to the translations or the or the real

812
00:30:22,640 --> 00:30:27,200
numberings uh of whatever uh physical

813
00:30:24,960 --> 00:30:31,200
cpus or or logical cpus in terms of

814
00:30:27,200 --> 00:30:31,200
linux is really really mapped to

815
00:30:31,360 --> 00:30:36,320
so so they will see the entire system i

816
00:30:33,760 --> 00:30:38,159
mean if you add a new cpu uh you will

817
00:30:36,320 --> 00:30:39,760
they'll see the you know the permutation

818
00:30:38,159 --> 00:30:43,440
that has been assigned to that new cpu

819
00:30:39,760 --> 00:30:46,159
when you remove a cpu for instance uh uh

820
00:30:43,440 --> 00:30:47,919
then that number goes away and if you if

821
00:30:46,159 --> 00:30:49,120
you unshare the entire namespace i think

822
00:30:47,919 --> 00:30:51,279
the current implementation will give

823
00:30:49,120 --> 00:30:54,880
whatever the host will see so

824
00:30:51,279 --> 00:30:54,880
just get rid of that recommendation

825
00:30:56,960 --> 00:30:59,679
well

826
00:30:57,760 --> 00:31:02,240
i think that brings us to the end of our

827
00:30:59,679 --> 00:31:05,760
time slot so thank you very much to uh

828
00:31:02,240 --> 00:31:06,640
prospect and gotham for their talk

829
00:31:05,760 --> 00:31:09,600
um

830
00:31:06,640 --> 00:31:11,760
and that brings us to afternoon tea uh

831
00:31:09,600 --> 00:31:13,519
so we'll be taking a 30 minute break

832
00:31:11,760 --> 00:31:16,159
until the next talk which will be uh

833
00:31:13,519 --> 00:31:18,559
alice farazi uh talking about merging an

834
00:31:16,159 --> 00:31:22,000
existing framework into kernel ci

835
00:31:18,559 --> 00:31:23,440
that will be coming up at 3 40 pm

836
00:31:22,000 --> 00:31:25,440
see you all then have a good afternoon

837
00:31:23,440 --> 00:31:27,760
tay

838
00:31:25,440 --> 00:31:31,000
thank you thank you

839
00:31:27,760 --> 00:31:31,000
thank you