1
00:00:02,240 --> 00:00:05,240
foreign

2
00:00:06,320 --> 00:00:11,499
[Music]

3
00:00:16,240 --> 00:00:21,119
welcome back to our next talk

4
00:00:18,560 --> 00:00:24,160
uh which will be on what c and c plus

5
00:00:21,119 --> 00:00:26,640
can do and when do you need assembly

6
00:00:24,160 --> 00:00:29,840
this talk will be given by alexander

7
00:00:26,640 --> 00:00:32,800
krizanovsky who is the ceo of tempesta

8
00:00:29,840 --> 00:00:35,280
technologies and the architect of

9
00:00:32,800 --> 00:00:37,760
tempesta fw a high performance open

10
00:00:35,280 --> 00:00:38,879
source linux application delivery

11
00:00:37,760 --> 00:00:41,120
controller

12
00:00:38,879 --> 00:00:43,040
alexander is responsible for the design

13
00:00:41,120 --> 00:00:45,520
and performance of several products in

14
00:00:43,040 --> 00:00:47,280
the areas of network traffic processing

15
00:00:45,520 --> 00:00:49,520
and databases

16
00:00:47,280 --> 00:00:51,760
he designed the core architecture of a

17
00:00:49,520 --> 00:00:54,480
web application firewall

18
00:00:51,760 --> 00:00:58,800
mentioned in the gartner magic quadrant

19
00:00:54,480 --> 00:01:01,359
maria db temporal data tables uh with

20
00:00:58,800 --> 00:01:04,879
that over to you alexander

21
00:01:01,359 --> 00:01:05,760
yeah uh do you see my screen shell

22
00:01:04,879 --> 00:01:08,640
yes

23
00:01:05,760 --> 00:01:09,920
i can go here um

24
00:01:08,640 --> 00:01:11,200
okay okay

25
00:01:09,920 --> 00:01:12,640
so

26
00:01:11,200 --> 00:01:15,520
first first of all

27
00:01:12,640 --> 00:01:18,320
we do a custom software development in

28
00:01:15,520 --> 00:01:21,600
network security and that storage areas

29
00:01:18,320 --> 00:01:24,000
in particular we um from time to time we

30
00:01:21,600 --> 00:01:27,360
face tasks like to develop very fast

31
00:01:24,000 --> 00:01:30,240
lego expressions engine about scaling

32
00:01:27,360 --> 00:01:32,159
data structure on multi-core systems

33
00:01:30,240 --> 00:01:34,560
various performance optimization and so

34
00:01:32,159 --> 00:01:36,880
on so typically we work in

35
00:01:34,560 --> 00:01:39,439
high performance area in network

36
00:01:36,880 --> 00:01:42,159
security and data storages

37
00:01:39,439 --> 00:01:44,960
also we develop our own product investor

38
00:01:42,159 --> 00:01:47,759
w which is a hybrid of gp accelerate and

39
00:01:44,960 --> 00:01:51,200
the firewall one of the main

40
00:01:47,759 --> 00:01:53,439
targets for the project is to be able to

41
00:01:51,200 --> 00:01:56,799
filter a massive

42
00:01:53,439 --> 00:01:58,320
malicious traffic like s70 dos attacks

43
00:01:56,799 --> 00:01:59,200
or web attacks

44
00:01:58,320 --> 00:02:00,880
and

45
00:01:59,200 --> 00:02:04,479
we have

46
00:02:00,880 --> 00:02:09,200
one of the fastest or maybe the fastest

47
00:02:04,479 --> 00:02:12,480
uh hpm parcel and uh so this talk will

48
00:02:09,200 --> 00:02:16,160
be most uh reference to this to

49
00:02:12,480 --> 00:02:18,560
suggestions about hp passing and uh

50
00:02:16,160 --> 00:02:20,239
cryptography interior less handshakes

51
00:02:18,560 --> 00:02:23,760
in our

52
00:02:20,239 --> 00:02:26,400
tasks usually uh performances about most

53
00:02:23,760 --> 00:02:29,760
about algorithms mathematics and so on

54
00:02:26,400 --> 00:02:32,480
but also very important stuff is to uh

55
00:02:29,760 --> 00:02:36,800
creatively and accurately uh

56
00:02:32,480 --> 00:02:39,760
code on the program to make it fast and

57
00:02:36,800 --> 00:02:44,239
optimize it so we do a lot of cnc plus

58
00:02:39,760 --> 00:02:47,599
plus and sometimes we deal with assembly

59
00:02:44,239 --> 00:02:50,480
uh for example if we um if you do a quad

60
00:02:47,599 --> 00:02:53,200
review and you see a functions like is

61
00:02:50,480 --> 00:02:55,840
digit uh you see that there are three

62
00:02:53,200 --> 00:02:58,560
operations involved um

63
00:02:55,840 --> 00:02:59,680
in the function and there's a hackers

64
00:02:58,560 --> 00:03:02,480
delight

65
00:02:59,680 --> 00:03:05,760
uh tricks when you can replace the

66
00:03:02,480 --> 00:03:06,800
condition with only two operations so

67
00:03:05,760 --> 00:03:09,519
uh

68
00:03:06,800 --> 00:03:13,280
it doesn't make sense uh to use a more

69
00:03:09,519 --> 00:03:16,159
sophisticated code uh than this um

70
00:03:13,280 --> 00:03:19,280
when a straightforward code actually not

71
00:03:16,159 --> 00:03:22,879
because modern compilers like gcc on

72
00:03:19,280 --> 00:03:26,159
both the optimization levels are good uh

73
00:03:22,879 --> 00:03:26,959
with such kind of optimization so we can

74
00:03:26,159 --> 00:03:28,560
just

75
00:03:26,959 --> 00:03:30,720
stick on

76
00:03:28,560 --> 00:03:34,000
the simplest version of the code

77
00:03:30,720 --> 00:03:36,720
the next example is let's consider this

78
00:03:34,000 --> 00:03:38,400
uh two variants of

79
00:03:36,720 --> 00:03:39,920
uh function

80
00:03:38,400 --> 00:03:41,920
and um

81
00:03:39,920 --> 00:03:44,720
you might be thinking

82
00:03:41,920 --> 00:03:48,560
that uh both the function are equal or

83
00:03:44,720 --> 00:03:50,959
not and in both of the

84
00:03:48,560 --> 00:03:54,720
cases you are actually right

85
00:03:50,959 --> 00:03:56,319
uh so let's um

86
00:03:54,720 --> 00:03:57,599
let's uh

87
00:03:56,319 --> 00:04:00,159
load

88
00:03:57,599 --> 00:04:01,280
the program into

89
00:04:00,159 --> 00:04:02,560
uh

90
00:04:01,280 --> 00:04:04,480
into

91
00:04:02,560 --> 00:04:05,680
uh compile explorer

92
00:04:04,480 --> 00:04:07,040
and

93
00:04:05,680 --> 00:04:09,360
one

94
00:04:07,040 --> 00:04:12,799
is for clunk

95
00:04:09,360 --> 00:04:14,959
as we see that both the

96
00:04:12,799 --> 00:04:18,400
versions are pretty the same

97
00:04:14,959 --> 00:04:21,199
uh function uh in assembly however if we

98
00:04:18,400 --> 00:04:23,919
compare it with gcc

99
00:04:21,199 --> 00:04:26,840
um let me do this

100
00:04:23,919 --> 00:04:30,000
we see that rgc

101
00:04:26,840 --> 00:04:34,080
um generates a bit different

102
00:04:30,000 --> 00:04:36,320
uh code and it involves rbp

103
00:04:34,080 --> 00:04:39,199
so actually

104
00:04:36,320 --> 00:04:40,840
not always um

105
00:04:39,199 --> 00:04:43,759
not not always

106
00:04:40,840 --> 00:04:45,600
um this is the same code not always

107
00:04:43,759 --> 00:04:47,280
generates the same results for different

108
00:04:45,600 --> 00:04:48,479
compilers

109
00:04:47,280 --> 00:04:50,160
uh

110
00:04:48,479 --> 00:04:52,440
uh let's now

111
00:04:50,160 --> 00:04:54,080
uh see what the

112
00:04:52,440 --> 00:04:57,520
x8664

113
00:04:54,080 --> 00:04:59,600
uh looks like uh first of all this

114
00:04:57,520 --> 00:05:03,280
surprise color architecture means that

115
00:04:59,600 --> 00:05:04,320
we can run multiple instructions in

116
00:05:03,280 --> 00:05:09,440
parallel

117
00:05:04,320 --> 00:05:11,520
uh this out for the executions and also

118
00:05:09,440 --> 00:05:14,479
each instruction is

119
00:05:11,520 --> 00:05:17,520
like if you see assembly code then this

120
00:05:14,479 --> 00:05:20,400
is not the lowest level of the code and

121
00:05:17,520 --> 00:05:23,280
the microphone processor splits

122
00:05:20,400 --> 00:05:25,680
instruction into my corporations

123
00:05:23,280 --> 00:05:28,720
and different instructions have

124
00:05:25,680 --> 00:05:29,680
different latency and throughput latency

125
00:05:28,720 --> 00:05:33,280
means

126
00:05:29,680 --> 00:05:34,960
how how many cycles instruction takes in

127
00:05:33,280 --> 00:05:37,600
dependency chain

128
00:05:34,960 --> 00:05:40,240
and the throughput is how many

129
00:05:37,600 --> 00:05:41,759
selections of the same kind can be run

130
00:05:40,240 --> 00:05:45,840
in parallel

131
00:05:41,759 --> 00:05:45,840
uh to visualize um

132
00:05:46,000 --> 00:05:48,639
the ideas about

133
00:05:47,840 --> 00:05:53,600
uh

134
00:05:48,639 --> 00:05:57,039
modern x 8664 let's uh have a look on

135
00:05:53,600 --> 00:06:02,560
lvm mca

136
00:05:57,039 --> 00:06:07,280
um so this is a call uh why vm uh mca is

137
00:06:02,560 --> 00:06:10,720
levm 2 it works just fine with gcc

138
00:06:07,280 --> 00:06:13,919
and you actually can pass any assembly

139
00:06:10,720 --> 00:06:15,440
code as input from levm

140
00:06:13,919 --> 00:06:17,919
mc

141
00:06:15,440 --> 00:06:22,000
so let's have a look what what it is

142
00:06:17,919 --> 00:06:24,080
uh this report about uh the program

143
00:06:22,000 --> 00:06:26,240
and uh first of all we see that the

144
00:06:24,080 --> 00:06:29,520
dispatch vis for current architecture

145
00:06:26,240 --> 00:06:33,120
this uh maximum six uh micro operations

146
00:06:29,520 --> 00:06:35,039
per cycle and this is problem we see

147
00:06:33,120 --> 00:06:38,720
almost uh six

148
00:06:35,039 --> 00:06:42,160
microbes per cycle this good for

149
00:06:38,720 --> 00:06:44,560
our program also this section

150
00:06:42,160 --> 00:06:47,120
please how many sections per second

151
00:06:44,560 --> 00:06:49,759
leverage are gone

152
00:06:47,120 --> 00:06:50,800
uh this is also not so

153
00:06:49,759 --> 00:06:53,280
bad

154
00:06:50,800 --> 00:06:56,319
uh next we see the

155
00:06:53,280 --> 00:06:59,039
assembly for this particular program uh

156
00:06:56,319 --> 00:07:01,360
doesn't my um this

157
00:06:59,039 --> 00:07:03,919
um it isn't important for now what was

158
00:07:01,360 --> 00:07:05,440
the problem is but now let's observe

159
00:07:03,919 --> 00:07:08,400
some

160
00:07:05,440 --> 00:07:12,160
key characteristics of the program in

161
00:07:08,400 --> 00:07:14,880
this report we see on first

162
00:07:12,160 --> 00:07:17,840
column is how many micro operations are

163
00:07:14,880 --> 00:07:17,840
involved in

164
00:07:18,639 --> 00:07:22,160
in

165
00:07:19,520 --> 00:07:24,080
particular instruction uh instruction

166
00:07:22,160 --> 00:07:27,840
latency and

167
00:07:24,080 --> 00:07:29,360
it's work of food

168
00:07:27,840 --> 00:07:30,400
frankly i don't

169
00:07:29,360 --> 00:07:31,199
trust

170
00:07:30,400 --> 00:07:32,800
that

171
00:07:31,199 --> 00:07:34,479
all the

172
00:07:32,800 --> 00:07:39,039
instruction i have only one mac

173
00:07:34,479 --> 00:07:41,680
operations but uh let's have a look on

174
00:07:39,039 --> 00:07:44,639
latency we see that uh vector

175
00:07:41,680 --> 00:07:46,800
instructions xml this uh

176
00:07:44,639 --> 00:07:46,800
uh

177
00:07:46,879 --> 00:07:49,680
this

178
00:07:48,000 --> 00:07:51,199
uh 16 byte

179
00:07:49,680 --> 00:07:54,639
vector operations

180
00:07:51,199 --> 00:07:55,599
have very large uh latency

181
00:07:54,639 --> 00:07:56,879
and

182
00:07:55,599 --> 00:07:59,039
throughput

183
00:07:56,879 --> 00:08:01,599
isn't bad this is a reciprocal

184
00:07:59,039 --> 00:08:05,840
throughput it means that our transaction

185
00:08:01,599 --> 00:08:07,120
of this kind can be run in parallel

186
00:08:05,840 --> 00:08:10,000
um

187
00:08:07,120 --> 00:08:12,720
okay it's uh

188
00:08:10,000 --> 00:08:14,879
this instruction have more my mic

189
00:08:12,720 --> 00:08:16,560
operations um

190
00:08:14,879 --> 00:08:18,319
anyway

191
00:08:16,560 --> 00:08:20,560
um

192
00:08:18,319 --> 00:08:21,680
let's uh see a

193
00:08:20,560 --> 00:08:23,360
timeline

194
00:08:21,680 --> 00:08:26,639
for the

195
00:08:23,360 --> 00:08:28,000
program let me make a

196
00:08:26,639 --> 00:08:28,879
little

197
00:08:28,000 --> 00:08:29,919
uh

198
00:08:28,879 --> 00:08:34,959
smaller

199
00:08:29,919 --> 00:08:36,640
uh this uh how the instructions are

200
00:08:34,959 --> 00:08:37,680
executed by

201
00:08:36,640 --> 00:08:40,399
uh

202
00:08:37,680 --> 00:08:42,719
which is

203
00:08:40,399 --> 00:08:45,200
are provided by mca

204
00:08:42,719 --> 00:08:46,399
first of all we see that the first 7th

205
00:08:45,200 --> 00:08:48,720
section

206
00:08:46,399 --> 00:08:51,839
starts decoding at the same

207
00:08:48,720 --> 00:08:55,120
time next is time to execute waiting for

208
00:08:51,839 --> 00:08:56,399
execution and execute

209
00:08:55,120 --> 00:08:58,640
here we

210
00:08:56,399 --> 00:09:01,440
see that

211
00:08:58,640 --> 00:09:04,560
this instruction this instruction

212
00:09:01,440 --> 00:09:06,080
as executed are earlier than previous

213
00:09:04,560 --> 00:09:09,120
instructions this

214
00:09:06,080 --> 00:09:10,640
autofocus execution in action

215
00:09:09,120 --> 00:09:11,920
uh however

216
00:09:10,640 --> 00:09:13,760
uh

217
00:09:11,920 --> 00:09:16,720
the instruction retired

218
00:09:13,760 --> 00:09:19,200
return means that uh results of the

219
00:09:16,720 --> 00:09:22,160
instruction are fixed in

220
00:09:19,200 --> 00:09:24,959
final registers the final memory and we

221
00:09:22,160 --> 00:09:26,240
see that all the instructions are always

222
00:09:24,959 --> 00:09:28,000
retired

223
00:09:26,240 --> 00:09:30,880
in

224
00:09:28,000 --> 00:09:33,839
in sequence as instruction appeared in

225
00:09:30,880 --> 00:09:33,839
the program

226
00:09:34,560 --> 00:09:43,360
so uh let's back uh to the slides

227
00:09:38,800 --> 00:09:44,720
uh thanks to out of vote execution we uh

228
00:09:43,360 --> 00:09:46,480
have um

229
00:09:44,720 --> 00:09:50,080
a spectacle attack

230
00:09:46,480 --> 00:09:52,399
and we suspect an attack um

231
00:09:50,080 --> 00:09:52,399
we

232
00:09:53,360 --> 00:09:59,600
the spectral attacks use with um out of

233
00:09:56,240 --> 00:10:00,720
forward execution let's uh uh let's

234
00:09:59,600 --> 00:10:02,880
explore

235
00:10:00,720 --> 00:10:07,360
uh one more example

236
00:10:02,880 --> 00:10:07,360
let me award it in compiler explorer

237
00:10:07,440 --> 00:10:10,240
here it is

238
00:10:12,720 --> 00:10:17,760
let me choose a clank for this

239
00:10:15,279 --> 00:10:20,880
experiment

240
00:10:17,760 --> 00:10:23,279
and uh here we go this uh basic uh clan

241
00:10:20,880 --> 00:10:26,480
captain uh compilation for this problem

242
00:10:23,279 --> 00:10:30,480
this uh what we would expect

243
00:10:26,480 --> 00:10:32,399
uh however in um out of order execution

244
00:10:30,480 --> 00:10:35,040
array lengths might be

245
00:10:32,399 --> 00:10:37,279
not fetched from the memory and the

246
00:10:35,040 --> 00:10:40,320
while uh experience waiting for the

247
00:10:37,279 --> 00:10:44,000
value of a length it might uh try to

248
00:10:40,320 --> 00:10:45,519
speculatively try to execute uh the next

249
00:10:44,000 --> 00:10:46,399
uh statement

250
00:10:45,519 --> 00:10:50,240
so

251
00:10:46,399 --> 00:10:52,959
uh if we long and actually offset is

252
00:10:50,240 --> 00:10:55,040
larger than array lengths then cpu may

253
00:10:52,959 --> 00:10:57,120
might uh store the

254
00:10:55,040 --> 00:11:00,160
uh results of this profession's

255
00:10:57,120 --> 00:11:02,880
instruction in the cpu cache and this um

256
00:11:00,160 --> 00:11:07,279
data may be repeated by nataka

257
00:11:02,880 --> 00:11:09,440
uh so uh to prevent such kind of attacks

258
00:11:07,279 --> 00:11:11,519
we can add

259
00:11:09,440 --> 00:11:14,920
command line options

260
00:11:11,519 --> 00:11:14,920
it's um

261
00:11:15,839 --> 00:11:19,040
line

262
00:11:17,160 --> 00:11:20,959
[Music]

263
00:11:19,040 --> 00:11:22,399
and

264
00:11:20,959 --> 00:11:24,320
now now we have

265
00:11:22,399 --> 00:11:27,279
legend epilogue

266
00:11:24,320 --> 00:11:30,000
and this actually the trick

267
00:11:27,279 --> 00:11:31,120
from the compiler to trick the

268
00:11:30,000 --> 00:11:32,880
cpu

269
00:11:31,120 --> 00:11:36,399
so if it goes

270
00:11:32,880 --> 00:11:39,920
uh to the wrong uh branch then it sees

271
00:11:36,399 --> 00:11:45,279
uh pause and defense which actually

272
00:11:39,920 --> 00:11:48,000
uh clears the all the uh data crashed

273
00:11:45,279 --> 00:11:52,000
uh so such

274
00:11:48,000 --> 00:11:54,839
epiworks uh have a performance cost and

275
00:11:52,000 --> 00:11:58,320
it might cost up to 15

276
00:11:54,839 --> 00:12:01,120
performance this next uh attack which

277
00:11:58,320 --> 00:12:04,720
were revealed couple of years ago is uh

278
00:12:01,120 --> 00:12:07,600
main down and this is about um

279
00:12:04,720 --> 00:12:10,560
page table isolation on the operating

280
00:12:07,600 --> 00:12:14,480
system and this might cost up to 30

281
00:12:10,560 --> 00:12:15,519
percent of uh performance

282
00:12:14,480 --> 00:12:20,959
um

283
00:12:15,519 --> 00:12:23,200
so if you do a code review uh and uh

284
00:12:20,959 --> 00:12:27,279
suppose that performance very important

285
00:12:23,200 --> 00:12:29,680
for you then you might be asking um

286
00:12:27,279 --> 00:12:30,880
as on the early side in this

287
00:12:29,680 --> 00:12:33,920
presentation

288
00:12:30,880 --> 00:12:37,360
uh with the compiler is smart enough to

289
00:12:33,920 --> 00:12:38,800
make uh optimization or not

290
00:12:37,360 --> 00:12:41,680
typically

291
00:12:38,800 --> 00:12:41,680
we can first

292
00:12:41,920 --> 00:12:46,079
check

293
00:12:43,600 --> 00:12:50,320
compiling documents this

294
00:12:46,079 --> 00:12:52,720
document from levm about optimization

295
00:12:50,320 --> 00:12:53,600
passes transformation phases

296
00:12:52,720 --> 00:12:55,279
and

297
00:12:53,600 --> 00:12:57,200
this documentation might give you an

298
00:12:55,279 --> 00:12:59,839
idea what the compiler can do in

299
00:12:57,200 --> 00:13:01,320
principle however

300
00:12:59,839 --> 00:13:03,920
one compiler can do a lot of

301
00:13:01,320 --> 00:13:06,399
optimizations not all of them can be

302
00:13:03,920 --> 00:13:08,560
applied in particular case

303
00:13:06,399 --> 00:13:12,000
and the compiler have

304
00:13:08,560 --> 00:13:14,270
common line options uh providing

305
00:13:12,000 --> 00:13:15,760
optimization information this um

306
00:13:14,270 --> 00:13:18,160
[Music]

307
00:13:15,760 --> 00:13:23,360
subtract of

308
00:13:18,160 --> 00:13:26,320
gcm output about optimized vectorized um

309
00:13:23,360 --> 00:13:27,920
optimizations and several missed uh

310
00:13:26,320 --> 00:13:31,519
optimizations

311
00:13:27,920 --> 00:13:34,639
uh once you have uh report you also

312
00:13:31,519 --> 00:13:37,200
have a source file you have lines and

313
00:13:34,639 --> 00:13:39,920
socks fast so you can have an ideas

314
00:13:37,200 --> 00:13:43,040
which part of your sources are optimized

315
00:13:39,920 --> 00:13:44,560
and not and with which uh strategies

316
00:13:43,040 --> 00:13:46,399
uh also

317
00:13:44,560 --> 00:13:49,360
uh reviewing the

318
00:13:46,399 --> 00:13:52,399
assembly it's always helpful but usually

319
00:13:49,360 --> 00:13:54,240
it's better to do this on last

320
00:13:52,399 --> 00:13:56,079
as the last

321
00:13:54,240 --> 00:13:59,120
try

322
00:13:56,079 --> 00:14:01,600
uh so let's start about particle with

323
00:13:59,120 --> 00:14:04,560
practical examples

324
00:14:01,600 --> 00:14:06,240
about http um the first example is about

325
00:14:04,560 --> 00:14:09,680
http parser

326
00:14:06,240 --> 00:14:11,360
uh http passwords are typically encoded

327
00:14:09,680 --> 00:14:12,240
as a

328
00:14:11,360 --> 00:14:15,440
loop

329
00:14:12,240 --> 00:14:18,720
uh in this slide there's this

330
00:14:15,440 --> 00:14:20,800
while loop on input data and inside the

331
00:14:18,720 --> 00:14:22,800
loop you have a bunch of switch

332
00:14:20,800 --> 00:14:25,040
statements and

333
00:14:22,800 --> 00:14:28,079
uh switch

334
00:14:25,040 --> 00:14:29,519
guns on current state machine state and

335
00:14:28,079 --> 00:14:33,279
input data

336
00:14:29,519 --> 00:14:37,040
uh for nginx we have relatively small

337
00:14:33,279 --> 00:14:41,600
state machine about 84

338
00:14:37,040 --> 00:14:44,720
states however 14 psfw we need much

339
00:14:41,600 --> 00:14:45,760
larger state machine and this magnitude

340
00:14:44,720 --> 00:14:48,560
larger

341
00:14:45,760 --> 00:14:50,880
uh the reason for larger set machine is

342
00:14:48,560 --> 00:14:52,000
because we do a lot of security checks

343
00:14:50,880 --> 00:14:54,160
in our

344
00:14:52,000 --> 00:14:55,519
hp passes so

345
00:14:54,160 --> 00:14:56,320
we

346
00:14:55,519 --> 00:14:58,480
have

347
00:14:56,320 --> 00:15:02,079
more than 900

348
00:14:58,480 --> 00:15:02,079
states now can't pass

349
00:15:02,880 --> 00:15:07,279
so

350
00:15:04,079 --> 00:15:09,519
what compilers can do for such

351
00:15:07,279 --> 00:15:12,320
large state machines

352
00:15:09,519 --> 00:15:15,360
the first one is that we have

353
00:15:12,320 --> 00:15:18,880
optimization of switch this well-known

354
00:15:15,360 --> 00:15:22,800
optimization uh first of all is lookup

355
00:15:18,880 --> 00:15:25,440
table so if your states uh which you use

356
00:15:22,800 --> 00:15:26,560
in switch statements are arranged and

357
00:15:25,440 --> 00:15:28,639
sequential

358
00:15:26,560 --> 00:15:29,920
then your compiler can generate the

359
00:15:28,639 --> 00:15:31,519
hookup

360
00:15:29,920 --> 00:15:33,600
table

361
00:15:31,519 --> 00:15:35,519
so we have a

362
00:15:33,600 --> 00:15:36,560
jump bias

363
00:15:35,519 --> 00:15:39,360
offset

364
00:15:36,560 --> 00:15:43,199
and by the job we define the label where

365
00:15:39,360 --> 00:15:46,480
we need to go so this like a double jump

366
00:15:43,199 --> 00:15:49,839
however since we use uh input data as

367
00:15:46,480 --> 00:15:53,519
i've said uh for jump this code is

368
00:15:49,839 --> 00:15:56,639
vulnerable for spectral attack

369
00:15:53,519 --> 00:16:00,320
uh the next optimization next variant of

370
00:15:56,639 --> 00:16:02,399
switch table optimization is uh binary

371
00:16:00,320 --> 00:16:04,480
search if your

372
00:16:02,399 --> 00:16:05,600
states and switch statements aren't

373
00:16:04,480 --> 00:16:06,959
arranged

374
00:16:05,600 --> 00:16:10,320
and

375
00:16:06,959 --> 00:16:11,839
have a non-consequent values then

376
00:16:10,320 --> 00:16:13,360
compiler have to

377
00:16:11,839 --> 00:16:16,000
fall back to

378
00:16:13,360 --> 00:16:17,120
a binary search between the

379
00:16:16,000 --> 00:16:18,560
labels

380
00:16:17,120 --> 00:16:21,519
actually

381
00:16:18,560 --> 00:16:24,000
modern compilers are smart enough to be

382
00:16:21,519 --> 00:16:26,320
able to recognize the same pattern for

383
00:16:24,000 --> 00:16:30,160
chains of yourself

384
00:16:26,320 --> 00:16:31,519
use f's ifs and we have the same uh

385
00:16:30,160 --> 00:16:34,639
pretty the same

386
00:16:31,519 --> 00:16:37,120
uh code generated for

387
00:16:34,639 --> 00:16:41,279
uh yourself

388
00:16:37,120 --> 00:16:43,279
uh the next thing which we can uh do is

389
00:16:41,279 --> 00:16:46,160
uh just to

390
00:16:43,279 --> 00:16:47,839
replace uh switch statement by

391
00:16:46,160 --> 00:16:49,440
uh go to

392
00:16:47,839 --> 00:16:50,320
uh jumps

393
00:16:49,440 --> 00:16:53,120
um

394
00:16:50,320 --> 00:16:57,279
jcbo

395
00:16:53,120 --> 00:16:59,600
will copy some uh code inside the

396
00:16:57,279 --> 00:17:02,000
uh state machine

397
00:16:59,600 --> 00:17:04,880
and we still need some

398
00:17:02,000 --> 00:17:07,839
uh starting um

399
00:17:04,880 --> 00:17:10,079
starting switch statement just to start

400
00:17:07,839 --> 00:17:11,520
uh the state machine

401
00:17:10,079 --> 00:17:13,280
and we

402
00:17:11,520 --> 00:17:14,880
actually can go

403
00:17:13,280 --> 00:17:15,919
uh faster

404
00:17:14,880 --> 00:17:19,199
uh

405
00:17:15,919 --> 00:17:22,640
she provides us um

406
00:17:19,199 --> 00:17:23,839
labels as values so essentially you can

407
00:17:22,640 --> 00:17:26,480
draw

408
00:17:23,839 --> 00:17:27,600
in c just the same as you do in assembly

409
00:17:26,480 --> 00:17:30,160
just

410
00:17:27,600 --> 00:17:32,799
jump directly to particular

411
00:17:30,160 --> 00:17:35,919
others in your code

412
00:17:32,799 --> 00:17:38,080
so with this optimization we can move uh

413
00:17:35,919 --> 00:17:38,880
significantly faster

414
00:17:38,080 --> 00:17:39,760
so

415
00:17:38,880 --> 00:17:42,720
uh

416
00:17:39,760 --> 00:17:44,240
for pretty large uh

417
00:17:42,720 --> 00:17:46,799
state machines

418
00:17:44,240 --> 00:17:48,480
uh this link to

419
00:17:46,799 --> 00:17:50,960
uh benchmark

420
00:17:48,480 --> 00:17:54,320
uh if we compare uh

421
00:17:50,960 --> 00:17:58,559
state state machines for or for seven or

422
00:17:54,320 --> 00:18:01,600
twenty uh seven states then uh times are

423
00:17:58,559 --> 00:18:02,400
more or less the same for very small

424
00:18:01,600 --> 00:18:06,559
uh

425
00:18:02,400 --> 00:18:10,000
state machines we uh go to given um

426
00:18:06,559 --> 00:18:11,840
after automaton may might be even a bit

427
00:18:10,000 --> 00:18:15,200
slower than

428
00:18:11,840 --> 00:18:17,600
switch driving however for a very large

429
00:18:15,200 --> 00:18:20,240
uh state machine or in this particular

430
00:18:17,600 --> 00:18:24,080
case uh it's twice

431
00:18:20,240 --> 00:18:25,200
uh smallest on our actual state machine

432
00:18:24,080 --> 00:18:29,360
you can

433
00:18:25,200 --> 00:18:32,320
move uh twice faster than a basic

434
00:18:29,360 --> 00:18:33,200
switch driving state machine

435
00:18:32,320 --> 00:18:35,600
uh

436
00:18:33,200 --> 00:18:39,280
the main reason for this is

437
00:18:35,600 --> 00:18:41,440
low um branch misses because we we drew

438
00:18:39,280 --> 00:18:43,440
much lower

439
00:18:41,440 --> 00:18:45,039
code transition

440
00:18:43,440 --> 00:18:48,960
in the loop

441
00:18:45,039 --> 00:18:48,960
then we have a much better

442
00:18:49,760 --> 00:18:54,480
cash

443
00:18:51,360 --> 00:18:58,000
much better characteristics of

444
00:18:54,480 --> 00:18:59,919
cash missiles even uh we have

445
00:18:58,000 --> 00:19:00,960
almost twice

446
00:18:59,919 --> 00:19:04,160
bigger

447
00:19:00,960 --> 00:19:06,160
uh code for go to driving aftermath on

448
00:19:04,160 --> 00:19:09,039
we have uh

449
00:19:06,160 --> 00:19:11,840
twice small number of

450
00:19:09,039 --> 00:19:11,840
cash misses

451
00:19:13,600 --> 00:19:19,280
next thing uh is

452
00:19:17,200 --> 00:19:21,200
how to

453
00:19:19,280 --> 00:19:23,440
how to properly code

454
00:19:21,200 --> 00:19:24,799
a state machine this

455
00:19:23,440 --> 00:19:26,000
clear for

456
00:19:24,799 --> 00:19:29,679
http

457
00:19:26,000 --> 00:19:34,240
uh semantics then uh first we have

458
00:19:29,679 --> 00:19:37,039
hp method next we have uh some

459
00:19:34,240 --> 00:19:41,520
hp headers and after that we have a http

460
00:19:37,039 --> 00:19:42,480
body and we would expect that um

461
00:19:41,520 --> 00:19:43,440
we

462
00:19:42,480 --> 00:19:46,320
well

463
00:19:43,440 --> 00:19:47,120
in in any case we always need to

464
00:19:46,320 --> 00:19:48,480
uh

465
00:19:47,120 --> 00:19:52,400
pass

466
00:19:48,480 --> 00:19:56,640
hpm uh method and after that move to hp

467
00:19:52,400 --> 00:20:00,880
headers so this no sense uh to place

468
00:19:56,640 --> 00:20:04,320
hp um uh method of python after hp

469
00:20:00,880 --> 00:20:07,200
headers so we have uh

470
00:20:04,320 --> 00:20:08,400
we have a fixed sequence of states

471
00:20:07,200 --> 00:20:10,720
however

472
00:20:08,400 --> 00:20:13,520
uh compilers doesn't know about these

473
00:20:10,720 --> 00:20:18,000
requirements about our semantics of

474
00:20:13,520 --> 00:20:20,880
input data so it's um free to reorder

475
00:20:18,000 --> 00:20:23,120
um code in its understanding of

476
00:20:20,880 --> 00:20:26,320
optimization and in our

477
00:20:23,120 --> 00:20:29,360
uh case we saw that

478
00:20:26,320 --> 00:20:31,840
very improbable states uh are placed by

479
00:20:29,360 --> 00:20:34,559
the compiler on the top of the state

480
00:20:31,840 --> 00:20:37,120
machine the top of the set machine is a

481
00:20:34,559 --> 00:20:40,559
interest point for the function and it's

482
00:20:37,120 --> 00:20:41,360
very very hot so we we need to place the

483
00:20:40,559 --> 00:20:43,440
hot

484
00:20:41,360 --> 00:20:45,440
states closer to the beginning of the

485
00:20:43,440 --> 00:20:48,799
function

486
00:20:45,440 --> 00:20:50,000
and the first thing which we can do is

487
00:20:48,799 --> 00:20:53,679
to use

488
00:20:50,000 --> 00:20:54,960
compiler barrier which

489
00:20:53,679 --> 00:20:58,559
prevents

490
00:20:54,960 --> 00:21:00,480
compiler from reordering

491
00:20:58,559 --> 00:21:03,840
however we can

492
00:21:00,480 --> 00:21:03,840
it's not optimal

493
00:21:06,480 --> 00:21:12,320
sorry uh it's not optimal to place

494
00:21:09,200 --> 00:21:18,080
compiler barriers everywhere we want to

495
00:21:12,320 --> 00:21:19,679
let compiler do optimization better and

496
00:21:18,080 --> 00:21:21,600
actually we can

497
00:21:19,679 --> 00:21:23,919
do this

498
00:21:21,600 --> 00:21:23,919
using

499
00:21:24,080 --> 00:21:28,559
using hot

500
00:21:27,039 --> 00:21:29,280
hot labels

501
00:21:28,559 --> 00:21:31,039
so

502
00:21:29,280 --> 00:21:34,000
with um

503
00:21:31,039 --> 00:21:36,799
with the uh hot and cold uh labels for

504
00:21:34,000 --> 00:21:39,039
state machines we can arrange the the

505
00:21:36,799 --> 00:21:41,679
parcel uh to

506
00:21:39,039 --> 00:21:44,640
to make a compiler place particular

507
00:21:41,679 --> 00:21:47,200
labels uh closer to the top of the

508
00:21:44,640 --> 00:21:49,360
function or closer to the bottom of the

509
00:21:47,200 --> 00:21:50,320
function

510
00:21:49,360 --> 00:21:52,240
in this

511
00:21:50,320 --> 00:21:55,919
case we found that

512
00:21:52,240 --> 00:21:58,480
pgo isn't helpful because

513
00:21:55,919 --> 00:22:01,840
profile guided optimization that

514
00:21:58,480 --> 00:22:05,039
arranges your code according to the

515
00:22:01,840 --> 00:22:07,039
profile profile data so if you spend

516
00:22:05,039 --> 00:22:08,960
very little time in

517
00:22:07,039 --> 00:22:11,120
hpm

518
00:22:08,960 --> 00:22:13,840
method passing

519
00:22:11,120 --> 00:22:17,200
and spend more time on your rain passing

520
00:22:13,840 --> 00:22:20,400
than um profile guy that optimization

521
00:22:17,200 --> 00:22:23,120
will place uh your rain passing uh

522
00:22:20,400 --> 00:22:27,120
before the method by passing and that's

523
00:22:23,120 --> 00:22:30,080
wrong because we always have to pass um

524
00:22:27,120 --> 00:22:33,280
hp method before the url

525
00:22:30,080 --> 00:22:37,120
and this um case when uh profile guide

526
00:22:33,280 --> 00:22:40,320
optimization also not helpful because um

527
00:22:37,120 --> 00:22:42,799
it's not aware about the input data

528
00:22:40,320 --> 00:22:43,919
semantics

529
00:22:42,799 --> 00:22:46,720
um

530
00:22:43,919 --> 00:22:49,919
this uh skip it

531
00:22:46,720 --> 00:22:51,760
says about alignment

532
00:22:49,919 --> 00:22:52,840
so

533
00:22:51,760 --> 00:22:54,320
actually

534
00:22:52,840 --> 00:22:57,600
we uh

535
00:22:54,320 --> 00:23:01,440
tried to experiment with uh data

536
00:22:57,600 --> 00:23:02,480
alignment for example this uh checked

537
00:23:01,440 --> 00:23:05,600
access

538
00:23:02,480 --> 00:23:09,679
for example if we

539
00:23:05,600 --> 00:23:12,880
convert a string to

540
00:23:09,679 --> 00:23:16,640
integer this known penalty to access

541
00:23:12,880 --> 00:23:18,960
integer uh non-aggregate integer and in

542
00:23:16,640 --> 00:23:20,640
a very micro benchmark you might see

543
00:23:18,960 --> 00:23:22,880
that almost

544
00:23:20,640 --> 00:23:25,919
more than two times on regular

545
00:23:22,880 --> 00:23:29,280
accessories slower than lignin however

546
00:23:25,919 --> 00:23:30,480
if we try to encode this uh directly in

547
00:23:29,280 --> 00:23:33,520
um

548
00:23:30,480 --> 00:23:37,200
in the past we won't see any performance

549
00:23:33,520 --> 00:23:40,320
improvement and the reason uh for that

550
00:23:37,200 --> 00:23:43,520
uh this actually is that uh anaerobic

551
00:23:40,320 --> 00:23:45,200
access is bit uh faster than alignment

552
00:23:43,520 --> 00:23:47,200
the reason that

553
00:23:45,200 --> 00:23:49,039
the additional code makes the

554
00:23:47,200 --> 00:23:53,600
uh code

555
00:23:49,039 --> 00:23:56,080
larger more sophisticated and

556
00:23:53,600 --> 00:23:58,000
since you need to

557
00:23:56,080 --> 00:23:59,360
align the diet in

558
00:23:58,000 --> 00:24:01,679
every

559
00:23:59,360 --> 00:24:04,720
string comparison the award of such

560
00:24:01,679 --> 00:24:08,240
places around the past and the password

561
00:24:04,720 --> 00:24:10,400
is significant so um the

562
00:24:08,240 --> 00:24:12,799
big bigger code and more

563
00:24:10,400 --> 00:24:16,880
complex code

564
00:24:12,799 --> 00:24:16,880
may make the password slower

565
00:24:17,279 --> 00:24:22,559
next thing is about

566
00:24:19,679 --> 00:24:25,919
optimization uh compiler optimization

567
00:24:22,559 --> 00:24:27,520
this size shows uh difference between o2

568
00:24:25,919 --> 00:24:31,039
and o3

569
00:24:27,520 --> 00:24:33,919
and those three uh adds um

570
00:24:31,039 --> 00:24:36,159
particular flags for

571
00:24:33,919 --> 00:24:38,640
uh optimization level

572
00:24:36,159 --> 00:24:41,440
and not all the flux

573
00:24:38,640 --> 00:24:42,320
uh improves uh performance in that there

574
00:24:41,440 --> 00:24:44,000
are

575
00:24:42,320 --> 00:24:46,960
um

576
00:24:44,000 --> 00:24:49,760
the shown facts which makes the

577
00:24:46,960 --> 00:24:50,799
performance works actually done

578
00:24:49,760 --> 00:24:53,840
better

579
00:24:50,799 --> 00:24:56,919
and we found that for http passer

580
00:24:53,840 --> 00:25:00,960
all three uh generates

581
00:24:56,919 --> 00:25:03,120
volkswagen o2

582
00:25:00,960 --> 00:25:04,400
the

583
00:25:03,120 --> 00:25:05,520
problem

584
00:25:04,400 --> 00:25:08,559
uh

585
00:25:05,520 --> 00:25:12,559
problem options are green and they're

586
00:25:08,559 --> 00:25:14,880
about after vectorization let's

587
00:25:12,559 --> 00:25:17,679
let's dig deeper into after

588
00:25:14,880 --> 00:25:20,320
vectorization after vectorization is

589
00:25:17,679 --> 00:25:23,039
about single instruction multiple data

590
00:25:20,320 --> 00:25:25,679
so if you have uh

591
00:25:23,039 --> 00:25:28,480
computation for example it's on the

592
00:25:25,679 --> 00:25:31,440
uh picture on the slide if we have

593
00:25:28,480 --> 00:25:32,640
uh two arrays of form elements and we

594
00:25:31,440 --> 00:25:35,360
need to

595
00:25:32,640 --> 00:25:38,480
add one race to another we can do this

596
00:25:35,360 --> 00:25:41,440
in a single instruction with multiple

597
00:25:38,480 --> 00:25:44,480
data however instructions while

598
00:25:41,440 --> 00:25:47,440
instruction can operate like up to 32

599
00:25:44,480 --> 00:25:50,080
appearance uh in the time

600
00:25:47,440 --> 00:25:53,039
they're not for free first of all they

601
00:25:50,080 --> 00:25:56,640
require a ligament they also have

602
00:25:53,039 --> 00:26:00,000
additional latency as we saw in mci

603
00:25:56,640 --> 00:26:06,000
output and actually not everything is

604
00:26:00,000 --> 00:26:08,480
after vitalized this good talk from um

605
00:26:06,000 --> 00:26:12,320
an ibm developer about this after

606
00:26:08,480 --> 00:26:14,799
virtualization internals in lvm so if

607
00:26:12,320 --> 00:26:16,400
you're interested about thanos i highly

608
00:26:14,799 --> 00:26:19,600
recommend you to

609
00:26:16,400 --> 00:26:20,480
watch the presentation

610
00:26:19,600 --> 00:26:23,120
um

611
00:26:20,480 --> 00:26:24,000
linux kernel

612
00:26:23,120 --> 00:26:26,960
it's

613
00:26:24,000 --> 00:26:28,799
known as uh optimized

614
00:26:26,960 --> 00:26:31,360
part of

615
00:26:28,799 --> 00:26:34,159
the optimizer software however it does

616
00:26:31,360 --> 00:26:35,600
not use after vectorization

617
00:26:34,159 --> 00:26:36,840
and uh

618
00:26:35,600 --> 00:26:40,799
actually

619
00:26:36,840 --> 00:26:42,480
the canon doesn't uh tries to avoid

620
00:26:40,799 --> 00:26:43,600
single instruction multiple data as

621
00:26:42,480 --> 00:26:46,880
possible

622
00:26:43,600 --> 00:26:49,279
uh actually some code uh which ldm

623
00:26:46,880 --> 00:26:53,600
benefits from uh single search multiple

624
00:26:49,279 --> 00:26:55,200
data does uh use it like crypto api

625
00:26:53,600 --> 00:26:57,600
uh however

626
00:26:55,200 --> 00:26:59,679
uh single search multiple data is

627
00:26:57,600 --> 00:27:04,799
managed by separate

628
00:26:59,679 --> 00:27:06,640
cpu unit as a 14 point unit and it has a

629
00:27:04,799 --> 00:27:08,240
separate contacts

630
00:27:06,640 --> 00:27:12,159
so um

631
00:27:08,240 --> 00:27:15,360
to be able to use uh fpu inside the

632
00:27:12,159 --> 00:27:18,000
kernel uh we need to store

633
00:27:15,360 --> 00:27:22,240
fpu contacts and restore this

634
00:27:18,000 --> 00:27:26,399
optimization for the linux kernel so if

635
00:27:22,240 --> 00:27:29,039
well um the modern user space code

636
00:27:26,399 --> 00:27:31,360
is almost every uh user space poker

637
00:27:29,039 --> 00:27:32,640
which you run is after with the after

638
00:27:31,360 --> 00:27:33,520
vectorizer

639
00:27:32,640 --> 00:27:36,880
so

640
00:27:33,520 --> 00:27:38,320
having that uh cannot trust optimizes

641
00:27:36,880 --> 00:27:41,279
and not to

642
00:27:38,320 --> 00:27:45,919
uh spend resources on saving and

643
00:27:41,279 --> 00:27:47,840
restoring uh fpu state for each uh

644
00:27:45,919 --> 00:27:50,159
contact switch between the user space

645
00:27:47,840 --> 00:27:51,120
and the kernel space and only when you

646
00:27:50,159 --> 00:27:53,120
need to

647
00:27:51,120 --> 00:27:56,159
to use a single structure multiple data

648
00:27:53,120 --> 00:27:59,360
inside the kernel uh we call the kind of

649
00:27:56,159 --> 00:28:02,240
you begin and cannot be you end and they

650
00:27:59,360 --> 00:28:05,600
are actually very costly this

651
00:28:02,240 --> 00:28:09,039
comparison uh between uh

652
00:28:05,600 --> 00:28:10,640
when mem cpi and the meme cpi protected

653
00:28:09,039 --> 00:28:13,840
by kind of pu

654
00:28:10,640 --> 00:28:17,520
are beginning kind of premium end

655
00:28:13,840 --> 00:28:21,279
uh however in team we tried to

656
00:28:17,520 --> 00:28:24,000
play with after vectorization and we had

657
00:28:21,279 --> 00:28:25,520
an idea uh that probably we might

658
00:28:24,000 --> 00:28:30,399
benefit from

659
00:28:25,520 --> 00:28:32,080
um after equalization of hp passer and

660
00:28:30,399 --> 00:28:33,200
you have a look what

661
00:28:32,080 --> 00:28:34,799
was

662
00:28:33,200 --> 00:28:35,600
going on

663
00:28:34,799 --> 00:28:37,679
uh

664
00:28:35,600 --> 00:28:38,880
there are

665
00:28:37,679 --> 00:28:40,720
several

666
00:28:38,880 --> 00:28:44,799
uh documentation about after

667
00:28:40,720 --> 00:28:47,279
vectorization the first one is about gcf

668
00:28:44,799 --> 00:28:49,919
vectorization and also this very good

669
00:28:47,279 --> 00:28:54,240
docs for real vm

670
00:28:49,919 --> 00:28:56,640
uh vm docs are much more readable about

671
00:28:54,240 --> 00:28:59,360
after week civilization stuff

672
00:28:56,640 --> 00:29:01,919
basically we have two types of after

673
00:28:59,360 --> 00:29:04,720
vectorization the first one is loop

674
00:29:01,919 --> 00:29:07,279
after vectorization and the second one

675
00:29:04,720 --> 00:29:09,760
is so for basic block after

676
00:29:07,279 --> 00:29:11,360
vectorization or

677
00:29:09,760 --> 00:29:15,120
slp

678
00:29:11,360 --> 00:29:18,559
also jc provides a different

679
00:29:15,120 --> 00:29:20,399
cost model for after vectorization and

680
00:29:18,559 --> 00:29:22,159
this is

681
00:29:20,399 --> 00:29:24,559
well since

682
00:29:22,159 --> 00:29:27,279
single structure multiple data isn't for

683
00:29:24,559 --> 00:29:29,440
free then compiler uh

684
00:29:27,279 --> 00:29:30,320
lets you choose pc5

685
00:29:29,440 --> 00:29:34,000
uh

686
00:29:30,320 --> 00:29:36,640
how how um whether do you want

687
00:29:34,000 --> 00:29:40,080
um single instruction multiple data in

688
00:29:36,640 --> 00:29:42,080
many cases or not like

689
00:29:40,080 --> 00:29:43,360
how aggressive a single instruction

690
00:29:42,080 --> 00:29:47,200
multiple

691
00:29:43,360 --> 00:29:48,559
after vectorization should behave

692
00:29:47,200 --> 00:29:50,480
um

693
00:29:48,559 --> 00:29:52,960
there are several

694
00:29:50,480 --> 00:29:55,679
requirements for code to be artificially

695
00:29:52,960 --> 00:29:57,600
realizable uh so

696
00:29:55,679 --> 00:29:59,360
first you need uh to

697
00:29:57,600 --> 00:30:02,480
have instructions which are after

698
00:29:59,360 --> 00:30:05,200
vectorizable unique diligent uh you need

699
00:30:02,480 --> 00:30:06,840
no uh data dependence uh between the

700
00:30:05,200 --> 00:30:10,799
code

701
00:30:06,840 --> 00:30:14,399
and preferably to have countable on long

702
00:30:10,799 --> 00:30:16,559
groups however modern compilers can um

703
00:30:14,399 --> 00:30:19,600
kind of after victories actually

704
00:30:16,559 --> 00:30:19,600
uncountable loops

705
00:30:20,080 --> 00:30:23,760
let's

706
00:30:21,120 --> 00:30:27,120
see some examples

707
00:30:23,760 --> 00:30:29,840
uh first of all i have

708
00:30:27,120 --> 00:30:31,039
examples of two programs which are good

709
00:30:29,840 --> 00:30:32,960
to

710
00:30:31,039 --> 00:30:35,200
consider

711
00:30:32,960 --> 00:30:39,520
the first one is

712
00:30:35,200 --> 00:30:42,480
for basic box after exercise this um

713
00:30:39,520 --> 00:30:46,720
simple pattern so if you see that there

714
00:30:42,480 --> 00:30:49,039
are arrays of uniform um uniform

715
00:30:46,720 --> 00:30:52,559
operation so in this case we have

716
00:30:49,039 --> 00:30:55,760
h equal operations on almost equal

717
00:30:52,559 --> 00:30:57,440
operation on the way this almost no uh

718
00:30:55,760 --> 00:31:00,080
data dependency

719
00:30:57,440 --> 00:31:02,480
uh well this um

720
00:31:00,080 --> 00:31:04,880
this actual dependency between

721
00:31:02,480 --> 00:31:06,880
uh different values

722
00:31:04,880 --> 00:31:10,159
but um

723
00:31:06,880 --> 00:31:11,919
in this one we have almost

724
00:31:10,159 --> 00:31:12,720
no dependency

725
00:31:11,919 --> 00:31:13,840
uh

726
00:31:12,720 --> 00:31:16,840
anyway

727
00:31:13,840 --> 00:31:21,120
uh one once which is that you do very

728
00:31:16,840 --> 00:31:24,559
very very similar operations uh in

729
00:31:21,120 --> 00:31:26,159
basic works then you might have ideas as

730
00:31:24,559 --> 00:31:28,640
this code could be

731
00:31:26,159 --> 00:31:31,360
uh after victorianized

732
00:31:28,640 --> 00:31:33,200
let's see um

733
00:31:31,360 --> 00:31:36,480
another example

734
00:31:33,200 --> 00:31:37,519
uh this about loop also in this loop we

735
00:31:36,480 --> 00:31:39,519
have

736
00:31:37,519 --> 00:31:41,279
very similar

737
00:31:39,519 --> 00:31:42,960
uh operations

738
00:31:41,279 --> 00:31:44,799
uh on

739
00:31:42,960 --> 00:31:46,880
number of

740
00:31:44,799 --> 00:31:50,480
appearance so that's also a good

741
00:31:46,880 --> 00:31:53,360
signature for to be artificialized

742
00:31:50,480 --> 00:31:53,360
uh well

743
00:31:54,320 --> 00:31:56,640
so

744
00:31:57,279 --> 00:32:02,000
let's um

745
00:31:59,600 --> 00:32:04,640
let's see or what we want to have with

746
00:32:02,000 --> 00:32:08,799
um after vectorization

747
00:32:04,640 --> 00:32:10,559
uh first of all uh let me to

748
00:32:08,799 --> 00:32:11,600
compile the

749
00:32:10,559 --> 00:32:15,000
basic

750
00:32:11,600 --> 00:32:15,000
basic box

751
00:32:17,600 --> 00:32:20,600
um

752
00:32:33,679 --> 00:32:36,679
uh

753
00:32:46,960 --> 00:32:51,679
so

754
00:32:47,840 --> 00:32:55,440
uh in this um in this comment we i used

755
00:32:51,679 --> 00:32:57,440
uh offline optimization for gc i also

756
00:32:55,440 --> 00:32:59,760
adoptions to see

757
00:32:57,440 --> 00:33:03,039
whether something is optimized and we

758
00:32:59,760 --> 00:33:06,159
see that compiler says that we

759
00:33:03,039 --> 00:33:09,200
have optimization and also i use the

760
00:33:06,159 --> 00:33:10,960
strongest uh consistent uh course model

761
00:33:09,200 --> 00:33:13,279
is unlimited

762
00:33:10,960 --> 00:33:17,120
um in

763
00:33:13,279 --> 00:33:18,000
song let's have a look what we

764
00:33:17,120 --> 00:33:19,679
uh

765
00:33:18,000 --> 00:33:21,840
what um

766
00:33:19,679 --> 00:33:21,840
no

767
00:33:22,880 --> 00:33:25,840
let's um

768
00:33:26,240 --> 00:33:31,760
let's uh compile

769
00:33:29,039 --> 00:33:32,799
first let's compile with uh

770
00:33:31,760 --> 00:33:34,799
um

771
00:33:32,799 --> 00:33:36,799
bb in

772
00:33:34,799 --> 00:33:39,799
different versions

773
00:33:36,799 --> 00:33:39,799
um

774
00:33:42,399 --> 00:33:48,480
so i have two versions of the program

775
00:33:45,519 --> 00:33:50,399
uh the first one is

776
00:33:48,480 --> 00:33:53,760
plain

777
00:33:50,399 --> 00:33:56,080
plain program so we have uh basic

778
00:33:53,760 --> 00:33:59,519
function uh this by the way the

779
00:33:56,080 --> 00:34:01,919
cryptography function and the next uh

780
00:33:59,519 --> 00:34:05,039
program i have a restrict

781
00:34:01,919 --> 00:34:06,640
access and i have a lignite uh hints for

782
00:34:05,039 --> 00:34:08,399
the compiler that

783
00:34:06,640 --> 00:34:11,040
uh the data

784
00:34:08,399 --> 00:34:13,359
passed to the function will be aligned

785
00:34:11,040 --> 00:34:16,159
and let's compare the assembly which

786
00:34:13,359 --> 00:34:19,159
will be generated for both of them

787
00:34:16,159 --> 00:34:19,159
functions

788
00:34:38,639 --> 00:34:43,200
here we go

789
00:34:40,000 --> 00:34:44,720
uh so the first thing which we might

790
00:34:43,200 --> 00:34:45,839
notice is that

791
00:34:44,720 --> 00:34:47,280
um

792
00:34:45,839 --> 00:34:51,200
non-malignant

793
00:34:47,280 --> 00:34:54,560
version generates code with a suffix hue

794
00:34:51,200 --> 00:34:55,679
this move the q u is for an alignment

795
00:34:54,560 --> 00:34:58,960
while

796
00:34:55,679 --> 00:35:01,200
qa is for lignite access

797
00:34:58,960 --> 00:35:03,920
what it means

798
00:35:01,200 --> 00:35:05,119
to understand whether we have some

799
00:35:03,920 --> 00:35:07,280
performance

800
00:35:05,119 --> 00:35:08,640
uh improvement for for the function

801
00:35:07,280 --> 00:35:10,800
let's compare

802
00:35:08,640 --> 00:35:12,560
let's um

803
00:35:10,800 --> 00:35:16,720
see on uh

804
00:35:12,560 --> 00:35:20,079
internal uh documentation from both the

805
00:35:16,720 --> 00:35:22,480
instruction this um made the qa

806
00:35:20,079 --> 00:35:25,200
instruction and we have that on my

807
00:35:22,480 --> 00:35:28,000
architecture skyway we have a latency

808
00:35:25,200 --> 00:35:29,760
six and four put uh

809
00:35:28,000 --> 00:35:31,119
0.5

810
00:35:29,760 --> 00:35:33,680
for

811
00:35:31,119 --> 00:35:35,520
alignment access we have the same data

812
00:35:33,680 --> 00:35:38,000
so basically

813
00:35:35,520 --> 00:35:40,560
this instruction are pretty the same and

814
00:35:38,000 --> 00:35:42,560
we don't win any performance for

815
00:35:40,560 --> 00:35:45,680
from using colligate data

816
00:35:42,560 --> 00:35:48,240
i also checked the different uh

817
00:35:45,680 --> 00:35:51,119
alignment instruction and

818
00:35:48,240 --> 00:35:52,240
basically modern uh inter architectures

819
00:35:51,119 --> 00:35:55,040
um

820
00:35:52,240 --> 00:35:56,880
in most of the instruction uh provides

821
00:35:55,040 --> 00:36:00,160
the same performance

822
00:35:56,880 --> 00:36:02,400
also this almost um

823
00:36:00,160 --> 00:36:06,000
equal instructions for

824
00:36:02,400 --> 00:36:08,640
um research success and in this

825
00:36:06,000 --> 00:36:11,920
particular case it seems we don't

826
00:36:08,640 --> 00:36:13,359
have much uh performance benefit from a

827
00:36:11,920 --> 00:36:16,000
restrict

828
00:36:13,359 --> 00:36:16,000
keyword

829
00:36:17,440 --> 00:36:21,599
that's

830
00:36:18,720 --> 00:36:21,599
back to them

831
00:36:21,839 --> 00:36:27,760
well let's have a look

832
00:36:24,320 --> 00:36:30,880
uh one a couple of more examples

833
00:36:27,760 --> 00:36:34,000
uh fortune to the um

834
00:36:30,880 --> 00:36:35,040
modern couple compiler is also able to

835
00:36:34,000 --> 00:36:36,400
um

836
00:36:35,040 --> 00:36:41,280
uh

837
00:36:36,400 --> 00:36:44,160
uh to optimize uh ship was plus sgl

838
00:36:41,280 --> 00:36:47,520
uh algorithms and in this

839
00:36:44,160 --> 00:36:50,720
particular case i used um

840
00:36:47,520 --> 00:36:52,240
uh i used programs from

841
00:36:50,720 --> 00:36:55,760
this blog post

842
00:36:52,240 --> 00:36:57,280
which is that uh actually

843
00:36:55,760 --> 00:36:58,880
we

844
00:36:57,280 --> 00:37:04,160
we see some

845
00:36:58,880 --> 00:37:06,960
uh some vectorization if we add um

846
00:37:04,160 --> 00:37:09,119
if we want to compile them to

847
00:37:06,960 --> 00:37:12,000
generate vxr2

848
00:37:09,119 --> 00:37:16,000
then we see that um

849
00:37:12,000 --> 00:37:17,599
compiler might generate even uh 32 uh

850
00:37:16,000 --> 00:37:21,200
byte

851
00:37:17,599 --> 00:37:24,480
code from the after vectorization

852
00:37:21,200 --> 00:37:27,040
uh more algorithms of foster which are

853
00:37:24,480 --> 00:37:29,200
vectorizable you can find on this uh

854
00:37:27,040 --> 00:37:30,400
blog post and that's not

855
00:37:29,200 --> 00:37:32,560
all them

856
00:37:30,400 --> 00:37:35,440
uh not all the algorithms are actually

857
00:37:32,560 --> 00:37:37,440
artifactorizable

858
00:37:35,440 --> 00:37:41,280
however

859
00:37:37,440 --> 00:37:44,400
uh with uh clunk this um

860
00:37:41,280 --> 00:37:45,839
uh well i didn't show um

861
00:37:44,400 --> 00:37:47,040
performance for

862
00:37:45,839 --> 00:37:48,800
uh

863
00:37:47,040 --> 00:37:54,000
for for loop

864
00:37:48,800 --> 00:37:57,680
uh but um clunk actually are not able to

865
00:37:54,000 --> 00:37:58,560
uh read uh to make reduce reduction

866
00:37:57,680 --> 00:38:01,200
for

867
00:37:58,560 --> 00:38:04,720
uh memory appearance

868
00:38:01,200 --> 00:38:05,839
um oh no um we

869
00:38:04,720 --> 00:38:09,040
end up

870
00:38:05,839 --> 00:38:12,320
not to use after vectorization for hpm

871
00:38:09,040 --> 00:38:16,079
passer and elliptic of cryptography

872
00:38:12,320 --> 00:38:16,079
for hp pulse we

873
00:38:16,400 --> 00:38:23,599
we have only a small optimization for

874
00:38:20,240 --> 00:38:24,800
uh basic blocks because as we saw the

875
00:38:23,599 --> 00:38:28,079
um

876
00:38:24,800 --> 00:38:31,040
state machine as with uh switch uh

877
00:38:28,079 --> 00:38:35,359
statements or go to drive and don't have

878
00:38:31,040 --> 00:38:37,760
much uh vectorizable uh operations like

879
00:38:35,359 --> 00:38:40,320
operation of the same kind also there

880
00:38:37,760 --> 00:38:43,200
are no small loops which are

881
00:38:40,320 --> 00:38:44,240
vectorizable for

882
00:38:43,200 --> 00:38:47,280
uh

883
00:38:44,240 --> 00:38:49,760
which are vectorizable so hp pulsar is

884
00:38:47,280 --> 00:38:52,960
uh just not the kind of code which can

885
00:38:49,760 --> 00:38:54,720
be utilizable elliptic

886
00:38:52,960 --> 00:38:57,520
cryptography is actually a good

887
00:38:54,720 --> 00:39:00,960
candidate for after utilization and if

888
00:38:57,520 --> 00:39:01,599
you program math in here she postpones

889
00:39:00,960 --> 00:39:03,520
and

890
00:39:01,599 --> 00:39:06,800
typically benefit from

891
00:39:03,520 --> 00:39:09,839
art vectorization however in our case as

892
00:39:06,800 --> 00:39:13,040
well as for other cryptography libraries

893
00:39:09,839 --> 00:39:15,359
the most of the mathematics is written

894
00:39:13,040 --> 00:39:18,480
in already assembly

895
00:39:15,359 --> 00:39:20,160
and the things is that um

896
00:39:18,480 --> 00:39:23,599
we also didn't

897
00:39:20,160 --> 00:39:25,599
uh go with after exterization form of

898
00:39:23,599 --> 00:39:29,280
mathematics because

899
00:39:25,599 --> 00:39:31,680
there's additional latency introduced by

900
00:39:29,280 --> 00:39:35,680
uh some of this single search multiple

901
00:39:31,680 --> 00:39:38,079
data and also we need to uh care about

902
00:39:35,680 --> 00:39:41,040
uh alignment because scanner by default

903
00:39:38,079 --> 00:39:42,800
is a rigorous for eight bytes while for

904
00:39:41,040 --> 00:39:46,000
single selection multiple data we need a

905
00:39:42,800 --> 00:39:47,280
ligament for 52 bytes

906
00:39:46,000 --> 00:39:49,359
um

907
00:39:47,280 --> 00:39:51,280
also the

908
00:39:49,359 --> 00:39:53,119
specific

909
00:39:51,280 --> 00:39:55,520
uh specific

910
00:39:53,119 --> 00:39:56,960
considerations about legacy code for

911
00:39:55,520 --> 00:39:58,960
example if you

912
00:39:56,960 --> 00:40:02,079
might have a old

913
00:39:58,960 --> 00:40:04,160
code which is sc only enable you need to

914
00:40:02,079 --> 00:40:06,319
use with zero update by default

915
00:40:04,160 --> 00:40:07,599
compilers are used with zero

916
00:40:06,319 --> 00:40:10,880
uh

917
00:40:07,599 --> 00:40:15,359
instructions and we can make the code

918
00:40:10,880 --> 00:40:18,960
faster if we uh remove uh visible we can

919
00:40:15,359 --> 00:40:19,839
get up to five performance uh better

920
00:40:18,960 --> 00:40:21,839
uh

921
00:40:19,839 --> 00:40:24,079
now we have

922
00:40:21,839 --> 00:40:25,520
explanation why don't we use after

923
00:40:24,079 --> 00:40:28,400
vectorization

924
00:40:25,520 --> 00:40:29,599
and why we record mathematics in

925
00:40:28,400 --> 00:40:31,280
assembly

926
00:40:29,599 --> 00:40:33,440
uh actually

927
00:40:31,280 --> 00:40:36,240
elliptic off and those are cryptography

928
00:40:33,440 --> 00:40:38,480
used with very long integers

929
00:40:36,240 --> 00:40:42,800
like integers of

930
00:40:38,480 --> 00:40:47,599
two four six maybe more limbs limp is

931
00:40:42,800 --> 00:40:50,640
usually unsigned wonk so for our 256 uh

932
00:40:47,599 --> 00:40:53,280
bits arithmetic we have a volumes uh

933
00:40:50,640 --> 00:40:53,280
four enzymes

934
00:40:59,960 --> 00:41:05,599
foreign from linux but

935
00:41:03,680 --> 00:41:08,400
all the libraries

936
00:41:05,599 --> 00:41:10,720
move from the abstract

937
00:41:08,400 --> 00:41:11,440
mpi representation to

938
00:41:10,720 --> 00:41:15,359
go

939
00:41:11,440 --> 00:41:18,800
always so if you can see the

940
00:41:15,359 --> 00:41:21,839
c function we use with two limb

941
00:41:18,800 --> 00:41:21,839
arithmetic so we have

942
00:41:22,160 --> 00:41:25,359
number a and number b

943
00:41:24,400 --> 00:41:28,880
ah

944
00:41:25,359 --> 00:41:30,560
each of size two limbs and we need to

945
00:41:28,880 --> 00:41:33,119
sum them

946
00:41:30,560 --> 00:41:36,880
you might spend

947
00:41:33,119 --> 00:41:39,280
some time to realize whether the smp add

948
00:41:36,880 --> 00:41:43,599
function is correct because we need to

949
00:41:39,280 --> 00:41:48,000
do uh some uh code with uh carry

950
00:41:43,599 --> 00:41:49,280
uh however if we use assembly we can not

951
00:41:48,000 --> 00:41:51,920
only

952
00:41:49,280 --> 00:41:53,119
make the code

953
00:41:51,920 --> 00:41:54,160
easier

954
00:41:53,119 --> 00:41:55,760
using

955
00:41:54,160 --> 00:41:58,960
carrier

956
00:41:55,760 --> 00:42:00,400
carrier where additional but also we can

957
00:41:58,960 --> 00:42:04,240
avoid

958
00:42:00,400 --> 00:42:05,760
branching in the code also cpus provide

959
00:42:04,240 --> 00:42:07,440
uh additional

960
00:42:05,760 --> 00:42:10,640
uh

961
00:42:07,440 --> 00:42:12,079
additional instruction to parallelize uh

962
00:42:10,640 --> 00:42:12,800
two uh

963
00:42:12,079 --> 00:42:16,720
two

964
00:42:12,800 --> 00:42:18,400
additional chains uh thanks to uh

965
00:42:16,720 --> 00:42:20,000
edx

966
00:42:18,400 --> 00:42:23,440
instruction set

967
00:42:20,000 --> 00:42:26,160
so in assembly we can uh

968
00:42:23,440 --> 00:42:29,599
do much more efficient

969
00:42:26,160 --> 00:42:32,560
uh efficient uh programming the last

970
00:42:29,599 --> 00:42:36,160
thing which i want to to mention

971
00:42:32,560 --> 00:42:37,680
is a problem which we recently faced on

972
00:42:36,160 --> 00:42:40,079
linux system

973
00:42:37,680 --> 00:42:41,440
and if your

974
00:42:40,079 --> 00:42:45,440
program uses

975
00:42:41,440 --> 00:42:46,880
spin locks in user space then you might

976
00:42:45,440 --> 00:42:47,920
see that

977
00:42:46,880 --> 00:42:49,119
if

978
00:42:47,920 --> 00:42:51,680
one thread

979
00:42:49,119 --> 00:42:55,440
acquires the spin lock and another

980
00:42:51,680 --> 00:42:58,960
thread is spinning and waiting for this

981
00:42:55,440 --> 00:43:00,480
spin walk so the waiting thread

982
00:42:58,960 --> 00:43:04,560
uh quickly

983
00:43:00,480 --> 00:43:05,440
uh just um spins in a bathroom loop

984
00:43:04,560 --> 00:43:08,000
and

985
00:43:05,440 --> 00:43:10,800
such processes which just

986
00:43:08,000 --> 00:43:12,400
balance cpu are considered as batch

987
00:43:10,800 --> 00:43:16,400
process buzz

988
00:43:12,400 --> 00:43:18,880
linux cpu scheduler i suppose there are

989
00:43:16,400 --> 00:43:21,440
interactive processors which are needed

990
00:43:18,880 --> 00:43:23,119
sometime just periodically and your

991
00:43:21,440 --> 00:43:25,599
monitoring system

992
00:43:23,119 --> 00:43:28,160
is typically

993
00:43:25,599 --> 00:43:30,720
considered by the system schedule as an

994
00:43:28,160 --> 00:43:34,079
interactive uh process because it's uh

995
00:43:30,720 --> 00:43:35,520
needs a speed timely only from time to

996
00:43:34,079 --> 00:43:36,400
time

997
00:43:35,520 --> 00:43:37,839
so

998
00:43:36,400 --> 00:43:39,200
in this

999
00:43:37,839 --> 00:43:41,599
case we

1000
00:43:39,200 --> 00:43:43,359
can use a perscad

1001
00:43:41,599 --> 00:43:45,280
time hist

1002
00:43:43,359 --> 00:43:49,839
to to

1003
00:43:45,280 --> 00:43:53,200
profile the system schedule and in this

1004
00:43:49,839 --> 00:43:55,599
in this case we saw that this

1005
00:43:53,200 --> 00:43:57,599
performance critical waiting process

1006
00:43:55,599 --> 00:43:58,800
which is spinning in

1007
00:43:57,599 --> 00:44:01,680
spin lock

1008
00:43:58,800 --> 00:44:05,440
and it's uh some point

1009
00:44:01,680 --> 00:44:08,960
uh while monitoring uh took the

1010
00:44:05,440 --> 00:44:12,319
cpu uh time and wake awakened the

1011
00:44:08,960 --> 00:44:15,680
process the process actually uh

1012
00:44:12,319 --> 00:44:19,520
received that the cpu time almost

1013
00:44:15,680 --> 00:44:22,079
100 milliseconds when it was awakened

1014
00:44:19,520 --> 00:44:23,359
and during that time another processes

1015
00:44:22,079 --> 00:44:27,599
uh

1016
00:44:23,359 --> 00:44:29,119
took place on the cpu uh so if you use

1017
00:44:27,599 --> 00:44:32,000
um

1018
00:44:29,119 --> 00:44:34,480
uh spin locks in user space in user

1019
00:44:32,000 --> 00:44:39,119
space we don't have primition control as

1020
00:44:34,480 --> 00:44:44,560
we have on in the candle space then uh

1021
00:44:39,119 --> 00:44:46,640
you you can face a huge tail latencies

1022
00:44:44,560 --> 00:44:48,240
well um that's

1023
00:44:46,640 --> 00:44:49,520
all um

1024
00:44:48,240 --> 00:44:52,560
what i had

1025
00:44:49,520 --> 00:44:53,760
so probably we have some time for

1026
00:44:52,560 --> 00:44:55,760
questions

1027
00:44:53,760 --> 00:44:58,000
so

1028
00:44:55,760 --> 00:44:58,000
our

1029
00:44:58,880 --> 00:45:04,880
in fact we're already over time

1030
00:45:01,119 --> 00:45:07,440
alexander there are two questions in the

1031
00:45:04,880 --> 00:45:09,440
question section on venulis

1032
00:45:07,440 --> 00:45:12,319
it would be good to

1033
00:45:09,440 --> 00:45:14,079
address those maybe in the chat

1034
00:45:12,319 --> 00:45:15,920
okay i will

1035
00:45:14,079 --> 00:45:18,560
thank you very much okay thank you very

1036
00:45:15,920 --> 00:45:20,640
much thank you okay everybody so now we

1037
00:45:18,560 --> 00:45:23,520
will have the tea break afternoon tea

1038
00:45:20,640 --> 00:45:24,720
break and back here

1039
00:45:23,520 --> 00:45:29,400
shortly

1040
00:45:24,720 --> 00:45:29,400
okay thank you bye thank you