1 00:00:02,240 --> 00:00:05,240 foreign 2 00:00:06,320 --> 00:00:11,499 [Music] 3 00:00:16,240 --> 00:00:21,119 welcome back to our next talk 4 00:00:18,560 --> 00:00:24,160 uh which will be on what c and c plus 5 00:00:21,119 --> 00:00:26,640 can do and when do you need assembly 6 00:00:24,160 --> 00:00:29,840 this talk will be given by alexander 7 00:00:26,640 --> 00:00:32,800 krizanovsky who is the ceo of tempesta 8 00:00:29,840 --> 00:00:35,280 technologies and the architect of 9 00:00:32,800 --> 00:00:37,760 tempesta fw a high performance open 10 00:00:35,280 --> 00:00:38,879 source linux application delivery 11 00:00:37,760 --> 00:00:41,120 controller 12 00:00:38,879 --> 00:00:43,040 alexander is responsible for the design 13 00:00:41,120 --> 00:00:45,520 and performance of several products in 14 00:00:43,040 --> 00:00:47,280 the areas of network traffic processing 15 00:00:45,520 --> 00:00:49,520 and databases 16 00:00:47,280 --> 00:00:51,760 he designed the core architecture of a 17 00:00:49,520 --> 00:00:54,480 web application firewall 18 00:00:51,760 --> 00:00:58,800 mentioned in the gartner magic quadrant 19 00:00:54,480 --> 00:01:01,359 maria db temporal data tables uh with 20 00:00:58,800 --> 00:01:04,879 that over to you alexander 21 00:01:01,359 --> 00:01:05,760 yeah uh do you see my screen shell 22 00:01:04,879 --> 00:01:08,640 yes 23 00:01:05,760 --> 00:01:09,920 i can go here um 24 00:01:08,640 --> 00:01:11,200 okay okay 25 00:01:09,920 --> 00:01:12,640 so 26 00:01:11,200 --> 00:01:15,520 first first of all 27 00:01:12,640 --> 00:01:18,320 we do a custom software development in 28 00:01:15,520 --> 00:01:21,600 network security and that storage areas 29 00:01:18,320 --> 00:01:24,000 in particular we um from time to time we 30 00:01:21,600 --> 00:01:27,360 face tasks like to develop very fast 31 00:01:24,000 --> 00:01:30,240 lego expressions engine about scaling 32 00:01:27,360 --> 00:01:32,159 data structure on multi-core systems 33 00:01:30,240 --> 00:01:34,560 various performance optimization and so 34 00:01:32,159 --> 00:01:36,880 on so typically we work in 35 00:01:34,560 --> 00:01:39,439 high performance area in network 36 00:01:36,880 --> 00:01:42,159 security and data storages 37 00:01:39,439 --> 00:01:44,960 also we develop our own product investor 38 00:01:42,159 --> 00:01:47,759 w which is a hybrid of gp accelerate and 39 00:01:44,960 --> 00:01:51,200 the firewall one of the main 40 00:01:47,759 --> 00:01:53,439 targets for the project is to be able to 41 00:01:51,200 --> 00:01:56,799 filter a massive 42 00:01:53,439 --> 00:01:58,320 malicious traffic like s70 dos attacks 43 00:01:56,799 --> 00:01:59,200 or web attacks 44 00:01:58,320 --> 00:02:00,880 and 45 00:01:59,200 --> 00:02:04,479 we have 46 00:02:00,880 --> 00:02:09,200 one of the fastest or maybe the fastest 47 00:02:04,479 --> 00:02:12,480 uh hpm parcel and uh so this talk will 48 00:02:09,200 --> 00:02:16,160 be most uh reference to this to 49 00:02:12,480 --> 00:02:18,560 suggestions about hp passing and uh 50 00:02:16,160 --> 00:02:20,239 cryptography interior less handshakes 51 00:02:18,560 --> 00:02:23,760 in our 52 00:02:20,239 --> 00:02:26,400 tasks usually uh performances about most 53 00:02:23,760 --> 00:02:29,760 about algorithms mathematics and so on 54 00:02:26,400 --> 00:02:32,480 but also very important stuff is to uh 55 00:02:29,760 --> 00:02:36,800 creatively and accurately uh 56 00:02:32,480 --> 00:02:39,760 code on the program to make it fast and 57 00:02:36,800 --> 00:02:44,239 optimize it so we do a lot of cnc plus 58 00:02:39,760 --> 00:02:47,599 plus and sometimes we deal with assembly 59 00:02:44,239 --> 00:02:50,480 uh for example if we um if you do a quad 60 00:02:47,599 --> 00:02:53,200 review and you see a functions like is 61 00:02:50,480 --> 00:02:55,840 digit uh you see that there are three 62 00:02:53,200 --> 00:02:58,560 operations involved um 63 00:02:55,840 --> 00:02:59,680 in the function and there's a hackers 64 00:02:58,560 --> 00:03:02,480 delight 65 00:02:59,680 --> 00:03:05,760 uh tricks when you can replace the 66 00:03:02,480 --> 00:03:06,800 condition with only two operations so 67 00:03:05,760 --> 00:03:09,519 uh 68 00:03:06,800 --> 00:03:13,280 it doesn't make sense uh to use a more 69 00:03:09,519 --> 00:03:16,159 sophisticated code uh than this um 70 00:03:13,280 --> 00:03:19,280 when a straightforward code actually not 71 00:03:16,159 --> 00:03:22,879 because modern compilers like gcc on 72 00:03:19,280 --> 00:03:26,159 both the optimization levels are good uh 73 00:03:22,879 --> 00:03:26,959 with such kind of optimization so we can 74 00:03:26,159 --> 00:03:28,560 just 75 00:03:26,959 --> 00:03:30,720 stick on 76 00:03:28,560 --> 00:03:34,000 the simplest version of the code 77 00:03:30,720 --> 00:03:36,720 the next example is let's consider this 78 00:03:34,000 --> 00:03:38,400 uh two variants of 79 00:03:36,720 --> 00:03:39,920 uh function 80 00:03:38,400 --> 00:03:41,920 and um 81 00:03:39,920 --> 00:03:44,720 you might be thinking 82 00:03:41,920 --> 00:03:48,560 that uh both the function are equal or 83 00:03:44,720 --> 00:03:50,959 not and in both of the 84 00:03:48,560 --> 00:03:54,720 cases you are actually right 85 00:03:50,959 --> 00:03:56,319 uh so let's um 86 00:03:54,720 --> 00:03:57,599 let's uh 87 00:03:56,319 --> 00:04:00,159 load 88 00:03:57,599 --> 00:04:01,280 the program into 89 00:04:00,159 --> 00:04:02,560 uh 90 00:04:01,280 --> 00:04:04,480 into 91 00:04:02,560 --> 00:04:05,680 uh compile explorer 92 00:04:04,480 --> 00:04:07,040 and 93 00:04:05,680 --> 00:04:09,360 one 94 00:04:07,040 --> 00:04:12,799 is for clunk 95 00:04:09,360 --> 00:04:14,959 as we see that both the 96 00:04:12,799 --> 00:04:18,400 versions are pretty the same 97 00:04:14,959 --> 00:04:21,199 uh function uh in assembly however if we 98 00:04:18,400 --> 00:04:23,919 compare it with gcc 99 00:04:21,199 --> 00:04:26,840 um let me do this 100 00:04:23,919 --> 00:04:30,000 we see that rgc 101 00:04:26,840 --> 00:04:34,080 um generates a bit different 102 00:04:30,000 --> 00:04:36,320 uh code and it involves rbp 103 00:04:34,080 --> 00:04:39,199 so actually 104 00:04:36,320 --> 00:04:40,840 not always um 105 00:04:39,199 --> 00:04:43,759 not not always 106 00:04:40,840 --> 00:04:45,600 um this is the same code not always 107 00:04:43,759 --> 00:04:47,280 generates the same results for different 108 00:04:45,600 --> 00:04:48,479 compilers 109 00:04:47,280 --> 00:04:50,160 uh 110 00:04:48,479 --> 00:04:52,440 uh let's now 111 00:04:50,160 --> 00:04:54,080 uh see what the 112 00:04:52,440 --> 00:04:57,520 x8664 113 00:04:54,080 --> 00:04:59,600 uh looks like uh first of all this 114 00:04:57,520 --> 00:05:03,280 surprise color architecture means that 115 00:04:59,600 --> 00:05:04,320 we can run multiple instructions in 116 00:05:03,280 --> 00:05:09,440 parallel 117 00:05:04,320 --> 00:05:11,520 uh this out for the executions and also 118 00:05:09,440 --> 00:05:14,479 each instruction is 119 00:05:11,520 --> 00:05:17,520 like if you see assembly code then this 120 00:05:14,479 --> 00:05:20,400 is not the lowest level of the code and 121 00:05:17,520 --> 00:05:23,280 the microphone processor splits 122 00:05:20,400 --> 00:05:25,680 instruction into my corporations 123 00:05:23,280 --> 00:05:28,720 and different instructions have 124 00:05:25,680 --> 00:05:29,680 different latency and throughput latency 125 00:05:28,720 --> 00:05:33,280 means 126 00:05:29,680 --> 00:05:34,960 how how many cycles instruction takes in 127 00:05:33,280 --> 00:05:37,600 dependency chain 128 00:05:34,960 --> 00:05:40,240 and the throughput is how many 129 00:05:37,600 --> 00:05:41,759 selections of the same kind can be run 130 00:05:40,240 --> 00:05:45,840 in parallel 131 00:05:41,759 --> 00:05:45,840 uh to visualize um 132 00:05:46,000 --> 00:05:48,639 the ideas about 133 00:05:47,840 --> 00:05:53,600 uh 134 00:05:48,639 --> 00:05:57,039 modern x 8664 let's uh have a look on 135 00:05:53,600 --> 00:06:02,560 lvm mca 136 00:05:57,039 --> 00:06:07,280 um so this is a call uh why vm uh mca is 137 00:06:02,560 --> 00:06:10,720 levm 2 it works just fine with gcc 138 00:06:07,280 --> 00:06:13,919 and you actually can pass any assembly 139 00:06:10,720 --> 00:06:15,440 code as input from levm 140 00:06:13,919 --> 00:06:17,919 mc 141 00:06:15,440 --> 00:06:22,000 so let's have a look what what it is 142 00:06:17,919 --> 00:06:24,080 uh this report about uh the program 143 00:06:22,000 --> 00:06:26,240 and uh first of all we see that the 144 00:06:24,080 --> 00:06:29,520 dispatch vis for current architecture 145 00:06:26,240 --> 00:06:33,120 this uh maximum six uh micro operations 146 00:06:29,520 --> 00:06:35,039 per cycle and this is problem we see 147 00:06:33,120 --> 00:06:38,720 almost uh six 148 00:06:35,039 --> 00:06:42,160 microbes per cycle this good for 149 00:06:38,720 --> 00:06:44,560 our program also this section 150 00:06:42,160 --> 00:06:47,120 please how many sections per second 151 00:06:44,560 --> 00:06:49,759 leverage are gone 152 00:06:47,120 --> 00:06:50,800 uh this is also not so 153 00:06:49,759 --> 00:06:53,280 bad 154 00:06:50,800 --> 00:06:56,319 uh next we see the 155 00:06:53,280 --> 00:06:59,039 assembly for this particular program uh 156 00:06:56,319 --> 00:07:01,360 doesn't my um this 157 00:06:59,039 --> 00:07:03,919 um it isn't important for now what was 158 00:07:01,360 --> 00:07:05,440 the problem is but now let's observe 159 00:07:03,919 --> 00:07:08,400 some 160 00:07:05,440 --> 00:07:12,160 key characteristics of the program in 161 00:07:08,400 --> 00:07:14,880 this report we see on first 162 00:07:12,160 --> 00:07:17,840 column is how many micro operations are 163 00:07:14,880 --> 00:07:17,840 involved in 164 00:07:18,639 --> 00:07:22,160 in 165 00:07:19,520 --> 00:07:24,080 particular instruction uh instruction 166 00:07:22,160 --> 00:07:27,840 latency and 167 00:07:24,080 --> 00:07:29,360 it's work of food 168 00:07:27,840 --> 00:07:30,400 frankly i don't 169 00:07:29,360 --> 00:07:31,199 trust 170 00:07:30,400 --> 00:07:32,800 that 171 00:07:31,199 --> 00:07:34,479 all the 172 00:07:32,800 --> 00:07:39,039 instruction i have only one mac 173 00:07:34,479 --> 00:07:41,680 operations but uh let's have a look on 174 00:07:39,039 --> 00:07:44,639 latency we see that uh vector 175 00:07:41,680 --> 00:07:46,800 instructions xml this uh 176 00:07:44,639 --> 00:07:46,800 uh 177 00:07:46,879 --> 00:07:49,680 this 178 00:07:48,000 --> 00:07:51,199 uh 16 byte 179 00:07:49,680 --> 00:07:54,639 vector operations 180 00:07:51,199 --> 00:07:55,599 have very large uh latency 181 00:07:54,639 --> 00:07:56,879 and 182 00:07:55,599 --> 00:07:59,039 throughput 183 00:07:56,879 --> 00:08:01,599 isn't bad this is a reciprocal 184 00:07:59,039 --> 00:08:05,840 throughput it means that our transaction 185 00:08:01,599 --> 00:08:07,120 of this kind can be run in parallel 186 00:08:05,840 --> 00:08:10,000 um 187 00:08:07,120 --> 00:08:12,720 okay it's uh 188 00:08:10,000 --> 00:08:14,879 this instruction have more my mic 189 00:08:12,720 --> 00:08:16,560 operations um 190 00:08:14,879 --> 00:08:18,319 anyway 191 00:08:16,560 --> 00:08:20,560 um 192 00:08:18,319 --> 00:08:21,680 let's uh see a 193 00:08:20,560 --> 00:08:23,360 timeline 194 00:08:21,680 --> 00:08:26,639 for the 195 00:08:23,360 --> 00:08:28,000 program let me make a 196 00:08:26,639 --> 00:08:28,879 little 197 00:08:28,000 --> 00:08:29,919 uh 198 00:08:28,879 --> 00:08:34,959 smaller 199 00:08:29,919 --> 00:08:36,640 uh this uh how the instructions are 200 00:08:34,959 --> 00:08:37,680 executed by 201 00:08:36,640 --> 00:08:40,399 uh 202 00:08:37,680 --> 00:08:42,719 which is 203 00:08:40,399 --> 00:08:45,200 are provided by mca 204 00:08:42,719 --> 00:08:46,399 first of all we see that the first 7th 205 00:08:45,200 --> 00:08:48,720 section 206 00:08:46,399 --> 00:08:51,839 starts decoding at the same 207 00:08:48,720 --> 00:08:55,120 time next is time to execute waiting for 208 00:08:51,839 --> 00:08:56,399 execution and execute 209 00:08:55,120 --> 00:08:58,640 here we 210 00:08:56,399 --> 00:09:01,440 see that 211 00:08:58,640 --> 00:09:04,560 this instruction this instruction 212 00:09:01,440 --> 00:09:06,080 as executed are earlier than previous 213 00:09:04,560 --> 00:09:09,120 instructions this 214 00:09:06,080 --> 00:09:10,640 autofocus execution in action 215 00:09:09,120 --> 00:09:11,920 uh however 216 00:09:10,640 --> 00:09:13,760 uh 217 00:09:11,920 --> 00:09:16,720 the instruction retired 218 00:09:13,760 --> 00:09:19,200 return means that uh results of the 219 00:09:16,720 --> 00:09:22,160 instruction are fixed in 220 00:09:19,200 --> 00:09:24,959 final registers the final memory and we 221 00:09:22,160 --> 00:09:26,240 see that all the instructions are always 222 00:09:24,959 --> 00:09:28,000 retired 223 00:09:26,240 --> 00:09:30,880 in 224 00:09:28,000 --> 00:09:33,839 in sequence as instruction appeared in 225 00:09:30,880 --> 00:09:33,839 the program 226 00:09:34,560 --> 00:09:43,360 so uh let's back uh to the slides 227 00:09:38,800 --> 00:09:44,720 uh thanks to out of vote execution we uh 228 00:09:43,360 --> 00:09:46,480 have um 229 00:09:44,720 --> 00:09:50,080 a spectacle attack 230 00:09:46,480 --> 00:09:52,399 and we suspect an attack um 231 00:09:50,080 --> 00:09:52,399 we 232 00:09:53,360 --> 00:09:59,600 the spectral attacks use with um out of 233 00:09:56,240 --> 00:10:00,720 forward execution let's uh uh let's 234 00:09:59,600 --> 00:10:02,880 explore 235 00:10:00,720 --> 00:10:07,360 uh one more example 236 00:10:02,880 --> 00:10:07,360 let me award it in compiler explorer 237 00:10:07,440 --> 00:10:10,240 here it is 238 00:10:12,720 --> 00:10:17,760 let me choose a clank for this 239 00:10:15,279 --> 00:10:20,880 experiment 240 00:10:17,760 --> 00:10:23,279 and uh here we go this uh basic uh clan 241 00:10:20,880 --> 00:10:26,480 captain uh compilation for this problem 242 00:10:23,279 --> 00:10:30,480 this uh what we would expect 243 00:10:26,480 --> 00:10:32,399 uh however in um out of order execution 244 00:10:30,480 --> 00:10:35,040 array lengths might be 245 00:10:32,399 --> 00:10:37,279 not fetched from the memory and the 246 00:10:35,040 --> 00:10:40,320 while uh experience waiting for the 247 00:10:37,279 --> 00:10:44,000 value of a length it might uh try to 248 00:10:40,320 --> 00:10:45,519 speculatively try to execute uh the next 249 00:10:44,000 --> 00:10:46,399 uh statement 250 00:10:45,519 --> 00:10:50,240 so 251 00:10:46,399 --> 00:10:52,959 uh if we long and actually offset is 252 00:10:50,240 --> 00:10:55,040 larger than array lengths then cpu may 253 00:10:52,959 --> 00:10:57,120 might uh store the 254 00:10:55,040 --> 00:11:00,160 uh results of this profession's 255 00:10:57,120 --> 00:11:02,880 instruction in the cpu cache and this um 256 00:11:00,160 --> 00:11:07,279 data may be repeated by nataka 257 00:11:02,880 --> 00:11:09,440 uh so uh to prevent such kind of attacks 258 00:11:07,279 --> 00:11:11,519 we can add 259 00:11:09,440 --> 00:11:14,920 command line options 260 00:11:11,519 --> 00:11:14,920 it's um 261 00:11:15,839 --> 00:11:19,040 line 262 00:11:17,160 --> 00:11:20,959 [Music] 263 00:11:19,040 --> 00:11:22,399 and 264 00:11:20,959 --> 00:11:24,320 now now we have 265 00:11:22,399 --> 00:11:27,279 legend epilogue 266 00:11:24,320 --> 00:11:30,000 and this actually the trick 267 00:11:27,279 --> 00:11:31,120 from the compiler to trick the 268 00:11:30,000 --> 00:11:32,880 cpu 269 00:11:31,120 --> 00:11:36,399 so if it goes 270 00:11:32,880 --> 00:11:39,920 uh to the wrong uh branch then it sees 271 00:11:36,399 --> 00:11:45,279 uh pause and defense which actually 272 00:11:39,920 --> 00:11:48,000 uh clears the all the uh data crashed 273 00:11:45,279 --> 00:11:52,000 uh so such 274 00:11:48,000 --> 00:11:54,839 epiworks uh have a performance cost and 275 00:11:52,000 --> 00:11:58,320 it might cost up to 15 276 00:11:54,839 --> 00:12:01,120 performance this next uh attack which 277 00:11:58,320 --> 00:12:04,720 were revealed couple of years ago is uh 278 00:12:01,120 --> 00:12:07,600 main down and this is about um 279 00:12:04,720 --> 00:12:10,560 page table isolation on the operating 280 00:12:07,600 --> 00:12:14,480 system and this might cost up to 30 281 00:12:10,560 --> 00:12:15,519 percent of uh performance 282 00:12:14,480 --> 00:12:20,959 um 283 00:12:15,519 --> 00:12:23,200 so if you do a code review uh and uh 284 00:12:20,959 --> 00:12:27,279 suppose that performance very important 285 00:12:23,200 --> 00:12:29,680 for you then you might be asking um 286 00:12:27,279 --> 00:12:30,880 as on the early side in this 287 00:12:29,680 --> 00:12:33,920 presentation 288 00:12:30,880 --> 00:12:37,360 uh with the compiler is smart enough to 289 00:12:33,920 --> 00:12:38,800 make uh optimization or not 290 00:12:37,360 --> 00:12:41,680 typically 291 00:12:38,800 --> 00:12:41,680 we can first 292 00:12:41,920 --> 00:12:46,079 check 293 00:12:43,600 --> 00:12:50,320 compiling documents this 294 00:12:46,079 --> 00:12:52,720 document from levm about optimization 295 00:12:50,320 --> 00:12:53,600 passes transformation phases 296 00:12:52,720 --> 00:12:55,279 and 297 00:12:53,600 --> 00:12:57,200 this documentation might give you an 298 00:12:55,279 --> 00:12:59,839 idea what the compiler can do in 299 00:12:57,200 --> 00:13:01,320 principle however 300 00:12:59,839 --> 00:13:03,920 one compiler can do a lot of 301 00:13:01,320 --> 00:13:06,399 optimizations not all of them can be 302 00:13:03,920 --> 00:13:08,560 applied in particular case 303 00:13:06,399 --> 00:13:12,000 and the compiler have 304 00:13:08,560 --> 00:13:14,270 common line options uh providing 305 00:13:12,000 --> 00:13:15,760 optimization information this um 306 00:13:14,270 --> 00:13:18,160 [Music] 307 00:13:15,760 --> 00:13:23,360 subtract of 308 00:13:18,160 --> 00:13:26,320 gcm output about optimized vectorized um 309 00:13:23,360 --> 00:13:27,920 optimizations and several missed uh 310 00:13:26,320 --> 00:13:31,519 optimizations 311 00:13:27,920 --> 00:13:34,639 uh once you have uh report you also 312 00:13:31,519 --> 00:13:37,200 have a source file you have lines and 313 00:13:34,639 --> 00:13:39,920 socks fast so you can have an ideas 314 00:13:37,200 --> 00:13:43,040 which part of your sources are optimized 315 00:13:39,920 --> 00:13:44,560 and not and with which uh strategies 316 00:13:43,040 --> 00:13:46,399 uh also 317 00:13:44,560 --> 00:13:49,360 uh reviewing the 318 00:13:46,399 --> 00:13:52,399 assembly it's always helpful but usually 319 00:13:49,360 --> 00:13:54,240 it's better to do this on last 320 00:13:52,399 --> 00:13:56,079 as the last 321 00:13:54,240 --> 00:13:59,120 try 322 00:13:56,079 --> 00:14:01,600 uh so let's start about particle with 323 00:13:59,120 --> 00:14:04,560 practical examples 324 00:14:01,600 --> 00:14:06,240 about http um the first example is about 325 00:14:04,560 --> 00:14:09,680 http parser 326 00:14:06,240 --> 00:14:11,360 uh http passwords are typically encoded 327 00:14:09,680 --> 00:14:12,240 as a 328 00:14:11,360 --> 00:14:15,440 loop 329 00:14:12,240 --> 00:14:18,720 uh in this slide there's this 330 00:14:15,440 --> 00:14:20,800 while loop on input data and inside the 331 00:14:18,720 --> 00:14:22,800 loop you have a bunch of switch 332 00:14:20,800 --> 00:14:25,040 statements and 333 00:14:22,800 --> 00:14:28,079 uh switch 334 00:14:25,040 --> 00:14:29,519 guns on current state machine state and 335 00:14:28,079 --> 00:14:33,279 input data 336 00:14:29,519 --> 00:14:37,040 uh for nginx we have relatively small 337 00:14:33,279 --> 00:14:41,600 state machine about 84 338 00:14:37,040 --> 00:14:44,720 states however 14 psfw we need much 339 00:14:41,600 --> 00:14:45,760 larger state machine and this magnitude 340 00:14:44,720 --> 00:14:48,560 larger 341 00:14:45,760 --> 00:14:50,880 uh the reason for larger set machine is 342 00:14:48,560 --> 00:14:52,000 because we do a lot of security checks 343 00:14:50,880 --> 00:14:54,160 in our 344 00:14:52,000 --> 00:14:55,519 hp passes so 345 00:14:54,160 --> 00:14:56,320 we 346 00:14:55,519 --> 00:14:58,480 have 347 00:14:56,320 --> 00:15:02,079 more than 900 348 00:14:58,480 --> 00:15:02,079 states now can't pass 349 00:15:02,880 --> 00:15:07,279 so 350 00:15:04,079 --> 00:15:09,519 what compilers can do for such 351 00:15:07,279 --> 00:15:12,320 large state machines 352 00:15:09,519 --> 00:15:15,360 the first one is that we have 353 00:15:12,320 --> 00:15:18,880 optimization of switch this well-known 354 00:15:15,360 --> 00:15:22,800 optimization uh first of all is lookup 355 00:15:18,880 --> 00:15:25,440 table so if your states uh which you use 356 00:15:22,800 --> 00:15:26,560 in switch statements are arranged and 357 00:15:25,440 --> 00:15:28,639 sequential 358 00:15:26,560 --> 00:15:29,920 then your compiler can generate the 359 00:15:28,639 --> 00:15:31,519 hookup 360 00:15:29,920 --> 00:15:33,600 table 361 00:15:31,519 --> 00:15:35,519 so we have a 362 00:15:33,600 --> 00:15:36,560 jump bias 363 00:15:35,519 --> 00:15:39,360 offset 364 00:15:36,560 --> 00:15:43,199 and by the job we define the label where 365 00:15:39,360 --> 00:15:46,480 we need to go so this like a double jump 366 00:15:43,199 --> 00:15:49,839 however since we use uh input data as 367 00:15:46,480 --> 00:15:53,519 i've said uh for jump this code is 368 00:15:49,839 --> 00:15:56,639 vulnerable for spectral attack 369 00:15:53,519 --> 00:16:00,320 uh the next optimization next variant of 370 00:15:56,639 --> 00:16:02,399 switch table optimization is uh binary 371 00:16:00,320 --> 00:16:04,480 search if your 372 00:16:02,399 --> 00:16:05,600 states and switch statements aren't 373 00:16:04,480 --> 00:16:06,959 arranged 374 00:16:05,600 --> 00:16:10,320 and 375 00:16:06,959 --> 00:16:11,839 have a non-consequent values then 376 00:16:10,320 --> 00:16:13,360 compiler have to 377 00:16:11,839 --> 00:16:16,000 fall back to 378 00:16:13,360 --> 00:16:17,120 a binary search between the 379 00:16:16,000 --> 00:16:18,560 labels 380 00:16:17,120 --> 00:16:21,519 actually 381 00:16:18,560 --> 00:16:24,000 modern compilers are smart enough to be 382 00:16:21,519 --> 00:16:26,320 able to recognize the same pattern for 383 00:16:24,000 --> 00:16:30,160 chains of yourself 384 00:16:26,320 --> 00:16:31,519 use f's ifs and we have the same uh 385 00:16:30,160 --> 00:16:34,639 pretty the same 386 00:16:31,519 --> 00:16:37,120 uh code generated for 387 00:16:34,639 --> 00:16:41,279 uh yourself 388 00:16:37,120 --> 00:16:43,279 uh the next thing which we can uh do is 389 00:16:41,279 --> 00:16:46,160 uh just to 390 00:16:43,279 --> 00:16:47,839 replace uh switch statement by 391 00:16:46,160 --> 00:16:49,440 uh go to 392 00:16:47,839 --> 00:16:50,320 uh jumps 393 00:16:49,440 --> 00:16:53,120 um 394 00:16:50,320 --> 00:16:57,279 jcbo 395 00:16:53,120 --> 00:16:59,600 will copy some uh code inside the 396 00:16:57,279 --> 00:17:02,000 uh state machine 397 00:16:59,600 --> 00:17:04,880 and we still need some 398 00:17:02,000 --> 00:17:07,839 uh starting um 399 00:17:04,880 --> 00:17:10,079 starting switch statement just to start 400 00:17:07,839 --> 00:17:11,520 uh the state machine 401 00:17:10,079 --> 00:17:13,280 and we 402 00:17:11,520 --> 00:17:14,880 actually can go 403 00:17:13,280 --> 00:17:15,919 uh faster 404 00:17:14,880 --> 00:17:19,199 uh 405 00:17:15,919 --> 00:17:22,640 she provides us um 406 00:17:19,199 --> 00:17:23,839 labels as values so essentially you can 407 00:17:22,640 --> 00:17:26,480 draw 408 00:17:23,839 --> 00:17:27,600 in c just the same as you do in assembly 409 00:17:26,480 --> 00:17:30,160 just 410 00:17:27,600 --> 00:17:32,799 jump directly to particular 411 00:17:30,160 --> 00:17:35,919 others in your code 412 00:17:32,799 --> 00:17:38,080 so with this optimization we can move uh 413 00:17:35,919 --> 00:17:38,880 significantly faster 414 00:17:38,080 --> 00:17:39,760 so 415 00:17:38,880 --> 00:17:42,720 uh 416 00:17:39,760 --> 00:17:44,240 for pretty large uh 417 00:17:42,720 --> 00:17:46,799 state machines 418 00:17:44,240 --> 00:17:48,480 uh this link to 419 00:17:46,799 --> 00:17:50,960 uh benchmark 420 00:17:48,480 --> 00:17:54,320 uh if we compare uh 421 00:17:50,960 --> 00:17:58,559 state state machines for or for seven or 422 00:17:54,320 --> 00:18:01,600 twenty uh seven states then uh times are 423 00:17:58,559 --> 00:18:02,400 more or less the same for very small 424 00:18:01,600 --> 00:18:06,559 uh 425 00:18:02,400 --> 00:18:10,000 state machines we uh go to given um 426 00:18:06,559 --> 00:18:11,840 after automaton may might be even a bit 427 00:18:10,000 --> 00:18:15,200 slower than 428 00:18:11,840 --> 00:18:17,600 switch driving however for a very large 429 00:18:15,200 --> 00:18:20,240 uh state machine or in this particular 430 00:18:17,600 --> 00:18:24,080 case uh it's twice 431 00:18:20,240 --> 00:18:25,200 uh smallest on our actual state machine 432 00:18:24,080 --> 00:18:29,360 you can 433 00:18:25,200 --> 00:18:32,320 move uh twice faster than a basic 434 00:18:29,360 --> 00:18:33,200 switch driving state machine 435 00:18:32,320 --> 00:18:35,600 uh 436 00:18:33,200 --> 00:18:39,280 the main reason for this is 437 00:18:35,600 --> 00:18:41,440 low um branch misses because we we drew 438 00:18:39,280 --> 00:18:43,440 much lower 439 00:18:41,440 --> 00:18:45,039 code transition 440 00:18:43,440 --> 00:18:48,960 in the loop 441 00:18:45,039 --> 00:18:48,960 then we have a much better 442 00:18:49,760 --> 00:18:54,480 cash 443 00:18:51,360 --> 00:18:58,000 much better characteristics of 444 00:18:54,480 --> 00:18:59,919 cash missiles even uh we have 445 00:18:58,000 --> 00:19:00,960 almost twice 446 00:18:59,919 --> 00:19:04,160 bigger 447 00:19:00,960 --> 00:19:06,160 uh code for go to driving aftermath on 448 00:19:04,160 --> 00:19:09,039 we have uh 449 00:19:06,160 --> 00:19:11,840 twice small number of 450 00:19:09,039 --> 00:19:11,840 cash misses 451 00:19:13,600 --> 00:19:19,280 next thing uh is 452 00:19:17,200 --> 00:19:21,200 how to 453 00:19:19,280 --> 00:19:23,440 how to properly code 454 00:19:21,200 --> 00:19:24,799 a state machine this 455 00:19:23,440 --> 00:19:26,000 clear for 456 00:19:24,799 --> 00:19:29,679 http 457 00:19:26,000 --> 00:19:34,240 uh semantics then uh first we have 458 00:19:29,679 --> 00:19:37,039 hp method next we have uh some 459 00:19:34,240 --> 00:19:41,520 hp headers and after that we have a http 460 00:19:37,039 --> 00:19:42,480 body and we would expect that um 461 00:19:41,520 --> 00:19:43,440 we 462 00:19:42,480 --> 00:19:46,320 well 463 00:19:43,440 --> 00:19:47,120 in in any case we always need to 464 00:19:46,320 --> 00:19:48,480 uh 465 00:19:47,120 --> 00:19:52,400 pass 466 00:19:48,480 --> 00:19:56,640 hpm uh method and after that move to hp 467 00:19:52,400 --> 00:20:00,880 headers so this no sense uh to place 468 00:19:56,640 --> 00:20:04,320 hp um uh method of python after hp 469 00:20:00,880 --> 00:20:07,200 headers so we have uh 470 00:20:04,320 --> 00:20:08,400 we have a fixed sequence of states 471 00:20:07,200 --> 00:20:10,720 however 472 00:20:08,400 --> 00:20:13,520 uh compilers doesn't know about these 473 00:20:10,720 --> 00:20:18,000 requirements about our semantics of 474 00:20:13,520 --> 00:20:20,880 input data so it's um free to reorder 475 00:20:18,000 --> 00:20:23,120 um code in its understanding of 476 00:20:20,880 --> 00:20:26,320 optimization and in our 477 00:20:23,120 --> 00:20:29,360 uh case we saw that 478 00:20:26,320 --> 00:20:31,840 very improbable states uh are placed by 479 00:20:29,360 --> 00:20:34,559 the compiler on the top of the state 480 00:20:31,840 --> 00:20:37,120 machine the top of the set machine is a 481 00:20:34,559 --> 00:20:40,559 interest point for the function and it's 482 00:20:37,120 --> 00:20:41,360 very very hot so we we need to place the 483 00:20:40,559 --> 00:20:43,440 hot 484 00:20:41,360 --> 00:20:45,440 states closer to the beginning of the 485 00:20:43,440 --> 00:20:48,799 function 486 00:20:45,440 --> 00:20:50,000 and the first thing which we can do is 487 00:20:48,799 --> 00:20:53,679 to use 488 00:20:50,000 --> 00:20:54,960 compiler barrier which 489 00:20:53,679 --> 00:20:58,559 prevents 490 00:20:54,960 --> 00:21:00,480 compiler from reordering 491 00:20:58,559 --> 00:21:03,840 however we can 492 00:21:00,480 --> 00:21:03,840 it's not optimal 493 00:21:06,480 --> 00:21:12,320 sorry uh it's not optimal to place 494 00:21:09,200 --> 00:21:18,080 compiler barriers everywhere we want to 495 00:21:12,320 --> 00:21:19,679 let compiler do optimization better and 496 00:21:18,080 --> 00:21:21,600 actually we can 497 00:21:19,679 --> 00:21:23,919 do this 498 00:21:21,600 --> 00:21:23,919 using 499 00:21:24,080 --> 00:21:28,559 using hot 500 00:21:27,039 --> 00:21:29,280 hot labels 501 00:21:28,559 --> 00:21:31,039 so 502 00:21:29,280 --> 00:21:34,000 with um 503 00:21:31,039 --> 00:21:36,799 with the uh hot and cold uh labels for 504 00:21:34,000 --> 00:21:39,039 state machines we can arrange the the 505 00:21:36,799 --> 00:21:41,679 parcel uh to 506 00:21:39,039 --> 00:21:44,640 to make a compiler place particular 507 00:21:41,679 --> 00:21:47,200 labels uh closer to the top of the 508 00:21:44,640 --> 00:21:49,360 function or closer to the bottom of the 509 00:21:47,200 --> 00:21:50,320 function 510 00:21:49,360 --> 00:21:52,240 in this 511 00:21:50,320 --> 00:21:55,919 case we found that 512 00:21:52,240 --> 00:21:58,480 pgo isn't helpful because 513 00:21:55,919 --> 00:22:01,840 profile guided optimization that 514 00:21:58,480 --> 00:22:05,039 arranges your code according to the 515 00:22:01,840 --> 00:22:07,039 profile profile data so if you spend 516 00:22:05,039 --> 00:22:08,960 very little time in 517 00:22:07,039 --> 00:22:11,120 hpm 518 00:22:08,960 --> 00:22:13,840 method passing 519 00:22:11,120 --> 00:22:17,200 and spend more time on your rain passing 520 00:22:13,840 --> 00:22:20,400 than um profile guy that optimization 521 00:22:17,200 --> 00:22:23,120 will place uh your rain passing uh 522 00:22:20,400 --> 00:22:27,120 before the method by passing and that's 523 00:22:23,120 --> 00:22:30,080 wrong because we always have to pass um 524 00:22:27,120 --> 00:22:33,280 hp method before the url 525 00:22:30,080 --> 00:22:37,120 and this um case when uh profile guide 526 00:22:33,280 --> 00:22:40,320 optimization also not helpful because um 527 00:22:37,120 --> 00:22:42,799 it's not aware about the input data 528 00:22:40,320 --> 00:22:43,919 semantics 529 00:22:42,799 --> 00:22:46,720 um 530 00:22:43,919 --> 00:22:49,919 this uh skip it 531 00:22:46,720 --> 00:22:51,760 says about alignment 532 00:22:49,919 --> 00:22:52,840 so 533 00:22:51,760 --> 00:22:54,320 actually 534 00:22:52,840 --> 00:22:57,600 we uh 535 00:22:54,320 --> 00:23:01,440 tried to experiment with uh data 536 00:22:57,600 --> 00:23:02,480 alignment for example this uh checked 537 00:23:01,440 --> 00:23:05,600 access 538 00:23:02,480 --> 00:23:09,679 for example if we 539 00:23:05,600 --> 00:23:12,880 convert a string to 540 00:23:09,679 --> 00:23:16,640 integer this known penalty to access 541 00:23:12,880 --> 00:23:18,960 integer uh non-aggregate integer and in 542 00:23:16,640 --> 00:23:20,640 a very micro benchmark you might see 543 00:23:18,960 --> 00:23:22,880 that almost 544 00:23:20,640 --> 00:23:25,919 more than two times on regular 545 00:23:22,880 --> 00:23:29,280 accessories slower than lignin however 546 00:23:25,919 --> 00:23:30,480 if we try to encode this uh directly in 547 00:23:29,280 --> 00:23:33,520 um 548 00:23:30,480 --> 00:23:37,200 in the past we won't see any performance 549 00:23:33,520 --> 00:23:40,320 improvement and the reason uh for that 550 00:23:37,200 --> 00:23:43,520 uh this actually is that uh anaerobic 551 00:23:40,320 --> 00:23:45,200 access is bit uh faster than alignment 552 00:23:43,520 --> 00:23:47,200 the reason that 553 00:23:45,200 --> 00:23:49,039 the additional code makes the 554 00:23:47,200 --> 00:23:53,600 uh code 555 00:23:49,039 --> 00:23:56,080 larger more sophisticated and 556 00:23:53,600 --> 00:23:58,000 since you need to 557 00:23:56,080 --> 00:23:59,360 align the diet in 558 00:23:58,000 --> 00:24:01,679 every 559 00:23:59,360 --> 00:24:04,720 string comparison the award of such 560 00:24:01,679 --> 00:24:08,240 places around the past and the password 561 00:24:04,720 --> 00:24:10,400 is significant so um the 562 00:24:08,240 --> 00:24:12,799 big bigger code and more 563 00:24:10,400 --> 00:24:16,880 complex code 564 00:24:12,799 --> 00:24:16,880 may make the password slower 565 00:24:17,279 --> 00:24:22,559 next thing is about 566 00:24:19,679 --> 00:24:25,919 optimization uh compiler optimization 567 00:24:22,559 --> 00:24:27,520 this size shows uh difference between o2 568 00:24:25,919 --> 00:24:31,039 and o3 569 00:24:27,520 --> 00:24:33,919 and those three uh adds um 570 00:24:31,039 --> 00:24:36,159 particular flags for 571 00:24:33,919 --> 00:24:38,640 uh optimization level 572 00:24:36,159 --> 00:24:41,440 and not all the flux 573 00:24:38,640 --> 00:24:42,320 uh improves uh performance in that there 574 00:24:41,440 --> 00:24:44,000 are 575 00:24:42,320 --> 00:24:46,960 um 576 00:24:44,000 --> 00:24:49,760 the shown facts which makes the 577 00:24:46,960 --> 00:24:50,799 performance works actually done 578 00:24:49,760 --> 00:24:53,840 better 579 00:24:50,799 --> 00:24:56,919 and we found that for http passer 580 00:24:53,840 --> 00:25:00,960 all three uh generates 581 00:24:56,919 --> 00:25:03,120 volkswagen o2 582 00:25:00,960 --> 00:25:04,400 the 583 00:25:03,120 --> 00:25:05,520 problem 584 00:25:04,400 --> 00:25:08,559 uh 585 00:25:05,520 --> 00:25:12,559 problem options are green and they're 586 00:25:08,559 --> 00:25:14,880 about after vectorization let's 587 00:25:12,559 --> 00:25:17,679 let's dig deeper into after 588 00:25:14,880 --> 00:25:20,320 vectorization after vectorization is 589 00:25:17,679 --> 00:25:23,039 about single instruction multiple data 590 00:25:20,320 --> 00:25:25,679 so if you have uh 591 00:25:23,039 --> 00:25:28,480 computation for example it's on the 592 00:25:25,679 --> 00:25:31,440 uh picture on the slide if we have 593 00:25:28,480 --> 00:25:32,640 uh two arrays of form elements and we 594 00:25:31,440 --> 00:25:35,360 need to 595 00:25:32,640 --> 00:25:38,480 add one race to another we can do this 596 00:25:35,360 --> 00:25:41,440 in a single instruction with multiple 597 00:25:38,480 --> 00:25:44,480 data however instructions while 598 00:25:41,440 --> 00:25:47,440 instruction can operate like up to 32 599 00:25:44,480 --> 00:25:50,080 appearance uh in the time 600 00:25:47,440 --> 00:25:53,039 they're not for free first of all they 601 00:25:50,080 --> 00:25:56,640 require a ligament they also have 602 00:25:53,039 --> 00:26:00,000 additional latency as we saw in mci 603 00:25:56,640 --> 00:26:06,000 output and actually not everything is 604 00:26:00,000 --> 00:26:08,480 after vitalized this good talk from um 605 00:26:06,000 --> 00:26:12,320 an ibm developer about this after 606 00:26:08,480 --> 00:26:14,799 virtualization internals in lvm so if 607 00:26:12,320 --> 00:26:16,400 you're interested about thanos i highly 608 00:26:14,799 --> 00:26:19,600 recommend you to 609 00:26:16,400 --> 00:26:20,480 watch the presentation 610 00:26:19,600 --> 00:26:23,120 um 611 00:26:20,480 --> 00:26:24,000 linux kernel 612 00:26:23,120 --> 00:26:26,960 it's 613 00:26:24,000 --> 00:26:28,799 known as uh optimized 614 00:26:26,960 --> 00:26:31,360 part of 615 00:26:28,799 --> 00:26:34,159 the optimizer software however it does 616 00:26:31,360 --> 00:26:35,600 not use after vectorization 617 00:26:34,159 --> 00:26:36,840 and uh 618 00:26:35,600 --> 00:26:40,799 actually 619 00:26:36,840 --> 00:26:42,480 the canon doesn't uh tries to avoid 620 00:26:40,799 --> 00:26:43,600 single instruction multiple data as 621 00:26:42,480 --> 00:26:46,880 possible 622 00:26:43,600 --> 00:26:49,279 uh actually some code uh which ldm 623 00:26:46,880 --> 00:26:53,600 benefits from uh single search multiple 624 00:26:49,279 --> 00:26:55,200 data does uh use it like crypto api 625 00:26:53,600 --> 00:26:57,600 uh however 626 00:26:55,200 --> 00:26:59,679 uh single search multiple data is 627 00:26:57,600 --> 00:27:04,799 managed by separate 628 00:26:59,679 --> 00:27:06,640 cpu unit as a 14 point unit and it has a 629 00:27:04,799 --> 00:27:08,240 separate contacts 630 00:27:06,640 --> 00:27:12,159 so um 631 00:27:08,240 --> 00:27:15,360 to be able to use uh fpu inside the 632 00:27:12,159 --> 00:27:18,000 kernel uh we need to store 633 00:27:15,360 --> 00:27:22,240 fpu contacts and restore this 634 00:27:18,000 --> 00:27:26,399 optimization for the linux kernel so if 635 00:27:22,240 --> 00:27:29,039 well um the modern user space code 636 00:27:26,399 --> 00:27:31,360 is almost every uh user space poker 637 00:27:29,039 --> 00:27:32,640 which you run is after with the after 638 00:27:31,360 --> 00:27:33,520 vectorizer 639 00:27:32,640 --> 00:27:36,880 so 640 00:27:33,520 --> 00:27:38,320 having that uh cannot trust optimizes 641 00:27:36,880 --> 00:27:41,279 and not to 642 00:27:38,320 --> 00:27:45,919 uh spend resources on saving and 643 00:27:41,279 --> 00:27:47,840 restoring uh fpu state for each uh 644 00:27:45,919 --> 00:27:50,159 contact switch between the user space 645 00:27:47,840 --> 00:27:51,120 and the kernel space and only when you 646 00:27:50,159 --> 00:27:53,120 need to 647 00:27:51,120 --> 00:27:56,159 to use a single structure multiple data 648 00:27:53,120 --> 00:27:59,360 inside the kernel uh we call the kind of 649 00:27:56,159 --> 00:28:02,240 you begin and cannot be you end and they 650 00:27:59,360 --> 00:28:05,600 are actually very costly this 651 00:28:02,240 --> 00:28:09,039 comparison uh between uh 652 00:28:05,600 --> 00:28:10,640 when mem cpi and the meme cpi protected 653 00:28:09,039 --> 00:28:13,840 by kind of pu 654 00:28:10,640 --> 00:28:17,520 are beginning kind of premium end 655 00:28:13,840 --> 00:28:21,279 uh however in team we tried to 656 00:28:17,520 --> 00:28:24,000 play with after vectorization and we had 657 00:28:21,279 --> 00:28:25,520 an idea uh that probably we might 658 00:28:24,000 --> 00:28:30,399 benefit from 659 00:28:25,520 --> 00:28:32,080 um after equalization of hp passer and 660 00:28:30,399 --> 00:28:33,200 you have a look what 661 00:28:32,080 --> 00:28:34,799 was 662 00:28:33,200 --> 00:28:35,600 going on 663 00:28:34,799 --> 00:28:37,679 uh 664 00:28:35,600 --> 00:28:38,880 there are 665 00:28:37,679 --> 00:28:40,720 several 666 00:28:38,880 --> 00:28:44,799 uh documentation about after 667 00:28:40,720 --> 00:28:47,279 vectorization the first one is about gcf 668 00:28:44,799 --> 00:28:49,919 vectorization and also this very good 669 00:28:47,279 --> 00:28:54,240 docs for real vm 670 00:28:49,919 --> 00:28:56,640 uh vm docs are much more readable about 671 00:28:54,240 --> 00:28:59,360 after week civilization stuff 672 00:28:56,640 --> 00:29:01,919 basically we have two types of after 673 00:28:59,360 --> 00:29:04,720 vectorization the first one is loop 674 00:29:01,919 --> 00:29:07,279 after vectorization and the second one 675 00:29:04,720 --> 00:29:09,760 is so for basic block after 676 00:29:07,279 --> 00:29:11,360 vectorization or 677 00:29:09,760 --> 00:29:15,120 slp 678 00:29:11,360 --> 00:29:18,559 also jc provides a different 679 00:29:15,120 --> 00:29:20,399 cost model for after vectorization and 680 00:29:18,559 --> 00:29:22,159 this is 681 00:29:20,399 --> 00:29:24,559 well since 682 00:29:22,159 --> 00:29:27,279 single structure multiple data isn't for 683 00:29:24,559 --> 00:29:29,440 free then compiler uh 684 00:29:27,279 --> 00:29:30,320 lets you choose pc5 685 00:29:29,440 --> 00:29:34,000 uh 686 00:29:30,320 --> 00:29:36,640 how how um whether do you want 687 00:29:34,000 --> 00:29:40,080 um single instruction multiple data in 688 00:29:36,640 --> 00:29:42,080 many cases or not like 689 00:29:40,080 --> 00:29:43,360 how aggressive a single instruction 690 00:29:42,080 --> 00:29:47,200 multiple 691 00:29:43,360 --> 00:29:48,559 after vectorization should behave 692 00:29:47,200 --> 00:29:50,480 um 693 00:29:48,559 --> 00:29:52,960 there are several 694 00:29:50,480 --> 00:29:55,679 requirements for code to be artificially 695 00:29:52,960 --> 00:29:57,600 realizable uh so 696 00:29:55,679 --> 00:29:59,360 first you need uh to 697 00:29:57,600 --> 00:30:02,480 have instructions which are after 698 00:29:59,360 --> 00:30:05,200 vectorizable unique diligent uh you need 699 00:30:02,480 --> 00:30:06,840 no uh data dependence uh between the 700 00:30:05,200 --> 00:30:10,799 code 701 00:30:06,840 --> 00:30:14,399 and preferably to have countable on long 702 00:30:10,799 --> 00:30:16,559 groups however modern compilers can um 703 00:30:14,399 --> 00:30:19,600 kind of after victories actually 704 00:30:16,559 --> 00:30:19,600 uncountable loops 705 00:30:20,080 --> 00:30:23,760 let's 706 00:30:21,120 --> 00:30:27,120 see some examples 707 00:30:23,760 --> 00:30:29,840 uh first of all i have 708 00:30:27,120 --> 00:30:31,039 examples of two programs which are good 709 00:30:29,840 --> 00:30:32,960 to 710 00:30:31,039 --> 00:30:35,200 consider 711 00:30:32,960 --> 00:30:39,520 the first one is 712 00:30:35,200 --> 00:30:42,480 for basic box after exercise this um 713 00:30:39,520 --> 00:30:46,720 simple pattern so if you see that there 714 00:30:42,480 --> 00:30:49,039 are arrays of uniform um uniform 715 00:30:46,720 --> 00:30:52,559 operation so in this case we have 716 00:30:49,039 --> 00:30:55,760 h equal operations on almost equal 717 00:30:52,559 --> 00:30:57,440 operation on the way this almost no uh 718 00:30:55,760 --> 00:31:00,080 data dependency 719 00:30:57,440 --> 00:31:02,480 uh well this um 720 00:31:00,080 --> 00:31:04,880 this actual dependency between 721 00:31:02,480 --> 00:31:06,880 uh different values 722 00:31:04,880 --> 00:31:10,159 but um 723 00:31:06,880 --> 00:31:11,919 in this one we have almost 724 00:31:10,159 --> 00:31:12,720 no dependency 725 00:31:11,919 --> 00:31:13,840 uh 726 00:31:12,720 --> 00:31:16,840 anyway 727 00:31:13,840 --> 00:31:21,120 uh one once which is that you do very 728 00:31:16,840 --> 00:31:24,559 very very similar operations uh in 729 00:31:21,120 --> 00:31:26,159 basic works then you might have ideas as 730 00:31:24,559 --> 00:31:28,640 this code could be 731 00:31:26,159 --> 00:31:31,360 uh after victorianized 732 00:31:28,640 --> 00:31:33,200 let's see um 733 00:31:31,360 --> 00:31:36,480 another example 734 00:31:33,200 --> 00:31:37,519 uh this about loop also in this loop we 735 00:31:36,480 --> 00:31:39,519 have 736 00:31:37,519 --> 00:31:41,279 very similar 737 00:31:39,519 --> 00:31:42,960 uh operations 738 00:31:41,279 --> 00:31:44,799 uh on 739 00:31:42,960 --> 00:31:46,880 number of 740 00:31:44,799 --> 00:31:50,480 appearance so that's also a good 741 00:31:46,880 --> 00:31:53,360 signature for to be artificialized 742 00:31:50,480 --> 00:31:53,360 uh well 743 00:31:54,320 --> 00:31:56,640 so 744 00:31:57,279 --> 00:32:02,000 let's um 745 00:31:59,600 --> 00:32:04,640 let's see or what we want to have with 746 00:32:02,000 --> 00:32:08,799 um after vectorization 747 00:32:04,640 --> 00:32:10,559 uh first of all uh let me to 748 00:32:08,799 --> 00:32:11,600 compile the 749 00:32:10,559 --> 00:32:15,000 basic 750 00:32:11,600 --> 00:32:15,000 basic box 751 00:32:17,600 --> 00:32:20,600 um 752 00:32:33,679 --> 00:32:36,679 uh 753 00:32:46,960 --> 00:32:51,679 so 754 00:32:47,840 --> 00:32:55,440 uh in this um in this comment we i used 755 00:32:51,679 --> 00:32:57,440 uh offline optimization for gc i also 756 00:32:55,440 --> 00:32:59,760 adoptions to see 757 00:32:57,440 --> 00:33:03,039 whether something is optimized and we 758 00:32:59,760 --> 00:33:06,159 see that compiler says that we 759 00:33:03,039 --> 00:33:09,200 have optimization and also i use the 760 00:33:06,159 --> 00:33:10,960 strongest uh consistent uh course model 761 00:33:09,200 --> 00:33:13,279 is unlimited 762 00:33:10,960 --> 00:33:17,120 um in 763 00:33:13,279 --> 00:33:18,000 song let's have a look what we 764 00:33:17,120 --> 00:33:19,679 uh 765 00:33:18,000 --> 00:33:21,840 what um 766 00:33:19,679 --> 00:33:21,840 no 767 00:33:22,880 --> 00:33:25,840 let's um 768 00:33:26,240 --> 00:33:31,760 let's uh compile 769 00:33:29,039 --> 00:33:32,799 first let's compile with uh 770 00:33:31,760 --> 00:33:34,799 um 771 00:33:32,799 --> 00:33:36,799 bb in 772 00:33:34,799 --> 00:33:39,799 different versions 773 00:33:36,799 --> 00:33:39,799 um 774 00:33:42,399 --> 00:33:48,480 so i have two versions of the program 775 00:33:45,519 --> 00:33:50,399 uh the first one is 776 00:33:48,480 --> 00:33:53,760 plain 777 00:33:50,399 --> 00:33:56,080 plain program so we have uh basic 778 00:33:53,760 --> 00:33:59,519 function uh this by the way the 779 00:33:56,080 --> 00:34:01,919 cryptography function and the next uh 780 00:33:59,519 --> 00:34:05,039 program i have a restrict 781 00:34:01,919 --> 00:34:06,640 access and i have a lignite uh hints for 782 00:34:05,039 --> 00:34:08,399 the compiler that 783 00:34:06,640 --> 00:34:11,040 uh the data 784 00:34:08,399 --> 00:34:13,359 passed to the function will be aligned 785 00:34:11,040 --> 00:34:16,159 and let's compare the assembly which 786 00:34:13,359 --> 00:34:19,159 will be generated for both of them 787 00:34:16,159 --> 00:34:19,159 functions 788 00:34:38,639 --> 00:34:43,200 here we go 789 00:34:40,000 --> 00:34:44,720 uh so the first thing which we might 790 00:34:43,200 --> 00:34:45,839 notice is that 791 00:34:44,720 --> 00:34:47,280 um 792 00:34:45,839 --> 00:34:51,200 non-malignant 793 00:34:47,280 --> 00:34:54,560 version generates code with a suffix hue 794 00:34:51,200 --> 00:34:55,679 this move the q u is for an alignment 795 00:34:54,560 --> 00:34:58,960 while 796 00:34:55,679 --> 00:35:01,200 qa is for lignite access 797 00:34:58,960 --> 00:35:03,920 what it means 798 00:35:01,200 --> 00:35:05,119 to understand whether we have some 799 00:35:03,920 --> 00:35:07,280 performance 800 00:35:05,119 --> 00:35:08,640 uh improvement for for the function 801 00:35:07,280 --> 00:35:10,800 let's compare 802 00:35:08,640 --> 00:35:12,560 let's um 803 00:35:10,800 --> 00:35:16,720 see on uh 804 00:35:12,560 --> 00:35:20,079 internal uh documentation from both the 805 00:35:16,720 --> 00:35:22,480 instruction this um made the qa 806 00:35:20,079 --> 00:35:25,200 instruction and we have that on my 807 00:35:22,480 --> 00:35:28,000 architecture skyway we have a latency 808 00:35:25,200 --> 00:35:29,760 six and four put uh 809 00:35:28,000 --> 00:35:31,119 0.5 810 00:35:29,760 --> 00:35:33,680 for 811 00:35:31,119 --> 00:35:35,520 alignment access we have the same data 812 00:35:33,680 --> 00:35:38,000 so basically 813 00:35:35,520 --> 00:35:40,560 this instruction are pretty the same and 814 00:35:38,000 --> 00:35:42,560 we don't win any performance for 815 00:35:40,560 --> 00:35:45,680 from using colligate data 816 00:35:42,560 --> 00:35:48,240 i also checked the different uh 817 00:35:45,680 --> 00:35:51,119 alignment instruction and 818 00:35:48,240 --> 00:35:52,240 basically modern uh inter architectures 819 00:35:51,119 --> 00:35:55,040 um 820 00:35:52,240 --> 00:35:56,880 in most of the instruction uh provides 821 00:35:55,040 --> 00:36:00,160 the same performance 822 00:35:56,880 --> 00:36:02,400 also this almost um 823 00:36:00,160 --> 00:36:06,000 equal instructions for 824 00:36:02,400 --> 00:36:08,640 um research success and in this 825 00:36:06,000 --> 00:36:11,920 particular case it seems we don't 826 00:36:08,640 --> 00:36:13,359 have much uh performance benefit from a 827 00:36:11,920 --> 00:36:16,000 restrict 828 00:36:13,359 --> 00:36:16,000 keyword 829 00:36:17,440 --> 00:36:21,599 that's 830 00:36:18,720 --> 00:36:21,599 back to them 831 00:36:21,839 --> 00:36:27,760 well let's have a look 832 00:36:24,320 --> 00:36:30,880 uh one a couple of more examples 833 00:36:27,760 --> 00:36:34,000 uh fortune to the um 834 00:36:30,880 --> 00:36:35,040 modern couple compiler is also able to 835 00:36:34,000 --> 00:36:36,400 um 836 00:36:35,040 --> 00:36:41,280 uh 837 00:36:36,400 --> 00:36:44,160 uh to optimize uh ship was plus sgl 838 00:36:41,280 --> 00:36:47,520 uh algorithms and in this 839 00:36:44,160 --> 00:36:50,720 particular case i used um 840 00:36:47,520 --> 00:36:52,240 uh i used programs from 841 00:36:50,720 --> 00:36:55,760 this blog post 842 00:36:52,240 --> 00:36:57,280 which is that uh actually 843 00:36:55,760 --> 00:36:58,880 we 844 00:36:57,280 --> 00:37:04,160 we see some 845 00:36:58,880 --> 00:37:06,960 uh some vectorization if we add um 846 00:37:04,160 --> 00:37:09,119 if we want to compile them to 847 00:37:06,960 --> 00:37:12,000 generate vxr2 848 00:37:09,119 --> 00:37:16,000 then we see that um 849 00:37:12,000 --> 00:37:17,599 compiler might generate even uh 32 uh 850 00:37:16,000 --> 00:37:21,200 byte 851 00:37:17,599 --> 00:37:24,480 code from the after vectorization 852 00:37:21,200 --> 00:37:27,040 uh more algorithms of foster which are 853 00:37:24,480 --> 00:37:29,200 vectorizable you can find on this uh 854 00:37:27,040 --> 00:37:30,400 blog post and that's not 855 00:37:29,200 --> 00:37:32,560 all them 856 00:37:30,400 --> 00:37:35,440 uh not all the algorithms are actually 857 00:37:32,560 --> 00:37:37,440 artifactorizable 858 00:37:35,440 --> 00:37:41,280 however 859 00:37:37,440 --> 00:37:44,400 uh with uh clunk this um 860 00:37:41,280 --> 00:37:45,839 uh well i didn't show um 861 00:37:44,400 --> 00:37:47,040 performance for 862 00:37:45,839 --> 00:37:48,800 uh 863 00:37:47,040 --> 00:37:54,000 for for loop 864 00:37:48,800 --> 00:37:57,680 uh but um clunk actually are not able to 865 00:37:54,000 --> 00:37:58,560 uh read uh to make reduce reduction 866 00:37:57,680 --> 00:38:01,200 for 867 00:37:58,560 --> 00:38:04,720 uh memory appearance 868 00:38:01,200 --> 00:38:05,839 um oh no um we 869 00:38:04,720 --> 00:38:09,040 end up 870 00:38:05,839 --> 00:38:12,320 not to use after vectorization for hpm 871 00:38:09,040 --> 00:38:16,079 passer and elliptic of cryptography 872 00:38:12,320 --> 00:38:16,079 for hp pulse we 873 00:38:16,400 --> 00:38:23,599 we have only a small optimization for 874 00:38:20,240 --> 00:38:24,800 uh basic blocks because as we saw the 875 00:38:23,599 --> 00:38:28,079 um 876 00:38:24,800 --> 00:38:31,040 state machine as with uh switch uh 877 00:38:28,079 --> 00:38:35,359 statements or go to drive and don't have 878 00:38:31,040 --> 00:38:37,760 much uh vectorizable uh operations like 879 00:38:35,359 --> 00:38:40,320 operation of the same kind also there 880 00:38:37,760 --> 00:38:43,200 are no small loops which are 881 00:38:40,320 --> 00:38:44,240 vectorizable for 882 00:38:43,200 --> 00:38:47,280 uh 883 00:38:44,240 --> 00:38:49,760 which are vectorizable so hp pulsar is 884 00:38:47,280 --> 00:38:52,960 uh just not the kind of code which can 885 00:38:49,760 --> 00:38:54,720 be utilizable elliptic 886 00:38:52,960 --> 00:38:57,520 cryptography is actually a good 887 00:38:54,720 --> 00:39:00,960 candidate for after utilization and if 888 00:38:57,520 --> 00:39:01,599 you program math in here she postpones 889 00:39:00,960 --> 00:39:03,520 and 890 00:39:01,599 --> 00:39:06,800 typically benefit from 891 00:39:03,520 --> 00:39:09,839 art vectorization however in our case as 892 00:39:06,800 --> 00:39:13,040 well as for other cryptography libraries 893 00:39:09,839 --> 00:39:15,359 the most of the mathematics is written 894 00:39:13,040 --> 00:39:18,480 in already assembly 895 00:39:15,359 --> 00:39:20,160 and the things is that um 896 00:39:18,480 --> 00:39:23,599 we also didn't 897 00:39:20,160 --> 00:39:25,599 uh go with after exterization form of 898 00:39:23,599 --> 00:39:29,280 mathematics because 899 00:39:25,599 --> 00:39:31,680 there's additional latency introduced by 900 00:39:29,280 --> 00:39:35,680 uh some of this single search multiple 901 00:39:31,680 --> 00:39:38,079 data and also we need to uh care about 902 00:39:35,680 --> 00:39:41,040 uh alignment because scanner by default 903 00:39:38,079 --> 00:39:42,800 is a rigorous for eight bytes while for 904 00:39:41,040 --> 00:39:46,000 single selection multiple data we need a 905 00:39:42,800 --> 00:39:47,280 ligament for 52 bytes 906 00:39:46,000 --> 00:39:49,359 um 907 00:39:47,280 --> 00:39:51,280 also the 908 00:39:49,359 --> 00:39:53,119 specific 909 00:39:51,280 --> 00:39:55,520 uh specific 910 00:39:53,119 --> 00:39:56,960 considerations about legacy code for 911 00:39:55,520 --> 00:39:58,960 example if you 912 00:39:56,960 --> 00:40:02,079 might have a old 913 00:39:58,960 --> 00:40:04,160 code which is sc only enable you need to 914 00:40:02,079 --> 00:40:06,319 use with zero update by default 915 00:40:04,160 --> 00:40:07,599 compilers are used with zero 916 00:40:06,319 --> 00:40:10,880 uh 917 00:40:07,599 --> 00:40:15,359 instructions and we can make the code 918 00:40:10,880 --> 00:40:18,960 faster if we uh remove uh visible we can 919 00:40:15,359 --> 00:40:19,839 get up to five performance uh better 920 00:40:18,960 --> 00:40:21,839 uh 921 00:40:19,839 --> 00:40:24,079 now we have 922 00:40:21,839 --> 00:40:25,520 explanation why don't we use after 923 00:40:24,079 --> 00:40:28,400 vectorization 924 00:40:25,520 --> 00:40:29,599 and why we record mathematics in 925 00:40:28,400 --> 00:40:31,280 assembly 926 00:40:29,599 --> 00:40:33,440 uh actually 927 00:40:31,280 --> 00:40:36,240 elliptic off and those are cryptography 928 00:40:33,440 --> 00:40:38,480 used with very long integers 929 00:40:36,240 --> 00:40:42,800 like integers of 930 00:40:38,480 --> 00:40:47,599 two four six maybe more limbs limp is 931 00:40:42,800 --> 00:40:50,640 usually unsigned wonk so for our 256 uh 932 00:40:47,599 --> 00:40:53,280 bits arithmetic we have a volumes uh 933 00:40:50,640 --> 00:40:53,280 four enzymes 934 00:40:59,960 --> 00:41:05,599 foreign from linux but 935 00:41:03,680 --> 00:41:08,400 all the libraries 936 00:41:05,599 --> 00:41:10,720 move from the abstract 937 00:41:08,400 --> 00:41:11,440 mpi representation to 938 00:41:10,720 --> 00:41:15,359 go 939 00:41:11,440 --> 00:41:18,800 always so if you can see the 940 00:41:15,359 --> 00:41:21,839 c function we use with two limb 941 00:41:18,800 --> 00:41:21,839 arithmetic so we have 942 00:41:22,160 --> 00:41:25,359 number a and number b 943 00:41:24,400 --> 00:41:28,880 ah 944 00:41:25,359 --> 00:41:30,560 each of size two limbs and we need to 945 00:41:28,880 --> 00:41:33,119 sum them 946 00:41:30,560 --> 00:41:36,880 you might spend 947 00:41:33,119 --> 00:41:39,280 some time to realize whether the smp add 948 00:41:36,880 --> 00:41:43,599 function is correct because we need to 949 00:41:39,280 --> 00:41:48,000 do uh some uh code with uh carry 950 00:41:43,599 --> 00:41:49,280 uh however if we use assembly we can not 951 00:41:48,000 --> 00:41:51,920 only 952 00:41:49,280 --> 00:41:53,119 make the code 953 00:41:51,920 --> 00:41:54,160 easier 954 00:41:53,119 --> 00:41:55,760 using 955 00:41:54,160 --> 00:41:58,960 carrier 956 00:41:55,760 --> 00:42:00,400 carrier where additional but also we can 957 00:41:58,960 --> 00:42:04,240 avoid 958 00:42:00,400 --> 00:42:05,760 branching in the code also cpus provide 959 00:42:04,240 --> 00:42:07,440 uh additional 960 00:42:05,760 --> 00:42:10,640 uh 961 00:42:07,440 --> 00:42:12,079 additional instruction to parallelize uh 962 00:42:10,640 --> 00:42:12,800 two uh 963 00:42:12,079 --> 00:42:16,720 two 964 00:42:12,800 --> 00:42:18,400 additional chains uh thanks to uh 965 00:42:16,720 --> 00:42:20,000 edx 966 00:42:18,400 --> 00:42:23,440 instruction set 967 00:42:20,000 --> 00:42:26,160 so in assembly we can uh 968 00:42:23,440 --> 00:42:29,599 do much more efficient 969 00:42:26,160 --> 00:42:32,560 uh efficient uh programming the last 970 00:42:29,599 --> 00:42:36,160 thing which i want to to mention 971 00:42:32,560 --> 00:42:37,680 is a problem which we recently faced on 972 00:42:36,160 --> 00:42:40,079 linux system 973 00:42:37,680 --> 00:42:41,440 and if your 974 00:42:40,079 --> 00:42:45,440 program uses 975 00:42:41,440 --> 00:42:46,880 spin locks in user space then you might 976 00:42:45,440 --> 00:42:47,920 see that 977 00:42:46,880 --> 00:42:49,119 if 978 00:42:47,920 --> 00:42:51,680 one thread 979 00:42:49,119 --> 00:42:55,440 acquires the spin lock and another 980 00:42:51,680 --> 00:42:58,960 thread is spinning and waiting for this 981 00:42:55,440 --> 00:43:00,480 spin walk so the waiting thread 982 00:42:58,960 --> 00:43:04,560 uh quickly 983 00:43:00,480 --> 00:43:05,440 uh just um spins in a bathroom loop 984 00:43:04,560 --> 00:43:08,000 and 985 00:43:05,440 --> 00:43:10,800 such processes which just 986 00:43:08,000 --> 00:43:12,400 balance cpu are considered as batch 987 00:43:10,800 --> 00:43:16,400 process buzz 988 00:43:12,400 --> 00:43:18,880 linux cpu scheduler i suppose there are 989 00:43:16,400 --> 00:43:21,440 interactive processors which are needed 990 00:43:18,880 --> 00:43:23,119 sometime just periodically and your 991 00:43:21,440 --> 00:43:25,599 monitoring system 992 00:43:23,119 --> 00:43:28,160 is typically 993 00:43:25,599 --> 00:43:30,720 considered by the system schedule as an 994 00:43:28,160 --> 00:43:34,079 interactive uh process because it's uh 995 00:43:30,720 --> 00:43:35,520 needs a speed timely only from time to 996 00:43:34,079 --> 00:43:36,400 time 997 00:43:35,520 --> 00:43:37,839 so 998 00:43:36,400 --> 00:43:39,200 in this 999 00:43:37,839 --> 00:43:41,599 case we 1000 00:43:39,200 --> 00:43:43,359 can use a perscad 1001 00:43:41,599 --> 00:43:45,280 time hist 1002 00:43:43,359 --> 00:43:49,839 to to 1003 00:43:45,280 --> 00:43:53,200 profile the system schedule and in this 1004 00:43:49,839 --> 00:43:55,599 in this case we saw that this 1005 00:43:53,200 --> 00:43:57,599 performance critical waiting process 1006 00:43:55,599 --> 00:43:58,800 which is spinning in 1007 00:43:57,599 --> 00:44:01,680 spin lock 1008 00:43:58,800 --> 00:44:05,440 and it's uh some point 1009 00:44:01,680 --> 00:44:08,960 uh while monitoring uh took the 1010 00:44:05,440 --> 00:44:12,319 cpu uh time and wake awakened the 1011 00:44:08,960 --> 00:44:15,680 process the process actually uh 1012 00:44:12,319 --> 00:44:19,520 received that the cpu time almost 1013 00:44:15,680 --> 00:44:22,079 100 milliseconds when it was awakened 1014 00:44:19,520 --> 00:44:23,359 and during that time another processes 1015 00:44:22,079 --> 00:44:27,599 uh 1016 00:44:23,359 --> 00:44:29,119 took place on the cpu uh so if you use 1017 00:44:27,599 --> 00:44:32,000 um 1018 00:44:29,119 --> 00:44:34,480 uh spin locks in user space in user 1019 00:44:32,000 --> 00:44:39,119 space we don't have primition control as 1020 00:44:34,480 --> 00:44:44,560 we have on in the candle space then uh 1021 00:44:39,119 --> 00:44:46,640 you you can face a huge tail latencies 1022 00:44:44,560 --> 00:44:48,240 well um that's 1023 00:44:46,640 --> 00:44:49,520 all um 1024 00:44:48,240 --> 00:44:52,560 what i had 1025 00:44:49,520 --> 00:44:53,760 so probably we have some time for 1026 00:44:52,560 --> 00:44:55,760 questions 1027 00:44:53,760 --> 00:44:58,000 so 1028 00:44:55,760 --> 00:44:58,000 our 1029 00:44:58,880 --> 00:45:04,880 in fact we're already over time 1030 00:45:01,119 --> 00:45:07,440 alexander there are two questions in the 1031 00:45:04,880 --> 00:45:09,440 question section on venulis 1032 00:45:07,440 --> 00:45:12,319 it would be good to 1033 00:45:09,440 --> 00:45:14,079 address those maybe in the chat 1034 00:45:12,319 --> 00:45:15,920 okay i will 1035 00:45:14,079 --> 00:45:18,560 thank you very much okay thank you very 1036 00:45:15,920 --> 00:45:20,640 much thank you okay everybody so now we 1037 00:45:18,560 --> 00:45:23,520 will have the tea break afternoon tea 1038 00:45:20,640 --> 00:45:24,720 break and back here 1039 00:45:23,520 --> 00:45:29,400 shortly 1040 00:45:24,720 --> 00:45:29,400 okay thank you bye thank you