1 00:00:06,320 --> 00:00:11,499 [Music] 2 00:00:17,039 --> 00:00:23,760 hi everyone today i'd like to talk about 3 00:00:20,000 --> 00:00:26,720 the persistent memory plus rdma new age 4 00:00:23,760 --> 00:00:26,720 remote device 5 00:00:28,640 --> 00:00:34,480 my name is yangxiang i'm a software 6 00:00:30,960 --> 00:00:37,040 engineer and 1950s nanda in china 7 00:00:34,480 --> 00:00:39,920 and have been working on linux and 8 00:00:37,040 --> 00:00:42,559 related oasis for six years 9 00:00:39,920 --> 00:00:45,840 i became a maintainer of linux test 10 00:00:42,559 --> 00:00:48,800 project as the end of 2018 11 00:00:45,840 --> 00:00:52,399 currently i'm focusing on processing the 12 00:00:48,800 --> 00:00:55,440 memory and rdma 13 00:00:52,399 --> 00:00:57,600 this is the agenda of my presentation it 14 00:00:55,440 --> 00:01:00,320 includes five parts 15 00:00:57,600 --> 00:01:02,320 i will explain why persistent memory 16 00:01:00,320 --> 00:01:05,199 with rdma 17 00:01:02,320 --> 00:01:08,000 and the show new specification of rdma 18 00:01:05,199 --> 00:01:10,479 for remote p memory then 19 00:01:08,000 --> 00:01:13,600 i will share how to implement a new 20 00:01:10,479 --> 00:01:15,119 specification on soft rc and live ib 21 00:01:13,600 --> 00:01:17,680 works 22 00:01:15,119 --> 00:01:20,880 and then i will introduce remote 23 00:01:17,680 --> 00:01:24,000 persistent memory access library 24 00:01:20,880 --> 00:01:27,360 finally i will do a conclusion and share 25 00:01:24,000 --> 00:01:28,479 our future work 26 00:01:27,360 --> 00:01:31,520 okay 27 00:01:28,479 --> 00:01:34,560 let me start from why persistent memory 28 00:01:31,520 --> 00:01:34,560 with rdma 29 00:01:34,720 --> 00:01:40,960 what is persistent memory persistent 30 00:01:37,920 --> 00:01:43,920 memory is a high performance and a 31 00:01:40,960 --> 00:01:46,720 better addressable memory device device 32 00:01:43,920 --> 00:01:49,520 which resends on the memory bus 33 00:01:46,720 --> 00:01:50,640 p-member is the short form of processing 34 00:01:49,520 --> 00:01:54,320 the memory 35 00:01:50,640 --> 00:01:56,960 it has many advantages 36 00:01:54,320 --> 00:01:59,040 like for example data is noto volatile 37 00:01:56,960 --> 00:02:01,920 after power interruption 38 00:01:59,040 --> 00:02:03,360 it has nearly the same speed and latency 39 00:02:01,920 --> 00:02:06,240 of drum 40 00:02:03,360 --> 00:02:09,440 it is cheaper than dram and provides the 41 00:02:06,240 --> 00:02:12,560 larger capacity like ssd 42 00:02:09,440 --> 00:02:13,680 user process can access p member in four 43 00:02:12,560 --> 00:02:17,920 modes 44 00:02:13,680 --> 00:02:20,800 fa stacks d redux sector and row 45 00:02:17,920 --> 00:02:22,959 fa stacks and dvdux are good for 46 00:02:20,800 --> 00:02:25,040 improving the performance 47 00:02:22,959 --> 00:02:28,480 because they are decided to access 48 00:02:25,040 --> 00:02:28,480 pmemor directly 49 00:02:29,520 --> 00:02:36,319 certainly it's faster to access local p 50 00:02:32,319 --> 00:02:39,599 memory in either fsdx or dvdux mode 51 00:02:36,319 --> 00:02:42,080 however modern id system and the service 52 00:02:39,599 --> 00:02:45,280 needed to transport the data from or to 53 00:02:42,080 --> 00:02:48,080 remote p memory such as distributed 54 00:02:45,280 --> 00:02:50,800 database distributed file system 55 00:02:48,080 --> 00:02:53,680 key value storm and so on 56 00:02:50,800 --> 00:02:54,879 traditional tcp became the performer's 57 00:02:53,680 --> 00:02:59,040 spotlight 58 00:02:54,879 --> 00:03:02,159 due to a lot of redundant overhead 59 00:02:59,040 --> 00:03:03,599 look look at this figure 60 00:03:02,159 --> 00:03:05,840 for example 61 00:03:03,599 --> 00:03:07,760 copy data between user space and the 62 00:03:05,840 --> 00:03:11,120 kernel space 63 00:03:07,760 --> 00:03:14,080 package data by the software tcp stack 64 00:03:11,120 --> 00:03:17,200 of operating system 65 00:03:14,080 --> 00:03:20,480 in this case we need a faster access way 66 00:03:17,200 --> 00:03:20,480 to remote p memory 67 00:03:21,120 --> 00:03:25,519 rdma is a good solution to access remote 68 00:03:24,319 --> 00:03:28,400 p memory 69 00:03:25,519 --> 00:03:31,120 rdma is the short form of remote 70 00:03:28,400 --> 00:03:34,120 director memory is success 71 00:03:31,120 --> 00:03:36,799 it is a technology that enables 72 00:03:34,120 --> 00:03:39,599 contributors in a network to exchange 73 00:03:36,799 --> 00:03:42,080 data in the main memory without 74 00:03:39,599 --> 00:03:43,680 involving operating system of either 75 00:03:42,080 --> 00:03:46,879 computer 76 00:03:43,680 --> 00:03:48,560 it avoids redundant overhand because of 77 00:03:46,879 --> 00:03:50,080 its advantages 78 00:03:48,560 --> 00:03:52,879 for example 79 00:03:50,080 --> 00:03:54,560 provide a zero copy between kernel space 80 00:03:52,879 --> 00:03:58,840 and user space 81 00:03:54,560 --> 00:04:02,480 bypass the host systems software tcp 82 00:03:58,840 --> 00:04:03,680 stack and the move data results cpu 83 00:04:02,480 --> 00:04:06,959 involvement 84 00:04:03,680 --> 00:04:06,959 by dma engine 85 00:04:08,720 --> 00:04:14,480 is it good enough to access remote p 86 00:04:11,360 --> 00:04:16,560 memory by traditional rdma 87 00:04:14,480 --> 00:04:20,000 note really 88 00:04:16,560 --> 00:04:22,960 rdma has two problems for accessing 89 00:04:20,000 --> 00:04:25,440 remote p memory 90 00:04:22,960 --> 00:04:27,520 the first problem is no guarantee of 91 00:04:25,440 --> 00:04:29,440 data persistency 92 00:04:27,520 --> 00:04:32,160 look at this figure 93 00:04:29,440 --> 00:04:36,080 responder returns acknowledge as soon as 94 00:04:32,160 --> 00:04:38,720 the rdma right reaches the remote unique 95 00:04:36,080 --> 00:04:41,280 the return data will be lost when it has 96 00:04:38,720 --> 00:04:44,320 not been saved into remote p memory and 97 00:04:41,280 --> 00:04:46,560 the remote system is powered down 98 00:04:44,320 --> 00:04:49,440 for data persistency 99 00:04:46,560 --> 00:04:52,479 we need a way to confirm that the data 100 00:04:49,440 --> 00:04:55,479 is actually written returned to remote p 101 00:04:52,479 --> 00:04:55,479 memory 102 00:04:56,240 --> 00:05:00,800 the second problem is no guarantee of 103 00:04:58,720 --> 00:05:03,600 data consistency 104 00:05:00,800 --> 00:05:06,400 two-phase commit is widely used by 105 00:05:03,600 --> 00:05:08,720 distributed database 106 00:05:06,400 --> 00:05:11,840 for example an application writes a 107 00:05:08,720 --> 00:05:14,560 block of data by two-phase commit 108 00:05:11,840 --> 00:05:15,919 like the following figure 109 00:05:14,560 --> 00:05:18,080 step one 110 00:05:15,919 --> 00:05:21,520 write a probe of data into remote 111 00:05:18,080 --> 00:05:24,320 p-member step two mark the data as a 112 00:05:21,520 --> 00:05:25,919 valued by updating an edge batch value 113 00:05:24,320 --> 00:05:28,800 atomically 114 00:05:25,919 --> 00:05:32,880 another application can know if the data 115 00:05:28,800 --> 00:05:36,240 is valued by reading the 8-bet value 116 00:05:32,880 --> 00:05:38,479 rdma doesn't provide an api for atomic 117 00:05:36,240 --> 00:05:40,960 write yet as step 2 118 00:05:38,479 --> 00:05:44,320 so we need a way to update and add 119 00:05:40,960 --> 00:05:44,320 better value atomically 120 00:05:44,800 --> 00:05:48,000 there are two ways to solve these 121 00:05:46,880 --> 00:05:50,400 problems 122 00:05:48,000 --> 00:05:54,240 the first way is to introduce new 123 00:05:50,400 --> 00:05:57,360 specification to extend rdma 124 00:05:54,240 --> 00:06:00,560 it ends rdma flash to guarantee data 125 00:05:57,360 --> 00:06:03,600 persistency and ends are the anatomic 126 00:06:00,560 --> 00:06:06,240 right to guarantee data consistency 127 00:06:03,600 --> 00:06:10,240 the second way is to make new up layer 128 00:06:06,240 --> 00:06:12,800 library it not only guarantees the 129 00:06:10,240 --> 00:06:14,400 persistency and the consistency of data 130 00:06:12,800 --> 00:06:15,759 but also 131 00:06:14,400 --> 00:06:19,120 hence the 132 00:06:15,759 --> 00:06:22,000 complexity of rdma and provides a set of 133 00:06:19,120 --> 00:06:24,160 simple api to applications 134 00:06:22,000 --> 00:06:26,639 in addition it will support a new 135 00:06:24,160 --> 00:06:27,600 specification in the future 136 00:06:26,639 --> 00:06:30,560 ok 137 00:06:27,600 --> 00:06:34,360 i will talk about the both solutions and 138 00:06:30,560 --> 00:06:34,360 our effort next 139 00:06:35,440 --> 00:06:42,000 let me show new specification of rdma4 140 00:06:38,800 --> 00:06:42,000 remote p memory 141 00:06:42,479 --> 00:06:46,560 there are two associations to make a new 142 00:06:45,039 --> 00:06:49,039 specification 143 00:06:46,560 --> 00:06:52,880 ipta and ietf 144 00:06:49,039 --> 00:06:55,759 ibta released the vu 1.5 specification 145 00:06:52,880 --> 00:06:59,280 in august 2021 146 00:06:55,759 --> 00:07:01,280 and it defined the new rdma operations 147 00:06:59,280 --> 00:07:05,840 for remote p memor 148 00:07:01,280 --> 00:07:08,479 ietf released a draft in march 2020 but 149 00:07:05,840 --> 00:07:11,520 didn't update it anymore 150 00:07:08,479 --> 00:07:14,000 it also defined the new rdma operations 151 00:07:11,520 --> 00:07:17,120 for remote p memory 152 00:07:14,000 --> 00:07:20,479 intel has showed the overview of ibts 153 00:07:17,120 --> 00:07:22,000 new specification on storage developer 154 00:07:20,479 --> 00:07:25,199 conference 155 00:07:22,000 --> 00:07:28,080 today i'll talk about ibts new 156 00:07:25,199 --> 00:07:28,080 specification 157 00:07:28,479 --> 00:07:35,599 ibts new specification defines new rdma 158 00:07:32,800 --> 00:07:37,520 flash operation 159 00:07:35,599 --> 00:07:40,720 look at this figure 160 00:07:37,520 --> 00:07:43,039 a new rdma flash can flush all previous 161 00:07:40,720 --> 00:07:46,400 rights or specific 162 00:07:43,039 --> 00:07:49,520 memory regions it guarantees that the 163 00:07:46,400 --> 00:07:52,240 data is pushed pushed to global 164 00:07:49,520 --> 00:07:55,199 visibility or persistency 165 00:07:52,240 --> 00:07:58,560 it will send the rdma reader response 166 00:07:55,199 --> 00:08:00,960 wheels with zero size to request after 167 00:07:58,560 --> 00:08:04,000 the data has been posted 168 00:08:00,960 --> 00:08:06,879 on both request and the responder the 169 00:08:04,000 --> 00:08:10,160 rdma write and rdma flash should be 170 00:08:06,879 --> 00:08:10,160 handled in order 171 00:08:11,280 --> 00:08:17,919 ibts new specification also defines new 172 00:08:15,039 --> 00:08:20,240 rdma atomic write operations 173 00:08:17,919 --> 00:08:23,599 look at this figure 174 00:08:20,240 --> 00:08:26,560 a new rdma atomic write carried and 175 00:08:23,599 --> 00:08:29,599 aligned eight bad value atomically it 176 00:08:26,560 --> 00:08:31,759 will send the rdma read response with 177 00:08:29,599 --> 00:08:33,839 zero size to request 178 00:08:31,759 --> 00:08:35,200 after the eight batch value have been 179 00:08:33,839 --> 00:08:38,159 returned 180 00:08:35,200 --> 00:08:41,120 on both requests and the responder the 181 00:08:38,159 --> 00:08:45,120 rdma flash and the rdma tonic right 182 00:08:41,120 --> 00:08:45,120 should also be handed in order 183 00:08:46,000 --> 00:08:51,760 to support rdma flush and rdm atomic 184 00:08:49,680 --> 00:08:54,560 right operations 185 00:08:51,760 --> 00:08:56,959 what must be extended in the stock of 186 00:08:54,560 --> 00:09:00,480 rdma 187 00:08:56,959 --> 00:09:03,040 as as shown in the figure below 188 00:09:00,480 --> 00:09:06,800 the whole stack of rdma needs to be 189 00:09:03,040 --> 00:09:08,080 extended to support new operations 190 00:09:06,800 --> 00:09:11,519 liver ib 191 00:09:08,080 --> 00:09:14,959 leave ib warps library provides rdma api 192 00:09:11,519 --> 00:09:17,920 to applications and it has no to support 193 00:09:14,959 --> 00:09:19,279 new operations yet 194 00:09:17,920 --> 00:09:22,640 currently 195 00:09:19,279 --> 00:09:26,880 there is no hardware on it and related 196 00:09:22,640 --> 00:09:26,880 driver to support the new operations 197 00:09:27,760 --> 00:09:33,440 the next thing i'm going to talk about 198 00:09:30,560 --> 00:09:37,839 is how to implement a new specification 199 00:09:33,440 --> 00:09:37,839 on soft rlc and live ib works 200 00:09:38,560 --> 00:09:44,080 why use soft rc as i said new 201 00:09:41,519 --> 00:09:45,240 specification requires hardware support 202 00:09:44,080 --> 00:09:49,360 usually 203 00:09:45,240 --> 00:09:52,160 v1.5 specification has been released but 204 00:09:49,360 --> 00:09:53,200 hardware vendors need them to make new 205 00:09:52,160 --> 00:09:55,680 unique 206 00:09:53,200 --> 00:09:58,000 it may be a long time and we don't want 207 00:09:55,680 --> 00:10:00,959 to wait the new arnic 208 00:09:58,000 --> 00:10:04,560 for this reason we are focusing on soft 209 00:10:00,959 --> 00:10:07,360 rc driver which is decided to make a 210 00:10:04,560 --> 00:10:10,480 normal nic support rdma 211 00:10:07,360 --> 00:10:14,160 though it is slower than real arnic 212 00:10:10,480 --> 00:10:17,680 user can experience rdma easily 213 00:10:14,160 --> 00:10:20,959 finally we decided to extend the soft rc 214 00:10:17,680 --> 00:10:20,959 and the live ib works 215 00:10:21,279 --> 00:10:29,200 this figure shows the software 216 00:10:24,399 --> 00:10:33,200 stock of rdma based on soft rce 217 00:10:29,200 --> 00:10:36,000 what is soft rc we have to know rcv2 218 00:10:33,200 --> 00:10:39,360 before introducing soft rc 219 00:10:36,000 --> 00:10:42,800 rc v2 is the short form of rp rootable 220 00:10:39,360 --> 00:10:46,720 rdma over converged ethernet 221 00:10:42,800 --> 00:10:48,959 rc v2 is a network protocol that can 222 00:10:46,720 --> 00:10:52,079 transfer every transport header and the 223 00:10:48,959 --> 00:10:54,880 payload through the traditional ethernet 224 00:10:52,079 --> 00:10:57,760 ip and udp headers 225 00:10:54,880 --> 00:11:01,519 if packets are formatted 226 00:10:57,760 --> 00:11:04,640 by rcv2 they can be forwarded by tcp 227 00:11:01,519 --> 00:11:08,720 routers and switchers 228 00:11:04,640 --> 00:11:11,680 soft rlc is software-based rc v2 229 00:11:08,720 --> 00:11:14,560 it produces ib transport header and 230 00:11:11,680 --> 00:11:16,800 inserts it and the payload into the udp 231 00:11:14,560 --> 00:11:19,120 header by software 232 00:11:16,800 --> 00:11:22,480 the red figure shows the 233 00:11:19,120 --> 00:11:25,600 difference between hardware rc v2 and 234 00:11:22,480 --> 00:11:25,600 soft rce 235 00:11:26,320 --> 00:11:33,120 let me talk about how to implement a new 236 00:11:29,440 --> 00:11:36,240 rdma flash process on soft rce 237 00:11:33,120 --> 00:11:38,320 please see the detailed logical of rdma 238 00:11:36,240 --> 00:11:39,600 flash we implement 239 00:11:38,320 --> 00:11:43,040 step 1 240 00:11:39,600 --> 00:11:45,760 when local soft rc process is a rdma 241 00:11:43,040 --> 00:11:48,720 flash request from user space 242 00:11:45,760 --> 00:11:51,920 it ensures all previous requests have 243 00:11:48,720 --> 00:11:54,560 been set by default all ensures all 244 00:11:51,920 --> 00:11:58,240 previous requests have been completed by 245 00:11:54,560 --> 00:12:02,720 a fast flag the fence plug is used to 246 00:11:58,240 --> 00:12:07,519 ensure the exclusion order of operations 247 00:12:02,720 --> 00:12:09,920 step 2 local soft rc prepares a 248 00:12:07,519 --> 00:12:11,920 rdma flash request package by the 249 00:12:09,920 --> 00:12:15,920 following changes 250 00:12:11,920 --> 00:12:19,600 and the new ibop code rcrdmf rush op 251 00:12:15,920 --> 00:12:21,360 code in base transport header 252 00:12:19,600 --> 00:12:23,760 and the new 253 00:12:21,360 --> 00:12:26,639 flash extended transport handle 254 00:12:23,760 --> 00:12:29,519 including selectivity level and the 255 00:12:26,639 --> 00:12:32,560 placement tab 256 00:12:29,519 --> 00:12:35,760 and the specified the address and lens 257 00:12:32,560 --> 00:12:37,279 to flash in rdma extended transport 258 00:12:35,760 --> 00:12:38,480 handle 259 00:12:37,279 --> 00:12:41,839 step 3 260 00:12:38,480 --> 00:12:46,000 local soft roc sends the rdma ferrari 261 00:12:41,839 --> 00:12:46,000 requested package over udp 262 00:12:47,440 --> 00:12:54,399 step 4 remote soft rlc accepts the rdma 263 00:12:51,360 --> 00:12:57,360 flash request package after executing 264 00:12:54,399 --> 00:12:58,720 all previous requests once 265 00:12:57,360 --> 00:13:02,079 step 5 266 00:12:58,720 --> 00:13:04,720 remote soft rlc flashes the specified 267 00:13:02,079 --> 00:13:07,600 ridge and the dips according to the 268 00:13:04,720 --> 00:13:09,040 content of flash extender transport 269 00:13:07,600 --> 00:13:11,760 header 270 00:13:09,040 --> 00:13:14,720 selectivity level defines the memory 271 00:13:11,760 --> 00:13:16,480 region reaches the rdma flash should 272 00:13:14,720 --> 00:13:19,040 apply on 273 00:13:16,480 --> 00:13:24,560 placement type defines the memory 274 00:13:19,040 --> 00:13:24,560 placement guarantee of this rdma flash 275 00:13:25,519 --> 00:13:32,839 step 6 remote soft rc prepares a rdma 276 00:13:29,760 --> 00:13:36,959 flash response package by the following 277 00:13:32,839 --> 00:13:39,920 changes use ibop code rcrdm may read 278 00:13:36,959 --> 00:13:41,120 response only op code in base transport 279 00:13:39,920 --> 00:13:44,160 handler 280 00:13:41,120 --> 00:13:47,120 set arc or not in 281 00:13:44,160 --> 00:13:48,560 arc extended transport handler 282 00:13:47,120 --> 00:13:51,600 step 7 283 00:13:48,560 --> 00:13:54,800 remote soft roc sends the rdma flash 284 00:13:51,600 --> 00:13:56,720 response package over udp 285 00:13:54,800 --> 00:13:59,839 step 8 286 00:13:56,720 --> 00:14:03,360 local software rlc accepts the rdma 287 00:13:59,839 --> 00:14:07,040 flash response package and generates the 288 00:14:03,360 --> 00:14:07,040 corresponding completion 289 00:14:08,480 --> 00:14:13,360 let me talk about 290 00:14:10,560 --> 00:14:16,720 how to implement a new rdma atomic 291 00:14:13,360 --> 00:14:19,199 writer process on soft rc 292 00:14:16,720 --> 00:14:22,560 please see the detail the logical of 293 00:14:19,199 --> 00:14:26,079 rdma atomic writer we implement 294 00:14:22,560 --> 00:14:29,360 step one when local soft roc 295 00:14:26,079 --> 00:14:33,199 process is a rdm atomic write request 296 00:14:29,360 --> 00:14:36,000 from user space it ensures all previous 297 00:14:33,199 --> 00:14:38,720 requests have been set by default 298 00:14:36,000 --> 00:14:42,160 or ensures all previous requests have 299 00:14:38,720 --> 00:14:43,360 been completed by fs flag 300 00:14:42,160 --> 00:14:46,959 step 2 301 00:14:43,360 --> 00:14:49,440 local soft rlc prepares rdma atomic 302 00:14:46,959 --> 00:14:50,720 writer request package by the following 303 00:14:49,440 --> 00:14:52,160 changes 304 00:14:50,720 --> 00:14:56,000 and the new 305 00:14:52,160 --> 00:14:58,720 ibop called rcrdm atomic write op code 306 00:14:56,000 --> 00:15:02,480 in base transport handler 307 00:14:58,720 --> 00:15:05,760 specify the address and the lens to 308 00:15:02,480 --> 00:15:06,880 atomic right in rdma extended transport 309 00:15:05,760 --> 00:15:10,480 handler 310 00:15:06,880 --> 00:15:13,480 and populate an allied eight better 311 00:15:10,480 --> 00:15:13,480 payload 312 00:15:14,480 --> 00:15:20,959 step 3 local soft rlc sends the rdm 313 00:15:18,399 --> 00:15:22,160 anatomical writer requested package over 314 00:15:20,959 --> 00:15:26,240 udp 315 00:15:22,160 --> 00:15:28,560 step 4 remote soft rlc accepts rdm 316 00:15:26,240 --> 00:15:31,680 atomic writer requested packet 317 00:15:28,560 --> 00:15:32,639 after executing all previous requests 318 00:15:31,680 --> 00:15:36,880 once 319 00:15:32,639 --> 00:15:40,440 step 5 remote soft rlc writes the 8-bat 320 00:15:36,880 --> 00:15:40,440 payload atomically 321 00:15:41,120 --> 00:15:46,320 step 6 322 00:15:42,560 --> 00:15:48,800 remote soft rlc prepares a rdma atomic 323 00:15:46,320 --> 00:15:50,399 writer response package by the following 324 00:15:48,800 --> 00:15:54,160 changes 325 00:15:50,399 --> 00:15:57,920 use ibop code rc rdma reader response 326 00:15:54,160 --> 00:16:00,079 only op code in base transport header 327 00:15:57,920 --> 00:16:03,519 set arc or lock in 328 00:16:00,079 --> 00:16:05,120 arc extended transporter handler 329 00:16:03,519 --> 00:16:08,320 step 7 330 00:16:05,120 --> 00:16:11,600 remote soft rlc sends the rdm atomic 331 00:16:08,320 --> 00:16:12,959 writer response package over udp 332 00:16:11,600 --> 00:16:16,079 step 8 333 00:16:12,959 --> 00:16:19,040 local software rlc accepts the rdma 334 00:16:16,079 --> 00:16:23,839 atomic writer response package and 335 00:16:19,040 --> 00:16:23,839 generates the corresponding completion 336 00:16:24,399 --> 00:16:30,959 okay let's go on to the next how to 337 00:16:27,440 --> 00:16:32,320 implement a new rdma flash api on level 338 00:16:30,959 --> 00:16:35,440 ib works 339 00:16:32,320 --> 00:16:36,320 to support rdma flash in ibv process 340 00:16:35,440 --> 00:16:40,000 send 341 00:16:36,320 --> 00:16:43,199 we defined the new ibv wrdma flash op 342 00:16:40,000 --> 00:16:46,480 code to identify a flash operation 343 00:16:43,199 --> 00:16:48,800 and ended the new structural flash to 344 00:16:46,480 --> 00:16:50,480 transfer the information required by the 345 00:16:48,800 --> 00:16:54,959 flash operation 346 00:16:50,480 --> 00:16:58,959 to support rdma flash in ibv4 cq 347 00:16:54,959 --> 00:17:02,320 we defined the new ibvwc rdma flash op 348 00:16:58,959 --> 00:17:04,640 code to identify a completed flash 349 00:17:02,320 --> 00:17:07,199 operation 350 00:17:04,640 --> 00:17:08,959 the following the following code shows 351 00:17:07,199 --> 00:17:10,160 how application 352 00:17:08,959 --> 00:17:13,120 use 353 00:17:10,160 --> 00:17:14,640 rdma flash api 354 00:17:13,120 --> 00:17:19,120 for example 355 00:17:14,640 --> 00:17:20,079 poster rdma flash request by ibv process 356 00:17:19,120 --> 00:17:23,760 and 357 00:17:20,079 --> 00:17:26,480 get the completion of rdma flash by ibb 358 00:17:23,760 --> 00:17:26,480 por cq 359 00:17:28,559 --> 00:17:34,400 how to implement a new rdma atomic 360 00:17:31,520 --> 00:17:37,760 writer api on live ib works 361 00:17:34,400 --> 00:17:41,760 to support rdma atomic writer in ibv 362 00:17:37,760 --> 00:17:45,039 ports ascent we defined the new ibv wrdn 363 00:17:41,760 --> 00:17:46,960 meta atomic write op code to identify 364 00:17:45,039 --> 00:17:49,919 atomic write operation 365 00:17:46,960 --> 00:17:52,960 and take use of structural rdma to 366 00:17:49,919 --> 00:17:55,280 transfer the information required by 367 00:17:52,960 --> 00:17:58,799 the atomic write operation 368 00:17:55,280 --> 00:18:03,679 to support rdma atomic writing in ibv 369 00:17:58,799 --> 00:18:06,640 poor cq we divided the new ibv wcrdm 370 00:18:03,679 --> 00:18:10,640 atomic writer op code to identify a 371 00:18:06,640 --> 00:18:13,360 completed atomic write operation 372 00:18:10,640 --> 00:18:16,160 the following code also shows how 373 00:18:13,360 --> 00:18:18,480 application use are the metatomic right 374 00:18:16,160 --> 00:18:22,799 api for example 375 00:18:18,480 --> 00:18:26,000 post rdma atomic writer requests by ibv 376 00:18:22,799 --> 00:18:31,200 poster sent get the completion of 377 00:18:26,000 --> 00:18:31,200 the anatomical right by ibb pro cq 378 00:18:31,840 --> 00:18:37,679 new rdma operations are under 379 00:18:34,480 --> 00:18:40,679 development is there any available 380 00:18:37,679 --> 00:18:40,679 solutions 381 00:18:42,640 --> 00:18:49,679 remote persistent memory access library 382 00:18:45,840 --> 00:18:49,679 is an available solution 383 00:18:49,760 --> 00:18:56,080 what is a remote persistent memory 384 00:18:52,480 --> 00:18:59,840 access library it is a new library to 385 00:18:56,080 --> 00:19:02,799 access remote p memory over rdma 386 00:18:59,840 --> 00:19:06,400 libre rpma is the short form of remote 387 00:19:02,799 --> 00:19:08,400 pro system the memory access library 388 00:19:06,400 --> 00:19:10,960 label rpma 389 00:19:08,400 --> 00:19:13,919 provides a complete set of api for 390 00:19:10,960 --> 00:19:18,160 applications to access remote p member 391 00:19:13,919 --> 00:19:19,600 lag rpms sender rpma receive rpma write 392 00:19:18,160 --> 00:19:22,880 and so on 393 00:19:19,600 --> 00:19:25,280 it has rpma flash tool first previous 394 00:19:22,880 --> 00:19:28,559 right into remote p memory 395 00:19:25,280 --> 00:19:31,840 it also has rpma atomic right to mark 396 00:19:28,559 --> 00:19:34,400 the previous red valued atomically after 397 00:19:31,840 --> 00:19:36,000 the previous rpma flash had been 398 00:19:34,400 --> 00:19:39,840 completed 399 00:19:36,000 --> 00:19:41,919 it will support new rdma operations when 400 00:19:39,840 --> 00:19:45,039 they are available 401 00:19:41,919 --> 00:19:48,240 intel and of this are main contributors 402 00:19:45,039 --> 00:19:48,240 to live rpma 403 00:19:48,799 --> 00:19:55,120 let's look at the basic api of libre 404 00:19:51,600 --> 00:19:57,360 rpma i will explain the functions of 405 00:19:55,120 --> 00:19:59,600 some liver rpma api 406 00:19:57,360 --> 00:20:03,039 for memory management 407 00:19:59,600 --> 00:20:03,919 we can register memory reading by rpm mr 408 00:20:03,039 --> 00:20:07,039 rig 409 00:20:03,919 --> 00:20:08,720 and the register memory region by rpma 410 00:20:07,039 --> 00:20:11,200 mrd rig 411 00:20:08,720 --> 00:20:14,400 for connection management 412 00:20:11,200 --> 00:20:18,960 we can create a new outgoing connection 413 00:20:14,400 --> 00:20:21,600 request by rpma call rick new and third 414 00:20:18,960 --> 00:20:24,720 an incoming connection requested by 415 00:20:21,600 --> 00:20:28,159 rpmaep next con rig 416 00:20:24,720 --> 00:20:31,600 for messaging we can send data to remote 417 00:20:28,159 --> 00:20:36,000 site by rpma sender and receive data 418 00:20:31,600 --> 00:20:38,880 from remote sender by rpma receive 419 00:20:36,000 --> 00:20:41,919 for remote p member access 420 00:20:38,880 --> 00:20:45,360 we can write data to remote pmemma by 421 00:20:41,919 --> 00:20:49,520 rpma write flash data into remote p 422 00:20:45,360 --> 00:20:52,320 memory by rpmi flash and write a 8-bet 423 00:20:49,520 --> 00:20:55,840 value to remote p-memor atomically by 424 00:20:52,320 --> 00:20:55,840 rpma writer atomic 425 00:20:56,799 --> 00:21:03,039 in lib rpma community there are 11 426 00:21:00,240 --> 00:21:05,840 examples to show how to use various 427 00:21:03,039 --> 00:21:08,559 liver rpma api together 428 00:21:05,840 --> 00:21:10,320 let's look at the example zero five 429 00:21:08,559 --> 00:21:13,840 flush to per 430 00:21:10,320 --> 00:21:13,840 system for details 431 00:21:14,159 --> 00:21:20,799 look at the following example 432 00:21:17,120 --> 00:21:24,000 client use uses drum to register memory 433 00:21:20,799 --> 00:21:27,120 region by rpma mlr rig 434 00:21:24,000 --> 00:21:30,799 server uses p-member to register memory 435 00:21:27,120 --> 00:21:32,320 region by rpma mr rig 436 00:21:30,799 --> 00:21:34,320 server and 437 00:21:32,320 --> 00:21:37,360 clan and the server 438 00:21:34,320 --> 00:21:40,559 established connection and the transform 439 00:21:37,360 --> 00:21:42,080 provides data by several rpma core 440 00:21:40,559 --> 00:21:45,360 functions 441 00:21:42,080 --> 00:21:48,320 with the connection clad can transfer 442 00:21:45,360 --> 00:21:51,600 data to remote p member by rpma write 443 00:21:48,320 --> 00:21:51,600 and rpma flash 444 00:21:52,320 --> 00:21:58,480 currently flash and atomic write is not 445 00:21:55,200 --> 00:22:01,440 supported by rdma soho liver rpma 446 00:21:58,480 --> 00:22:01,440 implementation 447 00:22:01,679 --> 00:22:08,559 how to implement rpma flush operation 448 00:22:05,360 --> 00:22:11,120 liver rpma implemented rpma flush by 449 00:22:08,559 --> 00:22:13,440 traditional rdma reader 450 00:22:11,120 --> 00:22:16,880 look at this figure 451 00:22:13,440 --> 00:22:18,960 requests requester sends a rdma read as 452 00:22:16,880 --> 00:22:21,679 a rpma flash 453 00:22:18,960 --> 00:22:25,360 the rdma reader reads the completion of 454 00:22:21,679 --> 00:22:28,159 previous router automatically 455 00:22:25,360 --> 00:22:30,960 the rdma reader flash or return data 456 00:22:28,159 --> 00:22:34,320 from arnic to the remote p memo before 457 00:22:30,960 --> 00:22:36,640 reading data from the remote p memory 458 00:22:34,320 --> 00:22:40,600 this way is called a appliance 459 00:22:36,640 --> 00:22:40,600 persistency method 460 00:22:41,440 --> 00:22:45,520 how to implement rpm atomic right 461 00:22:44,000 --> 00:22:48,640 operation 462 00:22:45,520 --> 00:22:51,440 liver rpma implemented the rpma atomic 463 00:22:48,640 --> 00:22:56,320 right by traditional rdma right with a 464 00:22:51,440 --> 00:23:00,400 light hbat value look at this figure 465 00:22:56,320 --> 00:23:04,080 request sends a rdma right with a land 8 466 00:23:00,400 --> 00:23:06,400 bad value and a fast flag as a rpma 467 00:23:04,080 --> 00:23:09,360 atomic write 468 00:23:06,400 --> 00:23:11,840 the rdma right waits the 469 00:23:09,360 --> 00:23:16,000 completion of previous flash bend the 470 00:23:11,840 --> 00:23:16,000 fence flag and then writes the value 471 00:23:16,559 --> 00:23:21,960 the rdma right needs to be flashed to 472 00:23:19,200 --> 00:23:25,200 remote pin member as well 473 00:23:21,960 --> 00:23:29,840 unfortunately the rdma write has to wait 474 00:23:25,200 --> 00:23:29,840 all previous read due to the first flag 475 00:23:30,559 --> 00:23:36,159 one more necessary consideration for 476 00:23:33,039 --> 00:23:39,440 rpma flash operation 477 00:23:36,159 --> 00:23:42,159 intel ddl is a key feature introduced on 478 00:23:39,440 --> 00:23:46,000 the intel xuan e5 professor and the 479 00:23:42,159 --> 00:23:49,279 inters1 e7 professor v2 480 00:23:46,000 --> 00:23:51,520 as the interest document mentions ddl 481 00:23:49,279 --> 00:23:53,919 makes the professor cache 482 00:23:51,520 --> 00:23:57,360 the primary destination and the source 483 00:23:53,919 --> 00:24:00,559 of our data rather than main memory 484 00:23:57,360 --> 00:24:01,679 helping helping to deliver increased 485 00:24:00,559 --> 00:24:04,320 bandwidth 486 00:24:01,679 --> 00:24:05,840 lower latency and reduce power 487 00:24:04,320 --> 00:24:09,679 consumption 488 00:24:05,840 --> 00:24:13,440 with ddl traditional rpma flash using 489 00:24:09,679 --> 00:24:16,000 rdma read can only flash data to the 490 00:24:13,440 --> 00:24:18,720 last level cache of cpu 491 00:24:16,000 --> 00:24:22,400 so remote applications need to trim the 492 00:24:18,720 --> 00:24:27,440 data to p member by themselves 493 00:24:22,400 --> 00:24:27,440 rpma flash has to consider ddio 494 00:24:28,080 --> 00:24:32,880 how to implement rpma flash operation 495 00:24:30,960 --> 00:24:34,960 with the ddio 496 00:24:32,880 --> 00:24:38,159 in this case 497 00:24:34,960 --> 00:24:43,200 liver rpm implemented the rpma flash by 498 00:24:38,159 --> 00:24:43,200 traditional rdma send and rdma receive 499 00:24:43,440 --> 00:24:48,240 look at this figure 500 00:24:45,120 --> 00:24:50,320 requester passes the address and read to 501 00:24:48,240 --> 00:24:54,159 flash to 502 00:24:50,320 --> 00:24:56,960 respond by rdma sender 503 00:24:54,159 --> 00:24:59,840 the rdma center waits the completion of 504 00:24:56,960 --> 00:25:02,799 previous writer automatically and then 505 00:24:59,840 --> 00:25:04,799 flash the written data from arnic to the 506 00:25:02,799 --> 00:25:07,360 arc 507 00:25:04,799 --> 00:25:10,320 responder flash or written data from 508 00:25:07,360 --> 00:25:12,880 area c to p memory according to the 509 00:25:10,320 --> 00:25:15,919 contents received 510 00:25:12,880 --> 00:25:18,640 responder notifies the requester the 511 00:25:15,919 --> 00:25:22,159 data has been returned into p memory by 512 00:25:18,640 --> 00:25:22,159 another rdma standard 513 00:25:23,279 --> 00:25:28,880 libre rpma is an up layer library so we 514 00:25:26,720 --> 00:25:32,240 would like to know how the performance 515 00:25:28,880 --> 00:25:32,240 of live rpma is 516 00:25:32,480 --> 00:25:37,919 how to evaluate the performance of labor 517 00:25:34,960 --> 00:25:41,600 rpma labour rpma introduced the liver 518 00:25:37,919 --> 00:25:43,760 rpma dedicated engine to fio so that we 519 00:25:41,600 --> 00:25:44,640 can use a file to do the performance 520 00:25:43,760 --> 00:25:46,400 test 521 00:25:44,640 --> 00:25:49,279 on our environment 522 00:25:46,400 --> 00:25:51,039 a file is a benchmark to test the io 523 00:25:49,279 --> 00:25:53,360 performance 524 00:25:51,039 --> 00:25:56,559 the table on the left shows comma 525 00:25:53,360 --> 00:25:59,440 configuration of our environment the 526 00:25:56,559 --> 00:26:02,080 example on the right shows the detailed 527 00:25:59,440 --> 00:26:03,279 steps to run the file benchmark on our 528 00:26:02,080 --> 00:26:04,720 environment 529 00:26:03,279 --> 00:26:08,320 for example 530 00:26:04,720 --> 00:26:11,279 client builds the latest file including 531 00:26:08,320 --> 00:26:14,000 live rpma engine 532 00:26:11,279 --> 00:26:17,919 it's a reference sample jobs to create a 533 00:26:14,000 --> 00:26:20,799 new job file for the live rpma client 534 00:26:17,919 --> 00:26:23,520 and then run a file with the client's 535 00:26:20,799 --> 00:26:23,520 job file 536 00:26:23,840 --> 00:26:29,760 server needs to do the similar steps 537 00:26:27,200 --> 00:26:29,760 as well 538 00:26:31,600 --> 00:26:36,960 by fio benchmark we go to the band os 539 00:26:34,720 --> 00:26:40,400 and the latency of remote p memory 540 00:26:36,960 --> 00:26:42,880 access based on label rpma and the 541 00:26:40,400 --> 00:26:45,200 source of low copy memory access as 542 00:26:42,880 --> 00:26:48,400 shown in the tables below 543 00:26:45,200 --> 00:26:50,640 compared with the local p memory xs 544 00:26:48,400 --> 00:26:54,000 the performance of remote p memory 545 00:26:50,640 --> 00:26:57,039 access is slightly worse 546 00:26:54,000 --> 00:26:59,520 but i think liberal pma is still a good 547 00:26:57,039 --> 00:27:02,159 solution to accessory multiple member 548 00:26:59,520 --> 00:27:05,559 and it may provide a higher performance 549 00:27:02,159 --> 00:27:05,559 in the future 550 00:27:06,559 --> 00:27:10,960 okay finally 551 00:27:08,640 --> 00:27:13,600 i will do a conclusion and show our 552 00:27:10,960 --> 00:27:13,600 future work 553 00:27:13,919 --> 00:27:20,080 in this presentation i explained the yp 554 00:27:17,600 --> 00:27:24,000 memo with rdma and show the new 555 00:27:20,080 --> 00:27:27,520 specification of rdma4 remote p memory 556 00:27:24,000 --> 00:27:31,360 i also showed how to implement new rdma 557 00:27:27,520 --> 00:27:34,559 operations on soft rlc and live iv works 558 00:27:31,360 --> 00:27:37,120 and introduced live rpma 559 00:27:34,559 --> 00:27:40,480 in the future we will finish 560 00:27:37,120 --> 00:27:42,880 implementing new rdma operations on soft 561 00:27:40,480 --> 00:27:45,919 rlc and liver iv works 562 00:27:42,880 --> 00:27:48,240 and then pushes them into the kernel and 563 00:27:45,919 --> 00:27:50,799 rdma core 564 00:27:48,240 --> 00:27:54,480 we will also make a live rpma support 565 00:27:50,799 --> 00:27:54,480 new rdma operations 566 00:27:55,360 --> 00:27:59,279 thank you for listening to my 567 00:27:57,360 --> 00:28:02,240 presentation 568 00:27:59,279 --> 00:28:05,440 please contact me by email if you have 569 00:28:02,240 --> 00:28:09,000 any questions about this slide 570 00:28:05,440 --> 00:28:09,000 thanks a lot