1 00:00:00,000 --> 00:00:08,469 foreign 2 00:00:00,500 --> 00:00:08,469 [Music] 3 00:00:11,340 --> 00:00:17,960 welcome everybody now for this afternoon 4 00:00:15,000 --> 00:00:17,960 I'm on the wrong page 5 00:00:18,000 --> 00:00:20,820 whoops 6 00:00:19,380 --> 00:00:23,180 I'm thoroughly on the wrong page I 7 00:00:20,820 --> 00:00:23,180 apologize 8 00:00:30,539 --> 00:00:35,160 I do apologize but I won't waste any 9 00:00:32,340 --> 00:00:37,320 more of your time 10 00:00:35,160 --> 00:00:39,600 so I would like everybody 11 00:00:37,320 --> 00:00:41,780 to make Christopher feel welcome thank 12 00:00:39,600 --> 00:00:41,780 you 13 00:00:49,200 --> 00:00:53,700 first you have New South Wales today I'm 14 00:00:51,899 --> 00:00:54,899 going to be talking about was that was 15 00:00:53,700 --> 00:00:57,780 that not on before 16 00:00:54,899 --> 00:01:01,260 okay today I'm going to be talking about 17 00:00:57,780 --> 00:01:03,300 ceph and drbd two distributed storage 18 00:01:01,260 --> 00:01:04,979 systems I'm going to be comparing them 19 00:01:03,300 --> 00:01:07,979 talking about what I liked and didn't 20 00:01:04,979 --> 00:01:09,479 like what their strengths and weaknesses 21 00:01:07,979 --> 00:01:11,280 are 22 00:01:09,479 --> 00:01:14,220 so 23 00:01:11,280 --> 00:01:15,840 at trustworthy systems we provide 24 00:01:14,220 --> 00:01:17,700 networked home directories to our 25 00:01:15,840 --> 00:01:19,320 members you can log into any machine on 26 00:01:17,700 --> 00:01:21,180 the network and you'll see the same home 27 00:01:19,320 --> 00:01:22,320 directory with all the changes you've 28 00:01:21,180 --> 00:01:25,080 made to it 29 00:01:22,320 --> 00:01:28,560 the way we accomplish that is using a 30 00:01:25,080 --> 00:01:30,720 combination of something called drbd and 31 00:01:28,560 --> 00:01:32,700 NFS 32 00:01:30,720 --> 00:01:34,140 so the file system that those home 33 00:01:32,700 --> 00:01:35,119 directories are in has to be stored 34 00:01:34,140 --> 00:01:37,740 somewhere 35 00:01:35,119 --> 00:01:39,780 drbd handles that task and it's better 36 00:01:37,740 --> 00:01:42,180 if that file system can be stored in 37 00:01:39,780 --> 00:01:44,460 multiple places at once 38 00:01:42,180 --> 00:01:48,240 yeah 39 00:01:44,460 --> 00:01:50,759 we have drbd does the work of storing 40 00:01:48,240 --> 00:01:52,740 the file system on multiple servers at a 41 00:01:50,759 --> 00:01:55,140 time and keeping it consistent across 42 00:01:52,740 --> 00:01:58,979 those different servers and then we use 43 00:01:55,140 --> 00:02:01,680 NFS the network file system protocol to 44 00:01:58,979 --> 00:02:03,420 export those home directories over the 45 00:02:01,680 --> 00:02:06,299 network so they can be mounted on 46 00:02:03,420 --> 00:02:08,520 whatever machine someone's using 47 00:02:06,299 --> 00:02:11,580 so let's do some more detail on that 48 00:02:08,520 --> 00:02:14,459 the drbd stands for distributed 49 00:02:11,580 --> 00:02:17,520 replicated block device 50 00:02:14,459 --> 00:02:20,120 uh block device is what drbd is designed 51 00:02:17,520 --> 00:02:22,440 to work on it deals in Block devices 52 00:02:20,120 --> 00:02:25,020 replicated it means that the contents of 53 00:02:22,440 --> 00:02:26,900 that block device are going to be copied 54 00:02:25,020 --> 00:02:29,580 identically to several places 55 00:02:26,900 --> 00:02:31,860 distributed means that that copying is 56 00:02:29,580 --> 00:02:34,920 happening to several machines on a 57 00:02:31,860 --> 00:02:36,420 network so multiple machines each have a 58 00:02:34,920 --> 00:02:38,700 block device those block devices on 59 00:02:36,420 --> 00:02:41,459 separate machines have the same content 60 00:02:38,700 --> 00:02:45,060 and drbd handles making sure that that 61 00:02:41,459 --> 00:02:46,800 stays the same then we Mount that file 62 00:02:45,060 --> 00:02:49,200 system in one of the machines and use 63 00:02:46,800 --> 00:02:52,620 NFS to export it 64 00:02:49,200 --> 00:02:55,319 so this setup works pretty well but it 65 00:02:52,620 --> 00:02:58,620 does have a couple of weaknesses 66 00:02:55,319 --> 00:03:00,120 one of them is that NFS and drbd don't 67 00:02:58,620 --> 00:03:02,519 know about each other 68 00:03:00,120 --> 00:03:04,620 drbd has no idea that the block device 69 00:03:02,519 --> 00:03:06,660 it's replicating is actually being 70 00:03:04,620 --> 00:03:09,480 served over the network onto other 71 00:03:06,660 --> 00:03:11,640 machines higher up NFS doesn't know that 72 00:03:09,480 --> 00:03:13,620 this file system it's exporting is 73 00:03:11,640 --> 00:03:15,540 actually replicated across multiple 74 00:03:13,620 --> 00:03:17,280 physical servers 75 00:03:15,540 --> 00:03:19,560 that means if you ever want to change 76 00:03:17,280 --> 00:03:20,760 the configuration here there's more work 77 00:03:19,560 --> 00:03:23,400 you've got to do there are more places 78 00:03:20,760 --> 00:03:25,200 you have to make the changes 79 00:03:23,400 --> 00:03:26,640 also because they don't know about each 80 00:03:25,200 --> 00:03:27,540 other 81 00:03:26,640 --> 00:03:30,900 um 82 00:03:27,540 --> 00:03:33,000 a dlbd resource can only be in primary 83 00:03:30,900 --> 00:03:34,920 mode on one or two hosts at a time 84 00:03:33,000 --> 00:03:36,720 primary mode is the mode where you're 85 00:03:34,920 --> 00:03:38,819 allowed to read and write the data on 86 00:03:36,720 --> 00:03:40,980 that block device 87 00:03:38,819 --> 00:03:43,379 so the easiest way to deal with that is 88 00:03:40,980 --> 00:03:45,540 usually put the drbd resource in primary 89 00:03:43,379 --> 00:03:47,519 mode on only one of the servers and 90 00:03:45,540 --> 00:03:49,980 leave it in secondary and the others so 91 00:03:47,519 --> 00:03:52,319 it can't be read from or written to on 92 00:03:49,980 --> 00:03:54,120 the other servers and then run your NFS 93 00:03:52,319 --> 00:03:55,680 demon that's doing the exports on the 94 00:03:54,120 --> 00:03:57,120 same server where the resource is 95 00:03:55,680 --> 00:04:00,060 primary 96 00:03:57,120 --> 00:04:02,519 if that server goes down however that 97 00:04:00,060 --> 00:04:04,200 means that the RBD is not going to 98 00:04:02,519 --> 00:04:06,420 automatically put the resource up into 99 00:04:04,200 --> 00:04:07,920 primary mode on another server you're 100 00:04:06,420 --> 00:04:09,420 not going to get an NFS theme and 101 00:04:07,920 --> 00:04:11,459 automatically spinning up on another 102 00:04:09,420 --> 00:04:15,000 server and even if they did do that 103 00:04:11,459 --> 00:04:17,160 there's no guarantee that their efforts 104 00:04:15,000 --> 00:04:19,620 to recover would work together because 105 00:04:17,160 --> 00:04:21,600 again they don't know about each other 106 00:04:19,620 --> 00:04:23,580 there are tools you can use to try to 107 00:04:21,600 --> 00:04:25,380 automate this thing and get them to work 108 00:04:23,580 --> 00:04:28,560 together but that's also adding an extra 109 00:04:25,380 --> 00:04:30,900 layer of complexity on top 110 00:04:28,560 --> 00:04:32,759 another issue with this setup is the RBD 111 00:04:30,900 --> 00:04:34,400 doesn't have the greatest conflict 112 00:04:32,759 --> 00:04:37,500 resolution support 113 00:04:34,400 --> 00:04:39,419 if two servers end up concluding that 114 00:04:37,500 --> 00:04:41,880 the contents of the block device should 115 00:04:39,419 --> 00:04:44,460 be different if they disagree 116 00:04:41,880 --> 00:04:46,740 often you end up having to just manually 117 00:04:44,460 --> 00:04:48,240 issue the command to re-synchronize the 118 00:04:46,740 --> 00:04:50,960 data on the device and that can take 119 00:04:48,240 --> 00:04:50,960 hours to happen 120 00:04:51,120 --> 00:04:54,960 so we've been thinking about 121 00:04:52,740 --> 00:04:58,080 alternatives to how we store our home 122 00:04:54,960 --> 00:05:00,780 directories and an option that we landed 123 00:04:58,080 --> 00:05:03,180 on is called ceph 124 00:05:00,780 --> 00:05:05,699 ceph is a distributed storage system 125 00:05:03,180 --> 00:05:07,400 which is designed to be able to scale to 126 00:05:05,699 --> 00:05:09,960 really large deployments 127 00:05:07,400 --> 00:05:12,259 it supports three different ways of 128 00:05:09,960 --> 00:05:16,199 mounting file systems over the network 129 00:05:12,259 --> 00:05:18,720 and it dynamically redistributes where 130 00:05:16,199 --> 00:05:21,240 the data restored on different hosts 131 00:05:18,720 --> 00:05:23,759 it's really intended to be an all-in-one 132 00:05:21,240 --> 00:05:26,580 solution and I like that I think it's 133 00:05:23,759 --> 00:05:27,960 it's nice to have a system where the 134 00:05:26,580 --> 00:05:30,720 components are designed to know about 135 00:05:27,960 --> 00:05:32,340 each other and work together 136 00:05:30,720 --> 00:05:35,639 but it's also a bit more complex than 137 00:05:32,340 --> 00:05:37,680 drbd is so to determine whether or not 138 00:05:35,639 --> 00:05:40,979 this would actually be a good choice to 139 00:05:37,680 --> 00:05:43,440 replace our existing nfsdrbd setup I 140 00:05:40,979 --> 00:05:44,880 created test deployments of each of 141 00:05:43,440 --> 00:05:46,560 these two structures so that I could 142 00:05:44,880 --> 00:05:49,259 compare them 143 00:05:46,560 --> 00:05:51,120 first point of comparison we come to how 144 00:05:49,259 --> 00:05:53,280 easy are these things to set up from 145 00:05:51,120 --> 00:05:55,560 scratch how much work does it take to 146 00:05:53,280 --> 00:05:58,500 get the RBD and NFS going how much work 147 00:05:55,560 --> 00:06:02,280 does it take to get stuff going 148 00:05:58,500 --> 00:06:04,620 the RBD has a big Advantage here drbd is 149 00:06:02,280 --> 00:06:07,100 a fairly simple technology there's not 150 00:06:04,620 --> 00:06:11,280 that many moving parts to it 151 00:06:07,100 --> 00:06:14,100 drbd consists of two main components 152 00:06:11,280 --> 00:06:16,560 there is some user space software that 153 00:06:14,100 --> 00:06:18,600 you use to configure the resource and 154 00:06:16,560 --> 00:06:20,280 give it instructions and there is a 155 00:06:18,600 --> 00:06:22,020 kernel module which does most of the 156 00:06:20,280 --> 00:06:25,800 work of replicating the data and 157 00:06:22,020 --> 00:06:27,900 communicating with other servers 158 00:06:25,800 --> 00:06:29,460 the steps for getting the NFS drbd 159 00:06:27,900 --> 00:06:30,660 deployment going will look something 160 00:06:29,460 --> 00:06:33,120 like this 161 00:06:30,660 --> 00:06:34,800 first you need to download the user 162 00:06:33,120 --> 00:06:36,720 space configuration software and you 163 00:06:34,800 --> 00:06:39,000 need to load the kernel module 164 00:06:36,720 --> 00:06:41,100 you then have to write a configuration 165 00:06:39,000 --> 00:06:42,840 file to describe what your resource is 166 00:06:41,100 --> 00:06:43,680 going to look like and you started 167 00:06:42,840 --> 00:06:45,660 running 168 00:06:43,680 --> 00:06:48,380 you need to decide where you're going to 169 00:06:45,660 --> 00:06:50,819 have the resource in primary mode 170 00:06:48,380 --> 00:06:52,919 Mount and then you can create a file 171 00:06:50,819 --> 00:06:54,660 system on the resource and mount the 172 00:06:52,919 --> 00:06:56,520 thing 173 00:06:54,660 --> 00:06:58,500 and once it's mounted on one of your 174 00:06:56,520 --> 00:07:00,539 servers you can start up an NFS demon 175 00:06:58,500 --> 00:07:03,800 and tell NFS here's the file system 176 00:07:00,539 --> 00:07:07,259 export this thing over the network 177 00:07:03,800 --> 00:07:09,360 when I was setting this up the NFS part 178 00:07:07,259 --> 00:07:11,639 was pretty straightforward I didn't come 179 00:07:09,360 --> 00:07:14,100 into any problems there 180 00:07:11,639 --> 00:07:17,220 setting up a drbd resource had a couple 181 00:07:14,100 --> 00:07:18,960 more steps to it it was mostly not much 182 00:07:17,220 --> 00:07:21,479 of a problem drbd's parent company 183 00:07:18,960 --> 00:07:23,220 limbit provides a user manual which 184 00:07:21,479 --> 00:07:26,099 tells you just about everything you need 185 00:07:23,220 --> 00:07:28,740 to know to get the RBD going there was 186 00:07:26,099 --> 00:07:31,919 one substantial problem I ran into with 187 00:07:28,740 --> 00:07:34,740 drbd setup and that was at the kernel 188 00:07:31,919 --> 00:07:37,080 module and the user space software can 189 00:07:34,740 --> 00:07:39,539 be on different versions 190 00:07:37,080 --> 00:07:42,060 I wanted to be using the RBD Version 9 191 00:07:39,539 --> 00:07:43,800 because that allows replication to three 192 00:07:42,060 --> 00:07:46,199 or more hosts at a time 193 00:07:43,800 --> 00:07:48,599 in the RBD version 8 and before can only 194 00:07:46,199 --> 00:07:51,240 replicate Between Two Hosts 195 00:07:48,599 --> 00:07:53,639 now the latest Damian packages for the 196 00:07:51,240 --> 00:07:56,039 user space software were on the RBD 197 00:07:53,639 --> 00:07:57,720 Version 9 but the version 9 kernel 198 00:07:56,039 --> 00:07:59,580 module hasn't been integrated into the 199 00:07:57,720 --> 00:08:02,280 Upstream Linux kernel yet 200 00:07:59,580 --> 00:08:04,080 so for a while I was trying to send 201 00:08:02,280 --> 00:08:05,940 commands to a kernel module and they 202 00:08:04,080 --> 00:08:07,979 didn't understand and the format of the 203 00:08:05,940 --> 00:08:09,900 configuration file changed a bit with 204 00:08:07,979 --> 00:08:11,819 the versions as well so it had no idea 205 00:08:09,900 --> 00:08:13,440 what I was telling it to do 206 00:08:11,819 --> 00:08:15,660 fortunately that wasn't too hard to 207 00:08:13,440 --> 00:08:17,160 solve Lin bit also provides source code 208 00:08:15,660 --> 00:08:19,979 for the kernel module so I could just 209 00:08:17,160 --> 00:08:21,479 download compile the version 9 kernel 210 00:08:19,979 --> 00:08:25,020 module and load that and from that point 211 00:08:21,479 --> 00:08:28,379 drbd pretty much worked 212 00:08:25,020 --> 00:08:30,720 so on to Seth 213 00:08:28,379 --> 00:08:34,140 Seth has a lot more moving Parts than 214 00:08:30,720 --> 00:08:35,760 drbd does a ceph cluster is made up of 215 00:08:34,140 --> 00:08:38,459 many different demons all working 216 00:08:35,760 --> 00:08:41,279 together and doing different tasks these 217 00:08:38,459 --> 00:08:42,539 are the four most prominent kinds of 218 00:08:41,279 --> 00:08:44,880 Seth demons 219 00:08:42,539 --> 00:08:47,820 there are the object storage demons or 220 00:08:44,880 --> 00:08:50,040 osds each OSD is responsible for 221 00:08:47,820 --> 00:08:53,040 managing one unit of storage such as one 222 00:08:50,040 --> 00:08:54,899 hard disk one drive there are monitors 223 00:08:53,040 --> 00:08:56,820 their job is to make sure that all the 224 00:08:54,899 --> 00:08:59,640 other demons are performing their tasks 225 00:08:56,820 --> 00:09:01,800 correctly and to track where the data is 226 00:08:59,640 --> 00:09:04,140 manager demons watch the cluster's 227 00:09:01,800 --> 00:09:06,420 health and performance and metadata 228 00:09:04,140 --> 00:09:08,580 servers store metadata for the file 229 00:09:06,420 --> 00:09:10,620 systems that your ceph cluster is 230 00:09:08,580 --> 00:09:12,360 managing to make them easier and faster 231 00:09:10,620 --> 00:09:14,519 to access 232 00:09:12,360 --> 00:09:16,380 but these four are just the four most 233 00:09:14,519 --> 00:09:19,080 prominent ones there are a bunch of 234 00:09:16,380 --> 00:09:21,420 other demons as well in a ceph cluster 235 00:09:19,080 --> 00:09:23,519 Seth has so many moving parts to it that 236 00:09:21,420 --> 00:09:26,220 it comes with its own Configuration 237 00:09:23,519 --> 00:09:27,720 utility called SEF ADM to help you set 238 00:09:26,220 --> 00:09:29,820 up the cluster 239 00:09:27,720 --> 00:09:31,320 if you're using SEF ADM then making 240 00:09:29,820 --> 00:09:32,519 yourself cluster look something like 241 00:09:31,320 --> 00:09:35,339 this 242 00:09:32,519 --> 00:09:36,720 one you tell safe ADM to bootstrap the 243 00:09:35,339 --> 00:09:39,720 cluster which means it's going to create 244 00:09:36,720 --> 00:09:41,220 a single monitor a single manager and 245 00:09:39,720 --> 00:09:43,080 some configuration and authentication 246 00:09:41,220 --> 00:09:45,060 information 247 00:09:43,080 --> 00:09:46,680 that bootstrapping happens on one 248 00:09:45,060 --> 00:09:48,959 machine you then have to connect other 249 00:09:46,680 --> 00:09:51,800 hosts up to it so that the cluster can 250 00:09:48,959 --> 00:09:51,800 start spreading 251 00:09:51,959 --> 00:09:55,320 take all the devices you want to 252 00:09:53,580 --> 00:09:58,860 actually use for storing things and 253 00:09:55,320 --> 00:10:01,200 create an OSD on top of each of them 254 00:09:58,860 --> 00:10:03,540 then you tell these F cluster to create 255 00:10:01,200 --> 00:10:05,160 a file system and it'll use those osds 256 00:10:03,540 --> 00:10:07,080 to get the storage it needs to put that 257 00:10:05,160 --> 00:10:09,420 file system on 258 00:10:07,080 --> 00:10:11,220 finally a client needs to be 259 00:10:09,420 --> 00:10:13,200 authenticated to the ceph cluster before 260 00:10:11,220 --> 00:10:17,120 it's allowed to access the cluster and 261 00:10:13,200 --> 00:10:17,120 mount the file systems that are in it 262 00:10:17,820 --> 00:10:22,860 now Seth has documentation and quite a 263 00:10:20,519 --> 00:10:25,140 lot of it there is 264 00:10:22,860 --> 00:10:28,440 a documentation website 265 00:10:25,140 --> 00:10:30,959 you can also look at Man pages on the 266 00:10:28,440 --> 00:10:32,940 ceph commands and you can often run a 267 00:10:30,959 --> 00:10:34,860 safe command and stick dot h on the end 268 00:10:32,940 --> 00:10:38,160 and it'll show you what are the further 269 00:10:34,860 --> 00:10:40,320 ways you can extend this command 270 00:10:38,160 --> 00:10:44,820 one pretty substantial challenge I've 271 00:10:40,320 --> 00:10:46,980 run into with understanding Seth is 272 00:10:44,820 --> 00:10:49,079 it doesn't feel like any of these three 273 00:10:46,980 --> 00:10:51,240 sources are really comprehensive 274 00:10:49,079 --> 00:10:53,339 I found things in each of them that 275 00:10:51,240 --> 00:10:55,680 don't seem to appear in the others 276 00:10:53,339 --> 00:10:57,839 you kind of have to check all three to 277 00:10:55,680 --> 00:11:00,480 be sure that you've seen what your 278 00:10:57,839 --> 00:11:02,700 options are and even then you can come 279 00:11:00,480 --> 00:11:04,200 across a situation where you've got say 280 00:11:02,700 --> 00:11:06,360 two different commands that seem like 281 00:11:04,200 --> 00:11:09,120 they're supposed to do the same thing 282 00:11:06,360 --> 00:11:11,880 but in fact they might not 283 00:11:09,120 --> 00:11:13,740 self documentation is good at giving a 284 00:11:11,880 --> 00:11:17,579 high level overview of how a cluster 285 00:11:13,740 --> 00:11:19,260 works the four key types of ceftemons I 286 00:11:17,579 --> 00:11:21,180 listed earlier that's an example of 287 00:11:19,260 --> 00:11:23,160 information that is very easy to get out 288 00:11:21,180 --> 00:11:25,200 of seps documentation 289 00:11:23,160 --> 00:11:28,079 there are also a lot of instances in the 290 00:11:25,200 --> 00:11:30,060 documentation where you are told in 291 00:11:28,079 --> 00:11:31,019 order to perform this action run this 292 00:11:30,060 --> 00:11:33,480 command 293 00:11:31,019 --> 00:11:35,399 low level instructions 294 00:11:33,480 --> 00:11:37,500 but I find the documentation kind of 295 00:11:35,399 --> 00:11:40,440 struggles with giving mid-level 296 00:11:37,500 --> 00:11:42,240 explanations of what are my options why 297 00:11:40,440 --> 00:11:44,720 would I want to do this versus the other 298 00:11:42,240 --> 00:11:46,740 what does this really mean 299 00:11:44,720 --> 00:11:48,660 for instance 300 00:11:46,740 --> 00:11:52,200 step three of setting up a ceph cluster 301 00:11:48,660 --> 00:11:54,779 was create osds these are two different 302 00:11:52,200 --> 00:11:56,820 commands both of these commands appear 303 00:11:54,779 --> 00:11:58,980 on sef's documentation website on 304 00:11:56,820 --> 00:12:01,560 different pages both of these commands 305 00:11:58,980 --> 00:12:03,660 will create an OSD for you in your 306 00:12:01,560 --> 00:12:05,399 cluster on top of a particular logical 307 00:12:03,660 --> 00:12:07,140 volume 308 00:12:05,399 --> 00:12:09,779 something that the documentation does 309 00:12:07,140 --> 00:12:11,640 not make clear is that these two 310 00:12:09,779 --> 00:12:14,579 commands are different in a significant 311 00:12:11,640 --> 00:12:17,940 way the First Command will create that 312 00:12:14,579 --> 00:12:19,620 OSD under the management of ceph ADM the 313 00:12:17,940 --> 00:12:21,060 tool which is explicitly designed to 314 00:12:19,620 --> 00:12:22,500 help you with deploying all the demons 315 00:12:21,060 --> 00:12:24,779 in SF cluster 316 00:12:22,500 --> 00:12:27,959 the second command will not create the 317 00:12:24,779 --> 00:12:29,880 OSD under the management of ceph ADM 318 00:12:27,959 --> 00:12:32,040 and we'll be able to detect that an OSD 319 00:12:29,880 --> 00:12:33,779 is there but won't be able to do 320 00:12:32,040 --> 00:12:36,240 anything to it until you explicitly 321 00:12:33,779 --> 00:12:39,300 adopt it 322 00:12:36,240 --> 00:12:40,920 this is the sort of thing that you can 323 00:12:39,300 --> 00:12:43,019 run into when you're trying to 324 00:12:40,920 --> 00:12:45,500 understand what's going on inside Seth's 325 00:12:43,019 --> 00:12:45,500 documentation 326 00:12:46,560 --> 00:12:53,519 on the Simplicity of setup drbd is 327 00:12:50,579 --> 00:12:56,279 definitely further ahead than Seth 328 00:12:53,519 --> 00:12:58,320 that's to be expected though drbd is a 329 00:12:56,279 --> 00:13:00,600 simpler tool it does not do as much it 330 00:12:58,320 --> 00:13:02,399 does not support as many options it's 331 00:13:00,600 --> 00:13:04,320 not surprising that it's less work to 332 00:13:02,399 --> 00:13:06,600 set the thing up 333 00:13:04,320 --> 00:13:08,940 so what about features what do these 334 00:13:06,600 --> 00:13:12,779 tools support how easy are they to use 335 00:13:08,940 --> 00:13:14,880 what options can they give us 336 00:13:12,779 --> 00:13:16,380 well I just mentioned drbd is a pretty 337 00:13:14,880 --> 00:13:18,959 simple technology 338 00:13:16,380 --> 00:13:21,660 all it's really doing is replicating the 339 00:13:18,959 --> 00:13:24,839 data on this block device 340 00:13:21,660 --> 00:13:27,240 the RBD doesn't have any higher level 341 00:13:24,839 --> 00:13:28,860 options on top of that it can't for 342 00:13:27,240 --> 00:13:31,680 instance put a file system on the Block 343 00:13:28,860 --> 00:13:33,480 device it can't do any kind of cluster 344 00:13:31,680 --> 00:13:35,040 awareness for you 345 00:13:33,480 --> 00:13:36,920 you have to create your own file system 346 00:13:35,040 --> 00:13:40,860 on top of the block device 347 00:13:36,920 --> 00:13:43,200 once you have drbd replicating the thing 348 00:13:40,860 --> 00:13:45,600 you can from the command line view the 349 00:13:43,200 --> 00:13:47,519 status of your drbd resource you can see 350 00:13:45,600 --> 00:13:50,639 where is it primary where is it 351 00:13:47,519 --> 00:13:53,339 secondary where is it up to date this 352 00:13:50,639 --> 00:13:57,019 was taken while a drbd resource was 353 00:13:53,339 --> 00:13:59,399 synchronizing to the two other servers 354 00:13:57,019 --> 00:14:01,260 but that's about all the information you 355 00:13:59,399 --> 00:14:03,180 can get there's just not that much high 356 00:14:01,260 --> 00:14:06,620 level stuff going on in the operations 357 00:14:03,180 --> 00:14:06,620 drbd is performing 358 00:14:06,720 --> 00:14:10,680 in contrast sef's feature set is 359 00:14:08,940 --> 00:14:12,480 probably my favorite part of it 360 00:14:10,680 --> 00:14:15,360 I mentioned earlier that safe 361 00:14:12,480 --> 00:14:17,639 dynamically redistributes data it's not 362 00:14:15,360 --> 00:14:19,260 like drvd which is essentially copying 363 00:14:17,639 --> 00:14:21,600 the exact contents of this block device 364 00:14:19,260 --> 00:14:25,079 onto another block device on another 365 00:14:21,600 --> 00:14:27,180 server several actually decide this OSD 366 00:14:25,079 --> 00:14:28,380 is the size this OSD is this size I'm 367 00:14:27,180 --> 00:14:31,620 going to give this one more data because 368 00:14:28,380 --> 00:14:33,540 it can hold more it will move where the 369 00:14:31,620 --> 00:14:35,880 data is replicated around based on what 370 00:14:33,540 --> 00:14:38,339 the storage options are so it's really 371 00:14:35,880 --> 00:14:40,740 quite easy to add and remove osds to 372 00:14:38,339 --> 00:14:44,480 your cluster as you get more storage or 373 00:14:40,740 --> 00:14:44,480 as you lose storage options 374 00:14:44,639 --> 00:14:49,740 if you have bootstrapped yourself 375 00:14:46,860 --> 00:14:51,720 cluster using ceph ADM then it will come 376 00:14:49,740 --> 00:14:53,279 with a graphical dashboard that you can 377 00:14:51,720 --> 00:14:56,279 reach over the web 378 00:14:53,279 --> 00:14:57,660 I mentioned that ceph managers watch the 379 00:14:56,279 --> 00:14:59,880 cluster's health and performance you can 380 00:14:57,660 --> 00:15:01,860 see a lot of that information here 381 00:14:59,880 --> 00:15:03,779 you can also perform a lot of the setup 382 00:15:01,860 --> 00:15:05,880 tasks for a Steph cluster from this 383 00:15:03,779 --> 00:15:08,579 dashboard you can add hosts to your 384 00:15:05,880 --> 00:15:10,139 cluster from here you can deploy 385 00:15:08,579 --> 00:15:12,360 services from here you are supposed to 386 00:15:10,139 --> 00:15:13,800 be able to create osds from here too I 387 00:15:12,360 --> 00:15:15,839 haven't managed it because of an issue 388 00:15:13,800 --> 00:15:18,800 to do with how ceph detects whether a 389 00:15:15,839 --> 00:15:18,800 device is empty or not 390 00:15:20,639 --> 00:15:24,540 one of my favorite things about Seth 391 00:15:22,560 --> 00:15:27,420 from a system administrator perspective 392 00:15:24,540 --> 00:15:30,000 is how automatic it is 393 00:15:27,420 --> 00:15:32,519 I've had times when I have taken down 394 00:15:30,000 --> 00:15:34,139 one of the hosts in a ceph cluster 395 00:15:32,519 --> 00:15:37,079 and then I've brought the host back up 396 00:15:34,139 --> 00:15:38,820 and rejoined it and initially the 397 00:15:37,079 --> 00:15:41,820 cluster status up in the top left will 398 00:15:38,820 --> 00:15:43,860 say health warning bad data 399 00:15:41,820 --> 00:15:46,320 but in a few minutes without me having 400 00:15:43,860 --> 00:15:47,940 to issue any commands except we'll have 401 00:15:46,320 --> 00:15:49,860 cleaned up the problem and the cluster 402 00:15:47,940 --> 00:15:51,779 will be back to full health again 403 00:15:49,860 --> 00:15:53,339 ceph is intentionally designed to be 404 00:15:51,779 --> 00:15:56,100 self-healing 405 00:15:53,339 --> 00:15:58,620 so that it will find and try to fix 406 00:15:56,100 --> 00:16:00,779 problems for you 407 00:15:58,620 --> 00:16:02,760 when that same kind of thing happens in 408 00:16:00,779 --> 00:16:04,260 drbd if I take a host down and bring it 409 00:16:02,760 --> 00:16:06,300 back up and try to reconnect it into the 410 00:16:04,260 --> 00:16:09,180 cluster I will often need to tell it to 411 00:16:06,300 --> 00:16:12,560 re-synchronize the data on the drive and 412 00:16:09,180 --> 00:16:12,560 that can take hours to happen 413 00:16:15,420 --> 00:16:18,560 so 414 00:16:16,920 --> 00:16:20,459 so far I've been talking about 415 00:16:18,560 --> 00:16:23,699 qualitative things 416 00:16:20,459 --> 00:16:26,040 how do these tools feel what sort of 417 00:16:23,699 --> 00:16:29,040 options do I have on them let's get some 418 00:16:26,040 --> 00:16:31,199 numbers involved how do they perform 419 00:16:29,040 --> 00:16:33,660 I mentioned there are three different 420 00:16:31,199 --> 00:16:36,420 ways you can mount a ceph file system 421 00:16:33,660 --> 00:16:38,279 over the network there is a kernel 422 00:16:36,420 --> 00:16:40,860 driver you can use for mounting set file 423 00:16:38,279 --> 00:16:42,240 systems you can mount them in user space 424 00:16:40,860 --> 00:16:44,040 with fuse 425 00:16:42,240 --> 00:16:45,540 and you can export a set file system 426 00:16:44,040 --> 00:16:47,699 over NFS 427 00:16:45,540 --> 00:16:50,459 similar to what we were doing with drbd 428 00:16:47,699 --> 00:16:52,740 although it's all handled inside SEF 429 00:16:50,459 --> 00:16:54,360 so when we include the RBD that's four 430 00:16:52,740 --> 00:16:57,420 different ways we can mount a file 431 00:16:54,360 --> 00:17:00,000 system to run tests on 432 00:16:57,420 --> 00:17:02,940 for testing performance I add three main 433 00:17:00,000 --> 00:17:05,400 strategies the first was a program 434 00:17:02,940 --> 00:17:07,620 called postmark the way postmark works 435 00:17:05,400 --> 00:17:10,020 is it creates a couple of hundred files 436 00:17:07,620 --> 00:17:12,240 and then it performs creates and deletes 437 00:17:10,020 --> 00:17:15,419 and reads on those files it completes a 438 00:17:12,240 --> 00:17:16,919 set total workload and then finishes the 439 00:17:15,419 --> 00:17:18,179 faster you can get that workload done 440 00:17:16,919 --> 00:17:19,980 the better your file system is 441 00:17:18,179 --> 00:17:23,640 performing 442 00:17:19,980 --> 00:17:26,939 NFS bench Forks itself to create several 443 00:17:23,640 --> 00:17:28,439 processes each process creates one file 444 00:17:26,939 --> 00:17:30,500 and that process is the only one that's 445 00:17:28,439 --> 00:17:33,840 going to work on that file 446 00:17:30,500 --> 00:17:35,940 the process each process will then open 447 00:17:33,840 --> 00:17:37,559 its file write data into the file close 448 00:17:35,940 --> 00:17:40,260 the file repeat that many times over 449 00:17:37,559 --> 00:17:41,580 again there is a set total workload the 450 00:17:40,260 --> 00:17:42,720 sooner you can get through that workload 451 00:17:41,580 --> 00:17:45,120 the better your file system is 452 00:17:42,720 --> 00:17:47,580 performing 453 00:17:45,120 --> 00:17:49,799 and the last thing I did I took the NFS 454 00:17:47,580 --> 00:17:51,720 batch program and I modified it so that 455 00:17:49,799 --> 00:17:53,640 rather than completing a set workload it 456 00:17:51,720 --> 00:17:54,840 would just run continuously for about 10 457 00:17:53,640 --> 00:17:58,380 minutes 458 00:17:54,840 --> 00:18:01,380 and then I mounted a file system on 459 00:17:58,380 --> 00:18:03,299 several different clients at a time and 460 00:18:01,380 --> 00:18:05,039 ran that long-running NFS bench program 461 00:18:03,299 --> 00:18:08,220 and all of them at once and I watched 462 00:18:05,039 --> 00:18:11,360 CPU usage on the server and client 463 00:18:08,220 --> 00:18:11,360 machines while that was happening 464 00:18:12,240 --> 00:18:14,640 actually now I'm going to stay on that 465 00:18:13,559 --> 00:18:16,559 slide for now 466 00:18:14,640 --> 00:18:18,240 so 467 00:18:16,559 --> 00:18:20,220 if you don't really understand a system 468 00:18:18,240 --> 00:18:22,559 well there are a lot of ways you can end 469 00:18:20,220 --> 00:18:24,539 up measuring the wrong thing 470 00:18:22,559 --> 00:18:26,460 especially with something like this 471 00:18:24,539 --> 00:18:28,500 distributed file systems 472 00:18:26,460 --> 00:18:30,780 there are a lot of components that go 473 00:18:28,500 --> 00:18:32,460 into the end results you get when you 474 00:18:30,780 --> 00:18:34,080 try to test their performance 475 00:18:32,460 --> 00:18:36,419 you've got to think about what are the 476 00:18:34,080 --> 00:18:37,799 capabilities of the clients what kind of 477 00:18:36,419 --> 00:18:39,539 network connection do the clients have 478 00:18:37,799 --> 00:18:41,100 to the service what are the capabilities 479 00:18:39,539 --> 00:18:42,419 of the servers how are the servers 480 00:18:41,100 --> 00:18:43,799 talking to each other what sort of 481 00:18:42,419 --> 00:18:46,679 algorithms are they using to decide 482 00:18:43,799 --> 00:18:47,820 where the data ends up 483 00:18:46,679 --> 00:18:50,580 if you don't really know what you're 484 00:18:47,820 --> 00:18:51,660 doing it's easy to end up measuring the 485 00:18:50,580 --> 00:18:52,799 performance of some bizarre 486 00:18:51,660 --> 00:18:54,539 configuration that you shouldn't be 487 00:18:52,799 --> 00:18:58,340 using in production anyway 488 00:18:54,539 --> 00:18:58,340 that happened to me several times over 489 00:18:59,520 --> 00:19:06,000 first I originally created myself osds 490 00:19:03,000 --> 00:19:09,480 so that they were storing all of their 491 00:19:06,000 --> 00:19:11,460 data on hard disks 492 00:19:09,480 --> 00:19:12,960 well the self documentation recommends 493 00:19:11,460 --> 00:19:15,120 that if you can you should put the right 494 00:19:12,960 --> 00:19:17,220 ahead log for the OSD on something 495 00:19:15,120 --> 00:19:20,760 faster so you don't have to wait for the 496 00:19:17,220 --> 00:19:22,919 hard for the slower hard disk as often 497 00:19:20,760 --> 00:19:25,440 when I made that change I then had a 498 00:19:22,919 --> 00:19:28,020 look at NFS bench I noticed NFS bench 499 00:19:25,440 --> 00:19:30,120 was using the sync system call 500 00:19:28,020 --> 00:19:32,700 I tried running it using the fsync 501 00:19:30,120 --> 00:19:35,400 system call instead both of those system 502 00:19:32,700 --> 00:19:38,580 calls are basically designed to commit 503 00:19:35,400 --> 00:19:40,620 the changes to the underlying file 504 00:19:38,580 --> 00:19:41,760 system or storage 505 00:19:40,620 --> 00:19:43,919 and I saw a huge difference in 506 00:19:41,760 --> 00:19:45,539 performance of 507 00:19:43,919 --> 00:19:47,820 that are optimized for everything 508 00:19:45,539 --> 00:19:49,799 nope turns out the difference is fsync 509 00:19:47,820 --> 00:19:51,179 commits one file at a time sync commits 510 00:19:49,799 --> 00:19:53,400 the entire file system every time you 511 00:19:51,179 --> 00:19:55,559 call it it was the difference between 512 00:19:53,400 --> 00:19:57,840 each process committing its own file and 513 00:19:55,559 --> 00:20:00,059 each process committing every file 514 00:19:57,840 --> 00:20:01,679 it Steph was just doing quadratically 515 00:20:00,059 --> 00:20:04,580 less work because I changed which system 516 00:20:01,679 --> 00:20:04,580 call I was using 517 00:20:04,799 --> 00:20:11,760 and then when I was running the CPU 518 00:20:08,820 --> 00:20:14,460 status the CPU usage 519 00:20:11,760 --> 00:20:16,260 experiments 520 00:20:14,460 --> 00:20:18,480 there were some cases where I found that 521 00:20:16,260 --> 00:20:21,000 the clients were spending almost 100 of 522 00:20:18,480 --> 00:20:22,440 their time waiting for Io to happen but 523 00:20:21,000 --> 00:20:23,640 the servers were virtually completely 524 00:20:22,440 --> 00:20:25,320 idle 525 00:20:23,640 --> 00:20:26,760 somehow the clients were hitting huge 526 00:20:25,320 --> 00:20:27,840 bottleneck and the servers had nothing 527 00:20:26,760 --> 00:20:30,000 to do 528 00:20:27,840 --> 00:20:31,679 it turns out that was because the switch 529 00:20:30,000 --> 00:20:33,299 that was connecting the clients to the 530 00:20:31,679 --> 00:20:35,160 servers just couldn't keep up with all 531 00:20:33,299 --> 00:20:37,440 the traffic the problem was nothing to 532 00:20:35,160 --> 00:20:40,500 do with safe or drbd the problem was to 533 00:20:37,440 --> 00:20:42,179 do with the network wasn't fast enough 534 00:20:40,500 --> 00:20:44,940 so 535 00:20:42,179 --> 00:20:47,100 having been through those experiences 536 00:20:44,940 --> 00:20:50,220 here are the results I ended up with on 537 00:20:47,100 --> 00:20:52,440 my last run of these tests 538 00:20:50,220 --> 00:20:54,179 so this is results for postmark on the 539 00:20:52,440 --> 00:20:56,400 y-axis we've got how long the experiment 540 00:20:54,179 --> 00:20:58,200 took to run remember a faster completion 541 00:20:56,400 --> 00:21:00,840 is better 542 00:20:58,200 --> 00:21:03,600 um on the bottom on the horizontal we've 543 00:21:00,840 --> 00:21:06,179 got how big the experiments were 544 00:21:03,600 --> 00:21:08,760 the yellow bar is a ceph file system 545 00:21:06,179 --> 00:21:10,200 mounted using NFS green is sapphire 546 00:21:08,760 --> 00:21:13,380 system mounted using the kernel driver 547 00:21:10,200 --> 00:21:16,559 blue is set mounted using fuse and red 548 00:21:13,380 --> 00:21:18,179 is drbd mounted over NFS I'm going to be 549 00:21:16,559 --> 00:21:20,940 maintaining that color scheme for the 550 00:21:18,179 --> 00:21:23,460 next couple slides 551 00:21:20,940 --> 00:21:25,860 so on the postmark we're seeing the ceph 552 00:21:23,460 --> 00:21:28,980 kernel is doing quite well for itself uh 553 00:21:25,860 --> 00:21:30,960 drbd and ceph over NFS are actually 554 00:21:28,980 --> 00:21:33,480 doing extremely similarly 555 00:21:30,960 --> 00:21:36,179 I was hoping coming into this that Seth 556 00:21:33,480 --> 00:21:39,120 would just visibly outperform the RBD 557 00:21:36,179 --> 00:21:41,039 but to be doing quite similarly is still 558 00:21:39,120 --> 00:21:43,020 pretty good when you consider that ceph 559 00:21:41,039 --> 00:21:45,179 is rather more complex and does a lot 560 00:21:43,020 --> 00:21:48,299 more for you 561 00:21:45,179 --> 00:21:50,580 set fuse however is having some issues 562 00:21:48,299 --> 00:21:52,980 I actually had to cut off the top of 563 00:21:50,580 --> 00:21:54,659 this graph so that we could see 564 00:21:52,980 --> 00:21:55,860 what was going on with the other Mount 565 00:21:54,659 --> 00:21:58,860 types 566 00:21:55,860 --> 00:22:00,780 cephus is not happy the Seth kernel 567 00:21:58,860 --> 00:22:03,120 driver and Seth over NFS are the ones 568 00:22:00,780 --> 00:22:04,320 we're particularly interested in the 569 00:22:03,120 --> 00:22:06,980 safe kernel driver seems to be 570 00:22:04,320 --> 00:22:10,799 performing very well and ceph over NFS 571 00:22:06,980 --> 00:22:12,720 is useful because the tools that allow 572 00:22:10,799 --> 00:22:14,580 you to do the SEF specific file system 573 00:22:12,720 --> 00:22:15,620 mounts are not available in every 574 00:22:14,580 --> 00:22:18,780 platform 575 00:22:15,620 --> 00:22:20,700 we want to be exporting over NFS where 576 00:22:18,780 --> 00:22:22,200 we can because that's something we know 577 00:22:20,700 --> 00:22:24,059 virtually every machine is going to be 578 00:22:22,200 --> 00:22:25,919 able to accept 579 00:22:24,059 --> 00:22:29,340 so it's good to see that Steph is not 580 00:22:25,919 --> 00:22:31,559 doing worse than drbd over NFS 581 00:22:29,340 --> 00:22:33,840 now under NFS pitch nfsbench has a lot 582 00:22:31,559 --> 00:22:36,659 more parameters than postmark does for 583 00:22:33,840 --> 00:22:37,919 varying the seismic experiment again on 584 00:22:36,659 --> 00:22:40,100 the y-axis we've got how long the 585 00:22:37,919 --> 00:22:43,500 experiment took shorter is better 586 00:22:40,100 --> 00:22:45,960 this is specifically each process was 587 00:22:43,500 --> 00:22:48,360 opening and writing into its file 400 588 00:22:45,960 --> 00:22:50,760 times and on the bottom we've got how 589 00:22:48,360 --> 00:22:52,200 many blocks of data were written and the 590 00:22:50,760 --> 00:22:54,240 minimum size of the file that was 591 00:22:52,200 --> 00:22:55,919 getting written into 592 00:22:54,240 --> 00:22:57,840 kind of interesting thing here that I 593 00:22:55,919 --> 00:22:59,280 noticed the number of blocks written in 594 00:22:57,840 --> 00:23:01,799 the minimum file size aren't really 595 00:22:59,280 --> 00:23:04,740 changing anything we're getting pretty 596 00:23:01,799 --> 00:23:06,480 similar and levels of performance as we 597 00:23:04,740 --> 00:23:08,760 vary those 598 00:23:06,480 --> 00:23:12,000 the only parameter that I found visibly 599 00:23:08,760 --> 00:23:13,860 made a difference with NFS bench was the 600 00:23:12,000 --> 00:23:16,020 number of opens when I doubled the 601 00:23:13,860 --> 00:23:18,480 number of opens the time it took to 602 00:23:16,020 --> 00:23:20,100 complete the experiment was a lot longer 603 00:23:18,480 --> 00:23:21,960 that's not remotely surprising though 604 00:23:20,100 --> 00:23:24,000 the number of opens is roughly the total 605 00:23:21,960 --> 00:23:26,640 workload that you've got to do 606 00:23:24,000 --> 00:23:29,400 so if you double it you can expect a 607 00:23:26,640 --> 00:23:32,600 roughly double duration on how long it 608 00:23:29,400 --> 00:23:32,600 takes to get the experiment done 609 00:23:33,720 --> 00:23:39,539 okay so for the CPU usage experiment 610 00:23:37,500 --> 00:23:42,780 remember the way this works is I've got 611 00:23:39,539 --> 00:23:45,120 10 client machines the file system is 612 00:23:42,780 --> 00:23:47,640 mounted on all 10 at once and then I'm 613 00:23:45,120 --> 00:23:50,039 running NFS bench on each of those 614 00:23:47,640 --> 00:23:52,559 machines all at the same time and 615 00:23:50,039 --> 00:23:54,299 watching the CPU usage on the clients 616 00:23:52,559 --> 00:23:57,120 and on the server 617 00:23:54,299 --> 00:23:58,440 NFS bench incidentally I was running it 618 00:23:57,120 --> 00:24:00,600 so that it would Fork itself to make 619 00:23:58,440 --> 00:24:02,460 four worker processes for each instance 620 00:24:00,600 --> 00:24:06,179 of NFS bench because the machines I was 621 00:24:02,460 --> 00:24:08,039 running on had four CPUs so all the tpus 622 00:24:06,179 --> 00:24:11,480 have things to do but they're not 623 00:24:08,039 --> 00:24:11,480 overloaded with loads of tasks 624 00:24:11,700 --> 00:24:17,100 so here we have when mounted using set 625 00:24:14,220 --> 00:24:19,980 fuse what percentage of total CPU time 626 00:24:17,100 --> 00:24:22,679 was spent waiting for Io 627 00:24:19,980 --> 00:24:24,600 so the y-axis is percentages notice it's 628 00:24:22,679 --> 00:24:26,340 not going up to 100 I had to shrink it 629 00:24:24,600 --> 00:24:29,100 so that we could see what was going on 630 00:24:26,340 --> 00:24:31,380 here not much I O is happening these are 631 00:24:29,100 --> 00:24:33,720 pretty small numbers really 632 00:24:31,380 --> 00:24:35,400 there are a lot of lines on this graph 633 00:24:33,720 --> 00:24:38,640 because I have 10 clients plus three 634 00:24:35,400 --> 00:24:40,200 servers the black and the gray lines are 635 00:24:38,640 --> 00:24:42,000 the servers all the colored lines are 636 00:24:40,200 --> 00:24:43,440 clients I'm again going to be keeping 637 00:24:42,000 --> 00:24:46,020 that convention for the rest of these 638 00:24:43,440 --> 00:24:48,000 graphs along the bottom is just 639 00:24:46,020 --> 00:24:50,039 as the experiment progressed what time 640 00:24:48,000 --> 00:24:51,900 it was the experiment is designed to run 641 00:24:50,039 --> 00:24:54,000 for about 10 minutes so we're seeing 642 00:24:51,900 --> 00:24:55,620 these experiments between 600 700 643 00:24:54,000 --> 00:24:57,659 seconds 644 00:24:55,620 --> 00:24:59,820 um 645 00:24:57,659 --> 00:25:00,720 so yeah not very much I O is happening 646 00:24:59,820 --> 00:25:02,760 here 647 00:25:00,720 --> 00:25:05,820 we can see one of the client one of the 648 00:25:02,760 --> 00:25:07,380 servers the light gray is getting a 649 00:25:05,820 --> 00:25:09,120 little more IO than the others but the 650 00:25:07,380 --> 00:25:11,159 numbers are so small that's that's not 651 00:25:09,120 --> 00:25:13,740 really a big conclusion 652 00:25:11,159 --> 00:25:16,080 for idle time what percentage of time 653 00:25:13,740 --> 00:25:17,940 did the CPU spend idle we can see the 654 00:25:16,080 --> 00:25:19,320 black and the Grays are below the colors 655 00:25:17,940 --> 00:25:21,720 the server is spending a little less 656 00:25:19,320 --> 00:25:23,400 time idle that's not really weird the 657 00:25:21,720 --> 00:25:25,020 servers are having to service everything 658 00:25:23,400 --> 00:25:27,980 from all the clients each client is only 659 00:25:25,020 --> 00:25:27,980 doing its own work 660 00:25:28,799 --> 00:25:34,140 now under the Seth kernel driver we have 661 00:25:32,039 --> 00:25:36,240 a much more interesting picture 662 00:25:34,140 --> 00:25:38,460 the clients are spending a whole lot of 663 00:25:36,240 --> 00:25:40,140 time on io8 under the Seth kernel driver 664 00:25:38,460 --> 00:25:41,880 something about the kernel driver is 665 00:25:40,140 --> 00:25:43,559 really inefficient in its usage of the 666 00:25:41,880 --> 00:25:45,960 CPU 667 00:25:43,559 --> 00:25:48,600 or at least in the the time it takes for 668 00:25:45,960 --> 00:25:51,059 Io to happen 669 00:25:48,600 --> 00:25:53,340 the servers however are not getting very 670 00:25:51,059 --> 00:25:54,419 are not having to do much I O it's a 671 00:25:53,340 --> 00:25:56,880 little hard to see because there's so 672 00:25:54,419 --> 00:25:58,500 many colors going everywhere but if you 673 00:25:56,880 --> 00:26:00,480 look at the bottom you can see the gray 674 00:25:58,500 --> 00:26:01,500 lines and the black line are still down 675 00:26:00,480 --> 00:26:03,419 the bottom 676 00:26:01,500 --> 00:26:04,860 uh the servers are not having a hard 677 00:26:03,419 --> 00:26:08,340 time it's the clients that are having to 678 00:26:04,860 --> 00:26:11,820 wait a lot idle time looks 679 00:26:08,340 --> 00:26:14,580 like it's just about a mirror of the i o 680 00:26:11,820 --> 00:26:17,220 wait time again the servers are at the 681 00:26:14,580 --> 00:26:18,720 top they are mostly idle the clients are 682 00:26:17,220 --> 00:26:21,539 the ones who are spending big chunks of 683 00:26:18,720 --> 00:26:24,539 time not idle but presumably that's 684 00:26:21,539 --> 00:26:27,720 mostly i o weight 685 00:26:24,539 --> 00:26:29,700 the drbd mounted over NFS looks a lot 686 00:26:27,720 --> 00:26:31,260 more consistent than the kernel driver 687 00:26:29,700 --> 00:26:35,279 does 688 00:26:31,260 --> 00:26:37,559 so drbd it's holding at around 50-ish 689 00:26:35,279 --> 00:26:40,200 percent of CPU time is waiting for Io 690 00:26:37,559 --> 00:26:42,480 it's a lot but it's not going everywhere 691 00:26:40,200 --> 00:26:44,279 it's pretty predictably the same thing 692 00:26:42,480 --> 00:26:45,600 you can see one of the servers here is 693 00:26:44,279 --> 00:26:47,700 spending a lot more time on iO than the 694 00:26:45,600 --> 00:26:49,440 others if you remember the way drbd 695 00:26:47,700 --> 00:26:51,600 works we can only have it in primary 696 00:26:49,440 --> 00:26:53,220 mode on one server at a time so only one 697 00:26:51,600 --> 00:26:55,260 of the servers is actually exporting 698 00:26:53,220 --> 00:26:57,000 over NFS that's the one that's getting 699 00:26:55,260 --> 00:26:59,039 the the longer I O wait times than the 700 00:26:57,000 --> 00:27:00,720 other two servers that's why one of the 701 00:26:59,039 --> 00:27:02,520 servers is getting a higher i o weight 702 00:27:00,720 --> 00:27:04,919 than the others 703 00:27:02,520 --> 00:27:07,380 and again idle time looks like it's 704 00:27:04,919 --> 00:27:10,799 pretty much a mirror image I I did also 705 00:27:07,380 --> 00:27:13,440 record things like the percentage of CPU 706 00:27:10,799 --> 00:27:16,860 time spent on user space spent in kernel 707 00:27:13,440 --> 00:27:19,140 space as opposed to idle and IO and so 708 00:27:16,860 --> 00:27:20,760 forth but those numbers were so 709 00:27:19,140 --> 00:27:23,100 consistently small across everything I 710 00:27:20,760 --> 00:27:24,419 didn't bother showing them it's the i o 711 00:27:23,100 --> 00:27:26,760 and the idle time where we get 712 00:27:24,419 --> 00:27:30,059 interesting looking graphs 713 00:27:26,760 --> 00:27:34,100 and for Seth mounted over NFS 714 00:27:30,059 --> 00:27:36,539 interestingly like drbd mounted over NFS 715 00:27:34,100 --> 00:27:40,260 it's holding around the sort of 50 716 00:27:36,539 --> 00:27:42,120 region it's just more Jagged Seth seems 717 00:27:40,260 --> 00:27:44,460 like it's a bit less consistent in the 718 00:27:42,120 --> 00:27:47,820 amount in the 719 00:27:44,460 --> 00:27:49,620 time spent on i o weight than drbd is 720 00:27:47,820 --> 00:27:51,779 but it is holding around the same area 721 00:27:49,620 --> 00:27:53,159 and I think that commonality is related 722 00:27:51,779 --> 00:27:56,299 to the fact that both of these are 723 00:27:53,159 --> 00:27:56,299 exporting over NFS 724 00:27:56,340 --> 00:28:01,799 uh the idle time again basically a 725 00:27:58,919 --> 00:28:03,840 mirror of the io8 time 726 00:28:01,799 --> 00:28:04,919 so what do I think of these things in 727 00:28:03,840 --> 00:28:07,559 the end 728 00:28:04,919 --> 00:28:10,440 well the RBD is definitely simpler to 729 00:28:07,559 --> 00:28:12,240 form your mental model of if you need to 730 00:28:10,440 --> 00:28:14,880 spend time understanding how something 731 00:28:12,240 --> 00:28:16,320 works before you can start using it drbd 732 00:28:14,880 --> 00:28:17,880 is going to take you less time to 733 00:28:16,320 --> 00:28:19,860 understand what the pieces are and how 734 00:28:17,880 --> 00:28:22,020 they fit together there aren't as many 735 00:28:19,860 --> 00:28:25,140 moving Parts it's gonna probably be 736 00:28:22,020 --> 00:28:27,360 easier to set up from scratch 737 00:28:25,140 --> 00:28:29,940 you may need to compile the kernel 738 00:28:27,360 --> 00:28:32,159 module if you want the RBD Version 9 but 739 00:28:29,940 --> 00:28:33,419 that's the most major snag you're likely 740 00:28:32,159 --> 00:28:35,880 to run into 741 00:28:33,419 --> 00:28:38,039 of course that Simplicity comes at the 742 00:28:35,880 --> 00:28:40,620 expense of drbd does not do nearly as 743 00:28:38,039 --> 00:28:42,960 much it will replicate the contents of 744 00:28:40,620 --> 00:28:45,960 this block device but it has a hard time 745 00:28:42,960 --> 00:28:47,820 recovering from conflicts and it doesn't 746 00:28:45,960 --> 00:28:49,559 really support any additional features 747 00:28:47,820 --> 00:28:52,320 on top of that 748 00:28:49,559 --> 00:28:53,820 Seth much more automatic it does a lot 749 00:28:52,320 --> 00:28:56,520 of things for you without you even 750 00:28:53,820 --> 00:29:01,140 having to issue commands it is very 751 00:28:56,520 --> 00:29:03,380 capable of fixing up issues with the 752 00:29:01,140 --> 00:29:07,200 consistency of data for instance 753 00:29:03,380 --> 00:29:09,240 but it has a much more complex structure 754 00:29:07,200 --> 00:29:11,580 to making SF cluster work there's a lot 755 00:29:09,240 --> 00:29:13,260 of demons that all need to exist and 756 00:29:11,580 --> 00:29:14,039 communicate with each other for the most 757 00:29:13,260 --> 00:29:16,440 part 758 00:29:14,039 --> 00:29:17,400 need to actually set up all these demons 759 00:29:16,440 --> 00:29:19,080 yourself 760 00:29:17,400 --> 00:29:20,880 you might have noticed when I was 761 00:29:19,080 --> 00:29:22,500 talking about setting up a ceft cluster 762 00:29:20,880 --> 00:29:24,480 from scratch I didn't say anything about 763 00:29:22,500 --> 00:29:26,700 creating extra monitors or managers or 764 00:29:24,480 --> 00:29:29,340 metadata servers you don't have to do 765 00:29:26,700 --> 00:29:32,340 that safe ADM will spin those up as 766 00:29:29,340 --> 00:29:33,779 necessary when your cluster gets bigger 767 00:29:32,340 --> 00:29:37,620 but you will need to for instance 768 00:29:33,779 --> 00:29:39,779 probably create your own osds 769 00:29:37,620 --> 00:29:41,480 ceph also probably needs your 770 00:29:39,779 --> 00:29:43,740 configuration 771 00:29:41,480 --> 00:29:45,500 in order to get the best out of it 772 00:29:43,740 --> 00:29:47,700 you'll need to spend a bit of time 773 00:29:45,500 --> 00:29:50,940 changing the setup from what it is out 774 00:29:47,700 --> 00:29:53,700 of the box like I had to realize 775 00:29:50,940 --> 00:29:55,140 actually just putting an OSD on top of a 776 00:29:53,700 --> 00:29:57,539 hard disk on its own is not getting the 777 00:29:55,140 --> 00:29:59,520 best out of an OSD I should also give it 778 00:29:57,539 --> 00:30:04,320 a faster drive for its writer head log 779 00:29:59,520 --> 00:30:06,600 that sort of thing so I think drbd is 780 00:30:04,320 --> 00:30:08,220 going to be easier to start up with if 781 00:30:06,600 --> 00:30:10,740 you've got a sort of a small sort of 782 00:30:08,220 --> 00:30:12,840 deployment you're working with drbd 783 00:30:10,740 --> 00:30:15,659 might be a good idea it's going to have 784 00:30:12,840 --> 00:30:17,940 less initial investment to get it going 785 00:30:15,659 --> 00:30:19,320 but as the size of your deployment gets 786 00:30:17,940 --> 00:30:22,020 bigger as the complexity of what you 787 00:30:19,320 --> 00:30:24,120 want to do gets higher sep is going to 788 00:30:22,020 --> 00:30:28,020 become more and more and more valuable 789 00:30:24,120 --> 00:30:31,440 because it is so capable of dealing with 790 00:30:28,020 --> 00:30:33,720 different situations different machines 791 00:30:31,440 --> 00:30:37,140 with different storage options and 792 00:30:33,720 --> 00:30:39,659 rebalancing and redistributing things 793 00:30:37,140 --> 00:30:41,580 as the capabilities that it's asked to 794 00:30:39,659 --> 00:30:44,240 deal with change 795 00:30:41,580 --> 00:30:45,160 that's all I've got to say 796 00:30:44,240 --> 00:30:48,329 [Applause] 797 00:30:45,160 --> 00:30:48,329 [Music] 798 00:30:49,210 --> 00:30:52,679 [Applause] 799 00:30:50,880 --> 00:30:55,580 thank you Christopher have you got any 800 00:30:52,679 --> 00:30:55,580 questions for them 801 00:30:59,640 --> 00:31:04,559 just a couple of implements 802 00:31:02,159 --> 00:31:05,340 Implement questions around NFS 803 00:31:04,559 --> 00:31:07,559 um 804 00:31:05,340 --> 00:31:11,820 first 805 00:31:07,559 --> 00:31:12,779 to the NFS that it's exports from CES is 806 00:31:11,820 --> 00:31:17,220 that 807 00:31:12,779 --> 00:31:19,919 just the normal Linux kernel NFS engine 808 00:31:17,220 --> 00:31:24,380 or does it have it on its own NFS engine 809 00:31:19,919 --> 00:31:27,299 and another question is you mentioned 810 00:31:24,380 --> 00:31:29,159 exporting Rover FS because of the 811 00:31:27,299 --> 00:31:32,179 clients would need it as opposed to 812 00:31:29,159 --> 00:31:35,100 Native connectivity decepts 813 00:31:32,179 --> 00:31:37,460 what clients are supported natively 814 00:31:35,100 --> 00:31:37,460 process 815 00:31:37,980 --> 00:31:41,480 Okay so 816 00:31:43,559 --> 00:31:50,520 step over NFS let's do that first 817 00:31:46,500 --> 00:31:53,880 um the ceph NFS export you set up inside 818 00:31:50,520 --> 00:31:55,260 the Sev cluster so there's a place you 819 00:31:53,880 --> 00:31:57,659 go you can do this from the dashboard 820 00:31:55,260 --> 00:31:59,159 you can say I want an NFS export and 821 00:31:57,659 --> 00:31:59,940 then cephal create some demons that do 822 00:31:59,159 --> 00:32:02,880 that 823 00:31:59,940 --> 00:32:04,700 so ceph does the NFS from inside the 824 00:32:02,880 --> 00:32:06,960 cluster 825 00:32:04,700 --> 00:32:09,600 as for 826 00:32:06,960 --> 00:32:12,539 what 827 00:32:09,600 --> 00:32:14,700 platforms the sap is available on I'm 828 00:32:12,539 --> 00:32:16,500 not sure what the full list is I've been 829 00:32:14,700 --> 00:32:18,419 doing all this on Debian so I know for a 830 00:32:16,500 --> 00:32:19,320 fact it works on Debian 831 00:32:18,419 --> 00:32:22,799 um 832 00:32:19,320 --> 00:32:24,600 I know I'm pretty sure one of the I 833 00:32:22,799 --> 00:32:27,440 don't believe for instance set for work 834 00:32:24,600 --> 00:32:31,080 on Max that is I don't believe you can 835 00:32:27,440 --> 00:32:33,840 uh you you can mount things as a ceph 836 00:32:31,080 --> 00:32:35,940 client on a Mac and some of the people 837 00:32:33,840 --> 00:32:37,440 in our organization use max so we need 838 00:32:35,940 --> 00:32:39,299 to be able to set things up so that that 839 00:32:37,440 --> 00:32:40,860 will still work 840 00:32:39,299 --> 00:32:43,020 um 841 00:32:40,860 --> 00:32:45,120 full list of what safe supports I'm not 842 00:32:43,020 --> 00:32:48,020 sure I've basically learned what I 843 00:32:45,120 --> 00:32:48,020 needed to get this going 844 00:32:54,899 --> 00:33:00,000 hi thanks for the talk 845 00:32:56,880 --> 00:33:02,039 um I wanted to ask about the original 846 00:33:00,000 --> 00:33:03,419 use case I think you kind of actually 847 00:33:02,039 --> 00:33:05,159 answered before that it's within your 848 00:33:03,419 --> 00:33:09,500 organization 849 00:33:05,159 --> 00:33:12,419 um uh but you are on unsw campus 850 00:33:09,500 --> 00:33:15,240 and like what are the clients that are 851 00:33:12,419 --> 00:33:16,740 accessing these uh home directories 852 00:33:15,240 --> 00:33:19,260 running 853 00:33:16,740 --> 00:33:22,080 um yeah the use case and the clients 854 00:33:19,260 --> 00:33:24,299 that are actually using it 855 00:33:22,080 --> 00:33:26,340 so the people who are accessing the home 856 00:33:24,299 --> 00:33:29,340 directories are members of trustworthy 857 00:33:26,340 --> 00:33:31,980 systems we have so we are attached to 858 00:33:29,340 --> 00:33:33,960 unsw but we have our own servers and we 859 00:33:31,980 --> 00:33:36,539 manage our own little infrastructure for 860 00:33:33,960 --> 00:33:39,360 our research group 861 00:33:36,539 --> 00:33:41,519 so yeah our home directories are not 862 00:33:39,360 --> 00:33:44,840 part of the general unsw computer 863 00:33:41,519 --> 00:33:44,840 science and engineering department 864 00:33:48,370 --> 00:33:51,490 [Music] 865 00:33:53,220 --> 00:33:58,940 what's the actual reason for that other 866 00:33:56,039 --> 00:33:58,940 than if someone's using the same 867 00:34:02,539 --> 00:34:09,000 potentially yes in some cases there 868 00:34:05,940 --> 00:34:10,560 might be someone's wanting to do some 869 00:34:09,000 --> 00:34:13,679 kind of tests on a particular machine 870 00:34:10,560 --> 00:34:15,060 but that's not like their work their 871 00:34:13,679 --> 00:34:16,740 work machine or their laptop that they 872 00:34:15,060 --> 00:34:19,619 bring in with them there can be 873 00:34:16,740 --> 00:34:21,899 situations like that and I found it is 874 00:34:19,619 --> 00:34:23,580 useful to have these 875 00:34:21,899 --> 00:34:25,320 home directories that mount on 876 00:34:23,580 --> 00:34:27,980 everything to move stuff around 877 00:34:25,320 --> 00:34:27,980 sometimes 878 00:34:29,710 --> 00:34:32,760 [Music] 879 00:34:33,619 --> 00:34:38,700 so in regards to give you a simple 880 00:34:36,179 --> 00:34:40,440 scenario a directory that has say a 881 00:34:38,700 --> 00:34:42,899 million files in it that might hit one 882 00:34:40,440 --> 00:34:44,460 OSD or you have a very very large file 883 00:34:42,899 --> 00:34:45,960 which lots of processes are hitting 884 00:34:44,460 --> 00:34:47,520 which again could have an impact on 885 00:34:45,960 --> 00:34:48,960 metadata performance have you done any 886 00:34:47,520 --> 00:34:51,659 event tracks like that 887 00:34:48,960 --> 00:34:54,440 no the the tests I've done are the ones 888 00:34:51,659 --> 00:34:54,440 I've talked about here 889 00:35:00,180 --> 00:35:04,560 uh I I I nowadays spend a lot more time 890 00:35:03,180 --> 00:35:06,540 on Seth though more on the Block side 891 00:35:04,560 --> 00:35:09,780 but in my past web hosting life I did a 892 00:35:06,540 --> 00:35:11,640 lot of drd NFS in the late 2000s and uh 893 00:35:09,780 --> 00:35:13,500 we were using was then heartbeat now 894 00:35:11,640 --> 00:35:14,460 Peacemaker to do like the failover stuff 895 00:35:13,500 --> 00:35:15,839 I was curious whether you actually 896 00:35:14,460 --> 00:35:17,700 leveraging that left it out or whether 897 00:35:15,839 --> 00:35:19,320 you weren't leveraging that to handle 898 00:35:17,700 --> 00:35:21,839 the you know moving stuff around and 899 00:35:19,320 --> 00:35:25,040 yeah our current our current setup does 900 00:35:21,839 --> 00:35:27,060 not use pacemaker I've been working on 901 00:35:25,040 --> 00:35:30,060 redesigning our setup and I'm using 902 00:35:27,060 --> 00:35:32,760 pacemaker for that uh 903 00:35:30,060 --> 00:35:36,300 but for our home directory is storage 904 00:35:32,760 --> 00:35:38,339 I'm using SEF because I like 905 00:35:36,300 --> 00:35:39,960 I like the self-healing that's one of my 906 00:35:38,339 --> 00:35:41,480 favorite things about it it fixes 907 00:35:39,960 --> 00:35:44,820 problems for you 908 00:35:41,480 --> 00:35:46,800 and when it does that it does it seem it 909 00:35:44,820 --> 00:35:50,240 I found it seems to do it a lot faster 910 00:35:46,800 --> 00:35:50,240 than drbd does as well 911 00:35:50,579 --> 00:35:56,640 thanks so one thing that I've seen quite 912 00:35:52,920 --> 00:36:00,660 often with NFS mounts is a lack of 913 00:35:56,640 --> 00:36:02,280 Transport encryption so Transit being 914 00:36:00,660 --> 00:36:05,640 plain text 915 00:36:02,280 --> 00:36:07,800 um have you like how does safe compare 916 00:36:05,640 --> 00:36:11,400 in that regards out of the box 917 00:36:07,800 --> 00:36:15,079 capabilities I don't think Seth does any 918 00:36:11,400 --> 00:36:15,079 encryption on the data it sends either 919 00:36:17,099 --> 00:36:20,700 thank you I noticed you talked about the 920 00:36:19,200 --> 00:36:22,200 networking side of it and how you 921 00:36:20,700 --> 00:36:25,079 switched was a bottleneck at one point 922 00:36:22,200 --> 00:36:27,000 have you then decided to leverage a 923 00:36:25,079 --> 00:36:30,960 separate switch for the set clustering 924 00:36:27,000 --> 00:36:33,540 back plane so the switch that I was 925 00:36:30,960 --> 00:36:36,420 running those tests over is not a switch 926 00:36:33,540 --> 00:36:38,640 that's responsible for most of our 927 00:36:36,420 --> 00:36:40,740 running the the main networks that 928 00:36:38,640 --> 00:36:43,140 everyone's working on this switch was 929 00:36:40,740 --> 00:36:44,579 connecting to a set of machines that we 930 00:36:43,140 --> 00:36:46,980 generally use specifically for 931 00:36:44,579 --> 00:36:49,260 benchmarking things they're useful 932 00:36:46,980 --> 00:36:51,480 because they're all basically the same 933 00:36:49,260 --> 00:36:52,800 and since we're using them for 934 00:36:51,480 --> 00:36:55,140 benchmarking there's not other things 935 00:36:52,800 --> 00:36:56,700 going on on them at the same time so 936 00:36:55,140 --> 00:36:59,640 that the switch that was causing the 937 00:36:56,700 --> 00:37:01,020 bottleneck there is not a switch that's 938 00:36:59,640 --> 00:37:03,839 really going to be coming into play in 939 00:37:01,020 --> 00:37:04,980 the day-to-day use of Seth look although 940 00:37:03,839 --> 00:37:07,260 of course we have replaced that with a 941 00:37:04,980 --> 00:37:08,220 faster switch for the benchmarks I 942 00:37:07,260 --> 00:37:09,480 showed here 943 00:37:08,220 --> 00:37:11,960 so that we could get rid of the 944 00:37:09,480 --> 00:37:11,960 bottleneck 945 00:37:12,720 --> 00:37:17,960 anything else right now 946 00:37:15,359 --> 00:37:17,960 you are going to 947 00:37:18,240 --> 00:37:22,079 handle as well and I'll take it you'll 948 00:37:19,920 --> 00:37:23,760 be around for the rest of the day yep 949 00:37:22,079 --> 00:37:25,440 lovely in the meantime thank you so much 950 00:37:23,760 --> 00:37:27,900 Christopher there's a little gift from 951 00:37:25,440 --> 00:37:30,079 us for you thank you for your time thank 952 00:37:27,900 --> 00:37:30,079 you