1
00:00:00,000 --> 00:00:08,469
foreign

2
00:00:00,500 --> 00:00:08,469
[Music]

3
00:00:11,340 --> 00:00:17,960
welcome everybody now for this afternoon

4
00:00:15,000 --> 00:00:17,960
I'm on the wrong page

5
00:00:18,000 --> 00:00:20,820
whoops

6
00:00:19,380 --> 00:00:23,180
I'm thoroughly on the wrong page I

7
00:00:20,820 --> 00:00:23,180
apologize

8
00:00:30,539 --> 00:00:35,160
I do apologize but I won't waste any

9
00:00:32,340 --> 00:00:37,320
more of your time

10
00:00:35,160 --> 00:00:39,600
so I would like everybody

11
00:00:37,320 --> 00:00:41,780
to make Christopher feel welcome thank

12
00:00:39,600 --> 00:00:41,780
you

13
00:00:49,200 --> 00:00:53,700
first you have New South Wales today I'm

14
00:00:51,899 --> 00:00:54,899
going to be talking about was that was

15
00:00:53,700 --> 00:00:57,780
that not on before

16
00:00:54,899 --> 00:01:01,260
okay today I'm going to be talking about

17
00:00:57,780 --> 00:01:03,300
ceph and drbd two distributed storage

18
00:01:01,260 --> 00:01:04,979
systems I'm going to be comparing them

19
00:01:03,300 --> 00:01:07,979
talking about what I liked and didn't

20
00:01:04,979 --> 00:01:09,479
like what their strengths and weaknesses

21
00:01:07,979 --> 00:01:11,280
are

22
00:01:09,479 --> 00:01:14,220
so

23
00:01:11,280 --> 00:01:15,840
at trustworthy systems we provide

24
00:01:14,220 --> 00:01:17,700
networked home directories to our

25
00:01:15,840 --> 00:01:19,320
members you can log into any machine on

26
00:01:17,700 --> 00:01:21,180
the network and you'll see the same home

27
00:01:19,320 --> 00:01:22,320
directory with all the changes you've

28
00:01:21,180 --> 00:01:25,080
made to it

29
00:01:22,320 --> 00:01:28,560
the way we accomplish that is using a

30
00:01:25,080 --> 00:01:30,720
combination of something called drbd and

31
00:01:28,560 --> 00:01:32,700
NFS

32
00:01:30,720 --> 00:01:34,140
so the file system that those home

33
00:01:32,700 --> 00:01:35,119
directories are in has to be stored

34
00:01:34,140 --> 00:01:37,740
somewhere

35
00:01:35,119 --> 00:01:39,780
drbd handles that task and it's better

36
00:01:37,740 --> 00:01:42,180
if that file system can be stored in

37
00:01:39,780 --> 00:01:44,460
multiple places at once

38
00:01:42,180 --> 00:01:48,240
yeah

39
00:01:44,460 --> 00:01:50,759
we have drbd does the work of storing

40
00:01:48,240 --> 00:01:52,740
the file system on multiple servers at a

41
00:01:50,759 --> 00:01:55,140
time and keeping it consistent across

42
00:01:52,740 --> 00:01:58,979
those different servers and then we use

43
00:01:55,140 --> 00:02:01,680
NFS the network file system protocol to

44
00:01:58,979 --> 00:02:03,420
export those home directories over the

45
00:02:01,680 --> 00:02:06,299
network so they can be mounted on

46
00:02:03,420 --> 00:02:08,520
whatever machine someone's using

47
00:02:06,299 --> 00:02:11,580
so let's do some more detail on that

48
00:02:08,520 --> 00:02:14,459
the drbd stands for distributed

49
00:02:11,580 --> 00:02:17,520
replicated block device

50
00:02:14,459 --> 00:02:20,120
uh block device is what drbd is designed

51
00:02:17,520 --> 00:02:22,440
to work on it deals in Block devices

52
00:02:20,120 --> 00:02:25,020
replicated it means that the contents of

53
00:02:22,440 --> 00:02:26,900
that block device are going to be copied

54
00:02:25,020 --> 00:02:29,580
identically to several places

55
00:02:26,900 --> 00:02:31,860
distributed means that that copying is

56
00:02:29,580 --> 00:02:34,920
happening to several machines on a

57
00:02:31,860 --> 00:02:36,420
network so multiple machines each have a

58
00:02:34,920 --> 00:02:38,700
block device those block devices on

59
00:02:36,420 --> 00:02:41,459
separate machines have the same content

60
00:02:38,700 --> 00:02:45,060
and drbd handles making sure that that

61
00:02:41,459 --> 00:02:46,800
stays the same then we Mount that file

62
00:02:45,060 --> 00:02:49,200
system in one of the machines and use

63
00:02:46,800 --> 00:02:52,620
NFS to export it

64
00:02:49,200 --> 00:02:55,319
so this setup works pretty well but it

65
00:02:52,620 --> 00:02:58,620
does have a couple of weaknesses

66
00:02:55,319 --> 00:03:00,120
one of them is that NFS and drbd don't

67
00:02:58,620 --> 00:03:02,519
know about each other

68
00:03:00,120 --> 00:03:04,620
drbd has no idea that the block device

69
00:03:02,519 --> 00:03:06,660
it's replicating is actually being

70
00:03:04,620 --> 00:03:09,480
served over the network onto other

71
00:03:06,660 --> 00:03:11,640
machines higher up NFS doesn't know that

72
00:03:09,480 --> 00:03:13,620
this file system it's exporting is

73
00:03:11,640 --> 00:03:15,540
actually replicated across multiple

74
00:03:13,620 --> 00:03:17,280
physical servers

75
00:03:15,540 --> 00:03:19,560
that means if you ever want to change

76
00:03:17,280 --> 00:03:20,760
the configuration here there's more work

77
00:03:19,560 --> 00:03:23,400
you've got to do there are more places

78
00:03:20,760 --> 00:03:25,200
you have to make the changes

79
00:03:23,400 --> 00:03:26,640
also because they don't know about each

80
00:03:25,200 --> 00:03:27,540
other

81
00:03:26,640 --> 00:03:30,900
um

82
00:03:27,540 --> 00:03:33,000
a dlbd resource can only be in primary

83
00:03:30,900 --> 00:03:34,920
mode on one or two hosts at a time

84
00:03:33,000 --> 00:03:36,720
primary mode is the mode where you're

85
00:03:34,920 --> 00:03:38,819
allowed to read and write the data on

86
00:03:36,720 --> 00:03:40,980
that block device

87
00:03:38,819 --> 00:03:43,379
so the easiest way to deal with that is

88
00:03:40,980 --> 00:03:45,540
usually put the drbd resource in primary

89
00:03:43,379 --> 00:03:47,519
mode on only one of the servers and

90
00:03:45,540 --> 00:03:49,980
leave it in secondary and the others so

91
00:03:47,519 --> 00:03:52,319
it can't be read from or written to on

92
00:03:49,980 --> 00:03:54,120
the other servers and then run your NFS

93
00:03:52,319 --> 00:03:55,680
demon that's doing the exports on the

94
00:03:54,120 --> 00:03:57,120
same server where the resource is

95
00:03:55,680 --> 00:04:00,060
primary

96
00:03:57,120 --> 00:04:02,519
if that server goes down however that

97
00:04:00,060 --> 00:04:04,200
means that the RBD is not going to

98
00:04:02,519 --> 00:04:06,420
automatically put the resource up into

99
00:04:04,200 --> 00:04:07,920
primary mode on another server you're

100
00:04:06,420 --> 00:04:09,420
not going to get an NFS theme and

101
00:04:07,920 --> 00:04:11,459
automatically spinning up on another

102
00:04:09,420 --> 00:04:15,000
server and even if they did do that

103
00:04:11,459 --> 00:04:17,160
there's no guarantee that their efforts

104
00:04:15,000 --> 00:04:19,620
to recover would work together because

105
00:04:17,160 --> 00:04:21,600
again they don't know about each other

106
00:04:19,620 --> 00:04:23,580
there are tools you can use to try to

107
00:04:21,600 --> 00:04:25,380
automate this thing and get them to work

108
00:04:23,580 --> 00:04:28,560
together but that's also adding an extra

109
00:04:25,380 --> 00:04:30,900
layer of complexity on top

110
00:04:28,560 --> 00:04:32,759
another issue with this setup is the RBD

111
00:04:30,900 --> 00:04:34,400
doesn't have the greatest conflict

112
00:04:32,759 --> 00:04:37,500
resolution support

113
00:04:34,400 --> 00:04:39,419
if two servers end up concluding that

114
00:04:37,500 --> 00:04:41,880
the contents of the block device should

115
00:04:39,419 --> 00:04:44,460
be different if they disagree

116
00:04:41,880 --> 00:04:46,740
often you end up having to just manually

117
00:04:44,460 --> 00:04:48,240
issue the command to re-synchronize the

118
00:04:46,740 --> 00:04:50,960
data on the device and that can take

119
00:04:48,240 --> 00:04:50,960
hours to happen

120
00:04:51,120 --> 00:04:54,960
so we've been thinking about

121
00:04:52,740 --> 00:04:58,080
alternatives to how we store our home

122
00:04:54,960 --> 00:05:00,780
directories and an option that we landed

123
00:04:58,080 --> 00:05:03,180
on is called ceph

124
00:05:00,780 --> 00:05:05,699
ceph is a distributed storage system

125
00:05:03,180 --> 00:05:07,400
which is designed to be able to scale to

126
00:05:05,699 --> 00:05:09,960
really large deployments

127
00:05:07,400 --> 00:05:12,259
it supports three different ways of

128
00:05:09,960 --> 00:05:16,199
mounting file systems over the network

129
00:05:12,259 --> 00:05:18,720
and it dynamically redistributes where

130
00:05:16,199 --> 00:05:21,240
the data restored on different hosts

131
00:05:18,720 --> 00:05:23,759
it's really intended to be an all-in-one

132
00:05:21,240 --> 00:05:26,580
solution and I like that I think it's

133
00:05:23,759 --> 00:05:27,960
it's nice to have a system where the

134
00:05:26,580 --> 00:05:30,720
components are designed to know about

135
00:05:27,960 --> 00:05:32,340
each other and work together

136
00:05:30,720 --> 00:05:35,639
but it's also a bit more complex than

137
00:05:32,340 --> 00:05:37,680
drbd is so to determine whether or not

138
00:05:35,639 --> 00:05:40,979
this would actually be a good choice to

139
00:05:37,680 --> 00:05:43,440
replace our existing nfsdrbd setup I

140
00:05:40,979 --> 00:05:44,880
created test deployments of each of

141
00:05:43,440 --> 00:05:46,560
these two structures so that I could

142
00:05:44,880 --> 00:05:49,259
compare them

143
00:05:46,560 --> 00:05:51,120
first point of comparison we come to how

144
00:05:49,259 --> 00:05:53,280
easy are these things to set up from

145
00:05:51,120 --> 00:05:55,560
scratch how much work does it take to

146
00:05:53,280 --> 00:05:58,500
get the RBD and NFS going how much work

147
00:05:55,560 --> 00:06:02,280
does it take to get stuff going

148
00:05:58,500 --> 00:06:04,620
the RBD has a big Advantage here drbd is

149
00:06:02,280 --> 00:06:07,100
a fairly simple technology there's not

150
00:06:04,620 --> 00:06:11,280
that many moving parts to it

151
00:06:07,100 --> 00:06:14,100
drbd consists of two main components

152
00:06:11,280 --> 00:06:16,560
there is some user space software that

153
00:06:14,100 --> 00:06:18,600
you use to configure the resource and

154
00:06:16,560 --> 00:06:20,280
give it instructions and there is a

155
00:06:18,600 --> 00:06:22,020
kernel module which does most of the

156
00:06:20,280 --> 00:06:25,800
work of replicating the data and

157
00:06:22,020 --> 00:06:27,900
communicating with other servers

158
00:06:25,800 --> 00:06:29,460
the steps for getting the NFS drbd

159
00:06:27,900 --> 00:06:30,660
deployment going will look something

160
00:06:29,460 --> 00:06:33,120
like this

161
00:06:30,660 --> 00:06:34,800
first you need to download the user

162
00:06:33,120 --> 00:06:36,720
space configuration software and you

163
00:06:34,800 --> 00:06:39,000
need to load the kernel module

164
00:06:36,720 --> 00:06:41,100
you then have to write a configuration

165
00:06:39,000 --> 00:06:42,840
file to describe what your resource is

166
00:06:41,100 --> 00:06:43,680
going to look like and you started

167
00:06:42,840 --> 00:06:45,660
running

168
00:06:43,680 --> 00:06:48,380
you need to decide where you're going to

169
00:06:45,660 --> 00:06:50,819
have the resource in primary mode

170
00:06:48,380 --> 00:06:52,919
Mount and then you can create a file

171
00:06:50,819 --> 00:06:54,660
system on the resource and mount the

172
00:06:52,919 --> 00:06:56,520
thing

173
00:06:54,660 --> 00:06:58,500
and once it's mounted on one of your

174
00:06:56,520 --> 00:07:00,539
servers you can start up an NFS demon

175
00:06:58,500 --> 00:07:03,800
and tell NFS here's the file system

176
00:07:00,539 --> 00:07:07,259
export this thing over the network

177
00:07:03,800 --> 00:07:09,360
when I was setting this up the NFS part

178
00:07:07,259 --> 00:07:11,639
was pretty straightforward I didn't come

179
00:07:09,360 --> 00:07:14,100
into any problems there

180
00:07:11,639 --> 00:07:17,220
setting up a drbd resource had a couple

181
00:07:14,100 --> 00:07:18,960
more steps to it it was mostly not much

182
00:07:17,220 --> 00:07:21,479
of a problem drbd's parent company

183
00:07:18,960 --> 00:07:23,220
limbit provides a user manual which

184
00:07:21,479 --> 00:07:26,099
tells you just about everything you need

185
00:07:23,220 --> 00:07:28,740
to know to get the RBD going there was

186
00:07:26,099 --> 00:07:31,919
one substantial problem I ran into with

187
00:07:28,740 --> 00:07:34,740
drbd setup and that was at the kernel

188
00:07:31,919 --> 00:07:37,080
module and the user space software can

189
00:07:34,740 --> 00:07:39,539
be on different versions

190
00:07:37,080 --> 00:07:42,060
I wanted to be using the RBD Version 9

191
00:07:39,539 --> 00:07:43,800
because that allows replication to three

192
00:07:42,060 --> 00:07:46,199
or more hosts at a time

193
00:07:43,800 --> 00:07:48,599
in the RBD version 8 and before can only

194
00:07:46,199 --> 00:07:51,240
replicate Between Two Hosts

195
00:07:48,599 --> 00:07:53,639
now the latest Damian packages for the

196
00:07:51,240 --> 00:07:56,039
user space software were on the RBD

197
00:07:53,639 --> 00:07:57,720
Version 9 but the version 9 kernel

198
00:07:56,039 --> 00:07:59,580
module hasn't been integrated into the

199
00:07:57,720 --> 00:08:02,280
Upstream Linux kernel yet

200
00:07:59,580 --> 00:08:04,080
so for a while I was trying to send

201
00:08:02,280 --> 00:08:05,940
commands to a kernel module and they

202
00:08:04,080 --> 00:08:07,979
didn't understand and the format of the

203
00:08:05,940 --> 00:08:09,900
configuration file changed a bit with

204
00:08:07,979 --> 00:08:11,819
the versions as well so it had no idea

205
00:08:09,900 --> 00:08:13,440
what I was telling it to do

206
00:08:11,819 --> 00:08:15,660
fortunately that wasn't too hard to

207
00:08:13,440 --> 00:08:17,160
solve Lin bit also provides source code

208
00:08:15,660 --> 00:08:19,979
for the kernel module so I could just

209
00:08:17,160 --> 00:08:21,479
download compile the version 9 kernel

210
00:08:19,979 --> 00:08:25,020
module and load that and from that point

211
00:08:21,479 --> 00:08:28,379
drbd pretty much worked

212
00:08:25,020 --> 00:08:30,720
so on to Seth

213
00:08:28,379 --> 00:08:34,140
Seth has a lot more moving Parts than

214
00:08:30,720 --> 00:08:35,760
drbd does a ceph cluster is made up of

215
00:08:34,140 --> 00:08:38,459
many different demons all working

216
00:08:35,760 --> 00:08:41,279
together and doing different tasks these

217
00:08:38,459 --> 00:08:42,539
are the four most prominent kinds of

218
00:08:41,279 --> 00:08:44,880
Seth demons

219
00:08:42,539 --> 00:08:47,820
there are the object storage demons or

220
00:08:44,880 --> 00:08:50,040
osds each OSD is responsible for

221
00:08:47,820 --> 00:08:53,040
managing one unit of storage such as one

222
00:08:50,040 --> 00:08:54,899
hard disk one drive there are monitors

223
00:08:53,040 --> 00:08:56,820
their job is to make sure that all the

224
00:08:54,899 --> 00:08:59,640
other demons are performing their tasks

225
00:08:56,820 --> 00:09:01,800
correctly and to track where the data is

226
00:08:59,640 --> 00:09:04,140
manager demons watch the cluster's

227
00:09:01,800 --> 00:09:06,420
health and performance and metadata

228
00:09:04,140 --> 00:09:08,580
servers store metadata for the file

229
00:09:06,420 --> 00:09:10,620
systems that your ceph cluster is

230
00:09:08,580 --> 00:09:12,360
managing to make them easier and faster

231
00:09:10,620 --> 00:09:14,519
to access

232
00:09:12,360 --> 00:09:16,380
but these four are just the four most

233
00:09:14,519 --> 00:09:19,080
prominent ones there are a bunch of

234
00:09:16,380 --> 00:09:21,420
other demons as well in a ceph cluster

235
00:09:19,080 --> 00:09:23,519
Seth has so many moving parts to it that

236
00:09:21,420 --> 00:09:26,220
it comes with its own Configuration

237
00:09:23,519 --> 00:09:27,720
utility called SEF ADM to help you set

238
00:09:26,220 --> 00:09:29,820
up the cluster

239
00:09:27,720 --> 00:09:31,320
if you're using SEF ADM then making

240
00:09:29,820 --> 00:09:32,519
yourself cluster look something like

241
00:09:31,320 --> 00:09:35,339
this

242
00:09:32,519 --> 00:09:36,720
one you tell safe ADM to bootstrap the

243
00:09:35,339 --> 00:09:39,720
cluster which means it's going to create

244
00:09:36,720 --> 00:09:41,220
a single monitor a single manager and

245
00:09:39,720 --> 00:09:43,080
some configuration and authentication

246
00:09:41,220 --> 00:09:45,060
information

247
00:09:43,080 --> 00:09:46,680
that bootstrapping happens on one

248
00:09:45,060 --> 00:09:48,959
machine you then have to connect other

249
00:09:46,680 --> 00:09:51,800
hosts up to it so that the cluster can

250
00:09:48,959 --> 00:09:51,800
start spreading

251
00:09:51,959 --> 00:09:55,320
take all the devices you want to

252
00:09:53,580 --> 00:09:58,860
actually use for storing things and

253
00:09:55,320 --> 00:10:01,200
create an OSD on top of each of them

254
00:09:58,860 --> 00:10:03,540
then you tell these F cluster to create

255
00:10:01,200 --> 00:10:05,160
a file system and it'll use those osds

256
00:10:03,540 --> 00:10:07,080
to get the storage it needs to put that

257
00:10:05,160 --> 00:10:09,420
file system on

258
00:10:07,080 --> 00:10:11,220
finally a client needs to be

259
00:10:09,420 --> 00:10:13,200
authenticated to the ceph cluster before

260
00:10:11,220 --> 00:10:17,120
it's allowed to access the cluster and

261
00:10:13,200 --> 00:10:17,120
mount the file systems that are in it

262
00:10:17,820 --> 00:10:22,860
now Seth has documentation and quite a

263
00:10:20,519 --> 00:10:25,140
lot of it there is

264
00:10:22,860 --> 00:10:28,440
a documentation website

265
00:10:25,140 --> 00:10:30,959
you can also look at Man pages on the

266
00:10:28,440 --> 00:10:32,940
ceph commands and you can often run a

267
00:10:30,959 --> 00:10:34,860
safe command and stick dot h on the end

268
00:10:32,940 --> 00:10:38,160
and it'll show you what are the further

269
00:10:34,860 --> 00:10:40,320
ways you can extend this command

270
00:10:38,160 --> 00:10:44,820
one pretty substantial challenge I've

271
00:10:40,320 --> 00:10:46,980
run into with understanding Seth is

272
00:10:44,820 --> 00:10:49,079
it doesn't feel like any of these three

273
00:10:46,980 --> 00:10:51,240
sources are really comprehensive

274
00:10:49,079 --> 00:10:53,339
I found things in each of them that

275
00:10:51,240 --> 00:10:55,680
don't seem to appear in the others

276
00:10:53,339 --> 00:10:57,839
you kind of have to check all three to

277
00:10:55,680 --> 00:11:00,480
be sure that you've seen what your

278
00:10:57,839 --> 00:11:02,700
options are and even then you can come

279
00:11:00,480 --> 00:11:04,200
across a situation where you've got say

280
00:11:02,700 --> 00:11:06,360
two different commands that seem like

281
00:11:04,200 --> 00:11:09,120
they're supposed to do the same thing

282
00:11:06,360 --> 00:11:11,880
but in fact they might not

283
00:11:09,120 --> 00:11:13,740
self documentation is good at giving a

284
00:11:11,880 --> 00:11:17,579
high level overview of how a cluster

285
00:11:13,740 --> 00:11:19,260
works the four key types of ceftemons I

286
00:11:17,579 --> 00:11:21,180
listed earlier that's an example of

287
00:11:19,260 --> 00:11:23,160
information that is very easy to get out

288
00:11:21,180 --> 00:11:25,200
of seps documentation

289
00:11:23,160 --> 00:11:28,079
there are also a lot of instances in the

290
00:11:25,200 --> 00:11:30,060
documentation where you are told in

291
00:11:28,079 --> 00:11:31,019
order to perform this action run this

292
00:11:30,060 --> 00:11:33,480
command

293
00:11:31,019 --> 00:11:35,399
low level instructions

294
00:11:33,480 --> 00:11:37,500
but I find the documentation kind of

295
00:11:35,399 --> 00:11:40,440
struggles with giving mid-level

296
00:11:37,500 --> 00:11:42,240
explanations of what are my options why

297
00:11:40,440 --> 00:11:44,720
would I want to do this versus the other

298
00:11:42,240 --> 00:11:46,740
what does this really mean

299
00:11:44,720 --> 00:11:48,660
for instance

300
00:11:46,740 --> 00:11:52,200
step three of setting up a ceph cluster

301
00:11:48,660 --> 00:11:54,779
was create osds these are two different

302
00:11:52,200 --> 00:11:56,820
commands both of these commands appear

303
00:11:54,779 --> 00:11:58,980
on sef's documentation website on

304
00:11:56,820 --> 00:12:01,560
different pages both of these commands

305
00:11:58,980 --> 00:12:03,660
will create an OSD for you in your

306
00:12:01,560 --> 00:12:05,399
cluster on top of a particular logical

307
00:12:03,660 --> 00:12:07,140
volume

308
00:12:05,399 --> 00:12:09,779
something that the documentation does

309
00:12:07,140 --> 00:12:11,640
not make clear is that these two

310
00:12:09,779 --> 00:12:14,579
commands are different in a significant

311
00:12:11,640 --> 00:12:17,940
way the First Command will create that

312
00:12:14,579 --> 00:12:19,620
OSD under the management of ceph ADM the

313
00:12:17,940 --> 00:12:21,060
tool which is explicitly designed to

314
00:12:19,620 --> 00:12:22,500
help you with deploying all the demons

315
00:12:21,060 --> 00:12:24,779
in SF cluster

316
00:12:22,500 --> 00:12:27,959
the second command will not create the

317
00:12:24,779 --> 00:12:29,880
OSD under the management of ceph ADM

318
00:12:27,959 --> 00:12:32,040
and we'll be able to detect that an OSD

319
00:12:29,880 --> 00:12:33,779
is there but won't be able to do

320
00:12:32,040 --> 00:12:36,240
anything to it until you explicitly

321
00:12:33,779 --> 00:12:39,300
adopt it

322
00:12:36,240 --> 00:12:40,920
this is the sort of thing that you can

323
00:12:39,300 --> 00:12:43,019
run into when you're trying to

324
00:12:40,920 --> 00:12:45,500
understand what's going on inside Seth's

325
00:12:43,019 --> 00:12:45,500
documentation

326
00:12:46,560 --> 00:12:53,519
on the Simplicity of setup drbd is

327
00:12:50,579 --> 00:12:56,279
definitely further ahead than Seth

328
00:12:53,519 --> 00:12:58,320
that's to be expected though drbd is a

329
00:12:56,279 --> 00:13:00,600
simpler tool it does not do as much it

330
00:12:58,320 --> 00:13:02,399
does not support as many options it's

331
00:13:00,600 --> 00:13:04,320
not surprising that it's less work to

332
00:13:02,399 --> 00:13:06,600
set the thing up

333
00:13:04,320 --> 00:13:08,940
so what about features what do these

334
00:13:06,600 --> 00:13:12,779
tools support how easy are they to use

335
00:13:08,940 --> 00:13:14,880
what options can they give us

336
00:13:12,779 --> 00:13:16,380
well I just mentioned drbd is a pretty

337
00:13:14,880 --> 00:13:18,959
simple technology

338
00:13:16,380 --> 00:13:21,660
all it's really doing is replicating the

339
00:13:18,959 --> 00:13:24,839
data on this block device

340
00:13:21,660 --> 00:13:27,240
the RBD doesn't have any higher level

341
00:13:24,839 --> 00:13:28,860
options on top of that it can't for

342
00:13:27,240 --> 00:13:31,680
instance put a file system on the Block

343
00:13:28,860 --> 00:13:33,480
device it can't do any kind of cluster

344
00:13:31,680 --> 00:13:35,040
awareness for you

345
00:13:33,480 --> 00:13:36,920
you have to create your own file system

346
00:13:35,040 --> 00:13:40,860
on top of the block device

347
00:13:36,920 --> 00:13:43,200
once you have drbd replicating the thing

348
00:13:40,860 --> 00:13:45,600
you can from the command line view the

349
00:13:43,200 --> 00:13:47,519
status of your drbd resource you can see

350
00:13:45,600 --> 00:13:50,639
where is it primary where is it

351
00:13:47,519 --> 00:13:53,339
secondary where is it up to date this

352
00:13:50,639 --> 00:13:57,019
was taken while a drbd resource was

353
00:13:53,339 --> 00:13:59,399
synchronizing to the two other servers

354
00:13:57,019 --> 00:14:01,260
but that's about all the information you

355
00:13:59,399 --> 00:14:03,180
can get there's just not that much high

356
00:14:01,260 --> 00:14:06,620
level stuff going on in the operations

357
00:14:03,180 --> 00:14:06,620
drbd is performing

358
00:14:06,720 --> 00:14:10,680
in contrast sef's feature set is

359
00:14:08,940 --> 00:14:12,480
probably my favorite part of it

360
00:14:10,680 --> 00:14:15,360
I mentioned earlier that safe

361
00:14:12,480 --> 00:14:17,639
dynamically redistributes data it's not

362
00:14:15,360 --> 00:14:19,260
like drvd which is essentially copying

363
00:14:17,639 --> 00:14:21,600
the exact contents of this block device

364
00:14:19,260 --> 00:14:25,079
onto another block device on another

365
00:14:21,600 --> 00:14:27,180
server several actually decide this OSD

366
00:14:25,079 --> 00:14:28,380
is the size this OSD is this size I'm

367
00:14:27,180 --> 00:14:31,620
going to give this one more data because

368
00:14:28,380 --> 00:14:33,540
it can hold more it will move where the

369
00:14:31,620 --> 00:14:35,880
data is replicated around based on what

370
00:14:33,540 --> 00:14:38,339
the storage options are so it's really

371
00:14:35,880 --> 00:14:40,740
quite easy to add and remove osds to

372
00:14:38,339 --> 00:14:44,480
your cluster as you get more storage or

373
00:14:40,740 --> 00:14:44,480
as you lose storage options

374
00:14:44,639 --> 00:14:49,740
if you have bootstrapped yourself

375
00:14:46,860 --> 00:14:51,720
cluster using ceph ADM then it will come

376
00:14:49,740 --> 00:14:53,279
with a graphical dashboard that you can

377
00:14:51,720 --> 00:14:56,279
reach over the web

378
00:14:53,279 --> 00:14:57,660
I mentioned that ceph managers watch the

379
00:14:56,279 --> 00:14:59,880
cluster's health and performance you can

380
00:14:57,660 --> 00:15:01,860
see a lot of that information here

381
00:14:59,880 --> 00:15:03,779
you can also perform a lot of the setup

382
00:15:01,860 --> 00:15:05,880
tasks for a Steph cluster from this

383
00:15:03,779 --> 00:15:08,579
dashboard you can add hosts to your

384
00:15:05,880 --> 00:15:10,139
cluster from here you can deploy

385
00:15:08,579 --> 00:15:12,360
services from here you are supposed to

386
00:15:10,139 --> 00:15:13,800
be able to create osds from here too I

387
00:15:12,360 --> 00:15:15,839
haven't managed it because of an issue

388
00:15:13,800 --> 00:15:18,800
to do with how ceph detects whether a

389
00:15:15,839 --> 00:15:18,800
device is empty or not

390
00:15:20,639 --> 00:15:24,540
one of my favorite things about Seth

391
00:15:22,560 --> 00:15:27,420
from a system administrator perspective

392
00:15:24,540 --> 00:15:30,000
is how automatic it is

393
00:15:27,420 --> 00:15:32,519
I've had times when I have taken down

394
00:15:30,000 --> 00:15:34,139
one of the hosts in a ceph cluster

395
00:15:32,519 --> 00:15:37,079
and then I've brought the host back up

396
00:15:34,139 --> 00:15:38,820
and rejoined it and initially the

397
00:15:37,079 --> 00:15:41,820
cluster status up in the top left will

398
00:15:38,820 --> 00:15:43,860
say health warning bad data

399
00:15:41,820 --> 00:15:46,320
but in a few minutes without me having

400
00:15:43,860 --> 00:15:47,940
to issue any commands except we'll have

401
00:15:46,320 --> 00:15:49,860
cleaned up the problem and the cluster

402
00:15:47,940 --> 00:15:51,779
will be back to full health again

403
00:15:49,860 --> 00:15:53,339
ceph is intentionally designed to be

404
00:15:51,779 --> 00:15:56,100
self-healing

405
00:15:53,339 --> 00:15:58,620
so that it will find and try to fix

406
00:15:56,100 --> 00:16:00,779
problems for you

407
00:15:58,620 --> 00:16:02,760
when that same kind of thing happens in

408
00:16:00,779 --> 00:16:04,260
drbd if I take a host down and bring it

409
00:16:02,760 --> 00:16:06,300
back up and try to reconnect it into the

410
00:16:04,260 --> 00:16:09,180
cluster I will often need to tell it to

411
00:16:06,300 --> 00:16:12,560
re-synchronize the data on the drive and

412
00:16:09,180 --> 00:16:12,560
that can take hours to happen

413
00:16:15,420 --> 00:16:18,560
so

414
00:16:16,920 --> 00:16:20,459
so far I've been talking about

415
00:16:18,560 --> 00:16:23,699
qualitative things

416
00:16:20,459 --> 00:16:26,040
how do these tools feel what sort of

417
00:16:23,699 --> 00:16:29,040
options do I have on them let's get some

418
00:16:26,040 --> 00:16:31,199
numbers involved how do they perform

419
00:16:29,040 --> 00:16:33,660
I mentioned there are three different

420
00:16:31,199 --> 00:16:36,420
ways you can mount a ceph file system

421
00:16:33,660 --> 00:16:38,279
over the network there is a kernel

422
00:16:36,420 --> 00:16:40,860
driver you can use for mounting set file

423
00:16:38,279 --> 00:16:42,240
systems you can mount them in user space

424
00:16:40,860 --> 00:16:44,040
with fuse

425
00:16:42,240 --> 00:16:45,540
and you can export a set file system

426
00:16:44,040 --> 00:16:47,699
over NFS

427
00:16:45,540 --> 00:16:50,459
similar to what we were doing with drbd

428
00:16:47,699 --> 00:16:52,740
although it's all handled inside SEF

429
00:16:50,459 --> 00:16:54,360
so when we include the RBD that's four

430
00:16:52,740 --> 00:16:57,420
different ways we can mount a file

431
00:16:54,360 --> 00:17:00,000
system to run tests on

432
00:16:57,420 --> 00:17:02,940
for testing performance I add three main

433
00:17:00,000 --> 00:17:05,400
strategies the first was a program

434
00:17:02,940 --> 00:17:07,620
called postmark the way postmark works

435
00:17:05,400 --> 00:17:10,020
is it creates a couple of hundred files

436
00:17:07,620 --> 00:17:12,240
and then it performs creates and deletes

437
00:17:10,020 --> 00:17:15,419
and reads on those files it completes a

438
00:17:12,240 --> 00:17:16,919
set total workload and then finishes the

439
00:17:15,419 --> 00:17:18,179
faster you can get that workload done

440
00:17:16,919 --> 00:17:19,980
the better your file system is

441
00:17:18,179 --> 00:17:23,640
performing

442
00:17:19,980 --> 00:17:26,939
NFS bench Forks itself to create several

443
00:17:23,640 --> 00:17:28,439
processes each process creates one file

444
00:17:26,939 --> 00:17:30,500
and that process is the only one that's

445
00:17:28,439 --> 00:17:33,840
going to work on that file

446
00:17:30,500 --> 00:17:35,940
the process each process will then open

447
00:17:33,840 --> 00:17:37,559
its file write data into the file close

448
00:17:35,940 --> 00:17:40,260
the file repeat that many times over

449
00:17:37,559 --> 00:17:41,580
again there is a set total workload the

450
00:17:40,260 --> 00:17:42,720
sooner you can get through that workload

451
00:17:41,580 --> 00:17:45,120
the better your file system is

452
00:17:42,720 --> 00:17:47,580
performing

453
00:17:45,120 --> 00:17:49,799
and the last thing I did I took the NFS

454
00:17:47,580 --> 00:17:51,720
batch program and I modified it so that

455
00:17:49,799 --> 00:17:53,640
rather than completing a set workload it

456
00:17:51,720 --> 00:17:54,840
would just run continuously for about 10

457
00:17:53,640 --> 00:17:58,380
minutes

458
00:17:54,840 --> 00:18:01,380
and then I mounted a file system on

459
00:17:58,380 --> 00:18:03,299
several different clients at a time and

460
00:18:01,380 --> 00:18:05,039
ran that long-running NFS bench program

461
00:18:03,299 --> 00:18:08,220
and all of them at once and I watched

462
00:18:05,039 --> 00:18:11,360
CPU usage on the server and client

463
00:18:08,220 --> 00:18:11,360
machines while that was happening

464
00:18:12,240 --> 00:18:14,640
actually now I'm going to stay on that

465
00:18:13,559 --> 00:18:16,559
slide for now

466
00:18:14,640 --> 00:18:18,240
so

467
00:18:16,559 --> 00:18:20,220
if you don't really understand a system

468
00:18:18,240 --> 00:18:22,559
well there are a lot of ways you can end

469
00:18:20,220 --> 00:18:24,539
up measuring the wrong thing

470
00:18:22,559 --> 00:18:26,460
especially with something like this

471
00:18:24,539 --> 00:18:28,500
distributed file systems

472
00:18:26,460 --> 00:18:30,780
there are a lot of components that go

473
00:18:28,500 --> 00:18:32,460
into the end results you get when you

474
00:18:30,780 --> 00:18:34,080
try to test their performance

475
00:18:32,460 --> 00:18:36,419
you've got to think about what are the

476
00:18:34,080 --> 00:18:37,799
capabilities of the clients what kind of

477
00:18:36,419 --> 00:18:39,539
network connection do the clients have

478
00:18:37,799 --> 00:18:41,100
to the service what are the capabilities

479
00:18:39,539 --> 00:18:42,419
of the servers how are the servers

480
00:18:41,100 --> 00:18:43,799
talking to each other what sort of

481
00:18:42,419 --> 00:18:46,679
algorithms are they using to decide

482
00:18:43,799 --> 00:18:47,820
where the data ends up

483
00:18:46,679 --> 00:18:50,580
if you don't really know what you're

484
00:18:47,820 --> 00:18:51,660
doing it's easy to end up measuring the

485
00:18:50,580 --> 00:18:52,799
performance of some bizarre

486
00:18:51,660 --> 00:18:54,539
configuration that you shouldn't be

487
00:18:52,799 --> 00:18:58,340
using in production anyway

488
00:18:54,539 --> 00:18:58,340
that happened to me several times over

489
00:18:59,520 --> 00:19:06,000
first I originally created myself osds

490
00:19:03,000 --> 00:19:09,480
so that they were storing all of their

491
00:19:06,000 --> 00:19:11,460
data on hard disks

492
00:19:09,480 --> 00:19:12,960
well the self documentation recommends

493
00:19:11,460 --> 00:19:15,120
that if you can you should put the right

494
00:19:12,960 --> 00:19:17,220
ahead log for the OSD on something

495
00:19:15,120 --> 00:19:20,760
faster so you don't have to wait for the

496
00:19:17,220 --> 00:19:22,919
hard for the slower hard disk as often

497
00:19:20,760 --> 00:19:25,440
when I made that change I then had a

498
00:19:22,919 --> 00:19:28,020
look at NFS bench I noticed NFS bench

499
00:19:25,440 --> 00:19:30,120
was using the sync system call

500
00:19:28,020 --> 00:19:32,700
I tried running it using the fsync

501
00:19:30,120 --> 00:19:35,400
system call instead both of those system

502
00:19:32,700 --> 00:19:38,580
calls are basically designed to commit

503
00:19:35,400 --> 00:19:40,620
the changes to the underlying file

504
00:19:38,580 --> 00:19:41,760
system or storage

505
00:19:40,620 --> 00:19:43,919
and I saw a huge difference in

506
00:19:41,760 --> 00:19:45,539
performance of

507
00:19:43,919 --> 00:19:47,820
that are optimized for everything

508
00:19:45,539 --> 00:19:49,799
nope turns out the difference is fsync

509
00:19:47,820 --> 00:19:51,179
commits one file at a time sync commits

510
00:19:49,799 --> 00:19:53,400
the entire file system every time you

511
00:19:51,179 --> 00:19:55,559
call it it was the difference between

512
00:19:53,400 --> 00:19:57,840
each process committing its own file and

513
00:19:55,559 --> 00:20:00,059
each process committing every file

514
00:19:57,840 --> 00:20:01,679
it Steph was just doing quadratically

515
00:20:00,059 --> 00:20:04,580
less work because I changed which system

516
00:20:01,679 --> 00:20:04,580
call I was using

517
00:20:04,799 --> 00:20:11,760
and then when I was running the CPU

518
00:20:08,820 --> 00:20:14,460
status the CPU usage

519
00:20:11,760 --> 00:20:16,260
experiments

520
00:20:14,460 --> 00:20:18,480
there were some cases where I found that

521
00:20:16,260 --> 00:20:21,000
the clients were spending almost 100 of

522
00:20:18,480 --> 00:20:22,440
their time waiting for Io to happen but

523
00:20:21,000 --> 00:20:23,640
the servers were virtually completely

524
00:20:22,440 --> 00:20:25,320
idle

525
00:20:23,640 --> 00:20:26,760
somehow the clients were hitting huge

526
00:20:25,320 --> 00:20:27,840
bottleneck and the servers had nothing

527
00:20:26,760 --> 00:20:30,000
to do

528
00:20:27,840 --> 00:20:31,679
it turns out that was because the switch

529
00:20:30,000 --> 00:20:33,299
that was connecting the clients to the

530
00:20:31,679 --> 00:20:35,160
servers just couldn't keep up with all

531
00:20:33,299 --> 00:20:37,440
the traffic the problem was nothing to

532
00:20:35,160 --> 00:20:40,500
do with safe or drbd the problem was to

533
00:20:37,440 --> 00:20:42,179
do with the network wasn't fast enough

534
00:20:40,500 --> 00:20:44,940
so

535
00:20:42,179 --> 00:20:47,100
having been through those experiences

536
00:20:44,940 --> 00:20:50,220
here are the results I ended up with on

537
00:20:47,100 --> 00:20:52,440
my last run of these tests

538
00:20:50,220 --> 00:20:54,179
so this is results for postmark on the

539
00:20:52,440 --> 00:20:56,400
y-axis we've got how long the experiment

540
00:20:54,179 --> 00:20:58,200
took to run remember a faster completion

541
00:20:56,400 --> 00:21:00,840
is better

542
00:20:58,200 --> 00:21:03,600
um on the bottom on the horizontal we've

543
00:21:00,840 --> 00:21:06,179
got how big the experiments were

544
00:21:03,600 --> 00:21:08,760
the yellow bar is a ceph file system

545
00:21:06,179 --> 00:21:10,200
mounted using NFS green is sapphire

546
00:21:08,760 --> 00:21:13,380
system mounted using the kernel driver

547
00:21:10,200 --> 00:21:16,559
blue is set mounted using fuse and red

548
00:21:13,380 --> 00:21:18,179
is drbd mounted over NFS I'm going to be

549
00:21:16,559 --> 00:21:20,940
maintaining that color scheme for the

550
00:21:18,179 --> 00:21:23,460
next couple slides

551
00:21:20,940 --> 00:21:25,860
so on the postmark we're seeing the ceph

552
00:21:23,460 --> 00:21:28,980
kernel is doing quite well for itself uh

553
00:21:25,860 --> 00:21:30,960
drbd and ceph over NFS are actually

554
00:21:28,980 --> 00:21:33,480
doing extremely similarly

555
00:21:30,960 --> 00:21:36,179
I was hoping coming into this that Seth

556
00:21:33,480 --> 00:21:39,120
would just visibly outperform the RBD

557
00:21:36,179 --> 00:21:41,039
but to be doing quite similarly is still

558
00:21:39,120 --> 00:21:43,020
pretty good when you consider that ceph

559
00:21:41,039 --> 00:21:45,179
is rather more complex and does a lot

560
00:21:43,020 --> 00:21:48,299
more for you

561
00:21:45,179 --> 00:21:50,580
set fuse however is having some issues

562
00:21:48,299 --> 00:21:52,980
I actually had to cut off the top of

563
00:21:50,580 --> 00:21:54,659
this graph so that we could see

564
00:21:52,980 --> 00:21:55,860
what was going on with the other Mount

565
00:21:54,659 --> 00:21:58,860
types

566
00:21:55,860 --> 00:22:00,780
cephus is not happy the Seth kernel

567
00:21:58,860 --> 00:22:03,120
driver and Seth over NFS are the ones

568
00:22:00,780 --> 00:22:04,320
we're particularly interested in the

569
00:22:03,120 --> 00:22:06,980
safe kernel driver seems to be

570
00:22:04,320 --> 00:22:10,799
performing very well and ceph over NFS

571
00:22:06,980 --> 00:22:12,720
is useful because the tools that allow

572
00:22:10,799 --> 00:22:14,580
you to do the SEF specific file system

573
00:22:12,720 --> 00:22:15,620
mounts are not available in every

574
00:22:14,580 --> 00:22:18,780
platform

575
00:22:15,620 --> 00:22:20,700
we want to be exporting over NFS where

576
00:22:18,780 --> 00:22:22,200
we can because that's something we know

577
00:22:20,700 --> 00:22:24,059
virtually every machine is going to be

578
00:22:22,200 --> 00:22:25,919
able to accept

579
00:22:24,059 --> 00:22:29,340
so it's good to see that Steph is not

580
00:22:25,919 --> 00:22:31,559
doing worse than drbd over NFS

581
00:22:29,340 --> 00:22:33,840
now under NFS pitch nfsbench has a lot

582
00:22:31,559 --> 00:22:36,659
more parameters than postmark does for

583
00:22:33,840 --> 00:22:37,919
varying the seismic experiment again on

584
00:22:36,659 --> 00:22:40,100
the y-axis we've got how long the

585
00:22:37,919 --> 00:22:43,500
experiment took shorter is better

586
00:22:40,100 --> 00:22:45,960
this is specifically each process was

587
00:22:43,500 --> 00:22:48,360
opening and writing into its file 400

588
00:22:45,960 --> 00:22:50,760
times and on the bottom we've got how

589
00:22:48,360 --> 00:22:52,200
many blocks of data were written and the

590
00:22:50,760 --> 00:22:54,240
minimum size of the file that was

591
00:22:52,200 --> 00:22:55,919
getting written into

592
00:22:54,240 --> 00:22:57,840
kind of interesting thing here that I

593
00:22:55,919 --> 00:22:59,280
noticed the number of blocks written in

594
00:22:57,840 --> 00:23:01,799
the minimum file size aren't really

595
00:22:59,280 --> 00:23:04,740
changing anything we're getting pretty

596
00:23:01,799 --> 00:23:06,480
similar and levels of performance as we

597
00:23:04,740 --> 00:23:08,760
vary those

598
00:23:06,480 --> 00:23:12,000
the only parameter that I found visibly

599
00:23:08,760 --> 00:23:13,860
made a difference with NFS bench was the

600
00:23:12,000 --> 00:23:16,020
number of opens when I doubled the

601
00:23:13,860 --> 00:23:18,480
number of opens the time it took to

602
00:23:16,020 --> 00:23:20,100
complete the experiment was a lot longer

603
00:23:18,480 --> 00:23:21,960
that's not remotely surprising though

604
00:23:20,100 --> 00:23:24,000
the number of opens is roughly the total

605
00:23:21,960 --> 00:23:26,640
workload that you've got to do

606
00:23:24,000 --> 00:23:29,400
so if you double it you can expect a

607
00:23:26,640 --> 00:23:32,600
roughly double duration on how long it

608
00:23:29,400 --> 00:23:32,600
takes to get the experiment done

609
00:23:33,720 --> 00:23:39,539
okay so for the CPU usage experiment

610
00:23:37,500 --> 00:23:42,780
remember the way this works is I've got

611
00:23:39,539 --> 00:23:45,120
10 client machines the file system is

612
00:23:42,780 --> 00:23:47,640
mounted on all 10 at once and then I'm

613
00:23:45,120 --> 00:23:50,039
running NFS bench on each of those

614
00:23:47,640 --> 00:23:52,559
machines all at the same time and

615
00:23:50,039 --> 00:23:54,299
watching the CPU usage on the clients

616
00:23:52,559 --> 00:23:57,120
and on the server

617
00:23:54,299 --> 00:23:58,440
NFS bench incidentally I was running it

618
00:23:57,120 --> 00:24:00,600
so that it would Fork itself to make

619
00:23:58,440 --> 00:24:02,460
four worker processes for each instance

620
00:24:00,600 --> 00:24:06,179
of NFS bench because the machines I was

621
00:24:02,460 --> 00:24:08,039
running on had four CPUs so all the tpus

622
00:24:06,179 --> 00:24:11,480
have things to do but they're not

623
00:24:08,039 --> 00:24:11,480
overloaded with loads of tasks

624
00:24:11,700 --> 00:24:17,100
so here we have when mounted using set

625
00:24:14,220 --> 00:24:19,980
fuse what percentage of total CPU time

626
00:24:17,100 --> 00:24:22,679
was spent waiting for Io

627
00:24:19,980 --> 00:24:24,600
so the y-axis is percentages notice it's

628
00:24:22,679 --> 00:24:26,340
not going up to 100 I had to shrink it

629
00:24:24,600 --> 00:24:29,100
so that we could see what was going on

630
00:24:26,340 --> 00:24:31,380
here not much I O is happening these are

631
00:24:29,100 --> 00:24:33,720
pretty small numbers really

632
00:24:31,380 --> 00:24:35,400
there are a lot of lines on this graph

633
00:24:33,720 --> 00:24:38,640
because I have 10 clients plus three

634
00:24:35,400 --> 00:24:40,200
servers the black and the gray lines are

635
00:24:38,640 --> 00:24:42,000
the servers all the colored lines are

636
00:24:40,200 --> 00:24:43,440
clients I'm again going to be keeping

637
00:24:42,000 --> 00:24:46,020
that convention for the rest of these

638
00:24:43,440 --> 00:24:48,000
graphs along the bottom is just

639
00:24:46,020 --> 00:24:50,039
as the experiment progressed what time

640
00:24:48,000 --> 00:24:51,900
it was the experiment is designed to run

641
00:24:50,039 --> 00:24:54,000
for about 10 minutes so we're seeing

642
00:24:51,900 --> 00:24:55,620
these experiments between 600 700

643
00:24:54,000 --> 00:24:57,659
seconds

644
00:24:55,620 --> 00:24:59,820
um

645
00:24:57,659 --> 00:25:00,720
so yeah not very much I O is happening

646
00:24:59,820 --> 00:25:02,760
here

647
00:25:00,720 --> 00:25:05,820
we can see one of the client one of the

648
00:25:02,760 --> 00:25:07,380
servers the light gray is getting a

649
00:25:05,820 --> 00:25:09,120
little more IO than the others but the

650
00:25:07,380 --> 00:25:11,159
numbers are so small that's that's not

651
00:25:09,120 --> 00:25:13,740
really a big conclusion

652
00:25:11,159 --> 00:25:16,080
for idle time what percentage of time

653
00:25:13,740 --> 00:25:17,940
did the CPU spend idle we can see the

654
00:25:16,080 --> 00:25:19,320
black and the Grays are below the colors

655
00:25:17,940 --> 00:25:21,720
the server is spending a little less

656
00:25:19,320 --> 00:25:23,400
time idle that's not really weird the

657
00:25:21,720 --> 00:25:25,020
servers are having to service everything

658
00:25:23,400 --> 00:25:27,980
from all the clients each client is only

659
00:25:25,020 --> 00:25:27,980
doing its own work

660
00:25:28,799 --> 00:25:34,140
now under the Seth kernel driver we have

661
00:25:32,039 --> 00:25:36,240
a much more interesting picture

662
00:25:34,140 --> 00:25:38,460
the clients are spending a whole lot of

663
00:25:36,240 --> 00:25:40,140
time on io8 under the Seth kernel driver

664
00:25:38,460 --> 00:25:41,880
something about the kernel driver is

665
00:25:40,140 --> 00:25:43,559
really inefficient in its usage of the

666
00:25:41,880 --> 00:25:45,960
CPU

667
00:25:43,559 --> 00:25:48,600
or at least in the the time it takes for

668
00:25:45,960 --> 00:25:51,059
Io to happen

669
00:25:48,600 --> 00:25:53,340
the servers however are not getting very

670
00:25:51,059 --> 00:25:54,419
are not having to do much I O it's a

671
00:25:53,340 --> 00:25:56,880
little hard to see because there's so

672
00:25:54,419 --> 00:25:58,500
many colors going everywhere but if you

673
00:25:56,880 --> 00:26:00,480
look at the bottom you can see the gray

674
00:25:58,500 --> 00:26:01,500
lines and the black line are still down

675
00:26:00,480 --> 00:26:03,419
the bottom

676
00:26:01,500 --> 00:26:04,860
uh the servers are not having a hard

677
00:26:03,419 --> 00:26:08,340
time it's the clients that are having to

678
00:26:04,860 --> 00:26:11,820
wait a lot idle time looks

679
00:26:08,340 --> 00:26:14,580
like it's just about a mirror of the i o

680
00:26:11,820 --> 00:26:17,220
wait time again the servers are at the

681
00:26:14,580 --> 00:26:18,720
top they are mostly idle the clients are

682
00:26:17,220 --> 00:26:21,539
the ones who are spending big chunks of

683
00:26:18,720 --> 00:26:24,539
time not idle but presumably that's

684
00:26:21,539 --> 00:26:27,720
mostly i o weight

685
00:26:24,539 --> 00:26:29,700
the drbd mounted over NFS looks a lot

686
00:26:27,720 --> 00:26:31,260
more consistent than the kernel driver

687
00:26:29,700 --> 00:26:35,279
does

688
00:26:31,260 --> 00:26:37,559
so drbd it's holding at around 50-ish

689
00:26:35,279 --> 00:26:40,200
percent of CPU time is waiting for Io

690
00:26:37,559 --> 00:26:42,480
it's a lot but it's not going everywhere

691
00:26:40,200 --> 00:26:44,279
it's pretty predictably the same thing

692
00:26:42,480 --> 00:26:45,600
you can see one of the servers here is

693
00:26:44,279 --> 00:26:47,700
spending a lot more time on iO than the

694
00:26:45,600 --> 00:26:49,440
others if you remember the way drbd

695
00:26:47,700 --> 00:26:51,600
works we can only have it in primary

696
00:26:49,440 --> 00:26:53,220
mode on one server at a time so only one

697
00:26:51,600 --> 00:26:55,260
of the servers is actually exporting

698
00:26:53,220 --> 00:26:57,000
over NFS that's the one that's getting

699
00:26:55,260 --> 00:26:59,039
the the longer I O wait times than the

700
00:26:57,000 --> 00:27:00,720
other two servers that's why one of the

701
00:26:59,039 --> 00:27:02,520
servers is getting a higher i o weight

702
00:27:00,720 --> 00:27:04,919
than the others

703
00:27:02,520 --> 00:27:07,380
and again idle time looks like it's

704
00:27:04,919 --> 00:27:10,799
pretty much a mirror image I I did also

705
00:27:07,380 --> 00:27:13,440
record things like the percentage of CPU

706
00:27:10,799 --> 00:27:16,860
time spent on user space spent in kernel

707
00:27:13,440 --> 00:27:19,140
space as opposed to idle and IO and so

708
00:27:16,860 --> 00:27:20,760
forth but those numbers were so

709
00:27:19,140 --> 00:27:23,100
consistently small across everything I

710
00:27:20,760 --> 00:27:24,419
didn't bother showing them it's the i o

711
00:27:23,100 --> 00:27:26,760
and the idle time where we get

712
00:27:24,419 --> 00:27:30,059
interesting looking graphs

713
00:27:26,760 --> 00:27:34,100
and for Seth mounted over NFS

714
00:27:30,059 --> 00:27:36,539
interestingly like drbd mounted over NFS

715
00:27:34,100 --> 00:27:40,260
it's holding around the sort of 50

716
00:27:36,539 --> 00:27:42,120
region it's just more Jagged Seth seems

717
00:27:40,260 --> 00:27:44,460
like it's a bit less consistent in the

718
00:27:42,120 --> 00:27:47,820
amount in the

719
00:27:44,460 --> 00:27:49,620
time spent on i o weight than drbd is

720
00:27:47,820 --> 00:27:51,779
but it is holding around the same area

721
00:27:49,620 --> 00:27:53,159
and I think that commonality is related

722
00:27:51,779 --> 00:27:56,299
to the fact that both of these are

723
00:27:53,159 --> 00:27:56,299
exporting over NFS

724
00:27:56,340 --> 00:28:01,799
uh the idle time again basically a

725
00:27:58,919 --> 00:28:03,840
mirror of the io8 time

726
00:28:01,799 --> 00:28:04,919
so what do I think of these things in

727
00:28:03,840 --> 00:28:07,559
the end

728
00:28:04,919 --> 00:28:10,440
well the RBD is definitely simpler to

729
00:28:07,559 --> 00:28:12,240
form your mental model of if you need to

730
00:28:10,440 --> 00:28:14,880
spend time understanding how something

731
00:28:12,240 --> 00:28:16,320
works before you can start using it drbd

732
00:28:14,880 --> 00:28:17,880
is going to take you less time to

733
00:28:16,320 --> 00:28:19,860
understand what the pieces are and how

734
00:28:17,880 --> 00:28:22,020
they fit together there aren't as many

735
00:28:19,860 --> 00:28:25,140
moving Parts it's gonna probably be

736
00:28:22,020 --> 00:28:27,360
easier to set up from scratch

737
00:28:25,140 --> 00:28:29,940
you may need to compile the kernel

738
00:28:27,360 --> 00:28:32,159
module if you want the RBD Version 9 but

739
00:28:29,940 --> 00:28:33,419
that's the most major snag you're likely

740
00:28:32,159 --> 00:28:35,880
to run into

741
00:28:33,419 --> 00:28:38,039
of course that Simplicity comes at the

742
00:28:35,880 --> 00:28:40,620
expense of drbd does not do nearly as

743
00:28:38,039 --> 00:28:42,960
much it will replicate the contents of

744
00:28:40,620 --> 00:28:45,960
this block device but it has a hard time

745
00:28:42,960 --> 00:28:47,820
recovering from conflicts and it doesn't

746
00:28:45,960 --> 00:28:49,559
really support any additional features

747
00:28:47,820 --> 00:28:52,320
on top of that

748
00:28:49,559 --> 00:28:53,820
Seth much more automatic it does a lot

749
00:28:52,320 --> 00:28:56,520
of things for you without you even

750
00:28:53,820 --> 00:29:01,140
having to issue commands it is very

751
00:28:56,520 --> 00:29:03,380
capable of fixing up issues with the

752
00:29:01,140 --> 00:29:07,200
consistency of data for instance

753
00:29:03,380 --> 00:29:09,240
but it has a much more complex structure

754
00:29:07,200 --> 00:29:11,580
to making SF cluster work there's a lot

755
00:29:09,240 --> 00:29:13,260
of demons that all need to exist and

756
00:29:11,580 --> 00:29:14,039
communicate with each other for the most

757
00:29:13,260 --> 00:29:16,440
part

758
00:29:14,039 --> 00:29:17,400
need to actually set up all these demons

759
00:29:16,440 --> 00:29:19,080
yourself

760
00:29:17,400 --> 00:29:20,880
you might have noticed when I was

761
00:29:19,080 --> 00:29:22,500
talking about setting up a ceft cluster

762
00:29:20,880 --> 00:29:24,480
from scratch I didn't say anything about

763
00:29:22,500 --> 00:29:26,700
creating extra monitors or managers or

764
00:29:24,480 --> 00:29:29,340
metadata servers you don't have to do

765
00:29:26,700 --> 00:29:32,340
that safe ADM will spin those up as

766
00:29:29,340 --> 00:29:33,779
necessary when your cluster gets bigger

767
00:29:32,340 --> 00:29:37,620
but you will need to for instance

768
00:29:33,779 --> 00:29:39,779
probably create your own osds

769
00:29:37,620 --> 00:29:41,480
ceph also probably needs your

770
00:29:39,779 --> 00:29:43,740
configuration

771
00:29:41,480 --> 00:29:45,500
in order to get the best out of it

772
00:29:43,740 --> 00:29:47,700
you'll need to spend a bit of time

773
00:29:45,500 --> 00:29:50,940
changing the setup from what it is out

774
00:29:47,700 --> 00:29:53,700
of the box like I had to realize

775
00:29:50,940 --> 00:29:55,140
actually just putting an OSD on top of a

776
00:29:53,700 --> 00:29:57,539
hard disk on its own is not getting the

777
00:29:55,140 --> 00:29:59,520
best out of an OSD I should also give it

778
00:29:57,539 --> 00:30:04,320
a faster drive for its writer head log

779
00:29:59,520 --> 00:30:06,600
that sort of thing so I think drbd is

780
00:30:04,320 --> 00:30:08,220
going to be easier to start up with if

781
00:30:06,600 --> 00:30:10,740
you've got a sort of a small sort of

782
00:30:08,220 --> 00:30:12,840
deployment you're working with drbd

783
00:30:10,740 --> 00:30:15,659
might be a good idea it's going to have

784
00:30:12,840 --> 00:30:17,940
less initial investment to get it going

785
00:30:15,659 --> 00:30:19,320
but as the size of your deployment gets

786
00:30:17,940 --> 00:30:22,020
bigger as the complexity of what you

787
00:30:19,320 --> 00:30:24,120
want to do gets higher sep is going to

788
00:30:22,020 --> 00:30:28,020
become more and more and more valuable

789
00:30:24,120 --> 00:30:31,440
because it is so capable of dealing with

790
00:30:28,020 --> 00:30:33,720
different situations different machines

791
00:30:31,440 --> 00:30:37,140
with different storage options and

792
00:30:33,720 --> 00:30:39,659
rebalancing and redistributing things

793
00:30:37,140 --> 00:30:41,580
as the capabilities that it's asked to

794
00:30:39,659 --> 00:30:44,240
deal with change

795
00:30:41,580 --> 00:30:45,160
that's all I've got to say

796
00:30:44,240 --> 00:30:48,329
[Applause]

797
00:30:45,160 --> 00:30:48,329
[Music]

798
00:30:49,210 --> 00:30:52,679
[Applause]

799
00:30:50,880 --> 00:30:55,580
thank you Christopher have you got any

800
00:30:52,679 --> 00:30:55,580
questions for them

801
00:30:59,640 --> 00:31:04,559
just a couple of implements

802
00:31:02,159 --> 00:31:05,340
Implement questions around NFS

803
00:31:04,559 --> 00:31:07,559
um

804
00:31:05,340 --> 00:31:11,820
first

805
00:31:07,559 --> 00:31:12,779
to the NFS that it's exports from CES is

806
00:31:11,820 --> 00:31:17,220
that

807
00:31:12,779 --> 00:31:19,919
just the normal Linux kernel NFS engine

808
00:31:17,220 --> 00:31:24,380
or does it have it on its own NFS engine

809
00:31:19,919 --> 00:31:27,299
and another question is you mentioned

810
00:31:24,380 --> 00:31:29,159
exporting Rover FS because of the

811
00:31:27,299 --> 00:31:32,179
clients would need it as opposed to

812
00:31:29,159 --> 00:31:35,100
Native connectivity decepts

813
00:31:32,179 --> 00:31:37,460
what clients are supported natively

814
00:31:35,100 --> 00:31:37,460
process

815
00:31:37,980 --> 00:31:41,480
Okay so

816
00:31:43,559 --> 00:31:50,520
step over NFS let's do that first

817
00:31:46,500 --> 00:31:53,880
um the ceph NFS export you set up inside

818
00:31:50,520 --> 00:31:55,260
the Sev cluster so there's a place you

819
00:31:53,880 --> 00:31:57,659
go you can do this from the dashboard

820
00:31:55,260 --> 00:31:59,159
you can say I want an NFS export and

821
00:31:57,659 --> 00:31:59,940
then cephal create some demons that do

822
00:31:59,159 --> 00:32:02,880
that

823
00:31:59,940 --> 00:32:04,700
so ceph does the NFS from inside the

824
00:32:02,880 --> 00:32:06,960
cluster

825
00:32:04,700 --> 00:32:09,600
as for

826
00:32:06,960 --> 00:32:12,539
what

827
00:32:09,600 --> 00:32:14,700
platforms the sap is available on I'm

828
00:32:12,539 --> 00:32:16,500
not sure what the full list is I've been

829
00:32:14,700 --> 00:32:18,419
doing all this on Debian so I know for a

830
00:32:16,500 --> 00:32:19,320
fact it works on Debian

831
00:32:18,419 --> 00:32:22,799
um

832
00:32:19,320 --> 00:32:24,600
I know I'm pretty sure one of the I

833
00:32:22,799 --> 00:32:27,440
don't believe for instance set for work

834
00:32:24,600 --> 00:32:31,080
on Max that is I don't believe you can

835
00:32:27,440 --> 00:32:33,840
uh you you can mount things as a ceph

836
00:32:31,080 --> 00:32:35,940
client on a Mac and some of the people

837
00:32:33,840 --> 00:32:37,440
in our organization use max so we need

838
00:32:35,940 --> 00:32:39,299
to be able to set things up so that that

839
00:32:37,440 --> 00:32:40,860
will still work

840
00:32:39,299 --> 00:32:43,020
um

841
00:32:40,860 --> 00:32:45,120
full list of what safe supports I'm not

842
00:32:43,020 --> 00:32:48,020
sure I've basically learned what I

843
00:32:45,120 --> 00:32:48,020
needed to get this going

844
00:32:54,899 --> 00:33:00,000
hi thanks for the talk

845
00:32:56,880 --> 00:33:02,039
um I wanted to ask about the original

846
00:33:00,000 --> 00:33:03,419
use case I think you kind of actually

847
00:33:02,039 --> 00:33:05,159
answered before that it's within your

848
00:33:03,419 --> 00:33:09,500
organization

849
00:33:05,159 --> 00:33:12,419
um uh but you are on unsw campus

850
00:33:09,500 --> 00:33:15,240
and like what are the clients that are

851
00:33:12,419 --> 00:33:16,740
accessing these uh home directories

852
00:33:15,240 --> 00:33:19,260
running

853
00:33:16,740 --> 00:33:22,080
um yeah the use case and the clients

854
00:33:19,260 --> 00:33:24,299
that are actually using it

855
00:33:22,080 --> 00:33:26,340
so the people who are accessing the home

856
00:33:24,299 --> 00:33:29,340
directories are members of trustworthy

857
00:33:26,340 --> 00:33:31,980
systems we have so we are attached to

858
00:33:29,340 --> 00:33:33,960
unsw but we have our own servers and we

859
00:33:31,980 --> 00:33:36,539
manage our own little infrastructure for

860
00:33:33,960 --> 00:33:39,360
our research group

861
00:33:36,539 --> 00:33:41,519
so yeah our home directories are not

862
00:33:39,360 --> 00:33:44,840
part of the general unsw computer

863
00:33:41,519 --> 00:33:44,840
science and engineering department

864
00:33:48,370 --> 00:33:51,490
[Music]

865
00:33:53,220 --> 00:33:58,940
what's the actual reason for that other

866
00:33:56,039 --> 00:33:58,940
than if someone's using the same

867
00:34:02,539 --> 00:34:09,000
potentially yes in some cases there

868
00:34:05,940 --> 00:34:10,560
might be someone's wanting to do some

869
00:34:09,000 --> 00:34:13,679
kind of tests on a particular machine

870
00:34:10,560 --> 00:34:15,060
but that's not like their work their

871
00:34:13,679 --> 00:34:16,740
work machine or their laptop that they

872
00:34:15,060 --> 00:34:19,619
bring in with them there can be

873
00:34:16,740 --> 00:34:21,899
situations like that and I found it is

874
00:34:19,619 --> 00:34:23,580
useful to have these

875
00:34:21,899 --> 00:34:25,320
home directories that mount on

876
00:34:23,580 --> 00:34:27,980
everything to move stuff around

877
00:34:25,320 --> 00:34:27,980
sometimes

878
00:34:29,710 --> 00:34:32,760
[Music]

879
00:34:33,619 --> 00:34:38,700
so in regards to give you a simple

880
00:34:36,179 --> 00:34:40,440
scenario a directory that has say a

881
00:34:38,700 --> 00:34:42,899
million files in it that might hit one

882
00:34:40,440 --> 00:34:44,460
OSD or you have a very very large file

883
00:34:42,899 --> 00:34:45,960
which lots of processes are hitting

884
00:34:44,460 --> 00:34:47,520
which again could have an impact on

885
00:34:45,960 --> 00:34:48,960
metadata performance have you done any

886
00:34:47,520 --> 00:34:51,659
event tracks like that

887
00:34:48,960 --> 00:34:54,440
no the the tests I've done are the ones

888
00:34:51,659 --> 00:34:54,440
I've talked about here

889
00:35:00,180 --> 00:35:04,560
uh I I I nowadays spend a lot more time

890
00:35:03,180 --> 00:35:06,540
on Seth though more on the Block side

891
00:35:04,560 --> 00:35:09,780
but in my past web hosting life I did a

892
00:35:06,540 --> 00:35:11,640
lot of drd NFS in the late 2000s and uh

893
00:35:09,780 --> 00:35:13,500
we were using was then heartbeat now

894
00:35:11,640 --> 00:35:14,460
Peacemaker to do like the failover stuff

895
00:35:13,500 --> 00:35:15,839
I was curious whether you actually

896
00:35:14,460 --> 00:35:17,700
leveraging that left it out or whether

897
00:35:15,839 --> 00:35:19,320
you weren't leveraging that to handle

898
00:35:17,700 --> 00:35:21,839
the you know moving stuff around and

899
00:35:19,320 --> 00:35:25,040
yeah our current our current setup does

900
00:35:21,839 --> 00:35:27,060
not use pacemaker I've been working on

901
00:35:25,040 --> 00:35:30,060
redesigning our setup and I'm using

902
00:35:27,060 --> 00:35:32,760
pacemaker for that uh

903
00:35:30,060 --> 00:35:36,300
but for our home directory is storage

904
00:35:32,760 --> 00:35:38,339
I'm using SEF because I like

905
00:35:36,300 --> 00:35:39,960
I like the self-healing that's one of my

906
00:35:38,339 --> 00:35:41,480
favorite things about it it fixes

907
00:35:39,960 --> 00:35:44,820
problems for you

908
00:35:41,480 --> 00:35:46,800
and when it does that it does it seem it

909
00:35:44,820 --> 00:35:50,240
I found it seems to do it a lot faster

910
00:35:46,800 --> 00:35:50,240
than drbd does as well

911
00:35:50,579 --> 00:35:56,640
thanks so one thing that I've seen quite

912
00:35:52,920 --> 00:36:00,660
often with NFS mounts is a lack of

913
00:35:56,640 --> 00:36:02,280
Transport encryption so Transit being

914
00:36:00,660 --> 00:36:05,640
plain text

915
00:36:02,280 --> 00:36:07,800
um have you like how does safe compare

916
00:36:05,640 --> 00:36:11,400
in that regards out of the box

917
00:36:07,800 --> 00:36:15,079
capabilities I don't think Seth does any

918
00:36:11,400 --> 00:36:15,079
encryption on the data it sends either

919
00:36:17,099 --> 00:36:20,700
thank you I noticed you talked about the

920
00:36:19,200 --> 00:36:22,200
networking side of it and how you

921
00:36:20,700 --> 00:36:25,079
switched was a bottleneck at one point

922
00:36:22,200 --> 00:36:27,000
have you then decided to leverage a

923
00:36:25,079 --> 00:36:30,960
separate switch for the set clustering

924
00:36:27,000 --> 00:36:33,540
back plane so the switch that I was

925
00:36:30,960 --> 00:36:36,420
running those tests over is not a switch

926
00:36:33,540 --> 00:36:38,640
that's responsible for most of our

927
00:36:36,420 --> 00:36:40,740
running the the main networks that

928
00:36:38,640 --> 00:36:43,140
everyone's working on this switch was

929
00:36:40,740 --> 00:36:44,579
connecting to a set of machines that we

930
00:36:43,140 --> 00:36:46,980
generally use specifically for

931
00:36:44,579 --> 00:36:49,260
benchmarking things they're useful

932
00:36:46,980 --> 00:36:51,480
because they're all basically the same

933
00:36:49,260 --> 00:36:52,800
and since we're using them for

934
00:36:51,480 --> 00:36:55,140
benchmarking there's not other things

935
00:36:52,800 --> 00:36:56,700
going on on them at the same time so

936
00:36:55,140 --> 00:36:59,640
that the switch that was causing the

937
00:36:56,700 --> 00:37:01,020
bottleneck there is not a switch that's

938
00:36:59,640 --> 00:37:03,839
really going to be coming into play in

939
00:37:01,020 --> 00:37:04,980
the day-to-day use of Seth look although

940
00:37:03,839 --> 00:37:07,260
of course we have replaced that with a

941
00:37:04,980 --> 00:37:08,220
faster switch for the benchmarks I

942
00:37:07,260 --> 00:37:09,480
showed here

943
00:37:08,220 --> 00:37:11,960
so that we could get rid of the

944
00:37:09,480 --> 00:37:11,960
bottleneck

945
00:37:12,720 --> 00:37:17,960
anything else right now

946
00:37:15,359 --> 00:37:17,960
you are going to

947
00:37:18,240 --> 00:37:22,079
handle as well and I'll take it you'll

948
00:37:19,920 --> 00:37:23,760
be around for the rest of the day yep

949
00:37:22,079 --> 00:37:25,440
lovely in the meantime thank you so much

950
00:37:23,760 --> 00:37:27,900
Christopher there's a little gift from

951
00:37:25,440 --> 00:37:30,079
us for you thank you for your time thank

952
00:37:27,900 --> 00:37:30,079
you