1
00:00:00,000 --> 00:00:08,469
foreign

2
00:00:00,500 --> 00:00:08,469
[Music]

3
00:00:10,980 --> 00:00:14,300
thank you very much good afternoon

4
00:00:14,400 --> 00:00:20,460
so today I want to talk about uh

5
00:00:18,539 --> 00:00:22,320
some of the tools that I use in in

6
00:00:20,460 --> 00:00:23,699
solving problems

7
00:00:22,320 --> 00:00:25,320
and there are kind of two kinds of

8
00:00:23,699 --> 00:00:27,900
problems that I'm going to talk about

9
00:00:25,320 --> 00:00:29,760
solving or at least uh

10
00:00:27,900 --> 00:00:31,399
trying to diagnose on the way to solving

11
00:00:29,760 --> 00:00:33,660
the solving Parts usually the hard part

12
00:00:31,399 --> 00:00:35,340
but but we'd like to figure out exactly

13
00:00:33,660 --> 00:00:37,260
what's going on and and so the kind of

14
00:00:35,340 --> 00:00:39,780
tools I use for this the first category

15
00:00:37,260 --> 00:00:41,219
I would put as performance analysis and

16
00:00:39,780 --> 00:00:42,480
so those are tools that are maybe giving

17
00:00:41,219 --> 00:00:45,780
us some kind of information about

18
00:00:42,480 --> 00:00:48,000
latency or operations per second or you

19
00:00:45,780 --> 00:00:49,200
know bytes per second uh that that kind

20
00:00:48,000 --> 00:00:51,719
of thing

21
00:00:49,200 --> 00:00:53,399
and the second kind of category is what

22
00:00:51,719 --> 00:00:54,960
I sort of call Behavior Analysis and

23
00:00:53,399 --> 00:00:56,579
that's more looking at exactly what a

24
00:00:54,960 --> 00:00:58,980
program is doing what steps is it taking

25
00:00:56,579 --> 00:01:00,600
maybe trying to figure out rather than

26
00:00:58,980 --> 00:01:02,940
it being too slow maybe it's doing the

27
00:01:00,600 --> 00:01:05,159
complete wrong thing

28
00:01:02,940 --> 00:01:06,180
and in in my day-to-day role I've been

29
00:01:05,159 --> 00:01:09,000
some

30
00:01:06,180 --> 00:01:11,880
administrator or SRE for uh for more

31
00:01:09,000 --> 00:01:14,460
than 15 years but currently I work in

32
00:01:11,880 --> 00:01:16,439
the Ubuntu support team and we support a

33
00:01:14,460 --> 00:01:18,180
right away of customer problems so it's

34
00:01:16,439 --> 00:01:20,280
not just one product there's kind of

35
00:01:18,180 --> 00:01:22,259
could be anything that runs on Ubuntu

36
00:01:20,280 --> 00:01:25,560
spend quite a bit of time on openstack

37
00:01:22,259 --> 00:01:26,960
and ceph which themselves have lots and

38
00:01:25,560 --> 00:01:29,520
lots of

39
00:01:26,960 --> 00:01:31,439
components but really someone can open a

40
00:01:29,520 --> 00:01:33,299
case about anything and so we don't

41
00:01:31,439 --> 00:01:36,479
always have built-in the best tools or

42
00:01:33,299 --> 00:01:37,860
best metrics for finding out what we

43
00:01:36,479 --> 00:01:39,720
want to know because the problems are

44
00:01:37,860 --> 00:01:41,600
kind of of always different and I need

45
00:01:39,720 --> 00:01:45,360
to do this at runtime and in production

46
00:01:41,600 --> 00:01:46,520
and uh not crash or slow down the system

47
00:01:45,360 --> 00:01:48,479
in the process

48
00:01:46,520 --> 00:01:50,100
so that's kind of what we're going to

49
00:01:48,479 --> 00:01:52,200
talk about today and my hope is that

50
00:01:50,100 --> 00:01:53,399
I'll introduce this new tool that you

51
00:01:52,200 --> 00:01:55,259
may be either a little bit familiar with

52
00:01:53,399 --> 00:01:57,000
or maybe give you some ideas about what

53
00:01:55,259 --> 00:01:59,479
you can do with it to maybe help solve

54
00:01:57,000 --> 00:01:59,479
your problems

55
00:02:00,420 --> 00:02:03,780
so in the performance analysis kind of

56
00:02:02,280 --> 00:02:05,520
bucket we have some traditional tools

57
00:02:03,780 --> 00:02:08,720
that you might be familiar with

58
00:02:05,520 --> 00:02:11,819
these are things like iostat or top

59
00:02:08,720 --> 00:02:13,620
vmstat NFS stat bwng all these kinds of

60
00:02:11,819 --> 00:02:14,879
tools and what most of these tools are

61
00:02:13,620 --> 00:02:16,680
doing is they're kind of they're very

62
00:02:14,879 --> 00:02:18,180
fast they don't typically impact our

63
00:02:16,680 --> 00:02:19,560
system performance because they're

64
00:02:18,180 --> 00:02:21,420
looking at some kind of performance

65
00:02:19,560 --> 00:02:22,980
counter that usually the kernel is

66
00:02:21,420 --> 00:02:24,780
already kind of taken care of and

67
00:02:22,980 --> 00:02:26,040
Counting for us you know how many

68
00:02:24,780 --> 00:02:28,319
transactions going through a network

69
00:02:26,040 --> 00:02:30,780
card or a disk or or something to do

70
00:02:28,319 --> 00:02:32,520
with the process

71
00:02:30,780 --> 00:02:34,200
these are super great for a top-down

72
00:02:32,520 --> 00:02:35,879
view but some of the limitations number

73
00:02:34,200 --> 00:02:37,680
one is they they really make heavy use

74
00:02:35,879 --> 00:02:39,900
of averages and averages are really good

75
00:02:37,680 --> 00:02:41,220
at hiding outlayers and we'll talk in a

76
00:02:39,900 --> 00:02:43,680
minute about why that can be a problem

77
00:02:41,220 --> 00:02:45,480
and they often have a predefined and

78
00:02:43,680 --> 00:02:47,040
limited kind of breakdown so often we

79
00:02:45,480 --> 00:02:49,019
might be able to see disk statistics for

80
00:02:47,040 --> 00:02:50,700
a disk but without correlation to a

81
00:02:49,019 --> 00:02:52,140
process or we may even be able to see i

82
00:02:50,700 --> 00:02:54,180
o information for a process but without

83
00:02:52,140 --> 00:02:56,959
correlation to which disk that I O

84
00:02:54,180 --> 00:02:56,959
information is from

85
00:02:58,019 --> 00:03:01,800
another kind of uh subcategory these

86
00:03:00,239 --> 00:03:03,959
tools and top and Nets that are really

87
00:03:01,800 --> 00:03:05,459
good examples of this is they'll take an

88
00:03:03,959 --> 00:03:07,379
instant snapshot of the system every

89
00:03:05,459 --> 00:03:08,819
second and if something happened in like

90
00:03:07,379 --> 00:03:10,200
point one of a second in between the

91
00:03:08,819 --> 00:03:11,640
last snapshot and this snapshot it might

92
00:03:10,200 --> 00:03:13,260
miss that completely

93
00:03:11,640 --> 00:03:14,940
so in the case of top if you're

94
00:03:13,260 --> 00:03:16,620
launching lots and lots of processes

95
00:03:14,940 --> 00:03:18,480
that take a very short time to execute

96
00:03:16,620 --> 00:03:20,400
those may not show up in in top at all

97
00:03:18,480 --> 00:03:21,120
and you might miss the CPU usage from

98
00:03:20,400 --> 00:03:23,159
those

99
00:03:21,120 --> 00:03:25,140
and netstat will show you all the live

100
00:03:23,159 --> 00:03:26,640
TCP and EDP connections on a system but

101
00:03:25,140 --> 00:03:28,019
only at the instant you run the tool and

102
00:03:26,640 --> 00:03:32,280
you won't see all the ones that were

103
00:03:28,019 --> 00:03:34,080
short-lived in between those executions

104
00:03:32,280 --> 00:03:36,599
so look at some examples of what

105
00:03:34,080 --> 00:03:38,220
averages can hurt us so this is a tool

106
00:03:36,599 --> 00:03:41,340
that I really like to use called iostat

107
00:03:38,220 --> 00:03:43,200
and it gives us a breakdown of i o

108
00:03:41,340 --> 00:03:47,519
performance statistics for each disk on

109
00:03:43,200 --> 00:03:49,920
the system and so here we can see uh

110
00:03:47,519 --> 00:03:51,900
we've got like multiple discs and some

111
00:03:49,920 --> 00:03:54,000
hard drives and nvme device and some

112
00:03:51,900 --> 00:03:56,220
Columns of data so we're seeing reads

113
00:03:54,000 --> 00:03:58,680
per second and writes per second that's

114
00:03:56,220 --> 00:04:00,299
iops the read kilobytes and right

115
00:03:58,680 --> 00:04:02,220
kilobytes a second that's the bandwidth

116
00:04:00,299 --> 00:04:04,440
going for the disk and also the average

117
00:04:02,220 --> 00:04:05,760
latency so the ra weight and the right a

118
00:04:04,440 --> 00:04:08,180
weight is the average latency of those

119
00:04:05,760 --> 00:04:08,180
operations

120
00:04:08,459 --> 00:04:13,500
this is kind of good as an overall but

121
00:04:11,700 --> 00:04:15,000
latency and latency can be a really

122
00:04:13,500 --> 00:04:16,919
important thing to look at is is an

123
00:04:15,000 --> 00:04:19,260
average over all 27 000 these operations

124
00:04:16,919 --> 00:04:21,540
that happened in in that second

125
00:04:19,260 --> 00:04:23,520
and so it could be that all 27 000 of

126
00:04:21,540 --> 00:04:25,740
those took 0.03 of a second

127
00:04:23,520 --> 00:04:27,720
it could also be that you know almost

128
00:04:25,740 --> 00:04:28,800
all of them except for one took 0.01 of

129
00:04:27,720 --> 00:04:30,180
a second and one of them took three

130
00:04:28,800 --> 00:04:31,560
seconds and if you've got a disco

131
00:04:30,180 --> 00:04:33,540
operation that's taking three seconds

132
00:04:31,560 --> 00:04:35,759
that's really interesting but you can

133
00:04:33,540 --> 00:04:37,800
entirely miss that in in tools uh like

134
00:04:35,759 --> 00:04:40,020
this

135
00:04:37,800 --> 00:04:41,699
if you've been around Iowa storage you

136
00:04:40,020 --> 00:04:43,320
might understand this but if if this is

137
00:04:41,699 --> 00:04:45,660
confusing I think one that we can all

138
00:04:43,320 --> 00:04:48,180
relate to uh is any kind of computer

139
00:04:45,660 --> 00:04:50,699
Enthusiast is CPU usage

140
00:04:48,180 --> 00:04:53,400
and so in this example of top output I'm

141
00:04:50,699 --> 00:04:55,199
running the fio Benchmark tool and I've

142
00:04:53,400 --> 00:04:57,120
I've set it to only use one thread to

143
00:04:55,199 --> 00:04:59,340
kind of stress a a point in this case

144
00:04:57,120 --> 00:05:02,639
but we can see top is telling us that

145
00:04:59,340 --> 00:05:05,940
our our CPU usage is is like only six

146
00:05:02,639 --> 00:05:07,320
percent and 93 of our CPU is Idle but at

147
00:05:05,940 --> 00:05:09,419
the same time it's telling us that that

148
00:05:07,320 --> 00:05:11,220
FAO is using 100 of our CPU and these

149
00:05:09,419 --> 00:05:13,259
numbers kind of like don't add up

150
00:05:11,220 --> 00:05:15,120
and if you haven't used this tool before

151
00:05:13,259 --> 00:05:16,919
it might surprise you the percentage of

152
00:05:15,120 --> 00:05:18,660
CPU it doesn't go to 100 actually goes

153
00:05:16,919 --> 00:05:20,400
to however many CPUs you have times 100

154
00:05:18,660 --> 00:05:23,639
so if this was using my whole system

155
00:05:20,400 --> 00:05:25,440
let's say 3200 percent

156
00:05:23,639 --> 00:05:27,060
but we can ask top to break it down so

157
00:05:25,440 --> 00:05:28,560
if you hit the number one key on top it

158
00:05:27,060 --> 00:05:29,759
will actually show us all the CPUs on

159
00:05:28,560 --> 00:05:31,560
the system and the utilization of each

160
00:05:29,759 --> 00:05:33,539
one and in this view we can see that

161
00:05:31,560 --> 00:05:35,639
actually one CPU is running it at Full

162
00:05:33,539 --> 00:05:37,560
Steam because FAO is single threaded it

163
00:05:35,639 --> 00:05:39,900
can only run on a single CPU but the

164
00:05:37,560 --> 00:05:41,580
other seven CPUs are idle

165
00:05:39,900 --> 00:05:45,060
and so if we only looked at that average

166
00:05:41,580 --> 00:05:46,320
we had here at six percent 93 idle we

167
00:05:45,060 --> 00:05:49,259
might think our CPU is not a problem

168
00:05:46,320 --> 00:05:50,340
when in reality one CPU is a problem and

169
00:05:49,259 --> 00:05:53,000
there is a bottleneck that we were

170
00:05:50,340 --> 00:05:53,000
missing with that average

171
00:05:53,220 --> 00:05:56,160
and these are more simple ones but these

172
00:05:54,720 --> 00:05:58,560
kind of problems show up everywhere in

173
00:05:56,160 --> 00:06:00,479
any kind of latency or utilization of

174
00:05:58,560 --> 00:06:02,759
any kind of system resource these

175
00:06:00,479 --> 00:06:04,919
outliers and averages hide stuff from us

176
00:06:02,759 --> 00:06:06,419
all the time that end up making it

177
00:06:04,919 --> 00:06:09,380
difficult to solve problems that I hit

178
00:06:06,419 --> 00:06:09,380
quite often

179
00:06:09,840 --> 00:06:14,340
another category of tools uh which is my

180
00:06:12,120 --> 00:06:16,320
behavior analysis kind of category these

181
00:06:14,340 --> 00:06:18,000
tools rather than looking at very fast

182
00:06:16,320 --> 00:06:19,940
performance counters are trying to like

183
00:06:18,000 --> 00:06:22,979
look at everything a program is doing

184
00:06:19,940 --> 00:06:24,780
s-trace is probably my favorite and S

185
00:06:22,979 --> 00:06:27,120
Trace will attach to every system call a

186
00:06:24,780 --> 00:06:28,680
program makes and Report exactly what

187
00:06:27,120 --> 00:06:30,180
that system core was how much data was

188
00:06:28,680 --> 00:06:31,680
written can even tell you what data was

189
00:06:30,180 --> 00:06:33,840
written

190
00:06:31,680 --> 00:06:35,520
but this tool transfers every one of

191
00:06:33,840 --> 00:06:37,500
those operations from the other process

192
00:06:35,520 --> 00:06:39,780
into user space to s-trace to then print

193
00:06:37,500 --> 00:06:41,220
out and that's very heavyweight

194
00:06:39,780 --> 00:06:42,840
if you're doing a very small program

195
00:06:41,220 --> 00:06:43,860
that might be a small amount of data but

196
00:06:42,840 --> 00:06:45,720
if you're trying to analyze something

197
00:06:43,860 --> 00:06:47,400
doing gigabits a second or gigabytes a

198
00:06:45,720 --> 00:06:49,139
second of disk or Network

199
00:06:47,400 --> 00:06:50,639
number one it won't be able to keep up

200
00:06:49,139 --> 00:06:54,000
with processing that amount of data but

201
00:06:50,639 --> 00:06:56,340
it'll also slow down that application

202
00:06:54,000 --> 00:06:58,080
uh block Trace is a pretty awesome tool

203
00:06:56,340 --> 00:06:59,580
it's kind of a kernel facility it'll

204
00:06:58,080 --> 00:07:01,620
dump out every block operation that

205
00:06:59,580 --> 00:07:03,600
happens on the whole system every every

206
00:07:01,620 --> 00:07:04,919
request queued when it comes out of the

207
00:07:03,600 --> 00:07:06,780
queue when it goes to the disk when it

208
00:07:04,919 --> 00:07:08,520
comes back but again it's going to queue

209
00:07:06,780 --> 00:07:10,380
all of that information out into user

210
00:07:08,520 --> 00:07:12,660
space and into a file and that's very

211
00:07:10,380 --> 00:07:15,479
heavyweight

212
00:07:12,660 --> 00:07:18,180
uh IP traffic looks quite innocent if

213
00:07:15,479 --> 00:07:19,979
you run IP traffic the first mode will

214
00:07:18,180 --> 00:07:21,000
tell you like every connection and like

215
00:07:19,979 --> 00:07:22,620
the number of bytes that are coming

216
00:07:21,000 --> 00:07:24,419
through all those connections but to do

217
00:07:22,620 --> 00:07:26,340
that it's actually a packet dumping the

218
00:07:24,419 --> 00:07:28,620
entire network card into IP traffic to

219
00:07:26,340 --> 00:07:30,240
like find like all those values

220
00:07:28,620 --> 00:07:31,800
and a few megabits not a problem if

221
00:07:30,240 --> 00:07:34,139
you're doing 10 or 20 gigabits a second

222
00:07:31,800 --> 00:07:36,060
uh that can be a problem

223
00:07:34,139 --> 00:07:37,919
and even just debug logging if you've

224
00:07:36,060 --> 00:07:39,720
ever turned up a debug in something to

225
00:07:37,919 --> 00:07:41,099
like level one five or Twenty the amount

226
00:07:39,720 --> 00:07:42,180
of stuff coming out usually will slow it

227
00:07:41,099 --> 00:07:43,740
down

228
00:07:42,180 --> 00:07:45,660
so those are the kinds of problems we

229
00:07:43,740 --> 00:07:47,220
have with these behavioral analysis

230
00:07:45,660 --> 00:07:48,599
tools but often they can tell us things

231
00:07:47,220 --> 00:07:50,099
that those performance counters can't

232
00:07:48,599 --> 00:07:53,639
because there's only only so many of

233
00:07:50,099 --> 00:07:55,020
those performance counters and

234
00:07:53,639 --> 00:07:56,400
if one of those doesn't cover the

235
00:07:55,020 --> 00:07:57,960
situation we're looking for if we kind

236
00:07:56,400 --> 00:07:59,099
of look at this detailed output we might

237
00:07:57,960 --> 00:08:02,160
be able to find what we're looking for

238
00:07:59,099 --> 00:08:05,340
at a cost of a big Slowdown

239
00:08:02,160 --> 00:08:08,039
to demonstrate how big that slowdown is

240
00:08:05,340 --> 00:08:10,860
I've got here an example case of running

241
00:08:08,039 --> 00:08:12,300
DD and that's just a disc copying tool

242
00:08:10,860 --> 00:08:14,039
we're probably all used with at least a

243
00:08:12,300 --> 00:08:15,599
Raspberry Pi or something I'm reading

244
00:08:14,039 --> 00:08:17,400
from Dev zero which is like a string of

245
00:08:15,599 --> 00:08:18,720
null bytes and I'm writing to Dev null

246
00:08:17,400 --> 00:08:20,699
so I'm reading a bunch of zeros and

247
00:08:18,720 --> 00:08:22,379
throwing them away straight away

248
00:08:20,699 --> 00:08:24,120
to make the example a little contrived

249
00:08:22,379 --> 00:08:25,440
and stand out I'm doing that one byte at

250
00:08:24,120 --> 00:08:26,819
a time I'm not batching it into like

251
00:08:25,440 --> 00:08:28,860
four kilobytes or a megabyte and we're

252
00:08:26,819 --> 00:08:30,960
gonna do it 500 000 times

253
00:08:28,860 --> 00:08:32,459
and so if I do that just with DD on its

254
00:08:30,960 --> 00:08:33,539
own it takes 0.2 of a second which is

255
00:08:32,459 --> 00:08:35,940
pretty fast

256
00:08:33,539 --> 00:08:38,159
if I use S trace and I don't even ask S

257
00:08:35,940 --> 00:08:39,300
Trace I wanna I ask it with except they

258
00:08:38,159 --> 00:08:41,820
say I only want to look at the accept

259
00:08:39,300 --> 00:08:43,260
system call that's to accept a

260
00:08:41,820 --> 00:08:44,459
connection for the network and Diddy

261
00:08:43,260 --> 00:08:46,920
doesn't even make network connections it

262
00:08:44,459 --> 00:08:48,480
might ask you to do that that same task

263
00:08:46,920 --> 00:08:51,440
will now take 18 seconds that's nearly

264
00:08:48,480 --> 00:08:51,440
100 times slow down

265
00:08:54,600 --> 00:08:58,200
in a production system that's very busy

266
00:08:56,399 --> 00:09:00,060
that's a penalty we we may not be able

267
00:08:58,200 --> 00:09:02,040
to incur on a developmental test system

268
00:09:00,060 --> 00:09:04,440
it might be reasonable or on a quiet

269
00:09:02,040 --> 00:09:05,580
system I've used it a lot but there are

270
00:09:04,440 --> 00:09:08,940
cases that's not going to work for us

271
00:09:05,580 --> 00:09:10,680
and the reason why it's 100 times slower

272
00:09:08,940 --> 00:09:12,839
is because the way this Tracer works it

273
00:09:10,680 --> 00:09:14,279
uses a function called in a kernel

274
00:09:12,839 --> 00:09:16,019
called P Trace which is the debugger

275
00:09:14,279 --> 00:09:16,980
interface it's the same interface GDB

276
00:09:16,019 --> 00:09:18,899
users

277
00:09:16,980 --> 00:09:20,220
and it's remote controlling DD it's

278
00:09:18,899 --> 00:09:21,899
literally like little remote control it

279
00:09:20,220 --> 00:09:23,100
says the kernel work Petra Sysco I want

280
00:09:21,899 --> 00:09:25,080
you to run this program until the next

281
00:09:23,100 --> 00:09:27,120
system call the kernel says okay it runs

282
00:09:25,080 --> 00:09:29,160
and it does the right system call

283
00:09:27,120 --> 00:09:30,660
and then uh it goes into the kernel to

284
00:09:29,160 --> 00:09:32,100
do that system call and goes oh I've got

285
00:09:30,660 --> 00:09:33,899
to talk to S trace it wanted to know so

286
00:09:32,100 --> 00:09:35,880
it's pause DD Diddy is doing nothing we

287
00:09:33,899 --> 00:09:37,080
go to S trace it goes cool we had a

288
00:09:35,880 --> 00:09:39,000
system call I want to know what the

289
00:09:37,080 --> 00:09:40,740
system call was to find that out it

290
00:09:39,000 --> 00:09:42,300
needs to know the CPU registers in the

291
00:09:40,740 --> 00:09:43,920
context of DD sort of ask the kernel

292
00:09:42,300 --> 00:09:45,839
what were the registers goes back to the

293
00:09:43,920 --> 00:09:47,580
kernel kernel sends it back to P Trace

294
00:09:45,839 --> 00:09:49,560
Petro says great now we need to actually

295
00:09:47,580 --> 00:09:51,300
run the system call so it goes back to

296
00:09:49,560 --> 00:09:53,339
the kernel it actually finishes the the

297
00:09:51,300 --> 00:09:55,320
work of that right system call but then

298
00:09:53,339 --> 00:09:57,660
we go back to S Trace again

299
00:09:55,320 --> 00:09:59,220
why do we do that well at the start we

300
00:09:57,660 --> 00:10:00,959
we have the registers we're telling us

301
00:09:59,220 --> 00:10:02,820
the parameters to the system calls so I

302
00:10:00,959 --> 00:10:04,860
want to write this data to this file or

303
00:10:02,820 --> 00:10:07,500
this many bytes but we didn't know the

304
00:10:04,860 --> 00:10:08,820
result yet and so after it runs we have

305
00:10:07,500 --> 00:10:10,019
to then get the registers to get the

306
00:10:08,820 --> 00:10:11,339
result but at that point we can no

307
00:10:10,019 --> 00:10:12,899
longer see

308
00:10:11,339 --> 00:10:14,580
all the parameters that were passed in

309
00:10:12,899 --> 00:10:16,560
so we have to interrupt it twice once at

310
00:10:14,580 --> 00:10:18,000
the start once at the end so it gets

311
00:10:16,560 --> 00:10:20,040
those registers then it goes back to the

312
00:10:18,000 --> 00:10:21,660
kernel uh gets those registers of the

313
00:10:20,040 --> 00:10:23,519
ancestor kernel okay you can continue to

314
00:10:21,660 --> 00:10:25,260
send the data back to the process and so

315
00:10:23,519 --> 00:10:26,459
that's context switched from DB back to

316
00:10:25,260 --> 00:10:28,440
West Trace through the kernel eight

317
00:10:26,459 --> 00:10:30,720
times

318
00:10:28,440 --> 00:10:33,060
and that's why it's 100 times slower

319
00:10:30,720 --> 00:10:36,480
and DD has to wait in all the time all

320
00:10:33,060 --> 00:10:38,339
the time here we've spent uh

321
00:10:36,480 --> 00:10:41,779
in s-trace land deity is doing nothing

322
00:10:38,339 --> 00:10:41,779
and so that that's why it slowed down

323
00:10:42,240 --> 00:10:45,540
so it'd be really great if we could

324
00:10:43,680 --> 00:10:46,920
write something a custom tool to gather

325
00:10:45,540 --> 00:10:48,000
performance metrics on anything we

326
00:10:46,920 --> 00:10:49,620
wanted

327
00:10:48,000 --> 00:10:51,180
you'll be also super great if we could

328
00:10:49,620 --> 00:10:52,980
group those in some way we wanted that

329
00:10:51,180 --> 00:10:54,899
wasn't predetermined you might want to

330
00:10:52,980 --> 00:10:57,480
group those by device network card or

331
00:10:54,899 --> 00:11:00,420
disk by which process by which C group

332
00:10:57,480 --> 00:11:02,100
anything we can kind of think of it'll

333
00:11:00,420 --> 00:11:03,420
be awesome if we could do that or even

334
00:11:02,100 --> 00:11:05,519
just get the same information we had

335
00:11:03,420 --> 00:11:07,500
before but with a much smaller impact on

336
00:11:05,519 --> 00:11:09,360
system performance

337
00:11:07,500 --> 00:11:10,860
and the good news is we can do that and

338
00:11:09,360 --> 00:11:11,940
there there are a number of ways we can

339
00:11:10,860 --> 00:11:13,200
do that now so here are a few quick

340
00:11:11,940 --> 00:11:15,360
demos

341
00:11:13,200 --> 00:11:17,399
so these are BPF Choice tools and I'll

342
00:11:15,360 --> 00:11:18,360
explain how they work in a minute but

343
00:11:17,399 --> 00:11:19,740
they can do some really interesting

344
00:11:18,360 --> 00:11:21,060
things so here's one called get host

345
00:11:19,740 --> 00:11:23,279
latency

346
00:11:21,060 --> 00:11:25,380
and it will actually Trace not just one

347
00:11:23,279 --> 00:11:27,180
process as Trace only attaches to one

348
00:11:25,380 --> 00:11:29,640
process at a time we can actually trace

349
00:11:27,180 --> 00:11:31,500
the entire system with every process and

350
00:11:29,640 --> 00:11:33,660
we're going to hook into the get host by

351
00:11:31,500 --> 00:11:35,160
name system call in libc and say every

352
00:11:33,660 --> 00:11:36,720
time that happens I want to collect that

353
00:11:35,160 --> 00:11:38,640
and then print out which process of

354
00:11:36,720 --> 00:11:40,320
looking at which hostname and how long

355
00:11:38,640 --> 00:11:42,779
it took and that latency is kind of one

356
00:11:40,320 --> 00:11:44,519
of the cool things we can get from uh

357
00:11:42,779 --> 00:11:48,019
from BPF Trace that other tools often

358
00:11:44,519 --> 00:11:48,019
won't show us a per request latency

359
00:11:48,060 --> 00:11:52,380
and I've solved real problems like this

360
00:11:50,100 --> 00:11:54,959
uh when I was doing

361
00:11:52,380 --> 00:11:56,760
um web hosting before before now your

362
00:11:54,959 --> 00:11:58,200
customer applications would be written

363
00:11:56,760 --> 00:11:59,940
in WordPress and the wonderful thing

364
00:11:58,200 --> 00:12:01,560
about WordPress is there's like 100

365
00:11:59,940 --> 00:12:03,540
million plugins written by 100 million

366
00:12:01,560 --> 00:12:05,399
people who kind of know how to write

367
00:12:03,540 --> 00:12:06,839
plugins and then people install like 100

368
00:12:05,399 --> 00:12:09,000
of them and so you get all these bad

369
00:12:06,839 --> 00:12:10,740
decisions 100 times Amplified

370
00:12:09,000 --> 00:12:12,360
and often these are written by people

371
00:12:10,740 --> 00:12:14,339
maybe that live in the US and they have

372
00:12:12,360 --> 00:12:16,320
a like maybe 20 millisecond or 30 second

373
00:12:14,339 --> 00:12:18,600
millisecond latency to the entire

374
00:12:16,320 --> 00:12:20,040
internet uh most of us in Australia

375
00:12:18,600 --> 00:12:21,600
would know that our latency is 10 times

376
00:12:20,040 --> 00:12:23,100
that like 300 milliseconds to most

377
00:12:21,600 --> 00:12:25,500
things and so if you make like 20

378
00:12:23,100 --> 00:12:26,880
Network calls or 20 DNS calls at 0.3 of

379
00:12:25,500 --> 00:12:28,440
a second each that really adds up and

380
00:12:26,880 --> 00:12:29,399
suddenly it takes 10 seconds to load

381
00:12:28,440 --> 00:12:31,560
your web page

382
00:12:29,399 --> 00:12:33,600
and so this is an example of a tool we

383
00:12:31,560 --> 00:12:36,860
can see well how many of those requests

384
00:12:33,600 --> 00:12:36,860
did we make and how long did they take

385
00:12:41,700 --> 00:12:45,600
another tool that we have and these are

386
00:12:43,680 --> 00:12:47,820
all kind of in in the BPF Trace toolkit

387
00:12:45,600 --> 00:12:49,440
that we can we can do custom ones is UDP

388
00:12:47,820 --> 00:12:51,540
life and it will show us all the

389
00:12:49,440 --> 00:12:53,220
connections not just TCP but UDP

390
00:12:51,540 --> 00:12:54,899
connections that happened on the system

391
00:12:53,220 --> 00:12:57,180
and how long they took how long was

392
00:12:54,899 --> 00:12:59,220
traffic going back and forward

393
00:12:57,180 --> 00:13:00,839
uh in this case it's kind of a more

394
00:12:59,220 --> 00:13:02,040
generic version of the last one and the

395
00:13:00,839 --> 00:13:03,720
last one we're looking at only DNS

396
00:13:02,040 --> 00:13:05,579
requests now we're kind of looking at

397
00:13:03,720 --> 00:13:07,380
latency by program of all UDP

398
00:13:05,579 --> 00:13:09,180
connections and how long they took and

399
00:13:07,380 --> 00:13:11,160
where they were going

400
00:13:09,180 --> 00:13:14,279
and so in this case we can actually see

401
00:13:11,160 --> 00:13:15,600
the DNS uh same DNS requests and their

402
00:13:14,279 --> 00:13:18,959
latency the latencies here in the

403
00:13:15,600 --> 00:13:21,360
millisecond column uh and

404
00:13:18,959 --> 00:13:23,639
uh Port 53 is DNS we can also see an ntp

405
00:13:21,360 --> 00:13:25,320
request here that had a 15 millisecond

406
00:13:23,639 --> 00:13:27,180
latency

407
00:13:25,320 --> 00:13:29,100
and unlike Nets that this is going to

408
00:13:27,180 --> 00:13:31,260
show every connection in real time as

409
00:13:29,100 --> 00:13:33,240
they happen

410
00:13:31,260 --> 00:13:35,040
and again for the entire system another

411
00:13:33,240 --> 00:13:37,680
tool is open Snoop it's going to show us

412
00:13:35,040 --> 00:13:41,100
every program what path they opened and

413
00:13:37,680 --> 00:13:43,139
uh whether there was an error for that

414
00:13:41,100 --> 00:13:45,060
and so far those kind of three tools

415
00:13:43,139 --> 00:13:46,260
I've shown are in the category still

416
00:13:45,060 --> 00:13:49,079
where they're transferring a lot of data

417
00:13:46,260 --> 00:13:51,300
user space it's kind of every operation

418
00:13:49,079 --> 00:13:53,220
one of the other uh nice things we can

419
00:13:51,300 --> 00:13:54,839
do with with BBF Trace is we can start

420
00:13:53,220 --> 00:13:56,579
to summarize information so instead of

421
00:13:54,839 --> 00:13:58,920
transferring out of the kernel into the

422
00:13:56,579 --> 00:14:01,620
into the user space analysis tool every

423
00:13:58,920 --> 00:14:03,300
operation we can send only a summary and

424
00:14:01,620 --> 00:14:05,339
here's an example of that we're looking

425
00:14:03,300 --> 00:14:07,380
at the i o latency of every i o request

426
00:14:05,339 --> 00:14:09,360
to a disk on the entire system for every

427
00:14:07,380 --> 00:14:11,639
process but rather than just an average

428
00:14:09,360 --> 00:14:13,860
we're now bucketing it with a histogram

429
00:14:11,639 --> 00:14:15,540
and so in this example

430
00:14:13,860 --> 00:14:18,360
uh

431
00:14:15,540 --> 00:14:20,220
the vast majority 8 000 operations took

432
00:14:18,360 --> 00:14:22,440
8 to 16 microseconds another eight

433
00:14:20,220 --> 00:14:24,120
thousand took 16 to 32 and you can see

434
00:14:22,440 --> 00:14:26,100
this is kind of like a bar graph of like

435
00:14:24,120 --> 00:14:27,600
what uh proportion of the total requests

436
00:14:26,100 --> 00:14:29,880
were in that bucket most of them are

437
00:14:27,600 --> 00:14:32,700
very fast a few uh you know few tens of

438
00:14:29,880 --> 00:14:35,160
microseconds for some reason we had one

439
00:14:32,700 --> 00:14:36,959
request out of 30 000 plus requests that

440
00:14:35,160 --> 00:14:39,120
took one million microseconds that's one

441
00:14:36,959 --> 00:14:40,500
to two seconds

442
00:14:39,120 --> 00:14:41,699
and these are those outlines I was

443
00:14:40,500 --> 00:14:43,079
talking about if your program if you had

444
00:14:41,699 --> 00:14:45,120
one program made all these requests

445
00:14:43,079 --> 00:14:47,639
sequentially it doesn't matter that all

446
00:14:45,120 --> 00:14:49,680
the every single one except for one took

447
00:14:47,639 --> 00:14:51,060
like 10 microseconds one of them took

448
00:14:49,680 --> 00:14:52,920
one second so that entire process is

449
00:14:51,060 --> 00:14:54,420
going to take a second to complete

450
00:14:52,920 --> 00:14:56,339
and those are the outliers that we need

451
00:14:54,420 --> 00:14:58,920
to understand when we're trying to uh

452
00:14:56,339 --> 00:15:00,180
solve some of these weird problems

453
00:14:58,920 --> 00:15:02,459
so what's unique here is that we're

454
00:15:00,180 --> 00:15:05,660
getting a histogram

455
00:15:02,459 --> 00:15:05,660
not just an average

456
00:15:05,820 --> 00:15:09,000
is another kind of tool that's doing

457
00:15:07,199 --> 00:15:12,120
some summary analysis called sys count

458
00:15:09,000 --> 00:15:14,459
and it's just showing the top system

459
00:15:12,120 --> 00:15:16,380
caused by ID uh this tool unfortunately

460
00:15:14,459 --> 00:15:18,360
doesn't print the friendly text name but

461
00:15:16,380 --> 00:15:20,279
it is possible to do that and also the

462
00:15:18,360 --> 00:15:21,480
top 10 processes by process names so

463
00:15:20,279 --> 00:15:23,040
which processors will make in the most

464
00:15:21,480 --> 00:15:25,440
system calls

465
00:15:23,040 --> 00:15:27,120
and we can customize all of these things

466
00:15:25,440 --> 00:15:28,440
these are just example programs whatever

467
00:15:27,120 --> 00:15:31,139
problem you're trying to solve you can

468
00:15:28,440 --> 00:15:33,860
change that summary or analysis to be uh

469
00:15:31,139 --> 00:15:33,860
what you want it to be

470
00:15:34,680 --> 00:15:38,339
so what makes this possible so I've kind

471
00:15:37,019 --> 00:15:39,300
of that's a little taster about what we

472
00:15:38,339 --> 00:15:41,699
can do with this tool well let's

473
00:15:39,300 --> 00:15:43,320
understand how it's doing it

474
00:15:41,699 --> 00:15:45,540
so the First Technology that's making

475
00:15:43,320 --> 00:15:47,100
use of uh it comes from the F Trace

476
00:15:45,540 --> 00:15:49,019
subsystem which was merged way back in

477
00:15:47,100 --> 00:15:50,459
2008 so it's really not that new of a

478
00:15:49,019 --> 00:15:52,680
technology

479
00:15:50,459 --> 00:15:53,699
and the kind of

480
00:15:52,680 --> 00:15:55,440
um

481
00:15:53,699 --> 00:15:57,120
first part that kind of f Trace brought

482
00:15:55,440 --> 00:15:59,459
in was this ability to have these events

483
00:15:57,120 --> 00:16:01,860
in the kernel that we could attach to or

484
00:15:59,459 --> 00:16:03,540
or record in some way and there are a

485
00:16:01,860 --> 00:16:04,920
few different types of those there are

486
00:16:03,540 --> 00:16:06,000
more I'll explain in a minute but a

487
00:16:04,920 --> 00:16:07,860
couple of the main ones the first are

488
00:16:06,000 --> 00:16:09,720
these static Trace points and they're

489
00:16:07,860 --> 00:16:12,120
kind of like the kernel developer of

490
00:16:09,720 --> 00:16:14,100
that disk subsystem or net subsystem has

491
00:16:12,120 --> 00:16:15,899
defined some Trace points that aren't

492
00:16:14,100 --> 00:16:17,339
going to change along with some

493
00:16:15,899 --> 00:16:19,139
predetermined information that's going

494
00:16:17,339 --> 00:16:20,880
to record about them it might be the

495
00:16:19,139 --> 00:16:23,880
network device name it might be the file

496
00:16:20,880 --> 00:16:26,459
name that kind of thing

497
00:16:23,880 --> 00:16:28,079
and those are ABI Stables so we can rely

498
00:16:26,459 --> 00:16:29,760
on the Kernel even if the internals of

499
00:16:28,079 --> 00:16:31,079
the kernel change that Trace point is

500
00:16:29,760 --> 00:16:33,240
always going to have the same name and

501
00:16:31,079 --> 00:16:35,459
output the same information

502
00:16:33,240 --> 00:16:36,839
we also have the ability to attach onto

503
00:16:35,459 --> 00:16:38,399
any function in the kernel if you go

504
00:16:36,839 --> 00:16:39,300
looking through the kernel source and

505
00:16:38,399 --> 00:16:40,740
there's something that interests you you

506
00:16:39,300 --> 00:16:43,860
can attach to any one of those functions

507
00:16:40,740 --> 00:16:45,779
and we can also Trace those but in those

508
00:16:43,860 --> 00:16:47,699
cases on different kernel versions those

509
00:16:45,779 --> 00:16:49,199
functions might change and then you also

510
00:16:47,699 --> 00:16:51,240
have to figure out how to get that extra

511
00:16:49,199 --> 00:16:53,639
context yourself you want the file name

512
00:16:51,240 --> 00:16:55,139
or the uh the disk score or something

513
00:16:53,639 --> 00:16:57,600
like that you have to start poking

514
00:16:55,139 --> 00:16:59,040
around in the data structures that's

515
00:16:57,600 --> 00:17:00,779
great when you get kind of really good

516
00:16:59,040 --> 00:17:02,399
at it but for you know the more typical

517
00:17:00,779 --> 00:17:04,679
case it's nice to have these predefined

518
00:17:02,399 --> 00:17:07,640
Trace points that know are already kind

519
00:17:04,679 --> 00:17:07,640
of what information to get

520
00:17:15,419 --> 00:17:19,799
this infrastructure is also super high

521
00:17:17,160 --> 00:17:21,660
performance and the way that works is is

522
00:17:19,799 --> 00:17:22,799
when the Kernel's compiled uh and

523
00:17:21,660 --> 00:17:24,540
there's a couple of different mechanisms

524
00:17:22,799 --> 00:17:26,699
different places but basically it it

525
00:17:24,540 --> 00:17:28,980
puts out a string of knot but no op

526
00:17:26,699 --> 00:17:30,480
codes for the CPU to execute and CPUs

527
00:17:28,980 --> 00:17:32,460
are really fast at doing nothing that's

528
00:17:30,480 --> 00:17:34,200
been optimized to death by the CPU

529
00:17:32,460 --> 00:17:36,360
vendors and so it's just like these five

530
00:17:34,200 --> 00:17:37,980
bytes in memory that do nothing and only

531
00:17:36,360 --> 00:17:40,620
when we turn on tracing for a certain

532
00:17:37,980 --> 00:17:41,820
kind of function does that in memory get

533
00:17:40,620 --> 00:17:43,380
overwritten with instructions to

534
00:17:41,820 --> 00:17:45,120
actually jump out to the code to handle

535
00:17:43,380 --> 00:17:46,980
this performance analysis

536
00:17:45,120 --> 00:17:49,200
and so every function in the kernel is

537
00:17:46,980 --> 00:17:50,880
ready to be kind of traced like this but

538
00:17:49,200 --> 00:17:53,360
there's no performance impact until you

539
00:17:50,880 --> 00:17:53,360
turn it on

540
00:17:54,059 --> 00:17:58,919
this incidentally is similar to how live

541
00:17:56,220 --> 00:18:00,900
patching also works uh it'll actually

542
00:17:58,919 --> 00:18:02,160
patch the start of the function to like

543
00:18:00,900 --> 00:18:03,720
go off to the new version with live

544
00:18:02,160 --> 00:18:06,360
patch rather than the existing one

545
00:18:03,720 --> 00:18:08,520
they're kind of related

546
00:18:06,360 --> 00:18:09,539
the second technology that uh kind of

547
00:18:08,520 --> 00:18:12,419
brings this together makes it really

548
00:18:09,539 --> 00:18:13,620
work was we moved in the s-trace case it

549
00:18:12,419 --> 00:18:15,720
was like a remote control for the

550
00:18:13,620 --> 00:18:17,160
process and and the kernel would send

551
00:18:15,720 --> 00:18:18,299
control to S trace it would tell the

552
00:18:17,160 --> 00:18:20,160
kernel like what it wanted to do back

553
00:18:18,299 --> 00:18:21,900
and Boards back and forth

554
00:18:20,160 --> 00:18:23,640
now with the F train system you have

555
00:18:21,900 --> 00:18:25,380
this ring buffer in memory in the kernel

556
00:18:23,640 --> 00:18:26,820
and the kernel itself handles the trace

557
00:18:25,380 --> 00:18:29,400
event so it doesn't break out to that

558
00:18:26,820 --> 00:18:31,260
user space analysis application at all

559
00:18:29,400 --> 00:18:32,820
and it writes that Trace event into a

560
00:18:31,260 --> 00:18:35,280
ring bar for in-memory all internal

561
00:18:32,820 --> 00:18:36,900
context then the user space tool reads

562
00:18:35,280 --> 00:18:38,100
out from that buffer asynchronously at

563
00:18:36,900 --> 00:18:40,380
the same time there's no lock between

564
00:18:38,100 --> 00:18:42,120
them

565
00:18:40,380 --> 00:18:43,620
and what that allows is now a

566
00:18:42,120 --> 00:18:45,179
recording's entirely in kernel space so

567
00:18:43,620 --> 00:18:47,460
we're not wasting all that time context

568
00:18:45,179 --> 00:18:49,559
switching back and forth

569
00:18:47,460 --> 00:18:51,059
and our process isn't paused either we

570
00:18:49,559 --> 00:18:52,740
that little bit of recording work it's

571
00:18:51,059 --> 00:18:54,179
just like you know poking a few bits in

572
00:18:52,740 --> 00:18:55,980
memory that's very fast to do compared

573
00:18:54,179 --> 00:18:57,660
to to all that that jumping back and

574
00:18:55,980 --> 00:18:59,460
forth

575
00:18:57,660 --> 00:19:02,280
one of the disadvantages of that however

576
00:18:59,460 --> 00:19:04,200
is uh if we have those predefined static

577
00:19:02,280 --> 00:19:05,820
Trace points the kernel knows how to get

578
00:19:04,200 --> 00:19:07,740
that context and it knows how to put it

579
00:19:05,820 --> 00:19:09,360
into that buffer in memory so the user

580
00:19:07,740 --> 00:19:11,039
space application can read it

581
00:19:09,360 --> 00:19:12,419
if we don't have a predefined trace

582
00:19:11,039 --> 00:19:14,220
point and this happens we want to just

583
00:19:12,419 --> 00:19:16,140
trace a random kernel function instead

584
00:19:14,220 --> 00:19:17,940
which doesn't know how to get its

585
00:19:16,140 --> 00:19:19,320
context then the only thing we can

586
00:19:17,940 --> 00:19:21,240
record is that that function was called

587
00:19:19,320 --> 00:19:22,799
and we can't really store

588
00:19:21,240 --> 00:19:24,059
you know the context about the disk

589
00:19:22,799 --> 00:19:26,720
score the length of the operation and

590
00:19:24,059 --> 00:19:26,720
that kind of thing

591
00:19:27,179 --> 00:19:30,780
this has some safety elements built in

592
00:19:28,860 --> 00:19:32,640
as well if if the buffer fills up they

593
00:19:30,780 --> 00:19:34,140
just those events just get lost and so

594
00:19:32,640 --> 00:19:35,820
rather than the kernel stalling and

595
00:19:34,140 --> 00:19:37,740
maybe making applications slow if for

596
00:19:35,820 --> 00:19:39,000
some reason the user space side can't

597
00:19:37,740 --> 00:19:39,960
read fast enough it just drops those

598
00:19:39,000 --> 00:19:41,760
events

599
00:19:39,960 --> 00:19:43,200
and if the actual processing of these

600
00:19:41,760 --> 00:19:44,400
events take too much to climb that the

601
00:19:43,200 --> 00:19:45,780
kernel also has mechanisms for

602
00:19:44,400 --> 00:19:48,000
throttling that time back to make sure

603
00:19:45,780 --> 00:19:51,260
that it it maybe is a few percent of

604
00:19:48,000 --> 00:19:51,260
your your system timer most

605
00:19:51,740 --> 00:19:57,179
and so if we look at the effect of this

606
00:19:54,240 --> 00:19:58,980
so here's uh before we we showed uh you

607
00:19:57,179 --> 00:20:01,380
know that same DD operation it took 0.2

608
00:19:58,980 --> 00:20:03,059
of a second if we s traced it and didn't

609
00:20:01,380 --> 00:20:05,700
even do any work as Trace was interested

610
00:20:03,059 --> 00:20:07,980
in it now took 18 seconds well if we use

611
00:20:05,700 --> 00:20:09,419
perf Trace instead which takes advantage

612
00:20:07,980 --> 00:20:11,760
of that kernel infrastructure to trace

613
00:20:09,419 --> 00:20:13,500
the event and also uses that ring buffer

614
00:20:11,760 --> 00:20:15,299
so that the the applications are not

615
00:20:13,500 --> 00:20:18,240
waiting on each other now we can trace

616
00:20:15,299 --> 00:20:20,160
the same information in three seconds

617
00:20:18,240 --> 00:20:21,840
that's still 15 times slower than our

618
00:20:20,160 --> 00:20:23,940
Baseline but that's a huge improvement

619
00:20:21,840 --> 00:20:25,919
over 18 seconds and we can improve that

620
00:20:23,940 --> 00:20:27,720
again if we if this was tracing every

621
00:20:25,919 --> 00:20:30,000
system called reads writes connects

622
00:20:27,720 --> 00:20:32,100
everything if we narrow that down again

623
00:20:30,000 --> 00:20:33,539
to only the right system call and

624
00:20:32,100 --> 00:20:35,940
remember here it's taking us 18 seconds

625
00:20:33,539 --> 00:20:37,620
to observe nothing here we can observe

626
00:20:35,940 --> 00:20:39,660
half of the work the entire process did

627
00:20:37,620 --> 00:20:41,880
because it's reading 500 000 bytes and

628
00:20:39,660 --> 00:20:43,200
writing 500 000 bytes so half the work

629
00:20:41,880 --> 00:20:44,940
is reading it half the work is writing

630
00:20:43,200 --> 00:20:46,020
if we only look at the writing of it we

631
00:20:44,940 --> 00:20:47,520
can now get that down to one second

632
00:20:46,020 --> 00:20:50,179
that's only five times slower than our

633
00:20:47,520 --> 00:20:50,179
Baseline case

634
00:20:51,240 --> 00:20:54,419
but we're still copying a lot of data to

635
00:20:53,160 --> 00:20:56,039
use the space so every one of those

636
00:20:54,419 --> 00:20:58,679
operations is going to this ring buffer

637
00:20:56,039 --> 00:21:00,240
and user spaces reading out all 500 000

638
00:20:58,679 --> 00:21:02,220
of those operations

639
00:21:00,240 --> 00:21:04,140
so we're way faster but we're still

640
00:21:02,220 --> 00:21:06,480
wasting resources that maybe could still

641
00:21:04,140 --> 00:21:09,480
be improved

642
00:21:06,480 --> 00:21:11,940
and so that's where BPF comes in

643
00:21:09,480 --> 00:21:13,860
and so the origins of BPF were were in

644
00:21:11,940 --> 00:21:17,100
packet capture so when you ran TCP dump

645
00:21:13,860 --> 00:21:19,500
and you were filtering on Port 22

646
00:21:17,100 --> 00:21:21,000
without BPF every single packet would

647
00:21:19,500 --> 00:21:22,380
have to be sent in the same same way

648
00:21:21,000 --> 00:21:23,820
we've just been discussing sent out of

649
00:21:22,380 --> 00:21:25,380
the kernel into tsp dump it would then

650
00:21:23,820 --> 00:21:26,220
parse the packet and only print it if it

651
00:21:25,380 --> 00:21:28,320
matched

652
00:21:26,220 --> 00:21:30,419
and at gigabits a second that's that's

653
00:21:28,320 --> 00:21:31,620
going to be a problem and so the BPF was

654
00:21:30,419 --> 00:21:34,260
this little virtual machine that was put

655
00:21:31,620 --> 00:21:35,700
into the kernel where we could Define a

656
00:21:34,260 --> 00:21:36,960
tiny little program and send it to the

657
00:21:35,700 --> 00:21:38,700
kernel for how to filter the packets

658
00:21:36,960 --> 00:21:40,980
that's another kernel would take a

659
00:21:38,700 --> 00:21:42,720
program that tested up wrote it would

660
00:21:40,980 --> 00:21:44,220
verify that that was safe it wasn't

661
00:21:42,720 --> 00:21:45,659
going to corrupt memory it wasn't going

662
00:21:44,220 --> 00:21:47,100
to take too long to run it has all these

663
00:21:45,659 --> 00:21:49,020
safety checks that make it safe to

664
00:21:47,100 --> 00:21:51,299
actually put user code into the kernel

665
00:21:49,020 --> 00:21:52,799
which sounds a little crazy and uses

666
00:21:51,299 --> 00:21:54,419
that to filter the packets and so now

667
00:21:52,799 --> 00:21:55,860
only the packets we're interested in are

668
00:21:54,419 --> 00:21:58,020
actually sent out to T speed dump which

669
00:21:55,860 --> 00:22:01,260
had a huge performance win but it was

670
00:21:58,020 --> 00:22:03,600
very specific to to packet capture

671
00:22:01,260 --> 00:22:05,840
and this was in 1992 so very long time

672
00:22:03,600 --> 00:22:05,840
ago

673
00:22:06,480 --> 00:22:12,179
in about 2013 uh some developers took

674
00:22:10,260 --> 00:22:13,679
that BPF and they enhanced it they added

675
00:22:12,179 --> 00:22:15,840
that little virtual machine they've

676
00:22:13,679 --> 00:22:18,240
added more registers more storage space

677
00:22:15,840 --> 00:22:19,860
they made it jit compiled so rather than

678
00:22:18,240 --> 00:22:21,360
kind of interpreting that virtual

679
00:22:19,860 --> 00:22:24,240
machine that's now converted to machine

680
00:22:21,360 --> 00:22:25,799
code kind of like JavaScript might do

681
00:22:24,240 --> 00:22:27,659
and they added the ability to drive

682
00:22:25,799 --> 00:22:28,620
these programs not just from packets but

683
00:22:27,659 --> 00:22:30,360
from all sorts of different event

684
00:22:28,620 --> 00:22:32,640
sources

685
00:22:30,360 --> 00:22:34,140
and one of those event sources is these

686
00:22:32,640 --> 00:22:36,480
same events that we're tracing so those

687
00:22:34,140 --> 00:22:37,679
those static Trace points those K probes

688
00:22:36,480 --> 00:22:39,840
all these event sources can now be

689
00:22:37,679 --> 00:22:42,000
connected to a BBF program

690
00:22:39,840 --> 00:22:43,380
and we have all that same safety the

691
00:22:42,000 --> 00:22:44,700
kernel is going to verify there's

692
00:22:43,380 --> 00:22:46,380
actually reads and parcels and

693
00:22:44,700 --> 00:22:48,120
understands the entire program before it

694
00:22:46,380 --> 00:22:50,039
loads it to make sure it won't crash

695
00:22:48,120 --> 00:22:51,659
that it won't take too long to run but

696
00:22:50,039 --> 00:22:53,700
it can't access unsafe memory in that

697
00:22:51,659 --> 00:22:55,260
and that kind of thing

698
00:22:53,700 --> 00:22:57,179
that does put some limits on the program

699
00:22:55,260 --> 00:22:58,799
and so to get a lot of the work done we

700
00:22:57,179 --> 00:23:01,200
need to do the kernel has little helper

701
00:22:58,799 --> 00:23:02,460
programs the BPF program basically make

702
00:23:01,200 --> 00:23:04,620
some function calls out to certain

703
00:23:02,460 --> 00:23:05,880
kernel programs that will maybe make

704
00:23:04,620 --> 00:23:08,299
sure that whatever you're getting back

705
00:23:05,880 --> 00:23:08,299
is safe

706
00:23:08,880 --> 00:23:12,720
the win this gives us is now we can take

707
00:23:11,159 --> 00:23:14,340
any tracing event and instead of just

708
00:23:12,720 --> 00:23:15,720
putting the full output of that into the

709
00:23:14,340 --> 00:23:17,400
ring buffer and sending it to user space

710
00:23:15,720 --> 00:23:19,140
and now we can process that in kernel

711
00:23:17,400 --> 00:23:22,020
and we can actually use code just like

712
00:23:19,140 --> 00:23:24,059
the packet dump to summarize or extract

713
00:23:22,020 --> 00:23:25,559
data in the kernel

714
00:23:24,059 --> 00:23:26,820
and then put the results of that in the

715
00:23:25,559 --> 00:23:29,000
ring buffer that's sent out to user

716
00:23:26,820 --> 00:23:29,000
space

717
00:23:29,760 --> 00:23:32,820
and now we're only transmitting a very

718
00:23:31,440 --> 00:23:35,039
small amount of data out to user space

719
00:23:32,820 --> 00:23:37,380
and so we're even faster again

720
00:23:35,039 --> 00:23:39,179
and so if we look here I've written a

721
00:23:37,380 --> 00:23:41,280
tool in BPF Trace using this technology

722
00:23:39,179 --> 00:23:43,260
to do the same thing we did here so

723
00:23:41,280 --> 00:23:45,000
before it took us one second to trace

724
00:23:43,260 --> 00:23:46,740
every right call because we're

725
00:23:45,000 --> 00:23:48,179
transferring every right call out to

726
00:23:46,740 --> 00:23:49,980
user space but I might have just wanted

727
00:23:48,179 --> 00:23:51,539
to know how many there were and so I

728
00:23:49,980 --> 00:23:53,100
wrote a program which counted how many

729
00:23:51,539 --> 00:23:54,299
right calls there were but it counted

730
00:23:53,100 --> 00:23:55,500
all of that in the kernel and the only

731
00:23:54,299 --> 00:23:57,960
thing it said to user space was the

732
00:23:55,500 --> 00:24:00,360
number 512 000. and now that takes 0.3

733
00:23:57,960 --> 00:24:03,419
of a second which is only 50 slower than

734
00:24:00,360 --> 00:24:05,220
our Baseline case

735
00:24:03,419 --> 00:24:06,960
and this is a really contrived example

736
00:24:05,220 --> 00:24:10,020
because we're doing a purely system call

737
00:24:06,960 --> 00:24:11,520
uh kind of load work uh if if in reality

738
00:24:10,020 --> 00:24:13,200
this program spent a lot of CPU time

739
00:24:11,520 --> 00:24:14,700
doing actual kind of work programs do

740
00:24:13,200 --> 00:24:17,159
and not just sending data back and forth

741
00:24:14,700 --> 00:24:18,900
that that uh performance impact would be

742
00:24:17,159 --> 00:24:21,620
would be even less often maybe five or

743
00:24:18,900 --> 00:24:21,620
ten percent at most

744
00:24:21,720 --> 00:24:24,960
and so as an example of that if I use

745
00:24:23,340 --> 00:24:26,640
fio this is a really good disk

746
00:24:24,960 --> 00:24:28,440
benchmarking tool with an actual kind of

747
00:24:26,640 --> 00:24:30,539
uh benchmark

748
00:24:28,440 --> 00:24:32,340
I can do 35 or a thousand iops a second

749
00:24:30,539 --> 00:24:34,559
with no analysis running and if I turn

750
00:24:32,340 --> 00:24:36,659
on a tool to analyze the latency of

751
00:24:34,559 --> 00:24:38,580
every eye operation on the system so

752
00:24:36,659 --> 00:24:40,080
sorry I get 37 000 without it but it

753
00:24:38,580 --> 00:24:41,640
only drops to thirty four thousand when

754
00:24:40,080 --> 00:24:43,380
I start analyzing the latency of every

755
00:24:41,640 --> 00:24:44,880
requesting the system so that's a a much

756
00:24:43,380 --> 00:24:47,659
much smaller drop than we have with all

757
00:24:44,880 --> 00:24:47,659
those other methods

758
00:24:47,700 --> 00:24:51,780
incidentally BPF was first added to

759
00:24:50,100 --> 00:24:53,880
kennel ebpf was first added to the

760
00:24:51,780 --> 00:24:56,640
kernel for s-trace while not just for

761
00:24:53,880 --> 00:24:58,740
ashtrays for uh but uh for a thing

762
00:24:56,640 --> 00:25:00,000
called set comp BBF and so this was a

763
00:24:58,740 --> 00:25:01,740
little program we sent to the colonel

764
00:25:00,000 --> 00:25:03,900
and when we used that p-trace mechanism

765
00:25:01,740 --> 00:25:05,100
and said intercept every system call we

766
00:25:03,900 --> 00:25:06,600
could send a little program to the

767
00:25:05,100 --> 00:25:07,919
kernel to filter out which system calls

768
00:25:06,600 --> 00:25:09,419
we wanted to trace and so I wouldn't

769
00:25:07,919 --> 00:25:11,280
actually wake up S trays for all the

770
00:25:09,419 --> 00:25:12,480
unrelated system calls only the ones we

771
00:25:11,280 --> 00:25:13,980
wanted

772
00:25:12,480 --> 00:25:16,440
um and this is used for for a few things

773
00:25:13,980 --> 00:25:17,880
but when we use it in s-trace we can

774
00:25:16,440 --> 00:25:20,580
halve the amount of time we take but it

775
00:25:17,880 --> 00:25:22,320
still takes 10 seconds uh you know

776
00:25:20,580 --> 00:25:24,000
versus the like one or point three of a

777
00:25:22,320 --> 00:25:25,140
second that we've managed to to take

778
00:25:24,000 --> 00:25:28,440
using

779
00:25:25,140 --> 00:25:30,419
uh this internal uh collection of the

780
00:25:28,440 --> 00:25:31,860
data and internal analysis because this

781
00:25:30,419 --> 00:25:33,240
this is only just cutting out all the

782
00:25:31,860 --> 00:25:34,620
events we didn't care about it still

783
00:25:33,240 --> 00:25:38,659
transfers all the events we care about

784
00:25:34,620 --> 00:25:38,659
to user space with s Trace

785
00:25:40,559 --> 00:25:44,340
so let's have a quick look at the

786
00:25:42,000 --> 00:25:46,620
language of BPF trace the the nice thing

787
00:25:44,340 --> 00:25:48,960
about BPF race is uh you can actually

788
00:25:46,620 --> 00:25:51,419
write BPF programs in python or in C or

789
00:25:48,960 --> 00:25:52,860
in go but doing that is quite of a boost

790
00:25:51,419 --> 00:25:54,299
and if you're a software engineer that's

791
00:25:52,860 --> 00:25:56,640
quite possible but as more of a sys

792
00:25:54,299 --> 00:25:58,260
admin or SRE type person who's maybe uh

793
00:25:56,640 --> 00:25:59,820
knows not to be dangerous uh when it

794
00:25:58,260 --> 00:26:01,799
comes to coding or how the kernel Works

795
00:25:59,820 --> 00:26:03,059
uh that's quite

796
00:26:01,799 --> 00:26:05,520
um

797
00:26:03,059 --> 00:26:08,400
a forbidden kind of kind of task BPF

798
00:26:05,520 --> 00:26:10,380
traces are much nicer Uh custom language

799
00:26:08,400 --> 00:26:11,700
lets us write these programs in a much

800
00:26:10,380 --> 00:26:15,260
more friendly way so here's an example

801
00:26:11,700 --> 00:26:15,260
of what the language looks like

802
00:26:15,720 --> 00:26:20,880
uh sorry

803
00:26:18,240 --> 00:26:22,500
uh yes so the BPF Trace language is

804
00:26:20,880 --> 00:26:25,140
inspired by the d-trace language which

805
00:26:22,500 --> 00:26:26,580
is a tool on Solaris from 2005 which did

806
00:26:25,140 --> 00:26:28,919
very much the same thing in the same way

807
00:26:26,580 --> 00:26:30,419
it was inspired by orc so the reason it

808
00:26:28,919 --> 00:26:32,039
looks like August because it was heavily

809
00:26:30,419 --> 00:26:33,120
inspired by orc

810
00:26:32,039 --> 00:26:35,340
um

811
00:26:33,120 --> 00:26:37,260
and so that's why we have a begin and an

812
00:26:35,340 --> 00:26:39,000
end so we can run some code at the start

813
00:26:37,260 --> 00:26:40,500
and end of our script in this case we

814
00:26:39,000 --> 00:26:41,640
just print out a nice little string so

815
00:26:40,500 --> 00:26:44,340
the person knows we're actually doing

816
00:26:41,640 --> 00:26:46,620
something and we've started our work we

817
00:26:44,340 --> 00:26:48,299
can specify a probe type and so an

818
00:26:46,620 --> 00:26:49,919
example of a probe we might do is a

819
00:26:48,299 --> 00:26:51,240
system call system call right but

820
00:26:49,919 --> 00:26:52,740
there's there's a lot of different

821
00:26:51,240 --> 00:26:55,260
probes and we'll look at all them in a

822
00:26:52,740 --> 00:26:57,179
second we then have a predicate so we

823
00:26:55,260 --> 00:26:59,400
can actually have a condition again like

824
00:26:57,179 --> 00:27:01,260
orc we can say I only want to match on

825
00:26:59,400 --> 00:27:02,159
this probe if the process is a python

826
00:27:01,260 --> 00:27:03,960
process

827
00:27:02,159 --> 00:27:05,880
and then we can run some actual code and

828
00:27:03,960 --> 00:27:08,279
so in this case we're going to store a

829
00:27:05,880 --> 00:27:10,200
global array called start we're going to

830
00:27:08,279 --> 00:27:12,659
index it by the thread ID and we're

831
00:27:10,200 --> 00:27:14,880
going to store the system time when that

832
00:27:12,659 --> 00:27:16,980
probe was called

833
00:27:14,880 --> 00:27:19,679
here's an example of another probe we've

834
00:27:16,980 --> 00:27:20,940
got a more complicated uh predicate

835
00:27:19,679 --> 00:27:22,679
first we're going to check that we

836
00:27:20,940 --> 00:27:24,360
actually had a non-error return value

837
00:27:22,679 --> 00:27:25,740
we're also going to check that we're

838
00:27:24,360 --> 00:27:27,720
Python and we're going to check that

839
00:27:25,740 --> 00:27:29,460
initially we kind of match this other

840
00:27:27,720 --> 00:27:30,600
other probe first we'll make sure we

841
00:27:29,460 --> 00:27:31,799
match that first and then we'll do some

842
00:27:30,600 --> 00:27:33,779
other logic and in this case we're

843
00:27:31,799 --> 00:27:34,919
summing the latency the time it took

844
00:27:33,779 --> 00:27:37,020
from the first function to the second

845
00:27:34,919 --> 00:27:40,020
function

846
00:27:37,020 --> 00:27:42,600
and so we calculate that latency we we

847
00:27:40,020 --> 00:27:43,919
sum up that latency per process ID so

848
00:27:42,600 --> 00:27:46,380
for whatever we're probing we're going

849
00:27:43,919 --> 00:27:48,539
to add up the total time spent uh in in

850
00:27:46,380 --> 00:27:49,860
that time period and then every 30

851
00:27:48,539 --> 00:27:51,600
second we're going to print out all that

852
00:27:49,860 --> 00:27:53,220
data we've stored and at the end we'll

853
00:27:51,600 --> 00:27:55,320
print it and so that a bit like a lot of

854
00:27:53,220 --> 00:27:57,000
iostat and top and stuff do it can print

855
00:27:55,320 --> 00:28:00,320
out kind of like intervals and then kind

856
00:27:57,000 --> 00:28:00,320
of print out the result at the end

857
00:28:02,100 --> 00:28:07,020
specifically some magic variables in

858
00:28:03,840 --> 00:28:08,159
there like count so the question is uh

859
00:28:07,020 --> 00:28:11,039
there seems to be some magic variables

860
00:28:08,159 --> 00:28:12,240
like count so yes so there are a bunch

861
00:28:11,039 --> 00:28:13,500
of

862
00:28:12,240 --> 00:28:16,080
um it's actually not a variable it's

863
00:28:13,500 --> 00:28:20,220
kind of a function but it

864
00:28:16,080 --> 00:28:23,220
yeah it basically just says uh count up

865
00:28:20,220 --> 00:28:25,500
that um

866
00:28:23,220 --> 00:28:27,120
oh you're saying I've uh oh I've

867
00:28:25,500 --> 00:28:28,260
actually made a mistake when I was

868
00:28:27,120 --> 00:28:30,120
making this line I see what you're

869
00:28:28,260 --> 00:28:32,940
saying my mistake I did account that

870
00:28:30,120 --> 00:28:35,279
should probably be uh at time that was

871
00:28:32,940 --> 00:28:36,720
just an error in my slide preparation so

872
00:28:35,279 --> 00:28:38,279
uh yeah that should just be at time this

873
00:28:36,720 --> 00:28:39,539
is not a real program I was kind of just

874
00:28:38,279 --> 00:28:42,260
trying to show this in text but thank

875
00:28:39,539 --> 00:28:42,260
you for catching that

876
00:28:42,720 --> 00:28:46,260
and so here are a bunch of one-liner

877
00:28:44,700 --> 00:28:48,179
examples of BPF programs that actually

878
00:28:46,260 --> 00:28:51,840
do things uh these come straight from

879
00:28:48,179 --> 00:28:53,159
the BPF Trace manual and uh so here's an

880
00:28:51,840 --> 00:28:55,320
example that we're going to check all

881
00:28:53,159 --> 00:28:57,299
the files open by a process so we get

882
00:28:55,320 --> 00:28:59,400
the open system call and we want to

883
00:28:57,299 --> 00:29:01,559
print out the file name and com which is

884
00:28:59,400 --> 00:29:03,480
a special variable BPF Trace knows it's

885
00:29:01,559 --> 00:29:06,720
just the the process name

886
00:29:03,480 --> 00:29:08,460
we can count system calls by program so

887
00:29:06,720 --> 00:29:10,559
we we match this probe and we're just

888
00:29:08,460 --> 00:29:12,179
going to count indexed on whatever the

889
00:29:10,559 --> 00:29:13,860
program name is so even if you have

890
00:29:12,179 --> 00:29:16,140
multiple copies of python running all of

891
00:29:13,860 --> 00:29:18,480
them are going to kind of count together

892
00:29:16,140 --> 00:29:20,100
uh this is a case of we're summing so

893
00:29:18,480 --> 00:29:21,299
rather than just counting we might want

894
00:29:20,100 --> 00:29:23,520
to sum the number of bytes that are

895
00:29:21,299 --> 00:29:25,200
being read uh that's an example of that

896
00:29:23,520 --> 00:29:27,179
here's our histogram we talked about

897
00:29:25,200 --> 00:29:28,260
let's say you create a histogram all

898
00:29:27,179 --> 00:29:30,480
these kind of different things that we

899
00:29:28,260 --> 00:29:32,039
can put together we can do printf uh we

900
00:29:30,480 --> 00:29:33,800
can do multi-dimensional arrays all

901
00:29:32,039 --> 00:29:37,030
sorts of stuff

902
00:29:33,800 --> 00:29:37,030
[Music]

903
00:29:38,940 --> 00:29:41,760
I mentioned before there are a bunch of

904
00:29:40,679 --> 00:29:43,980
these different probe events we can

905
00:29:41,760 --> 00:29:45,720
attach to and so uh the two we talked

906
00:29:43,980 --> 00:29:47,640
about so far one was Trace Point these

907
00:29:45,720 --> 00:29:48,960
are these predefined ones in the kernel

908
00:29:47,640 --> 00:29:50,820
that

909
00:29:48,960 --> 00:29:52,380
um that they've kind of given us as a

910
00:29:50,820 --> 00:29:54,840
convenience and it already knows how to

911
00:29:52,380 --> 00:29:56,580
extract some some context we can also do

912
00:29:54,840 --> 00:29:58,039
a k-probe and that attaches to any

913
00:29:56,580 --> 00:30:00,960
kernel function

914
00:29:58,039 --> 00:30:03,059
and uh the cape probe has a k probe and

915
00:30:00,960 --> 00:30:05,100
a k rep probe so as I described earlier

916
00:30:03,059 --> 00:30:06,720
in the S Trace example at the start of a

917
00:30:05,100 --> 00:30:08,039
function we need to intercept then to

918
00:30:06,720 --> 00:30:09,720
capture the arguments of the function

919
00:30:08,039 --> 00:30:11,520
and then we need to intercept that at

920
00:30:09,720 --> 00:30:13,860
the end to get the return value or

921
00:30:11,520 --> 00:30:16,020
perhaps collect um the results of that

922
00:30:13,860 --> 00:30:18,659
function running so we we kind of probe

923
00:30:16,020 --> 00:30:20,760
those separately we can also probe uh

924
00:30:18,659 --> 00:30:22,020
user space programs so a u-probe will

925
00:30:20,760 --> 00:30:24,179
actually probe a function not in the

926
00:30:22,020 --> 00:30:26,279
kernel but in a program or in a shared

927
00:30:24,179 --> 00:30:28,620
library and I'll show in a bit for

928
00:30:26,279 --> 00:30:30,240
example we can probe the SSL library on

929
00:30:28,620 --> 00:30:33,179
every process in the system and kind of

930
00:30:30,240 --> 00:30:35,580
capture information from that

931
00:30:33,179 --> 00:30:37,080
use based programs can also much like

932
00:30:35,580 --> 00:30:38,760
Trace points in the kernel there are

933
00:30:37,080 --> 00:30:40,679
statically defined user space Trace

934
00:30:38,760 --> 00:30:42,720
points called usdt a bunch of bits of

935
00:30:40,679 --> 00:30:43,799
software have some of those and we can

936
00:30:42,720 --> 00:30:44,640
also do

937
00:30:43,799 --> 00:30:46,440
um

938
00:30:44,640 --> 00:30:48,179
a bunch of periodical things we can do

939
00:30:46,440 --> 00:30:50,640
things on an interval

940
00:30:48,179 --> 00:30:52,200
um we can do things

941
00:30:50,640 --> 00:30:54,360
um based on performance candles from the

942
00:30:52,200 --> 00:30:55,260
CPU those ones get a little bit more

943
00:30:54,360 --> 00:30:56,220
advanced something we're going to talk

944
00:30:55,260 --> 00:30:58,440
about those today but there's a whole

945
00:30:56,220 --> 00:31:00,179
range of them that you can look at the

946
00:30:58,440 --> 00:31:01,320
user space probes the kernel probes and

947
00:31:00,179 --> 00:31:05,000
the trace points are kind of the most

948
00:31:01,320 --> 00:31:05,000
interesting uh most interesting ones

949
00:31:06,179 --> 00:31:08,940
so I promised five recipes so I'm going

950
00:31:07,919 --> 00:31:10,260
to go through and talk about kind of

951
00:31:08,940 --> 00:31:12,179
five Solutions I've actually used when

952
00:31:10,260 --> 00:31:15,860
solving problems and kind of show the

953
00:31:12,179 --> 00:31:15,860
program and what that looks like

954
00:31:16,799 --> 00:31:22,919
uh the first one is uh I was trying to

955
00:31:19,980 --> 00:31:24,419
profile some python applications uh now

956
00:31:22,919 --> 00:31:26,279
BPF Trace actually has the ability to

957
00:31:24,419 --> 00:31:27,659
get the kernel stack and the user space

958
00:31:26,279 --> 00:31:29,460
stack the actual call stack like a

959
00:31:27,659 --> 00:31:31,080
debugger would but it can only do that

960
00:31:29,460 --> 00:31:33,480
from programs with like traditional like

961
00:31:31,080 --> 00:31:35,880
frame pointers and debug info uh that

962
00:31:33,480 --> 00:31:37,020
doesn't work with python right now that

963
00:31:35,880 --> 00:31:39,059
might get fixed because there's some

964
00:31:37,020 --> 00:31:40,200
ability to add like some some magic so

965
00:31:39,059 --> 00:31:42,419
that it can kind of understand from

966
00:31:40,200 --> 00:31:43,559
python what's going on but I was using a

967
00:31:42,419 --> 00:31:45,059
different python tool called Austin

968
00:31:43,559 --> 00:31:47,279
which will profile the python

969
00:31:45,059 --> 00:31:48,899
application but I wasn't launching

970
00:31:47,279 --> 00:31:50,220
python something else was and so what I

971
00:31:48,899 --> 00:31:51,840
wanted to do was capture every time

972
00:31:50,220 --> 00:31:54,240
python was started and then immediately

973
00:31:51,840 --> 00:31:55,860
launched the Austin profiler to profile

974
00:31:54,240 --> 00:31:57,779
that python process

975
00:31:55,860 --> 00:31:59,760
and so this is a simple program which

976
00:31:57,779 --> 00:32:01,200
where we're going to capture every time

977
00:31:59,760 --> 00:32:03,779
we execute something this isn't called

978
00:32:01,200 --> 00:32:06,179
it's called X exactly if it was

979
00:32:03,779 --> 00:32:08,580
successful and if we're a Python 3

980
00:32:06,179 --> 00:32:10,799
process then we're going to Output the

981
00:32:08,580 --> 00:32:13,460
process ID and then I I process that to

982
00:32:10,799 --> 00:32:13,460
launch the process

983
00:32:14,760 --> 00:32:19,020
I was using this because I wanted to

984
00:32:16,559 --> 00:32:20,580
profile uh we at canonical we deploy

985
00:32:19,020 --> 00:32:22,980
software with a bunch of python code

986
00:32:20,580 --> 00:32:24,779
basically that's orchestrated by a tool

987
00:32:22,980 --> 00:32:26,039
called Juju and it it was launching them

988
00:32:24,779 --> 00:32:27,779
and so I just wanted to profile

989
00:32:26,039 --> 00:32:29,460
everything it launched without having to

990
00:32:27,779 --> 00:32:31,820
modify Juju so that was a super easy way

991
00:32:29,460 --> 00:32:31,820
to do that

992
00:32:31,980 --> 00:32:36,899
uh here's another example uh a version

993
00:32:34,260 --> 00:32:39,240
of this tool ships with BPF Trace but it

994
00:32:36,899 --> 00:32:41,279
doesn't tell you the offset uh into the

995
00:32:39,240 --> 00:32:42,899
disk that it was reading from and the

996
00:32:41,279 --> 00:32:44,700
link that was reading so I was debugging

997
00:32:42,899 --> 00:32:46,380
a ceph case and I wanted to understand

998
00:32:44,700 --> 00:32:48,179
uh

999
00:32:46,380 --> 00:32:49,799
where from the disk we're reading it and

1000
00:32:48,179 --> 00:32:52,320
at what length and then what the latency

1001
00:32:49,799 --> 00:32:54,360
was and so I customized that tool and

1002
00:32:52,320 --> 00:32:56,159
enhanced it to understand the disk

1003
00:32:54,360 --> 00:32:58,260
offset and unless down the disk length

1004
00:32:56,159 --> 00:33:00,860
and kind of output that

1005
00:32:58,260 --> 00:33:00,860
yeah

1006
00:33:00,899 --> 00:33:04,679
uh in doing so I discovered that uh the

1007
00:33:03,419 --> 00:33:06,000
reason Trace points are great is because

1008
00:33:04,679 --> 00:33:07,260
kernel developers who know much more

1009
00:33:06,000 --> 00:33:08,880
about that kernel subsystem have

1010
00:33:07,260 --> 00:33:11,520
predefined all that information and how

1011
00:33:08,880 --> 00:33:12,539
to get it I was using kprobe so I had to

1012
00:33:11,520 --> 00:33:14,100
uh

1013
00:33:12,539 --> 00:33:15,779
look through the data structures and

1014
00:33:14,100 --> 00:33:18,120
follow several structures myself to find

1015
00:33:15,779 --> 00:33:19,559
things like the link and the offset

1016
00:33:18,120 --> 00:33:20,940
um and it turns that I did that wrong in

1017
00:33:19,559 --> 00:33:22,320
such a way that the length and offset

1018
00:33:20,940 --> 00:33:25,200
looked right most of the time but

1019
00:33:22,320 --> 00:33:27,000
sometimes it was wrong entirely and so I

1020
00:33:25,200 --> 00:33:28,500
I do advise some caution if you're a

1021
00:33:27,000 --> 00:33:30,179
kernel developer that knows enough to be

1022
00:33:28,500 --> 00:33:32,519
dangerous that's me

1023
00:33:30,179 --> 00:33:34,500
um to maybe verify the output of of

1024
00:33:32,519 --> 00:33:36,840
things you've come up with if it's a

1025
00:33:34,500 --> 00:33:38,760
k-probe or rely on someone that maybe

1026
00:33:36,840 --> 00:33:41,659
knows uh some more but I've only made

1027
00:33:38,760 --> 00:33:41,659
that mistake once so

1028
00:33:42,960 --> 00:33:48,000
another tool I used in a real case

1029
00:33:45,539 --> 00:33:51,299
um I did a whole talk on this at LCA in

1030
00:33:48,000 --> 00:33:53,159
2020 uh called uh fixing Windows guests

1031
00:33:51,299 --> 00:33:55,980
on Seth uh

1032
00:33:53,159 --> 00:33:58,019
uh and a whole story on that uh in that

1033
00:33:55,980 --> 00:33:58,980
case what was happening is uh Linux

1034
00:33:58,019 --> 00:34:01,080
machines

1035
00:33:58,980 --> 00:34:03,000
do all their i o in four kilobyte blocks

1036
00:34:01,080 --> 00:34:04,559
Windows machines do it in five four five

1037
00:34:03,000 --> 00:34:06,659
blocks

1038
00:34:04,559 --> 00:34:08,280
most modern storage devices like all

1039
00:34:06,659 --> 00:34:09,899
their IO to be done in five tour byte

1040
00:34:08,280 --> 00:34:11,879
blocks sorry four kilobyte blocks not

1041
00:34:09,899 --> 00:34:13,379
five tour byte blocks and what was

1042
00:34:11,879 --> 00:34:15,060
happening was uh Windows was sending a

1043
00:34:13,379 --> 00:34:17,460
fire truck bike right and internally

1044
00:34:15,060 --> 00:34:19,740
Seth was sending that to the kernel as a

1045
00:34:17,460 --> 00:34:21,300
buffered i o request and the kernel only

1046
00:34:19,740 --> 00:34:22,800
got 512 bytes but it needed four

1047
00:34:21,300 --> 00:34:25,080
kilobytes because it can't send any less

1048
00:34:22,800 --> 00:34:26,399
than that back to the disk so in line

1049
00:34:25,080 --> 00:34:27,599
when it made the right call to the

1050
00:34:26,399 --> 00:34:28,740
kernel that you expected the kernel to

1051
00:34:27,599 --> 00:34:30,419
buffer in memory it wasn't even going to

1052
00:34:28,740 --> 00:34:31,740
wait to send it to the disk normally it

1053
00:34:30,419 --> 00:34:33,419
would then have to go out to the disk

1054
00:34:31,740 --> 00:34:36,839
read the rest of the four kilobytes pull

1055
00:34:33,419 --> 00:34:38,580
that in change only that 512 bytes and

1056
00:34:36,839 --> 00:34:40,980
then go back to the to the program it's

1057
00:34:38,580 --> 00:34:42,899
called a read modified write penalty

1058
00:34:40,980 --> 00:34:45,300
and uh we didn't realize that was

1059
00:34:42,899 --> 00:34:47,099
happening at the time but

1060
00:34:45,300 --> 00:34:48,960
what I did with this program was I'd

1061
00:34:47,099 --> 00:34:50,520
already narrowed down that this the P

1062
00:34:48,960 --> 00:34:53,099
right V system call was taking a long

1063
00:34:50,520 --> 00:34:55,320
time and I didn't know why

1064
00:34:53,099 --> 00:34:56,879
so I then went and broke down with this

1065
00:34:55,320 --> 00:34:59,099
program the latency of each pirate

1066
00:34:56,879 --> 00:35:01,140
vehicle and by which process did it and

1067
00:34:59,099 --> 00:35:03,420
so from that I determined that the the

1068
00:35:01,140 --> 00:35:06,119
tpf store op process which is like an

1069
00:35:03,420 --> 00:35:07,560
internal ceph thread I found that it had

1070
00:35:06,119 --> 00:35:08,700
these bad latencies and that was the

1071
00:35:07,560 --> 00:35:11,180
process that I was though I was

1072
00:35:08,700 --> 00:35:11,180
interested in

1073
00:35:13,140 --> 00:35:16,200
I had another case it was a murder

1074
00:35:14,700 --> 00:35:17,579
mystery there was uh something that was

1075
00:35:16,200 --> 00:35:18,900
randomly killing processes on the system

1076
00:35:17,579 --> 00:35:20,460
and I didn't know what and that's a

1077
00:35:18,900 --> 00:35:21,780
really hard problem to debug uh

1078
00:35:20,460 --> 00:35:23,640
traditionally there's no super easy way

1079
00:35:21,780 --> 00:35:25,260
to find out who killed what and so in

1080
00:35:23,640 --> 00:35:26,940
this case I attached onto every time I

1081
00:35:25,260 --> 00:35:28,320
generate a signal in the kernel I would

1082
00:35:26,940 --> 00:35:30,000
print out which process sent which

1083
00:35:28,320 --> 00:35:31,859
signal to which other process

1084
00:35:30,000 --> 00:35:33,599
and from that I was able to figure out

1085
00:35:31,859 --> 00:35:35,820
uh it was actually systemd killing my

1086
00:35:33,599 --> 00:35:39,060
process for a very complicated reason to

1087
00:35:35,820 --> 00:35:40,920
do with session scope management but

1088
00:35:39,060 --> 00:35:43,079
that's another excellent BPF Trace

1089
00:35:40,920 --> 00:35:45,180
example I had

1090
00:35:43,079 --> 00:35:46,140
and this was kind of a fun one uh that I

1091
00:35:45,180 --> 00:35:48,240
like

1092
00:35:46,140 --> 00:35:50,099
uh all that stuff before was kind of

1093
00:35:48,240 --> 00:35:51,540
looking at kernel stuff this is uh is

1094
00:35:50,099 --> 00:35:54,420
using uprobes and so what we're doing

1095
00:35:51,540 --> 00:35:56,640
here is we're probing the SSL library

1096
00:35:54,420 --> 00:35:58,380
and again on the entire system and we're

1097
00:35:56,640 --> 00:35:59,700
hooking into the SSL read and the SSL

1098
00:35:58,380 --> 00:36:01,740
right call

1099
00:35:59,700 --> 00:36:03,420
and when there's a call we actually read

1100
00:36:01,740 --> 00:36:04,680
the buffer that was passed into a cell

1101
00:36:03,420 --> 00:36:07,440
read and write and we actually print out

1102
00:36:04,680 --> 00:36:09,720
the data that the program is sending

1103
00:36:07,440 --> 00:36:11,880
if you've ever tried to debug like an a

1104
00:36:09,720 --> 00:36:13,740
https connection and you use TCP dump

1105
00:36:11,880 --> 00:36:15,119
all you get is the encrypted data and

1106
00:36:13,740 --> 00:36:16,800
sometimes you can give into your program

1107
00:36:15,119 --> 00:36:18,300
to like write the encryption key out to

1108
00:36:16,800 --> 00:36:20,820
try and decrypt that but we can just

1109
00:36:18,300 --> 00:36:22,560
bypass all that hook into the SSL calls

1110
00:36:20,820 --> 00:36:26,040
and look at the unencrypted data before

1111
00:36:22,560 --> 00:36:27,720
it's encrypted or after it's decrypted

1112
00:36:26,040 --> 00:36:29,640
this does unfortunately highlight a

1113
00:36:27,720 --> 00:36:31,560
current limitation of BPF Trace which is

1114
00:36:29,640 --> 00:36:34,020
that we can only uh

1115
00:36:31,560 --> 00:36:36,300
print or capture strings up to 200 bytes

1116
00:36:34,020 --> 00:36:38,460
and Links at most uh that is a fixable

1117
00:36:36,300 --> 00:36:40,320
limitation but no one's fixed it yet and

1118
00:36:38,460 --> 00:36:42,359
so there are there is some software ones

1119
00:36:40,320 --> 00:36:44,400
called ecapture which uses like Python

1120
00:36:42,359 --> 00:36:45,599
and go to do this and and can bypass

1121
00:36:44,400 --> 00:36:47,579
that limitation and print out all the

1122
00:36:45,599 --> 00:36:50,339
data but if you use BPF Trace to do this

1123
00:36:47,579 --> 00:36:52,740
you'll only get 200 bytes at a time

1124
00:36:50,339 --> 00:36:54,660
and as a bonus because I wanted to throw

1125
00:36:52,740 --> 00:36:56,040
an extra one in and number six out of

1126
00:36:54,660 --> 00:36:58,680
five

1127
00:36:56,040 --> 00:37:00,180
um I wanted to uh Trace something in the

1128
00:36:58,680 --> 00:37:02,339
kernel but then correlate that with the

1129
00:37:00,180 --> 00:37:05,099
kernel stack and in this case I was

1130
00:37:02,339 --> 00:37:07,740
trying to understand uh

1131
00:37:05,099 --> 00:37:09,420
systemd was at some point changing the

1132
00:37:07,740 --> 00:37:11,520
CPU set of a process that's the list of

1133
00:37:09,420 --> 00:37:13,859
CPUs that can run on

1134
00:37:11,520 --> 00:37:16,440
and I didn't know why from which

1135
00:37:13,859 --> 00:37:18,060
um which kind of uh you know area of

1136
00:37:16,440 --> 00:37:19,380
system dude was doing that so I just you

1137
00:37:18,060 --> 00:37:20,880
can actually Wild Card keeper

1138
00:37:19,380 --> 00:37:23,940
unfortunately it's not on the slide

1139
00:37:20,880 --> 00:37:25,859
right now but I said CPU set star

1140
00:37:23,940 --> 00:37:27,000
probe every CPU function in the kernel

1141
00:37:25,859 --> 00:37:29,339
and then print me out the user space

1142
00:37:27,000 --> 00:37:30,780
stack of what ultimately called it and

1143
00:37:29,339 --> 00:37:32,040
so I could see all the code paths and

1144
00:37:30,780 --> 00:37:33,900
systemd that were calling into those

1145
00:37:32,040 --> 00:37:35,760
functions and that let me try to figure

1146
00:37:33,900 --> 00:37:37,740
out you know where to go looking for for

1147
00:37:35,760 --> 00:37:40,460
why it was changing the CPU set in a way

1148
00:37:37,740 --> 00:37:40,460
that I didn't expect

1149
00:37:41,880 --> 00:37:44,820
foreign

1150
00:37:43,020 --> 00:37:47,040
so ultimately there are many different

1151
00:37:44,820 --> 00:37:48,839
methods of doing these things some of

1152
00:37:47,040 --> 00:37:51,300
the limitations I talked about like top

1153
00:37:48,839 --> 00:37:53,280
uh Missing things you can solve those in

1154
00:37:51,300 --> 00:37:54,839
different ways but the the nice thing

1155
00:37:53,280 --> 00:37:56,460
about BPF Trace is it lets you solve a

1156
00:37:54,839 --> 00:37:58,320
lot of those problems in a language

1157
00:37:56,460 --> 00:37:59,640
that's very kind of cis-admin friendly

1158
00:37:58,320 --> 00:38:02,460
you don't need to be a kernel engineer

1159
00:37:59,640 --> 00:38:04,520
necessarily sometimes

1160
00:38:02,460 --> 00:38:04,520
um

1161
00:38:04,560 --> 00:38:07,980
and always using the same tool rather

1162
00:38:06,359 --> 00:38:09,480
than like there is a version of top

1163
00:38:07,980 --> 00:38:11,160
called atop which uses process

1164
00:38:09,480 --> 00:38:13,140
accounting to kind of not miss processes

1165
00:38:11,160 --> 00:38:14,339
but it's very complicated so it kind of

1166
00:38:13,140 --> 00:38:16,140
lets you solve lots of problems in a

1167
00:38:14,339 --> 00:38:17,760
quick and easy way but there are a bunch

1168
00:38:16,140 --> 00:38:19,680
of limitations like that string length

1169
00:38:17,760 --> 00:38:21,060
one and so if you wanted to kind of more

1170
00:38:19,680 --> 00:38:22,560
advanced stuff you can still use the

1171
00:38:21,060 --> 00:38:24,420
same technology but in a real language

1172
00:38:22,560 --> 00:38:27,140
like python or go and that gives you

1173
00:38:24,420 --> 00:38:27,140
some more flexibility

1174
00:38:30,780 --> 00:38:34,320
there are other couple other related

1175
00:38:32,460 --> 00:38:35,760
tools I often use for this uh the perf

1176
00:38:34,320 --> 00:38:37,200
tool itself if you don't want to get

1177
00:38:35,760 --> 00:38:38,700
into BPF Trace you can kind of do a lot

1178
00:38:37,200 --> 00:38:41,040
with just perf itself it just doesn't do

1179
00:38:38,700 --> 00:38:42,480
that summarizing part for you

1180
00:38:41,040 --> 00:38:43,859
um and and Flame graph if you've never

1181
00:38:42,480 --> 00:38:45,480
heard of is a really great tool to show

1182
00:38:43,859 --> 00:38:49,500
you graphically uh where you're spending

1183
00:38:45,480 --> 00:38:51,720
CPU time which can kind of help in these

1184
00:38:49,500 --> 00:38:53,700
and last of all as a quick note this

1185
00:38:51,720 --> 00:38:55,500
concept of putting these little tiny uh

1186
00:38:53,700 --> 00:38:57,540
event-driven programs into the kernel

1187
00:38:55,500 --> 00:38:59,640
that a user defined has kind of taken

1188
00:38:57,540 --> 00:39:01,500
off um which some of you may or may not

1189
00:38:59,640 --> 00:39:04,380
have seen but now we're not just doing

1190
00:39:01,500 --> 00:39:06,240
packet captures or uh or tracing you can

1191
00:39:04,380 --> 00:39:07,800
now put entire network load balances in

1192
00:39:06,240 --> 00:39:10,440
the kernel custom CPU schedulers

1193
00:39:07,800 --> 00:39:11,760
firewall processing uh profiling

1194
00:39:10,440 --> 00:39:13,560
auditing

1195
00:39:11,760 --> 00:39:16,260
it's become like an industry so the term

1196
00:39:13,560 --> 00:39:18,000
BPF has become a technology and every

1197
00:39:16,260 --> 00:39:19,500
company that's doing uh crazy stuff

1198
00:39:18,000 --> 00:39:21,300
often in kubernetes is now putting all

1199
00:39:19,500 --> 00:39:22,800
sorts of crazy stuff in it and so if

1200
00:39:21,300 --> 00:39:24,540
that sounds interesting or horrifying

1201
00:39:22,800 --> 00:39:26,460
I'd recommend reading up on it and and

1202
00:39:24,540 --> 00:39:27,839
seeing what you can do

1203
00:39:26,460 --> 00:39:29,220
if you want to find out more there's

1204
00:39:27,839 --> 00:39:30,780
some resources uh here there's a couple

1205
00:39:29,220 --> 00:39:33,000
good books by Brendan Greg he's a very

1206
00:39:30,780 --> 00:39:34,560
very good resource on this topic uh

1207
00:39:33,000 --> 00:39:35,880
that's where to find BPF trades if your

1208
00:39:34,560 --> 00:39:38,280
YouTube BPF trades there are a lot more

1209
00:39:35,880 --> 00:39:41,540
talks that will go into a lot more depth

1210
00:39:38,280 --> 00:39:41,540
and that is all I have

1211
00:39:46,320 --> 00:39:51,300
so I believe we have oh maybe a couple

1212
00:39:49,380 --> 00:39:54,380
minutes or questions a little bit of

1213
00:39:51,300 --> 00:39:54,380
time anyone got any questions

1214
00:39:56,460 --> 00:40:01,980
my go-tool for this is system tap how

1215
00:39:59,400 --> 00:40:03,780
does BPF tracing system they're a great

1216
00:40:01,980 --> 00:40:06,000
question so the question uh is how to

1217
00:40:03,780 --> 00:40:08,040
system a system tap compared to BPF

1218
00:40:06,000 --> 00:40:09,599
Trace so system tap lets you do a bunch

1219
00:40:08,040 --> 00:40:10,940
of the same things that BPF Trace does

1220
00:40:09,599 --> 00:40:13,440
and has also done that for a long time

1221
00:40:10,940 --> 00:40:15,300
the way our system tap worked internally

1222
00:40:13,440 --> 00:40:16,920
is it would take a similar kind of

1223
00:40:15,300 --> 00:40:18,660
language it's different but kind of

1224
00:40:16,920 --> 00:40:20,520
similar it would then comp turn it into

1225
00:40:18,660 --> 00:40:23,700
C and compile it and load it as a kernel

1226
00:40:20,520 --> 00:40:25,020
module and so it didn't have that same

1227
00:40:23,700 --> 00:40:26,579
kind of like

1228
00:40:25,020 --> 00:40:28,200
I mean it tried to do its own safety

1229
00:40:26,579 --> 00:40:30,000
guarantees but it didn't have that same

1230
00:40:28,200 --> 00:40:32,820
implicit safety guarantee of the generic

1231
00:40:30,000 --> 00:40:35,579
BPF system it also required you to your

1232
00:40:32,820 --> 00:40:37,079
compiler kernel load a module and uh

1233
00:40:35,579 --> 00:40:38,520
that has become more of a problem now

1234
00:40:37,079 --> 00:40:39,960
with secure boot because you need to

1235
00:40:38,520 --> 00:40:41,520
then sign that module and that kind of

1236
00:40:39,960 --> 00:40:43,619
thing which you kind of avoid uh

1237
00:40:41,520 --> 00:40:44,940
potentially with with the BBF method but

1238
00:40:43,619 --> 00:40:47,400
you can solve a lot of the same problems

1239
00:40:44,940 --> 00:40:48,720
with with system tap

1240
00:40:47,400 --> 00:40:50,339
um Sonic I didn't talk about very much

1241
00:40:48,720 --> 00:40:51,660
was right now when your probe a user

1242
00:40:50,339 --> 00:40:53,460
space program that's actually a much

1243
00:40:51,660 --> 00:40:54,720
bigger performance penalty than the

1244
00:40:53,460 --> 00:40:56,400
kernel side because it does actually

1245
00:40:54,720 --> 00:40:58,800
trap to the kernel and back every time

1246
00:40:56,400 --> 00:41:00,660
one of those is hit

1247
00:40:58,800 --> 00:41:01,859
um system tap had a solution for that

1248
00:41:00,660 --> 00:41:03,780
where it would kind of push some of that

1249
00:41:01,859 --> 00:41:05,099
stuff into the into user space the same

1250
00:41:03,780 --> 00:41:06,359
thing could be done with BPF Trace but

1251
00:41:05,099 --> 00:41:08,220
no one's really done the work to make

1252
00:41:06,359 --> 00:41:12,200
that happen yet so there might be an

1253
00:41:08,220 --> 00:41:12,200
advantage there in some cases

1254
00:41:13,740 --> 00:41:18,180
okay thank you very much tread for your

1255
00:41:16,800 --> 00:41:20,760
time and effort it's a little gift from

1256
00:41:18,180 --> 00:41:24,660
us thank you very much just a little

1257
00:41:20,760 --> 00:41:27,300
thing people the key card for the Oaks

1258
00:41:24,660 --> 00:41:28,560
um accommodation hasn't been claimed so

1259
00:41:27,300 --> 00:41:30,300
if you're staying at one of the Oaks

1260
00:41:28,560 --> 00:41:32,099
venues please check that you've got your

1261
00:41:30,300 --> 00:41:33,900
key card because finding out when you

1262
00:41:32,099 --> 00:41:37,700
get there this evening is not going to

1263
00:41:33,900 --> 00:41:37,700
be a lot of fun all right let me go