1
00:00:00,420 --> 00:00:05,910
[Music]

2
00:00:11,840 --> 00:00:16,800
Hi folks, apologies for the delay. Uh

3
00:00:14,240 --> 00:00:18,400
it's my pleasure to present GA and her

4
00:00:16,800 --> 00:00:20,560
talk on the bird watchers guide to

5
00:00:18,400 --> 00:00:21,680
optimized data pipelines. Take it away

6
00:00:20,560 --> 00:00:24,920
Ja.

7
00:00:21,680 --> 00:00:24,920
Thank you.

8
00:00:27,840 --> 00:00:32,719
Oh, I thought presenting will be the

9
00:00:30,640 --> 00:00:35,600
scary part. Turned out to be getting the

10
00:00:32,719 --> 00:00:38,399
slides on. Hey everyone. Um, thank you

11
00:00:35,600 --> 00:00:40,160
so much for coming. My name is Ja and

12
00:00:38,399 --> 00:00:41,920
today I would like to share with you a

13
00:00:40,160 --> 00:00:44,640
little story about performance

14
00:00:41,920 --> 00:00:47,760
optimization. For the next 30 minutes,

15
00:00:44,640 --> 00:00:49,440
it's going to be birds, planes, and data

16
00:00:47,760 --> 00:00:52,079
pipelines.

17
00:00:49,440 --> 00:00:54,960
I am a data scientist and I work in

18
00:00:52,079 --> 00:00:57,120
aviation in an AI company which is

19
00:00:54,960 --> 00:00:59,760
called ASA where we build machine

20
00:00:57,120 --> 00:01:02,320
learning and computer vision systems for

21
00:00:59,760 --> 00:01:04,479
airlines and airports.

22
00:01:02,320 --> 00:01:07,280
And recently I had to optimize one of

23
00:01:04,479 --> 00:01:09,600
our feature engineering pipelines. It

24
00:01:07,280 --> 00:01:11,360
inspired this talk but yet it sent me

25
00:01:09,600 --> 00:01:13,680
down a rabbit hole of optimization

26
00:01:11,360 --> 00:01:15,280
techniques. But don't worry, I'm not

27
00:01:13,680 --> 00:01:17,360
here to drag you down with me down this

28
00:01:15,280 --> 00:01:19,280
rabbit hole. Instead, I would like to

29
00:01:17,360 --> 00:01:21,280
share with you a few things that I have

30
00:01:19,280 --> 00:01:23,840
learned along the way. And this is a

31
00:01:21,280 --> 00:01:25,520
checklist that I put for myself, and I

32
00:01:23,840 --> 00:01:28,000
find it to be quite handy when your

33
00:01:25,520 --> 00:01:31,040
pipeline is running slow or when it's

34
00:01:28,000 --> 00:01:33,280
eating too much memory. All right, let's

35
00:01:31,040 --> 00:01:35,920
get started.

36
00:01:33,280 --> 00:01:37,759
On this slide, oh yeah, on this slide,

37
00:01:35,920 --> 00:01:40,000
you can see a dashboard from the real

38
00:01:37,759 --> 00:01:42,320
world. The white bars show when ground

39
00:01:40,000 --> 00:01:44,400
operations are happening like fueling,

40
00:01:42,320 --> 00:01:47,439
cleaning, catering, passengers in,

41
00:01:44,400 --> 00:01:50,079
passengers out. My task was to turn

42
00:01:47,439 --> 00:01:51,680
these detections into features for a

43
00:01:50,079 --> 00:01:54,079
machine learning model then that will

44
00:01:51,680 --> 00:01:56,320
later predict delays. Will the plane

45
00:01:54,079 --> 00:01:59,280
leave on time or is it going to be late

46
00:01:56,320 --> 00:02:02,079
and for how long? But in the spirit of

47
00:01:59,280 --> 00:02:04,240
Spring Hill in Australia, I would like

48
00:02:02,079 --> 00:02:06,560
to translate this task into a bird

49
00:02:04,240 --> 00:02:10,560
story. So instead of aircraft

50
00:02:06,560 --> 00:02:12,640
operations, imagine my backyard feeder.

51
00:02:10,560 --> 00:02:14,959
The task is to count how many birds of

52
00:02:12,640 --> 00:02:16,879
each species are present at every 5

53
00:02:14,959 --> 00:02:19,040
minute interval. And the goal is to

54
00:02:16,879 --> 00:02:20,800
answer, do I have enough food for the

55
00:02:19,040 --> 00:02:22,720
day or do I need to run to the

56
00:02:20,800 --> 00:02:25,120
supermarket real quick and grab some

57
00:02:22,720 --> 00:02:28,239
worms? Yeah, from the supermarket and

58
00:02:25,120 --> 00:02:30,560
seeds for my birdies.

59
00:02:28,239 --> 00:02:32,480
And this is the outline. A bird's eyee

60
00:02:30,560 --> 00:02:34,879
view of all the optimizations that we

61
00:02:32,480 --> 00:02:37,440
will cover from algorithmic complexity

62
00:02:34,879 --> 00:02:39,440
to vectorization, data frame choice will

63
00:02:37,440 --> 00:02:43,120
be there and all the way down to

64
00:02:39,440 --> 00:02:44,879
streaming and hardware acceleration.

65
00:02:43,120 --> 00:02:47,200
The bird version of the task is the

66
00:02:44,879 --> 00:02:49,760
following. We've got three species.

67
00:02:47,200 --> 00:02:51,360
Sparrow, greenfinch, and crochets. And

68
00:02:49,760 --> 00:02:53,599
those are the local birdies from the

69
00:02:51,360 --> 00:02:56,959
Vietnamese highlands where my home

70
00:02:53,599 --> 00:02:59,360
currently is. So each bar is when a bird

71
00:02:56,959 --> 00:03:01,599
is at the feeder and our job every 5

72
00:02:59,360 --> 00:03:04,080
minutes we count how many birds of each

73
00:03:01,599 --> 00:03:06,480
species are present.

74
00:03:04,080 --> 00:03:08,239
Here's the same thing as a pipeline

75
00:03:06,480 --> 00:03:10,640
input detection and time stamps. They

76
00:03:08,239 --> 00:03:12,319
are stored a disk and a list of species

77
00:03:10,640 --> 00:03:14,239
we care about because we'll be filtering

78
00:03:12,319 --> 00:03:17,599
out some of them. And the output is a

79
00:03:14,239 --> 00:03:19,280
table of counts by species.

80
00:03:17,599 --> 00:03:22,000
To compare optimizations, I'll be using

81
00:03:19,280 --> 00:03:24,480
benchmarks on synthetic data that can

82
00:03:22,000 --> 00:03:27,599
comfortably fit into one local machine.

83
00:03:24,480 --> 00:03:31,200
And the size of this machine is 25 GB of

84
00:03:27,599 --> 00:03:34,000
RAM, five CPU cores, um some GPU, and

85
00:03:31,200 --> 00:03:35,519
for this task, we'll say unlimited disk.

86
00:03:34,000 --> 00:03:38,480
Throughout the talk, I'll be showing two

87
00:03:35,519 --> 00:03:42,319
extremes. A tiny little excess case, and

88
00:03:38,480 --> 00:03:44,879
a bigger L case with 300 species and 5

89
00:03:42,319 --> 00:03:46,879
million detections. And each step is

90
00:03:44,879 --> 00:03:49,280
warmed up, and then it's run five times.

91
00:03:46,879 --> 00:03:51,360
And then we measure runtime and pick

92
00:03:49,280 --> 00:03:53,599
memory.

93
00:03:51,360 --> 00:03:55,440
All right, let's get started with the

94
00:03:53,599 --> 00:03:57,920
optimizations. But first, we need to

95
00:03:55,440 --> 00:04:00,640
look at the nave algorithm. It's really

96
00:03:57,920 --> 00:04:02,640
naive. For each time step, we look for

97
00:04:00,640 --> 00:04:04,959
every detection. Check if the bird is

98
00:04:02,640 --> 00:04:08,480
there and a plus one. And we have an

99
00:04:04,959 --> 00:04:11,120
early return the break there. So the

100
00:04:08,480 --> 00:04:14,000
time complexity of this thing is O of T

101
00:04:11,120 --> 00:04:16,560
* D times by detections. And this is our

102
00:04:14,000 --> 00:04:19,440
baseline. It runs but it's painfully

103
00:04:16,560 --> 00:04:21,359
slow and inefficient. So optimization

104
00:04:19,440 --> 00:04:23,919
number one is coming. Let us fix the

105
00:04:21,359 --> 00:04:26,800
algorithm. In our case, we can do better

106
00:04:23,919 --> 00:04:28,800
with a sweep line approach. Let's

107
00:04:26,800 --> 00:04:31,520
simplify to just Paris for now and then

108
00:04:28,800 --> 00:04:33,759
we will add other birdies later. The

109
00:04:31,520 --> 00:04:36,160
matrix here is to turn each detection

110
00:04:33,759 --> 00:04:38,400
into two events. Plus one when the bird

111
00:04:36,160 --> 00:04:41,199
arrives and minus one when it leaves the

112
00:04:38,400 --> 00:04:44,000
feeder. And that gives us a table of

113
00:04:41,199 --> 00:04:46,080
times and deltas.

114
00:04:44,000 --> 00:04:48,240
Now we can run a cumulative sum like a

115
00:04:46,080 --> 00:04:49,680
run in total and at every moment that

116
00:04:48,240 --> 00:04:52,400
will tell us how many sparrows are

117
00:04:49,680 --> 00:04:54,160
present. This sweep step is linear but

118
00:04:52,400 --> 00:04:56,400
because of the sorting that dominates

119
00:04:54,160 --> 00:04:58,880
the complexity or the whole algorithm

120
00:04:56,400 --> 00:05:02,320
will run in O of T, the time steps are

121
00:04:58,880 --> 00:05:03,840
sorted beforehand plus D log D. And

122
00:05:02,320 --> 00:05:05,600
finally we merge these events with the

123
00:05:03,840 --> 00:05:07,360
time stamps we really cared about in the

124
00:05:05,600 --> 00:05:09,919
beginning. And that gives us the count

125
00:05:07,360 --> 00:05:12,800
that we wanted.

126
00:05:09,919 --> 00:05:15,039
So in our case we sped things up in with

127
00:05:12,800 --> 00:05:16,800
a simplified sweepline algorithm. Of

128
00:05:15,039 --> 00:05:18,560
course this is just my example. The

129
00:05:16,800 --> 00:05:21,600
point is we check the algorithm first

130
00:05:18,560 --> 00:05:23,840
and then once we squeeze that it will

131
00:05:21,600 --> 00:05:27,840
make sense to move to the further

132
00:05:23,840 --> 00:05:29,520
fancier tricks. All right

133
00:05:27,840 --> 00:05:31,440
here is how it shows up in numbers.

134
00:05:29,520 --> 00:05:34,080
Here's the excess data set. Naive takes

135
00:05:31,440 --> 00:05:36,639
about 20 seconds and sweep line drops

136
00:05:34,080 --> 00:05:38,560
that to under a second which is a huge

137
00:05:36,639 --> 00:05:40,080
win. Starting from the next

138
00:05:38,560 --> 00:05:42,320
optimization, I'll be showing you

139
00:05:40,080 --> 00:05:44,320
results only on the larger data set so

140
00:05:42,320 --> 00:05:46,800
you can see how they work on a larger

141
00:05:44,320 --> 00:05:49,039
scale and I'll come back to this XS

142
00:05:46,800 --> 00:05:51,280
version later in the end so that we can

143
00:05:49,039 --> 00:05:53,360
see how those optimizations work out on

144
00:05:51,280 --> 00:05:55,440
this tiny scale.

145
00:05:53,360 --> 00:05:58,880
With the algorithm sorted out, the next

146
00:05:55,440 --> 00:06:01,880
step would be vectorization.

147
00:05:58,880 --> 00:06:01,880
Okay.

148
00:06:02,479 --> 00:06:08,080
A loop is imperative. You write a for

149
00:06:05,520 --> 00:06:11,039
loop and you tell Python exactly how to

150
00:06:08,080 --> 00:06:12,960
walk each row. The problem is the in

151
00:06:11,039 --> 00:06:14,720
Python every iteration will need to go

152
00:06:12,960 --> 00:06:16,560
through the interpreter and this would

153
00:06:14,720 --> 00:06:19,280
mean function call type checks object

154
00:06:16,560 --> 00:06:21,520
overhead and the vectorzed version is

155
00:06:19,280 --> 00:06:23,360
declarative. You tell pandas what

156
00:06:21,520 --> 00:06:25,199
columns you want and it applies the

157
00:06:23,360 --> 00:06:27,680
operation to the whole column at once

158
00:06:25,199 --> 00:06:30,000
and that lets pandas use C engine under

159
00:06:27,680 --> 00:06:33,600
the hood which is much faster than our

160
00:06:30,000 --> 00:06:35,600
Python slow Python loop.

161
00:06:33,600 --> 00:06:38,400
And this is how it looks like on L data

162
00:06:35,600 --> 00:06:40,560
set. The loopy version took more than 9

163
00:06:38,400 --> 00:06:42,960
minutes and the vectorzed version cut

164
00:06:40,560 --> 00:06:46,400
that down to about four. That's more

165
00:06:42,960 --> 00:06:48,319
than twice as fast and just by changing

166
00:06:46,400 --> 00:06:51,199
how we express the code we have achieved

167
00:06:48,319 --> 00:06:53,199
this result. So vectorization definitely

168
00:06:51,199 --> 00:06:56,720
helped but the engine that we'll be

169
00:06:53,199 --> 00:06:58,880
using it matters too immensely.

170
00:06:56,720 --> 00:07:01,759
And that brings us to the next point

171
00:06:58,880 --> 00:07:03,280
data frame choice. My favorite one

172
00:07:01,759 --> 00:07:05,759
probably.

173
00:07:03,280 --> 00:07:08,319
Unlike some 10 years ago when there was

174
00:07:05,759 --> 00:07:10,720
mostly pandas, today we have a whole zoo

175
00:07:08,319 --> 00:07:13,360
of different data frames we can choose

176
00:07:10,720 --> 00:07:15,360
from and there is no single best one for

177
00:07:13,360 --> 00:07:17,680
every task. It will depend on our

178
00:07:15,360 --> 00:07:20,639
pipeline of on our task that we are

179
00:07:17,680 --> 00:07:22,560
aiming to solve. So let us ask ourselves

180
00:07:20,639 --> 00:07:25,599
a few guiding questions and let's

181
00:07:22,560 --> 00:07:28,720
eliminate those choices one by one.

182
00:07:25,599 --> 00:07:30,639
First, does our data fit in one machine

183
00:07:28,720 --> 00:07:33,120
or do we rather need a distributed

184
00:07:30,639 --> 00:07:35,599
cluster? If we would need a cluster, we

185
00:07:33,120 --> 00:07:37,840
would probably reach for Spark or Dusk.

186
00:07:35,599 --> 00:07:40,000
Dusk is basically pandas that scales. It

187
00:07:37,840 --> 00:07:42,000
splits a data frame into many chunks and

188
00:07:40,000 --> 00:07:44,960
runs them in parallel, either on your

189
00:07:42,000 --> 00:07:46,960
laptop or in a cluster. Spark and Dusk

190
00:07:44,960 --> 00:07:49,599
both handle distributed workloads really

191
00:07:46,960 --> 00:07:51,360
well, but since our bird's data fits

192
00:07:49,599 --> 00:07:52,960
comfortably in memory on just one

193
00:07:51,360 --> 00:07:55,919
machine,

194
00:07:52,960 --> 00:07:58,160
we don't need either of them here.

195
00:07:55,919 --> 00:07:59,840
All right, question number two. Am I

196
00:07:58,160 --> 00:08:02,479
prototyping fast or building something

197
00:07:59,840 --> 00:08:04,240
that will need to be durable and

198
00:08:02,479 --> 00:08:06,319
scalable? If we would be just

199
00:08:04,240 --> 00:08:08,720
prototyping, pod was pandas would be a

200
00:08:06,319 --> 00:08:10,720
wonderful choice. It is familiar. We all

201
00:08:08,720 --> 00:08:13,840
probably know the syntax and it has a

202
00:08:10,720 --> 00:08:16,319
huge ecosystem. But for larger scale, it

203
00:08:13,840 --> 00:08:19,199
would most most definitely slow the

204
00:08:16,319 --> 00:08:22,160
things down for us. So here we'll cross

205
00:08:19,199 --> 00:08:24,240
out pandas for our job since speed is

206
00:08:22,160 --> 00:08:26,319
one of our major concerns and also the

207
00:08:24,240 --> 00:08:29,680
data is quite large. And the last

208
00:08:26,319 --> 00:08:32,640
question, do we need SQL or do we like

209
00:08:29,680 --> 00:08:35,919
SQL or would we rather prefer a data

210
00:08:32,640 --> 00:08:37,680
frame API? If you prefer SQL, Doug DB is

211
00:08:35,919 --> 00:08:39,839
a fantastic choice. It's column node

212
00:08:37,680 --> 00:08:42,240
super fast. But since we started in

213
00:08:39,839 --> 00:08:45,680
pandas, I would like to stay in the data

214
00:08:42,240 --> 00:08:48,480
frame world for this talk. So after

215
00:08:45,680 --> 00:08:50,560
crossing all the others, we landed with

216
00:08:48,480 --> 00:08:52,240
polars.

217
00:08:50,560 --> 00:08:54,959
And just a quick note before getting

218
00:08:52,240 --> 00:08:57,120
deeper into the polars, there's a brand

219
00:08:54,959 --> 00:08:59,279
new library called Narwhals released

220
00:08:57,120 --> 00:09:01,040
just last year and it gives us one

221
00:08:59,279 --> 00:09:03,600
syntax across all those different

222
00:09:01,040 --> 00:09:06,240
libraries we've been talking about and

223
00:09:03,600 --> 00:09:10,480
it lets you easily experiment and even

224
00:09:06,240 --> 00:09:12,959
mix them seamlessly in one code base.

225
00:09:10,480 --> 00:09:16,000
still very early stages and I haven't

226
00:09:12,959 --> 00:09:18,720
myself used it in production but worth

227
00:09:16,000 --> 00:09:21,760
keeping an eye on.

228
00:09:18,720 --> 00:09:25,279
So polars this slide shows you two

229
00:09:21,760 --> 00:09:27,839
things. First polers is genuinely fast.

230
00:09:25,279 --> 00:09:30,320
I think they use the word breezingly. So

231
00:09:27,839 --> 00:09:31,920
genuinely breezingly fast. Benchmarks

232
00:09:30,320 --> 00:09:33,839
like this one on joins and group bus

233
00:09:31,920 --> 00:09:36,720
show it running in seconds while pandas

234
00:09:33,839 --> 00:09:38,720
takes minutes. And second it's not just

235
00:09:36,720 --> 00:09:41,279
theory. The community is adopting it

236
00:09:38,720 --> 00:09:44,000
quickly. GitHub stars for polars and

237
00:09:41,279 --> 00:09:47,040
duck DB have been climbing steeply since

238
00:09:44,000 --> 00:09:49,760
2022 while pandas has been steady

239
00:09:47,040 --> 00:09:52,399
steadily growing but yet steady for over

240
00:09:49,760 --> 00:09:55,120
a decade. So the excitement around

241
00:09:52,399 --> 00:09:57,120
polers comes from both sides its strong

242
00:09:55,120 --> 00:10:00,120
performance and a real momentum in the

243
00:09:57,120 --> 00:10:00,120
community.

244
00:10:00,560 --> 00:10:08,160
So what actually makes polers so fast?

245
00:10:05,600 --> 00:10:11,120
There are two pillars the Rust back end

246
00:10:08,160 --> 00:10:13,279
and the Apache error memory format. What

247
00:10:11,120 --> 00:10:15,519
Rust does is gives us multi-threading

248
00:10:13,279 --> 00:10:17,760
and single instruction multiple data

249
00:10:15,519 --> 00:10:20,160
which would mean safe parallel execution

250
00:10:17,760 --> 00:10:22,399
across coursees and arrow gives us

251
00:10:20,160 --> 00:10:24,800
column layout which is compact and cach

252
00:10:22,399 --> 00:10:27,200
friendly and it tracks nulls with a bit

253
00:10:24,800 --> 00:10:29,519
map. This is a nice thing for people

254
00:10:27,200 --> 00:10:31,519
coming from a pendas background. Pendas

255
00:10:29,519 --> 00:10:33,600
with numpy nullable back end has a

256
00:10:31,519 --> 00:10:35,279
similar thing going on but it uses one

257
00:10:33,600 --> 00:10:37,839
bite per value instead of one bit. So

258
00:10:35,279 --> 00:10:40,399
idea is the same but less less efficient

259
00:10:37,839 --> 00:10:42,560
and because arrow is a standard polers

260
00:10:40,399 --> 00:10:44,959
can interoperate seamlessly with other

261
00:10:42,560 --> 00:10:48,560
libraries like dubd uses in-memory

262
00:10:44,959 --> 00:10:50,959
format of apache error under the hood pi

263
00:10:48,560 --> 00:10:53,920
error spark all of this without any

264
00:10:50,959 --> 00:10:56,800
copies so that interoperability is a

265
00:10:53,920 --> 00:10:59,519
huge deal

266
00:10:56,800 --> 00:11:02,959
so polers is fast

267
00:10:59,519 --> 00:11:04,959
but I also what I also love about it is

268
00:11:02,959 --> 00:11:08,560
how it looks from the outside its

269
00:11:04,959 --> 00:11:10,399
syntax. Polers works with expressions.

270
00:11:08,560 --> 00:11:13,279
You can think of expression as a

271
00:11:10,399 --> 00:11:16,000
recipe.pl

272
00:11:13,279 --> 00:11:18,320
cumulative sum over alias all of these

273
00:11:16,000 --> 00:11:20,079
are expressions. Each one is a recipe

274
00:11:18,320 --> 00:11:23,519
but they don't cook on their own. What

275
00:11:20,079 --> 00:11:26,720
they need is a kitchen a context where

276
00:11:23,519 --> 00:11:28,880
what they will be getting executed. And

277
00:11:26,720 --> 00:11:31,120
this context here which encapsulates all

278
00:11:28,880 --> 00:11:32,959
the expressions is select. There are

279
00:11:31,120 --> 00:11:34,720
other contexts like group by should be

280
00:11:32,959 --> 00:11:37,040
familiar from the other data frames with

281
00:11:34,720 --> 00:11:39,600
columns. It's like assign as p in pandas

282
00:11:37,040 --> 00:11:42,000
and filter and the real power is that

283
00:11:39,600 --> 00:11:43,519
you can chain these expressions and that

284
00:11:42,000 --> 00:11:47,040
chaining will unlock the next

285
00:11:43,519 --> 00:11:49,120
optimizations for us lazy execution.

286
00:11:47,040 --> 00:11:51,600
But first let's look at the numbers on

287
00:11:49,120 --> 00:11:53,920
size L. Switching from pandas to polers

288
00:11:51,600 --> 00:11:56,240
makes the pipeline more than twice as

289
00:11:53,920 --> 00:11:58,079
fast. And the key thing is we didn't

290
00:11:56,240 --> 00:11:59,920
change the logic at all. We just changed

291
00:11:58,079 --> 00:12:02,480
the syntax. We could have done it using

292
00:11:59,920 --> 00:12:04,399
an LLM in a few minutes. You might

293
00:12:02,480 --> 00:12:07,360
remember earlier I showed you benchmarks

294
00:12:04,399 --> 00:12:09,040
numbers for polars. The speed up was

295
00:12:07,360 --> 00:12:11,839
really dramatic and actually much

296
00:12:09,040 --> 00:12:14,079
bigger, sometimes an order of magnitude.

297
00:12:11,839 --> 00:12:16,320
But in full pipeline like this one, the

298
00:12:14,079 --> 00:12:17,920
gain is more modest since it's more

299
00:12:16,320 --> 00:12:20,959
diverse and not just the highly

300
00:12:17,920 --> 00:12:22,399
optimized joins and group buys. Still,

301
00:12:20,959 --> 00:12:24,720
it gave us another great speed

302
00:12:22,399 --> 00:12:27,600
improvement. And from this step on,

303
00:12:24,720 --> 00:12:29,200
everything will be run in polers. And

304
00:12:27,600 --> 00:12:32,959
here is where it brings in something not

305
00:12:29,200 --> 00:12:35,360
every library has lazy execution.

306
00:12:32,959 --> 00:12:37,600
The trick is that polers doesn't have to

307
00:12:35,360 --> 00:12:40,639
run each step immediately.

308
00:12:37,600 --> 00:12:43,279
Instead, it can hold off, build a full

309
00:12:40,639 --> 00:12:45,200
plan, and then execute it in the most

310
00:12:43,279 --> 00:12:47,839
efficient way it can. And that's what

311
00:12:45,200 --> 00:12:49,600
lazy execution is about. It's making

312
00:12:47,839 --> 00:12:51,440
like it's it's like making a shopping

313
00:12:49,600 --> 00:12:53,519
list before you go to the supermarket.

314
00:12:51,440 --> 00:12:55,279
If you just buy items one by one in the

315
00:12:53,519 --> 00:12:56,880
order that you remember them, you'll be

316
00:12:55,279 --> 00:12:59,120
probably running back and forth between

317
00:12:56,880 --> 00:13:01,120
the aisles all the time. You'll get your

318
00:12:59,120 --> 00:13:03,360
steps in, that's a good thing, but

319
00:13:01,120 --> 00:13:05,200
you'll waste a lot of time. And with a

320
00:13:03,360 --> 00:13:07,440
full list, what you can do is make one

321
00:13:05,200 --> 00:13:09,600
efficient trip to all the aisles and get

322
00:13:07,440 --> 00:13:12,000
your groceries done.

323
00:13:09,600 --> 00:13:13,920
And here is how this idea looks like in

324
00:13:12,000 --> 00:13:15,680
code. In the snippet, we're selecting

325
00:13:13,920 --> 00:13:19,440
the first three green features to leave

326
00:13:15,680 --> 00:13:21,920
the feeder. Notice they use the scan CSV

327
00:13:19,440 --> 00:13:23,920
instead of read CSV there. And that's

328
00:13:21,920 --> 00:13:26,800
what will create a lazy data frame for

329
00:13:23,920 --> 00:13:29,440
us. Nothing is running yet. Bowlers just

330
00:13:26,800 --> 00:13:32,399
collects the steps and only when we call

331
00:13:29,440 --> 00:13:35,920
dot collect does it execute them with

332
00:13:32,399 --> 00:13:37,839
all the optimizations applied.

333
00:13:35,920 --> 00:13:39,920
If we pick at the cury plan, you can see

334
00:13:37,839 --> 00:13:42,320
how it works. You can hopefully see how

335
00:13:39,920 --> 00:13:44,560
it works. You read it from bottom to

336
00:13:42,320 --> 00:13:46,720
top. On the left is the unoptimized

337
00:13:44,560 --> 00:13:49,040
plan. Exactly the steps I have written

338
00:13:46,720 --> 00:13:51,519
in the order I have written them. On the

339
00:13:49,040 --> 00:13:53,200
right is the optimized plan. Filters and

340
00:13:51,519 --> 00:13:55,839
selects have been pushed all the way

341
00:13:53,200 --> 00:13:58,880
down to the scan. And this way we never

342
00:13:55,839 --> 00:14:00,959
even load unnecessary stuff into memory.

343
00:13:58,880 --> 00:14:03,360
And you see those little sigma and pi

344
00:14:00,959 --> 00:14:06,000
symbols there on the right. They stand

345
00:14:03,360 --> 00:14:07,920
for projection and predicate push downs.

346
00:14:06,000 --> 00:14:09,839
Projection push down means we only read

347
00:14:07,920 --> 00:14:12,000
the columns that we need. In our case,

348
00:14:09,839 --> 00:14:16,800
it's just three out of five columns.

349
00:14:12,000 --> 00:14:19,199
It's like SQL select or a polar select

350
00:14:16,800 --> 00:14:22,160
predicate push down means we only read

351
00:14:19,199 --> 00:14:24,959
the rows we need the green rows in our

352
00:14:22,160 --> 00:14:26,560
case it's like a SQL wear or polos

353
00:14:24,959 --> 00:14:29,040
filter

354
00:14:26,560 --> 00:14:30,399
keep this in mind they'll come handy

355
00:14:29,040 --> 00:14:33,399
when we'll be talking about the file

356
00:14:30,399 --> 00:14:33,399
formats

357
00:14:33,600 --> 00:14:38,880
lazy execution isn't just for the local

358
00:14:35,680 --> 00:14:40,560
files it also can work with data lakes I

359
00:14:38,880 --> 00:14:43,519
like to think of a data lake as like a

360
00:14:40,560 --> 00:14:46,240
giant library full of super cool books,

361
00:14:43,519 --> 00:14:48,639
but all of them are just stored in boxes

362
00:14:46,240 --> 00:14:50,639
and on the floor. You can find something

363
00:14:48,639 --> 00:14:52,880
if you really need to, but it's going to

364
00:14:50,639 --> 00:14:55,120
be super painful. And table formats like

365
00:14:52,880 --> 00:14:57,760
Apache iceberg here, the one that our

366
00:14:55,120 --> 00:15:00,880
bird is sitting on, it will add a

367
00:14:57,760 --> 00:15:02,720
catalog to this library. Now you know

368
00:15:00,880 --> 00:15:05,600
which shelf, which book, and which page

369
00:15:02,720 --> 00:15:07,839
to look at. And what Polers does in this

370
00:15:05,600 --> 00:15:10,560
in this picture, it uses this catalog

371
00:15:07,839 --> 00:15:12,480
like a librarian. It finds for you your

372
00:15:10,560 --> 00:15:15,120
book and fetches the exact pages you ask

373
00:15:12,480 --> 00:15:19,399
for. Like for example, notes on Atlantic

374
00:15:15,120 --> 00:15:19,399
Puffin in our little snippet.

375
00:15:20,959 --> 00:15:27,279
Lazy mode doesn't change much here. We

376
00:15:22,959 --> 00:15:30,000
got from 104 seconds to 96, but we

377
00:15:27,279 --> 00:15:31,519
change how Polers runs the query. And

378
00:15:30,000 --> 00:15:34,000
this will unlock a whole family of

379
00:15:31,519 --> 00:15:36,399
optimizations a little bit later. But

380
00:15:34,000 --> 00:15:40,000
first, let's look at another easy win.

381
00:15:36,399 --> 00:15:42,160
Choosing the right data types.

382
00:15:40,000 --> 00:15:45,120
Polars inherits the full Apache arrow

383
00:15:42,160 --> 00:15:47,040
res data types. We get our usual suspect

384
00:15:45,120 --> 00:15:50,560
integers, unsigned integers, float

385
00:15:47,040 --> 00:15:54,000
strings, but arrow also has some nested

386
00:15:50,560 --> 00:15:55,279
types and polar supports them too like a

387
00:15:54,000 --> 00:15:57,920
variable length collection which is

388
00:15:55,279 --> 00:16:00,800
called list or a fixed length one which

389
00:15:57,920 --> 00:16:03,600
is called an array and a dictionary like

390
00:16:00,800 --> 00:16:05,839
structure which is called well strct.

391
00:16:03,600 --> 00:16:07,519
But what I want to highlight are three

392
00:16:05,839 --> 00:16:10,240
string like types plain string,

393
00:16:07,519 --> 00:16:12,320
categorical and enum. And this can make

394
00:16:10,240 --> 00:16:14,800
a big difference in terms of memory and

395
00:16:12,320 --> 00:16:17,120
speed.

396
00:16:14,800 --> 00:16:18,800
Let's take our species in our burn

397
00:16:17,120 --> 00:16:20,240
example. We only have a handful of

398
00:16:18,800 --> 00:16:22,480
unique species but millions of

399
00:16:20,240 --> 00:16:24,160
detections. If we store them all as

400
00:16:22,480 --> 00:16:26,560
plain strings, we'll repeat the same

401
00:16:24,160 --> 00:16:29,360
value over and over again. We don't want

402
00:16:26,560 --> 00:16:31,199
that. with a categorical type replaces

403
00:16:29,360 --> 00:16:34,639
those strings with integer codes and

404
00:16:31,199 --> 00:16:36,560
then on the side it keeps a dictionary.

405
00:16:34,639 --> 00:16:38,399
This means less memory and faster

406
00:16:36,560 --> 00:16:41,360
calculations.

407
00:16:38,399 --> 00:16:42,959
And then there is enum in stricter. They

408
00:16:41,360 --> 00:16:45,440
define the full set of possible values

409
00:16:42,959 --> 00:16:48,160
up front which is perfect when the set

410
00:16:45,440 --> 00:16:49,920
is fixed like in our bird feeder example

411
00:16:48,160 --> 00:16:51,600
where we know exactly which species are

412
00:16:49,920 --> 00:16:54,079
going to appear. And there are

413
00:16:51,600 --> 00:16:56,240
categoricals that are looser. They can

414
00:16:54,079 --> 00:16:58,320
handle new unseen birds and this will be

415
00:16:56,240 --> 00:17:01,440
safer for production but with a little

416
00:16:58,320 --> 00:17:03,759
more overhead.

417
00:17:01,440 --> 00:17:06,559
Data types might look small but they

418
00:17:03,759 --> 00:17:10,160
matter. About a 30% speed up here and

419
00:17:06,559 --> 00:17:12,559
less memory too.

420
00:17:10,160 --> 00:17:13,839
And the next lever we can pull is a

421
00:17:12,559 --> 00:17:16,079
little different for what we've been

422
00:17:13,839 --> 00:17:18,319
doing before. Up to now we've been

423
00:17:16,079 --> 00:17:20,400
focused on what happens inside memory.

424
00:17:18,319 --> 00:17:22,959
All those algorithms factorization data

425
00:17:20,400 --> 00:17:26,000
types. But performance also depends on

426
00:17:22,959 --> 00:17:27,679
what happens on disk. So let's talk file

427
00:17:26,000 --> 00:17:29,200
formats.

428
00:17:27,679 --> 00:17:33,200
Under the hood, they differ in how they

429
00:17:29,200 --> 00:17:37,039
store data. Some formats store data row

430
00:17:33,200 --> 00:17:39,840
by row. This is our common CSV or a nice

431
00:17:37,039 --> 00:17:41,919
binary format of row. It's simple and

432
00:17:39,840 --> 00:17:43,919
fine for logs or interchange, but for

433
00:17:41,919 --> 00:17:45,679
analytics, it's super slow. You always

434
00:17:43,919 --> 00:17:48,000
read the whole row, even if you only

435
00:17:45,679 --> 00:17:50,080
need one column.

436
00:17:48,000 --> 00:17:52,400
Other formats flip this around and they

437
00:17:50,080 --> 00:17:55,200
store data column by column. That would

438
00:17:52,400 --> 00:17:56,960
be feather or RC. This is great for

439
00:17:55,200 --> 00:17:59,280
analytics. You can read only the columns

440
00:17:56,960 --> 00:18:01,840
you care about. Feather is blazingly

441
00:17:59,280 --> 00:18:04,160
fast for moving data around, but it

442
00:18:01,840 --> 00:18:06,880
doesn't support the push downs we need.

443
00:18:04,160 --> 00:18:09,360
RC is very powerful, but it really pays

444
00:18:06,880 --> 00:18:11,120
off on a huge scale.

445
00:18:09,360 --> 00:18:13,440
And then there is something in the

446
00:18:11,120 --> 00:18:15,919
middle, a hybrid storage layout, which

447
00:18:13,440 --> 00:18:17,919
is parket. It chops the data into row

448
00:18:15,919 --> 00:18:20,080
groups. And inside each group it's

449
00:18:17,919 --> 00:18:23,200
column now. And that means you can speak

450
00:18:20,080 --> 00:18:25,440
big skip big chunks you don't need while

451
00:18:23,200 --> 00:18:28,160
getting column access. And this balance

452
00:18:25,440 --> 00:18:31,280
is one of the reasons why parkier is so

453
00:18:28,160 --> 00:18:33,120
widely used.

454
00:18:31,280 --> 00:18:35,440
All right. With this layout discussion

455
00:18:33,120 --> 00:18:38,640
in mind here is a small zoo of common

456
00:18:35,440 --> 00:18:41,840
formats. And as we as we did with the

457
00:18:38,640 --> 00:18:44,640
data frames let us narrow them down with

458
00:18:41,840 --> 00:18:47,200
three quick questions. First writing or

459
00:18:44,640 --> 00:18:49,520
reading. In our case, only reading, no

460
00:18:47,200 --> 00:18:52,640
writing, so we can drop the roadbased

461
00:18:49,520 --> 00:18:55,520
ones. Second, are we just moving data in

462
00:18:52,640 --> 00:18:57,679
and out or are we doing some analytics?

463
00:18:55,520 --> 00:19:00,080
Feather is great for interchange, but

464
00:18:57,679 --> 00:19:03,679
doesn't support push downs, so it's out.

465
00:19:00,080 --> 00:19:05,679
And finally, how big is our data? OC

466
00:19:03,679 --> 00:19:08,559
shines on a massive scale, but here it

467
00:19:05,679 --> 00:19:10,640
would be just an overhead for us. So

468
00:19:08,559 --> 00:19:12,880
that leaves park hybrid storage with

469
00:19:10,640 --> 00:19:14,720
push downs the most practical choice for

470
00:19:12,880 --> 00:19:16,799
us.

471
00:19:14,720 --> 00:19:19,679
On top of its hybrid layout, park also

472
00:19:16,799 --> 00:19:22,880
applies some fancy encodings inside each

473
00:19:19,679 --> 00:19:25,600
column. There there are dictionary

474
00:19:22,880 --> 00:19:28,640
codings like polar's categorical type.

475
00:19:25,600 --> 00:19:30,799
It replaces strings with integers and

476
00:19:28,640 --> 00:19:32,320
run length encoding. Uh if a value

477
00:19:30,799 --> 00:19:34,880
repeats many times, you can just store

478
00:19:32,320 --> 00:19:36,799
the value plus the count and delta

479
00:19:34,880 --> 00:19:38,240
encoding. For numbers that are similar

480
00:19:36,799 --> 00:19:39,840
that are close together, you can store

481
00:19:38,240 --> 00:19:43,200
just the minimum and then just the

482
00:19:39,840 --> 00:19:45,919
differences and the val the value is

483
00:19:43,200 --> 00:19:47,679
going to be much smaller. And finally,

484
00:19:45,919 --> 00:19:50,799
park supports different compression

485
00:19:47,679 --> 00:19:53,919
codex. So you can trade off either speed

486
00:19:50,799 --> 00:19:56,400
versus file size depending on what your

487
00:19:53,919 --> 00:19:58,080
workload is.

488
00:19:56,400 --> 00:20:00,480
So here's the impact of changing the

489
00:19:58,080 --> 00:20:02,880
input format. Switching to parquet saves

490
00:20:00,480 --> 00:20:04,640
us some time not huge but this is

491
00:20:02,880 --> 00:20:07,440
because our pipeline is dominated by the

492
00:20:04,640 --> 00:20:09,039
in-memory work but in other cases maybe

493
00:20:07,440 --> 00:20:12,039
in your pipelines it can be really

494
00:20:09,039 --> 00:20:12,039
dramatic.

495
00:20:12,080 --> 00:20:17,280
Next one is streaming. After the parket

496
00:20:14,720 --> 00:20:19,600
magic, our happy bird here expects big

497
00:20:17,280 --> 00:20:21,520
memory to go down because streaming is

498
00:20:19,600 --> 00:20:23,679
specifically designed for the data sets

499
00:20:21,520 --> 00:20:26,640
that are larger than memory. And instead

500
00:20:23,679 --> 00:20:28,799
of loading everything all at once, it

501
00:20:26,640 --> 00:20:32,000
processes the data by chunks. Loads a

502
00:20:28,799 --> 00:20:34,880
chunk, processes it, and it never ever

503
00:20:32,000 --> 00:20:36,880
materializes all the data in memory. So

504
00:20:34,880 --> 00:20:39,600
the idea with streaming is that p memory

505
00:20:36,880 --> 00:20:41,440
should go down. That's what we want. But

506
00:20:39,600 --> 00:20:44,000
in our case, the p memory stayed the

507
00:20:41,440 --> 00:20:47,520
same. And that's why our ostrich here

508
00:20:44,000 --> 00:20:49,840
looks so unimpressed. As do I.

509
00:20:47,520 --> 00:20:52,320
To get to the bottom of this, let's take

510
00:20:49,840 --> 00:20:54,159
a look at the streaming query plan. Most

511
00:20:52,320 --> 00:20:55,919
of the operations, those white cells

512
00:20:54,159 --> 00:20:57,840
over there, they can be streamed.

513
00:20:55,919 --> 00:20:59,840
They're perfectly streamable. But some

514
00:20:57,840 --> 00:21:02,400
of them can't.

515
00:20:59,840 --> 00:21:06,480
This is the red cell. And this is what

516
00:21:02,400 --> 00:21:08,960
breaks our chain. And this cumulative

517
00:21:06,480 --> 00:21:10,720
sum means that the peak memory will stay

518
00:21:08,960 --> 00:21:12,720
high.

519
00:21:10,720 --> 00:21:15,280
Still, not all is lost. Even though the

520
00:21:12,720 --> 00:21:17,600
query couldn't be streamed, parts of it

521
00:21:15,280 --> 00:21:20,559
could. And that gave us a really nice

522
00:21:17,600 --> 00:21:23,520
speed up. Here's the effect. From 1

523
00:21:20,559 --> 00:21:25,520
minute down to 36 seconds. That's really

524
00:21:23,520 --> 00:21:28,559
nice improvement for practically no

525
00:21:25,520 --> 00:21:30,720
effort. All we needed to do was write

526
00:21:28,559 --> 00:21:34,320
engine equals to streaming and we got

527
00:21:30,720 --> 00:21:36,240
about 40% speed up. So in our case,

528
00:21:34,320 --> 00:21:39,360
streaming did not help with memory, but

529
00:21:36,240 --> 00:21:41,039
it did help with speed.

530
00:21:39,360 --> 00:21:46,400
The last optimizations we'll be looking

531
00:21:41,039 --> 00:21:48,159
at is hardware acceleration GPU support.

532
00:21:46,400 --> 00:21:50,799
Starting from last year, Polers can

533
00:21:48,159 --> 00:21:53,760
offload queries to Nvidia GPUs through

534
00:21:50,799 --> 00:21:55,919
rapids and QDF. And using it is also

535
00:21:53,760 --> 00:21:58,000
really simple as with the streaming, you

536
00:21:55,919 --> 00:22:01,760
can just write collect engine equals to

537
00:21:58,000 --> 00:22:03,840
GPU and that's it. But in our case, if

538
00:22:01,760 --> 00:22:05,520
there was an ostrich on this slide, just

539
00:22:03,840 --> 00:22:07,679
like with streaming, it would be deeply

540
00:22:05,520 --> 00:22:09,520
disappointed. Again, similarly to

541
00:22:07,679 --> 00:22:12,559
streaming, the query plan we have

542
00:22:09,520 --> 00:22:14,400
written cannot be executed on GPU. And

543
00:22:12,559 --> 00:22:17,200
unlike streaming, which at least helped

544
00:22:14,400 --> 00:22:18,880
us with memory a little,

545
00:22:17,200 --> 00:22:22,320
excuse me, with speed, if not with

546
00:22:18,880 --> 00:22:26,080
memory, GPU didn't help us at all. And

547
00:22:22,320 --> 00:22:27,600
in fact, the numbers even got worse.

548
00:22:26,080 --> 00:22:28,960
It disabled the previous streaming

549
00:22:27,600 --> 00:22:30,799
optimization because they cannot be

550
00:22:28,960 --> 00:22:33,440
applied at once. And it added some

551
00:22:30,799 --> 00:22:35,360
overhead for us. So, the trophy marks

552
00:22:33,440 --> 00:22:37,840
the real winner. everything up to

553
00:22:35,360 --> 00:22:39,520
streaming from 9 minutes with the sweep

554
00:22:37,840 --> 00:22:42,240
line version and that's already after

555
00:22:39,520 --> 00:22:44,640
fixing the algorithm down to only 36

556
00:22:42,240 --> 00:22:48,320
seconds in the final pipeline and the

557
00:22:44,640 --> 00:22:50,400
full NA version we even we didn't even

558
00:22:48,320 --> 00:22:53,679
bother to measure here because it would

559
00:22:50,400 --> 00:22:55,200
have taken ages and this is the excess

560
00:22:53,679 --> 00:22:58,159
data set that I have promised to show

561
00:22:55,200 --> 00:22:59,840
you the to example we have already seen

562
00:22:58,159 --> 00:23:02,240
this big drop in the beginning from

563
00:22:59,840 --> 00:23:04,400
naive to the algorithmic complexity and

564
00:23:02,240 --> 00:23:06,720
also do you see where the minimum is

565
00:23:04,400 --> 00:23:09,039
it's not where it was on the L side,

566
00:23:06,720 --> 00:23:11,679
right? On this scale, the optimizations

567
00:23:09,039 --> 00:23:14,000
really help help only up to polers. And

568
00:23:11,679 --> 00:23:15,840
once we try to buy all those fancy

569
00:23:14,000 --> 00:23:18,320
tricks that are really meant for larger

570
00:23:15,840 --> 00:23:20,000
scale, we just add more and more

571
00:23:18,320 --> 00:23:22,000
overhead.

572
00:23:20,000 --> 00:23:23,760
So, it's a good reminder just because

573
00:23:22,000 --> 00:23:26,559
you've got a shiny new optimization

574
00:23:23,760 --> 00:23:30,159
hammer from your PyCon doesn't mean that

575
00:23:26,559 --> 00:23:32,159
every pipeline gets to be a nail now.

576
00:23:30,159 --> 00:23:34,640
So, what about memory? We have started

577
00:23:32,159 --> 00:23:37,840
with 14 GB and by the latest stages we

578
00:23:34,640 --> 00:23:39,600
brought that down to only 9. You'll

579
00:23:37,840 --> 00:23:41,919
notice the spike at vectorization that's

580
00:23:39,600 --> 00:23:45,039
expected. Vectorization created big

581
00:23:41,919 --> 00:23:50,240
intermediate arrays and memory usage

582
00:23:45,039 --> 00:23:52,240
went up while the runtime went down.

583
00:23:50,240 --> 00:23:54,720
And that's our checklist. Some fixes

584
00:23:52,240 --> 00:23:56,880
were tiny, other ones more dramatic, but

585
00:23:54,720 --> 00:23:59,120
stuck together they made the pipeline

586
00:23:56,880 --> 00:24:01,360
lean and quick. And that's the whole

587
00:23:59,120 --> 00:24:03,360
idea of the bird watchers guide. a field

588
00:24:01,360 --> 00:24:04,960
guide that you can flick through when

589
00:24:03,360 --> 00:24:07,360
you're out in the wild with your own

590
00:24:04,960 --> 00:24:10,000
pipelines.

591
00:24:07,360 --> 00:24:11,919
So, here's my last slide and with it

592
00:24:10,000 --> 00:24:14,159
comes a little bonus, some footage from

593
00:24:11,919 --> 00:24:16,799
this summer to birds sharing food. It

594
00:24:14,159 --> 00:24:18,799
was just too cute not to show you. So,

595
00:24:16,799 --> 00:24:21,200
thank you so much. I'm so happy I got to

596
00:24:18,799 --> 00:24:23,200
share this with you. And the links are

597
00:24:21,200 --> 00:24:25,360
here. You can scan the QR. This is the

598
00:24:23,200 --> 00:24:28,159
birdies repo, my socials if you would

599
00:24:25,360 --> 00:24:30,960
like to talk about birds, planes, data,

600
00:24:28,159 --> 00:24:32,880
Python, anything. And if the plain part

601
00:24:30,960 --> 00:24:34,559
was your favorite, Asai has some open

602
00:24:32,880 --> 00:24:39,720
rolls, too. So, feel free to check it

603
00:24:34,559 --> 00:24:39,720
out. And once again, thank you so much.

604
00:24:41,470 --> 00:24:44,579
[Music]

605
00:24:47,600 --> 00:24:51,760
Thank you so much, Jenna. What a

606
00:24:49,600 --> 00:24:53,440
fascinating deep dive into optimization.

607
00:24:51,760 --> 00:24:54,880
This is a small token of PyCon's

608
00:24:53,440 --> 00:24:56,320
appreciation. Thank you.

609
00:24:54,880 --> 00:24:58,480
Um, stay tuned for the next session.

610
00:24:56,320 --> 00:24:59,919
We'll be back very, very soon. So,

611
00:24:58,480 --> 00:25:03,080
thanks. Another big round of applause

612
00:24:59,919 --> 00:25:03,080
for GA.