1 00:00:00,420 --> 00:00:05,910 [Music] 2 00:00:11,840 --> 00:00:16,800 Hi folks, apologies for the delay. Uh 3 00:00:14,240 --> 00:00:18,400 it's my pleasure to present GA and her 4 00:00:16,800 --> 00:00:20,560 talk on the bird watchers guide to 5 00:00:18,400 --> 00:00:21,680 optimized data pipelines. Take it away 6 00:00:20,560 --> 00:00:24,920 Ja. 7 00:00:21,680 --> 00:00:24,920 Thank you. 8 00:00:27,840 --> 00:00:32,719 Oh, I thought presenting will be the 9 00:00:30,640 --> 00:00:35,600 scary part. Turned out to be getting the 10 00:00:32,719 --> 00:00:38,399 slides on. Hey everyone. Um, thank you 11 00:00:35,600 --> 00:00:40,160 so much for coming. My name is Ja and 12 00:00:38,399 --> 00:00:41,920 today I would like to share with you a 13 00:00:40,160 --> 00:00:44,640 little story about performance 14 00:00:41,920 --> 00:00:47,760 optimization. For the next 30 minutes, 15 00:00:44,640 --> 00:00:49,440 it's going to be birds, planes, and data 16 00:00:47,760 --> 00:00:52,079 pipelines. 17 00:00:49,440 --> 00:00:54,960 I am a data scientist and I work in 18 00:00:52,079 --> 00:00:57,120 aviation in an AI company which is 19 00:00:54,960 --> 00:00:59,760 called ASA where we build machine 20 00:00:57,120 --> 00:01:02,320 learning and computer vision systems for 21 00:00:59,760 --> 00:01:04,479 airlines and airports. 22 00:01:02,320 --> 00:01:07,280 And recently I had to optimize one of 23 00:01:04,479 --> 00:01:09,600 our feature engineering pipelines. It 24 00:01:07,280 --> 00:01:11,360 inspired this talk but yet it sent me 25 00:01:09,600 --> 00:01:13,680 down a rabbit hole of optimization 26 00:01:11,360 --> 00:01:15,280 techniques. But don't worry, I'm not 27 00:01:13,680 --> 00:01:17,360 here to drag you down with me down this 28 00:01:15,280 --> 00:01:19,280 rabbit hole. Instead, I would like to 29 00:01:17,360 --> 00:01:21,280 share with you a few things that I have 30 00:01:19,280 --> 00:01:23,840 learned along the way. And this is a 31 00:01:21,280 --> 00:01:25,520 checklist that I put for myself, and I 32 00:01:23,840 --> 00:01:28,000 find it to be quite handy when your 33 00:01:25,520 --> 00:01:31,040 pipeline is running slow or when it's 34 00:01:28,000 --> 00:01:33,280 eating too much memory. All right, let's 35 00:01:31,040 --> 00:01:35,920 get started. 36 00:01:33,280 --> 00:01:37,759 On this slide, oh yeah, on this slide, 37 00:01:35,920 --> 00:01:40,000 you can see a dashboard from the real 38 00:01:37,759 --> 00:01:42,320 world. The white bars show when ground 39 00:01:40,000 --> 00:01:44,400 operations are happening like fueling, 40 00:01:42,320 --> 00:01:47,439 cleaning, catering, passengers in, 41 00:01:44,400 --> 00:01:50,079 passengers out. My task was to turn 42 00:01:47,439 --> 00:01:51,680 these detections into features for a 43 00:01:50,079 --> 00:01:54,079 machine learning model then that will 44 00:01:51,680 --> 00:01:56,320 later predict delays. Will the plane 45 00:01:54,079 --> 00:01:59,280 leave on time or is it going to be late 46 00:01:56,320 --> 00:02:02,079 and for how long? But in the spirit of 47 00:01:59,280 --> 00:02:04,240 Spring Hill in Australia, I would like 48 00:02:02,079 --> 00:02:06,560 to translate this task into a bird 49 00:02:04,240 --> 00:02:10,560 story. So instead of aircraft 50 00:02:06,560 --> 00:02:12,640 operations, imagine my backyard feeder. 51 00:02:10,560 --> 00:02:14,959 The task is to count how many birds of 52 00:02:12,640 --> 00:02:16,879 each species are present at every 5 53 00:02:14,959 --> 00:02:19,040 minute interval. And the goal is to 54 00:02:16,879 --> 00:02:20,800 answer, do I have enough food for the 55 00:02:19,040 --> 00:02:22,720 day or do I need to run to the 56 00:02:20,800 --> 00:02:25,120 supermarket real quick and grab some 57 00:02:22,720 --> 00:02:28,239 worms? Yeah, from the supermarket and 58 00:02:25,120 --> 00:02:30,560 seeds for my birdies. 59 00:02:28,239 --> 00:02:32,480 And this is the outline. A bird's eyee 60 00:02:30,560 --> 00:02:34,879 view of all the optimizations that we 61 00:02:32,480 --> 00:02:37,440 will cover from algorithmic complexity 62 00:02:34,879 --> 00:02:39,440 to vectorization, data frame choice will 63 00:02:37,440 --> 00:02:43,120 be there and all the way down to 64 00:02:39,440 --> 00:02:44,879 streaming and hardware acceleration. 65 00:02:43,120 --> 00:02:47,200 The bird version of the task is the 66 00:02:44,879 --> 00:02:49,760 following. We've got three species. 67 00:02:47,200 --> 00:02:51,360 Sparrow, greenfinch, and crochets. And 68 00:02:49,760 --> 00:02:53,599 those are the local birdies from the 69 00:02:51,360 --> 00:02:56,959 Vietnamese highlands where my home 70 00:02:53,599 --> 00:02:59,360 currently is. So each bar is when a bird 71 00:02:56,959 --> 00:03:01,599 is at the feeder and our job every 5 72 00:02:59,360 --> 00:03:04,080 minutes we count how many birds of each 73 00:03:01,599 --> 00:03:06,480 species are present. 74 00:03:04,080 --> 00:03:08,239 Here's the same thing as a pipeline 75 00:03:06,480 --> 00:03:10,640 input detection and time stamps. They 76 00:03:08,239 --> 00:03:12,319 are stored a disk and a list of species 77 00:03:10,640 --> 00:03:14,239 we care about because we'll be filtering 78 00:03:12,319 --> 00:03:17,599 out some of them. And the output is a 79 00:03:14,239 --> 00:03:19,280 table of counts by species. 80 00:03:17,599 --> 00:03:22,000 To compare optimizations, I'll be using 81 00:03:19,280 --> 00:03:24,480 benchmarks on synthetic data that can 82 00:03:22,000 --> 00:03:27,599 comfortably fit into one local machine. 83 00:03:24,480 --> 00:03:31,200 And the size of this machine is 25 GB of 84 00:03:27,599 --> 00:03:34,000 RAM, five CPU cores, um some GPU, and 85 00:03:31,200 --> 00:03:35,519 for this task, we'll say unlimited disk. 86 00:03:34,000 --> 00:03:38,480 Throughout the talk, I'll be showing two 87 00:03:35,519 --> 00:03:42,319 extremes. A tiny little excess case, and 88 00:03:38,480 --> 00:03:44,879 a bigger L case with 300 species and 5 89 00:03:42,319 --> 00:03:46,879 million detections. And each step is 90 00:03:44,879 --> 00:03:49,280 warmed up, and then it's run five times. 91 00:03:46,879 --> 00:03:51,360 And then we measure runtime and pick 92 00:03:49,280 --> 00:03:53,599 memory. 93 00:03:51,360 --> 00:03:55,440 All right, let's get started with the 94 00:03:53,599 --> 00:03:57,920 optimizations. But first, we need to 95 00:03:55,440 --> 00:04:00,640 look at the nave algorithm. It's really 96 00:03:57,920 --> 00:04:02,640 naive. For each time step, we look for 97 00:04:00,640 --> 00:04:04,959 every detection. Check if the bird is 98 00:04:02,640 --> 00:04:08,480 there and a plus one. And we have an 99 00:04:04,959 --> 00:04:11,120 early return the break there. So the 100 00:04:08,480 --> 00:04:14,000 time complexity of this thing is O of T 101 00:04:11,120 --> 00:04:16,560 * D times by detections. And this is our 102 00:04:14,000 --> 00:04:19,440 baseline. It runs but it's painfully 103 00:04:16,560 --> 00:04:21,359 slow and inefficient. So optimization 104 00:04:19,440 --> 00:04:23,919 number one is coming. Let us fix the 105 00:04:21,359 --> 00:04:26,800 algorithm. In our case, we can do better 106 00:04:23,919 --> 00:04:28,800 with a sweep line approach. Let's 107 00:04:26,800 --> 00:04:31,520 simplify to just Paris for now and then 108 00:04:28,800 --> 00:04:33,759 we will add other birdies later. The 109 00:04:31,520 --> 00:04:36,160 matrix here is to turn each detection 110 00:04:33,759 --> 00:04:38,400 into two events. Plus one when the bird 111 00:04:36,160 --> 00:04:41,199 arrives and minus one when it leaves the 112 00:04:38,400 --> 00:04:44,000 feeder. And that gives us a table of 113 00:04:41,199 --> 00:04:46,080 times and deltas. 114 00:04:44,000 --> 00:04:48,240 Now we can run a cumulative sum like a 115 00:04:46,080 --> 00:04:49,680 run in total and at every moment that 116 00:04:48,240 --> 00:04:52,400 will tell us how many sparrows are 117 00:04:49,680 --> 00:04:54,160 present. This sweep step is linear but 118 00:04:52,400 --> 00:04:56,400 because of the sorting that dominates 119 00:04:54,160 --> 00:04:58,880 the complexity or the whole algorithm 120 00:04:56,400 --> 00:05:02,320 will run in O of T, the time steps are 121 00:04:58,880 --> 00:05:03,840 sorted beforehand plus D log D. And 122 00:05:02,320 --> 00:05:05,600 finally we merge these events with the 123 00:05:03,840 --> 00:05:07,360 time stamps we really cared about in the 124 00:05:05,600 --> 00:05:09,919 beginning. And that gives us the count 125 00:05:07,360 --> 00:05:12,800 that we wanted. 126 00:05:09,919 --> 00:05:15,039 So in our case we sped things up in with 127 00:05:12,800 --> 00:05:16,800 a simplified sweepline algorithm. Of 128 00:05:15,039 --> 00:05:18,560 course this is just my example. The 129 00:05:16,800 --> 00:05:21,600 point is we check the algorithm first 130 00:05:18,560 --> 00:05:23,840 and then once we squeeze that it will 131 00:05:21,600 --> 00:05:27,840 make sense to move to the further 132 00:05:23,840 --> 00:05:29,520 fancier tricks. All right 133 00:05:27,840 --> 00:05:31,440 here is how it shows up in numbers. 134 00:05:29,520 --> 00:05:34,080 Here's the excess data set. Naive takes 135 00:05:31,440 --> 00:05:36,639 about 20 seconds and sweep line drops 136 00:05:34,080 --> 00:05:38,560 that to under a second which is a huge 137 00:05:36,639 --> 00:05:40,080 win. Starting from the next 138 00:05:38,560 --> 00:05:42,320 optimization, I'll be showing you 139 00:05:40,080 --> 00:05:44,320 results only on the larger data set so 140 00:05:42,320 --> 00:05:46,800 you can see how they work on a larger 141 00:05:44,320 --> 00:05:49,039 scale and I'll come back to this XS 142 00:05:46,800 --> 00:05:51,280 version later in the end so that we can 143 00:05:49,039 --> 00:05:53,360 see how those optimizations work out on 144 00:05:51,280 --> 00:05:55,440 this tiny scale. 145 00:05:53,360 --> 00:05:58,880 With the algorithm sorted out, the next 146 00:05:55,440 --> 00:06:01,880 step would be vectorization. 147 00:05:58,880 --> 00:06:01,880 Okay. 148 00:06:02,479 --> 00:06:08,080 A loop is imperative. You write a for 149 00:06:05,520 --> 00:06:11,039 loop and you tell Python exactly how to 150 00:06:08,080 --> 00:06:12,960 walk each row. The problem is the in 151 00:06:11,039 --> 00:06:14,720 Python every iteration will need to go 152 00:06:12,960 --> 00:06:16,560 through the interpreter and this would 153 00:06:14,720 --> 00:06:19,280 mean function call type checks object 154 00:06:16,560 --> 00:06:21,520 overhead and the vectorzed version is 155 00:06:19,280 --> 00:06:23,360 declarative. You tell pandas what 156 00:06:21,520 --> 00:06:25,199 columns you want and it applies the 157 00:06:23,360 --> 00:06:27,680 operation to the whole column at once 158 00:06:25,199 --> 00:06:30,000 and that lets pandas use C engine under 159 00:06:27,680 --> 00:06:33,600 the hood which is much faster than our 160 00:06:30,000 --> 00:06:35,600 Python slow Python loop. 161 00:06:33,600 --> 00:06:38,400 And this is how it looks like on L data 162 00:06:35,600 --> 00:06:40,560 set. The loopy version took more than 9 163 00:06:38,400 --> 00:06:42,960 minutes and the vectorzed version cut 164 00:06:40,560 --> 00:06:46,400 that down to about four. That's more 165 00:06:42,960 --> 00:06:48,319 than twice as fast and just by changing 166 00:06:46,400 --> 00:06:51,199 how we express the code we have achieved 167 00:06:48,319 --> 00:06:53,199 this result. So vectorization definitely 168 00:06:51,199 --> 00:06:56,720 helped but the engine that we'll be 169 00:06:53,199 --> 00:06:58,880 using it matters too immensely. 170 00:06:56,720 --> 00:07:01,759 And that brings us to the next point 171 00:06:58,880 --> 00:07:03,280 data frame choice. My favorite one 172 00:07:01,759 --> 00:07:05,759 probably. 173 00:07:03,280 --> 00:07:08,319 Unlike some 10 years ago when there was 174 00:07:05,759 --> 00:07:10,720 mostly pandas, today we have a whole zoo 175 00:07:08,319 --> 00:07:13,360 of different data frames we can choose 176 00:07:10,720 --> 00:07:15,360 from and there is no single best one for 177 00:07:13,360 --> 00:07:17,680 every task. It will depend on our 178 00:07:15,360 --> 00:07:20,639 pipeline of on our task that we are 179 00:07:17,680 --> 00:07:22,560 aiming to solve. So let us ask ourselves 180 00:07:20,639 --> 00:07:25,599 a few guiding questions and let's 181 00:07:22,560 --> 00:07:28,720 eliminate those choices one by one. 182 00:07:25,599 --> 00:07:30,639 First, does our data fit in one machine 183 00:07:28,720 --> 00:07:33,120 or do we rather need a distributed 184 00:07:30,639 --> 00:07:35,599 cluster? If we would need a cluster, we 185 00:07:33,120 --> 00:07:37,840 would probably reach for Spark or Dusk. 186 00:07:35,599 --> 00:07:40,000 Dusk is basically pandas that scales. It 187 00:07:37,840 --> 00:07:42,000 splits a data frame into many chunks and 188 00:07:40,000 --> 00:07:44,960 runs them in parallel, either on your 189 00:07:42,000 --> 00:07:46,960 laptop or in a cluster. Spark and Dusk 190 00:07:44,960 --> 00:07:49,599 both handle distributed workloads really 191 00:07:46,960 --> 00:07:51,360 well, but since our bird's data fits 192 00:07:49,599 --> 00:07:52,960 comfortably in memory on just one 193 00:07:51,360 --> 00:07:55,919 machine, 194 00:07:52,960 --> 00:07:58,160 we don't need either of them here. 195 00:07:55,919 --> 00:07:59,840 All right, question number two. Am I 196 00:07:58,160 --> 00:08:02,479 prototyping fast or building something 197 00:07:59,840 --> 00:08:04,240 that will need to be durable and 198 00:08:02,479 --> 00:08:06,319 scalable? If we would be just 199 00:08:04,240 --> 00:08:08,720 prototyping, pod was pandas would be a 200 00:08:06,319 --> 00:08:10,720 wonderful choice. It is familiar. We all 201 00:08:08,720 --> 00:08:13,840 probably know the syntax and it has a 202 00:08:10,720 --> 00:08:16,319 huge ecosystem. But for larger scale, it 203 00:08:13,840 --> 00:08:19,199 would most most definitely slow the 204 00:08:16,319 --> 00:08:22,160 things down for us. So here we'll cross 205 00:08:19,199 --> 00:08:24,240 out pandas for our job since speed is 206 00:08:22,160 --> 00:08:26,319 one of our major concerns and also the 207 00:08:24,240 --> 00:08:29,680 data is quite large. And the last 208 00:08:26,319 --> 00:08:32,640 question, do we need SQL or do we like 209 00:08:29,680 --> 00:08:35,919 SQL or would we rather prefer a data 210 00:08:32,640 --> 00:08:37,680 frame API? If you prefer SQL, Doug DB is 211 00:08:35,919 --> 00:08:39,839 a fantastic choice. It's column node 212 00:08:37,680 --> 00:08:42,240 super fast. But since we started in 213 00:08:39,839 --> 00:08:45,680 pandas, I would like to stay in the data 214 00:08:42,240 --> 00:08:48,480 frame world for this talk. So after 215 00:08:45,680 --> 00:08:50,560 crossing all the others, we landed with 216 00:08:48,480 --> 00:08:52,240 polars. 217 00:08:50,560 --> 00:08:54,959 And just a quick note before getting 218 00:08:52,240 --> 00:08:57,120 deeper into the polars, there's a brand 219 00:08:54,959 --> 00:08:59,279 new library called Narwhals released 220 00:08:57,120 --> 00:09:01,040 just last year and it gives us one 221 00:08:59,279 --> 00:09:03,600 syntax across all those different 222 00:09:01,040 --> 00:09:06,240 libraries we've been talking about and 223 00:09:03,600 --> 00:09:10,480 it lets you easily experiment and even 224 00:09:06,240 --> 00:09:12,959 mix them seamlessly in one code base. 225 00:09:10,480 --> 00:09:16,000 still very early stages and I haven't 226 00:09:12,959 --> 00:09:18,720 myself used it in production but worth 227 00:09:16,000 --> 00:09:21,760 keeping an eye on. 228 00:09:18,720 --> 00:09:25,279 So polars this slide shows you two 229 00:09:21,760 --> 00:09:27,839 things. First polers is genuinely fast. 230 00:09:25,279 --> 00:09:30,320 I think they use the word breezingly. So 231 00:09:27,839 --> 00:09:31,920 genuinely breezingly fast. Benchmarks 232 00:09:30,320 --> 00:09:33,839 like this one on joins and group bus 233 00:09:31,920 --> 00:09:36,720 show it running in seconds while pandas 234 00:09:33,839 --> 00:09:38,720 takes minutes. And second it's not just 235 00:09:36,720 --> 00:09:41,279 theory. The community is adopting it 236 00:09:38,720 --> 00:09:44,000 quickly. GitHub stars for polars and 237 00:09:41,279 --> 00:09:47,040 duck DB have been climbing steeply since 238 00:09:44,000 --> 00:09:49,760 2022 while pandas has been steady 239 00:09:47,040 --> 00:09:52,399 steadily growing but yet steady for over 240 00:09:49,760 --> 00:09:55,120 a decade. So the excitement around 241 00:09:52,399 --> 00:09:57,120 polers comes from both sides its strong 242 00:09:55,120 --> 00:10:00,120 performance and a real momentum in the 243 00:09:57,120 --> 00:10:00,120 community. 244 00:10:00,560 --> 00:10:08,160 So what actually makes polers so fast? 245 00:10:05,600 --> 00:10:11,120 There are two pillars the Rust back end 246 00:10:08,160 --> 00:10:13,279 and the Apache error memory format. What 247 00:10:11,120 --> 00:10:15,519 Rust does is gives us multi-threading 248 00:10:13,279 --> 00:10:17,760 and single instruction multiple data 249 00:10:15,519 --> 00:10:20,160 which would mean safe parallel execution 250 00:10:17,760 --> 00:10:22,399 across coursees and arrow gives us 251 00:10:20,160 --> 00:10:24,800 column layout which is compact and cach 252 00:10:22,399 --> 00:10:27,200 friendly and it tracks nulls with a bit 253 00:10:24,800 --> 00:10:29,519 map. This is a nice thing for people 254 00:10:27,200 --> 00:10:31,519 coming from a pendas background. Pendas 255 00:10:29,519 --> 00:10:33,600 with numpy nullable back end has a 256 00:10:31,519 --> 00:10:35,279 similar thing going on but it uses one 257 00:10:33,600 --> 00:10:37,839 bite per value instead of one bit. So 258 00:10:35,279 --> 00:10:40,399 idea is the same but less less efficient 259 00:10:37,839 --> 00:10:42,560 and because arrow is a standard polers 260 00:10:40,399 --> 00:10:44,959 can interoperate seamlessly with other 261 00:10:42,560 --> 00:10:48,560 libraries like dubd uses in-memory 262 00:10:44,959 --> 00:10:50,959 format of apache error under the hood pi 263 00:10:48,560 --> 00:10:53,920 error spark all of this without any 264 00:10:50,959 --> 00:10:56,800 copies so that interoperability is a 265 00:10:53,920 --> 00:10:59,519 huge deal 266 00:10:56,800 --> 00:11:02,959 so polers is fast 267 00:10:59,519 --> 00:11:04,959 but I also what I also love about it is 268 00:11:02,959 --> 00:11:08,560 how it looks from the outside its 269 00:11:04,959 --> 00:11:10,399 syntax. Polers works with expressions. 270 00:11:08,560 --> 00:11:13,279 You can think of expression as a 271 00:11:10,399 --> 00:11:16,000 recipe.pl 272 00:11:13,279 --> 00:11:18,320 cumulative sum over alias all of these 273 00:11:16,000 --> 00:11:20,079 are expressions. Each one is a recipe 274 00:11:18,320 --> 00:11:23,519 but they don't cook on their own. What 275 00:11:20,079 --> 00:11:26,720 they need is a kitchen a context where 276 00:11:23,519 --> 00:11:28,880 what they will be getting executed. And 277 00:11:26,720 --> 00:11:31,120 this context here which encapsulates all 278 00:11:28,880 --> 00:11:32,959 the expressions is select. There are 279 00:11:31,120 --> 00:11:34,720 other contexts like group by should be 280 00:11:32,959 --> 00:11:37,040 familiar from the other data frames with 281 00:11:34,720 --> 00:11:39,600 columns. It's like assign as p in pandas 282 00:11:37,040 --> 00:11:42,000 and filter and the real power is that 283 00:11:39,600 --> 00:11:43,519 you can chain these expressions and that 284 00:11:42,000 --> 00:11:47,040 chaining will unlock the next 285 00:11:43,519 --> 00:11:49,120 optimizations for us lazy execution. 286 00:11:47,040 --> 00:11:51,600 But first let's look at the numbers on 287 00:11:49,120 --> 00:11:53,920 size L. Switching from pandas to polers 288 00:11:51,600 --> 00:11:56,240 makes the pipeline more than twice as 289 00:11:53,920 --> 00:11:58,079 fast. And the key thing is we didn't 290 00:11:56,240 --> 00:11:59,920 change the logic at all. We just changed 291 00:11:58,079 --> 00:12:02,480 the syntax. We could have done it using 292 00:11:59,920 --> 00:12:04,399 an LLM in a few minutes. You might 293 00:12:02,480 --> 00:12:07,360 remember earlier I showed you benchmarks 294 00:12:04,399 --> 00:12:09,040 numbers for polars. The speed up was 295 00:12:07,360 --> 00:12:11,839 really dramatic and actually much 296 00:12:09,040 --> 00:12:14,079 bigger, sometimes an order of magnitude. 297 00:12:11,839 --> 00:12:16,320 But in full pipeline like this one, the 298 00:12:14,079 --> 00:12:17,920 gain is more modest since it's more 299 00:12:16,320 --> 00:12:20,959 diverse and not just the highly 300 00:12:17,920 --> 00:12:22,399 optimized joins and group buys. Still, 301 00:12:20,959 --> 00:12:24,720 it gave us another great speed 302 00:12:22,399 --> 00:12:27,600 improvement. And from this step on, 303 00:12:24,720 --> 00:12:29,200 everything will be run in polers. And 304 00:12:27,600 --> 00:12:32,959 here is where it brings in something not 305 00:12:29,200 --> 00:12:35,360 every library has lazy execution. 306 00:12:32,959 --> 00:12:37,600 The trick is that polers doesn't have to 307 00:12:35,360 --> 00:12:40,639 run each step immediately. 308 00:12:37,600 --> 00:12:43,279 Instead, it can hold off, build a full 309 00:12:40,639 --> 00:12:45,200 plan, and then execute it in the most 310 00:12:43,279 --> 00:12:47,839 efficient way it can. And that's what 311 00:12:45,200 --> 00:12:49,600 lazy execution is about. It's making 312 00:12:47,839 --> 00:12:51,440 like it's it's like making a shopping 313 00:12:49,600 --> 00:12:53,519 list before you go to the supermarket. 314 00:12:51,440 --> 00:12:55,279 If you just buy items one by one in the 315 00:12:53,519 --> 00:12:56,880 order that you remember them, you'll be 316 00:12:55,279 --> 00:12:59,120 probably running back and forth between 317 00:12:56,880 --> 00:13:01,120 the aisles all the time. You'll get your 318 00:12:59,120 --> 00:13:03,360 steps in, that's a good thing, but 319 00:13:01,120 --> 00:13:05,200 you'll waste a lot of time. And with a 320 00:13:03,360 --> 00:13:07,440 full list, what you can do is make one 321 00:13:05,200 --> 00:13:09,600 efficient trip to all the aisles and get 322 00:13:07,440 --> 00:13:12,000 your groceries done. 323 00:13:09,600 --> 00:13:13,920 And here is how this idea looks like in 324 00:13:12,000 --> 00:13:15,680 code. In the snippet, we're selecting 325 00:13:13,920 --> 00:13:19,440 the first three green features to leave 326 00:13:15,680 --> 00:13:21,920 the feeder. Notice they use the scan CSV 327 00:13:19,440 --> 00:13:23,920 instead of read CSV there. And that's 328 00:13:21,920 --> 00:13:26,800 what will create a lazy data frame for 329 00:13:23,920 --> 00:13:29,440 us. Nothing is running yet. Bowlers just 330 00:13:26,800 --> 00:13:32,399 collects the steps and only when we call 331 00:13:29,440 --> 00:13:35,920 dot collect does it execute them with 332 00:13:32,399 --> 00:13:37,839 all the optimizations applied. 333 00:13:35,920 --> 00:13:39,920 If we pick at the cury plan, you can see 334 00:13:37,839 --> 00:13:42,320 how it works. You can hopefully see how 335 00:13:39,920 --> 00:13:44,560 it works. You read it from bottom to 336 00:13:42,320 --> 00:13:46,720 top. On the left is the unoptimized 337 00:13:44,560 --> 00:13:49,040 plan. Exactly the steps I have written 338 00:13:46,720 --> 00:13:51,519 in the order I have written them. On the 339 00:13:49,040 --> 00:13:53,200 right is the optimized plan. Filters and 340 00:13:51,519 --> 00:13:55,839 selects have been pushed all the way 341 00:13:53,200 --> 00:13:58,880 down to the scan. And this way we never 342 00:13:55,839 --> 00:14:00,959 even load unnecessary stuff into memory. 343 00:13:58,880 --> 00:14:03,360 And you see those little sigma and pi 344 00:14:00,959 --> 00:14:06,000 symbols there on the right. They stand 345 00:14:03,360 --> 00:14:07,920 for projection and predicate push downs. 346 00:14:06,000 --> 00:14:09,839 Projection push down means we only read 347 00:14:07,920 --> 00:14:12,000 the columns that we need. In our case, 348 00:14:09,839 --> 00:14:16,800 it's just three out of five columns. 349 00:14:12,000 --> 00:14:19,199 It's like SQL select or a polar select 350 00:14:16,800 --> 00:14:22,160 predicate push down means we only read 351 00:14:19,199 --> 00:14:24,959 the rows we need the green rows in our 352 00:14:22,160 --> 00:14:26,560 case it's like a SQL wear or polos 353 00:14:24,959 --> 00:14:29,040 filter 354 00:14:26,560 --> 00:14:30,399 keep this in mind they'll come handy 355 00:14:29,040 --> 00:14:33,399 when we'll be talking about the file 356 00:14:30,399 --> 00:14:33,399 formats 357 00:14:33,600 --> 00:14:38,880 lazy execution isn't just for the local 358 00:14:35,680 --> 00:14:40,560 files it also can work with data lakes I 359 00:14:38,880 --> 00:14:43,519 like to think of a data lake as like a 360 00:14:40,560 --> 00:14:46,240 giant library full of super cool books, 361 00:14:43,519 --> 00:14:48,639 but all of them are just stored in boxes 362 00:14:46,240 --> 00:14:50,639 and on the floor. You can find something 363 00:14:48,639 --> 00:14:52,880 if you really need to, but it's going to 364 00:14:50,639 --> 00:14:55,120 be super painful. And table formats like 365 00:14:52,880 --> 00:14:57,760 Apache iceberg here, the one that our 366 00:14:55,120 --> 00:15:00,880 bird is sitting on, it will add a 367 00:14:57,760 --> 00:15:02,720 catalog to this library. Now you know 368 00:15:00,880 --> 00:15:05,600 which shelf, which book, and which page 369 00:15:02,720 --> 00:15:07,839 to look at. And what Polers does in this 370 00:15:05,600 --> 00:15:10,560 in this picture, it uses this catalog 371 00:15:07,839 --> 00:15:12,480 like a librarian. It finds for you your 372 00:15:10,560 --> 00:15:15,120 book and fetches the exact pages you ask 373 00:15:12,480 --> 00:15:19,399 for. Like for example, notes on Atlantic 374 00:15:15,120 --> 00:15:19,399 Puffin in our little snippet. 375 00:15:20,959 --> 00:15:27,279 Lazy mode doesn't change much here. We 376 00:15:22,959 --> 00:15:30,000 got from 104 seconds to 96, but we 377 00:15:27,279 --> 00:15:31,519 change how Polers runs the query. And 378 00:15:30,000 --> 00:15:34,000 this will unlock a whole family of 379 00:15:31,519 --> 00:15:36,399 optimizations a little bit later. But 380 00:15:34,000 --> 00:15:40,000 first, let's look at another easy win. 381 00:15:36,399 --> 00:15:42,160 Choosing the right data types. 382 00:15:40,000 --> 00:15:45,120 Polars inherits the full Apache arrow 383 00:15:42,160 --> 00:15:47,040 res data types. We get our usual suspect 384 00:15:45,120 --> 00:15:50,560 integers, unsigned integers, float 385 00:15:47,040 --> 00:15:54,000 strings, but arrow also has some nested 386 00:15:50,560 --> 00:15:55,279 types and polar supports them too like a 387 00:15:54,000 --> 00:15:57,920 variable length collection which is 388 00:15:55,279 --> 00:16:00,800 called list or a fixed length one which 389 00:15:57,920 --> 00:16:03,600 is called an array and a dictionary like 390 00:16:00,800 --> 00:16:05,839 structure which is called well strct. 391 00:16:03,600 --> 00:16:07,519 But what I want to highlight are three 392 00:16:05,839 --> 00:16:10,240 string like types plain string, 393 00:16:07,519 --> 00:16:12,320 categorical and enum. And this can make 394 00:16:10,240 --> 00:16:14,800 a big difference in terms of memory and 395 00:16:12,320 --> 00:16:17,120 speed. 396 00:16:14,800 --> 00:16:18,800 Let's take our species in our burn 397 00:16:17,120 --> 00:16:20,240 example. We only have a handful of 398 00:16:18,800 --> 00:16:22,480 unique species but millions of 399 00:16:20,240 --> 00:16:24,160 detections. If we store them all as 400 00:16:22,480 --> 00:16:26,560 plain strings, we'll repeat the same 401 00:16:24,160 --> 00:16:29,360 value over and over again. We don't want 402 00:16:26,560 --> 00:16:31,199 that. with a categorical type replaces 403 00:16:29,360 --> 00:16:34,639 those strings with integer codes and 404 00:16:31,199 --> 00:16:36,560 then on the side it keeps a dictionary. 405 00:16:34,639 --> 00:16:38,399 This means less memory and faster 406 00:16:36,560 --> 00:16:41,360 calculations. 407 00:16:38,399 --> 00:16:42,959 And then there is enum in stricter. They 408 00:16:41,360 --> 00:16:45,440 define the full set of possible values 409 00:16:42,959 --> 00:16:48,160 up front which is perfect when the set 410 00:16:45,440 --> 00:16:49,920 is fixed like in our bird feeder example 411 00:16:48,160 --> 00:16:51,600 where we know exactly which species are 412 00:16:49,920 --> 00:16:54,079 going to appear. And there are 413 00:16:51,600 --> 00:16:56,240 categoricals that are looser. They can 414 00:16:54,079 --> 00:16:58,320 handle new unseen birds and this will be 415 00:16:56,240 --> 00:17:01,440 safer for production but with a little 416 00:16:58,320 --> 00:17:03,759 more overhead. 417 00:17:01,440 --> 00:17:06,559 Data types might look small but they 418 00:17:03,759 --> 00:17:10,160 matter. About a 30% speed up here and 419 00:17:06,559 --> 00:17:12,559 less memory too. 420 00:17:10,160 --> 00:17:13,839 And the next lever we can pull is a 421 00:17:12,559 --> 00:17:16,079 little different for what we've been 422 00:17:13,839 --> 00:17:18,319 doing before. Up to now we've been 423 00:17:16,079 --> 00:17:20,400 focused on what happens inside memory. 424 00:17:18,319 --> 00:17:22,959 All those algorithms factorization data 425 00:17:20,400 --> 00:17:26,000 types. But performance also depends on 426 00:17:22,959 --> 00:17:27,679 what happens on disk. So let's talk file 427 00:17:26,000 --> 00:17:29,200 formats. 428 00:17:27,679 --> 00:17:33,200 Under the hood, they differ in how they 429 00:17:29,200 --> 00:17:37,039 store data. Some formats store data row 430 00:17:33,200 --> 00:17:39,840 by row. This is our common CSV or a nice 431 00:17:37,039 --> 00:17:41,919 binary format of row. It's simple and 432 00:17:39,840 --> 00:17:43,919 fine for logs or interchange, but for 433 00:17:41,919 --> 00:17:45,679 analytics, it's super slow. You always 434 00:17:43,919 --> 00:17:48,000 read the whole row, even if you only 435 00:17:45,679 --> 00:17:50,080 need one column. 436 00:17:48,000 --> 00:17:52,400 Other formats flip this around and they 437 00:17:50,080 --> 00:17:55,200 store data column by column. That would 438 00:17:52,400 --> 00:17:56,960 be feather or RC. This is great for 439 00:17:55,200 --> 00:17:59,280 analytics. You can read only the columns 440 00:17:56,960 --> 00:18:01,840 you care about. Feather is blazingly 441 00:17:59,280 --> 00:18:04,160 fast for moving data around, but it 442 00:18:01,840 --> 00:18:06,880 doesn't support the push downs we need. 443 00:18:04,160 --> 00:18:09,360 RC is very powerful, but it really pays 444 00:18:06,880 --> 00:18:11,120 off on a huge scale. 445 00:18:09,360 --> 00:18:13,440 And then there is something in the 446 00:18:11,120 --> 00:18:15,919 middle, a hybrid storage layout, which 447 00:18:13,440 --> 00:18:17,919 is parket. It chops the data into row 448 00:18:15,919 --> 00:18:20,080 groups. And inside each group it's 449 00:18:17,919 --> 00:18:23,200 column now. And that means you can speak 450 00:18:20,080 --> 00:18:25,440 big skip big chunks you don't need while 451 00:18:23,200 --> 00:18:28,160 getting column access. And this balance 452 00:18:25,440 --> 00:18:31,280 is one of the reasons why parkier is so 453 00:18:28,160 --> 00:18:33,120 widely used. 454 00:18:31,280 --> 00:18:35,440 All right. With this layout discussion 455 00:18:33,120 --> 00:18:38,640 in mind here is a small zoo of common 456 00:18:35,440 --> 00:18:41,840 formats. And as we as we did with the 457 00:18:38,640 --> 00:18:44,640 data frames let us narrow them down with 458 00:18:41,840 --> 00:18:47,200 three quick questions. First writing or 459 00:18:44,640 --> 00:18:49,520 reading. In our case, only reading, no 460 00:18:47,200 --> 00:18:52,640 writing, so we can drop the roadbased 461 00:18:49,520 --> 00:18:55,520 ones. Second, are we just moving data in 462 00:18:52,640 --> 00:18:57,679 and out or are we doing some analytics? 463 00:18:55,520 --> 00:19:00,080 Feather is great for interchange, but 464 00:18:57,679 --> 00:19:03,679 doesn't support push downs, so it's out. 465 00:19:00,080 --> 00:19:05,679 And finally, how big is our data? OC 466 00:19:03,679 --> 00:19:08,559 shines on a massive scale, but here it 467 00:19:05,679 --> 00:19:10,640 would be just an overhead for us. So 468 00:19:08,559 --> 00:19:12,880 that leaves park hybrid storage with 469 00:19:10,640 --> 00:19:14,720 push downs the most practical choice for 470 00:19:12,880 --> 00:19:16,799 us. 471 00:19:14,720 --> 00:19:19,679 On top of its hybrid layout, park also 472 00:19:16,799 --> 00:19:22,880 applies some fancy encodings inside each 473 00:19:19,679 --> 00:19:25,600 column. There there are dictionary 474 00:19:22,880 --> 00:19:28,640 codings like polar's categorical type. 475 00:19:25,600 --> 00:19:30,799 It replaces strings with integers and 476 00:19:28,640 --> 00:19:32,320 run length encoding. Uh if a value 477 00:19:30,799 --> 00:19:34,880 repeats many times, you can just store 478 00:19:32,320 --> 00:19:36,799 the value plus the count and delta 479 00:19:34,880 --> 00:19:38,240 encoding. For numbers that are similar 480 00:19:36,799 --> 00:19:39,840 that are close together, you can store 481 00:19:38,240 --> 00:19:43,200 just the minimum and then just the 482 00:19:39,840 --> 00:19:45,919 differences and the val the value is 483 00:19:43,200 --> 00:19:47,679 going to be much smaller. And finally, 484 00:19:45,919 --> 00:19:50,799 park supports different compression 485 00:19:47,679 --> 00:19:53,919 codex. So you can trade off either speed 486 00:19:50,799 --> 00:19:56,400 versus file size depending on what your 487 00:19:53,919 --> 00:19:58,080 workload is. 488 00:19:56,400 --> 00:20:00,480 So here's the impact of changing the 489 00:19:58,080 --> 00:20:02,880 input format. Switching to parquet saves 490 00:20:00,480 --> 00:20:04,640 us some time not huge but this is 491 00:20:02,880 --> 00:20:07,440 because our pipeline is dominated by the 492 00:20:04,640 --> 00:20:09,039 in-memory work but in other cases maybe 493 00:20:07,440 --> 00:20:12,039 in your pipelines it can be really 494 00:20:09,039 --> 00:20:12,039 dramatic. 495 00:20:12,080 --> 00:20:17,280 Next one is streaming. After the parket 496 00:20:14,720 --> 00:20:19,600 magic, our happy bird here expects big 497 00:20:17,280 --> 00:20:21,520 memory to go down because streaming is 498 00:20:19,600 --> 00:20:23,679 specifically designed for the data sets 499 00:20:21,520 --> 00:20:26,640 that are larger than memory. And instead 500 00:20:23,679 --> 00:20:28,799 of loading everything all at once, it 501 00:20:26,640 --> 00:20:32,000 processes the data by chunks. Loads a 502 00:20:28,799 --> 00:20:34,880 chunk, processes it, and it never ever 503 00:20:32,000 --> 00:20:36,880 materializes all the data in memory. So 504 00:20:34,880 --> 00:20:39,600 the idea with streaming is that p memory 505 00:20:36,880 --> 00:20:41,440 should go down. That's what we want. But 506 00:20:39,600 --> 00:20:44,000 in our case, the p memory stayed the 507 00:20:41,440 --> 00:20:47,520 same. And that's why our ostrich here 508 00:20:44,000 --> 00:20:49,840 looks so unimpressed. As do I. 509 00:20:47,520 --> 00:20:52,320 To get to the bottom of this, let's take 510 00:20:49,840 --> 00:20:54,159 a look at the streaming query plan. Most 511 00:20:52,320 --> 00:20:55,919 of the operations, those white cells 512 00:20:54,159 --> 00:20:57,840 over there, they can be streamed. 513 00:20:55,919 --> 00:20:59,840 They're perfectly streamable. But some 514 00:20:57,840 --> 00:21:02,400 of them can't. 515 00:20:59,840 --> 00:21:06,480 This is the red cell. And this is what 516 00:21:02,400 --> 00:21:08,960 breaks our chain. And this cumulative 517 00:21:06,480 --> 00:21:10,720 sum means that the peak memory will stay 518 00:21:08,960 --> 00:21:12,720 high. 519 00:21:10,720 --> 00:21:15,280 Still, not all is lost. Even though the 520 00:21:12,720 --> 00:21:17,600 query couldn't be streamed, parts of it 521 00:21:15,280 --> 00:21:20,559 could. And that gave us a really nice 522 00:21:17,600 --> 00:21:23,520 speed up. Here's the effect. From 1 523 00:21:20,559 --> 00:21:25,520 minute down to 36 seconds. That's really 524 00:21:23,520 --> 00:21:28,559 nice improvement for practically no 525 00:21:25,520 --> 00:21:30,720 effort. All we needed to do was write 526 00:21:28,559 --> 00:21:34,320 engine equals to streaming and we got 527 00:21:30,720 --> 00:21:36,240 about 40% speed up. So in our case, 528 00:21:34,320 --> 00:21:39,360 streaming did not help with memory, but 529 00:21:36,240 --> 00:21:41,039 it did help with speed. 530 00:21:39,360 --> 00:21:46,400 The last optimizations we'll be looking 531 00:21:41,039 --> 00:21:48,159 at is hardware acceleration GPU support. 532 00:21:46,400 --> 00:21:50,799 Starting from last year, Polers can 533 00:21:48,159 --> 00:21:53,760 offload queries to Nvidia GPUs through 534 00:21:50,799 --> 00:21:55,919 rapids and QDF. And using it is also 535 00:21:53,760 --> 00:21:58,000 really simple as with the streaming, you 536 00:21:55,919 --> 00:22:01,760 can just write collect engine equals to 537 00:21:58,000 --> 00:22:03,840 GPU and that's it. But in our case, if 538 00:22:01,760 --> 00:22:05,520 there was an ostrich on this slide, just 539 00:22:03,840 --> 00:22:07,679 like with streaming, it would be deeply 540 00:22:05,520 --> 00:22:09,520 disappointed. Again, similarly to 541 00:22:07,679 --> 00:22:12,559 streaming, the query plan we have 542 00:22:09,520 --> 00:22:14,400 written cannot be executed on GPU. And 543 00:22:12,559 --> 00:22:17,200 unlike streaming, which at least helped 544 00:22:14,400 --> 00:22:18,880 us with memory a little, 545 00:22:17,200 --> 00:22:22,320 excuse me, with speed, if not with 546 00:22:18,880 --> 00:22:26,080 memory, GPU didn't help us at all. And 547 00:22:22,320 --> 00:22:27,600 in fact, the numbers even got worse. 548 00:22:26,080 --> 00:22:28,960 It disabled the previous streaming 549 00:22:27,600 --> 00:22:30,799 optimization because they cannot be 550 00:22:28,960 --> 00:22:33,440 applied at once. And it added some 551 00:22:30,799 --> 00:22:35,360 overhead for us. So, the trophy marks 552 00:22:33,440 --> 00:22:37,840 the real winner. everything up to 553 00:22:35,360 --> 00:22:39,520 streaming from 9 minutes with the sweep 554 00:22:37,840 --> 00:22:42,240 line version and that's already after 555 00:22:39,520 --> 00:22:44,640 fixing the algorithm down to only 36 556 00:22:42,240 --> 00:22:48,320 seconds in the final pipeline and the 557 00:22:44,640 --> 00:22:50,400 full NA version we even we didn't even 558 00:22:48,320 --> 00:22:53,679 bother to measure here because it would 559 00:22:50,400 --> 00:22:55,200 have taken ages and this is the excess 560 00:22:53,679 --> 00:22:58,159 data set that I have promised to show 561 00:22:55,200 --> 00:22:59,840 you the to example we have already seen 562 00:22:58,159 --> 00:23:02,240 this big drop in the beginning from 563 00:22:59,840 --> 00:23:04,400 naive to the algorithmic complexity and 564 00:23:02,240 --> 00:23:06,720 also do you see where the minimum is 565 00:23:04,400 --> 00:23:09,039 it's not where it was on the L side, 566 00:23:06,720 --> 00:23:11,679 right? On this scale, the optimizations 567 00:23:09,039 --> 00:23:14,000 really help help only up to polers. And 568 00:23:11,679 --> 00:23:15,840 once we try to buy all those fancy 569 00:23:14,000 --> 00:23:18,320 tricks that are really meant for larger 570 00:23:15,840 --> 00:23:20,000 scale, we just add more and more 571 00:23:18,320 --> 00:23:22,000 overhead. 572 00:23:20,000 --> 00:23:23,760 So, it's a good reminder just because 573 00:23:22,000 --> 00:23:26,559 you've got a shiny new optimization 574 00:23:23,760 --> 00:23:30,159 hammer from your PyCon doesn't mean that 575 00:23:26,559 --> 00:23:32,159 every pipeline gets to be a nail now. 576 00:23:30,159 --> 00:23:34,640 So, what about memory? We have started 577 00:23:32,159 --> 00:23:37,840 with 14 GB and by the latest stages we 578 00:23:34,640 --> 00:23:39,600 brought that down to only 9. You'll 579 00:23:37,840 --> 00:23:41,919 notice the spike at vectorization that's 580 00:23:39,600 --> 00:23:45,039 expected. Vectorization created big 581 00:23:41,919 --> 00:23:50,240 intermediate arrays and memory usage 582 00:23:45,039 --> 00:23:52,240 went up while the runtime went down. 583 00:23:50,240 --> 00:23:54,720 And that's our checklist. Some fixes 584 00:23:52,240 --> 00:23:56,880 were tiny, other ones more dramatic, but 585 00:23:54,720 --> 00:23:59,120 stuck together they made the pipeline 586 00:23:56,880 --> 00:24:01,360 lean and quick. And that's the whole 587 00:23:59,120 --> 00:24:03,360 idea of the bird watchers guide. a field 588 00:24:01,360 --> 00:24:04,960 guide that you can flick through when 589 00:24:03,360 --> 00:24:07,360 you're out in the wild with your own 590 00:24:04,960 --> 00:24:10,000 pipelines. 591 00:24:07,360 --> 00:24:11,919 So, here's my last slide and with it 592 00:24:10,000 --> 00:24:14,159 comes a little bonus, some footage from 593 00:24:11,919 --> 00:24:16,799 this summer to birds sharing food. It 594 00:24:14,159 --> 00:24:18,799 was just too cute not to show you. So, 595 00:24:16,799 --> 00:24:21,200 thank you so much. I'm so happy I got to 596 00:24:18,799 --> 00:24:23,200 share this with you. And the links are 597 00:24:21,200 --> 00:24:25,360 here. You can scan the QR. This is the 598 00:24:23,200 --> 00:24:28,159 birdies repo, my socials if you would 599 00:24:25,360 --> 00:24:30,960 like to talk about birds, planes, data, 600 00:24:28,159 --> 00:24:32,880 Python, anything. And if the plain part 601 00:24:30,960 --> 00:24:34,559 was your favorite, Asai has some open 602 00:24:32,880 --> 00:24:39,720 rolls, too. So, feel free to check it 603 00:24:34,559 --> 00:24:39,720 out. And once again, thank you so much. 604 00:24:41,470 --> 00:24:44,579 [Music] 605 00:24:47,600 --> 00:24:51,760 Thank you so much, Jenna. What a 606 00:24:49,600 --> 00:24:53,440 fascinating deep dive into optimization. 607 00:24:51,760 --> 00:24:54,880 This is a small token of PyCon's 608 00:24:53,440 --> 00:24:56,320 appreciation. Thank you. 609 00:24:54,880 --> 00:24:58,480 Um, stay tuned for the next session. 610 00:24:56,320 --> 00:24:59,919 We'll be back very, very soon. So, 611 00:24:58,480 --> 00:25:03,080 thanks. Another big round of applause 612 00:24:59,919 --> 00:25:03,080 for GA.