1 00:00:04,960 --> 00:00:19,999 [Music] 2 00:00:20,760 --> 00:00:25,800 uh I'd like to introduce Ed Shield who 3 00:00:23,279 --> 00:00:31,800 today will be talking about better data 4 00:00:25,800 --> 00:00:31,800 frames big hand for Ed thanks hi 5 00:00:32,520 --> 00:00:37,360 thanks for coming so the pictures here 6 00:00:34,960 --> 00:00:40,719 uh oh there's nothing showing yet uh can 7 00:00:37,360 --> 00:00:43,920 we have there we are beautiful all right 8 00:00:40,719 --> 00:00:46,160 so we have a a panda we have a polar 9 00:00:43,920 --> 00:00:48,199 bear and we have an Ibis these are three 10 00:00:46,160 --> 00:00:50,760 of the data frame libraries I'll be 11 00:00:48,199 --> 00:00:54,920 talking about in this in this uh session 12 00:00:50,760 --> 00:00:56,640 thanks for coming so uh now pandas is 13 00:00:54,920 --> 00:00:59,519 sort of the heavyweight data frame 14 00:00:56,640 --> 00:01:04,799 library in Python it's been around um 15 00:00:59,519 --> 00:01:06,600 for about 10 15 years and uh at the 16 00:01:04,799 --> 00:01:10,799 moment there's a kind of proliferation 17 00:01:06,600 --> 00:01:15,040 of upstart packages which have different 18 00:01:10,799 --> 00:01:18,520 ideas that they've been exploring um and 19 00:01:15,040 --> 00:01:22,119 um I'd like to help you know what these 20 00:01:18,520 --> 00:01:23,640 are and make sense of these so you know 21 00:01:22,119 --> 00:01:24,920 which package to choose for your next 22 00:01:23,640 --> 00:01:27,799 project or if you're a library 23 00:01:24,920 --> 00:01:30,520 maintainer that you know how you can 24 00:01:27,799 --> 00:01:33,200 support different kinds of data frame in 25 00:01:30,520 --> 00:01:37,920 the uh in inputs to your functions for 26 00:01:33,200 --> 00:01:40,439 example so here's the outline of this 27 00:01:37,920 --> 00:01:43,640 talk so i' like 28 00:01:40,439 --> 00:01:46,600 to just go over the history of data 29 00:01:43,640 --> 00:01:49,880 frames briefly to help you understand 30 00:01:46,600 --> 00:01:53,680 some design limitations that pandas 31 00:01:49,880 --> 00:01:56,399 has and then describe some of the 32 00:01:53,680 --> 00:01:58,439 Alternatives and efforts at 33 00:01:56,399 --> 00:02:00,520 interoperability and then a few minutes 34 00:01:58,439 --> 00:02:02,880 if there's time um 35 00:02:00,520 --> 00:02:05,759 with some more philosophical thoughts on 36 00:02:02,880 --> 00:02:08,200 the nature of the kind of creative 37 00:02:05,759 --> 00:02:10,879 process this creative destruction 38 00:02:08,200 --> 00:02:13,280 process in open source and some 39 00:02:10,879 --> 00:02:14,920 recommendations about which packages to 40 00:02:13,280 --> 00:02:17,239 use and which techniques to use for 41 00:02:14,920 --> 00:02:20,280 different tasks and if we have time at 42 00:02:17,239 --> 00:02:24,680 the end um I'd welcome questions as 43 00:02:20,280 --> 00:02:26,000 well so my name is Ed scoffield I have a 44 00:02:24,680 --> 00:02:29,040 background mathematics and computer 45 00:02:26,000 --> 00:02:30,519 science and I did a PhD in language 46 00:02:29,040 --> 00:02:32,599 models before 47 00:02:30,519 --> 00:02:34,720 most people knew who what they were in 48 00:02:32,599 --> 00:02:37,879 fact um 49 00:02:34,720 --> 00:02:41,400 uh I while procrastinating writing up my 50 00:02:37,879 --> 00:02:45,040 PhD I I got involved in uh development 51 00:02:41,400 --> 00:02:48,519 with scipi and at the time actually the 52 00:02:45,040 --> 00:02:52,239 tools were quite kind of um primitive 53 00:02:48,519 --> 00:02:55,720 and broken so numpy and scipi uh needed 54 00:02:52,239 --> 00:02:57,680 to be developed to be useful for um for 55 00:02:55,720 --> 00:03:00,920 you know machine learning research at 56 00:02:57,680 --> 00:03:02,680 the time and uh 57 00:03:00,920 --> 00:03:04,799 um yeah anyway Matt plot lib didn't 58 00:03:02,680 --> 00:03:08,080 exist and so that eventually came around 59 00:03:04,799 --> 00:03:11,080 around 2005 or so um so I founded python 60 00:03:08,080 --> 00:03:13,159 Charmers after that it's a training 61 00:03:11,080 --> 00:03:16,599 business and we're based in Melbourne 62 00:03:13,159 --> 00:03:19,159 and Singapore and we've given about 700 63 00:03:16,599 --> 00:03:22,080 training courses over the time about 500 64 00:03:19,159 --> 00:03:25,080 in Australia um most of them are kind of 65 00:03:22,080 --> 00:03:28,640 live with you know Hands-On uh 66 00:03:25,080 --> 00:03:30,640 coding um so um I was involved in the 67 00:03:28,640 --> 00:03:32,799 python future project when I was 68 00:03:30,640 --> 00:03:35,360 concerned that the Python 3 migration 69 00:03:32,799 --> 00:03:38,400 was happening too slowly so it looked 70 00:03:35,360 --> 00:03:41,000 for a time like Python 3 might be dead 71 00:03:38,400 --> 00:03:43,400 on arrival and it might become a a pearl 72 00:03:41,000 --> 00:03:47,239 sick situation and thankfully the 73 00:03:43,400 --> 00:03:49,680 community rallied to to do a big push to 74 00:03:47,239 --> 00:03:52,640 make this migration possible but it it 75 00:03:49,680 --> 00:03:55,599 looked it looked quite kind of close for 76 00:03:52,640 --> 00:03:57,319 a while so I got involved in the python 77 00:03:55,599 --> 00:04:00,920 future project to S of assist with that 78 00:03:57,319 --> 00:04:04,280 effort uh and um 79 00:04:00,920 --> 00:04:07,120 uh for fun at the moment I'm doing a set 80 00:04:04,280 --> 00:04:08,560 of courses through Stanford in in AI 81 00:04:07,120 --> 00:04:11,519 particularly in reinforcement learning 82 00:04:08,560 --> 00:04:14,200 which I'm finding really really fun um 83 00:04:11,519 --> 00:04:16,680 something I haven't really explored but 84 00:04:14,200 --> 00:04:18,199 it's a it's a brilliant field so anyway 85 00:04:16,680 --> 00:04:21,239 that's um what I'm doing at the moment 86 00:04:18,199 --> 00:04:24,680 for fun so so first of all what is a 87 00:04:21,239 --> 00:04:26,400 data frame most of you probably use them 88 00:04:24,680 --> 00:04:30,759 uh can I a show of hands please actually 89 00:04:26,400 --> 00:04:33,280 who uses data frames most of you okay um 90 00:04:30,759 --> 00:04:35,199 so it's not a vector it's not a matrix 91 00:04:33,280 --> 00:04:37,160 it's not a multi-dimensional array it's 92 00:04:35,199 --> 00:04:41,039 it's a bit like a matrix in a sense but 93 00:04:37,160 --> 00:04:41,039 you've got heterogeneously typed 94 00:04:41,199 --> 00:04:44,759 columns uh 95 00:04:42,660 --> 00:04:47,840 [Music] 96 00:04:44,759 --> 00:04:49,600 so so how is that different from just a 97 00:04:47,840 --> 00:04:52,160 a table I'll come back to that in a 98 00:04:49,600 --> 00:04:55,440 moment but um this is another view of a 99 00:04:52,160 --> 00:04:57,800 data frame you've got a column index or 100 00:04:55,440 --> 00:04:58,919 of column names and you've got multiple 101 00:04:57,800 --> 00:05:01,560 series of data and this is the 102 00:04:58,919 --> 00:05:04,120 terminology used by both pandas and 103 00:05:01,560 --> 00:05:06,880 polers notice there's no index in 104 00:05:04,120 --> 00:05:08,400 general that's a a pandas concept but 105 00:05:06,880 --> 00:05:11,320 most other data frame libraries don't 106 00:05:08,400 --> 00:05:13,320 have the concept of a row 107 00:05:11,320 --> 00:05:16,120 index and so 108 00:05:13,320 --> 00:05:19,240 how are these different 109 00:05:16,120 --> 00:05:20,479 from from tables in SQL databases or 110 00:05:19,240 --> 00:05:25,840 tables in 111 00:05:20,479 --> 00:05:29,280 spreadsheets um well suppose terminology 112 00:05:25,840 --> 00:05:33,680 wise um tables in relational databases 113 00:05:29,280 --> 00:05:33,680 have know row relationships 114 00:05:33,800 --> 00:05:38,639 um and um and 115 00:05:36,520 --> 00:05:42,800 traditionally um relational databases 116 00:05:38,639 --> 00:05:46,280 store data rowwise so so operations 117 00:05:42,800 --> 00:05:49,600 analytic operations are not optimized um 118 00:05:46,280 --> 00:05:52,759 as they are in the new breed of columna 119 00:05:49,600 --> 00:05:55,360 uh data databases olap databases like 120 00:05:52,759 --> 00:05:57,800 Duck DB and data frame libraries okay so 121 00:05:55,360 --> 00:06:00,560 data frame libraries store data 122 00:05:57,800 --> 00:06:03,479 columnwise and this allows F vectorized 123 00:06:00,560 --> 00:06:05,720 operations traditionally also there's 124 00:06:03,479 --> 00:06:07,599 been this s of terminology that a data 125 00:06:05,720 --> 00:06:10,280 frame is something that lives in memory 126 00:06:07,599 --> 00:06:12,520 it's not always true now but and the 127 00:06:10,280 --> 00:06:14,800 lines are blurring but um traditionally 128 00:06:12,520 --> 00:06:17,319 you would do a a kind of import of data 129 00:06:14,800 --> 00:06:20,319 suck in lots of data until until you've 130 00:06:17,319 --> 00:06:22,160 got it all in Ram okay which which of 131 00:06:20,319 --> 00:06:25,000 course naturally puts a a limit on the 132 00:06:22,160 --> 00:06:27,720 size of data sets you can work with all 133 00:06:25,000 --> 00:06:30,160 right but that's changing these 134 00:06:27,720 --> 00:06:32,520 days so data frames 135 00:06:30,160 --> 00:06:32,520 really 136 00:06:32,560 --> 00:06:39,000 support um kind of janitorial kinds of 137 00:06:36,280 --> 00:06:40,840 uh um task which are necessary before 138 00:06:39,000 --> 00:06:42,639 you put data into a machine learning 139 00:06:40,840 --> 00:06:46,560 model for example or do a regression or 140 00:06:42,639 --> 00:06:50,759 something like that uh so um data frames 141 00:06:46,560 --> 00:06:54,240 are kind of have yeah features for for 142 00:06:50,759 --> 00:06:56,639 cleaning up data uh for aggregating data 143 00:06:54,240 --> 00:07:00,879 with time series and things like that 144 00:06:56,639 --> 00:07:04,120 okay um now uh um here's a post from Big 145 00:07:00,879 --> 00:07:06,400 Data Borat uh 11 years ago in data 146 00:07:04,120 --> 00:07:08,639 science 80% of time spent prepare data 147 00:07:06,400 --> 00:07:13,080 20% of time spent complain about need 148 00:07:08,639 --> 00:07:15,919 for prepare data uh so he was there was 149 00:07:13,080 --> 00:07:20,280 he yeah understood understood that very 150 00:07:15,919 --> 00:07:25,639 early uh all right so 151 00:07:20,280 --> 00:07:27,759 now here's um brief sort of set of words 152 00:07:25,639 --> 00:07:29,960 about the design of pandas and where 153 00:07:27,759 --> 00:07:32,440 it's come from so in the beginning we 154 00:07:29,960 --> 00:07:36,680 had this but then a few billion years 155 00:07:32,440 --> 00:07:39,720 passed and we had uh the 1970s and um so 156 00:07:36,680 --> 00:07:43,160 in the' 70s the S language came out of 157 00:07:39,720 --> 00:07:44,680 bell labs and SQL was sort of developed 158 00:07:43,160 --> 00:07:47,960 and started to be standardized and the 159 00:07:44,680 --> 00:07:51,520 first spreadsheet package vial um became 160 00:07:47,960 --> 00:07:54,720 widespread um in the 80s we had data. 161 00:07:51,520 --> 00:07:58,960 frame as an object in in the S language 162 00:07:54,720 --> 00:08:01,960 python came out in 1983 R came out um 163 00:07:58,960 --> 00:08:06,120 later that decade 164 00:08:01,960 --> 00:08:08,960 um and uh and in the 90s there were 165 00:08:06,120 --> 00:08:12,840 these packages numeric and nay which 166 00:08:08,960 --> 00:08:15,879 made use of Python's great C API to 167 00:08:12,840 --> 00:08:17,400 extend python to make it useful for 168 00:08:15,879 --> 00:08:20,120 scientific Computing in particular and 169 00:08:17,400 --> 00:08:22,599 Engineering Computing um some of the 170 00:08:20,120 --> 00:08:25,159 early applications of it the mySQL 171 00:08:22,599 --> 00:08:26,720 database also came out in the 1990s just 172 00:08:25,159 --> 00:08:30,000 for historical context as part of the 173 00:08:26,720 --> 00:08:33,640 lamp stack postgrad was in 2000 174 00:08:30,000 --> 00:08:35,320 it was until 2005 that numpy came out in 175 00:08:33,640 --> 00:08:37,640 its first release in its current form 176 00:08:35,320 --> 00:08:40,320 which is a kind of amalgam of those two 177 00:08:37,640 --> 00:08:43,080 earlier array packages numeric and num 178 00:08:40,320 --> 00:08:47,440 and scipi came out at around the same 179 00:08:43,080 --> 00:08:49,360 time Matt plot lib a year or so later um 180 00:08:47,440 --> 00:08:51,080 when I started working on scipi it 181 00:08:49,360 --> 00:08:54,279 contained three separate plotting 182 00:08:51,080 --> 00:08:56,080 libraries and all of them were broken uh 183 00:08:54,279 --> 00:09:00,000 so I was very happy to have Matt plot 184 00:08:56,080 --> 00:09:03,839 lib to create some plots in my my thesis 185 00:09:00,000 --> 00:09:06,040 um uh Panda's I think um development 186 00:09:03,839 --> 00:09:09,360 started around 2008 which is pretty 187 00:09:06,040 --> 00:09:11,560 early so this is sort of interesting 188 00:09:09,360 --> 00:09:15,399 this historical note so it 189 00:09:11,560 --> 00:09:18,240 predates uh the kind of Innovations in r 190 00:09:15,399 --> 00:09:23,120 on other kinds of data frames um it 191 00:09:18,240 --> 00:09:25,680 predates spark um and uh Big Data tools 192 00:09:23,120 --> 00:09:27,519 like like big query and so on so the 193 00:09:25,680 --> 00:09:30,720 first of the kind of 194 00:09:27,519 --> 00:09:35,120 um now tidy ver packages and R came out 195 00:09:30,720 --> 00:09:37,760 in 2011 uh and uh and deer and tidy R 196 00:09:35,120 --> 00:09:41,760 came came later and the Tidy verse was 197 00:09:37,760 --> 00:09:43,760 sort of formalized more in 2016 and 198 00:09:41,760 --> 00:09:46,160 that's also when when development 199 00:09:43,760 --> 00:09:48,959 started on the Apache Arrow project now 200 00:09:46,160 --> 00:09:52,399 this was really an initiative of Wes 201 00:09:48,959 --> 00:09:54,680 mckin who's the is the uh original 202 00:09:52,399 --> 00:09:57,079 author of pandas and Hadley Wicken from 203 00:09:54,680 --> 00:10:00,320 the r community 204 00:09:57,079 --> 00:10:01,680 so so Wes mckin wrote a blog post in 205 00:10:00,320 --> 00:10:06,320 about 206 00:10:01,680 --> 00:10:07,399 201 13 or so I think um perhaps a little 207 00:10:06,320 --> 00:10:10,600 bit later 208 00:10:07,399 --> 00:10:12,880 2015 uh called 10 Things I Hate About 209 00:10:10,600 --> 00:10:15,240 pandas and he gave a talk at the New 210 00:10:12,880 --> 00:10:19,000 York City python Meetup or something 211 00:10:15,240 --> 00:10:20,560 like that um about that and so he was 212 00:10:19,000 --> 00:10:23,040 aware of these design limitations from 213 00:10:20,560 --> 00:10:25,120 the beginning but but by then Panda's 214 00:10:23,040 --> 00:10:27,600 adoption was really growing and I think 215 00:10:25,120 --> 00:10:30,800 we've got pandas to thank for the fact 216 00:10:27,600 --> 00:10:35,800 that python is where it is today as as a 217 00:10:30,800 --> 00:10:39,360 tool for manipulating data so R was was 218 00:10:35,800 --> 00:10:41,440 very much loved by users um and still is 219 00:10:39,360 --> 00:10:43,920 uh some some there are users out there 220 00:10:41,440 --> 00:10:46,320 who say know you'll take R out of my 221 00:10:43,920 --> 00:10:48,560 cold dead hands you got to claw it out 222 00:10:46,320 --> 00:10:51,880 um uh and the Tidy roast is a is a 223 00:10:48,560 --> 00:10:54,920 wonderful kind of ecosystem of packages 224 00:10:51,880 --> 00:10:56,760 um but uh but it's quite possible that 225 00:10:54,920 --> 00:11:01,639 without pandas python wouldn't have had 226 00:10:56,760 --> 00:11:04,279 a big role to play in in um in many 227 00:11:01,639 --> 00:11:05,680 different kinds of sub fields of of data 228 00:11:04,279 --> 00:11:08,959 analytics and data science or at least 229 00:11:05,680 --> 00:11:11,760 not as big a role as it does today uh so 230 00:11:08,959 --> 00:11:14,760 pandas builds upon numpy as a foundation 231 00:11:11,760 --> 00:11:16,920 and that also gives it a couple of 232 00:11:14,760 --> 00:11:18,760 limitations uh 233 00:11:16,920 --> 00:11:21,240 so um 234 00:11:18,760 --> 00:11:23,800 numpy I want to 235 00:11:21,240 --> 00:11:26,800 um sorry an old version of this slide 236 00:11:23,800 --> 00:11:29,839 but numpy um provides 237 00:11:26,800 --> 00:11:31,920 a um 238 00:11:29,839 --> 00:11:35,320 a multi-dimensional array object but no 239 00:11:31,920 --> 00:11:36,920 support for masked well it supports a 240 00:11:35,320 --> 00:11:40,440 masked array it doesn't support sort of 241 00:11:36,920 --> 00:11:42,560 missing data natively for for a range of 242 00:11:40,440 --> 00:11:44,800 data types the master array feature 243 00:11:42,560 --> 00:11:46,880 never really took off and and it 244 00:11:44,800 --> 00:11:48,639 provides record arrays and structured 245 00:11:46,880 --> 00:11:51,800 arrays those are also aren't really a 246 00:11:48,639 --> 00:11:54,240 substitute for for data frames in in 247 00:11:51,800 --> 00:11:57,240 Panda so pandas built upon 248 00:11:54,240 --> 00:12:00,160 numpy uh and provided that kind of 249 00:11:57,240 --> 00:12:01,680 tabular analytics function ity okay so 250 00:12:00,160 --> 00:12:03,240 there's a data frame in pandas it's like 251 00:12:01,680 --> 00:12:09,120 the diagram you saw before but there's 252 00:12:03,240 --> 00:12:11,480 this row index down down the side um now 253 00:12:09,120 --> 00:12:13,000 there are people who use the row index 254 00:12:11,480 --> 00:12:15,839 to its advantage and you can get better 255 00:12:13,000 --> 00:12:18,480 performance by making use of it um but a 256 00:12:15,839 --> 00:12:20,920 lot of people just don't know what to do 257 00:12:18,480 --> 00:12:22,800 with and just find it an annoyance and 258 00:12:20,920 --> 00:12:25,199 try to drop it or reset it or something 259 00:12:22,800 --> 00:12:27,519 like that but um it's an interesting 260 00:12:25,199 --> 00:12:30,040 design decision because 261 00:12:27,519 --> 00:12:34,480 it um 262 00:12:30,040 --> 00:12:38,079 it it does complicate learning pandas 263 00:12:34,480 --> 00:12:39,279 um um uh but yeah I think a lot of the 264 00:12:38,079 --> 00:12:41,240 the sort of new breed of data frame 265 00:12:39,279 --> 00:12:43,600 libraries don't don't use a row 266 00:12:41,240 --> 00:12:45,320 index many SQL databases also 267 00:12:43,600 --> 00:12:48,639 traditionally haven't returned data in a 268 00:12:45,320 --> 00:12:53,040 consistent order and uh unless you 269 00:12:48,639 --> 00:12:53,040 explicitly call order by as a 270 00:12:53,600 --> 00:12:58,120 clause all right so I'd like to say 271 00:12:56,120 --> 00:13:01,959 again sort of three cheers for pandas 272 00:12:58,120 --> 00:13:04,920 it's it's done us well uh it's it's used 273 00:13:01,959 --> 00:13:06,880 by millions and millions of people and 274 00:13:04,920 --> 00:13:10,839 what it does it does well so it does in 275 00:13:06,880 --> 00:13:14,680 memory uh analytics it's expressive it's 276 00:13:10,839 --> 00:13:16,920 got an enormous API which is uh it means 277 00:13:14,680 --> 00:13:18,639 you can do a lot with it but it also you 278 00:13:16,920 --> 00:13:20,560 know makes it harder to maintain and 279 00:13:18,639 --> 00:13:22,360 things like that it its focus isn't 280 00:13:20,560 --> 00:13:24,600 there and it's it's got a lot of open 281 00:13:22,360 --> 00:13:27,639 bugs partly because there are few 282 00:13:24,600 --> 00:13:30,880 maintainers doing putting in a Hulan 283 00:13:27,639 --> 00:13:32,720 effort to kind of uh support you know 284 00:13:30,880 --> 00:13:35,079 millions of users who are all 285 00:13:32,720 --> 00:13:39,680 complaining that my sort of such and 286 00:13:35,079 --> 00:13:41,120 such Doesn't work um so it's it's um 287 00:13:39,680 --> 00:13:45,760 played a very important role in the 288 00:13:41,120 --> 00:13:49,560 python data ecosystem uh now Panda 2 was 289 00:13:45,760 --> 00:13:51,079 released last year uh and there are a 290 00:13:49,560 --> 00:13:53,199 couple 291 00:13:51,079 --> 00:13:57,600 of architectural 292 00:13:53,199 --> 00:14:00,360 improvements uh it supports optional 293 00:13:57,600 --> 00:14:03,880 backing of all of columns in a data 294 00:14:00,360 --> 00:14:06,199 frame with the pi Arrow sort of a 295 00:14:03,880 --> 00:14:08,399 package with the arrow standard that 296 00:14:06,199 --> 00:14:12,480 this supports and this 297 00:14:08,399 --> 00:14:15,519 provides the ability to um to have 298 00:14:12,480 --> 00:14:17,040 missing data on all data types uh it 299 00:14:15,519 --> 00:14:20,360 also provides better string handling 300 00:14:17,040 --> 00:14:23,240 than numpy is traditionally provided but 301 00:14:20,360 --> 00:14:26,959 numpy um uh 302 00:14:23,240 --> 00:14:29,279 2.0 does change that um but numpy 2.0 303 00:14:26,959 --> 00:14:34,600 doesn't have a good answer about missing 304 00:14:29,279 --> 00:14:37,360 data still uh so there's this limitation 305 00:14:34,600 --> 00:14:40,079 there there are some um other 306 00:14:37,360 --> 00:14:42,560 limitations as well about s of memory 307 00:14:40,079 --> 00:14:45,360 usage in particular that limit the scale 308 00:14:42,560 --> 00:14:47,480 you can of data you can work with with 309 00:14:45,360 --> 00:14:49,120 pandas that there's assumption that you 310 00:14:47,480 --> 00:14:51,399 pull in all your data into memory for 311 00:14:49,120 --> 00:14:53,600 example but there are some new copy on 312 00:14:51,399 --> 00:14:58,040 write features which can assist a little 313 00:14:53,600 --> 00:15:00,880 bit but the pandas 3.0 release is still 314 00:14:58,040 --> 00:15:03,800 s of in an early design phase and the 315 00:15:00,880 --> 00:15:05,800 design is not decided yet uh there's 316 00:15:03,800 --> 00:15:08,240 been some kind of back and forth about 317 00:15:05,800 --> 00:15:10,880 whether Pi arot types should be 318 00:15:08,240 --> 00:15:12,240 supported by default and initially that 319 00:15:10,880 --> 00:15:14,639 was going to be the case and then 320 00:15:12,240 --> 00:15:17,360 someone said hey wait this brings in 321 00:15:14,639 --> 00:15:20,440 another 500 megabytes of of dependency 322 00:15:17,360 --> 00:15:24,399 is what's going on here uh it's a very 323 00:15:20,440 --> 00:15:26,880 heavyweight um dependency at turns out 324 00:15:24,399 --> 00:15:29,360 so so 325 00:15:26,880 --> 00:15:33,519 uh it's unclear what the future of the 326 00:15:29,360 --> 00:15:36,319 project is and and I'm thinking it um it 327 00:15:33,519 --> 00:15:39,199 will probably not go away anytime soon 328 00:15:36,319 --> 00:15:41,399 but it's going to become continually 329 00:15:39,199 --> 00:15:46,880 less relevant in relation to the other 330 00:15:41,399 --> 00:15:50,920 data frame libraries uh so um so I think 331 00:15:46,880 --> 00:15:52,680 Marie condo says a sort of you know say 332 00:15:50,920 --> 00:15:53,959 bless something and then they say if you 333 00:15:52,680 --> 00:15:56,600 don't no longer need to bless it and 334 00:15:53,959 --> 00:15:58,319 then throw it in the bin so so uh so 335 00:15:56,600 --> 00:16:01,279 this is perhaps what we should be doing 336 00:15:58,319 --> 00:16:04,440 here and I I don't think um this would 337 00:16:01,279 --> 00:16:06,040 offend W mckin who's very aware of its 338 00:16:04,440 --> 00:16:08,160 uh design limitations and has been 339 00:16:06,040 --> 00:16:11,759 actively kind of contributing the last 340 00:16:08,160 --> 00:16:14,319 sort of well yeah the last 10 years of 341 00:16:11,759 --> 00:16:15,920 his his career towards improving 342 00:16:14,319 --> 00:16:19,040 improving this current 343 00:16:15,920 --> 00:16:22,759 situation uh so so here are some 344 00:16:19,040 --> 00:16:27,000 Alternatives now this is a set of GitHub 345 00:16:22,759 --> 00:16:30,519 Stars uh look at the Blue Line That's 346 00:16:27,000 --> 00:16:33,480 polar the green line is Duck DB the red 347 00:16:30,519 --> 00:16:37,600 line is pandas okay so King of the 348 00:16:33,480 --> 00:16:40,560 castle right now um but it's um it's 349 00:16:37,600 --> 00:16:45,399 acceleration it sort of yeah it seems to 350 00:16:40,560 --> 00:16:48,600 be um smaller it's a second moment and 351 00:16:45,399 --> 00:16:52,240 we've got um yeah various other kind of 352 00:16:48,600 --> 00:16:54,279 little tools there um but in particular 353 00:16:52,240 --> 00:16:57,399 the blue and green lines are 354 00:16:54,279 --> 00:17:00,480 interesting so one thing that's 355 00:16:57,399 --> 00:17:02,959 not really listed which is super 356 00:17:00,480 --> 00:17:06,520 important is this so this was as I 357 00:17:02,959 --> 00:17:10,280 mentioned the initiative between um W 358 00:17:06,520 --> 00:17:14,360 mckin and Hadley Wickham to to define a 359 00:17:10,280 --> 00:17:17,600 cross language standard for how how data 360 00:17:14,360 --> 00:17:19,720 can be accessed in memory which supports 361 00:17:17,600 --> 00:17:23,079 you know nullability of of all data 362 00:17:19,720 --> 00:17:24,919 types um and and efficient zeroc copy 363 00:17:23,079 --> 00:17:29,000 you know reads and 364 00:17:24,919 --> 00:17:31,000 writes so this is critical and this is 365 00:17:29,000 --> 00:17:34,400 better than numpy in a way numpy is 366 00:17:31,000 --> 00:17:36,880 python specific arrow is cross language 367 00:17:34,400 --> 00:17:37,840 so it's it's really very powerful and 368 00:17:36,880 --> 00:17:40,880 duct 369 00:17:37,840 --> 00:17:45,360 DB is based on Arrow polers is based on 370 00:17:40,880 --> 00:17:47,120 Arrow uh the a lot of the current breed 371 00:17:45,360 --> 00:17:50,120 of new data frame libraries are based on 372 00:17:47,120 --> 00:17:52,440 this standard okay so this this is sort 373 00:17:50,120 --> 00:17:54,919 of the invisible stitching behind the 374 00:17:52,440 --> 00:17:56,159 scenes that makes this all work so if it 375 00:17:54,919 --> 00:17:58,960 looks like there's a complicated 376 00:17:56,159 --> 00:18:02,039 ecosystem of new ways to doing things 377 00:17:58,960 --> 00:18:04,840 with data frames in a way that most of 378 00:18:02,039 --> 00:18:08,400 them are linked together by this okay so 379 00:18:04,840 --> 00:18:11,280 it's it's not as incoherent as it as it 380 00:18:08,400 --> 00:18:12,760 seems okay so Arrow solves these kinds 381 00:18:11,280 --> 00:18:16,640 of problems that pan is as tightly 382 00:18:12,760 --> 00:18:20,240 coupled with numpy with its limitations 383 00:18:16,640 --> 00:18:22,520 pandas also oh yeah um Arrow also 384 00:18:20,240 --> 00:18:25,200 supports um other languages there's this 385 00:18:22,520 --> 00:18:28,240 Nano Arrow package for example and a 386 00:18:25,200 --> 00:18:30,440 variety of other C libraries you can use 387 00:18:28,240 --> 00:18:32,840 and linking with various languages so 388 00:18:30,440 --> 00:18:34,559 Arrow does say something about what you 389 00:18:32,840 --> 00:18:37,159 can do to process larger than memory 390 00:18:34,559 --> 00:18:40,000 data sets Pand has never had a good 391 00:18:37,159 --> 00:18:44,960 answer to this story for example um uh 392 00:18:40,000 --> 00:18:49,440 you can't uh use yeah sort of memory 393 00:18:44,960 --> 00:18:52,080 yeah like like dis backed sort of um 394 00:18:49,440 --> 00:18:54,880 arrays and memory and um what am I 395 00:18:52,080 --> 00:18:58,960 talking about okay mental blank there 396 00:18:54,880 --> 00:19:01,600 but um yeah it won't support um 397 00:18:58,960 --> 00:19:03,159 sort of this transparent mapping of 398 00:19:01,600 --> 00:19:07,520 memory maap files is what I'm trying to 399 00:19:03,159 --> 00:19:10,080 say yeah um okay so pandas provide uh 400 00:19:07,520 --> 00:19:11,720 creates lots of intermediate objects in 401 00:19:10,080 --> 00:19:14,000 memory which is quite wasteful of memory 402 00:19:11,720 --> 00:19:16,280 so this creates an intermediate series 403 00:19:14,000 --> 00:19:20,559 and then filters on that which creates 404 00:19:16,280 --> 00:19:23,720 an intermediate data frame and um and so 405 00:19:20,559 --> 00:19:27,559 Arrow can help with that um and I 406 00:19:23,720 --> 00:19:31,000 mentioned memory mapping okay so polers 407 00:19:27,559 --> 00:19:32,960 is a very interesting Library uh as I 408 00:19:31,000 --> 00:19:35,760 mentioned it's based on 409 00:19:32,960 --> 00:19:38,120 Arrow it does have a limitation 410 00:19:35,760 --> 00:19:40,720 fundamental kind of 411 00:19:38,120 --> 00:19:43,600 limitation um that it is designed at the 412 00:19:40,720 --> 00:19:45,760 moment anyway for local inprocess data 413 00:19:43,600 --> 00:19:46,960 processing so it it's not designed to 414 00:19:45,760 --> 00:19:50,480 run across a 415 00:19:46,960 --> 00:19:56,840 cluster it's not an alternative to to 416 00:19:50,480 --> 00:19:59,840 Apache spark or pie spark um but duck DB 417 00:19:56,840 --> 00:20:02,799 is also targeting this Niche they're 418 00:19:59,840 --> 00:20:05,960 very very closely related projects um so 419 00:20:02,799 --> 00:20:09,000 it's fast neck and neck with duck DB in 420 00:20:05,960 --> 00:20:11,280 terms of uh benchmarks um it's written 421 00:20:09,000 --> 00:20:13,320 in Rust and has kind of documentation 422 00:20:11,280 --> 00:20:16,840 and interfaces in Python and rust and 423 00:20:13,320 --> 00:20:19,520 the API is is quite nice uh it's not as 424 00:20:16,840 --> 00:20:23,520 Wy or clunky as the Panda's API it's a 425 00:20:19,520 --> 00:20:26,200 bit more of a Bose um but people like it 426 00:20:23,520 --> 00:20:29,440 just like people like the the D plier 427 00:20:26,200 --> 00:20:32,240 API uh from r 428 00:20:29,440 --> 00:20:34,440 so duct DB you notice some similarities 429 00:20:32,240 --> 00:20:37,799 it's also based on Arrow it's also for 430 00:20:34,440 --> 00:20:40,520 local in in process data 431 00:20:37,799 --> 00:20:43,000 processing it does however have good 432 00:20:40,520 --> 00:20:44,440 support for spilling data to to dis to 433 00:20:43,000 --> 00:20:47,600 some degree okay you probably don't want 434 00:20:44,440 --> 00:20:49,440 to go about more than 10 times Ram um 435 00:20:47,600 --> 00:20:52,200 certainly not more than 100 times Ram 436 00:20:49,440 --> 00:20:55,880 but it it duct DB and polers will let 437 00:20:52,200 --> 00:20:58,840 you handle um data sets potentially up 438 00:20:55,880 --> 00:21:01,400 to 100 gabes or a terabyte if you've got 439 00:20:58,840 --> 00:21:04,520 lots and lots of ram um but you can't 440 00:21:01,400 --> 00:21:06,360 really push them beyond that very well 441 00:21:04,520 --> 00:21:08,640 um duct DB has interfaces in most 442 00:21:06,360 --> 00:21:11,840 languages which is nice 443 00:21:08,640 --> 00:21:15,640 so some of you would have gone to the 444 00:21:11,840 --> 00:21:18,120 duck DB talk um it's uh you can think of 445 00:21:15,640 --> 00:21:21,520 it as like SQL light but for analytical 446 00:21:18,120 --> 00:21:23,520 queries but the python API is is for 447 00:21:21,520 --> 00:21:26,200 Bose at the moment and it hasn't really 448 00:21:23,520 --> 00:21:28,640 had the same attention as the underlying 449 00:21:26,200 --> 00:21:31,120 du DB Library so I wouldn't recommend 450 00:21:28,640 --> 00:21:33,960 and the python API for duct DB at this 451 00:21:31,120 --> 00:21:36,240 stage um there was this original plan to 452 00:21:33,960 --> 00:21:38,400 build out a data frame API like pandas 453 00:21:36,240 --> 00:21:40,480 or poas or something with the duck DB 454 00:21:38,400 --> 00:21:43,000 backing which is not a bad plan but now 455 00:21:40,480 --> 00:21:45,159 others are doing it better so they're 456 00:21:43,000 --> 00:21:48,240 focusing on where they contribute which 457 00:21:45,159 --> 00:21:50,640 is the basic Library so here's Ibis it 458 00:21:48,240 --> 00:21:52,640 was started by W mckin around the same 459 00:21:50,640 --> 00:21:56,600 time as the Pache Arab 460 00:21:52,640 --> 00:21:59,760 project um and what Ibis lets you do is 461 00:21:56,600 --> 00:22:04,080 to scale 462 00:21:59,760 --> 00:22:05,400 from local to cloud with the same API 463 00:22:04,080 --> 00:22:08,720 which is a really 464 00:22:05,400 --> 00:22:12,440 interesting kind of um s of value 465 00:22:08,720 --> 00:22:15,120 proposition I'd say so originally um 466 00:22:12,440 --> 00:22:18,440 where implemented apparently as an 467 00:22:15,120 --> 00:22:21,039 interface for Impala um but now it 468 00:22:18,440 --> 00:22:24,679 supports about 20 different backends uh 469 00:22:21,039 --> 00:22:27,880 including including duck DB and polers 470 00:22:24,679 --> 00:22:30,600 it is what pandas but that's being 471 00:22:27,880 --> 00:22:32,799 removed now because pandas is so 472 00:22:30,600 --> 00:22:36,120 different from all of the other 473 00:22:32,799 --> 00:22:38,120 engines so it was like using 40 or 50% 474 00:22:36,120 --> 00:22:41,000 of the entire code base and it was not 475 00:22:38,120 --> 00:22:44,559 worth it since it was slower anyway it's 476 00:22:41,000 --> 00:22:46,520 still but but this will quite happily 477 00:22:44,559 --> 00:22:48,760 read data from Panda data frames in 478 00:22:46,520 --> 00:22:51,080 memory that's not a problem but it just 479 00:22:48,760 --> 00:22:52,360 no longer pushes data out to pandas as 480 00:22:51,080 --> 00:22:54,200 an execution 481 00:22:52,360 --> 00:22:59,120 engine 482 00:22:54,200 --> 00:23:01,559 um so I think of this 483 00:22:59,120 --> 00:23:03,600 uh as being like SQL Alchemy for 484 00:23:01,559 --> 00:23:06,080 analytical queries I haven't seen that 485 00:23:03,600 --> 00:23:08,400 kind of epithet used but that that would 486 00:23:06,080 --> 00:23:10,039 be a good tagline for the project SQL 487 00:23:08,400 --> 00:23:12,120 Alchemy is brilliant of course we don't 488 00:23:10,039 --> 00:23:15,760 need to write you know custom code for 489 00:23:12,120 --> 00:23:17,799 each database and inte D portability and 490 00:23:15,760 --> 00:23:20,240 so on and and that's what this aims to 491 00:23:17,799 --> 00:23:22,720 do to this project okay and it's 492 00:23:20,240 --> 00:23:25,360 actively developed and growing doesn't 493 00:23:22,720 --> 00:23:28,080 have the same traction or or kind of uh 494 00:23:25,360 --> 00:23:31,919 hoo-ha as as polers but it probably 495 00:23:28,080 --> 00:23:35,679 deserves more than it's getting so um 496 00:23:31,919 --> 00:23:38,039 here's a quote from the the site uh the 497 00:23:35,679 --> 00:23:39,679 blog blog post from the IIs project if 498 00:23:38,039 --> 00:23:42,240 you're considering polers for new code 499 00:23:39,679 --> 00:23:43,480 give Ibis a try with a duck DB backend 500 00:23:42,240 --> 00:23:45,799 you'll get better performance than 501 00:23:43,480 --> 00:23:48,400 polers on some workloads with a broader 502 00:23:45,799 --> 00:23:50,039 cross backend API that helps you scale 503 00:23:48,400 --> 00:23:52,120 from development to 504 00:23:50,039 --> 00:23:54,120 production okay so you can use the same 505 00:23:52,120 --> 00:23:56,360 API for processing data on your local 506 00:23:54,120 --> 00:23:59,840 machine as for processing data through 507 00:23:56,360 --> 00:24:02,760 big query or or um click house or 508 00:23:59,840 --> 00:24:05,080 something like that so there's a uh the 509 00:24:02,760 --> 00:24:07,200 current list of backends from the Ibis 510 00:24:05,080 --> 00:24:09,240 website that Ibis 511 00:24:07,200 --> 00:24:10,440 supports so it's really interesting 512 00:24:09,240 --> 00:24:12,760 you'll notice that there's one data 513 00:24:10,440 --> 00:24:15,360 frame library polers and the others are 514 00:24:12,760 --> 00:24:15,360 all SQL 515 00:24:16,720 --> 00:24:24,320 engines so it does a good job of 516 00:24:20,640 --> 00:24:26,440 converting SQL uh these days it's making 517 00:24:24,320 --> 00:24:30,760 use of a package called SQL glot to do 518 00:24:26,440 --> 00:24:34,480 that um which is a something you yeah 519 00:24:30,760 --> 00:24:35,720 it's probably worth watching um does 520 00:24:34,480 --> 00:24:37,880 yeah so you can just translate between 521 00:24:35,720 --> 00:24:40,679 SQL dialects there now there's this 522 00:24:37,880 --> 00:24:43,080 deferred execution mod model that IIs 523 00:24:40,679 --> 00:24:45,880 offers you can kind of connect to some 524 00:24:43,080 --> 00:24:48,159 remote database whatever the store is 525 00:24:45,880 --> 00:24:51,559 you perform some SQL query you get your 526 00:24:48,159 --> 00:24:54,200 results back but it feels like a local a 527 00:24:51,559 --> 00:24:55,799 local database and that's especially 528 00:24:54,200 --> 00:24:59,240 important when the connection between 529 00:24:55,799 --> 00:25:01,679 your local machine and the cloud is is 530 00:24:59,240 --> 00:25:04,039 weak uh it's like some some slow 531 00:25:01,679 --> 00:25:05,600 connection um okay so you got all this 532 00:25:04,039 --> 00:25:07,960 data up there and You' got a fast 533 00:25:05,600 --> 00:25:10,880 computer but this this tiny little wire 534 00:25:07,960 --> 00:25:12,679 connecting them so I saw this was an 535 00:25:10,880 --> 00:25:16,159 important kind of news announcement 536 00:25:12,679 --> 00:25:17,919 recently um so until we have a new 537 00:25:16,159 --> 00:25:20,039 carrier pigeon based Network that's 538 00:25:17,919 --> 00:25:22,559 competitive it's good idea to do the 539 00:25:20,039 --> 00:25:24,720 processing on the remote instance 540 00:25:22,559 --> 00:25:27,320 instead of uh instead of pulling all the 541 00:25:24,720 --> 00:25:31,360 data over the wire just 542 00:25:27,320 --> 00:25:34,279 a couple of others this is a GPU Library 543 00:25:31,360 --> 00:25:36,399 it's limited by vram it's Nvidia cards 544 00:25:34,279 --> 00:25:38,000 only so it won't run on your Mac but it 545 00:25:36,399 --> 00:25:39,000 now has nice Integrations with polers 546 00:25:38,000 --> 00:25:40,600 and 547 00:25:39,000 --> 00:25:43,679 pandas 548 00:25:40,600 --> 00:25:46,399 um modin and dar data frames are both 549 00:25:43,679 --> 00:25:48,799 eort efforts to scale out Panda's data 550 00:25:46,399 --> 00:25:51,840 frames to a cluster so they try to mimic 551 00:25:48,799 --> 00:25:54,520 the Panda's API which is good for 552 00:25:51,840 --> 00:25:59,279 compatibility but it has its warts and 553 00:25:54,520 --> 00:26:01,440 so on as well um narwals is worth 554 00:25:59,279 --> 00:26:04,240 using if you want 555 00:26:01,440 --> 00:26:05,960 cross uh data frame compatibility if 556 00:26:04,240 --> 00:26:08,559 you're writing a library that needs to 557 00:26:05,960 --> 00:26:11,399 accept different data frames okay you 558 00:26:08,559 --> 00:26:13,919 can use it for example like this okay so 559 00:26:11,399 --> 00:26:16,600 this would be a a function that accepts 560 00:26:13,919 --> 00:26:18,600 a variety of different data frames um 561 00:26:16,600 --> 00:26:21,840 and does what's necessary preferably 562 00:26:18,600 --> 00:26:27,279 without memory copies to support those 563 00:26:21,840 --> 00:26:30,840 um now there just note on benchmarks um 564 00:26:27,279 --> 00:26:32,760 polers and du are often neck and neck in 565 00:26:30,840 --> 00:26:34,760 performance but they're both a kind of 566 00:26:32,760 --> 00:26:37,679 big step up from these other kinds of 567 00:26:34,760 --> 00:26:40,159 libraries and and pandas and deer are 568 00:26:37,679 --> 00:26:43,440 ort of in the dust I'm not sure what 569 00:26:40,159 --> 00:26:45,720 arrows doing there uh I'm not sure 570 00:26:43,440 --> 00:26:46,919 because these are based on Arrow so 571 00:26:45,720 --> 00:26:51,279 that's a bit surprising I'm not quite 572 00:26:46,919 --> 00:26:53,880 sure what that's about um so arrow is a 573 00:26:51,279 --> 00:26:56,159 very very well design 574 00:26:53,880 --> 00:26:57,960 specification um here's a quote from 575 00:26:56,159 --> 00:27:01,000 will IED from an interesting blog post 576 00:26:57,960 --> 00:27:02,520 about the The Arrow C data interface so 577 00:27:01,000 --> 00:27:04,799 um so one you'll want to look at this 578 00:27:02,520 --> 00:27:07,360 for napari and anyone else who wants to 579 00:27:04,799 --> 00:27:11,120 support um 580 00:27:07,360 --> 00:27:15,240 uh um zeroc copy memory access from data 581 00:27:11,120 --> 00:27:17,919 frames the the mechanism is this Arrow C 582 00:27:15,240 --> 00:27:19,679 stream the C data interface it's 583 00:27:17,919 --> 00:27:23,200 becoming the standard mechanism to 584 00:27:19,679 --> 00:27:25,840 support Arab based data um with a lot of 585 00:27:23,200 --> 00:27:28,799 kind of benefits there so if widespread 586 00:27:25,840 --> 00:27:31,919 this would support zero copy 587 00:27:28,799 --> 00:27:34,279 operations um without requiring Pi Arrow 588 00:27:31,919 --> 00:27:36,640 which as I mentioned is a big 589 00:27:34,279 --> 00:27:39,480 dependency so 590 00:27:36,640 --> 00:27:42,240 um so this is the the goal of data frame 591 00:27:39,480 --> 00:27:45,880 agnosticism currently there's a an 592 00:27:42,240 --> 00:27:47,440 effort underway toward removing pandas 593 00:27:45,880 --> 00:27:49,039 from a whole bunch of different projects 594 00:27:47,440 --> 00:27:51,600 as a dependency which is quite 595 00:27:49,039 --> 00:27:53,480 interesting um but still accepting the 596 00:27:51,600 --> 00:27:57,760 data frames of course just no longer 597 00:27:53,480 --> 00:28:02,799 requiring it there so I will just skip 598 00:27:57,760 --> 00:28:05,720 to the end as I'm out of time um so 599 00:28:02,799 --> 00:28:09,840 um uh if you're interested please 600 00:28:05,720 --> 00:28:13,320 approach me sorry sorry there and please 601 00:28:09,840 --> 00:28:15,440 um got to go that to that there and that 602 00:28:13,320 --> 00:28:18,559 there so be happy to kind of give you a 603 00:28:15,440 --> 00:28:20,519 live demo I have prepared a live demo to 604 00:28:18,559 --> 00:28:21,760 um yeah of alternative data frame 605 00:28:20,519 --> 00:28:25,240 libraries how you can use them in 606 00:28:21,760 --> 00:28:27,080 practice and and what what the um 607 00:28:25,240 --> 00:28:29,279 experience is there so please please 608 00:28:27,080 --> 00:28:32,279 email me and be happy to uh chat to you 609 00:28:29,279 --> 00:28:35,159 more um yeah and Le come and find me 610 00:28:32,279 --> 00:28:37,679 afterwards I'll be hang out probably 611 00:28:35,159 --> 00:28:38,980 outside yeah all right so uh thank you 612 00:28:37,679 --> 00:28:44,279 very 613 00:28:38,980 --> 00:28:46,559 [Applause] 614 00:28:44,279 --> 00:28:48,720 much thank you very much Ed for that 615 00:28:46,559 --> 00:28:51,679 talk I would like to present you with 616 00:28:48,720 --> 00:28:54,720 the pon thank you uh I think we've got 617 00:28:51,679 --> 00:28:55,679 time for one question okay uh anyone 618 00:28:54,720 --> 00:28:57,720 have a 619 00:28:55,679 --> 00:29:00,039 question oh yeah we've got one at the 620 00:28:57,720 --> 00:29:00,039 back 621 00:29:03,320 --> 00:29:08,320 thanks that was brilliant um I come from 622 00:29:05,720 --> 00:29:11,399 Earth Science community and there we use 623 00:29:08,320 --> 00:29:13,799 xray n CDF hdf5 how do you see the 624 00:29:11,399 --> 00:29:16,000 relationship between that and what you 625 00:29:13,799 --> 00:29:18,799 presented 626 00:29:16,000 --> 00:29:21,320 yeah 627 00:29:18,799 --> 00:29:24,880 these say data frame libraries are are 628 00:29:21,320 --> 00:29:28,559 not really focusing on on that use case 629 00:29:24,880 --> 00:29:30,240 um there's a little bit of overlap but I 630 00:29:28,559 --> 00:29:33,519 don't yet see any efforts underway to 631 00:29:30,240 --> 00:29:35,159 for example generalize xarray to to use 632 00:29:33,519 --> 00:29:37,640 polar data frames under the hood or 633 00:29:35,159 --> 00:29:41,399 something like that 634 00:29:37,640 --> 00:29:45,320 um yeah so I I 635 00:29:41,399 --> 00:29:48,360 don't I don't see any overlap at the 636 00:29:45,320 --> 00:29:50,880 moment yeah uh there will naturally be 637 00:29:48,360 --> 00:29:52,440 improvements available in performance 638 00:29:50,880 --> 00:29:56,120 for example with working with large data 639 00:29:52,440 --> 00:29:57,840 sets by virtue of the Apache Arrow 640 00:29:56,120 --> 00:29:59,240 format if if these libraries start to 641 00:29:57,840 --> 00:30:03,320 use that so that might be the first kind 642 00:29:59,240 --> 00:30:07,120 of point of entry um for for those 643 00:30:03,320 --> 00:30:09,159 libraries to to support uh for example 644 00:30:07,120 --> 00:30:11,559 missing data on at a deeper level than 645 00:30:09,159 --> 00:30:14,440 numpy does 646 00:30:11,559 --> 00:30:18,540 yeah well thank you if could you please 647 00:30:14,440 --> 00:30:22,490 join me in thanking Ed once more 648 00:30:18,540 --> 00:30:22,490 [Applause]