1 00:00:00,420 --> 00:00:05,910 [Music] 2 00:00:09,920 --> 00:00:15,440 Welcome back everyone. Another brief 3 00:00:12,320 --> 00:00:18,480 reminder for those sitting in the masked 4 00:00:15,440 --> 00:00:21,359 area, please wear your masks. 5 00:00:18,480 --> 00:00:24,400 And now it is my pleasure to introduce 6 00:00:21,359 --> 00:00:26,560 our next speaker, Sheena, who is going 7 00:00:24,400 --> 00:00:29,039 to be talking on scaling Python powered 8 00:00:26,560 --> 00:00:31,040 machine learning with Snowflake. So a 9 00:00:29,039 --> 00:00:34,100 round of applause please. 10 00:00:31,040 --> 00:00:34,100 [Applause] 11 00:00:35,120 --> 00:00:40,480 Hello everyone. Thank you so much for 12 00:00:36,880 --> 00:00:43,280 joining in. So I'm Sheena and today I'll 13 00:00:40,480 --> 00:00:45,920 be talking about how you can scale your 14 00:00:43,280 --> 00:00:48,079 ML pipelines because when you really 15 00:00:45,920 --> 00:00:49,840 start working on it or you are building 16 00:00:48,079 --> 00:00:52,640 a prototype, you don't really think 17 00:00:49,840 --> 00:00:55,280 about this enterprise scale level of 18 00:00:52,640 --> 00:00:57,440 problems that can really come to you. Um 19 00:00:55,280 --> 00:01:00,480 so my idea of the session is also to 20 00:00:57,440 --> 00:01:04,320 give you like a view of how we solve 21 00:01:00,480 --> 00:01:06,799 those kind of uh data scaling problems. 22 00:01:04,320 --> 00:01:09,439 All right. Who am I? I am an AI field 23 00:01:06,799 --> 00:01:12,720 CTO in Snowflake. So I work with a lot 24 00:01:09,439 --> 00:01:15,040 of enterprise customers over APJ and um 25 00:01:12,720 --> 00:01:16,640 we help them build a IML solutions with 26 00:01:15,040 --> 00:01:20,000 their best practices and things like 27 00:01:16,640 --> 00:01:21,920 that. So this is today's agenda. I'll be 28 00:01:20,000 --> 00:01:24,960 going through just to give you an intro 29 00:01:21,920 --> 00:01:27,680 of what Snowflake is and then um how we 30 00:01:24,960 --> 00:01:29,920 scale different parts of the ML life 31 00:01:27,680 --> 00:01:32,320 cycle which is like data processing, 32 00:01:29,920 --> 00:01:34,240 feature engineering, model development, 33 00:01:32,320 --> 00:01:37,280 inference and finally the consumption 34 00:01:34,240 --> 00:01:40,000 part and we'll give you a summary of it. 35 00:01:37,280 --> 00:01:42,000 All right. What is Snowflake? Anybody 36 00:01:40,000 --> 00:01:44,640 working on Snowflake? Anybody heard 37 00:01:42,000 --> 00:01:47,360 about Snowflake? Okay, few hands. Good. 38 00:01:44,640 --> 00:01:49,280 So Snowflake is a unified platform where 39 00:01:47,360 --> 00:01:52,159 you can bring in all your data and you 40 00:01:49,280 --> 00:01:54,159 can build your AI on top of it. And 41 00:01:52,159 --> 00:01:56,159 finally you can also build your apps 42 00:01:54,159 --> 00:01:58,960 because the consumption part is very 43 00:01:56,159 --> 00:02:02,320 important. If you cannot build something 44 00:01:58,960 --> 00:02:04,159 that can help the users to use your AI 45 00:02:02,320 --> 00:02:06,719 that is where the value is not going to 46 00:02:04,159 --> 00:02:09,920 come out. So you can build the entire 47 00:02:06,719 --> 00:02:12,080 thing in snowflake. 48 00:02:09,920 --> 00:02:14,080 Now what are the key messaging of 49 00:02:12,080 --> 00:02:15,760 Snowflake is that we are a very easy 50 00:02:14,080 --> 00:02:17,680 platform. We are a SAS platform. We 51 00:02:15,760 --> 00:02:19,360 manage all the infrastructure. So you 52 00:02:17,680 --> 00:02:21,520 don't have to worry about provisioning 53 00:02:19,360 --> 00:02:23,520 it, maintaining, tuning it. So if you 54 00:02:21,520 --> 00:02:26,000 want CPUs, GPUs, you can just go in and 55 00:02:23,520 --> 00:02:28,080 leverage it and use it on the go. And we 56 00:02:26,000 --> 00:02:30,239 are completely connected. We give you 57 00:02:28,080 --> 00:02:32,319 options where you can just go and share 58 00:02:30,239 --> 00:02:35,280 your data, collaborate with different 59 00:02:32,319 --> 00:02:37,760 other uh customers, vendors, etc. And 60 00:02:35,280 --> 00:02:39,519 it's trusted. We take um security very 61 00:02:37,760 --> 00:02:42,000 very seriously in Snowflake. So it's a 62 00:02:39,519 --> 00:02:44,160 trusted platform as well. Now I'm going 63 00:02:42,000 --> 00:02:46,080 to talk about distributed and scalable 64 00:02:44,160 --> 00:02:48,480 Python in snowflake. So we'll go through 65 00:02:46,080 --> 00:02:52,800 each of these u sections and see how it 66 00:02:48,480 --> 00:02:54,800 is done. First one data processing. 67 00:02:52,800 --> 00:02:56,080 How do we do data processing today? 68 00:02:54,800 --> 00:02:58,959 That's first thing we are going to 69 00:02:56,080 --> 00:03:01,680 address. Pandas. Who all are using 70 00:02:58,959 --> 00:03:04,080 pandas here? Okay. I'm with the right 71 00:03:01,680 --> 00:03:05,840 audience. Awesome. So pandas is one of 72 00:03:04,080 --> 00:03:07,840 the most popular libraries. I'm sure you 73 00:03:05,840 --> 00:03:10,319 all are using it for processing the 74 00:03:07,840 --> 00:03:13,120 data. Now what's the problem with 75 00:03:10,319 --> 00:03:15,040 pandas? 76 00:03:13,120 --> 00:03:16,480 If you write a code in pandas and if you 77 00:03:15,040 --> 00:03:18,640 want to do an enterprise level 78 00:03:16,480 --> 00:03:21,200 productionalization, we usually see like 79 00:03:18,640 --> 00:03:23,120 an average of 6 months has been taken to 80 00:03:21,200 --> 00:03:26,319 rewrite that code and write it in a 81 00:03:23,120 --> 00:03:28,879 enterprise uh level production code. 82 00:03:26,319 --> 00:03:33,040 It's all because of how really pandas 83 00:03:28,879 --> 00:03:35,440 work. So, pantas is working in memory 84 00:03:33,040 --> 00:03:37,120 and you mostly get out of the memory 85 00:03:35,440 --> 00:03:38,799 errors. I'm sure some of you might have 86 00:03:37,120 --> 00:03:40,480 already got this. There is no way if 87 00:03:38,799 --> 00:03:42,080 you're working with pandas, you have 88 00:03:40,480 --> 00:03:44,560 never encountered an out of the memory 89 00:03:42,080 --> 00:03:47,120 error. It's very uh common to working 90 00:03:44,560 --> 00:03:49,120 with pandas and pandas is single 91 00:03:47,120 --> 00:03:51,040 threaded. Doesn't matter how many cores 92 00:03:49,120 --> 00:03:53,120 of CPU that you have, it's not going to 93 00:03:51,040 --> 00:03:55,360 take it. So, it will always take one 94 00:03:53,120 --> 00:03:59,040 core of that CPU and keeps on working. 95 00:03:55,360 --> 00:04:01,920 So no distributed processing at all 96 00:03:59,040 --> 00:04:03,680 as a result it cannot scale even on GBs 97 00:04:01,920 --> 00:04:06,080 of data millions of rows it's not going 98 00:04:03,680 --> 00:04:08,080 to happen and encountering with most of 99 00:04:06,080 --> 00:04:10,560 these problems like you will have to do 100 00:04:08,080 --> 00:04:12,560 a retransation of your code from pandas 101 00:04:10,560 --> 00:04:14,720 when you go into productionization on an 102 00:04:12,560 --> 00:04:17,120 enterprise scale and also it is very 103 00:04:14,720 --> 00:04:19,919 difficult to experiment debug loss of 104 00:04:17,120 --> 00:04:22,240 productivity etc. 105 00:04:19,919 --> 00:04:23,919 Now how does Snowflake helps our 106 00:04:22,240 --> 00:04:26,960 customers or anybody who is building 107 00:04:23,919 --> 00:04:29,280 that data processing part in Snowflake? 108 00:04:26,960 --> 00:04:31,840 We have what is known as Snowpark Python 109 00:04:29,280 --> 00:04:35,040 API. Think of it as like a Python API 110 00:04:31,840 --> 00:04:36,800 which is a wrapper on top of um the most 111 00:04:35,040 --> 00:04:40,160 common Python libraries and things like 112 00:04:36,800 --> 00:04:42,720 that. Now under Snowpark API we have two 113 00:04:40,160 --> 00:04:45,840 main points um I want to highlight here. 114 00:04:42,720 --> 00:04:47,360 One is Snowark helps you to run the data 115 00:04:45,840 --> 00:04:49,360 processing and transformation in a 116 00:04:47,360 --> 00:04:52,560 distributed fashion. And the second 117 00:04:49,360 --> 00:04:55,120 thing is we are more moving towards 118 00:04:52,560 --> 00:04:56,400 executing your code near to the data. So 119 00:04:55,120 --> 00:04:58,080 usually think about it when you're 120 00:04:56,400 --> 00:05:00,160 writing a code in your laptop your data 121 00:04:58,080 --> 00:05:02,000 is in a database you're just going to 122 00:05:00,160 --> 00:05:04,560 download that data put it in your local 123 00:05:02,000 --> 00:05:06,639 memory and then read csv or something 124 00:05:04,560 --> 00:05:08,880 like that and then you work on it 125 00:05:06,639 --> 00:05:11,840 locally but here the concept is like 126 00:05:08,880 --> 00:05:13,600 pushing this code back to where the data 127 00:05:11,840 --> 00:05:15,919 is residing. 128 00:05:13,600 --> 00:05:18,320 Now we have two main offerings under the 129 00:05:15,919 --> 00:05:21,680 snow park API which is pandas API and 130 00:05:18,320 --> 00:05:23,840 dataf frame API. So for the pandas API 131 00:05:21,680 --> 00:05:25,440 it is distributed. It solves all the 132 00:05:23,840 --> 00:05:27,680 problems that we discussed for the open 133 00:05:25,440 --> 00:05:30,160 source pandas. And then we also have a 134 00:05:27,680 --> 00:05:32,000 dataf frame API which is u lazily 135 00:05:30,160 --> 00:05:35,039 evaluated something very similar to 136 00:05:32,000 --> 00:05:36,960 spark if you have done that. 137 00:05:35,039 --> 00:05:38,960 How do I get started with pre-processing 138 00:05:36,960 --> 00:05:41,039 in in snowflake? So I need an 139 00:05:38,960 --> 00:05:43,600 environment to write it. It can be your 140 00:05:41,039 --> 00:05:45,199 own ids or snowflake itself has a 141 00:05:43,600 --> 00:05:47,840 notebook inside. So you can leverage 142 00:05:45,199 --> 00:05:49,759 those or use your own ids like Jupyter 143 00:05:47,840 --> 00:05:51,520 notebook, VS code, anything to write 144 00:05:49,759 --> 00:05:53,759 your code. 145 00:05:51,520 --> 00:05:55,600 Now you have options. So I want to 146 00:05:53,759 --> 00:05:57,680 highlight again if you ever want to just 147 00:05:55,600 --> 00:06:00,080 stick on to the normal pandas python 148 00:05:57,680 --> 00:06:01,759 code OSS version and then want to do the 149 00:06:00,080 --> 00:06:03,759 pro pre-processing. It's completely 150 00:06:01,759 --> 00:06:05,680 fine. So these are the extra options 151 00:06:03,759 --> 00:06:07,840 that we provide in snowflake to the 152 00:06:05,680 --> 00:06:10,240 customers to do the things on a very 153 00:06:07,840 --> 00:06:12,160 scalable manner. So we have snow park 154 00:06:10,240 --> 00:06:14,880 data frame APIs. You can see a sample of 155 00:06:12,160 --> 00:06:18,000 code there. DF.filter df.tate is equal 156 00:06:14,880 --> 00:06:19,919 to WA. So we are sort of like um using 157 00:06:18,000 --> 00:06:22,560 it very similar to if you're a user of 158 00:06:19,919 --> 00:06:24,800 pandas. And then we have other option 159 00:06:22,560 --> 00:06:27,520 which is pandas which is built on top of 160 00:06:24,800 --> 00:06:29,520 the open source ones. Uh we maintain 161 00:06:27,520 --> 00:06:31,120 this the name of the functions to be 162 00:06:29,520 --> 00:06:32,639 very similar to the open source version. 163 00:06:31,120 --> 00:06:34,960 So you don't have to change much of the 164 00:06:32,639 --> 00:06:37,520 code. And in case if you want to write 165 00:06:34,960 --> 00:06:38,880 custom Python code, very custom logic, 166 00:06:37,520 --> 00:06:41,120 we also have something known as 167 00:06:38,880 --> 00:06:43,360 open-source um sorry userdefined 168 00:06:41,120 --> 00:06:46,160 functions and things in um snowflake as 169 00:06:43,360 --> 00:06:49,280 well. Now the the best part is that if 170 00:06:46,160 --> 00:06:51,759 you write code in these APIs or in this 171 00:06:49,280 --> 00:06:54,800 manner, it always gets pushed down to 172 00:06:51,759 --> 00:06:56,960 snowflake compute. So you write the code 173 00:06:54,800 --> 00:06:58,319 and there is a client level libraries 174 00:06:56,960 --> 00:07:00,479 which will push the code and the 175 00:06:58,319 --> 00:07:03,039 execution always happens inside 176 00:07:00,479 --> 00:07:05,280 snowflake. 177 00:07:03,039 --> 00:07:07,680 Now we'll go double click into a little 178 00:07:05,280 --> 00:07:09,520 bit more on the pantas on snowflake part 179 00:07:07,680 --> 00:07:12,080 how it works how is it distributed and 180 00:07:09,520 --> 00:07:14,080 how it becomes scalable 181 00:07:12,080 --> 00:07:16,160 in this case again it's an extension of 182 00:07:14,080 --> 00:07:19,199 the snow park API which is modeled on 183 00:07:16,160 --> 00:07:22,240 top of the oss uh pandas as well now 184 00:07:19,199 --> 00:07:25,919 anybody heard about modin 185 00:07:22,240 --> 00:07:28,560 mo yeah okay awesome so mod was an um 186 00:07:25,919 --> 00:07:30,560 open source initiative where building 187 00:07:28,560 --> 00:07:33,440 this pandas to be on a distributed 188 00:07:30,560 --> 00:07:36,080 scalable way Um it has been over 5 years 189 00:07:33,440 --> 00:07:38,080 research and things like that. So modin 190 00:07:36,080 --> 00:07:39,840 is also available as an open source. We 191 00:07:38,080 --> 00:07:42,560 recently acquired modin as well and 192 00:07:39,840 --> 00:07:44,880 snowark panners is already available in 193 00:07:42,560 --> 00:07:48,160 snowflake. So what essentially this desk 194 00:07:44,880 --> 00:07:50,400 does is you can keep on using your code 195 00:07:48,160 --> 00:07:52,080 as such and the main thing that you need 196 00:07:50,400 --> 00:07:55,280 to change is your import statement. So 197 00:07:52,080 --> 00:07:57,440 you can see in the uh in the uh picture 198 00:07:55,280 --> 00:07:59,520 there. So you are basically changing the 199 00:07:57,440 --> 00:08:01,199 import pandas to import modern.pandas 200 00:07:59,520 --> 00:08:03,280 pandas and the rest of the functioning 201 00:08:01,199 --> 00:08:04,879 and everything remains same but you will 202 00:08:03,280 --> 00:08:07,280 still leverage the distributed and 203 00:08:04,879 --> 00:08:09,440 scalable processing behind the scenes. 204 00:08:07,280 --> 00:08:10,879 The oss version is still available if 205 00:08:09,440 --> 00:08:13,599 you want to check it out. Yeah, please 206 00:08:10,879 --> 00:08:15,599 feel free to check it out as well. 207 00:08:13,599 --> 00:08:17,919 Now how does it really work? How does it 208 00:08:15,599 --> 00:08:20,319 really scale in snowflake? What is going 209 00:08:17,919 --> 00:08:24,720 to happen is like when you write a 210 00:08:20,319 --> 00:08:27,440 statement like pd.conat concat um df1 211 00:08:24,720 --> 00:08:29,840 df2. What really happens is like there 212 00:08:27,440 --> 00:08:32,399 is a query translator that is literally 213 00:08:29,840 --> 00:08:34,159 taking your statement and then we have a 214 00:08:32,399 --> 00:08:36,080 connector which connects to the 215 00:08:34,159 --> 00:08:38,560 snowflake and it automatically pushes 216 00:08:36,080 --> 00:08:40,880 down as a SQL query. So your pandas is 217 00:08:38,560 --> 00:08:42,640 getting translated into a SQL query and 218 00:08:40,880 --> 00:08:44,720 it is pushed down to the snowflake 219 00:08:42,640 --> 00:08:46,560 processing engine. 220 00:08:44,720 --> 00:08:48,560 Now the processing engine consists of 221 00:08:46,560 --> 00:08:49,760 multiple nodes. It is distributed. So 222 00:08:48,560 --> 00:08:51,440 Snowflake behind the scenes 223 00:08:49,760 --> 00:08:53,600 automatically distribute and scale the 224 00:08:51,440 --> 00:08:55,839 things for you. What are the advantages 225 00:08:53,600 --> 00:08:57,440 to it? Seamless prototype. You don't 226 00:08:55,839 --> 00:08:59,760 have to worry about refactoring your 227 00:08:57,440 --> 00:09:01,680 code and everything and zero data moment 228 00:08:59,760 --> 00:09:04,399 because we are pushing it towards the 229 00:09:01,680 --> 00:09:06,399 snowflake and then you can just keep on 230 00:09:04,399 --> 00:09:08,160 using your pandas which you're very 231 00:09:06,399 --> 00:09:10,160 familiar with keep on writing the code 232 00:09:08,160 --> 00:09:12,320 in that. 233 00:09:10,160 --> 00:09:14,480 Now this is the main um highlight of 234 00:09:12,320 --> 00:09:16,480 what is the difference when you switch 235 00:09:14,480 --> 00:09:18,240 between these two libraries when the 236 00:09:16,480 --> 00:09:20,320 difference is mainly like a import 237 00:09:18,240 --> 00:09:22,399 statement. What happens is like let's 238 00:09:20,320 --> 00:09:24,800 take a look at the second bar graph 239 00:09:22,399 --> 00:09:26,880 which is about 10 GB of data. So the 240 00:09:24,800 --> 00:09:28,640 x-axis is the data size and the y-axis 241 00:09:26,880 --> 00:09:31,279 is the second. When you look at the 10 242 00:09:28,640 --> 00:09:34,959 GB data the processing is 30 times 243 00:09:31,279 --> 00:09:37,200 faster than using the normal oasis 244 00:09:34,959 --> 00:09:39,920 pandas. So you can see the blue colored 245 00:09:37,200 --> 00:09:43,440 one is the snowflake model um and the 246 00:09:39,920 --> 00:09:46,720 other one is using the pantas osis wash 247 00:09:43,440 --> 00:09:48,560 and the last one when it goes to one TB 248 00:09:46,720 --> 00:09:52,640 you can see the processing gets finished 249 00:09:48,560 --> 00:09:55,040 under 50 seconds for snow park pandas 250 00:09:52,640 --> 00:09:56,640 but whereas other one it didn't finish 251 00:09:55,040 --> 00:09:59,279 you will mostly encounter out of the 252 00:09:56,640 --> 00:10:01,680 memory errors 253 00:09:59,279 --> 00:10:04,720 now going to the feature engineering any 254 00:10:01,680 --> 00:10:06,959 data scientist here 255 00:10:04,720 --> 00:10:10,399 okay Good. So when it comes to data 256 00:10:06,959 --> 00:10:12,800 science, it is very hard to do certain 257 00:10:10,399 --> 00:10:14,480 kinds of feature engineering techniques, 258 00:10:12,800 --> 00:10:15,920 we call it like one hot encoding and 259 00:10:14,480 --> 00:10:18,160 things like that. These are very 260 00:10:15,920 --> 00:10:20,000 computationally expensive operations. So 261 00:10:18,160 --> 00:10:22,240 how do we help for those kind of 262 00:10:20,000 --> 00:10:24,720 operations is that we have another API 263 00:10:22,240 --> 00:10:27,040 called Snowflake ML Python package. So 264 00:10:24,720 --> 00:10:29,360 what it essentially does is it is again 265 00:10:27,040 --> 00:10:32,399 a wrapper that is built on top of the 266 00:10:29,360 --> 00:10:35,519 open-source ones scikitlearn mainly and 267 00:10:32,399 --> 00:10:38,800 uh we also have the rappers on top of XG 268 00:10:35,519 --> 00:10:41,279 boost scikitlearn then live GBM all 269 00:10:38,800 --> 00:10:43,360 these packages that helps you do 270 00:10:41,279 --> 00:10:45,839 different kinds of pre-processing as 271 00:10:43,360 --> 00:10:47,920 well as the ML training part in a very 272 00:10:45,839 --> 00:10:50,240 distributed fashion. So under the 273 00:10:47,920 --> 00:10:51,760 pre-processing you can see most of the 274 00:10:50,240 --> 00:10:54,079 uh pre-processing things that you use 275 00:10:51,760 --> 00:10:56,000 like standard scalar ordinal encoding 276 00:10:54,079 --> 00:10:59,040 all those things are already there. So 277 00:10:56,000 --> 00:11:00,640 you just have to use the functions and 278 00:10:59,040 --> 00:11:03,680 behind the scenes we handle the 279 00:11:00,640 --> 00:11:05,360 distributed things for you. Now again 280 00:11:03,680 --> 00:11:08,000 how does it work behind the scenes or 281 00:11:05,360 --> 00:11:10,560 how does it distributed the same logic 282 00:11:08,000 --> 00:11:12,000 ML pre-processing gets again passed 283 00:11:10,560 --> 00:11:14,079 through a query translator which 284 00:11:12,000 --> 00:11:16,320 translate into a SQL and pushed back 285 00:11:14,079 --> 00:11:18,880 into snowflake engine and then execution 286 00:11:16,320 --> 00:11:20,640 happens whereas for the model training 287 00:11:18,880 --> 00:11:23,040 some training we cannot distribute it 288 00:11:20,640 --> 00:11:25,200 honestly um so if it is like a XGB 289 00:11:23,040 --> 00:11:27,279 regressor or something it will just push 290 00:11:25,200 --> 00:11:29,920 down as Python byte code and it will get 291 00:11:27,279 --> 00:11:32,640 executed we'll go into more on how to 292 00:11:29,920 --> 00:11:36,160 distribute the ML part as well later in 293 00:11:32,640 --> 00:11:39,440 the slides. All right. Now, what is the 294 00:11:36,160 --> 00:11:41,839 difference? Why should a person use like 295 00:11:39,440 --> 00:11:43,760 snowpark ML for feature engineering 296 00:11:41,839 --> 00:11:46,160 compared to the other scikitlearn or 297 00:11:43,760 --> 00:11:49,200 anything? So, this is experiment we did 298 00:11:46,160 --> 00:11:51,920 on uh 200 million rows of data which is 299 00:11:49,200 --> 00:11:55,120 16 GB data in the memory and the only 300 00:11:51,920 --> 00:11:59,360 difference between the two bar graphs 301 00:11:55,120 --> 00:12:01,600 that you see. The first one is using uh 302 00:11:59,360 --> 00:12:03,279 skarn standard scalar library. So 303 00:12:01,600 --> 00:12:06,079 everybody is familiar with that. And 304 00:12:03,279 --> 00:12:08,880 then when you do a scalar function there 305 00:12:06,079 --> 00:12:10,720 are usually two um ways to it. One is 306 00:12:08,880 --> 00:12:12,800 like you you train it and then you fit 307 00:12:10,720 --> 00:12:15,120 it. That is why you see the two colors 308 00:12:12,800 --> 00:12:16,560 which is red and the blue color. So 309 00:12:15,120 --> 00:12:19,519 there is a fitting time and there is a 310 00:12:16,560 --> 00:12:22,320 transforming time as well. Now the same 311 00:12:19,519 --> 00:12:25,519 data when we run with snowpark ML 312 00:12:22,320 --> 00:12:28,720 standard scalar which is the second one 313 00:12:25,519 --> 00:12:31,680 here you see like the difference is 314 00:12:28,720 --> 00:12:34,639 quite significant. It is about 25 to 50x 315 00:12:31,680 --> 00:12:36,720 of speed up. And another operation we 316 00:12:34,639 --> 00:12:38,880 also ran is one hot encoder. So these 317 00:12:36,720 --> 00:12:40,720 are again very computationally expensive 318 00:12:38,880 --> 00:12:42,480 feature engineering operations that data 319 00:12:40,720 --> 00:12:44,240 scientists use and you can see the 320 00:12:42,480 --> 00:12:46,079 significant difference there too. And 321 00:12:44,240 --> 00:12:48,320 another one in the end which is the 322 00:12:46,079 --> 00:12:51,040 purple bar graph. It is about 323 00:12:48,320 --> 00:12:53,839 correlation. So in skarn also you can do 324 00:12:51,040 --> 00:12:55,680 the correlation of features. Um this 325 00:12:53,839 --> 00:12:57,920 becomes like very very hard when the 326 00:12:55,680 --> 00:13:00,320 number of features increases. So in this 327 00:12:57,920 --> 00:13:03,279 case we did the experiment on a 1 328 00:13:00,320 --> 00:13:05,279 million of rows with 512 columns running 329 00:13:03,279 --> 00:13:07,120 in a medium snowpark optimized 330 00:13:05,279 --> 00:13:10,240 warehouse. warehouse is our the name of 331 00:13:07,120 --> 00:13:13,440 our compute that we use and you can see 332 00:13:10,240 --> 00:13:15,200 the result is that once 1,024 columns 333 00:13:13,440 --> 00:13:17,680 there is usually getting out of the 334 00:13:15,200 --> 00:13:19,920 memory errors for scikitlearn packages 335 00:13:17,680 --> 00:13:22,720 but if you use it um you can get 336 00:13:19,920 --> 00:13:24,560 everything still ongoing and without 337 00:13:22,720 --> 00:13:27,680 getting a memory error you can execute 338 00:13:24,560 --> 00:13:30,480 those correlation metrics and things. 339 00:13:27,680 --> 00:13:33,839 Now going to the model training in case 340 00:13:30,480 --> 00:13:36,240 of model training we offer a container 341 00:13:33,839 --> 00:13:38,079 runtime for ML. Now whatever code you 342 00:13:36,240 --> 00:13:39,839 have however optimized way that you 343 00:13:38,079 --> 00:13:41,920 write if you don't have a scalable 344 00:13:39,839 --> 00:13:43,920 infrastructure on which you want to you 345 00:13:41,920 --> 00:13:45,360 can run the code it's not going to be 346 00:13:43,920 --> 00:13:47,760 effective. So that is why we come up 347 00:13:45,360 --> 00:13:50,480 with this container runtime for ML where 348 00:13:47,760 --> 00:13:52,880 you have CPU and GPU pools which you can 349 00:13:50,480 --> 00:13:55,920 configure out of the box and then it is 350 00:13:52,880 --> 00:13:58,240 uh powered by ray ray compute cluster 351 00:13:55,920 --> 00:14:00,560 and when you spin up a container what 352 00:13:58,240 --> 00:14:02,560 really happens is like it comes with 353 00:14:00,560 --> 00:14:04,880 most of the python libraries that you 354 00:14:02,560 --> 00:14:07,680 really use it for. You can see skarn XG 355 00:14:04,880 --> 00:14:10,160 boost light GBM etc. It comes as the 356 00:14:07,680 --> 00:14:13,199 pre-built packages and we also have 357 00:14:10,160 --> 00:14:16,480 optimized data injection if you want to 358 00:14:13,199 --> 00:14:18,639 get a one TB of data or a huge size data 359 00:14:16,480 --> 00:14:20,959 into the container runtime. It is much 360 00:14:18,639 --> 00:14:24,880 easy and then we have scalable training 361 00:14:20,959 --> 00:14:26,959 APIs over top of it etc. 362 00:14:24,880 --> 00:14:29,760 Now how do we distribute it? So I have 363 00:14:26,959 --> 00:14:31,680 an XG boost model and I want to 364 00:14:29,760 --> 00:14:34,880 distribute the training of this XG boost 365 00:14:31,680 --> 00:14:36,639 over multiple nodes. So usually outside 366 00:14:34,880 --> 00:14:38,800 snowflake or if you want to do it with 367 00:14:36,639 --> 00:14:41,519 os python it is quite difficult. So this 368 00:14:38,800 --> 00:14:43,440 is why we again build a wrapper um on 369 00:14:41,519 --> 00:14:46,240 top of it. So you can just keep on 370 00:14:43,440 --> 00:14:48,240 writing your any opensource code and you 371 00:14:46,240 --> 00:14:50,480 can just scale it easily by using the 372 00:14:48,240 --> 00:14:52,160 scaling configuration. So you have seen 373 00:14:50,480 --> 00:14:53,920 the only difference is that you have to 374 00:14:52,160 --> 00:14:56,639 specify this particular scaling 375 00:14:53,920 --> 00:14:58,240 configuration where you can specify how 376 00:14:56,639 --> 00:15:00,480 many GPUs you want, what is the 377 00:14:58,240 --> 00:15:03,040 estimators and scaling configuration 378 00:15:00,480 --> 00:15:06,399 particularly that will help you to do 379 00:15:03,040 --> 00:15:08,079 everything easily and um all the infra 380 00:15:06,399 --> 00:15:10,000 management all the scaling everything is 381 00:15:08,079 --> 00:15:11,760 done by Snowflake and also Snowflake 382 00:15:10,000 --> 00:15:13,760 manages how to distribute the training 383 00:15:11,760 --> 00:15:15,600 across your multiple nodes which you can 384 00:15:13,760 --> 00:15:17,920 just freely use this particular 385 00:15:15,600 --> 00:15:19,600 distributed training APIs and it just 386 00:15:17,920 --> 00:15:21,680 runs. 387 00:15:19,600 --> 00:15:24,880 Now what's the difference? So you can 388 00:15:21,680 --> 00:15:28,079 see the first the blue line which is 389 00:15:24,880 --> 00:15:30,240 using the snowflake ML API that I just 390 00:15:28,079 --> 00:15:34,160 mentioned and also the other experiment 391 00:15:30,240 --> 00:15:35,680 is with the oss um library for the xg 392 00:15:34,160 --> 00:15:38,399 boost. So you can see the difference 393 00:15:35,680 --> 00:15:40,800 when the x-axis is like the size of the 394 00:15:38,399 --> 00:15:42,399 data and the x boost training time as 395 00:15:40,800 --> 00:15:44,560 well. when the size of the data 396 00:15:42,399 --> 00:15:46,800 increases, it's going to take more time 397 00:15:44,560 --> 00:15:48,800 for the oss version to finish the 398 00:15:46,800 --> 00:15:51,519 training and things like that. But in 399 00:15:48,800 --> 00:15:54,639 case of snowflake, it can automatically 400 00:15:51,519 --> 00:15:57,279 distribute. And you can see how wide the 401 00:15:54,639 --> 00:15:59,759 gap is. 402 00:15:57,279 --> 00:16:02,000 All right. Now, we also have a 403 00:15:59,759 --> 00:16:04,639 hyperparameter tuning API. The one that 404 00:16:02,000 --> 00:16:07,199 we discussed before was to run an XG 405 00:16:04,639 --> 00:16:09,680 boost in a distributed fashion. Now, how 406 00:16:07,199 --> 00:16:12,079 about if I have multiple hyperparameter 407 00:16:09,680 --> 00:16:14,880 tuning? This is a a very common 408 00:16:12,079 --> 00:16:17,199 technique all the data scientists use um 409 00:16:14,880 --> 00:16:18,800 where you can run multiple 410 00:16:17,199 --> 00:16:20,399 configurations of the model and see 411 00:16:18,800 --> 00:16:22,959 which is the best one and you pick the 412 00:16:20,399 --> 00:16:25,040 best among that. Now this is also a very 413 00:16:22,959 --> 00:16:27,040 difficult technique but very essential 414 00:16:25,040 --> 00:16:28,480 if you want to build a model you have to 415 00:16:27,040 --> 00:16:31,680 test different hyperparameter 416 00:16:28,480 --> 00:16:33,839 configurations and select the best one. 417 00:16:31,680 --> 00:16:35,920 So you can easily scale hundreds of 418 00:16:33,839 --> 00:16:38,320 model training across this one by using 419 00:16:35,920 --> 00:16:40,399 it. Again just leverage your normal 420 00:16:38,320 --> 00:16:42,480 Python OSS code. You can bring in any 421 00:16:40,399 --> 00:16:44,399 package or anything and you will write 422 00:16:42,480 --> 00:16:46,480 your define train function and you'll 423 00:16:44,399 --> 00:16:48,639 write your opensource Python code and 424 00:16:46,480 --> 00:16:50,160 under the tuner you basically specify 425 00:16:48,639 --> 00:16:53,680 what configurations or what 426 00:16:50,160 --> 00:16:56,639 hyperparameters you want to use to train 427 00:16:53,680 --> 00:16:58,880 this models in a parallel fashion and 428 00:16:56,639 --> 00:17:01,680 you can also um define different 429 00:16:58,880 --> 00:17:04,240 parameters metric etc. So it will 430 00:17:01,680 --> 00:17:06,640 automatically distribute for you and you 431 00:17:04,240 --> 00:17:08,559 will get different models run parallelly 432 00:17:06,640 --> 00:17:11,120 and finally you can pick the optimized 433 00:17:08,559 --> 00:17:13,679 best one out of all your hyperparameter 434 00:17:11,120 --> 00:17:16,240 configurations. You should also be able 435 00:17:13,679 --> 00:17:17,760 to do like grid search strategies, 436 00:17:16,240 --> 00:17:19,600 random search strategies. So these are 437 00:17:17,760 --> 00:17:22,400 the strategies normally data scientists 438 00:17:19,600 --> 00:17:24,400 use to um train different kinds of 439 00:17:22,400 --> 00:17:27,360 parameter tuning and then pick the best 440 00:17:24,400 --> 00:17:29,919 out of it. 441 00:17:27,360 --> 00:17:32,880 Now we also have as a part of the 442 00:17:29,919 --> 00:17:34,559 snowflake ML API a bit of hyperparameter 443 00:17:32,880 --> 00:17:37,200 optimization that is coming out of the 444 00:17:34,559 --> 00:17:40,400 box. So you can see the code is very 445 00:17:37,200 --> 00:17:42,559 simple and very similar to when you use 446 00:17:40,400 --> 00:17:44,960 scikitlearn or anything. So you can see 447 00:17:42,559 --> 00:17:47,120 we are doing a grid search CV and you 448 00:17:44,960 --> 00:17:49,600 are just specifying all these parameter 449 00:17:47,120 --> 00:17:51,120 but behind the scenes what we do is like 450 00:17:49,600 --> 00:17:54,160 we are running everything in a 451 00:17:51,120 --> 00:17:56,240 distributed fashion. So you can see the 452 00:17:54,160 --> 00:17:59,919 same code when you increase the number 453 00:17:56,240 --> 00:18:02,640 of nodes or the size of the compute from 454 00:17:59,919 --> 00:18:04,240 medium to large to extra large the time 455 00:18:02,640 --> 00:18:06,240 is going to go down because the 456 00:18:04,240 --> 00:18:10,240 automatic distribution happens behind 457 00:18:06,240 --> 00:18:12,080 the scenes for you in snowflake. 458 00:18:10,240 --> 00:18:13,919 Now this was a question I encountered 459 00:18:12,080 --> 00:18:16,000 when I was uh talking with one of our 460 00:18:13,919 --> 00:18:18,720 customers. So their question was like 461 00:18:16,000 --> 00:18:21,360 hey I want to build 60 million models 462 00:18:18,720 --> 00:18:23,840 for my use case. there's only like one 463 00:18:21,360 --> 00:18:25,919 single use case but there has to be 60 464 00:18:23,840 --> 00:18:27,840 million models that needs to be built 465 00:18:25,919 --> 00:18:30,000 and um this use case is more a 466 00:18:27,840 --> 00:18:31,840 hyperpersonalization use case where I 467 00:18:30,000 --> 00:18:34,160 want to build a model for every single 468 00:18:31,840 --> 00:18:35,919 customer and I want to understand how 469 00:18:34,160 --> 00:18:37,600 much they are going to spend or are they 470 00:18:35,919 --> 00:18:40,480 going to churn so I don't want like a 471 00:18:37,600 --> 00:18:42,799 generic model so how do we do this so in 472 00:18:40,480 --> 00:18:45,760 snowflake we did this um solve this 473 00:18:42,799 --> 00:18:48,240 problem using a partitioned model 474 00:18:45,760 --> 00:18:50,320 training feature that we use and for 475 00:18:48,240 --> 00:18:52,559 this one when they were running it in 476 00:18:50,320 --> 00:18:54,480 another platform without any 477 00:18:52,559 --> 00:18:57,360 parallelization any distribution. It was 478 00:18:54,480 --> 00:18:59,600 taking 18 hours to run. Now what we did 479 00:18:57,360 --> 00:19:01,840 is like we ported the code in snowflake 480 00:18:59,600 --> 00:19:04,240 and we use this feature and we could 481 00:19:01,840 --> 00:19:07,280 finish it under 3 hours. So that's the 482 00:19:04,240 --> 00:19:10,880 difference we are talking about. 483 00:19:07,280 --> 00:19:13,200 Now this is very helpful particularly in 484 00:19:10,880 --> 00:19:15,360 the cases like there is a partition key 485 00:19:13,200 --> 00:19:17,440 or a partition that you can really 486 00:19:15,360 --> 00:19:19,919 leverage. For example, in this one, 487 00:19:17,440 --> 00:19:22,480 think about like different stores. You 488 00:19:19,919 --> 00:19:24,400 want to understand what is the uh what 489 00:19:22,480 --> 00:19:26,240 is the requirement or a demand for 490 00:19:24,400 --> 00:19:28,640 different stores and you want to build 491 00:19:26,240 --> 00:19:30,480 it on a store level or even on a product 492 00:19:28,640 --> 00:19:32,400 level, right? I want to understand what 493 00:19:30,480 --> 00:19:34,000 is the demand prediction for my product. 494 00:19:32,400 --> 00:19:36,720 Then you have to go into each product 495 00:19:34,000 --> 00:19:38,799 level and build it. So this is how we 496 00:19:36,720 --> 00:19:41,120 can do that. Your training data just 497 00:19:38,799 --> 00:19:42,880 remains the same. It's the data in the 498 00:19:41,120 --> 00:19:45,039 table. You don't have to partition it. 499 00:19:42,880 --> 00:19:47,840 You don't have to do anything. But what 500 00:19:45,039 --> 00:19:49,840 we do is like we will do uh we will 501 00:19:47,840 --> 00:19:51,919 create the partitions for different 502 00:19:49,840 --> 00:19:54,320 stores and we'll create the subm models 503 00:19:51,919 --> 00:19:56,240 for each of the stores and finally 504 00:19:54,320 --> 00:19:58,880 everything gets pushed into our model 505 00:19:56,240 --> 00:20:00,799 registry as one single model. So anytime 506 00:19:58,880 --> 00:20:03,520 you want to do inference you call that 507 00:20:00,799 --> 00:20:05,840 single model from the registry and it 508 00:20:03,520 --> 00:20:08,320 will pick the corresponding submodel for 509 00:20:05,840 --> 00:20:10,799 that particular store and give you the 510 00:20:08,320 --> 00:20:12,880 results. So that is how it works and 511 00:20:10,799 --> 00:20:15,600 there are two versions of it stateless 512 00:20:12,880 --> 00:20:18,080 and stateful. So imagine you're building 513 00:20:15,600 --> 00:20:20,559 60 million models. It is hard to save 514 00:20:18,080 --> 00:20:22,400 all the models 60 million. Uh every day 515 00:20:20,559 --> 00:20:24,080 you are running it. Then you don't want 516 00:20:22,400 --> 00:20:25,679 to like save all the models all the 517 00:20:24,080 --> 00:20:28,480 time. So it has to be like training on 518 00:20:25,679 --> 00:20:30,400 the go and just done with the inference. 519 00:20:28,480 --> 00:20:31,919 That's it. Next day you repeat the same. 520 00:20:30,400 --> 00:20:33,600 So those kind of models are called 521 00:20:31,919 --> 00:20:36,799 stateless models. You don't need to save 522 00:20:33,600 --> 00:20:38,320 it. Then you have stateful moders. um I 523 00:20:36,799 --> 00:20:39,919 want to save that particular model 524 00:20:38,320 --> 00:20:42,559 version which is the normal behavior 525 00:20:39,919 --> 00:20:44,400 when you do a build a data science model 526 00:20:42,559 --> 00:20:46,400 you save it and then you call it and do 527 00:20:44,400 --> 00:20:48,480 the inference so if you want to do that 528 00:20:46,400 --> 00:20:50,159 that is also fine but it's considering 529 00:20:48,480 --> 00:20:52,960 when you are building millions of model 530 00:20:50,159 --> 00:20:56,320 it's going to take time um and uh it 531 00:20:52,960 --> 00:20:58,880 will affect the performance as well 532 00:20:56,320 --> 00:21:01,200 all right um so how do we do this many 533 00:20:58,880 --> 00:21:04,400 model partition inference in snowflake 534 00:21:01,200 --> 00:21:07,039 it is very same as you can see here you 535 00:21:04,400 --> 00:21:09,200 write a python class which is example 536 00:21:07,039 --> 00:21:13,360 forecasting model just um another 537 00:21:09,200 --> 00:21:15,679 example and you can put a decorator here 538 00:21:13,360 --> 00:21:18,000 the custom model decorator and then 539 00:21:15,679 --> 00:21:20,559 under that function predict you can 540 00:21:18,000 --> 00:21:23,280 write any Python code that you want any 541 00:21:20,559 --> 00:21:26,559 algorithm any packages that you want to 542 00:21:23,280 --> 00:21:29,280 use normal osis python xg boost anything 543 00:21:26,559 --> 00:21:31,520 you can write your code underneath it 544 00:21:29,280 --> 00:21:33,280 and after that you can log that model 545 00:21:31,520 --> 00:21:35,039 into the registry so even before the 546 00:21:33,280 --> 00:21:37,200 training we are logging the model into 547 00:21:35,039 --> 00:21:39,600 the registry because we're building 60 548 00:21:37,200 --> 00:21:41,440 models on parallel and then you're going 549 00:21:39,600 --> 00:21:43,679 to push it to the registry. So registry 550 00:21:41,440 --> 00:21:45,520 handles everything behind the scenes for 551 00:21:43,679 --> 00:21:47,360 you. So in this case what I'm going to 552 00:21:45,520 --> 00:21:49,120 do I'm going to push this model first 553 00:21:47,360 --> 00:21:51,120 into the registry. So if you're not 554 00:21:49,120 --> 00:21:53,679 familiar with model registry think of it 555 00:21:51,120 --> 00:21:56,320 like a centralized place where I put all 556 00:21:53,679 --> 00:21:58,640 my models. So I will have later like a 557 00:21:56,320 --> 00:22:00,559 auditability traceability. It also acts 558 00:21:58,640 --> 00:22:03,120 as a handshake points between different 559 00:22:00,559 --> 00:22:05,840 teams. So I'll push my model into this 560 00:22:03,120 --> 00:22:07,840 registry and say like hey apps team or 561 00:22:05,840 --> 00:22:09,679 ML engineering team you can take the 562 00:22:07,840 --> 00:22:12,400 model from there and predictionalize it 563 00:22:09,679 --> 00:22:16,720 and things like that. So I do that and 564 00:22:12,400 --> 00:22:19,679 then next one is I can just go in and 565 00:22:16,720 --> 00:22:21,760 call the function run and just specify 566 00:22:19,679 --> 00:22:24,640 the partition column. This partition 567 00:22:21,760 --> 00:22:26,400 column can be store name. So which is 568 00:22:24,640 --> 00:22:28,240 very easy. You don't have to physically 569 00:22:26,400 --> 00:22:29,760 partition it. You just go here and one 570 00:22:28,240 --> 00:22:32,080 line of code and you're specifying what 571 00:22:29,760 --> 00:22:35,919 is the partition column you want to use. 572 00:22:32,080 --> 00:22:38,480 Now tomorrow I decided I don't want to 573 00:22:35,919 --> 00:22:40,080 do this particular modeling on a store 574 00:22:38,480 --> 00:22:42,480 level. I want to do it on a country 575 00:22:40,080 --> 00:22:44,000 level. So you can come here and you 576 00:22:42,480 --> 00:22:46,159 change your partition column to the 577 00:22:44,000 --> 00:22:48,080 country. So it only needs like a a 578 00:22:46,159 --> 00:22:50,080 partition key that it can use and the 579 00:22:48,080 --> 00:22:52,080 code remains the same. You don't have to 580 00:22:50,080 --> 00:22:54,159 change anything. 581 00:22:52,080 --> 00:22:57,280 So that's how we build millions of model 582 00:22:54,159 --> 00:22:59,600 in parallel in snowflake. 583 00:22:57,280 --> 00:23:02,159 All right, inference. When it comes to 584 00:22:59,600 --> 00:23:04,720 inference, we provide two kinds of 585 00:23:02,159 --> 00:23:08,000 inference as um usual ways which is one 586 00:23:04,720 --> 00:23:10,559 is batch inference. So batch inference, 587 00:23:08,000 --> 00:23:12,400 we provide it through warehouse. In this 588 00:23:10,559 --> 00:23:14,400 particular kind of an inference, it 589 00:23:12,400 --> 00:23:16,400 comes out of the box as distributed. 590 00:23:14,400 --> 00:23:18,559 Let's say you have a million rows and 591 00:23:16,400 --> 00:23:21,200 you are just using the warehouse 592 00:23:18,559 --> 00:23:23,039 inference. Then it will automatically 593 00:23:21,200 --> 00:23:25,440 behind the scenes distribute your data 594 00:23:23,039 --> 00:23:28,400 and do the prediction for you. So you 595 00:23:25,440 --> 00:23:30,480 can see the inference code here which is 596 00:23:28,400 --> 00:23:33,440 model predict and you are just running 597 00:23:30,480 --> 00:23:35,840 the function that's it you just get your 598 00:23:33,440 --> 00:23:37,600 data test data and then you uh try to 599 00:23:35,840 --> 00:23:39,440 predict with the model then behind the 600 00:23:37,600 --> 00:23:41,840 scenes snowflake handle the distributed 601 00:23:39,440 --> 00:23:44,000 processing and then finally merge all 602 00:23:41,840 --> 00:23:47,840 the results together and then give it to 603 00:23:44,000 --> 00:23:49,919 you as a one single table or result. Now 604 00:23:47,840 --> 00:23:52,000 going to the next one which is noar 605 00:23:49,919 --> 00:23:54,720 container services that is something we 606 00:23:52,000 --> 00:23:56,640 use if you want to go for a GPU level 607 00:23:54,720 --> 00:23:58,960 inference or let's say you want to go 608 00:23:56,640 --> 00:24:01,039 for a realtime inference. So in case of 609 00:23:58,960 --> 00:24:02,720 a realtime inference um there is 610 00:24:01,039 --> 00:24:05,520 something you need to balance the load 611 00:24:02,720 --> 00:24:08,080 as well. For example now I only have 612 00:24:05,520 --> 00:24:10,559 10,000 users so I'm only getting 10,000 613 00:24:08,080 --> 00:24:13,120 API calls. So tomorrow I'm starting to 614 00:24:10,559 --> 00:24:15,200 get 20,000. So how do you scale it? So 615 00:24:13,120 --> 00:24:17,679 that is also handled by snowflake. So 616 00:24:15,200 --> 00:24:20,080 you can configure different uh number of 617 00:24:17,679 --> 00:24:22,559 nodes or scalability. I want like 618 00:24:20,080 --> 00:24:24,480 minimum nodes, maximum nodes and then 619 00:24:22,559 --> 00:24:26,080 once you start inference in snow 620 00:24:24,480 --> 00:24:28,880 container services, it will 621 00:24:26,080 --> 00:24:30,960 automatically scale it for you according 622 00:24:28,880 --> 00:24:34,480 to the load that is incoming. So you 623 00:24:30,960 --> 00:24:36,559 don't have to worry about those things. 624 00:24:34,480 --> 00:24:39,200 All right. And finally, this is also the 625 00:24:36,559 --> 00:24:42,320 most important part. How do I make 626 00:24:39,200 --> 00:24:44,960 available the result of my AI or ML to 627 00:24:42,320 --> 00:24:47,200 the customers or the users, business 628 00:24:44,960 --> 00:24:49,440 users in a company. So that is a very 629 00:24:47,200 --> 00:24:51,919 important part here. So how do we do 630 00:24:49,440 --> 00:24:53,840 that? You might have seen in every 631 00:24:51,919 --> 00:24:55,840 single presentation today or at least 632 00:24:53,840 --> 00:24:57,600 whatever I attended in this uh 633 00:24:55,840 --> 00:24:59,279 conference, everybody talks about 634 00:24:57,600 --> 00:25:03,520 Streamlit, 635 00:24:59,279 --> 00:25:06,400 right? Anybody knows like Streamlit um 636 00:25:03,520 --> 00:25:09,039 is acquired by Snowflake? 637 00:25:06,400 --> 00:25:11,919 Okay. Okay, only one hand. Yeah. So, we 638 00:25:09,039 --> 00:25:14,720 acquired snow uh streamlit um and now it 639 00:25:11,919 --> 00:25:18,080 is available as a part of snowflake. We 640 00:25:14,720 --> 00:25:20,000 call it cis streamlit in snowflake and 641 00:25:18,080 --> 00:25:22,559 you also have the open source version 642 00:25:20,000 --> 00:25:24,080 still available. Uh but uh inside 643 00:25:22,559 --> 00:25:26,159 snowflake we provide it as an 644 00:25:24,080 --> 00:25:27,919 enterprisegrade level app. We have a 645 00:25:26,159 --> 00:25:30,159 native app framework where you can 646 00:25:27,919 --> 00:25:32,799 easily build your streamllet inside and 647 00:25:30,159 --> 00:25:35,600 share it across um very seamlessly with 648 00:25:32,799 --> 00:25:38,159 all the rback control o and etc handled 649 00:25:35,600 --> 00:25:40,400 by snowflake and then you also have your 650 00:25:38,159 --> 00:25:42,880 open source version which you all are 651 00:25:40,400 --> 00:25:45,600 leveraging I can see that in the most of 652 00:25:42,880 --> 00:25:48,080 the talks. 653 00:25:45,600 --> 00:25:50,159 Now why should I do streaml and 654 00:25:48,080 --> 00:25:51,840 snowflake? We come with a top of 655 00:25:50,159 --> 00:25:54,320 additional advantages when you compare 656 00:25:51,840 --> 00:25:56,320 to your oss version which is fully 657 00:25:54,320 --> 00:25:58,240 managed. We are also working on a 658 00:25:56,320 --> 00:26:00,960 feature where you can run your streamlit 659 00:25:58,240 --> 00:26:02,640 on our container runtime. Now that again 660 00:26:00,960 --> 00:26:04,720 means it's going to be very very 661 00:26:02,640 --> 00:26:06,480 scalable for you. Um you don't have to 662 00:26:04,720 --> 00:26:09,440 worry about the load handling and things 663 00:26:06,480 --> 00:26:11,600 like that. Governance and security. You 664 00:26:09,440 --> 00:26:13,760 don't have to configure out of the box O 665 00:26:11,600 --> 00:26:15,200 or anything who can log into it and 666 00:26:13,760 --> 00:26:17,919 things like that. That will also be 667 00:26:15,200 --> 00:26:20,480 taken care by Snowflake. You can set our 668 00:26:17,919 --> 00:26:22,320 back level permissions. Now it's 669 00:26:20,480 --> 00:26:24,400 completely integrated with Snowflake. So 670 00:26:22,320 --> 00:26:26,240 it is easier to build it. We come with 671 00:26:24,400 --> 00:26:29,200 an interface where you can go in and 672 00:26:26,240 --> 00:26:31,360 start coding your python uh streamllet 673 00:26:29,200 --> 00:26:34,000 and then um on the same window itself 674 00:26:31,360 --> 00:26:35,840 you can see the build that is being 675 00:26:34,000 --> 00:26:38,159 happening for this particular dashboards 676 00:26:35,840 --> 00:26:40,799 in streamllet and also it's integrated 677 00:26:38,159 --> 00:26:42,960 with our notebooks as well. If you code 678 00:26:40,799 --> 00:26:45,200 in notebooks it's also easier for you to 679 00:26:42,960 --> 00:26:47,440 see it there. 680 00:26:45,200 --> 00:26:50,960 Now summary. 681 00:26:47,440 --> 00:26:52,960 So Snowflake provides a very simple coy 682 00:26:50,960 --> 00:26:56,000 platform that helps you to scale every 683 00:26:52,960 --> 00:26:57,840 single part of your ML workload. And in 684 00:26:56,000 --> 00:26:59,520 case of if you want high performance 685 00:26:57,840 --> 00:27:02,400 pre-processing that is where you can 686 00:26:59,520 --> 00:27:05,679 leverage the APIs that we provide out of 687 00:27:02,400 --> 00:27:08,159 the box. Snow park API, snowark pantas 688 00:27:05,679 --> 00:27:10,799 and also snowpark ML APIs to do your 689 00:27:08,159 --> 00:27:13,600 feature engineering and processing. But 690 00:27:10,799 --> 00:27:15,760 you're always free to use Python OSS 691 00:27:13,600 --> 00:27:18,080 version and um bring in your existing 692 00:27:15,760 --> 00:27:20,320 code or write anything that you want and 693 00:27:18,080 --> 00:27:22,480 execute in in Snowflake. 694 00:27:20,320 --> 00:27:25,120 Now going to the distributor training 695 00:27:22,480 --> 00:27:28,159 and tuning, you can leverage your CPUs 696 00:27:25,120 --> 00:27:30,960 and GPUs very easily in Snowflake 697 00:27:28,159 --> 00:27:32,799 without setting any um any of these 698 00:27:30,960 --> 00:27:34,799 libraries. For example, if you want to 699 00:27:32,799 --> 00:27:37,279 use GPU, you have to get started with, 700 00:27:34,799 --> 00:27:38,960 you know, installing CUDA, Kurin and all 701 00:27:37,279 --> 00:27:40,559 those libraries and set it up and get 702 00:27:38,960 --> 00:27:42,400 from the scratch. But in Snowflake, it 703 00:27:40,559 --> 00:27:44,960 all comes out of the box for you. So you 704 00:27:42,400 --> 00:27:47,279 can right away get started. 705 00:27:44,960 --> 00:27:49,840 Seamless deployment and serving. So you 706 00:27:47,279 --> 00:27:51,840 can serve your models on top of CPU, GPU 707 00:27:49,840 --> 00:27:53,919 in whichever way you want. And out of 708 00:27:51,840 --> 00:27:55,679 the box, we will provide the distributed 709 00:27:53,919 --> 00:27:57,760 uh things for you. It will be handled 710 00:27:55,679 --> 00:28:00,159 behind the scenes. For the user, it is 711 00:27:57,760 --> 00:28:03,120 very similar to writing a normal Python 712 00:28:00,159 --> 00:28:04,720 code with any other libraries. 713 00:28:03,120 --> 00:28:06,720 Streamlit, I don't think I have to talk 714 00:28:04,720 --> 00:28:09,760 much about streamllet. Everybody is uh 715 00:28:06,720 --> 00:28:11,600 using streamllet now. So um it will help 716 00:28:09,760 --> 00:28:14,159 you to build all those apps and 717 00:28:11,600 --> 00:28:17,279 prototypes and easier and then share it 718 00:28:14,159 --> 00:28:20,399 with the your wider users to leverage 719 00:28:17,279 --> 00:28:23,919 the results of your AI ML etc. All 720 00:28:20,399 --> 00:28:26,480 right, that is all for the session 721 00:28:23,919 --> 00:28:29,279 today. If you're curious about knowing 722 00:28:26,480 --> 00:28:31,679 more about Python in snowflake, so this 723 00:28:29,279 --> 00:28:33,840 is the QR code that you can scan and 724 00:28:31,679 --> 00:28:36,320 there is a lot of Python recipes that 725 00:28:33,840 --> 00:28:38,480 you can get on that particular GitHub 726 00:28:36,320 --> 00:28:40,640 link which will help you to build all 727 00:28:38,480 --> 00:28:43,279 this uh Python in a very distributed 728 00:28:40,640 --> 00:28:45,750 fashion. All right, that's all for 729 00:28:43,279 --> 00:28:52,910 today. Thank you. 730 00:28:45,750 --> 00:28:52,910 [Applause]