1
00:00:00,420 --> 00:00:05,910
[Music]

2
00:00:09,920 --> 00:00:15,440
Welcome back everyone. Another brief

3
00:00:12,320 --> 00:00:18,480
reminder for those sitting in the masked

4
00:00:15,440 --> 00:00:21,359
area, please wear your masks.

5
00:00:18,480 --> 00:00:24,400
And now it is my pleasure to introduce

6
00:00:21,359 --> 00:00:26,560
our next speaker, Sheena, who is going

7
00:00:24,400 --> 00:00:29,039
to be talking on scaling Python powered

8
00:00:26,560 --> 00:00:31,040
machine learning with Snowflake. So a

9
00:00:29,039 --> 00:00:34,100
round of applause please.

10
00:00:31,040 --> 00:00:34,100
[Applause]

11
00:00:35,120 --> 00:00:40,480
Hello everyone. Thank you so much for

12
00:00:36,880 --> 00:00:43,280
joining in. So I'm Sheena and today I'll

13
00:00:40,480 --> 00:00:45,920
be talking about how you can scale your

14
00:00:43,280 --> 00:00:48,079
ML pipelines because when you really

15
00:00:45,920 --> 00:00:49,840
start working on it or you are building

16
00:00:48,079 --> 00:00:52,640
a prototype, you don't really think

17
00:00:49,840 --> 00:00:55,280
about this enterprise scale level of

18
00:00:52,640 --> 00:00:57,440
problems that can really come to you. Um

19
00:00:55,280 --> 00:01:00,480
so my idea of the session is also to

20
00:00:57,440 --> 00:01:04,320
give you like a view of how we solve

21
00:01:00,480 --> 00:01:06,799
those kind of uh data scaling problems.

22
00:01:04,320 --> 00:01:09,439
All right. Who am I? I am an AI field

23
00:01:06,799 --> 00:01:12,720
CTO in Snowflake. So I work with a lot

24
00:01:09,439 --> 00:01:15,040
of enterprise customers over APJ and um

25
00:01:12,720 --> 00:01:16,640
we help them build a IML solutions with

26
00:01:15,040 --> 00:01:20,000
their best practices and things like

27
00:01:16,640 --> 00:01:21,920
that. So this is today's agenda. I'll be

28
00:01:20,000 --> 00:01:24,960
going through just to give you an intro

29
00:01:21,920 --> 00:01:27,680
of what Snowflake is and then um how we

30
00:01:24,960 --> 00:01:29,920
scale different parts of the ML life

31
00:01:27,680 --> 00:01:32,320
cycle which is like data processing,

32
00:01:29,920 --> 00:01:34,240
feature engineering, model development,

33
00:01:32,320 --> 00:01:37,280
inference and finally the consumption

34
00:01:34,240 --> 00:01:40,000
part and we'll give you a summary of it.

35
00:01:37,280 --> 00:01:42,000
All right. What is Snowflake? Anybody

36
00:01:40,000 --> 00:01:44,640
working on Snowflake? Anybody heard

37
00:01:42,000 --> 00:01:47,360
about Snowflake? Okay, few hands. Good.

38
00:01:44,640 --> 00:01:49,280
So Snowflake is a unified platform where

39
00:01:47,360 --> 00:01:52,159
you can bring in all your data and you

40
00:01:49,280 --> 00:01:54,159
can build your AI on top of it. And

41
00:01:52,159 --> 00:01:56,159
finally you can also build your apps

42
00:01:54,159 --> 00:01:58,960
because the consumption part is very

43
00:01:56,159 --> 00:02:02,320
important. If you cannot build something

44
00:01:58,960 --> 00:02:04,159
that can help the users to use your AI

45
00:02:02,320 --> 00:02:06,719
that is where the value is not going to

46
00:02:04,159 --> 00:02:09,920
come out. So you can build the entire

47
00:02:06,719 --> 00:02:12,080
thing in snowflake.

48
00:02:09,920 --> 00:02:14,080
Now what are the key messaging of

49
00:02:12,080 --> 00:02:15,760
Snowflake is that we are a very easy

50
00:02:14,080 --> 00:02:17,680
platform. We are a SAS platform. We

51
00:02:15,760 --> 00:02:19,360
manage all the infrastructure. So you

52
00:02:17,680 --> 00:02:21,520
don't have to worry about provisioning

53
00:02:19,360 --> 00:02:23,520
it, maintaining, tuning it. So if you

54
00:02:21,520 --> 00:02:26,000
want CPUs, GPUs, you can just go in and

55
00:02:23,520 --> 00:02:28,080
leverage it and use it on the go. And we

56
00:02:26,000 --> 00:02:30,239
are completely connected. We give you

57
00:02:28,080 --> 00:02:32,319
options where you can just go and share

58
00:02:30,239 --> 00:02:35,280
your data, collaborate with different

59
00:02:32,319 --> 00:02:37,760
other uh customers, vendors, etc. And

60
00:02:35,280 --> 00:02:39,519
it's trusted. We take um security very

61
00:02:37,760 --> 00:02:42,000
very seriously in Snowflake. So it's a

62
00:02:39,519 --> 00:02:44,160
trusted platform as well. Now I'm going

63
00:02:42,000 --> 00:02:46,080
to talk about distributed and scalable

64
00:02:44,160 --> 00:02:48,480
Python in snowflake. So we'll go through

65
00:02:46,080 --> 00:02:52,800
each of these u sections and see how it

66
00:02:48,480 --> 00:02:54,800
is done. First one data processing.

67
00:02:52,800 --> 00:02:56,080
How do we do data processing today?

68
00:02:54,800 --> 00:02:58,959
That's first thing we are going to

69
00:02:56,080 --> 00:03:01,680
address. Pandas. Who all are using

70
00:02:58,959 --> 00:03:04,080
pandas here? Okay. I'm with the right

71
00:03:01,680 --> 00:03:05,840
audience. Awesome. So pandas is one of

72
00:03:04,080 --> 00:03:07,840
the most popular libraries. I'm sure you

73
00:03:05,840 --> 00:03:10,319
all are using it for processing the

74
00:03:07,840 --> 00:03:13,120
data. Now what's the problem with

75
00:03:10,319 --> 00:03:15,040
pandas?

76
00:03:13,120 --> 00:03:16,480
If you write a code in pandas and if you

77
00:03:15,040 --> 00:03:18,640
want to do an enterprise level

78
00:03:16,480 --> 00:03:21,200
productionalization, we usually see like

79
00:03:18,640 --> 00:03:23,120
an average of 6 months has been taken to

80
00:03:21,200 --> 00:03:26,319
rewrite that code and write it in a

81
00:03:23,120 --> 00:03:28,879
enterprise uh level production code.

82
00:03:26,319 --> 00:03:33,040
It's all because of how really pandas

83
00:03:28,879 --> 00:03:35,440
work. So, pantas is working in memory

84
00:03:33,040 --> 00:03:37,120
and you mostly get out of the memory

85
00:03:35,440 --> 00:03:38,799
errors. I'm sure some of you might have

86
00:03:37,120 --> 00:03:40,480
already got this. There is no way if

87
00:03:38,799 --> 00:03:42,080
you're working with pandas, you have

88
00:03:40,480 --> 00:03:44,560
never encountered an out of the memory

89
00:03:42,080 --> 00:03:47,120
error. It's very uh common to working

90
00:03:44,560 --> 00:03:49,120
with pandas and pandas is single

91
00:03:47,120 --> 00:03:51,040
threaded. Doesn't matter how many cores

92
00:03:49,120 --> 00:03:53,120
of CPU that you have, it's not going to

93
00:03:51,040 --> 00:03:55,360
take it. So, it will always take one

94
00:03:53,120 --> 00:03:59,040
core of that CPU and keeps on working.

95
00:03:55,360 --> 00:04:01,920
So no distributed processing at all

96
00:03:59,040 --> 00:04:03,680
as a result it cannot scale even on GBs

97
00:04:01,920 --> 00:04:06,080
of data millions of rows it's not going

98
00:04:03,680 --> 00:04:08,080
to happen and encountering with most of

99
00:04:06,080 --> 00:04:10,560
these problems like you will have to do

100
00:04:08,080 --> 00:04:12,560
a retransation of your code from pandas

101
00:04:10,560 --> 00:04:14,720
when you go into productionization on an

102
00:04:12,560 --> 00:04:17,120
enterprise scale and also it is very

103
00:04:14,720 --> 00:04:19,919
difficult to experiment debug loss of

104
00:04:17,120 --> 00:04:22,240
productivity etc.

105
00:04:19,919 --> 00:04:23,919
Now how does Snowflake helps our

106
00:04:22,240 --> 00:04:26,960
customers or anybody who is building

107
00:04:23,919 --> 00:04:29,280
that data processing part in Snowflake?

108
00:04:26,960 --> 00:04:31,840
We have what is known as Snowpark Python

109
00:04:29,280 --> 00:04:35,040
API. Think of it as like a Python API

110
00:04:31,840 --> 00:04:36,800
which is a wrapper on top of um the most

111
00:04:35,040 --> 00:04:40,160
common Python libraries and things like

112
00:04:36,800 --> 00:04:42,720
that. Now under Snowpark API we have two

113
00:04:40,160 --> 00:04:45,840
main points um I want to highlight here.

114
00:04:42,720 --> 00:04:47,360
One is Snowark helps you to run the data

115
00:04:45,840 --> 00:04:49,360
processing and transformation in a

116
00:04:47,360 --> 00:04:52,560
distributed fashion. And the second

117
00:04:49,360 --> 00:04:55,120
thing is we are more moving towards

118
00:04:52,560 --> 00:04:56,400
executing your code near to the data. So

119
00:04:55,120 --> 00:04:58,080
usually think about it when you're

120
00:04:56,400 --> 00:05:00,160
writing a code in your laptop your data

121
00:04:58,080 --> 00:05:02,000
is in a database you're just going to

122
00:05:00,160 --> 00:05:04,560
download that data put it in your local

123
00:05:02,000 --> 00:05:06,639
memory and then read csv or something

124
00:05:04,560 --> 00:05:08,880
like that and then you work on it

125
00:05:06,639 --> 00:05:11,840
locally but here the concept is like

126
00:05:08,880 --> 00:05:13,600
pushing this code back to where the data

127
00:05:11,840 --> 00:05:15,919
is residing.

128
00:05:13,600 --> 00:05:18,320
Now we have two main offerings under the

129
00:05:15,919 --> 00:05:21,680
snow park API which is pandas API and

130
00:05:18,320 --> 00:05:23,840
dataf frame API. So for the pandas API

131
00:05:21,680 --> 00:05:25,440
it is distributed. It solves all the

132
00:05:23,840 --> 00:05:27,680
problems that we discussed for the open

133
00:05:25,440 --> 00:05:30,160
source pandas. And then we also have a

134
00:05:27,680 --> 00:05:32,000
dataf frame API which is u lazily

135
00:05:30,160 --> 00:05:35,039
evaluated something very similar to

136
00:05:32,000 --> 00:05:36,960
spark if you have done that.

137
00:05:35,039 --> 00:05:38,960
How do I get started with pre-processing

138
00:05:36,960 --> 00:05:41,039
in in snowflake? So I need an

139
00:05:38,960 --> 00:05:43,600
environment to write it. It can be your

140
00:05:41,039 --> 00:05:45,199
own ids or snowflake itself has a

141
00:05:43,600 --> 00:05:47,840
notebook inside. So you can leverage

142
00:05:45,199 --> 00:05:49,759
those or use your own ids like Jupyter

143
00:05:47,840 --> 00:05:51,520
notebook, VS code, anything to write

144
00:05:49,759 --> 00:05:53,759
your code.

145
00:05:51,520 --> 00:05:55,600
Now you have options. So I want to

146
00:05:53,759 --> 00:05:57,680
highlight again if you ever want to just

147
00:05:55,600 --> 00:06:00,080
stick on to the normal pandas python

148
00:05:57,680 --> 00:06:01,759
code OSS version and then want to do the

149
00:06:00,080 --> 00:06:03,759
pro pre-processing. It's completely

150
00:06:01,759 --> 00:06:05,680
fine. So these are the extra options

151
00:06:03,759 --> 00:06:07,840
that we provide in snowflake to the

152
00:06:05,680 --> 00:06:10,240
customers to do the things on a very

153
00:06:07,840 --> 00:06:12,160
scalable manner. So we have snow park

154
00:06:10,240 --> 00:06:14,880
data frame APIs. You can see a sample of

155
00:06:12,160 --> 00:06:18,000
code there. DF.filter df.tate is equal

156
00:06:14,880 --> 00:06:19,919
to WA. So we are sort of like um using

157
00:06:18,000 --> 00:06:22,560
it very similar to if you're a user of

158
00:06:19,919 --> 00:06:24,800
pandas. And then we have other option

159
00:06:22,560 --> 00:06:27,520
which is pandas which is built on top of

160
00:06:24,800 --> 00:06:29,520
the open source ones. Uh we maintain

161
00:06:27,520 --> 00:06:31,120
this the name of the functions to be

162
00:06:29,520 --> 00:06:32,639
very similar to the open source version.

163
00:06:31,120 --> 00:06:34,960
So you don't have to change much of the

164
00:06:32,639 --> 00:06:37,520
code. And in case if you want to write

165
00:06:34,960 --> 00:06:38,880
custom Python code, very custom logic,

166
00:06:37,520 --> 00:06:41,120
we also have something known as

167
00:06:38,880 --> 00:06:43,360
open-source um sorry userdefined

168
00:06:41,120 --> 00:06:46,160
functions and things in um snowflake as

169
00:06:43,360 --> 00:06:49,280
well. Now the the best part is that if

170
00:06:46,160 --> 00:06:51,759
you write code in these APIs or in this

171
00:06:49,280 --> 00:06:54,800
manner, it always gets pushed down to

172
00:06:51,759 --> 00:06:56,960
snowflake compute. So you write the code

173
00:06:54,800 --> 00:06:58,319
and there is a client level libraries

174
00:06:56,960 --> 00:07:00,479
which will push the code and the

175
00:06:58,319 --> 00:07:03,039
execution always happens inside

176
00:07:00,479 --> 00:07:05,280
snowflake.

177
00:07:03,039 --> 00:07:07,680
Now we'll go double click into a little

178
00:07:05,280 --> 00:07:09,520
bit more on the pantas on snowflake part

179
00:07:07,680 --> 00:07:12,080
how it works how is it distributed and

180
00:07:09,520 --> 00:07:14,080
how it becomes scalable

181
00:07:12,080 --> 00:07:16,160
in this case again it's an extension of

182
00:07:14,080 --> 00:07:19,199
the snow park API which is modeled on

183
00:07:16,160 --> 00:07:22,240
top of the oss uh pandas as well now

184
00:07:19,199 --> 00:07:25,919
anybody heard about modin

185
00:07:22,240 --> 00:07:28,560
mo yeah okay awesome so mod was an um

186
00:07:25,919 --> 00:07:30,560
open source initiative where building

187
00:07:28,560 --> 00:07:33,440
this pandas to be on a distributed

188
00:07:30,560 --> 00:07:36,080
scalable way Um it has been over 5 years

189
00:07:33,440 --> 00:07:38,080
research and things like that. So modin

190
00:07:36,080 --> 00:07:39,840
is also available as an open source. We

191
00:07:38,080 --> 00:07:42,560
recently acquired modin as well and

192
00:07:39,840 --> 00:07:44,880
snowark panners is already available in

193
00:07:42,560 --> 00:07:48,160
snowflake. So what essentially this desk

194
00:07:44,880 --> 00:07:50,400
does is you can keep on using your code

195
00:07:48,160 --> 00:07:52,080
as such and the main thing that you need

196
00:07:50,400 --> 00:07:55,280
to change is your import statement. So

197
00:07:52,080 --> 00:07:57,440
you can see in the uh in the uh picture

198
00:07:55,280 --> 00:07:59,520
there. So you are basically changing the

199
00:07:57,440 --> 00:08:01,199
import pandas to import modern.pandas

200
00:07:59,520 --> 00:08:03,280
pandas and the rest of the functioning

201
00:08:01,199 --> 00:08:04,879
and everything remains same but you will

202
00:08:03,280 --> 00:08:07,280
still leverage the distributed and

203
00:08:04,879 --> 00:08:09,440
scalable processing behind the scenes.

204
00:08:07,280 --> 00:08:10,879
The oss version is still available if

205
00:08:09,440 --> 00:08:13,599
you want to check it out. Yeah, please

206
00:08:10,879 --> 00:08:15,599
feel free to check it out as well.

207
00:08:13,599 --> 00:08:17,919
Now how does it really work? How does it

208
00:08:15,599 --> 00:08:20,319
really scale in snowflake? What is going

209
00:08:17,919 --> 00:08:24,720
to happen is like when you write a

210
00:08:20,319 --> 00:08:27,440
statement like pd.conat concat um df1

211
00:08:24,720 --> 00:08:29,840
df2. What really happens is like there

212
00:08:27,440 --> 00:08:32,399
is a query translator that is literally

213
00:08:29,840 --> 00:08:34,159
taking your statement and then we have a

214
00:08:32,399 --> 00:08:36,080
connector which connects to the

215
00:08:34,159 --> 00:08:38,560
snowflake and it automatically pushes

216
00:08:36,080 --> 00:08:40,880
down as a SQL query. So your pandas is

217
00:08:38,560 --> 00:08:42,640
getting translated into a SQL query and

218
00:08:40,880 --> 00:08:44,720
it is pushed down to the snowflake

219
00:08:42,640 --> 00:08:46,560
processing engine.

220
00:08:44,720 --> 00:08:48,560
Now the processing engine consists of

221
00:08:46,560 --> 00:08:49,760
multiple nodes. It is distributed. So

222
00:08:48,560 --> 00:08:51,440
Snowflake behind the scenes

223
00:08:49,760 --> 00:08:53,600
automatically distribute and scale the

224
00:08:51,440 --> 00:08:55,839
things for you. What are the advantages

225
00:08:53,600 --> 00:08:57,440
to it? Seamless prototype. You don't

226
00:08:55,839 --> 00:08:59,760
have to worry about refactoring your

227
00:08:57,440 --> 00:09:01,680
code and everything and zero data moment

228
00:08:59,760 --> 00:09:04,399
because we are pushing it towards the

229
00:09:01,680 --> 00:09:06,399
snowflake and then you can just keep on

230
00:09:04,399 --> 00:09:08,160
using your pandas which you're very

231
00:09:06,399 --> 00:09:10,160
familiar with keep on writing the code

232
00:09:08,160 --> 00:09:12,320
in that.

233
00:09:10,160 --> 00:09:14,480
Now this is the main um highlight of

234
00:09:12,320 --> 00:09:16,480
what is the difference when you switch

235
00:09:14,480 --> 00:09:18,240
between these two libraries when the

236
00:09:16,480 --> 00:09:20,320
difference is mainly like a import

237
00:09:18,240 --> 00:09:22,399
statement. What happens is like let's

238
00:09:20,320 --> 00:09:24,800
take a look at the second bar graph

239
00:09:22,399 --> 00:09:26,880
which is about 10 GB of data. So the

240
00:09:24,800 --> 00:09:28,640
x-axis is the data size and the y-axis

241
00:09:26,880 --> 00:09:31,279
is the second. When you look at the 10

242
00:09:28,640 --> 00:09:34,959
GB data the processing is 30 times

243
00:09:31,279 --> 00:09:37,200
faster than using the normal oasis

244
00:09:34,959 --> 00:09:39,920
pandas. So you can see the blue colored

245
00:09:37,200 --> 00:09:43,440
one is the snowflake model um and the

246
00:09:39,920 --> 00:09:46,720
other one is using the pantas osis wash

247
00:09:43,440 --> 00:09:48,560
and the last one when it goes to one TB

248
00:09:46,720 --> 00:09:52,640
you can see the processing gets finished

249
00:09:48,560 --> 00:09:55,040
under 50 seconds for snow park pandas

250
00:09:52,640 --> 00:09:56,640
but whereas other one it didn't finish

251
00:09:55,040 --> 00:09:59,279
you will mostly encounter out of the

252
00:09:56,640 --> 00:10:01,680
memory errors

253
00:09:59,279 --> 00:10:04,720
now going to the feature engineering any

254
00:10:01,680 --> 00:10:06,959
data scientist here

255
00:10:04,720 --> 00:10:10,399
okay Good. So when it comes to data

256
00:10:06,959 --> 00:10:12,800
science, it is very hard to do certain

257
00:10:10,399 --> 00:10:14,480
kinds of feature engineering techniques,

258
00:10:12,800 --> 00:10:15,920
we call it like one hot encoding and

259
00:10:14,480 --> 00:10:18,160
things like that. These are very

260
00:10:15,920 --> 00:10:20,000
computationally expensive operations. So

261
00:10:18,160 --> 00:10:22,240
how do we help for those kind of

262
00:10:20,000 --> 00:10:24,720
operations is that we have another API

263
00:10:22,240 --> 00:10:27,040
called Snowflake ML Python package. So

264
00:10:24,720 --> 00:10:29,360
what it essentially does is it is again

265
00:10:27,040 --> 00:10:32,399
a wrapper that is built on top of the

266
00:10:29,360 --> 00:10:35,519
open-source ones scikitlearn mainly and

267
00:10:32,399 --> 00:10:38,800
uh we also have the rappers on top of XG

268
00:10:35,519 --> 00:10:41,279
boost scikitlearn then live GBM all

269
00:10:38,800 --> 00:10:43,360
these packages that helps you do

270
00:10:41,279 --> 00:10:45,839
different kinds of pre-processing as

271
00:10:43,360 --> 00:10:47,920
well as the ML training part in a very

272
00:10:45,839 --> 00:10:50,240
distributed fashion. So under the

273
00:10:47,920 --> 00:10:51,760
pre-processing you can see most of the

274
00:10:50,240 --> 00:10:54,079
uh pre-processing things that you use

275
00:10:51,760 --> 00:10:56,000
like standard scalar ordinal encoding

276
00:10:54,079 --> 00:10:59,040
all those things are already there. So

277
00:10:56,000 --> 00:11:00,640
you just have to use the functions and

278
00:10:59,040 --> 00:11:03,680
behind the scenes we handle the

279
00:11:00,640 --> 00:11:05,360
distributed things for you. Now again

280
00:11:03,680 --> 00:11:08,000
how does it work behind the scenes or

281
00:11:05,360 --> 00:11:10,560
how does it distributed the same logic

282
00:11:08,000 --> 00:11:12,000
ML pre-processing gets again passed

283
00:11:10,560 --> 00:11:14,079
through a query translator which

284
00:11:12,000 --> 00:11:16,320
translate into a SQL and pushed back

285
00:11:14,079 --> 00:11:18,880
into snowflake engine and then execution

286
00:11:16,320 --> 00:11:20,640
happens whereas for the model training

287
00:11:18,880 --> 00:11:23,040
some training we cannot distribute it

288
00:11:20,640 --> 00:11:25,200
honestly um so if it is like a XGB

289
00:11:23,040 --> 00:11:27,279
regressor or something it will just push

290
00:11:25,200 --> 00:11:29,920
down as Python byte code and it will get

291
00:11:27,279 --> 00:11:32,640
executed we'll go into more on how to

292
00:11:29,920 --> 00:11:36,160
distribute the ML part as well later in

293
00:11:32,640 --> 00:11:39,440
the slides. All right. Now, what is the

294
00:11:36,160 --> 00:11:41,839
difference? Why should a person use like

295
00:11:39,440 --> 00:11:43,760
snowpark ML for feature engineering

296
00:11:41,839 --> 00:11:46,160
compared to the other scikitlearn or

297
00:11:43,760 --> 00:11:49,200
anything? So, this is experiment we did

298
00:11:46,160 --> 00:11:51,920
on uh 200 million rows of data which is

299
00:11:49,200 --> 00:11:55,120
16 GB data in the memory and the only

300
00:11:51,920 --> 00:11:59,360
difference between the two bar graphs

301
00:11:55,120 --> 00:12:01,600
that you see. The first one is using uh

302
00:11:59,360 --> 00:12:03,279
skarn standard scalar library. So

303
00:12:01,600 --> 00:12:06,079
everybody is familiar with that. And

304
00:12:03,279 --> 00:12:08,880
then when you do a scalar function there

305
00:12:06,079 --> 00:12:10,720
are usually two um ways to it. One is

306
00:12:08,880 --> 00:12:12,800
like you you train it and then you fit

307
00:12:10,720 --> 00:12:15,120
it. That is why you see the two colors

308
00:12:12,800 --> 00:12:16,560
which is red and the blue color. So

309
00:12:15,120 --> 00:12:19,519
there is a fitting time and there is a

310
00:12:16,560 --> 00:12:22,320
transforming time as well. Now the same

311
00:12:19,519 --> 00:12:25,519
data when we run with snowpark ML

312
00:12:22,320 --> 00:12:28,720
standard scalar which is the second one

313
00:12:25,519 --> 00:12:31,680
here you see like the difference is

314
00:12:28,720 --> 00:12:34,639
quite significant. It is about 25 to 50x

315
00:12:31,680 --> 00:12:36,720
of speed up. And another operation we

316
00:12:34,639 --> 00:12:38,880
also ran is one hot encoder. So these

317
00:12:36,720 --> 00:12:40,720
are again very computationally expensive

318
00:12:38,880 --> 00:12:42,480
feature engineering operations that data

319
00:12:40,720 --> 00:12:44,240
scientists use and you can see the

320
00:12:42,480 --> 00:12:46,079
significant difference there too. And

321
00:12:44,240 --> 00:12:48,320
another one in the end which is the

322
00:12:46,079 --> 00:12:51,040
purple bar graph. It is about

323
00:12:48,320 --> 00:12:53,839
correlation. So in skarn also you can do

324
00:12:51,040 --> 00:12:55,680
the correlation of features. Um this

325
00:12:53,839 --> 00:12:57,920
becomes like very very hard when the

326
00:12:55,680 --> 00:13:00,320
number of features increases. So in this

327
00:12:57,920 --> 00:13:03,279
case we did the experiment on a 1

328
00:13:00,320 --> 00:13:05,279
million of rows with 512 columns running

329
00:13:03,279 --> 00:13:07,120
in a medium snowpark optimized

330
00:13:05,279 --> 00:13:10,240
warehouse. warehouse is our the name of

331
00:13:07,120 --> 00:13:13,440
our compute that we use and you can see

332
00:13:10,240 --> 00:13:15,200
the result is that once 1,024 columns

333
00:13:13,440 --> 00:13:17,680
there is usually getting out of the

334
00:13:15,200 --> 00:13:19,920
memory errors for scikitlearn packages

335
00:13:17,680 --> 00:13:22,720
but if you use it um you can get

336
00:13:19,920 --> 00:13:24,560
everything still ongoing and without

337
00:13:22,720 --> 00:13:27,680
getting a memory error you can execute

338
00:13:24,560 --> 00:13:30,480
those correlation metrics and things.

339
00:13:27,680 --> 00:13:33,839
Now going to the model training in case

340
00:13:30,480 --> 00:13:36,240
of model training we offer a container

341
00:13:33,839 --> 00:13:38,079
runtime for ML. Now whatever code you

342
00:13:36,240 --> 00:13:39,839
have however optimized way that you

343
00:13:38,079 --> 00:13:41,920
write if you don't have a scalable

344
00:13:39,839 --> 00:13:43,920
infrastructure on which you want to you

345
00:13:41,920 --> 00:13:45,360
can run the code it's not going to be

346
00:13:43,920 --> 00:13:47,760
effective. So that is why we come up

347
00:13:45,360 --> 00:13:50,480
with this container runtime for ML where

348
00:13:47,760 --> 00:13:52,880
you have CPU and GPU pools which you can

349
00:13:50,480 --> 00:13:55,920
configure out of the box and then it is

350
00:13:52,880 --> 00:13:58,240
uh powered by ray ray compute cluster

351
00:13:55,920 --> 00:14:00,560
and when you spin up a container what

352
00:13:58,240 --> 00:14:02,560
really happens is like it comes with

353
00:14:00,560 --> 00:14:04,880
most of the python libraries that you

354
00:14:02,560 --> 00:14:07,680
really use it for. You can see skarn XG

355
00:14:04,880 --> 00:14:10,160
boost light GBM etc. It comes as the

356
00:14:07,680 --> 00:14:13,199
pre-built packages and we also have

357
00:14:10,160 --> 00:14:16,480
optimized data injection if you want to

358
00:14:13,199 --> 00:14:18,639
get a one TB of data or a huge size data

359
00:14:16,480 --> 00:14:20,959
into the container runtime. It is much

360
00:14:18,639 --> 00:14:24,880
easy and then we have scalable training

361
00:14:20,959 --> 00:14:26,959
APIs over top of it etc.

362
00:14:24,880 --> 00:14:29,760
Now how do we distribute it? So I have

363
00:14:26,959 --> 00:14:31,680
an XG boost model and I want to

364
00:14:29,760 --> 00:14:34,880
distribute the training of this XG boost

365
00:14:31,680 --> 00:14:36,639
over multiple nodes. So usually outside

366
00:14:34,880 --> 00:14:38,800
snowflake or if you want to do it with

367
00:14:36,639 --> 00:14:41,519
os python it is quite difficult. So this

368
00:14:38,800 --> 00:14:43,440
is why we again build a wrapper um on

369
00:14:41,519 --> 00:14:46,240
top of it. So you can just keep on

370
00:14:43,440 --> 00:14:48,240
writing your any opensource code and you

371
00:14:46,240 --> 00:14:50,480
can just scale it easily by using the

372
00:14:48,240 --> 00:14:52,160
scaling configuration. So you have seen

373
00:14:50,480 --> 00:14:53,920
the only difference is that you have to

374
00:14:52,160 --> 00:14:56,639
specify this particular scaling

375
00:14:53,920 --> 00:14:58,240
configuration where you can specify how

376
00:14:56,639 --> 00:15:00,480
many GPUs you want, what is the

377
00:14:58,240 --> 00:15:03,040
estimators and scaling configuration

378
00:15:00,480 --> 00:15:06,399
particularly that will help you to do

379
00:15:03,040 --> 00:15:08,079
everything easily and um all the infra

380
00:15:06,399 --> 00:15:10,000
management all the scaling everything is

381
00:15:08,079 --> 00:15:11,760
done by Snowflake and also Snowflake

382
00:15:10,000 --> 00:15:13,760
manages how to distribute the training

383
00:15:11,760 --> 00:15:15,600
across your multiple nodes which you can

384
00:15:13,760 --> 00:15:17,920
just freely use this particular

385
00:15:15,600 --> 00:15:19,600
distributed training APIs and it just

386
00:15:17,920 --> 00:15:21,680
runs.

387
00:15:19,600 --> 00:15:24,880
Now what's the difference? So you can

388
00:15:21,680 --> 00:15:28,079
see the first the blue line which is

389
00:15:24,880 --> 00:15:30,240
using the snowflake ML API that I just

390
00:15:28,079 --> 00:15:34,160
mentioned and also the other experiment

391
00:15:30,240 --> 00:15:35,680
is with the oss um library for the xg

392
00:15:34,160 --> 00:15:38,399
boost. So you can see the difference

393
00:15:35,680 --> 00:15:40,800
when the x-axis is like the size of the

394
00:15:38,399 --> 00:15:42,399
data and the x boost training time as

395
00:15:40,800 --> 00:15:44,560
well. when the size of the data

396
00:15:42,399 --> 00:15:46,800
increases, it's going to take more time

397
00:15:44,560 --> 00:15:48,800
for the oss version to finish the

398
00:15:46,800 --> 00:15:51,519
training and things like that. But in

399
00:15:48,800 --> 00:15:54,639
case of snowflake, it can automatically

400
00:15:51,519 --> 00:15:57,279
distribute. And you can see how wide the

401
00:15:54,639 --> 00:15:59,759
gap is.

402
00:15:57,279 --> 00:16:02,000
All right. Now, we also have a

403
00:15:59,759 --> 00:16:04,639
hyperparameter tuning API. The one that

404
00:16:02,000 --> 00:16:07,199
we discussed before was to run an XG

405
00:16:04,639 --> 00:16:09,680
boost in a distributed fashion. Now, how

406
00:16:07,199 --> 00:16:12,079
about if I have multiple hyperparameter

407
00:16:09,680 --> 00:16:14,880
tuning? This is a a very common

408
00:16:12,079 --> 00:16:17,199
technique all the data scientists use um

409
00:16:14,880 --> 00:16:18,800
where you can run multiple

410
00:16:17,199 --> 00:16:20,399
configurations of the model and see

411
00:16:18,800 --> 00:16:22,959
which is the best one and you pick the

412
00:16:20,399 --> 00:16:25,040
best among that. Now this is also a very

413
00:16:22,959 --> 00:16:27,040
difficult technique but very essential

414
00:16:25,040 --> 00:16:28,480
if you want to build a model you have to

415
00:16:27,040 --> 00:16:31,680
test different hyperparameter

416
00:16:28,480 --> 00:16:33,839
configurations and select the best one.

417
00:16:31,680 --> 00:16:35,920
So you can easily scale hundreds of

418
00:16:33,839 --> 00:16:38,320
model training across this one by using

419
00:16:35,920 --> 00:16:40,399
it. Again just leverage your normal

420
00:16:38,320 --> 00:16:42,480
Python OSS code. You can bring in any

421
00:16:40,399 --> 00:16:44,399
package or anything and you will write

422
00:16:42,480 --> 00:16:46,480
your define train function and you'll

423
00:16:44,399 --> 00:16:48,639
write your opensource Python code and

424
00:16:46,480 --> 00:16:50,160
under the tuner you basically specify

425
00:16:48,639 --> 00:16:53,680
what configurations or what

426
00:16:50,160 --> 00:16:56,639
hyperparameters you want to use to train

427
00:16:53,680 --> 00:16:58,880
this models in a parallel fashion and

428
00:16:56,639 --> 00:17:01,680
you can also um define different

429
00:16:58,880 --> 00:17:04,240
parameters metric etc. So it will

430
00:17:01,680 --> 00:17:06,640
automatically distribute for you and you

431
00:17:04,240 --> 00:17:08,559
will get different models run parallelly

432
00:17:06,640 --> 00:17:11,120
and finally you can pick the optimized

433
00:17:08,559 --> 00:17:13,679
best one out of all your hyperparameter

434
00:17:11,120 --> 00:17:16,240
configurations. You should also be able

435
00:17:13,679 --> 00:17:17,760
to do like grid search strategies,

436
00:17:16,240 --> 00:17:19,600
random search strategies. So these are

437
00:17:17,760 --> 00:17:22,400
the strategies normally data scientists

438
00:17:19,600 --> 00:17:24,400
use to um train different kinds of

439
00:17:22,400 --> 00:17:27,360
parameter tuning and then pick the best

440
00:17:24,400 --> 00:17:29,919
out of it.

441
00:17:27,360 --> 00:17:32,880
Now we also have as a part of the

442
00:17:29,919 --> 00:17:34,559
snowflake ML API a bit of hyperparameter

443
00:17:32,880 --> 00:17:37,200
optimization that is coming out of the

444
00:17:34,559 --> 00:17:40,400
box. So you can see the code is very

445
00:17:37,200 --> 00:17:42,559
simple and very similar to when you use

446
00:17:40,400 --> 00:17:44,960
scikitlearn or anything. So you can see

447
00:17:42,559 --> 00:17:47,120
we are doing a grid search CV and you

448
00:17:44,960 --> 00:17:49,600
are just specifying all these parameter

449
00:17:47,120 --> 00:17:51,120
but behind the scenes what we do is like

450
00:17:49,600 --> 00:17:54,160
we are running everything in a

451
00:17:51,120 --> 00:17:56,240
distributed fashion. So you can see the

452
00:17:54,160 --> 00:17:59,919
same code when you increase the number

453
00:17:56,240 --> 00:18:02,640
of nodes or the size of the compute from

454
00:17:59,919 --> 00:18:04,240
medium to large to extra large the time

455
00:18:02,640 --> 00:18:06,240
is going to go down because the

456
00:18:04,240 --> 00:18:10,240
automatic distribution happens behind

457
00:18:06,240 --> 00:18:12,080
the scenes for you in snowflake.

458
00:18:10,240 --> 00:18:13,919
Now this was a question I encountered

459
00:18:12,080 --> 00:18:16,000
when I was uh talking with one of our

460
00:18:13,919 --> 00:18:18,720
customers. So their question was like

461
00:18:16,000 --> 00:18:21,360
hey I want to build 60 million models

462
00:18:18,720 --> 00:18:23,840
for my use case. there's only like one

463
00:18:21,360 --> 00:18:25,919
single use case but there has to be 60

464
00:18:23,840 --> 00:18:27,840
million models that needs to be built

465
00:18:25,919 --> 00:18:30,000
and um this use case is more a

466
00:18:27,840 --> 00:18:31,840
hyperpersonalization use case where I

467
00:18:30,000 --> 00:18:34,160
want to build a model for every single

468
00:18:31,840 --> 00:18:35,919
customer and I want to understand how

469
00:18:34,160 --> 00:18:37,600
much they are going to spend or are they

470
00:18:35,919 --> 00:18:40,480
going to churn so I don't want like a

471
00:18:37,600 --> 00:18:42,799
generic model so how do we do this so in

472
00:18:40,480 --> 00:18:45,760
snowflake we did this um solve this

473
00:18:42,799 --> 00:18:48,240
problem using a partitioned model

474
00:18:45,760 --> 00:18:50,320
training feature that we use and for

475
00:18:48,240 --> 00:18:52,559
this one when they were running it in

476
00:18:50,320 --> 00:18:54,480
another platform without any

477
00:18:52,559 --> 00:18:57,360
parallelization any distribution. It was

478
00:18:54,480 --> 00:18:59,600
taking 18 hours to run. Now what we did

479
00:18:57,360 --> 00:19:01,840
is like we ported the code in snowflake

480
00:18:59,600 --> 00:19:04,240
and we use this feature and we could

481
00:19:01,840 --> 00:19:07,280
finish it under 3 hours. So that's the

482
00:19:04,240 --> 00:19:10,880
difference we are talking about.

483
00:19:07,280 --> 00:19:13,200
Now this is very helpful particularly in

484
00:19:10,880 --> 00:19:15,360
the cases like there is a partition key

485
00:19:13,200 --> 00:19:17,440
or a partition that you can really

486
00:19:15,360 --> 00:19:19,919
leverage. For example, in this one,

487
00:19:17,440 --> 00:19:22,480
think about like different stores. You

488
00:19:19,919 --> 00:19:24,400
want to understand what is the uh what

489
00:19:22,480 --> 00:19:26,240
is the requirement or a demand for

490
00:19:24,400 --> 00:19:28,640
different stores and you want to build

491
00:19:26,240 --> 00:19:30,480
it on a store level or even on a product

492
00:19:28,640 --> 00:19:32,400
level, right? I want to understand what

493
00:19:30,480 --> 00:19:34,000
is the demand prediction for my product.

494
00:19:32,400 --> 00:19:36,720
Then you have to go into each product

495
00:19:34,000 --> 00:19:38,799
level and build it. So this is how we

496
00:19:36,720 --> 00:19:41,120
can do that. Your training data just

497
00:19:38,799 --> 00:19:42,880
remains the same. It's the data in the

498
00:19:41,120 --> 00:19:45,039
table. You don't have to partition it.

499
00:19:42,880 --> 00:19:47,840
You don't have to do anything. But what

500
00:19:45,039 --> 00:19:49,840
we do is like we will do uh we will

501
00:19:47,840 --> 00:19:51,919
create the partitions for different

502
00:19:49,840 --> 00:19:54,320
stores and we'll create the subm models

503
00:19:51,919 --> 00:19:56,240
for each of the stores and finally

504
00:19:54,320 --> 00:19:58,880
everything gets pushed into our model

505
00:19:56,240 --> 00:20:00,799
registry as one single model. So anytime

506
00:19:58,880 --> 00:20:03,520
you want to do inference you call that

507
00:20:00,799 --> 00:20:05,840
single model from the registry and it

508
00:20:03,520 --> 00:20:08,320
will pick the corresponding submodel for

509
00:20:05,840 --> 00:20:10,799
that particular store and give you the

510
00:20:08,320 --> 00:20:12,880
results. So that is how it works and

511
00:20:10,799 --> 00:20:15,600
there are two versions of it stateless

512
00:20:12,880 --> 00:20:18,080
and stateful. So imagine you're building

513
00:20:15,600 --> 00:20:20,559
60 million models. It is hard to save

514
00:20:18,080 --> 00:20:22,400
all the models 60 million. Uh every day

515
00:20:20,559 --> 00:20:24,080
you are running it. Then you don't want

516
00:20:22,400 --> 00:20:25,679
to like save all the models all the

517
00:20:24,080 --> 00:20:28,480
time. So it has to be like training on

518
00:20:25,679 --> 00:20:30,400
the go and just done with the inference.

519
00:20:28,480 --> 00:20:31,919
That's it. Next day you repeat the same.

520
00:20:30,400 --> 00:20:33,600
So those kind of models are called

521
00:20:31,919 --> 00:20:36,799
stateless models. You don't need to save

522
00:20:33,600 --> 00:20:38,320
it. Then you have stateful moders. um I

523
00:20:36,799 --> 00:20:39,919
want to save that particular model

524
00:20:38,320 --> 00:20:42,559
version which is the normal behavior

525
00:20:39,919 --> 00:20:44,400
when you do a build a data science model

526
00:20:42,559 --> 00:20:46,400
you save it and then you call it and do

527
00:20:44,400 --> 00:20:48,480
the inference so if you want to do that

528
00:20:46,400 --> 00:20:50,159
that is also fine but it's considering

529
00:20:48,480 --> 00:20:52,960
when you are building millions of model

530
00:20:50,159 --> 00:20:56,320
it's going to take time um and uh it

531
00:20:52,960 --> 00:20:58,880
will affect the performance as well

532
00:20:56,320 --> 00:21:01,200
all right um so how do we do this many

533
00:20:58,880 --> 00:21:04,400
model partition inference in snowflake

534
00:21:01,200 --> 00:21:07,039
it is very same as you can see here you

535
00:21:04,400 --> 00:21:09,200
write a python class which is example

536
00:21:07,039 --> 00:21:13,360
forecasting model just um another

537
00:21:09,200 --> 00:21:15,679
example and you can put a decorator here

538
00:21:13,360 --> 00:21:18,000
the custom model decorator and then

539
00:21:15,679 --> 00:21:20,559
under that function predict you can

540
00:21:18,000 --> 00:21:23,280
write any Python code that you want any

541
00:21:20,559 --> 00:21:26,559
algorithm any packages that you want to

542
00:21:23,280 --> 00:21:29,280
use normal osis python xg boost anything

543
00:21:26,559 --> 00:21:31,520
you can write your code underneath it

544
00:21:29,280 --> 00:21:33,280
and after that you can log that model

545
00:21:31,520 --> 00:21:35,039
into the registry so even before the

546
00:21:33,280 --> 00:21:37,200
training we are logging the model into

547
00:21:35,039 --> 00:21:39,600
the registry because we're building 60

548
00:21:37,200 --> 00:21:41,440
models on parallel and then you're going

549
00:21:39,600 --> 00:21:43,679
to push it to the registry. So registry

550
00:21:41,440 --> 00:21:45,520
handles everything behind the scenes for

551
00:21:43,679 --> 00:21:47,360
you. So in this case what I'm going to

552
00:21:45,520 --> 00:21:49,120
do I'm going to push this model first

553
00:21:47,360 --> 00:21:51,120
into the registry. So if you're not

554
00:21:49,120 --> 00:21:53,679
familiar with model registry think of it

555
00:21:51,120 --> 00:21:56,320
like a centralized place where I put all

556
00:21:53,679 --> 00:21:58,640
my models. So I will have later like a

557
00:21:56,320 --> 00:22:00,559
auditability traceability. It also acts

558
00:21:58,640 --> 00:22:03,120
as a handshake points between different

559
00:22:00,559 --> 00:22:05,840
teams. So I'll push my model into this

560
00:22:03,120 --> 00:22:07,840
registry and say like hey apps team or

561
00:22:05,840 --> 00:22:09,679
ML engineering team you can take the

562
00:22:07,840 --> 00:22:12,400
model from there and predictionalize it

563
00:22:09,679 --> 00:22:16,720
and things like that. So I do that and

564
00:22:12,400 --> 00:22:19,679
then next one is I can just go in and

565
00:22:16,720 --> 00:22:21,760
call the function run and just specify

566
00:22:19,679 --> 00:22:24,640
the partition column. This partition

567
00:22:21,760 --> 00:22:26,400
column can be store name. So which is

568
00:22:24,640 --> 00:22:28,240
very easy. You don't have to physically

569
00:22:26,400 --> 00:22:29,760
partition it. You just go here and one

570
00:22:28,240 --> 00:22:32,080
line of code and you're specifying what

571
00:22:29,760 --> 00:22:35,919
is the partition column you want to use.

572
00:22:32,080 --> 00:22:38,480
Now tomorrow I decided I don't want to

573
00:22:35,919 --> 00:22:40,080
do this particular modeling on a store

574
00:22:38,480 --> 00:22:42,480
level. I want to do it on a country

575
00:22:40,080 --> 00:22:44,000
level. So you can come here and you

576
00:22:42,480 --> 00:22:46,159
change your partition column to the

577
00:22:44,000 --> 00:22:48,080
country. So it only needs like a a

578
00:22:46,159 --> 00:22:50,080
partition key that it can use and the

579
00:22:48,080 --> 00:22:52,080
code remains the same. You don't have to

580
00:22:50,080 --> 00:22:54,159
change anything.

581
00:22:52,080 --> 00:22:57,280
So that's how we build millions of model

582
00:22:54,159 --> 00:22:59,600
in parallel in snowflake.

583
00:22:57,280 --> 00:23:02,159
All right, inference. When it comes to

584
00:22:59,600 --> 00:23:04,720
inference, we provide two kinds of

585
00:23:02,159 --> 00:23:08,000
inference as um usual ways which is one

586
00:23:04,720 --> 00:23:10,559
is batch inference. So batch inference,

587
00:23:08,000 --> 00:23:12,400
we provide it through warehouse. In this

588
00:23:10,559 --> 00:23:14,400
particular kind of an inference, it

589
00:23:12,400 --> 00:23:16,400
comes out of the box as distributed.

590
00:23:14,400 --> 00:23:18,559
Let's say you have a million rows and

591
00:23:16,400 --> 00:23:21,200
you are just using the warehouse

592
00:23:18,559 --> 00:23:23,039
inference. Then it will automatically

593
00:23:21,200 --> 00:23:25,440
behind the scenes distribute your data

594
00:23:23,039 --> 00:23:28,400
and do the prediction for you. So you

595
00:23:25,440 --> 00:23:30,480
can see the inference code here which is

596
00:23:28,400 --> 00:23:33,440
model predict and you are just running

597
00:23:30,480 --> 00:23:35,840
the function that's it you just get your

598
00:23:33,440 --> 00:23:37,600
data test data and then you uh try to

599
00:23:35,840 --> 00:23:39,440
predict with the model then behind the

600
00:23:37,600 --> 00:23:41,840
scenes snowflake handle the distributed

601
00:23:39,440 --> 00:23:44,000
processing and then finally merge all

602
00:23:41,840 --> 00:23:47,840
the results together and then give it to

603
00:23:44,000 --> 00:23:49,919
you as a one single table or result. Now

604
00:23:47,840 --> 00:23:52,000
going to the next one which is noar

605
00:23:49,919 --> 00:23:54,720
container services that is something we

606
00:23:52,000 --> 00:23:56,640
use if you want to go for a GPU level

607
00:23:54,720 --> 00:23:58,960
inference or let's say you want to go

608
00:23:56,640 --> 00:24:01,039
for a realtime inference. So in case of

609
00:23:58,960 --> 00:24:02,720
a realtime inference um there is

610
00:24:01,039 --> 00:24:05,520
something you need to balance the load

611
00:24:02,720 --> 00:24:08,080
as well. For example now I only have

612
00:24:05,520 --> 00:24:10,559
10,000 users so I'm only getting 10,000

613
00:24:08,080 --> 00:24:13,120
API calls. So tomorrow I'm starting to

614
00:24:10,559 --> 00:24:15,200
get 20,000. So how do you scale it? So

615
00:24:13,120 --> 00:24:17,679
that is also handled by snowflake. So

616
00:24:15,200 --> 00:24:20,080
you can configure different uh number of

617
00:24:17,679 --> 00:24:22,559
nodes or scalability. I want like

618
00:24:20,080 --> 00:24:24,480
minimum nodes, maximum nodes and then

619
00:24:22,559 --> 00:24:26,080
once you start inference in snow

620
00:24:24,480 --> 00:24:28,880
container services, it will

621
00:24:26,080 --> 00:24:30,960
automatically scale it for you according

622
00:24:28,880 --> 00:24:34,480
to the load that is incoming. So you

623
00:24:30,960 --> 00:24:36,559
don't have to worry about those things.

624
00:24:34,480 --> 00:24:39,200
All right. And finally, this is also the

625
00:24:36,559 --> 00:24:42,320
most important part. How do I make

626
00:24:39,200 --> 00:24:44,960
available the result of my AI or ML to

627
00:24:42,320 --> 00:24:47,200
the customers or the users, business

628
00:24:44,960 --> 00:24:49,440
users in a company. So that is a very

629
00:24:47,200 --> 00:24:51,919
important part here. So how do we do

630
00:24:49,440 --> 00:24:53,840
that? You might have seen in every

631
00:24:51,919 --> 00:24:55,840
single presentation today or at least

632
00:24:53,840 --> 00:24:57,600
whatever I attended in this uh

633
00:24:55,840 --> 00:24:59,279
conference, everybody talks about

634
00:24:57,600 --> 00:25:03,520
Streamlit,

635
00:24:59,279 --> 00:25:06,400
right? Anybody knows like Streamlit um

636
00:25:03,520 --> 00:25:09,039
is acquired by Snowflake?

637
00:25:06,400 --> 00:25:11,919
Okay. Okay, only one hand. Yeah. So, we

638
00:25:09,039 --> 00:25:14,720
acquired snow uh streamlit um and now it

639
00:25:11,919 --> 00:25:18,080
is available as a part of snowflake. We

640
00:25:14,720 --> 00:25:20,000
call it cis streamlit in snowflake and

641
00:25:18,080 --> 00:25:22,559
you also have the open source version

642
00:25:20,000 --> 00:25:24,080
still available. Uh but uh inside

643
00:25:22,559 --> 00:25:26,159
snowflake we provide it as an

644
00:25:24,080 --> 00:25:27,919
enterprisegrade level app. We have a

645
00:25:26,159 --> 00:25:30,159
native app framework where you can

646
00:25:27,919 --> 00:25:32,799
easily build your streamllet inside and

647
00:25:30,159 --> 00:25:35,600
share it across um very seamlessly with

648
00:25:32,799 --> 00:25:38,159
all the rback control o and etc handled

649
00:25:35,600 --> 00:25:40,400
by snowflake and then you also have your

650
00:25:38,159 --> 00:25:42,880
open source version which you all are

651
00:25:40,400 --> 00:25:45,600
leveraging I can see that in the most of

652
00:25:42,880 --> 00:25:48,080
the talks.

653
00:25:45,600 --> 00:25:50,159
Now why should I do streaml and

654
00:25:48,080 --> 00:25:51,840
snowflake? We come with a top of

655
00:25:50,159 --> 00:25:54,320
additional advantages when you compare

656
00:25:51,840 --> 00:25:56,320
to your oss version which is fully

657
00:25:54,320 --> 00:25:58,240
managed. We are also working on a

658
00:25:56,320 --> 00:26:00,960
feature where you can run your streamlit

659
00:25:58,240 --> 00:26:02,640
on our container runtime. Now that again

660
00:26:00,960 --> 00:26:04,720
means it's going to be very very

661
00:26:02,640 --> 00:26:06,480
scalable for you. Um you don't have to

662
00:26:04,720 --> 00:26:09,440
worry about the load handling and things

663
00:26:06,480 --> 00:26:11,600
like that. Governance and security. You

664
00:26:09,440 --> 00:26:13,760
don't have to configure out of the box O

665
00:26:11,600 --> 00:26:15,200
or anything who can log into it and

666
00:26:13,760 --> 00:26:17,919
things like that. That will also be

667
00:26:15,200 --> 00:26:20,480
taken care by Snowflake. You can set our

668
00:26:17,919 --> 00:26:22,320
back level permissions. Now it's

669
00:26:20,480 --> 00:26:24,400
completely integrated with Snowflake. So

670
00:26:22,320 --> 00:26:26,240
it is easier to build it. We come with

671
00:26:24,400 --> 00:26:29,200
an interface where you can go in and

672
00:26:26,240 --> 00:26:31,360
start coding your python uh streamllet

673
00:26:29,200 --> 00:26:34,000
and then um on the same window itself

674
00:26:31,360 --> 00:26:35,840
you can see the build that is being

675
00:26:34,000 --> 00:26:38,159
happening for this particular dashboards

676
00:26:35,840 --> 00:26:40,799
in streamllet and also it's integrated

677
00:26:38,159 --> 00:26:42,960
with our notebooks as well. If you code

678
00:26:40,799 --> 00:26:45,200
in notebooks it's also easier for you to

679
00:26:42,960 --> 00:26:47,440
see it there.

680
00:26:45,200 --> 00:26:50,960
Now summary.

681
00:26:47,440 --> 00:26:52,960
So Snowflake provides a very simple coy

682
00:26:50,960 --> 00:26:56,000
platform that helps you to scale every

683
00:26:52,960 --> 00:26:57,840
single part of your ML workload. And in

684
00:26:56,000 --> 00:26:59,520
case of if you want high performance

685
00:26:57,840 --> 00:27:02,400
pre-processing that is where you can

686
00:26:59,520 --> 00:27:05,679
leverage the APIs that we provide out of

687
00:27:02,400 --> 00:27:08,159
the box. Snow park API, snowark pantas

688
00:27:05,679 --> 00:27:10,799
and also snowpark ML APIs to do your

689
00:27:08,159 --> 00:27:13,600
feature engineering and processing. But

690
00:27:10,799 --> 00:27:15,760
you're always free to use Python OSS

691
00:27:13,600 --> 00:27:18,080
version and um bring in your existing

692
00:27:15,760 --> 00:27:20,320
code or write anything that you want and

693
00:27:18,080 --> 00:27:22,480
execute in in Snowflake.

694
00:27:20,320 --> 00:27:25,120
Now going to the distributor training

695
00:27:22,480 --> 00:27:28,159
and tuning, you can leverage your CPUs

696
00:27:25,120 --> 00:27:30,960
and GPUs very easily in Snowflake

697
00:27:28,159 --> 00:27:32,799
without setting any um any of these

698
00:27:30,960 --> 00:27:34,799
libraries. For example, if you want to

699
00:27:32,799 --> 00:27:37,279
use GPU, you have to get started with,

700
00:27:34,799 --> 00:27:38,960
you know, installing CUDA, Kurin and all

701
00:27:37,279 --> 00:27:40,559
those libraries and set it up and get

702
00:27:38,960 --> 00:27:42,400
from the scratch. But in Snowflake, it

703
00:27:40,559 --> 00:27:44,960
all comes out of the box for you. So you

704
00:27:42,400 --> 00:27:47,279
can right away get started.

705
00:27:44,960 --> 00:27:49,840
Seamless deployment and serving. So you

706
00:27:47,279 --> 00:27:51,840
can serve your models on top of CPU, GPU

707
00:27:49,840 --> 00:27:53,919
in whichever way you want. And out of

708
00:27:51,840 --> 00:27:55,679
the box, we will provide the distributed

709
00:27:53,919 --> 00:27:57,760
uh things for you. It will be handled

710
00:27:55,679 --> 00:28:00,159
behind the scenes. For the user, it is

711
00:27:57,760 --> 00:28:03,120
very similar to writing a normal Python

712
00:28:00,159 --> 00:28:04,720
code with any other libraries.

713
00:28:03,120 --> 00:28:06,720
Streamlit, I don't think I have to talk

714
00:28:04,720 --> 00:28:09,760
much about streamllet. Everybody is uh

715
00:28:06,720 --> 00:28:11,600
using streamllet now. So um it will help

716
00:28:09,760 --> 00:28:14,159
you to build all those apps and

717
00:28:11,600 --> 00:28:17,279
prototypes and easier and then share it

718
00:28:14,159 --> 00:28:20,399
with the your wider users to leverage

719
00:28:17,279 --> 00:28:23,919
the results of your AI ML etc. All

720
00:28:20,399 --> 00:28:26,480
right, that is all for the session

721
00:28:23,919 --> 00:28:29,279
today. If you're curious about knowing

722
00:28:26,480 --> 00:28:31,679
more about Python in snowflake, so this

723
00:28:29,279 --> 00:28:33,840
is the QR code that you can scan and

724
00:28:31,679 --> 00:28:36,320
there is a lot of Python recipes that

725
00:28:33,840 --> 00:28:38,480
you can get on that particular GitHub

726
00:28:36,320 --> 00:28:40,640
link which will help you to build all

727
00:28:38,480 --> 00:28:43,279
this uh Python in a very distributed

728
00:28:40,640 --> 00:28:45,750
fashion. All right, that's all for

729
00:28:43,279 --> 00:28:52,910
today. Thank you.

730
00:28:45,750 --> 00:28:52,910
[Applause]