1
00:00:04,960 --> 00:00:19,999
[Music]

2
00:00:20,680 --> 00:00:26,000
uh hello everyone so H thank

3
00:00:23,470 --> 00:00:28,199
[Applause]

4
00:00:26,000 --> 00:00:30,960
you first of all thank you for coming

5
00:00:28,199 --> 00:00:32,719
today to my presentation so today I hope

6
00:00:30,960 --> 00:00:35,200
to teach you about embeddings what they

7
00:00:32,719 --> 00:00:37,960
are how they work and how you can use

8
00:00:35,200 --> 00:00:40,399
them if I've done my job right I hope to

9
00:00:37,960 --> 00:00:41,680
make you just as excited by embeddings

10
00:00:40,399 --> 00:00:44,160
as I

11
00:00:41,680 --> 00:00:46,760
am so to start off we're going to go

12
00:00:44,160 --> 00:00:48,920
through some of the pre 2013 methods for

13
00:00:46,760 --> 00:00:51,800
embeddings and work our way to state of

14
00:00:48,920 --> 00:00:51,800
the the art neural

15
00:00:52,719 --> 00:00:58,359
networks so first of all what are

16
00:00:55,920 --> 00:01:00,280
embeddings fundamentally embeddings are

17
00:00:58,359 --> 00:01:02,840
a way to capture information and

18
00:01:00,280 --> 00:01:04,760
represent it on a computer this

19
00:01:02,840 --> 00:01:08,119
information is typically language or

20
00:01:04,760 --> 00:01:10,040
images but can really be anything for

21
00:01:08,119 --> 00:01:13,280
the purposes of this talk I'm just going

22
00:01:10,040 --> 00:01:13,280
to be covering

23
00:01:14,159 --> 00:01:18,720
language so when we say we're going to

24
00:01:16,119 --> 00:01:20,880
embed a word that means to take a word

25
00:01:18,720 --> 00:01:23,759
and embed it into a vector space where

26
00:01:20,880 --> 00:01:26,680
similar words are closer

27
00:01:23,759 --> 00:01:28,680
together this in essence captures

28
00:01:26,680 --> 00:01:30,560
semantic meaning and turns it into a

29
00:01:28,680 --> 00:01:33,040
vector of numbers that we can then use

30
00:01:30,560 --> 00:01:33,040
computers

31
00:01:33,560 --> 00:01:38,439
with so how can computers understand

32
00:01:36,439 --> 00:01:40,680
language well maybe we can try and

33
00:01:38,439 --> 00:01:42,960
Define a giant dictionary full of all

34
00:01:40,680 --> 00:01:45,000
the English words and their

35
00:01:42,960 --> 00:01:47,040
definitions but that doesn't really

36
00:01:45,000 --> 00:01:49,399
capture any sort of semantic

37
00:01:47,040 --> 00:01:51,240
meaning well maybe we can then order the

38
00:01:49,399 --> 00:01:52,399
dictionary so that similar words are

39
00:01:51,240 --> 00:01:54,960
closer

40
00:01:52,399 --> 00:01:57,399
together but what happens when the word

41
00:01:54,960 --> 00:02:00,439
changes depending on the context for

42
00:01:57,399 --> 00:02:03,399
example if I'm talking about bugs

43
00:02:00,439 --> 00:02:04,680
am I talking about bugs insects or bugs

44
00:02:03,399 --> 00:02:06,640
computer

45
00:02:04,680 --> 00:02:08,800
errors maybe then we could try and

46
00:02:06,640 --> 00:02:10,679
hardcode all the different types of um

47
00:02:08,800 --> 00:02:12,920
context but I don't think any of us

48
00:02:10,679 --> 00:02:15,280
really want to be um writing over you

49
00:02:12,920 --> 00:02:18,000
know a million if

50
00:02:15,280 --> 00:02:19,720
statements so it turns out the solution

51
00:02:18,000 --> 00:02:21,879
to capturing word embeddings and

52
00:02:19,720 --> 00:02:24,160
capturing word meaning begins outside

53
00:02:21,879 --> 00:02:25,720
the world of computing and instead in

54
00:02:24,160 --> 00:02:28,280
the world of

55
00:02:25,720 --> 00:02:30,760
linguistics John rert fth proposed the

56
00:02:28,280 --> 00:02:33,000
idea that words with similar meanings

57
00:02:30,760 --> 00:02:34,879
will appear in similar

58
00:02:33,000 --> 00:02:37,480
contexts this is called the

59
00:02:34,879 --> 00:02:40,720
distributional hypothesis and is best

60
00:02:37,480 --> 00:02:44,800
described by his 1957 quote you shall

61
00:02:40,720 --> 00:02:44,800
know a word by the company it

62
00:02:45,120 --> 00:02:50,159
keeps the central idea here is that

63
00:02:47,760 --> 00:02:52,760
words with similar meanings will appear

64
00:02:50,159 --> 00:02:54,959
in similar contexts this means that you

65
00:02:52,760 --> 00:02:56,680
can derive a word's meaning by the

66
00:02:54,959 --> 00:03:00,800
context in which it is

67
00:02:56,680 --> 00:03:02,440
used this is the basis uh for the way

68
00:03:00,800 --> 00:03:04,840
modern computer science captures

69
00:03:02,440 --> 00:03:07,000
semantic meaning in modern natural

70
00:03:04,840 --> 00:03:09,879
language

71
00:03:07,000 --> 00:03:12,120
processing so prior to the Advent and

72
00:03:09,879 --> 00:03:13,840
subsequent domination of neuron networks

73
00:03:12,120 --> 00:03:16,560
algorithms in the early

74
00:03:13,840 --> 00:03:18,560
2010s natural language processing relied

75
00:03:16,560 --> 00:03:21,640
heavily on rule based systems and

76
00:03:18,560 --> 00:03:24,840
symbolic systems much like the um ideas

77
00:03:21,640 --> 00:03:26,680
we played around in the slides

78
00:03:24,840 --> 00:03:28,400
before but there's actually another

79
00:03:26,680 --> 00:03:30,920
class of method that were class of

80
00:03:28,400 --> 00:03:32,959
methods that were called the statistical

81
00:03:30,920 --> 00:03:35,799
methods many of these statistical

82
00:03:32,959 --> 00:03:38,439
methods are based on word co-occurrence

83
00:03:35,799 --> 00:03:41,439
and bag of word approaches as a form of

84
00:03:38,439 --> 00:03:45,159
embedding words based on the as a form

85
00:03:41,439 --> 00:03:47,840
of embedding words and text

86
00:03:45,159 --> 00:03:50,920
respectively so for example we can use

87
00:03:47,840 --> 00:03:52,879
the um word co-occurrence matrices so we

88
00:03:50,920 --> 00:03:55,319
can use the frequency of words within a

89
00:03:52,879 --> 00:03:58,040
context window of a certain word to

90
00:03:55,319 --> 00:04:00,280
create embeddings a context window is

91
00:03:58,040 --> 00:04:02,599
just the words in front and behind the

92
00:04:00,280 --> 00:04:04,360
word that you're looking at and can vary

93
00:04:02,599 --> 00:04:07,079
in length depending on the model you're

94
00:04:04,360 --> 00:04:09,319
using so when you count all the times a

95
00:04:07,079 --> 00:04:11,560
word occurs within another words context

96
00:04:09,319 --> 00:04:14,120
window you get what is called a

97
00:04:11,560 --> 00:04:16,000
co-occurrence matrix so this one for

98
00:04:14,120 --> 00:04:17,919
example along the top here I've just

99
00:04:16,000 --> 00:04:19,079
taken a subset of the words and likewise

100
00:04:17,919 --> 00:04:21,160
down the bottom but in reality you're

101
00:04:19,079 --> 00:04:23,680
going to get a matrix that's you know

102
00:04:21,160 --> 00:04:27,080
the same size as it's as long as it is

103
00:04:23,680 --> 00:04:29,479
uh wide so along the top we have a

104
00:04:27,080 --> 00:04:31,720
subset uh and this will often be hundred

105
00:04:29,479 --> 00:04:33,199
you thousands to tens of thousands of

106
00:04:31,720 --> 00:04:36,000
words long depending on how big your

107
00:04:33,199 --> 00:04:39,000
documents are so the word digital for

108
00:04:36,000 --> 00:04:41,160
example has had computer the word the

109
00:04:39,000 --> 00:04:45,120
word computer occur

110
00:04:41,160 --> 00:04:48,039
1,670 times within its context window

111
00:04:45,120 --> 00:04:51,560
and the word data occur

112
00:04:48,039 --> 00:04:53,919
1,683 this makes it very um dissimilar

113
00:04:51,560 --> 00:04:55,479
to words like cherry and strawberry

114
00:04:53,919 --> 00:04:57,639
which have had very few occurrences of

115
00:04:55,479 --> 00:05:00,120
those words but are high in words like

116
00:04:57,639 --> 00:05:01,680
pie and sugar

117
00:05:00,120 --> 00:05:03,800
if you just take two Dimensions out of

118
00:05:01,680 --> 00:05:07,520
this Matrix you can actually use it to

119
00:05:03,800 --> 00:05:10,560
kind of um plot the words so if we just

120
00:05:07,520 --> 00:05:12,039
use the dimensions computer and data we

121
00:05:10,560 --> 00:05:14,240
can see that the words digital and

122
00:05:12,039 --> 00:05:16,600
information are actually very close

123
00:05:14,240 --> 00:05:18,639
together and often times when we're

124
00:05:16,600 --> 00:05:21,160
measuring a sort of closeness with

125
00:05:18,639 --> 00:05:23,240
vectors we're not using how far away the

126
00:05:21,160 --> 00:05:24,919
the dots are but instead we're measuring

127
00:05:23,240 --> 00:05:26,560
the the distance of the angle between

128
00:05:24,919 --> 00:05:28,600
them so the angle between these two

129
00:05:26,560 --> 00:05:30,960
words are very low so this is called

130
00:05:28,600 --> 00:05:33,280
cosine similarity and it would be pretty

131
00:05:30,960 --> 00:05:36,400
high in this

132
00:05:33,280 --> 00:05:39,240
case an adjacent concept to this is bag

133
00:05:36,400 --> 00:05:41,880
of words to create document embeddings

134
00:05:39,240 --> 00:05:44,400
we use a bag of Words which means we can

135
00:05:41,880 --> 00:05:47,280
encode our documents to be a account of

136
00:05:44,400 --> 00:05:49,520
all the words it contains so if we have

137
00:05:47,280 --> 00:05:51,240
these three documents this would be our

138
00:05:49,520 --> 00:05:53,440
vocabulary which is just all the unique

139
00:05:51,240 --> 00:05:55,360
words in our documents and these would

140
00:05:53,440 --> 00:05:57,759
be the word

141
00:05:55,360 --> 00:06:00,000
frequencies what we are left with are

142
00:05:57,759 --> 00:06:02,360
document embeddings and this is a useful

143
00:06:00,000 --> 00:06:04,560
way for comparing documents based on the

144
00:06:02,360 --> 00:06:06,919
words that occur within them but it

145
00:06:04,560 --> 00:06:08,759
doesn't make any attempt to compare the

146
00:06:06,919 --> 00:06:10,919
documents based on the meaning of the

147
00:06:08,759 --> 00:06:12,440
words within the

148
00:06:10,919 --> 00:06:15,160
documents

149
00:06:12,440 --> 00:06:17,840
so both of these um family of

150
00:06:15,160 --> 00:06:20,039
statistical methods use the frequency of

151
00:06:17,840 --> 00:06:22,759
occurrences in order to derive some

152
00:06:20,039 --> 00:06:24,360
notion of similarity the idea here is

153
00:06:22,759 --> 00:06:27,120
that similar words will have a similar

154
00:06:24,360 --> 00:06:29,039
number of Co concurrences and likewise

155
00:06:27,120 --> 00:06:31,080
similar documents will have a similar

156
00:06:29,039 --> 00:06:33,319
number of word

157
00:06:31,080 --> 00:06:35,400
frequencies these methods produce large

158
00:06:33,319 --> 00:06:37,440
dimensions because often the size of the

159
00:06:35,400 --> 00:06:39,880
vocabulary is how long your dimensions

160
00:06:37,440 --> 00:06:41,080
are going to be uh these methods also

161
00:06:39,880 --> 00:06:43,599
produce what are called sparse

162
00:06:41,080 --> 00:06:45,680
embeddings or sparse vectors because

163
00:06:43,599 --> 00:06:50,120
they mostly contain zeros because most

164
00:06:45,680 --> 00:06:53,880
words don't occur next to each

165
00:06:50,120 --> 00:06:55,360
other so by the early 2010s neural

166
00:06:53,880 --> 00:06:57,400
networks were experiencing a bit of a

167
00:06:55,360 --> 00:06:59,800
Renaissance in computer science driven

168
00:06:57,400 --> 00:07:01,440
by advances in compute power and neural

169
00:06:59,800 --> 00:07:04,120
networks demonstrating how good they are

170
00:07:01,440 --> 00:07:06,560
at classifying whether an image is a cat

171
00:07:04,120 --> 00:07:09,520
dog or a

172
00:07:06,560 --> 00:07:12,440
muffin so the introduction of word to

173
00:07:09,520 --> 00:07:14,599
VEC in 2013 by Thomas mikeloff and his

174
00:07:12,440 --> 00:07:16,720
team at Google combin natural language

175
00:07:14,599 --> 00:07:18,759
processing and neuron

176
00:07:16,720 --> 00:07:20,759
networks instead of just counting the

177
00:07:18,759 --> 00:07:23,120
word neighbors like we pre like we did

178
00:07:20,759 --> 00:07:25,360
in previous methods we feed the neuron

179
00:07:23,120 --> 00:07:27,160
Network um examples of words that are

180
00:07:25,360 --> 00:07:29,240
neighbors and examples of words that are

181
00:07:27,160 --> 00:07:30,080
not neighbors and then we force it to

182
00:07:29,240 --> 00:07:32,919
predict

183
00:07:30,080 --> 00:07:35,000
which is which for the predictions to be

184
00:07:32,919 --> 00:07:37,160
accurate the neuron network is forced to

185
00:07:35,000 --> 00:07:39,520
capture some sort of deeper

186
00:07:37,160 --> 00:07:41,400
contextual deeper semantic understanding

187
00:07:39,520 --> 00:07:43,199
of the words so that it can actually

188
00:07:41,400 --> 00:07:44,919
make accurate predictions when it we

189
00:07:43,199 --> 00:07:47,080
force it to

190
00:07:44,919 --> 00:07:49,800
predict so the training data for the

191
00:07:47,080 --> 00:07:52,080
word Toc model is nothing complex all

192
00:07:49,800 --> 00:07:54,759
you need to do is process all your data

193
00:07:52,080 --> 00:07:56,319
to create a bunch of word pairs you can

194
00:07:54,759 --> 00:07:58,479
do this by looking at the words that

195
00:07:56,319 --> 00:08:00,560
occur within the context window of a

196
00:07:58,479 --> 00:08:03,400
specific word and labing labeling them

197
00:08:00,560 --> 00:08:05,360
as neighbors so for example hard has a

198
00:08:03,400 --> 00:08:07,159
context window with the word

199
00:08:05,360 --> 00:08:08,639
implementation in it and we'd want the

200
00:08:07,159 --> 00:08:11,000
neuron Network to predict that those two

201
00:08:08,639 --> 00:08:14,120
words are neighbors we do the same thing

202
00:08:11,000 --> 00:08:15,840
for is until we get a list for all that

203
00:08:14,120 --> 00:08:18,960
for that for that word and then we

204
00:08:15,840 --> 00:08:20,879
iterate to the next word so we go to two

205
00:08:18,960 --> 00:08:23,080
and we do the same thing for all the

206
00:08:20,879 --> 00:08:24,720
words in all of our documents until

207
00:08:23,080 --> 00:08:27,440
eventually we eventually we get a giant

208
00:08:24,720 --> 00:08:30,120
list of all the words and their

209
00:08:27,440 --> 00:08:31,800
neighbors equally as important

210
00:08:30,120 --> 00:08:34,599
is that you

211
00:08:31,800 --> 00:08:36,800
um randomly sample words that don't

212
00:08:34,599 --> 00:08:39,320
occur together and label them as non-

213
00:08:36,800 --> 00:08:41,919
neighbors if you don't do this the ne

214
00:08:39,320 --> 00:08:44,120
Network could get away with um with

215
00:08:41,919 --> 00:08:45,800
labeling everything as or predicting

216
00:08:44,120 --> 00:08:48,279
everything as neighbors and it would be

217
00:08:45,800 --> 00:08:48,279
100%

218
00:08:51,399 --> 00:08:56,880
accurate before we move on to discuss

219
00:08:54,279 --> 00:08:59,160
more about the word algorithm we need to

220
00:08:56,880 --> 00:09:00,000
quickly do a bit of a primer on neuron

221
00:08:59,160 --> 00:09:02,480
Network

222
00:09:00,000 --> 00:09:05,399
so neuron networks are structured in a

223
00:09:02,480 --> 00:09:07,959
way that they take data as input and

224
00:09:05,399 --> 00:09:11,360
transform it via its layers into a

225
00:09:07,959 --> 00:09:13,600
desired output so for example if you

226
00:09:11,360 --> 00:09:16,360
were to train a cat classifying neural

227
00:09:13,600 --> 00:09:19,160
network the inputs would be all the um

228
00:09:16,360 --> 00:09:20,519
pixel values of the image and then the

229
00:09:19,160 --> 00:09:23,519
values would then be transformed through

230
00:09:20,519 --> 00:09:25,399
the layers into maybe a binary output

231
00:09:23,519 --> 00:09:26,800
that represents whether the image is a

232
00:09:25,399 --> 00:09:29,800
cat or

233
00:09:26,800 --> 00:09:32,040
not so essentially neural network are

234
00:09:29,800 --> 00:09:34,279
really good at learning patterns and

235
00:09:32,040 --> 00:09:36,560
much like a child learns it needs to be

236
00:09:34,279 --> 00:09:38,600
exposed to examples and feedback to

237
00:09:36,560 --> 00:09:41,279
become good at recognizing

238
00:09:38,600 --> 00:09:43,880
something unlike a child a neuron

239
00:09:41,279 --> 00:09:46,760
Network stores its learnings in matrices

240
00:09:43,880 --> 00:09:49,440
that are often um referred to as its

241
00:09:46,760 --> 00:09:51,399
weights so in this diagram here these

242
00:09:49,440 --> 00:09:52,720
would be the neuron Network's weights

243
00:09:51,399 --> 00:09:55,519
the weights are responsible for

244
00:09:52,720 --> 00:09:57,800
propagating the data from one set of

245
00:09:55,519 --> 00:10:00,040
layers to the next

246
00:09:57,800 --> 00:10:02,519
layer and each time you show a neur

247
00:10:00,040 --> 00:10:04,120
network an example it adjusts its

248
00:10:02,519 --> 00:10:06,279
weights to minimize its prediction

249
00:10:04,120 --> 00:10:09,560
errors which then improves its ability

250
00:10:06,279 --> 00:10:12,240
to make correct predictions on unseen

251
00:10:09,560 --> 00:10:14,440
data just a bit of a clarification this

252
00:10:12,240 --> 00:10:17,760
is a major simplification but it should

253
00:10:14,440 --> 00:10:20,120
give you just enough inition to work

254
00:10:17,760 --> 00:10:22,800
with so to train a model using the word

255
00:10:20,120 --> 00:10:24,839
Toc algorithm we use a neural network to

256
00:10:22,800 --> 00:10:26,959
predict the most likely context words

257
00:10:24,839 --> 00:10:29,399
for each word in our vocabulary so the

258
00:10:26,959 --> 00:10:31,120
diagram we showed before the it's very

259
00:10:29,399 --> 00:10:33,200
similar to this neuron Network except

260
00:10:31,120 --> 00:10:35,000
I've just um blown up blown up the

261
00:10:33,200 --> 00:10:37,920
weights and made them into squares for

262
00:10:35,000 --> 00:10:40,880
it to be easier for us to

263
00:10:37,920 --> 00:10:44,279
visualize so the first set of Weights in

264
00:10:40,880 --> 00:10:47,120
the word algorithm is a v byn Matrix

265
00:10:44,279 --> 00:10:49,079
where each row corresponds to a word

266
00:10:47,120 --> 00:10:51,399
from the V dimensional

267
00:10:49,079 --> 00:10:53,360
vocabulary the N columns contain

268
00:10:51,399 --> 00:10:55,959
flirting Point numbers that are used to

269
00:10:53,360 --> 00:10:57,959
represent the word's meaning they are

270
00:10:55,959 --> 00:11:00,480
typically around 300 columns for word

271
00:10:57,959 --> 00:11:02,920
models but there's no limit and they may

272
00:11:00,480 --> 00:11:05,519
they may range between you know 50 to a

273
00:11:02,920 --> 00:11:06,720
th000 where smaller dimensions are more

274
00:11:05,519 --> 00:11:08,680
efficient

275
00:11:06,720 --> 00:11:11,200
computationally but they do tend to

276
00:11:08,680 --> 00:11:12,920
sacrifice their semantic resolution much

277
00:11:11,200 --> 00:11:16,160
like you would when you limit the amount

278
00:11:12,920 --> 00:11:18,320
of pixels you can represent an image

279
00:11:16,160 --> 00:11:20,680
with so these particular weights in the

280
00:11:18,320 --> 00:11:23,079
word Matrix are actually referred to as

281
00:11:20,680 --> 00:11:24,760
its embedding Matrix and as it's

282
00:11:23,079 --> 00:11:27,040
learning to maximize its prediction

283
00:11:24,760 --> 00:11:28,959
accuracy the neuronet is slowly nudging

284
00:11:27,040 --> 00:11:29,760
the numbers in there so that similar

285
00:11:28,959 --> 00:11:32,399
words

286
00:11:29,760 --> 00:11:32,399
have similar

287
00:11:32,720 --> 00:11:36,120
numbers as you may have guessed we're

288
00:11:34,959 --> 00:11:38,880
not actually interested in the

289
00:11:36,120 --> 00:11:40,560
predictive ability of the neuron network

290
00:11:38,880 --> 00:11:42,800
but rather the embedding Matrix that is

291
00:11:40,560 --> 00:11:46,040
created as a byproduct of

292
00:11:42,800 --> 00:11:48,120
it once training is complete the Matrix

293
00:11:46,040 --> 00:11:50,040
is harvested which is what we use for

294
00:11:48,120 --> 00:11:52,360
our word em word

295
00:11:50,040 --> 00:11:54,680
embeddings the trained embedding Matrix

296
00:11:52,360 --> 00:11:56,560
provides dense representations of Words

297
00:11:54,680 --> 00:11:58,839
which means every number in the word

298
00:11:56,560 --> 00:12:01,880
embedding is Meaningful and contributes

299
00:11:58,839 --> 00:12:04,800
to its representation unlike the sparse

300
00:12:01,880 --> 00:12:07,120
predecessors we just talked

301
00:12:04,800 --> 00:12:08,959
about these dense embeddings contain

302
00:12:07,120 --> 00:12:11,160
really interesting properties that

303
00:12:08,959 --> 00:12:13,680
actually are the best way to illustrate

304
00:12:11,160 --> 00:12:15,519
them I think is to compress your 300

305
00:12:13,680 --> 00:12:17,880
Dimensions down to two so you can see

306
00:12:15,519 --> 00:12:19,600
them on a graph so if we look at the

307
00:12:17,880 --> 00:12:21,560
first property for example we can see

308
00:12:19,600 --> 00:12:23,920
that embeddings these embeddings are

309
00:12:21,560 --> 00:12:26,519
fantastic at capturing meaning similar

310
00:12:23,920 --> 00:12:28,760
words like fantastic awesome and amazing

311
00:12:26,519 --> 00:12:30,920
are all on the same location whereas

312
00:12:28,760 --> 00:12:32,720
terrible awful and Dreadful are all in

313
00:12:30,920 --> 00:12:36,000
their own group and unrelated words like

314
00:12:32,720 --> 00:12:37,720
bug are all the way by themselves the

315
00:12:36,000 --> 00:12:40,959
second thing that I find really

316
00:12:37,720 --> 00:12:42,440
interesting is the um the ability for

317
00:12:40,959 --> 00:12:44,880
embeddings to preserve semantic

318
00:12:42,440 --> 00:12:47,680
relationships like tents so here we can

319
00:12:44,880 --> 00:12:51,079
see flew is to

320
00:12:47,680 --> 00:12:52,560
Flying as ran is to running and the cool

321
00:12:51,079 --> 00:12:54,959
thing is no one trained the neuron

322
00:12:52,560 --> 00:12:56,720
Network to do this just by show forcing

323
00:12:54,959 --> 00:12:58,680
it to predict its word neighbors it was

324
00:12:56,720 --> 00:13:00,240
able to capture these relationships by

325
00:12:58,680 --> 00:13:02,000
itself

326
00:13:00,240 --> 00:13:06,560
it also captures real world information

327
00:13:02,000 --> 00:13:08,639
so cra is to Australia as Beijing is to

328
00:13:06,560 --> 00:13:10,519
China and a consequence of these

329
00:13:08,639 --> 00:13:13,760
relationships that are captured is the

330
00:13:10,519 --> 00:13:16,240
ability to perform meaning meaningful

331
00:13:13,760 --> 00:13:19,000
Vector arithmetic so if we take the

332
00:13:16,240 --> 00:13:21,519
embedding for Cambra and subtract the

333
00:13:19,000 --> 00:13:24,279
vector for Australia and add the vector

334
00:13:21,519 --> 00:13:27,040
for China you get a vector that is

335
00:13:24,279 --> 00:13:29,920
approximately the same as the vector for

336
00:13:27,040 --> 00:13:32,320
Beijing and to prove this I've used the

337
00:13:29,920 --> 00:13:33,800
most similar function in the uh model

338
00:13:32,320 --> 00:13:35,800
I've been using and I did the

339
00:13:33,800 --> 00:13:37,800
computation in there and you can see the

340
00:13:35,800 --> 00:13:39,519
top five most similar vectors you're

341
00:13:37,800 --> 00:13:42,760
getting Beijing right at the top

342
00:13:39,519 --> 00:13:45,240
followed by China CRA Chinese and then a

343
00:13:42,760 --> 00:13:45,240
typo of

344
00:13:45,560 --> 00:13:49,959
Beijing so up until this point our

345
00:13:48,720 --> 00:13:54,120
embeddings have been static

346
00:13:49,959 --> 00:13:56,560
representations of words pre 2013 the

347
00:13:54,120 --> 00:14:00,480
statistical methods used static vectors

348
00:13:56,560 --> 00:14:02,240
that were long and sparse they were long

349
00:14:00,480 --> 00:14:05,720
because every Dimension literally

350
00:14:02,240 --> 00:14:09,000
represented the count of a specific

351
00:14:05,720 --> 00:14:11,480
word word toc on the other hand has uses

352
00:14:09,000 --> 00:14:13,720
the Learned representations of words

353
00:14:11,480 --> 00:14:16,320
from a neural network meaning that each

354
00:14:13,720 --> 00:14:18,399
dimmension while not directly

355
00:14:16,320 --> 00:14:21,040
interpretable uh is used to represent

356
00:14:18,399 --> 00:14:23,560
the meaning of the words this makes word

357
00:14:21,040 --> 00:14:25,480
tobec much more efficient and effective

358
00:14:23,560 --> 00:14:29,680
at capturing word meanings because it's

359
00:14:25,480 --> 00:14:29,680
the dimensions in it being a lot smaller

360
00:14:30,480 --> 00:14:35,839
but it's not perfect there's a problem

361
00:14:32,880 --> 00:14:38,199
word Toc produces static word

362
00:14:35,839 --> 00:14:40,199
embeddings therefore words with multiple

363
00:14:38,199 --> 00:14:41,959
meanings end up being encoded as some

364
00:14:40,199 --> 00:14:44,240
sort of awkward average between all the

365
00:14:41,959 --> 00:14:46,040
context in which it appears so for

366
00:14:44,240 --> 00:14:48,480
example this is a general model and so

367
00:14:46,040 --> 00:14:51,880
you can see the word bug is stuck stuck

368
00:14:48,480 --> 00:14:53,120
between insect and error SLG glitch so

369
00:14:51,880 --> 00:14:54,639
this model was probably changed on

370
00:14:53,120 --> 00:14:57,519
general information but if you trange

371
00:14:54,639 --> 00:14:59,720
your model on say biological text You'

372
00:14:57,519 --> 00:15:02,079
expect to see bug right over by the word

373
00:14:59,720 --> 00:15:04,759
insect and likewise computer science

374
00:15:02,079 --> 00:15:07,000
textbooks or so on you'd expect to see

375
00:15:04,759 --> 00:15:10,480
bug right next to all the error and

376
00:15:07,000 --> 00:15:12,320
glitch sort of words so what's clear is

377
00:15:10,480 --> 00:15:14,040
that for the best embedding of a word

378
00:15:12,320 --> 00:15:17,759
you really need to have an embedding for

379
00:15:14,040 --> 00:15:17,759
each context that it can appear

380
00:15:19,480 --> 00:15:26,440
in enter Bert in 2018 by Jacob Devin and

381
00:15:23,680 --> 00:15:28,720
colleagues at Google which prod bir

382
00:15:26,440 --> 00:15:31,920
produces Dynamic embeddings making them

383
00:15:28,720 --> 00:15:34,440
much more effective at capturing meaning

384
00:15:31,920 --> 00:15:37,040
ber uses a deep learning uh a deep

385
00:15:34,440 --> 00:15:39,120
neural network based on the Transformer

386
00:15:37,040 --> 00:15:41,040
architector and deep in this context

387
00:15:39,120 --> 00:15:44,600
literally just means that it has more

388
00:15:41,040 --> 00:15:46,600
layers unlike the word to one unlike the

389
00:15:44,600 --> 00:15:47,880
word to neuron Network which just just

390
00:15:46,600 --> 00:15:50,560
had

391
00:15:47,880 --> 00:15:52,120
one so a bit of a tangent and I don't

392
00:15:50,560 --> 00:15:53,560
expect you to take all this in but this

393
00:15:52,120 --> 00:15:56,319
is what the Transformer architecture

394
00:15:53,560 --> 00:15:58,160
looks like on the left you have the

395
00:15:56,319 --> 00:15:59,680
encoder and on the right you have the

396
00:15:58,160 --> 00:16:01,480
decoder

397
00:15:59,680 --> 00:16:03,720
so the Transformer architecture is

398
00:16:01,480 --> 00:16:06,680
foundational to generative models like

399
00:16:03,720 --> 00:16:07,560
chat GPT 4 and representative models

400
00:16:06,680 --> 00:16:10,360
like

401
00:16:07,560 --> 00:16:12,680
Bert Bert uses the encoder part of the

402
00:16:10,360 --> 00:16:13,959
Transformer which takes words and

403
00:16:12,680 --> 00:16:17,600
outputs

404
00:16:13,959 --> 00:16:19,920
embeddings gbt on the other hand gbt 4

405
00:16:17,600 --> 00:16:23,240
utilizes the decoder side which takes

406
00:16:19,920 --> 00:16:25,600
embeddings and outputs words both the

407
00:16:23,240 --> 00:16:28,240
encoder and the decoder use what is

408
00:16:25,600 --> 00:16:30,279
called an attention mechanism which

409
00:16:28,240 --> 00:16:32,680
allows them to focus on Words more

410
00:16:30,279 --> 00:16:34,160
important to the context on the word

411
00:16:32,680 --> 00:16:38,600
more important to the context of the

412
00:16:34,160 --> 00:16:41,480
word than generic words like is thee and

413
00:16:38,600 --> 00:16:43,519
end so this allows the encoder to create

414
00:16:41,480 --> 00:16:45,880
high quality Dynamic

415
00:16:43,519 --> 00:16:47,920
embeddings these AR this Transformer

416
00:16:45,880 --> 00:16:50,480
architecture is behind the recent

417
00:16:47,920 --> 00:16:52,600
explosion in AI capabilities and it's

418
00:16:50,480 --> 00:16:55,639
quite literally in their names so chat

419
00:16:52,600 --> 00:16:58,600
gp4 or GPT 4 stands for generative

420
00:16:55,639 --> 00:17:00,600
pre-trained Transformer 4 but on the

421
00:16:58,600 --> 00:17:04,519
other hand then stands for bidirectional

422
00:17:00,600 --> 00:17:04,519
and cod of representations from

423
00:17:04,720 --> 00:17:09,559
Transformers so bir's a bit different

424
00:17:07,120 --> 00:17:12,959
but it's also much the same so B trains

425
00:17:09,559 --> 00:17:15,120
its equivalent to word to matric ber

426
00:17:12,959 --> 00:17:17,520
trains its equivalent to the word to

427
00:17:15,120 --> 00:17:18,520
embedding matrix by predicting missing

428
00:17:17,520 --> 00:17:21,079
words in a

429
00:17:18,520 --> 00:17:23,000
sentence so explicit is better than

430
00:17:21,079 --> 00:17:26,039
implicit it should it should predict

431
00:17:23,000 --> 00:17:28,319
better if we blank out the word better

432
00:17:26,039 --> 00:17:30,720
and likewise sequences of words like my

433
00:17:28,319 --> 00:17:33,000
dog is Cur he likes playing we would

434
00:17:30,720 --> 00:17:35,320
want to force the model to predict that

435
00:17:33,000 --> 00:17:35,320
those are

436
00:17:35,760 --> 00:17:40,120
sequential because Bert produces

437
00:17:38,000 --> 00:17:42,440
contextual word embeddings it takes

438
00:17:40,120 --> 00:17:44,960
sentences as inputs rather than single

439
00:17:42,440 --> 00:17:47,039
words and then produces an embedding for

440
00:17:44,960 --> 00:17:49,039
each word in the

441
00:17:47,039 --> 00:17:51,000
sentence rather than harvesting the

442
00:17:49,039 --> 00:17:54,480
embedding Matrix like we did with word

443
00:17:51,000 --> 00:17:57,840
DEC we keep the neuronet intact so that

444
00:17:54,480 --> 00:18:00,080
we can continue to produce Dynamic word

445
00:17:57,840 --> 00:18:02,360
embeddings so now that we have context

446
00:18:00,080 --> 00:18:04,720
sensitive word embedding models given a

447
00:18:02,360 --> 00:18:07,640
word a model will produce a vector that

448
00:18:04,720 --> 00:18:09,960
is specific to the word and its

449
00:18:07,640 --> 00:18:11,880
context but what if we want to

450
00:18:09,960 --> 00:18:14,720
understand and embed sentences

451
00:18:11,880 --> 00:18:17,520
paragraphs or even entire

452
00:18:14,720 --> 00:18:19,640
documents well it turns out the easiest

453
00:18:17,520 --> 00:18:22,720
way to do this is just to average all of

454
00:18:19,640 --> 00:18:24,840
the output uh word embeddings that our B

455
00:18:22,720 --> 00:18:27,000
model gave us and this will give you a

456
00:18:24,840 --> 00:18:29,400
single Vector that represents the entire

457
00:18:27,000 --> 00:18:31,480
meaning of your text

458
00:18:29,400 --> 00:18:33,360
the reason we can't use models like word

459
00:18:31,480 --> 00:18:35,840
to V to do this is because the word

460
00:18:33,360 --> 00:18:38,480
embeddings there are static and do not

461
00:18:35,840 --> 00:18:41,159
change depending on the context so

462
00:18:38,480 --> 00:18:43,880
sentences like the cat chased the dog

463
00:18:41,159 --> 00:18:46,240
and the dog chased the cat would be

464
00:18:43,880 --> 00:18:47,919
encoded identically despite the

465
00:18:46,240 --> 00:18:50,960
sentences having different

466
00:18:47,919 --> 00:18:53,080
meaning a bit more on this so static and

467
00:18:50,960 --> 00:18:55,159
beddings produce the same vectors for a

468
00:18:53,080 --> 00:18:58,080
word regardless of the context it's used

469
00:18:55,159 --> 00:19:01,200
in so my code has a bug there's a bug in

470
00:18:58,080 --> 00:19:03,600
my soup would produce the same word um

471
00:19:01,200 --> 00:19:06,919
embedding regardless of their context

472
00:19:03,600 --> 00:19:09,400
whereas a model like bird would um

473
00:19:06,919 --> 00:19:11,480
produce vectors for a word that are then

474
00:19:09,400 --> 00:19:14,200
Modified by the presence of the

475
00:19:11,480 --> 00:19:16,799
neighboring words so in this case the

476
00:19:14,200 --> 00:19:18,520
role of this is in int this is the role

477
00:19:16,799 --> 00:19:20,880
of the attention mechanism which

478
00:19:18,520 --> 00:19:23,799
selectively modifies the vector based on

479
00:19:20,880 --> 00:19:27,200
the relevance of surrounding words for

480
00:19:23,799 --> 00:19:28,880
example my code has a bug the the

481
00:19:27,200 --> 00:19:31,159
presence of the word code would be be

482
00:19:28,880 --> 00:19:33,840
highly relevant to the context of the

483
00:19:31,159 --> 00:19:36,320
word bug and therefore the embedding for

484
00:19:33,840 --> 00:19:40,640
the word bug would be modified so that

485
00:19:36,320 --> 00:19:43,360
it's closer to the word code likewise

486
00:19:40,640 --> 00:19:46,080
for there's a bug in my sup and as you

487
00:19:43,360 --> 00:19:47,600
can see the context on these words bird

488
00:19:46,080 --> 00:19:49,640
is using the context on both sides of

489
00:19:47,600 --> 00:19:52,840
the word this is where it gets the this

490
00:19:49,640 --> 00:19:52,840
is why it's called bir

491
00:19:53,159 --> 00:19:57,360
directional so before we move on to a

492
00:19:55,200 --> 00:19:59,159
quick demo a quick recap on the methods

493
00:19:57,360 --> 00:20:02,559
we've discussed

494
00:19:59,159 --> 00:20:05,400
pre 2013 methods use static and sparse

495
00:20:02,559 --> 00:20:08,320
vectors to represent words whereas word

496
00:20:05,400 --> 00:20:11,440
to V algorithms use static but dense

497
00:20:08,320 --> 00:20:14,600
embeddings this shift away towards dense

498
00:20:11,440 --> 00:20:16,799
embeddings hugely prized the method of

499
00:20:14,600 --> 00:20:20,080
embedding words and it became the go-to

500
00:20:16,799 --> 00:20:22,120
method until B was released in 2018

501
00:20:20,080 --> 00:20:24,480
which provides Dynamic densely

502
00:20:22,120 --> 00:20:26,400
represented word vectors but that

503
00:20:24,480 --> 00:20:28,200
doesn't mean word to can't still be used

504
00:20:26,400 --> 00:20:29,360
and it often still is since it's a lot

505
00:20:28,200 --> 00:20:31,520
more efficient

506
00:20:29,360 --> 00:20:33,120
but it's not used where context is

507
00:20:31,520 --> 00:20:34,880
highly important to what you're wanting

508
00:20:33,120 --> 00:20:37,520
to

509
00:20:34,880 --> 00:20:39,320
do so we're going to do a quick demo

510
00:20:37,520 --> 00:20:41,360
using the descriptions of the Pyon

511
00:20:39,320 --> 00:20:43,679
presentations to see which ones are most

512
00:20:41,360 --> 00:20:47,039
similar we will use the library called

513
00:20:43,679 --> 00:20:48,960
sentence Transformers or esbert which

514
00:20:47,039 --> 00:20:51,240
provides great sentence embedding models

515
00:20:48,960 --> 00:20:53,559
trained using

516
00:20:51,240 --> 00:20:55,559
Bert the first thing we're going to do

517
00:20:53,559 --> 00:20:57,600
is we're going to load in all the Pyon

518
00:20:55,559 --> 00:20:59,280
session titles in their descriptions and

519
00:20:57,600 --> 00:21:01,840
we're also going to load Ascension

520
00:20:59,280 --> 00:21:03,600
transform Transformer model and a cross

521
00:21:01,840 --> 00:21:06,000
encoder model and I'll talk more about

522
00:21:03,600 --> 00:21:08,400
the cross encoder model

523
00:21:06,000 --> 00:21:10,240
later so using the sentence Transformer

524
00:21:08,400 --> 00:21:12,520
model we calculate all of the sentence

525
00:21:10,240 --> 00:21:15,400
Bings for all of our descriptions all in

526
00:21:12,520 --> 00:21:17,039
one line of code and then like we talked

527
00:21:15,400 --> 00:21:19,559
about before we can compress these

528
00:21:17,039 --> 00:21:20,760
embeddings into just two dimensions and

529
00:21:19,559 --> 00:21:25,200
we can plot

530
00:21:20,760 --> 00:21:26,880
them so this plot is pretty cool there's

531
00:21:25,200 --> 00:21:29,760
not really any strong clusters except

532
00:21:26,880 --> 00:21:33,039
for maybe uh the top left we have all

533
00:21:29,760 --> 00:21:35,520
the education and you know teaching

534
00:21:33,039 --> 00:21:37,320
related speeches and in the top right we

535
00:21:35,520 --> 00:21:39,320
have all the D Jango and database

536
00:21:37,320 --> 00:21:40,559
related speeches but in the middle yes

537
00:21:39,320 --> 00:21:42,520
there is clusters there but they're not

538
00:21:40,559 --> 00:21:44,760
very strong and which probably goes to

539
00:21:42,520 --> 00:21:48,320
show that the pon team did a great job

540
00:21:44,760 --> 00:21:48,320
at choosing a very diverse set of

541
00:21:49,400 --> 00:21:53,480
topics so let's say we want to find the

542
00:21:51,760 --> 00:21:55,240
most similar speeches to the speech

543
00:21:53,480 --> 00:21:57,240
teaching digital Technologies in

544
00:21:55,240 --> 00:22:00,360
Australian schools for Python and the

545
00:21:57,240 --> 00:22:04,039
cooker Berry so we get the index for it

546
00:22:00,360 --> 00:22:06,240
here and then we now use P torch to

547
00:22:04,039 --> 00:22:08,279
compare the distance between this is

548
00:22:06,240 --> 00:22:10,159
using coine similarity we compare the

549
00:22:08,279 --> 00:22:11,600
distance between the target embedding

550
00:22:10,159 --> 00:22:13,640
and all of the other embeddings for the

551
00:22:11,600 --> 00:22:16,200
speeches and see and we want to find the

552
00:22:13,640 --> 00:22:18,799
most the top 10 closest embeddings to

553
00:22:16,200 --> 00:22:21,279
our Target so this is just like before

554
00:22:18,799 --> 00:22:23,360
when we had two dimensional angles now

555
00:22:21,279 --> 00:22:25,400
we have 300 dimensional angles and we're

556
00:22:23,360 --> 00:22:27,960
calculating which ones are

557
00:22:25,400 --> 00:22:31,080
closer so under the hood it looks a bit

558
00:22:27,960 --> 00:22:32,919
like this where we take in two espert or

559
00:22:31,080 --> 00:22:34,200
we take in two embeddings and we put

560
00:22:32,919 --> 00:22:36,120
them through a cosine similarity

561
00:22:34,200 --> 00:22:38,880
function and that will return a number

562
00:22:36,120 --> 00:22:40,720
from negative 1 to one where negative 1

563
00:22:38,880 --> 00:22:42,520
means that the embeddings are you

564
00:22:40,720 --> 00:22:43,679
completely in opposite directions and

565
00:22:42,520 --> 00:22:46,720
one means that they're in the same

566
00:22:43,679 --> 00:22:46,720
direction or they're

567
00:22:47,120 --> 00:22:51,080
parallel it seems to have done a pretty

568
00:22:49,039 --> 00:22:54,360
good job because the these speeches are

569
00:22:51,080 --> 00:22:56,919
all mostly related to education and

570
00:22:54,360 --> 00:23:00,000
teaching and all that sort of stuff and

571
00:22:56,919 --> 00:23:02,200
schools but one thing we noticed is

572
00:23:00,000 --> 00:23:04,559
micropython I don't think is about

573
00:23:02,200 --> 00:23:06,640
specifically about education in schools

574
00:23:04,559 --> 00:23:08,440
so we can probably do a bit better here

575
00:23:06,640 --> 00:23:12,200
and that's where the cross encoder comes

576
00:23:08,440 --> 00:23:14,720
in so using just the top 10 speeches We

577
00:23:12,200 --> 00:23:17,480
compare our Target speech

578
00:23:14,720 --> 00:23:20,559
description with one at a time with the

579
00:23:17,480 --> 00:23:24,520
top 10 uh Speech

580
00:23:20,559 --> 00:23:27,120
descriptions what this looks like

581
00:23:24,520 --> 00:23:28,960
is this is what the cross en code is for

582
00:23:27,120 --> 00:23:30,720
and it's it's also a bur model but

583
00:23:28,960 --> 00:23:33,039
instead of producing word embeddings

584
00:23:30,720 --> 00:23:34,600
it's been optimized to produce a score

585
00:23:33,039 --> 00:23:36,960
representing the similarity of its

586
00:23:34,600 --> 00:23:39,039
inputs it does this by comparing each of

587
00:23:36,960 --> 00:23:40,640
the embeddings at a word level rather

588
00:23:39,039 --> 00:23:43,279
than measuring the distance between the

589
00:23:40,640 --> 00:23:46,159
averaged word embedding of the model

590
00:23:43,279 --> 00:23:48,320
prior this makes it much more precise in

591
00:23:46,159 --> 00:23:50,679
judging what text is similar however

592
00:23:48,320 --> 00:23:52,880
it's extremely less efficient and that's

593
00:23:50,679 --> 00:23:54,880
why it's often done as a second step to

594
00:23:52,880 --> 00:23:57,159
kind of rerank the results that the

595
00:23:54,880 --> 00:23:58,760
first step did which is much more

596
00:23:57,159 --> 00:24:01,279
efficient

597
00:23:58,760 --> 00:24:02,480
and it saves you from reranking a much

598
00:24:01,279 --> 00:24:04,760
larger set of

599
00:24:02,480 --> 00:24:06,400
documents so if we go to then these

600
00:24:04,760 --> 00:24:07,799
results we can see that for example

601
00:24:06,400 --> 00:24:09,679
micropython has been reranked all the

602
00:24:07,799 --> 00:24:11,799
way down to the bottom which goes to

603
00:24:09,679 --> 00:24:14,360
show that probably wasn't related to

604
00:24:11,799 --> 00:24:15,720
education so I've actually got a live

605
00:24:14,360 --> 00:24:19,000
web app version of this which will let

606
00:24:15,720 --> 00:24:20,480
you select your speech and see the most

607
00:24:19,000 --> 00:24:23,720
similar speeches and I'll make that

608
00:24:20,480 --> 00:24:23,720
available right at the

609
00:24:24,120 --> 00:24:28,159
end um so embeddings can be used for a

610
00:24:26,600 --> 00:24:30,399
whole range of tasks and I want to just

611
00:24:28,159 --> 00:24:32,520
go through a few more cool applications

612
00:24:30,399 --> 00:24:34,960
this one's probably my favorite it's you

613
00:24:32,520 --> 00:24:37,320
can actually see uh how the meaning of

614
00:24:34,960 --> 00:24:39,399
words have changed throughout time so if

615
00:24:37,320 --> 00:24:42,360
you were to train like a word Toc model

616
00:24:39,399 --> 00:24:45,000
on documents from say the 1850s you'd

617
00:24:42,360 --> 00:24:46,840
see that the word broadcast is Ed in a

618
00:24:45,000 --> 00:24:48,480
context that's very similar to farming

619
00:24:46,840 --> 00:24:51,039
and sewing seeds and you know

620
00:24:48,480 --> 00:24:53,799
broadcasting all the seeds down to the

621
00:24:51,039 --> 00:24:57,240
1900s broadcast becomes about

622
00:24:53,799 --> 00:24:59,360
newspapers down to the 1990s broadcast

623
00:24:57,240 --> 00:25:00,600
quickly becomes about television radio

624
00:24:59,360 --> 00:25:03,480
and even the

625
00:25:00,600 --> 00:25:06,799
BBC another one that I find very cool is

626
00:25:03,480 --> 00:25:10,159
awful in the 1850s awful wasn't a good

627
00:25:06,799 --> 00:25:14,240
word you know full of all amazing

628
00:25:10,159 --> 00:25:16,080
Majestic down to the 1900s and 1990s all

629
00:25:14,240 --> 00:25:20,240
four very becomes very quickly becomes a

630
00:25:16,080 --> 00:25:20,240
negative word meaning terrible or

631
00:25:20,720 --> 00:25:25,000
horrible another application is

632
00:25:22,919 --> 00:25:26,240
visualizing embeddings I know I've

633
00:25:25,000 --> 00:25:27,880
spoken about this before but it's

634
00:25:26,240 --> 00:25:30,440
actually really really good for

635
00:25:27,880 --> 00:25:32,039
explorers atory data analysis and if the

636
00:25:30,440 --> 00:25:33,919
graph you're using is interactive you

637
00:25:32,039 --> 00:25:35,720
can get a quick idea of what makes up

638
00:25:33,919 --> 00:25:37,080
all the Clusters and get a real

639
00:25:35,720 --> 00:25:39,240
understanding of your data if you

640
00:25:37,080 --> 00:25:40,799
haven't seen it before so in this

641
00:25:39,240 --> 00:25:44,679
example you can see that the

642
00:25:40,799 --> 00:25:47,120
non-fiction uh genre of books are on the

643
00:25:44,679 --> 00:25:50,559
complete opposite side of the science

644
00:25:47,120 --> 00:25:53,120
fiction which intuitively makes

645
00:25:50,559 --> 00:25:55,399
sense the final application or not the

646
00:25:53,120 --> 00:25:57,960
final one but one of the bigger ones is

647
00:25:55,399 --> 00:25:59,640
Vector databases so Vector databases

648
00:25:57,960 --> 00:26:01,600
Essen store all the vectors for your

649
00:25:59,640 --> 00:26:04,240
embeddings which allow you to query them

650
00:26:01,600 --> 00:26:06,520
using embeddings as well so for example

651
00:26:04,240 --> 00:26:09,120
if you wanted to make a document search

652
00:26:06,520 --> 00:26:11,720
you could then embed your query and find

653
00:26:09,120 --> 00:26:13,480
the most similar documents to your query

654
00:26:11,720 --> 00:26:15,760
so this is actually used as far as I'm

655
00:26:13,480 --> 00:26:19,000
aware in Google and Yahoo searches as

656
00:26:15,760 --> 00:26:20,720
well for sort of more of a semantic

657
00:26:19,000 --> 00:26:22,760
understanding of what you're Googling

658
00:26:20,720 --> 00:26:26,600
and a common use case in modern

659
00:26:22,760 --> 00:26:29,120
applications is generative llms so if

660
00:26:26,600 --> 00:26:31,080
you hook your generative llm generally

661
00:26:29,120 --> 00:26:33,240
isn't going to be trained on your data

662
00:26:31,080 --> 00:26:35,679
so to give it access to data that's it's

663
00:26:33,240 --> 00:26:38,559
not being trained on you can simply

664
00:26:35,679 --> 00:26:40,399
input the user's prompt into the vect

665
00:26:38,559 --> 00:26:42,919
database retrieve the most relevant

666
00:26:40,399 --> 00:26:45,720
chunks of text or documents and then

667
00:26:42,919 --> 00:26:48,880
append those to your um prompt and then

668
00:26:45,720 --> 00:26:53,600
feed the prompt and the relevant chunks

669
00:26:48,880 --> 00:26:55,679
into your llm llm and that will then um

670
00:26:53,600 --> 00:26:57,440
produce relevant information and that's

671
00:26:55,679 --> 00:26:59,080
actually called retrieval augmented

672
00:26:57,440 --> 00:27:01,240
generation and I believe there's a

673
00:26:59,080 --> 00:27:04,080
presentation or a workshop on that this

674
00:27:01,240 --> 00:27:05,960
Monday so that' be good so that's really

675
00:27:04,080 --> 00:27:08,640
good for chat

676
00:27:05,960 --> 00:27:10,440
Bots another way you can do it is you

677
00:27:08,640 --> 00:27:12,840
can optimize your sentence and betting

678
00:27:10,440 --> 00:27:16,399
models to kind of fit your definition of

679
00:27:12,840 --> 00:27:18,360
similarity so out of the box this uh

680
00:27:16,399 --> 00:27:19,679
model is doing what it should be all all

681
00:27:18,360 --> 00:27:21,600
the product descriptions are close

682
00:27:19,679 --> 00:27:24,279
together but if you then find unun it

683
00:27:21,600 --> 00:27:27,320
you can find unit to be for example um

684
00:27:24,279 --> 00:27:30,600
putting negative and positive uh

685
00:27:27,320 --> 00:27:32,320
sentiment closer together so that's all

686
00:27:30,600 --> 00:27:34,760
the applications I've brought up today

687
00:27:32,320 --> 00:27:35,600
but it really is up to your imagination

688
00:27:34,760 --> 00:27:38,320
with the

689
00:27:35,600 --> 00:27:40,600
Tings so as promised here's a QR code or

690
00:27:38,320 --> 00:27:43,000
a link if you prefer links to the

691
00:27:40,600 --> 00:27:45,760
interactive demo I haven't tested it

692
00:27:43,000 --> 00:27:48,640
really um on anyone except myself so

693
00:27:45,760 --> 00:27:50,320
we'll see how it goes but um uh there's

694
00:27:48,640 --> 00:27:51,519
probably no time for questions but any

695
00:27:50,320 --> 00:27:54,810
questions please feel free to come up

696
00:27:51,519 --> 00:28:01,869
and see me outside thank you

697
00:27:54,810 --> 00:28:01,869
[Applause]