1 00:00:04,960 --> 00:00:19,999 [Music] 2 00:00:20,680 --> 00:00:26,000 uh hello everyone so H thank 3 00:00:23,470 --> 00:00:28,199 [Applause] 4 00:00:26,000 --> 00:00:30,960 you first of all thank you for coming 5 00:00:28,199 --> 00:00:32,719 today to my presentation so today I hope 6 00:00:30,960 --> 00:00:35,200 to teach you about embeddings what they 7 00:00:32,719 --> 00:00:37,960 are how they work and how you can use 8 00:00:35,200 --> 00:00:40,399 them if I've done my job right I hope to 9 00:00:37,960 --> 00:00:41,680 make you just as excited by embeddings 10 00:00:40,399 --> 00:00:44,160 as I 11 00:00:41,680 --> 00:00:46,760 am so to start off we're going to go 12 00:00:44,160 --> 00:00:48,920 through some of the pre 2013 methods for 13 00:00:46,760 --> 00:00:51,800 embeddings and work our way to state of 14 00:00:48,920 --> 00:00:51,800 the the art neural 15 00:00:52,719 --> 00:00:58,359 networks so first of all what are 16 00:00:55,920 --> 00:01:00,280 embeddings fundamentally embeddings are 17 00:00:58,359 --> 00:01:02,840 a way to capture information and 18 00:01:00,280 --> 00:01:04,760 represent it on a computer this 19 00:01:02,840 --> 00:01:08,119 information is typically language or 20 00:01:04,760 --> 00:01:10,040 images but can really be anything for 21 00:01:08,119 --> 00:01:13,280 the purposes of this talk I'm just going 22 00:01:10,040 --> 00:01:13,280 to be covering 23 00:01:14,159 --> 00:01:18,720 language so when we say we're going to 24 00:01:16,119 --> 00:01:20,880 embed a word that means to take a word 25 00:01:18,720 --> 00:01:23,759 and embed it into a vector space where 26 00:01:20,880 --> 00:01:26,680 similar words are closer 27 00:01:23,759 --> 00:01:28,680 together this in essence captures 28 00:01:26,680 --> 00:01:30,560 semantic meaning and turns it into a 29 00:01:28,680 --> 00:01:33,040 vector of numbers that we can then use 30 00:01:30,560 --> 00:01:33,040 computers 31 00:01:33,560 --> 00:01:38,439 with so how can computers understand 32 00:01:36,439 --> 00:01:40,680 language well maybe we can try and 33 00:01:38,439 --> 00:01:42,960 Define a giant dictionary full of all 34 00:01:40,680 --> 00:01:45,000 the English words and their 35 00:01:42,960 --> 00:01:47,040 definitions but that doesn't really 36 00:01:45,000 --> 00:01:49,399 capture any sort of semantic 37 00:01:47,040 --> 00:01:51,240 meaning well maybe we can then order the 38 00:01:49,399 --> 00:01:52,399 dictionary so that similar words are 39 00:01:51,240 --> 00:01:54,960 closer 40 00:01:52,399 --> 00:01:57,399 together but what happens when the word 41 00:01:54,960 --> 00:02:00,439 changes depending on the context for 42 00:01:57,399 --> 00:02:03,399 example if I'm talking about bugs 43 00:02:00,439 --> 00:02:04,680 am I talking about bugs insects or bugs 44 00:02:03,399 --> 00:02:06,640 computer 45 00:02:04,680 --> 00:02:08,800 errors maybe then we could try and 46 00:02:06,640 --> 00:02:10,679 hardcode all the different types of um 47 00:02:08,800 --> 00:02:12,920 context but I don't think any of us 48 00:02:10,679 --> 00:02:15,280 really want to be um writing over you 49 00:02:12,920 --> 00:02:18,000 know a million if 50 00:02:15,280 --> 00:02:19,720 statements so it turns out the solution 51 00:02:18,000 --> 00:02:21,879 to capturing word embeddings and 52 00:02:19,720 --> 00:02:24,160 capturing word meaning begins outside 53 00:02:21,879 --> 00:02:25,720 the world of computing and instead in 54 00:02:24,160 --> 00:02:28,280 the world of 55 00:02:25,720 --> 00:02:30,760 linguistics John rert fth proposed the 56 00:02:28,280 --> 00:02:33,000 idea that words with similar meanings 57 00:02:30,760 --> 00:02:34,879 will appear in similar 58 00:02:33,000 --> 00:02:37,480 contexts this is called the 59 00:02:34,879 --> 00:02:40,720 distributional hypothesis and is best 60 00:02:37,480 --> 00:02:44,800 described by his 1957 quote you shall 61 00:02:40,720 --> 00:02:44,800 know a word by the company it 62 00:02:45,120 --> 00:02:50,159 keeps the central idea here is that 63 00:02:47,760 --> 00:02:52,760 words with similar meanings will appear 64 00:02:50,159 --> 00:02:54,959 in similar contexts this means that you 65 00:02:52,760 --> 00:02:56,680 can derive a word's meaning by the 66 00:02:54,959 --> 00:03:00,800 context in which it is 67 00:02:56,680 --> 00:03:02,440 used this is the basis uh for the way 68 00:03:00,800 --> 00:03:04,840 modern computer science captures 69 00:03:02,440 --> 00:03:07,000 semantic meaning in modern natural 70 00:03:04,840 --> 00:03:09,879 language 71 00:03:07,000 --> 00:03:12,120 processing so prior to the Advent and 72 00:03:09,879 --> 00:03:13,840 subsequent domination of neuron networks 73 00:03:12,120 --> 00:03:16,560 algorithms in the early 74 00:03:13,840 --> 00:03:18,560 2010s natural language processing relied 75 00:03:16,560 --> 00:03:21,640 heavily on rule based systems and 76 00:03:18,560 --> 00:03:24,840 symbolic systems much like the um ideas 77 00:03:21,640 --> 00:03:26,680 we played around in the slides 78 00:03:24,840 --> 00:03:28,400 before but there's actually another 79 00:03:26,680 --> 00:03:30,920 class of method that were class of 80 00:03:28,400 --> 00:03:32,959 methods that were called the statistical 81 00:03:30,920 --> 00:03:35,799 methods many of these statistical 82 00:03:32,959 --> 00:03:38,439 methods are based on word co-occurrence 83 00:03:35,799 --> 00:03:41,439 and bag of word approaches as a form of 84 00:03:38,439 --> 00:03:45,159 embedding words based on the as a form 85 00:03:41,439 --> 00:03:47,840 of embedding words and text 86 00:03:45,159 --> 00:03:50,920 respectively so for example we can use 87 00:03:47,840 --> 00:03:52,879 the um word co-occurrence matrices so we 88 00:03:50,920 --> 00:03:55,319 can use the frequency of words within a 89 00:03:52,879 --> 00:03:58,040 context window of a certain word to 90 00:03:55,319 --> 00:04:00,280 create embeddings a context window is 91 00:03:58,040 --> 00:04:02,599 just the words in front and behind the 92 00:04:00,280 --> 00:04:04,360 word that you're looking at and can vary 93 00:04:02,599 --> 00:04:07,079 in length depending on the model you're 94 00:04:04,360 --> 00:04:09,319 using so when you count all the times a 95 00:04:07,079 --> 00:04:11,560 word occurs within another words context 96 00:04:09,319 --> 00:04:14,120 window you get what is called a 97 00:04:11,560 --> 00:04:16,000 co-occurrence matrix so this one for 98 00:04:14,120 --> 00:04:17,919 example along the top here I've just 99 00:04:16,000 --> 00:04:19,079 taken a subset of the words and likewise 100 00:04:17,919 --> 00:04:21,160 down the bottom but in reality you're 101 00:04:19,079 --> 00:04:23,680 going to get a matrix that's you know 102 00:04:21,160 --> 00:04:27,080 the same size as it's as long as it is 103 00:04:23,680 --> 00:04:29,479 uh wide so along the top we have a 104 00:04:27,080 --> 00:04:31,720 subset uh and this will often be hundred 105 00:04:29,479 --> 00:04:33,199 you thousands to tens of thousands of 106 00:04:31,720 --> 00:04:36,000 words long depending on how big your 107 00:04:33,199 --> 00:04:39,000 documents are so the word digital for 108 00:04:36,000 --> 00:04:41,160 example has had computer the word the 109 00:04:39,000 --> 00:04:45,120 word computer occur 110 00:04:41,160 --> 00:04:48,039 1,670 times within its context window 111 00:04:45,120 --> 00:04:51,560 and the word data occur 112 00:04:48,039 --> 00:04:53,919 1,683 this makes it very um dissimilar 113 00:04:51,560 --> 00:04:55,479 to words like cherry and strawberry 114 00:04:53,919 --> 00:04:57,639 which have had very few occurrences of 115 00:04:55,479 --> 00:05:00,120 those words but are high in words like 116 00:04:57,639 --> 00:05:01,680 pie and sugar 117 00:05:00,120 --> 00:05:03,800 if you just take two Dimensions out of 118 00:05:01,680 --> 00:05:07,520 this Matrix you can actually use it to 119 00:05:03,800 --> 00:05:10,560 kind of um plot the words so if we just 120 00:05:07,520 --> 00:05:12,039 use the dimensions computer and data we 121 00:05:10,560 --> 00:05:14,240 can see that the words digital and 122 00:05:12,039 --> 00:05:16,600 information are actually very close 123 00:05:14,240 --> 00:05:18,639 together and often times when we're 124 00:05:16,600 --> 00:05:21,160 measuring a sort of closeness with 125 00:05:18,639 --> 00:05:23,240 vectors we're not using how far away the 126 00:05:21,160 --> 00:05:24,919 the dots are but instead we're measuring 127 00:05:23,240 --> 00:05:26,560 the the distance of the angle between 128 00:05:24,919 --> 00:05:28,600 them so the angle between these two 129 00:05:26,560 --> 00:05:30,960 words are very low so this is called 130 00:05:28,600 --> 00:05:33,280 cosine similarity and it would be pretty 131 00:05:30,960 --> 00:05:36,400 high in this 132 00:05:33,280 --> 00:05:39,240 case an adjacent concept to this is bag 133 00:05:36,400 --> 00:05:41,880 of words to create document embeddings 134 00:05:39,240 --> 00:05:44,400 we use a bag of Words which means we can 135 00:05:41,880 --> 00:05:47,280 encode our documents to be a account of 136 00:05:44,400 --> 00:05:49,520 all the words it contains so if we have 137 00:05:47,280 --> 00:05:51,240 these three documents this would be our 138 00:05:49,520 --> 00:05:53,440 vocabulary which is just all the unique 139 00:05:51,240 --> 00:05:55,360 words in our documents and these would 140 00:05:53,440 --> 00:05:57,759 be the word 141 00:05:55,360 --> 00:06:00,000 frequencies what we are left with are 142 00:05:57,759 --> 00:06:02,360 document embeddings and this is a useful 143 00:06:00,000 --> 00:06:04,560 way for comparing documents based on the 144 00:06:02,360 --> 00:06:06,919 words that occur within them but it 145 00:06:04,560 --> 00:06:08,759 doesn't make any attempt to compare the 146 00:06:06,919 --> 00:06:10,919 documents based on the meaning of the 147 00:06:08,759 --> 00:06:12,440 words within the 148 00:06:10,919 --> 00:06:15,160 documents 149 00:06:12,440 --> 00:06:17,840 so both of these um family of 150 00:06:15,160 --> 00:06:20,039 statistical methods use the frequency of 151 00:06:17,840 --> 00:06:22,759 occurrences in order to derive some 152 00:06:20,039 --> 00:06:24,360 notion of similarity the idea here is 153 00:06:22,759 --> 00:06:27,120 that similar words will have a similar 154 00:06:24,360 --> 00:06:29,039 number of Co concurrences and likewise 155 00:06:27,120 --> 00:06:31,080 similar documents will have a similar 156 00:06:29,039 --> 00:06:33,319 number of word 157 00:06:31,080 --> 00:06:35,400 frequencies these methods produce large 158 00:06:33,319 --> 00:06:37,440 dimensions because often the size of the 159 00:06:35,400 --> 00:06:39,880 vocabulary is how long your dimensions 160 00:06:37,440 --> 00:06:41,080 are going to be uh these methods also 161 00:06:39,880 --> 00:06:43,599 produce what are called sparse 162 00:06:41,080 --> 00:06:45,680 embeddings or sparse vectors because 163 00:06:43,599 --> 00:06:50,120 they mostly contain zeros because most 164 00:06:45,680 --> 00:06:53,880 words don't occur next to each 165 00:06:50,120 --> 00:06:55,360 other so by the early 2010s neural 166 00:06:53,880 --> 00:06:57,400 networks were experiencing a bit of a 167 00:06:55,360 --> 00:06:59,800 Renaissance in computer science driven 168 00:06:57,400 --> 00:07:01,440 by advances in compute power and neural 169 00:06:59,800 --> 00:07:04,120 networks demonstrating how good they are 170 00:07:01,440 --> 00:07:06,560 at classifying whether an image is a cat 171 00:07:04,120 --> 00:07:09,520 dog or a 172 00:07:06,560 --> 00:07:12,440 muffin so the introduction of word to 173 00:07:09,520 --> 00:07:14,599 VEC in 2013 by Thomas mikeloff and his 174 00:07:12,440 --> 00:07:16,720 team at Google combin natural language 175 00:07:14,599 --> 00:07:18,759 processing and neuron 176 00:07:16,720 --> 00:07:20,759 networks instead of just counting the 177 00:07:18,759 --> 00:07:23,120 word neighbors like we pre like we did 178 00:07:20,759 --> 00:07:25,360 in previous methods we feed the neuron 179 00:07:23,120 --> 00:07:27,160 Network um examples of words that are 180 00:07:25,360 --> 00:07:29,240 neighbors and examples of words that are 181 00:07:27,160 --> 00:07:30,080 not neighbors and then we force it to 182 00:07:29,240 --> 00:07:32,919 predict 183 00:07:30,080 --> 00:07:35,000 which is which for the predictions to be 184 00:07:32,919 --> 00:07:37,160 accurate the neuron network is forced to 185 00:07:35,000 --> 00:07:39,520 capture some sort of deeper 186 00:07:37,160 --> 00:07:41,400 contextual deeper semantic understanding 187 00:07:39,520 --> 00:07:43,199 of the words so that it can actually 188 00:07:41,400 --> 00:07:44,919 make accurate predictions when it we 189 00:07:43,199 --> 00:07:47,080 force it to 190 00:07:44,919 --> 00:07:49,800 predict so the training data for the 191 00:07:47,080 --> 00:07:52,080 word Toc model is nothing complex all 192 00:07:49,800 --> 00:07:54,759 you need to do is process all your data 193 00:07:52,080 --> 00:07:56,319 to create a bunch of word pairs you can 194 00:07:54,759 --> 00:07:58,479 do this by looking at the words that 195 00:07:56,319 --> 00:08:00,560 occur within the context window of a 196 00:07:58,479 --> 00:08:03,400 specific word and labing labeling them 197 00:08:00,560 --> 00:08:05,360 as neighbors so for example hard has a 198 00:08:03,400 --> 00:08:07,159 context window with the word 199 00:08:05,360 --> 00:08:08,639 implementation in it and we'd want the 200 00:08:07,159 --> 00:08:11,000 neuron Network to predict that those two 201 00:08:08,639 --> 00:08:14,120 words are neighbors we do the same thing 202 00:08:11,000 --> 00:08:15,840 for is until we get a list for all that 203 00:08:14,120 --> 00:08:18,960 for that for that word and then we 204 00:08:15,840 --> 00:08:20,879 iterate to the next word so we go to two 205 00:08:18,960 --> 00:08:23,080 and we do the same thing for all the 206 00:08:20,879 --> 00:08:24,720 words in all of our documents until 207 00:08:23,080 --> 00:08:27,440 eventually we eventually we get a giant 208 00:08:24,720 --> 00:08:30,120 list of all the words and their 209 00:08:27,440 --> 00:08:31,800 neighbors equally as important 210 00:08:30,120 --> 00:08:34,599 is that you 211 00:08:31,800 --> 00:08:36,800 um randomly sample words that don't 212 00:08:34,599 --> 00:08:39,320 occur together and label them as non- 213 00:08:36,800 --> 00:08:41,919 neighbors if you don't do this the ne 214 00:08:39,320 --> 00:08:44,120 Network could get away with um with 215 00:08:41,919 --> 00:08:45,800 labeling everything as or predicting 216 00:08:44,120 --> 00:08:48,279 everything as neighbors and it would be 217 00:08:45,800 --> 00:08:48,279 100% 218 00:08:51,399 --> 00:08:56,880 accurate before we move on to discuss 219 00:08:54,279 --> 00:08:59,160 more about the word algorithm we need to 220 00:08:56,880 --> 00:09:00,000 quickly do a bit of a primer on neuron 221 00:08:59,160 --> 00:09:02,480 Network 222 00:09:00,000 --> 00:09:05,399 so neuron networks are structured in a 223 00:09:02,480 --> 00:09:07,959 way that they take data as input and 224 00:09:05,399 --> 00:09:11,360 transform it via its layers into a 225 00:09:07,959 --> 00:09:13,600 desired output so for example if you 226 00:09:11,360 --> 00:09:16,360 were to train a cat classifying neural 227 00:09:13,600 --> 00:09:19,160 network the inputs would be all the um 228 00:09:16,360 --> 00:09:20,519 pixel values of the image and then the 229 00:09:19,160 --> 00:09:23,519 values would then be transformed through 230 00:09:20,519 --> 00:09:25,399 the layers into maybe a binary output 231 00:09:23,519 --> 00:09:26,800 that represents whether the image is a 232 00:09:25,399 --> 00:09:29,800 cat or 233 00:09:26,800 --> 00:09:32,040 not so essentially neural network are 234 00:09:29,800 --> 00:09:34,279 really good at learning patterns and 235 00:09:32,040 --> 00:09:36,560 much like a child learns it needs to be 236 00:09:34,279 --> 00:09:38,600 exposed to examples and feedback to 237 00:09:36,560 --> 00:09:41,279 become good at recognizing 238 00:09:38,600 --> 00:09:43,880 something unlike a child a neuron 239 00:09:41,279 --> 00:09:46,760 Network stores its learnings in matrices 240 00:09:43,880 --> 00:09:49,440 that are often um referred to as its 241 00:09:46,760 --> 00:09:51,399 weights so in this diagram here these 242 00:09:49,440 --> 00:09:52,720 would be the neuron Network's weights 243 00:09:51,399 --> 00:09:55,519 the weights are responsible for 244 00:09:52,720 --> 00:09:57,800 propagating the data from one set of 245 00:09:55,519 --> 00:10:00,040 layers to the next 246 00:09:57,800 --> 00:10:02,519 layer and each time you show a neur 247 00:10:00,040 --> 00:10:04,120 network an example it adjusts its 248 00:10:02,519 --> 00:10:06,279 weights to minimize its prediction 249 00:10:04,120 --> 00:10:09,560 errors which then improves its ability 250 00:10:06,279 --> 00:10:12,240 to make correct predictions on unseen 251 00:10:09,560 --> 00:10:14,440 data just a bit of a clarification this 252 00:10:12,240 --> 00:10:17,760 is a major simplification but it should 253 00:10:14,440 --> 00:10:20,120 give you just enough inition to work 254 00:10:17,760 --> 00:10:22,800 with so to train a model using the word 255 00:10:20,120 --> 00:10:24,839 Toc algorithm we use a neural network to 256 00:10:22,800 --> 00:10:26,959 predict the most likely context words 257 00:10:24,839 --> 00:10:29,399 for each word in our vocabulary so the 258 00:10:26,959 --> 00:10:31,120 diagram we showed before the it's very 259 00:10:29,399 --> 00:10:33,200 similar to this neuron Network except 260 00:10:31,120 --> 00:10:35,000 I've just um blown up blown up the 261 00:10:33,200 --> 00:10:37,920 weights and made them into squares for 262 00:10:35,000 --> 00:10:40,880 it to be easier for us to 263 00:10:37,920 --> 00:10:44,279 visualize so the first set of Weights in 264 00:10:40,880 --> 00:10:47,120 the word algorithm is a v byn Matrix 265 00:10:44,279 --> 00:10:49,079 where each row corresponds to a word 266 00:10:47,120 --> 00:10:51,399 from the V dimensional 267 00:10:49,079 --> 00:10:53,360 vocabulary the N columns contain 268 00:10:51,399 --> 00:10:55,959 flirting Point numbers that are used to 269 00:10:53,360 --> 00:10:57,959 represent the word's meaning they are 270 00:10:55,959 --> 00:11:00,480 typically around 300 columns for word 271 00:10:57,959 --> 00:11:02,920 models but there's no limit and they may 272 00:11:00,480 --> 00:11:05,519 they may range between you know 50 to a 273 00:11:02,920 --> 00:11:06,720 th000 where smaller dimensions are more 274 00:11:05,519 --> 00:11:08,680 efficient 275 00:11:06,720 --> 00:11:11,200 computationally but they do tend to 276 00:11:08,680 --> 00:11:12,920 sacrifice their semantic resolution much 277 00:11:11,200 --> 00:11:16,160 like you would when you limit the amount 278 00:11:12,920 --> 00:11:18,320 of pixels you can represent an image 279 00:11:16,160 --> 00:11:20,680 with so these particular weights in the 280 00:11:18,320 --> 00:11:23,079 word Matrix are actually referred to as 281 00:11:20,680 --> 00:11:24,760 its embedding Matrix and as it's 282 00:11:23,079 --> 00:11:27,040 learning to maximize its prediction 283 00:11:24,760 --> 00:11:28,959 accuracy the neuronet is slowly nudging 284 00:11:27,040 --> 00:11:29,760 the numbers in there so that similar 285 00:11:28,959 --> 00:11:32,399 words 286 00:11:29,760 --> 00:11:32,399 have similar 287 00:11:32,720 --> 00:11:36,120 numbers as you may have guessed we're 288 00:11:34,959 --> 00:11:38,880 not actually interested in the 289 00:11:36,120 --> 00:11:40,560 predictive ability of the neuron network 290 00:11:38,880 --> 00:11:42,800 but rather the embedding Matrix that is 291 00:11:40,560 --> 00:11:46,040 created as a byproduct of 292 00:11:42,800 --> 00:11:48,120 it once training is complete the Matrix 293 00:11:46,040 --> 00:11:50,040 is harvested which is what we use for 294 00:11:48,120 --> 00:11:52,360 our word em word 295 00:11:50,040 --> 00:11:54,680 embeddings the trained embedding Matrix 296 00:11:52,360 --> 00:11:56,560 provides dense representations of Words 297 00:11:54,680 --> 00:11:58,839 which means every number in the word 298 00:11:56,560 --> 00:12:01,880 embedding is Meaningful and contributes 299 00:11:58,839 --> 00:12:04,800 to its representation unlike the sparse 300 00:12:01,880 --> 00:12:07,120 predecessors we just talked 301 00:12:04,800 --> 00:12:08,959 about these dense embeddings contain 302 00:12:07,120 --> 00:12:11,160 really interesting properties that 303 00:12:08,959 --> 00:12:13,680 actually are the best way to illustrate 304 00:12:11,160 --> 00:12:15,519 them I think is to compress your 300 305 00:12:13,680 --> 00:12:17,880 Dimensions down to two so you can see 306 00:12:15,519 --> 00:12:19,600 them on a graph so if we look at the 307 00:12:17,880 --> 00:12:21,560 first property for example we can see 308 00:12:19,600 --> 00:12:23,920 that embeddings these embeddings are 309 00:12:21,560 --> 00:12:26,519 fantastic at capturing meaning similar 310 00:12:23,920 --> 00:12:28,760 words like fantastic awesome and amazing 311 00:12:26,519 --> 00:12:30,920 are all on the same location whereas 312 00:12:28,760 --> 00:12:32,720 terrible awful and Dreadful are all in 313 00:12:30,920 --> 00:12:36,000 their own group and unrelated words like 314 00:12:32,720 --> 00:12:37,720 bug are all the way by themselves the 315 00:12:36,000 --> 00:12:40,959 second thing that I find really 316 00:12:37,720 --> 00:12:42,440 interesting is the um the ability for 317 00:12:40,959 --> 00:12:44,880 embeddings to preserve semantic 318 00:12:42,440 --> 00:12:47,680 relationships like tents so here we can 319 00:12:44,880 --> 00:12:51,079 see flew is to 320 00:12:47,680 --> 00:12:52,560 Flying as ran is to running and the cool 321 00:12:51,079 --> 00:12:54,959 thing is no one trained the neuron 322 00:12:52,560 --> 00:12:56,720 Network to do this just by show forcing 323 00:12:54,959 --> 00:12:58,680 it to predict its word neighbors it was 324 00:12:56,720 --> 00:13:00,240 able to capture these relationships by 325 00:12:58,680 --> 00:13:02,000 itself 326 00:13:00,240 --> 00:13:06,560 it also captures real world information 327 00:13:02,000 --> 00:13:08,639 so cra is to Australia as Beijing is to 328 00:13:06,560 --> 00:13:10,519 China and a consequence of these 329 00:13:08,639 --> 00:13:13,760 relationships that are captured is the 330 00:13:10,519 --> 00:13:16,240 ability to perform meaning meaningful 331 00:13:13,760 --> 00:13:19,000 Vector arithmetic so if we take the 332 00:13:16,240 --> 00:13:21,519 embedding for Cambra and subtract the 333 00:13:19,000 --> 00:13:24,279 vector for Australia and add the vector 334 00:13:21,519 --> 00:13:27,040 for China you get a vector that is 335 00:13:24,279 --> 00:13:29,920 approximately the same as the vector for 336 00:13:27,040 --> 00:13:32,320 Beijing and to prove this I've used the 337 00:13:29,920 --> 00:13:33,800 most similar function in the uh model 338 00:13:32,320 --> 00:13:35,800 I've been using and I did the 339 00:13:33,800 --> 00:13:37,800 computation in there and you can see the 340 00:13:35,800 --> 00:13:39,519 top five most similar vectors you're 341 00:13:37,800 --> 00:13:42,760 getting Beijing right at the top 342 00:13:39,519 --> 00:13:45,240 followed by China CRA Chinese and then a 343 00:13:42,760 --> 00:13:45,240 typo of 344 00:13:45,560 --> 00:13:49,959 Beijing so up until this point our 345 00:13:48,720 --> 00:13:54,120 embeddings have been static 346 00:13:49,959 --> 00:13:56,560 representations of words pre 2013 the 347 00:13:54,120 --> 00:14:00,480 statistical methods used static vectors 348 00:13:56,560 --> 00:14:02,240 that were long and sparse they were long 349 00:14:00,480 --> 00:14:05,720 because every Dimension literally 350 00:14:02,240 --> 00:14:09,000 represented the count of a specific 351 00:14:05,720 --> 00:14:11,480 word word toc on the other hand has uses 352 00:14:09,000 --> 00:14:13,720 the Learned representations of words 353 00:14:11,480 --> 00:14:16,320 from a neural network meaning that each 354 00:14:13,720 --> 00:14:18,399 dimmension while not directly 355 00:14:16,320 --> 00:14:21,040 interpretable uh is used to represent 356 00:14:18,399 --> 00:14:23,560 the meaning of the words this makes word 357 00:14:21,040 --> 00:14:25,480 tobec much more efficient and effective 358 00:14:23,560 --> 00:14:29,680 at capturing word meanings because it's 359 00:14:25,480 --> 00:14:29,680 the dimensions in it being a lot smaller 360 00:14:30,480 --> 00:14:35,839 but it's not perfect there's a problem 361 00:14:32,880 --> 00:14:38,199 word Toc produces static word 362 00:14:35,839 --> 00:14:40,199 embeddings therefore words with multiple 363 00:14:38,199 --> 00:14:41,959 meanings end up being encoded as some 364 00:14:40,199 --> 00:14:44,240 sort of awkward average between all the 365 00:14:41,959 --> 00:14:46,040 context in which it appears so for 366 00:14:44,240 --> 00:14:48,480 example this is a general model and so 367 00:14:46,040 --> 00:14:51,880 you can see the word bug is stuck stuck 368 00:14:48,480 --> 00:14:53,120 between insect and error SLG glitch so 369 00:14:51,880 --> 00:14:54,639 this model was probably changed on 370 00:14:53,120 --> 00:14:57,519 general information but if you trange 371 00:14:54,639 --> 00:14:59,720 your model on say biological text You' 372 00:14:57,519 --> 00:15:02,079 expect to see bug right over by the word 373 00:14:59,720 --> 00:15:04,759 insect and likewise computer science 374 00:15:02,079 --> 00:15:07,000 textbooks or so on you'd expect to see 375 00:15:04,759 --> 00:15:10,480 bug right next to all the error and 376 00:15:07,000 --> 00:15:12,320 glitch sort of words so what's clear is 377 00:15:10,480 --> 00:15:14,040 that for the best embedding of a word 378 00:15:12,320 --> 00:15:17,759 you really need to have an embedding for 379 00:15:14,040 --> 00:15:17,759 each context that it can appear 380 00:15:19,480 --> 00:15:26,440 in enter Bert in 2018 by Jacob Devin and 381 00:15:23,680 --> 00:15:28,720 colleagues at Google which prod bir 382 00:15:26,440 --> 00:15:31,920 produces Dynamic embeddings making them 383 00:15:28,720 --> 00:15:34,440 much more effective at capturing meaning 384 00:15:31,920 --> 00:15:37,040 ber uses a deep learning uh a deep 385 00:15:34,440 --> 00:15:39,120 neural network based on the Transformer 386 00:15:37,040 --> 00:15:41,040 architector and deep in this context 387 00:15:39,120 --> 00:15:44,600 literally just means that it has more 388 00:15:41,040 --> 00:15:46,600 layers unlike the word to one unlike the 389 00:15:44,600 --> 00:15:47,880 word to neuron Network which just just 390 00:15:46,600 --> 00:15:50,560 had 391 00:15:47,880 --> 00:15:52,120 one so a bit of a tangent and I don't 392 00:15:50,560 --> 00:15:53,560 expect you to take all this in but this 393 00:15:52,120 --> 00:15:56,319 is what the Transformer architecture 394 00:15:53,560 --> 00:15:58,160 looks like on the left you have the 395 00:15:56,319 --> 00:15:59,680 encoder and on the right you have the 396 00:15:58,160 --> 00:16:01,480 decoder 397 00:15:59,680 --> 00:16:03,720 so the Transformer architecture is 398 00:16:01,480 --> 00:16:06,680 foundational to generative models like 399 00:16:03,720 --> 00:16:07,560 chat GPT 4 and representative models 400 00:16:06,680 --> 00:16:10,360 like 401 00:16:07,560 --> 00:16:12,680 Bert Bert uses the encoder part of the 402 00:16:10,360 --> 00:16:13,959 Transformer which takes words and 403 00:16:12,680 --> 00:16:17,600 outputs 404 00:16:13,959 --> 00:16:19,920 embeddings gbt on the other hand gbt 4 405 00:16:17,600 --> 00:16:23,240 utilizes the decoder side which takes 406 00:16:19,920 --> 00:16:25,600 embeddings and outputs words both the 407 00:16:23,240 --> 00:16:28,240 encoder and the decoder use what is 408 00:16:25,600 --> 00:16:30,279 called an attention mechanism which 409 00:16:28,240 --> 00:16:32,680 allows them to focus on Words more 410 00:16:30,279 --> 00:16:34,160 important to the context on the word 411 00:16:32,680 --> 00:16:38,600 more important to the context of the 412 00:16:34,160 --> 00:16:41,480 word than generic words like is thee and 413 00:16:38,600 --> 00:16:43,519 end so this allows the encoder to create 414 00:16:41,480 --> 00:16:45,880 high quality Dynamic 415 00:16:43,519 --> 00:16:47,920 embeddings these AR this Transformer 416 00:16:45,880 --> 00:16:50,480 architecture is behind the recent 417 00:16:47,920 --> 00:16:52,600 explosion in AI capabilities and it's 418 00:16:50,480 --> 00:16:55,639 quite literally in their names so chat 419 00:16:52,600 --> 00:16:58,600 gp4 or GPT 4 stands for generative 420 00:16:55,639 --> 00:17:00,600 pre-trained Transformer 4 but on the 421 00:16:58,600 --> 00:17:04,519 other hand then stands for bidirectional 422 00:17:00,600 --> 00:17:04,519 and cod of representations from 423 00:17:04,720 --> 00:17:09,559 Transformers so bir's a bit different 424 00:17:07,120 --> 00:17:12,959 but it's also much the same so B trains 425 00:17:09,559 --> 00:17:15,120 its equivalent to word to matric ber 426 00:17:12,959 --> 00:17:17,520 trains its equivalent to the word to 427 00:17:15,120 --> 00:17:18,520 embedding matrix by predicting missing 428 00:17:17,520 --> 00:17:21,079 words in a 429 00:17:18,520 --> 00:17:23,000 sentence so explicit is better than 430 00:17:21,079 --> 00:17:26,039 implicit it should it should predict 431 00:17:23,000 --> 00:17:28,319 better if we blank out the word better 432 00:17:26,039 --> 00:17:30,720 and likewise sequences of words like my 433 00:17:28,319 --> 00:17:33,000 dog is Cur he likes playing we would 434 00:17:30,720 --> 00:17:35,320 want to force the model to predict that 435 00:17:33,000 --> 00:17:35,320 those are 436 00:17:35,760 --> 00:17:40,120 sequential because Bert produces 437 00:17:38,000 --> 00:17:42,440 contextual word embeddings it takes 438 00:17:40,120 --> 00:17:44,960 sentences as inputs rather than single 439 00:17:42,440 --> 00:17:47,039 words and then produces an embedding for 440 00:17:44,960 --> 00:17:49,039 each word in the 441 00:17:47,039 --> 00:17:51,000 sentence rather than harvesting the 442 00:17:49,039 --> 00:17:54,480 embedding Matrix like we did with word 443 00:17:51,000 --> 00:17:57,840 DEC we keep the neuronet intact so that 444 00:17:54,480 --> 00:18:00,080 we can continue to produce Dynamic word 445 00:17:57,840 --> 00:18:02,360 embeddings so now that we have context 446 00:18:00,080 --> 00:18:04,720 sensitive word embedding models given a 447 00:18:02,360 --> 00:18:07,640 word a model will produce a vector that 448 00:18:04,720 --> 00:18:09,960 is specific to the word and its 449 00:18:07,640 --> 00:18:11,880 context but what if we want to 450 00:18:09,960 --> 00:18:14,720 understand and embed sentences 451 00:18:11,880 --> 00:18:17,520 paragraphs or even entire 452 00:18:14,720 --> 00:18:19,640 documents well it turns out the easiest 453 00:18:17,520 --> 00:18:22,720 way to do this is just to average all of 454 00:18:19,640 --> 00:18:24,840 the output uh word embeddings that our B 455 00:18:22,720 --> 00:18:27,000 model gave us and this will give you a 456 00:18:24,840 --> 00:18:29,400 single Vector that represents the entire 457 00:18:27,000 --> 00:18:31,480 meaning of your text 458 00:18:29,400 --> 00:18:33,360 the reason we can't use models like word 459 00:18:31,480 --> 00:18:35,840 to V to do this is because the word 460 00:18:33,360 --> 00:18:38,480 embeddings there are static and do not 461 00:18:35,840 --> 00:18:41,159 change depending on the context so 462 00:18:38,480 --> 00:18:43,880 sentences like the cat chased the dog 463 00:18:41,159 --> 00:18:46,240 and the dog chased the cat would be 464 00:18:43,880 --> 00:18:47,919 encoded identically despite the 465 00:18:46,240 --> 00:18:50,960 sentences having different 466 00:18:47,919 --> 00:18:53,080 meaning a bit more on this so static and 467 00:18:50,960 --> 00:18:55,159 beddings produce the same vectors for a 468 00:18:53,080 --> 00:18:58,080 word regardless of the context it's used 469 00:18:55,159 --> 00:19:01,200 in so my code has a bug there's a bug in 470 00:18:58,080 --> 00:19:03,600 my soup would produce the same word um 471 00:19:01,200 --> 00:19:06,919 embedding regardless of their context 472 00:19:03,600 --> 00:19:09,400 whereas a model like bird would um 473 00:19:06,919 --> 00:19:11,480 produce vectors for a word that are then 474 00:19:09,400 --> 00:19:14,200 Modified by the presence of the 475 00:19:11,480 --> 00:19:16,799 neighboring words so in this case the 476 00:19:14,200 --> 00:19:18,520 role of this is in int this is the role 477 00:19:16,799 --> 00:19:20,880 of the attention mechanism which 478 00:19:18,520 --> 00:19:23,799 selectively modifies the vector based on 479 00:19:20,880 --> 00:19:27,200 the relevance of surrounding words for 480 00:19:23,799 --> 00:19:28,880 example my code has a bug the the 481 00:19:27,200 --> 00:19:31,159 presence of the word code would be be 482 00:19:28,880 --> 00:19:33,840 highly relevant to the context of the 483 00:19:31,159 --> 00:19:36,320 word bug and therefore the embedding for 484 00:19:33,840 --> 00:19:40,640 the word bug would be modified so that 485 00:19:36,320 --> 00:19:43,360 it's closer to the word code likewise 486 00:19:40,640 --> 00:19:46,080 for there's a bug in my sup and as you 487 00:19:43,360 --> 00:19:47,600 can see the context on these words bird 488 00:19:46,080 --> 00:19:49,640 is using the context on both sides of 489 00:19:47,600 --> 00:19:52,840 the word this is where it gets the this 490 00:19:49,640 --> 00:19:52,840 is why it's called bir 491 00:19:53,159 --> 00:19:57,360 directional so before we move on to a 492 00:19:55,200 --> 00:19:59,159 quick demo a quick recap on the methods 493 00:19:57,360 --> 00:20:02,559 we've discussed 494 00:19:59,159 --> 00:20:05,400 pre 2013 methods use static and sparse 495 00:20:02,559 --> 00:20:08,320 vectors to represent words whereas word 496 00:20:05,400 --> 00:20:11,440 to V algorithms use static but dense 497 00:20:08,320 --> 00:20:14,600 embeddings this shift away towards dense 498 00:20:11,440 --> 00:20:16,799 embeddings hugely prized the method of 499 00:20:14,600 --> 00:20:20,080 embedding words and it became the go-to 500 00:20:16,799 --> 00:20:22,120 method until B was released in 2018 501 00:20:20,080 --> 00:20:24,480 which provides Dynamic densely 502 00:20:22,120 --> 00:20:26,400 represented word vectors but that 503 00:20:24,480 --> 00:20:28,200 doesn't mean word to can't still be used 504 00:20:26,400 --> 00:20:29,360 and it often still is since it's a lot 505 00:20:28,200 --> 00:20:31,520 more efficient 506 00:20:29,360 --> 00:20:33,120 but it's not used where context is 507 00:20:31,520 --> 00:20:34,880 highly important to what you're wanting 508 00:20:33,120 --> 00:20:37,520 to 509 00:20:34,880 --> 00:20:39,320 do so we're going to do a quick demo 510 00:20:37,520 --> 00:20:41,360 using the descriptions of the Pyon 511 00:20:39,320 --> 00:20:43,679 presentations to see which ones are most 512 00:20:41,360 --> 00:20:47,039 similar we will use the library called 513 00:20:43,679 --> 00:20:48,960 sentence Transformers or esbert which 514 00:20:47,039 --> 00:20:51,240 provides great sentence embedding models 515 00:20:48,960 --> 00:20:53,559 trained using 516 00:20:51,240 --> 00:20:55,559 Bert the first thing we're going to do 517 00:20:53,559 --> 00:20:57,600 is we're going to load in all the Pyon 518 00:20:55,559 --> 00:20:59,280 session titles in their descriptions and 519 00:20:57,600 --> 00:21:01,840 we're also going to load Ascension 520 00:20:59,280 --> 00:21:03,600 transform Transformer model and a cross 521 00:21:01,840 --> 00:21:06,000 encoder model and I'll talk more about 522 00:21:03,600 --> 00:21:08,400 the cross encoder model 523 00:21:06,000 --> 00:21:10,240 later so using the sentence Transformer 524 00:21:08,400 --> 00:21:12,520 model we calculate all of the sentence 525 00:21:10,240 --> 00:21:15,400 Bings for all of our descriptions all in 526 00:21:12,520 --> 00:21:17,039 one line of code and then like we talked 527 00:21:15,400 --> 00:21:19,559 about before we can compress these 528 00:21:17,039 --> 00:21:20,760 embeddings into just two dimensions and 529 00:21:19,559 --> 00:21:25,200 we can plot 530 00:21:20,760 --> 00:21:26,880 them so this plot is pretty cool there's 531 00:21:25,200 --> 00:21:29,760 not really any strong clusters except 532 00:21:26,880 --> 00:21:33,039 for maybe uh the top left we have all 533 00:21:29,760 --> 00:21:35,520 the education and you know teaching 534 00:21:33,039 --> 00:21:37,320 related speeches and in the top right we 535 00:21:35,520 --> 00:21:39,320 have all the D Jango and database 536 00:21:37,320 --> 00:21:40,559 related speeches but in the middle yes 537 00:21:39,320 --> 00:21:42,520 there is clusters there but they're not 538 00:21:40,559 --> 00:21:44,760 very strong and which probably goes to 539 00:21:42,520 --> 00:21:48,320 show that the pon team did a great job 540 00:21:44,760 --> 00:21:48,320 at choosing a very diverse set of 541 00:21:49,400 --> 00:21:53,480 topics so let's say we want to find the 542 00:21:51,760 --> 00:21:55,240 most similar speeches to the speech 543 00:21:53,480 --> 00:21:57,240 teaching digital Technologies in 544 00:21:55,240 --> 00:22:00,360 Australian schools for Python and the 545 00:21:57,240 --> 00:22:04,039 cooker Berry so we get the index for it 546 00:22:00,360 --> 00:22:06,240 here and then we now use P torch to 547 00:22:04,039 --> 00:22:08,279 compare the distance between this is 548 00:22:06,240 --> 00:22:10,159 using coine similarity we compare the 549 00:22:08,279 --> 00:22:11,600 distance between the target embedding 550 00:22:10,159 --> 00:22:13,640 and all of the other embeddings for the 551 00:22:11,600 --> 00:22:16,200 speeches and see and we want to find the 552 00:22:13,640 --> 00:22:18,799 most the top 10 closest embeddings to 553 00:22:16,200 --> 00:22:21,279 our Target so this is just like before 554 00:22:18,799 --> 00:22:23,360 when we had two dimensional angles now 555 00:22:21,279 --> 00:22:25,400 we have 300 dimensional angles and we're 556 00:22:23,360 --> 00:22:27,960 calculating which ones are 557 00:22:25,400 --> 00:22:31,080 closer so under the hood it looks a bit 558 00:22:27,960 --> 00:22:32,919 like this where we take in two espert or 559 00:22:31,080 --> 00:22:34,200 we take in two embeddings and we put 560 00:22:32,919 --> 00:22:36,120 them through a cosine similarity 561 00:22:34,200 --> 00:22:38,880 function and that will return a number 562 00:22:36,120 --> 00:22:40,720 from negative 1 to one where negative 1 563 00:22:38,880 --> 00:22:42,520 means that the embeddings are you 564 00:22:40,720 --> 00:22:43,679 completely in opposite directions and 565 00:22:42,520 --> 00:22:46,720 one means that they're in the same 566 00:22:43,679 --> 00:22:46,720 direction or they're 567 00:22:47,120 --> 00:22:51,080 parallel it seems to have done a pretty 568 00:22:49,039 --> 00:22:54,360 good job because the these speeches are 569 00:22:51,080 --> 00:22:56,919 all mostly related to education and 570 00:22:54,360 --> 00:23:00,000 teaching and all that sort of stuff and 571 00:22:56,919 --> 00:23:02,200 schools but one thing we noticed is 572 00:23:00,000 --> 00:23:04,559 micropython I don't think is about 573 00:23:02,200 --> 00:23:06,640 specifically about education in schools 574 00:23:04,559 --> 00:23:08,440 so we can probably do a bit better here 575 00:23:06,640 --> 00:23:12,200 and that's where the cross encoder comes 576 00:23:08,440 --> 00:23:14,720 in so using just the top 10 speeches We 577 00:23:12,200 --> 00:23:17,480 compare our Target speech 578 00:23:14,720 --> 00:23:20,559 description with one at a time with the 579 00:23:17,480 --> 00:23:24,520 top 10 uh Speech 580 00:23:20,559 --> 00:23:27,120 descriptions what this looks like 581 00:23:24,520 --> 00:23:28,960 is this is what the cross en code is for 582 00:23:27,120 --> 00:23:30,720 and it's it's also a bur model but 583 00:23:28,960 --> 00:23:33,039 instead of producing word embeddings 584 00:23:30,720 --> 00:23:34,600 it's been optimized to produce a score 585 00:23:33,039 --> 00:23:36,960 representing the similarity of its 586 00:23:34,600 --> 00:23:39,039 inputs it does this by comparing each of 587 00:23:36,960 --> 00:23:40,640 the embeddings at a word level rather 588 00:23:39,039 --> 00:23:43,279 than measuring the distance between the 589 00:23:40,640 --> 00:23:46,159 averaged word embedding of the model 590 00:23:43,279 --> 00:23:48,320 prior this makes it much more precise in 591 00:23:46,159 --> 00:23:50,679 judging what text is similar however 592 00:23:48,320 --> 00:23:52,880 it's extremely less efficient and that's 593 00:23:50,679 --> 00:23:54,880 why it's often done as a second step to 594 00:23:52,880 --> 00:23:57,159 kind of rerank the results that the 595 00:23:54,880 --> 00:23:58,760 first step did which is much more 596 00:23:57,159 --> 00:24:01,279 efficient 597 00:23:58,760 --> 00:24:02,480 and it saves you from reranking a much 598 00:24:01,279 --> 00:24:04,760 larger set of 599 00:24:02,480 --> 00:24:06,400 documents so if we go to then these 600 00:24:04,760 --> 00:24:07,799 results we can see that for example 601 00:24:06,400 --> 00:24:09,679 micropython has been reranked all the 602 00:24:07,799 --> 00:24:11,799 way down to the bottom which goes to 603 00:24:09,679 --> 00:24:14,360 show that probably wasn't related to 604 00:24:11,799 --> 00:24:15,720 education so I've actually got a live 605 00:24:14,360 --> 00:24:19,000 web app version of this which will let 606 00:24:15,720 --> 00:24:20,480 you select your speech and see the most 607 00:24:19,000 --> 00:24:23,720 similar speeches and I'll make that 608 00:24:20,480 --> 00:24:23,720 available right at the 609 00:24:24,120 --> 00:24:28,159 end um so embeddings can be used for a 610 00:24:26,600 --> 00:24:30,399 whole range of tasks and I want to just 611 00:24:28,159 --> 00:24:32,520 go through a few more cool applications 612 00:24:30,399 --> 00:24:34,960 this one's probably my favorite it's you 613 00:24:32,520 --> 00:24:37,320 can actually see uh how the meaning of 614 00:24:34,960 --> 00:24:39,399 words have changed throughout time so if 615 00:24:37,320 --> 00:24:42,360 you were to train like a word Toc model 616 00:24:39,399 --> 00:24:45,000 on documents from say the 1850s you'd 617 00:24:42,360 --> 00:24:46,840 see that the word broadcast is Ed in a 618 00:24:45,000 --> 00:24:48,480 context that's very similar to farming 619 00:24:46,840 --> 00:24:51,039 and sewing seeds and you know 620 00:24:48,480 --> 00:24:53,799 broadcasting all the seeds down to the 621 00:24:51,039 --> 00:24:57,240 1900s broadcast becomes about 622 00:24:53,799 --> 00:24:59,360 newspapers down to the 1990s broadcast 623 00:24:57,240 --> 00:25:00,600 quickly becomes about television radio 624 00:24:59,360 --> 00:25:03,480 and even the 625 00:25:00,600 --> 00:25:06,799 BBC another one that I find very cool is 626 00:25:03,480 --> 00:25:10,159 awful in the 1850s awful wasn't a good 627 00:25:06,799 --> 00:25:14,240 word you know full of all amazing 628 00:25:10,159 --> 00:25:16,080 Majestic down to the 1900s and 1990s all 629 00:25:14,240 --> 00:25:20,240 four very becomes very quickly becomes a 630 00:25:16,080 --> 00:25:20,240 negative word meaning terrible or 631 00:25:20,720 --> 00:25:25,000 horrible another application is 632 00:25:22,919 --> 00:25:26,240 visualizing embeddings I know I've 633 00:25:25,000 --> 00:25:27,880 spoken about this before but it's 634 00:25:26,240 --> 00:25:30,440 actually really really good for 635 00:25:27,880 --> 00:25:32,039 explorers atory data analysis and if the 636 00:25:30,440 --> 00:25:33,919 graph you're using is interactive you 637 00:25:32,039 --> 00:25:35,720 can get a quick idea of what makes up 638 00:25:33,919 --> 00:25:37,080 all the Clusters and get a real 639 00:25:35,720 --> 00:25:39,240 understanding of your data if you 640 00:25:37,080 --> 00:25:40,799 haven't seen it before so in this 641 00:25:39,240 --> 00:25:44,679 example you can see that the 642 00:25:40,799 --> 00:25:47,120 non-fiction uh genre of books are on the 643 00:25:44,679 --> 00:25:50,559 complete opposite side of the science 644 00:25:47,120 --> 00:25:53,120 fiction which intuitively makes 645 00:25:50,559 --> 00:25:55,399 sense the final application or not the 646 00:25:53,120 --> 00:25:57,960 final one but one of the bigger ones is 647 00:25:55,399 --> 00:25:59,640 Vector databases so Vector databases 648 00:25:57,960 --> 00:26:01,600 Essen store all the vectors for your 649 00:25:59,640 --> 00:26:04,240 embeddings which allow you to query them 650 00:26:01,600 --> 00:26:06,520 using embeddings as well so for example 651 00:26:04,240 --> 00:26:09,120 if you wanted to make a document search 652 00:26:06,520 --> 00:26:11,720 you could then embed your query and find 653 00:26:09,120 --> 00:26:13,480 the most similar documents to your query 654 00:26:11,720 --> 00:26:15,760 so this is actually used as far as I'm 655 00:26:13,480 --> 00:26:19,000 aware in Google and Yahoo searches as 656 00:26:15,760 --> 00:26:20,720 well for sort of more of a semantic 657 00:26:19,000 --> 00:26:22,760 understanding of what you're Googling 658 00:26:20,720 --> 00:26:26,600 and a common use case in modern 659 00:26:22,760 --> 00:26:29,120 applications is generative llms so if 660 00:26:26,600 --> 00:26:31,080 you hook your generative llm generally 661 00:26:29,120 --> 00:26:33,240 isn't going to be trained on your data 662 00:26:31,080 --> 00:26:35,679 so to give it access to data that's it's 663 00:26:33,240 --> 00:26:38,559 not being trained on you can simply 664 00:26:35,679 --> 00:26:40,399 input the user's prompt into the vect 665 00:26:38,559 --> 00:26:42,919 database retrieve the most relevant 666 00:26:40,399 --> 00:26:45,720 chunks of text or documents and then 667 00:26:42,919 --> 00:26:48,880 append those to your um prompt and then 668 00:26:45,720 --> 00:26:53,600 feed the prompt and the relevant chunks 669 00:26:48,880 --> 00:26:55,679 into your llm llm and that will then um 670 00:26:53,600 --> 00:26:57,440 produce relevant information and that's 671 00:26:55,679 --> 00:26:59,080 actually called retrieval augmented 672 00:26:57,440 --> 00:27:01,240 generation and I believe there's a 673 00:26:59,080 --> 00:27:04,080 presentation or a workshop on that this 674 00:27:01,240 --> 00:27:05,960 Monday so that' be good so that's really 675 00:27:04,080 --> 00:27:08,640 good for chat 676 00:27:05,960 --> 00:27:10,440 Bots another way you can do it is you 677 00:27:08,640 --> 00:27:12,840 can optimize your sentence and betting 678 00:27:10,440 --> 00:27:16,399 models to kind of fit your definition of 679 00:27:12,840 --> 00:27:18,360 similarity so out of the box this uh 680 00:27:16,399 --> 00:27:19,679 model is doing what it should be all all 681 00:27:18,360 --> 00:27:21,600 the product descriptions are close 682 00:27:19,679 --> 00:27:24,279 together but if you then find unun it 683 00:27:21,600 --> 00:27:27,320 you can find unit to be for example um 684 00:27:24,279 --> 00:27:30,600 putting negative and positive uh 685 00:27:27,320 --> 00:27:32,320 sentiment closer together so that's all 686 00:27:30,600 --> 00:27:34,760 the applications I've brought up today 687 00:27:32,320 --> 00:27:35,600 but it really is up to your imagination 688 00:27:34,760 --> 00:27:38,320 with the 689 00:27:35,600 --> 00:27:40,600 Tings so as promised here's a QR code or 690 00:27:38,320 --> 00:27:43,000 a link if you prefer links to the 691 00:27:40,600 --> 00:27:45,760 interactive demo I haven't tested it 692 00:27:43,000 --> 00:27:48,640 really um on anyone except myself so 693 00:27:45,760 --> 00:27:50,320 we'll see how it goes but um uh there's 694 00:27:48,640 --> 00:27:51,519 probably no time for questions but any 695 00:27:50,320 --> 00:27:54,810 questions please feel free to come up 696 00:27:51,519 --> 00:28:01,869 and see me outside thank you 697 00:27:54,810 --> 00:28:01,869 [Applause]