1 00:00:06,320 --> 00:00:11,499 [Music] 2 00:00:18,880 --> 00:00:22,720 welcome back good evening from 3 00:00:20,400 --> 00:00:25,920 wellington we have greg baker with us 4 00:00:22,720 --> 00:00:28,000 here today greg baker is an entrepreneur 5 00:00:25,920 --> 00:00:30,160 author translator and an internationally 6 00:00:28,000 --> 00:00:32,399 water composer and musician he also 7 00:00:30,160 --> 00:00:34,320 codes a bit he's been running software 8 00:00:32,399 --> 00:00:36,160 that populates the leaftop database 9 00:00:34,320 --> 00:00:38,160 which has the goal of being the largest 10 00:00:36,160 --> 00:00:40,640 lexa canary and is also building a 11 00:00:38,160 --> 00:00:42,800 universal grammar extractor which can 12 00:00:40,640 --> 00:00:46,399 currently inflict a plural from a 13 00:00:42,800 --> 00:00:47,680 singular for 11 of the world's nouns 14 00:00:46,399 --> 00:00:50,160 this talk 15 00:00:47,680 --> 00:00:53,600 is for language geeks and machine 16 00:00:50,160 --> 00:00:55,520 learning nerds no my greg 17 00:00:53,600 --> 00:00:58,399 well some tentative first steps towards 18 00:00:55,520 --> 00:01:00,079 a star trek universal communicator 19 00:00:58,399 --> 00:01:02,399 and i've noticed that a lot of people 20 00:01:00,079 --> 00:01:05,360 have been giving acknowledgements of the 21 00:01:02,399 --> 00:01:06,720 land that they're on at the moment and i 22 00:01:05,360 --> 00:01:09,520 thought it'd be really appropriate to 23 00:01:06,720 --> 00:01:12,880 give an acknowledgement of the languages 24 00:01:09,520 --> 00:01:15,200 of the land that i'm bringing this from 25 00:01:12,880 --> 00:01:17,680 um unfortunately 26 00:01:15,200 --> 00:01:18,640 i can't and that's because 27 00:01:17,680 --> 00:01:20,479 the 28 00:01:18,640 --> 00:01:22,960 water medical's people 29 00:01:20,479 --> 00:01:25,600 were had their culture so completely 30 00:01:22,960 --> 00:01:26,960 destroyed by colonization that we don't 31 00:01:25,600 --> 00:01:29,360 even know 32 00:01:26,960 --> 00:01:31,119 what they called their language 33 00:01:29,360 --> 00:01:32,960 and thus it's called 34 00:01:31,119 --> 00:01:35,360 the sydney language because we don't 35 00:01:32,960 --> 00:01:37,600 know what else to call it 36 00:01:35,360 --> 00:01:40,880 and that's kind of tragic and 37 00:01:37,600 --> 00:01:43,600 when we lose a language we lose 38 00:01:40,880 --> 00:01:45,920 a part of what it is to be human and so 39 00:01:43,600 --> 00:01:48,399 it's rather sad to hear that 40 00:01:45,920 --> 00:01:50,720 we estimate that something like 50 to 90 41 00:01:48,399 --> 00:01:52,079 percent of the languages that are spoken 42 00:01:50,720 --> 00:01:56,399 today 43 00:01:52,079 --> 00:01:58,719 will be dead by the year 2100 44 00:01:56,399 --> 00:02:00,240 and there's a a kind of cycle that we've 45 00:01:58,719 --> 00:02:02,000 seen happen 46 00:02:00,240 --> 00:02:04,320 time and time again that causes 47 00:02:02,000 --> 00:02:08,799 languages to die you start anywhere 48 00:02:04,320 --> 00:02:08,799 around the circle and then 49 00:02:09,759 --> 00:02:14,400 rotate around and after a few iterations 50 00:02:11,599 --> 00:02:15,920 around the language is essentially gone 51 00:02:14,400 --> 00:02:18,080 so i'll just do an example start in the 52 00:02:15,920 --> 00:02:20,080 bottom right hand corner children's 53 00:02:18,080 --> 00:02:22,480 being sent to dominant language schools 54 00:02:20,080 --> 00:02:24,319 so the the wu language 55 00:02:22,480 --> 00:02:27,280 um which is you know one of the the 56 00:02:24,319 --> 00:02:29,440 great connections we have to 57 00:02:27,280 --> 00:02:31,760 middle era china 58 00:02:29,440 --> 00:02:33,920 that kind of died when children were 59 00:02:31,760 --> 00:02:36,959 forced to go to mandarin speaking 60 00:02:33,920 --> 00:02:38,000 schools by the the ccp and so that then 61 00:02:36,959 --> 00:02:39,440 meant that 62 00:02:38,000 --> 00:02:40,800 moving across to the loss of culture 63 00:02:39,440 --> 00:02:43,599 just became a kitchen language a 64 00:02:40,800 --> 00:02:45,360 language that you you hear when you're 65 00:02:43,599 --> 00:02:47,120 at home and your parents speak it but 66 00:02:45,360 --> 00:02:49,280 it's not something that you speak with 67 00:02:47,120 --> 00:02:51,440 your friends or anybody you know and so 68 00:02:49,280 --> 00:02:54,319 the next step around that is you stop 69 00:02:51,440 --> 00:02:56,560 identifying with that minority language 70 00:02:54,319 --> 00:02:59,360 and so the population of people in that 71 00:02:56,560 --> 00:03:01,840 who speak it declines which means people 72 00:02:59,360 --> 00:03:03,680 who only speak that language they can't 73 00:03:01,840 --> 00:03:05,920 communicate with the wider population so 74 00:03:03,680 --> 00:03:07,840 they have poor jobs prospects which then 75 00:03:05,920 --> 00:03:10,319 means that they are all the more keen on 76 00:03:07,840 --> 00:03:12,159 sending their children to schools that 77 00:03:10,319 --> 00:03:16,720 speak the dominant language so they can 78 00:03:12,159 --> 00:03:19,200 get themselves out of that poverty rut 79 00:03:16,720 --> 00:03:21,200 and that is happening all over the world 80 00:03:19,200 --> 00:03:23,680 at the moment uh i'll give you another 81 00:03:21,200 --> 00:03:25,200 one uh chechua this is 10 million 82 00:03:23,680 --> 00:03:27,280 speakers so it's like half the 83 00:03:25,200 --> 00:03:29,120 population of australia spread out 84 00:03:27,280 --> 00:03:32,000 across south america 85 00:03:29,120 --> 00:03:34,239 it's our last connection to the inca 86 00:03:32,000 --> 00:03:36,319 empire it's the language of the empire 87 00:03:34,239 --> 00:03:38,159 as it was sort of evolved 88 00:03:36,319 --> 00:03:39,840 and it's dying 89 00:03:38,159 --> 00:03:41,680 in fact the people 90 00:03:39,840 --> 00:03:44,000 who speak chechua if they're in the 91 00:03:41,680 --> 00:03:47,680 cities they'll probably hide the fact 92 00:03:44,000 --> 00:03:49,599 that they can speak it because of this 93 00:03:47,680 --> 00:03:53,200 you know extreme level of i don't want 94 00:03:49,599 --> 00:03:54,720 to identify with my language and culture 95 00:03:53,200 --> 00:03:56,000 at this point you're probably thinking 96 00:03:54,720 --> 00:03:57,599 this is going to be one of these really 97 00:03:56,000 --> 00:03:58,840 depressing talks 98 00:03:57,599 --> 00:04:01,840 but it's 99 00:03:58,840 --> 00:04:03,760 not because there are two things that 100 00:04:01,840 --> 00:04:05,599 are happening right now 101 00:04:03,760 --> 00:04:07,200 that that are 102 00:04:05,599 --> 00:04:09,280 changing this 103 00:04:07,200 --> 00:04:10,959 where we're breaking the cycle of 104 00:04:09,280 --> 00:04:13,599 language death 105 00:04:10,959 --> 00:04:16,239 in two ways that has never really 106 00:04:13,599 --> 00:04:18,160 happened before 107 00:04:16,239 --> 00:04:21,280 um the first which i'll talk about just 108 00:04:18,160 --> 00:04:22,960 for a few minutes um first is the idea 109 00:04:21,280 --> 00:04:24,240 of a language being associated with 110 00:04:22,960 --> 00:04:26,639 technology 111 00:04:24,240 --> 00:04:28,400 and when that happens when 112 00:04:26,639 --> 00:04:31,120 you can start interacting with a 113 00:04:28,400 --> 00:04:32,560 computer in your own 114 00:04:31,120 --> 00:04:33,759 language 115 00:04:32,560 --> 00:04:36,479 this thing gives you this incredible 116 00:04:33,759 --> 00:04:38,400 sense of power and capability 117 00:04:36,479 --> 00:04:40,880 and then the other half of course is as 118 00:04:38,400 --> 00:04:42,639 we get better machine translation 119 00:04:40,880 --> 00:04:43,600 there's the possibility that this may 120 00:04:42,639 --> 00:04:47,040 also 121 00:04:43,600 --> 00:04:48,479 break the cycle of language death 122 00:04:47,040 --> 00:04:50,639 let me talk about the language 123 00:04:48,479 --> 00:04:53,440 associated with technology and have a 124 00:04:50,639 --> 00:04:55,120 just a quick pitch here 125 00:04:53,440 --> 00:04:58,000 there's a language that i'll use in this 126 00:04:55,120 --> 00:05:00,880 talk it has it goes by three different 127 00:04:58,000 --> 00:05:02,240 names they're subtly different languages 128 00:05:00,880 --> 00:05:05,280 there's the language 129 00:05:02,240 --> 00:05:06,400 called bislama or bislama depending on 130 00:05:05,280 --> 00:05:08,479 where you 131 00:05:06,400 --> 00:05:10,080 grew up 132 00:05:08,479 --> 00:05:11,919 in and that's the language that's spoken 133 00:05:10,080 --> 00:05:14,080 in vanuatu strangely enough it actually 134 00:05:11,919 --> 00:05:16,080 is the word for sea slug 135 00:05:14,080 --> 00:05:17,919 um not many people name their languages 136 00:05:16,080 --> 00:05:19,919 after sea slugs but that's what happened 137 00:05:17,919 --> 00:05:21,600 um or papua new guinea uh it's called 138 00:05:19,919 --> 00:05:23,199 tock pissing there's a couple of tiny 139 00:05:21,600 --> 00:05:25,280 little differences or the solomon 140 00:05:23,199 --> 00:05:27,360 islands is called pidgeon and this is a 141 00:05:25,280 --> 00:05:30,479 language that's only really diverged 142 00:05:27,360 --> 00:05:32,880 from english for about 120 years 143 00:05:30,479 --> 00:05:34,720 uh it derives from 144 00:05:32,880 --> 00:05:36,840 where uh particularly australian 145 00:05:34,720 --> 00:05:39,199 landholders captured a lot of 146 00:05:36,840 --> 00:05:41,600 melanesians force them to work on sugar 147 00:05:39,199 --> 00:05:43,680 cane fields and on on ships so you end 148 00:05:41,600 --> 00:05:47,440 up with this language that has english 149 00:05:43,680 --> 00:05:48,800 grammar but sorry english vocabulary but 150 00:05:47,440 --> 00:05:50,720 melanesian grammar and then we've 151 00:05:48,800 --> 00:05:52,320 watched it diverge 152 00:05:50,720 --> 00:05:54,960 now there's something like five to six 153 00:05:52,320 --> 00:05:57,360 million speakers of 154 00:05:54,960 --> 00:05:59,919 the language that i'll call toc peace in 155 00:05:57,360 --> 00:06:01,680 um mostly in papua new guinea but all 156 00:05:59,919 --> 00:06:03,600 throughout melanesia 157 00:06:01,680 --> 00:06:06,880 and so i decided just for fun that i 158 00:06:03,600 --> 00:06:09,840 would pay bradley and jimmy um to 159 00:06:06,880 --> 00:06:12,880 translate as into localized libra office 160 00:06:09,840 --> 00:06:15,600 into uh bish lama 161 00:06:12,880 --> 00:06:17,360 and um if you want to donate i'm aiming 162 00:06:15,600 --> 00:06:18,840 to raise ten thousand dollars that will 163 00:06:17,360 --> 00:06:21,280 be enough to 164 00:06:18,840 --> 00:06:24,560 um to fully translate it and fully 165 00:06:21,280 --> 00:06:26,000 localize it and the extraordinary effect 166 00:06:24,560 --> 00:06:27,840 that this has because there's a bunch of 167 00:06:26,000 --> 00:06:29,600 people who are not comfortable with 168 00:06:27,840 --> 00:06:32,000 computers not comfortable with english 169 00:06:29,600 --> 00:06:34,960 and it's a double burden when they see 170 00:06:32,000 --> 00:06:37,199 hey i can actually operate in my 171 00:06:34,960 --> 00:06:38,400 language on a computer 172 00:06:37,199 --> 00:06:40,160 it's like 173 00:06:38,400 --> 00:06:41,600 suddenly the world has opened up it's 174 00:06:40,160 --> 00:06:43,440 that it's like that feeling you get when 175 00:06:41,600 --> 00:06:45,440 you first play around with open source 176 00:06:43,440 --> 00:06:46,960 software and you suddenly go hey this is 177 00:06:45,440 --> 00:06:49,039 this is a community this is a culture 178 00:06:46,960 --> 00:06:50,800 this is something different 179 00:06:49,039 --> 00:06:52,080 in previous versions of this talk i used 180 00:06:50,800 --> 00:06:54,319 other languages 181 00:06:52,080 --> 00:06:56,479 which nobody knew nobody will know these 182 00:06:54,319 --> 00:06:59,199 languages either but it's kind of fun 183 00:06:56,479 --> 00:07:01,120 because you can kind of get a sense of 184 00:06:59,199 --> 00:07:03,039 when i sound out some of the words 185 00:07:01,120 --> 00:07:04,880 it'll sort of sound familiar and sound 186 00:07:03,039 --> 00:07:07,840 right 187 00:07:04,880 --> 00:07:10,479 okay so the other thing we have the 188 00:07:07,840 --> 00:07:12,560 other tool in our toolbox is universal 189 00:07:10,479 --> 00:07:14,960 translation 190 00:07:12,560 --> 00:07:16,800 and what we want to do is we want to 191 00:07:14,960 --> 00:07:17,840 have the universal translator from star 192 00:07:16,800 --> 00:07:19,039 trek 193 00:07:17,840 --> 00:07:20,160 now 194 00:07:19,039 --> 00:07:22,639 there's 195 00:07:20,160 --> 00:07:24,720 something that i need to say here 196 00:07:22,639 --> 00:07:26,560 that may be a shock to some people 197 00:07:24,720 --> 00:07:28,639 and that is that 198 00:07:26,560 --> 00:07:30,960 star trek is actually fictional it's 199 00:07:28,639 --> 00:07:33,919 it's not a depiction of reality 200 00:07:30,960 --> 00:07:35,919 uh so we don't actually have to follow 201 00:07:33,919 --> 00:07:38,160 how it works in star trek 202 00:07:35,919 --> 00:07:40,319 uh in the original series the universal 203 00:07:38,160 --> 00:07:42,840 translator worked by scanning the brain 204 00:07:40,319 --> 00:07:45,440 waves of the alien 205 00:07:42,840 --> 00:07:49,120 species and in 206 00:07:45,440 --> 00:07:51,599 enterprise it was hoshi's creation 207 00:07:49,120 --> 00:07:53,680 but either way what we notice is that 208 00:07:51,599 --> 00:07:55,919 the star trek universal translator can 209 00:07:53,680 --> 00:07:58,960 cope with really small languages you 210 00:07:55,919 --> 00:08:01,520 land on a planet and there's only two 211 00:07:58,960 --> 00:08:04,080 people on the planet somehow 212 00:08:01,520 --> 00:08:05,759 the the the crew are able to communicate 213 00:08:04,080 --> 00:08:07,199 with these like 214 00:08:05,759 --> 00:08:08,879 aliens who might be the last of their 215 00:08:07,199 --> 00:08:11,440 species 216 00:08:08,879 --> 00:08:14,720 and that's a key point because 217 00:08:11,440 --> 00:08:17,759 if you just want to translate big 218 00:08:14,720 --> 00:08:20,160 major languages that's easy 219 00:08:17,759 --> 00:08:22,720 um the the process for building a 220 00:08:20,160 --> 00:08:24,479 translator from aged language is 221 00:08:22,720 --> 00:08:25,599 assemble a few million pairs of 222 00:08:24,479 --> 00:08:28,000 translated 223 00:08:25,599 --> 00:08:30,479 documents so like you've got maybe the 224 00:08:28,000 --> 00:08:32,320 proceedings of the european unions has a 225 00:08:30,479 --> 00:08:34,959 good set where you've got 226 00:08:32,320 --> 00:08:36,640 document a document b you know that b is 227 00:08:34,959 --> 00:08:39,120 a direct translation of a it'll be 228 00:08:36,640 --> 00:08:40,399 sentence for sentence translated 229 00:08:39,120 --> 00:08:42,399 so you've got these 230 00:08:40,399 --> 00:08:44,159 um yeah a couple of million 231 00:08:42,399 --> 00:08:45,920 sentence pairs 232 00:08:44,159 --> 00:08:49,839 that you can work with 233 00:08:45,920 --> 00:08:51,920 and as of about 2018 the um hip new 234 00:08:49,839 --> 00:08:53,760 technology is to create a transformer so 235 00:08:51,920 --> 00:08:55,279 deep learning with 236 00:08:53,760 --> 00:08:56,399 intention mechanisms 237 00:08:55,279 --> 00:08:58,720 and 238 00:08:56,399 --> 00:09:02,080 you basically try to get it to predict 239 00:08:58,720 --> 00:09:04,000 the next token that comes out from a 240 00:09:02,080 --> 00:09:05,760 translation i'll do an example of that a 241 00:09:04,000 --> 00:09:08,320 little bit later 242 00:09:05,760 --> 00:09:10,160 and it works and then if you throw in a 243 00:09:08,320 --> 00:09:11,839 speech recognition system you probably 244 00:09:10,160 --> 00:09:13,440 only need a few thousand hours of 245 00:09:11,839 --> 00:09:15,839 correctly transcribed speech and then 246 00:09:13,440 --> 00:09:18,800 you can get sort of in the 90 247 00:09:15,839 --> 00:09:21,120 accuracy for speech recognition 248 00:09:18,800 --> 00:09:22,880 a bit more than that and you can start 249 00:09:21,120 --> 00:09:25,519 getting some really 250 00:09:22,880 --> 00:09:28,240 really accurate models 251 00:09:25,519 --> 00:09:31,440 but the problem is this beginning step 252 00:09:28,240 --> 00:09:34,640 first we assemble a few million pairs of 253 00:09:31,440 --> 00:09:37,360 translated documents 254 00:09:34,640 --> 00:09:38,480 just to get a sense of how big that is 255 00:09:37,360 --> 00:09:40,560 so we'll 256 00:09:38,480 --> 00:09:42,959 in the keynote yesterday you know 7000 257 00:09:40,560 --> 00:09:43,920 lines of code in unix version 6 that i 258 00:09:42,959 --> 00:09:45,839 think 259 00:09:43,920 --> 00:09:48,560 brian kernighan mentioned so the 260 00:09:45,839 --> 00:09:49,680 commentary on that gets you 254 pages 261 00:09:48,560 --> 00:09:51,440 and that 262 00:09:49,680 --> 00:09:53,760 that's a lot of information it's enough 263 00:09:51,440 --> 00:09:55,839 to launch an industry for what is it 50 264 00:09:53,760 --> 00:09:58,720 years now 265 00:09:55,839 --> 00:10:00,940 and over on the right i 266 00:09:58,720 --> 00:10:02,079 took photographs of 267 00:10:00,940 --> 00:10:04,240 [Music] 268 00:10:02,079 --> 00:10:06,720 a large number of quite thick books 269 00:10:04,240 --> 00:10:08,240 actually they're mostly uh bible variant 270 00:10:06,720 --> 00:10:11,200 translations 271 00:10:08,240 --> 00:10:15,440 now a bible translation gets you about 272 00:10:11,200 --> 00:10:18,399 30 000 sentences in your target language 273 00:10:15,440 --> 00:10:20,079 so that's a great start but it's not 274 00:10:18,399 --> 00:10:22,560 sufficient on its own 275 00:10:20,079 --> 00:10:24,160 and so just to give you a sense 276 00:10:22,560 --> 00:10:25,839 in that stack there you can see you know 277 00:10:24,160 --> 00:10:28,959 most of the books are bibles and you get 278 00:10:25,839 --> 00:10:32,720 a sense of how how thick they are 279 00:10:28,959 --> 00:10:35,040 well that's about 1 30th of what you 280 00:10:32,720 --> 00:10:36,399 need translated 281 00:10:35,040 --> 00:10:38,320 in order to 282 00:10:36,399 --> 00:10:41,360 start building a 283 00:10:38,320 --> 00:10:44,000 high level full effective 284 00:10:41,360 --> 00:10:46,399 major language translator 285 00:10:44,000 --> 00:10:47,680 so that stack of books there that's 20 286 00:10:46,399 --> 00:10:49,600 books high 287 00:10:47,680 --> 00:10:52,000 that is not enough 288 00:10:49,600 --> 00:10:53,440 you would need to have more than that 289 00:10:52,000 --> 00:10:54,880 kind of corpus 290 00:10:53,440 --> 00:10:56,800 and when you think about it that's 291 00:10:54,880 --> 00:10:58,560 really hard when you're talking about a 292 00:10:56,800 --> 00:10:59,920 language where 293 00:10:58,560 --> 00:11:01,120 there may be less than a million 294 00:10:59,920 --> 00:11:03,040 speakers 295 00:11:01,120 --> 00:11:05,680 getting that much stuff translated oh by 296 00:11:03,040 --> 00:11:07,440 the way a bible translation which is you 297 00:11:05,680 --> 00:11:10,079 know they've got the process pretty well 298 00:11:07,440 --> 00:11:12,000 down pat it's between five to twenty 299 00:11:10,079 --> 00:11:15,120 person years 300 00:11:12,000 --> 00:11:17,279 multiply that by about 30 years what you 301 00:11:15,120 --> 00:11:18,959 need you're talking about 150 to 600 302 00:11:17,279 --> 00:11:21,680 person years 303 00:11:18,959 --> 00:11:24,640 worth of work to start building a deep 304 00:11:21,680 --> 00:11:26,160 learning based translator 305 00:11:24,640 --> 00:11:28,240 when you're starting to talk about low 306 00:11:26,160 --> 00:11:30,399 resource languages you just laugh at 307 00:11:28,240 --> 00:11:33,120 that and there's no way so you do 308 00:11:30,399 --> 00:11:35,040 whatever dirty tricks you can in order 309 00:11:33,120 --> 00:11:38,320 to extract as much vocabulary for 310 00:11:35,040 --> 00:11:40,839 whatever sources you have 311 00:11:38,320 --> 00:11:44,480 and then step two is learn the rules of 312 00:11:40,839 --> 00:11:46,399 the the rules of the uh 313 00:11:44,480 --> 00:11:47,360 target language's grammar 314 00:11:46,399 --> 00:11:48,640 and then you should be able to 315 00:11:47,360 --> 00:11:50,560 synthetically put together some 316 00:11:48,640 --> 00:11:52,880 sentences 317 00:11:50,560 --> 00:11:55,360 you can skip that step if you want to 318 00:11:52,880 --> 00:11:59,040 make your machine learning model and try 319 00:11:55,360 --> 00:12:01,040 to start generating translations and 320 00:11:59,040 --> 00:12:03,040 then separate that you probably don't 321 00:12:01,040 --> 00:12:04,639 shoot for full speech recognition just 322 00:12:03,040 --> 00:12:07,120 shoot for 323 00:12:04,639 --> 00:12:09,120 recognizers that recognize when a word 324 00:12:07,120 --> 00:12:10,399 is spoken 325 00:12:09,120 --> 00:12:13,839 um so just 326 00:12:10,399 --> 00:12:15,600 did we hear this word or not um and that 327 00:12:13,839 --> 00:12:17,519 gets you sort of a 328 00:12:15,600 --> 00:12:20,240 a starting point on some of these 329 00:12:17,519 --> 00:12:20,240 translation work 330 00:12:20,399 --> 00:12:24,320 well 331 00:12:21,680 --> 00:12:27,440 so given that you can't do a really good 332 00:12:24,320 --> 00:12:30,399 job unless you have 333 00:12:27,440 --> 00:12:32,560 a million translated sentences 334 00:12:30,399 --> 00:12:34,399 what's the best you can do with what 335 00:12:32,560 --> 00:12:36,399 little you have 336 00:12:34,399 --> 00:12:38,240 now if i've managed to get this to work 337 00:12:36,399 --> 00:12:40,399 properly i should be able to switch over 338 00:12:38,240 --> 00:12:41,600 here and i've got 339 00:12:40,399 --> 00:12:44,639 here's the 340 00:12:41,600 --> 00:12:46,959 translation texts of 341 00:12:44,639 --> 00:12:49,600 just what jimmy and bradley have managed 342 00:12:46,959 --> 00:12:51,839 to do so far 343 00:12:49,600 --> 00:12:53,920 you'll notice a few things here 344 00:12:51,839 --> 00:12:56,399 so like we're looking at the save menu 345 00:12:53,920 --> 00:12:59,120 so i'll just put my mouse up here 346 00:12:56,399 --> 00:13:00,800 looking at like the the menu for save 347 00:12:59,120 --> 00:13:02,320 that translates as 348 00:13:00,800 --> 00:13:08,320 save him 349 00:13:02,320 --> 00:13:08,320 or save as turns into save olsum 350 00:13:08,480 --> 00:13:12,320 and you can sort of hear the word save 351 00:13:10,720 --> 00:13:14,399 in there and it's right that's where it 352 00:13:12,320 --> 00:13:17,279 is actually coming from 353 00:13:14,399 --> 00:13:20,399 you'll also see that the ch often 354 00:13:17,279 --> 00:13:22,079 changes into an s so you get instead of 355 00:13:20,399 --> 00:13:23,200 change you get 356 00:13:22,079 --> 00:13:27,120 sin 357 00:13:23,200 --> 00:13:31,040 and or sinis and so you can get like 358 00:13:27,120 --> 00:13:33,600 rename is uh scenes of the name and 359 00:13:31,040 --> 00:13:36,560 it is citizen 360 00:13:33,600 --> 00:13:38,560 um okay so we've got this vocabulary the 361 00:13:36,560 --> 00:13:39,920 in a few places um 362 00:13:38,560 --> 00:13:40,800 bradley and jimmy just sort of freaked 363 00:13:39,920 --> 00:13:43,199 out and said i don't know how to 364 00:13:40,800 --> 00:13:45,360 translate templates it's not a word that 365 00:13:43,199 --> 00:13:47,920 we've ever used in 366 00:13:45,360 --> 00:13:48,880 top piston we we always use the english 367 00:13:47,920 --> 00:13:50,560 word 368 00:13:48,880 --> 00:13:52,320 um 369 00:13:50,560 --> 00:13:56,320 so that was kind of interesting all up 370 00:13:52,320 --> 00:13:58,639 we've got about uh it's about 300 um 371 00:13:56,320 --> 00:14:01,120 texts translated uh you'll notice that 372 00:13:58,639 --> 00:14:03,120 there's big gaps where like hard stuff 373 00:14:01,120 --> 00:14:04,720 like how do you translate these things 374 00:14:03,120 --> 00:14:07,920 they might not exist or they might be 375 00:14:04,720 --> 00:14:09,839 difficult to say or whatever 376 00:14:07,920 --> 00:14:13,760 you also see that in general 377 00:14:09,839 --> 00:14:15,120 toxin is a lot more wordy than english 378 00:14:13,760 --> 00:14:16,639 you tend to have a lot more words to say 379 00:14:15,120 --> 00:14:19,120 the same things than they're just spoken 380 00:14:16,639 --> 00:14:19,120 faster 381 00:14:19,839 --> 00:14:22,880 and so what if you wanted to just 382 00:14:21,199 --> 00:14:25,199 translate a single word 383 00:14:22,880 --> 00:14:27,120 let's say they hadn't done the 384 00:14:25,199 --> 00:14:29,040 translation for the word save but we 385 00:14:27,120 --> 00:14:30,560 wanted to extend our localization and 386 00:14:29,040 --> 00:14:32,399 just grab that 387 00:14:30,560 --> 00:14:37,120 well we can do that 388 00:14:32,399 --> 00:14:39,680 what you do is you start with all the 389 00:14:37,120 --> 00:14:41,440 translation texts that have the word 390 00:14:39,680 --> 00:14:43,519 save in them 391 00:14:41,440 --> 00:14:44,480 and then you look across to top pisin 392 00:14:43,519 --> 00:14:47,279 and you say 393 00:14:44,480 --> 00:14:49,839 what are all the words that appear in 394 00:14:47,279 --> 00:14:52,000 translations okay well there's save them 395 00:14:49,839 --> 00:14:53,760 there's olsum there's naropela there's 396 00:14:52,000 --> 00:14:56,320 wak there's one there's true there's 397 00:14:53,760 --> 00:14:58,000 also this dispeller and we can then ask 398 00:14:56,320 --> 00:14:59,920 the question 399 00:14:58,000 --> 00:15:03,199 what's the probability 400 00:14:59,920 --> 00:15:04,320 that this word or this phrase 401 00:15:03,199 --> 00:15:06,480 is 402 00:15:04,320 --> 00:15:07,519 likely to be the translation of the word 403 00:15:06,480 --> 00:15:09,920 save 404 00:15:07,519 --> 00:15:10,959 so if i now jump over to some code over 405 00:15:09,920 --> 00:15:12,959 here 406 00:15:10,959 --> 00:15:14,480 let's see if i got the right one 407 00:15:12,959 --> 00:15:17,600 um 408 00:15:14,480 --> 00:15:17,600 yep this is the right code 409 00:15:17,839 --> 00:15:20,560 this 410 00:15:18,639 --> 00:15:23,839 surprisingly little infrastructure you 411 00:15:20,560 --> 00:15:26,900 needed in order to make this work 412 00:15:23,839 --> 00:15:28,320 i'm using pandas and 413 00:15:26,900 --> 00:15:30,800 [Music] 414 00:15:28,320 --> 00:15:32,480 numpy because it was convenient 415 00:15:30,800 --> 00:15:34,320 the only thing i'm using out of scipy is 416 00:15:32,480 --> 00:15:37,519 just a binomial test the only thing i'm 417 00:15:34,320 --> 00:15:39,120 using out of nltk so nltk is natural 418 00:15:37,519 --> 00:15:41,199 language toolkit 419 00:15:39,120 --> 00:15:42,480 for python it's 420 00:15:41,199 --> 00:15:46,000 the only thing i'm using it for is 421 00:15:42,480 --> 00:15:48,480 splitting words up the word tokenization 422 00:15:46,000 --> 00:15:50,160 so jumping down through a bit of text i 423 00:15:48,480 --> 00:15:53,360 load in the 424 00:15:50,160 --> 00:15:53,360 text from the 425 00:15:53,759 --> 00:15:59,040 the localization um stuff that the jimmy 426 00:15:57,120 --> 00:16:00,880 and bradley have done 427 00:15:59,040 --> 00:16:02,720 and you can see there's 428 00:16:00,880 --> 00:16:06,079 i've removed save so that i'm not 429 00:16:02,720 --> 00:16:07,600 cheating um save as save a copy close 430 00:16:06,079 --> 00:16:09,440 open and so on 431 00:16:07,600 --> 00:16:10,639 and let's just run through a little bit 432 00:16:09,440 --> 00:16:12,880 more 433 00:16:10,639 --> 00:16:15,360 the box of the the code is actually in 434 00:16:12,880 --> 00:16:18,240 these two functions here the every gram 435 00:16:15,360 --> 00:16:19,440 generator and the possible translation 436 00:16:18,240 --> 00:16:21,759 phrases 437 00:16:19,440 --> 00:16:24,320 so the the every gram generator 438 00:16:21,759 --> 00:16:26,880 basically says you give me a sentence 439 00:16:24,320 --> 00:16:29,040 like you know the quick brown and i'll 440 00:16:26,880 --> 00:16:31,600 return back to you the the quick the 441 00:16:29,040 --> 00:16:33,759 quick brown quick quick brown and brown 442 00:16:31,600 --> 00:16:36,959 because i don't know how many words in 443 00:16:33,759 --> 00:16:39,360 top pissing corresponds to the the word 444 00:16:36,959 --> 00:16:41,199 in english so it might be that there's a 445 00:16:39,360 --> 00:16:43,839 phrase but if i'm translating a single 446 00:16:41,199 --> 00:16:46,639 word it's probably a phrase of of letter 447 00:16:43,839 --> 00:16:48,000 of words that come one after another 448 00:16:46,639 --> 00:16:50,079 so just a 449 00:16:48,000 --> 00:16:52,320 little bit optimization there 450 00:16:50,079 --> 00:16:54,720 so what i can do then is i can say all 451 00:16:52,320 --> 00:16:55,680 the possible translations of the word 452 00:16:54,720 --> 00:16:57,759 save 453 00:16:55,680 --> 00:16:59,279 so these are all the things that i've 454 00:16:57,759 --> 00:17:01,680 seen 455 00:16:59,279 --> 00:17:04,160 where the word save appeared in english 456 00:17:01,680 --> 00:17:06,880 somewhere what are all the words i've 457 00:17:04,160 --> 00:17:08,559 seen on the other side and uh 458 00:17:06,880 --> 00:17:11,199 and all the all the 459 00:17:08,559 --> 00:17:12,480 not just words but phrases that uh that 460 00:17:11,199 --> 00:17:14,799 appeared there so 461 00:17:12,480 --> 00:17:16,559 i think i took i mentioned dispeller 462 00:17:14,799 --> 00:17:17,919 computer when i was 463 00:17:16,559 --> 00:17:18,880 writing when i was reading something out 464 00:17:17,919 --> 00:17:20,240 earlier 465 00:17:18,880 --> 00:17:22,079 and so on there's there's literally 466 00:17:20,240 --> 00:17:23,280 hundreds of them but what we can do is 467 00:17:22,079 --> 00:17:25,919 we can say 468 00:17:23,280 --> 00:17:28,000 and this is the core of the code here 469 00:17:25,919 --> 00:17:30,960 i'm going to count up the number of 470 00:17:28,000 --> 00:17:33,520 times that i saw 471 00:17:30,960 --> 00:17:36,240 word a along with save 472 00:17:33,520 --> 00:17:38,240 the number of times that's that together 473 00:17:36,240 --> 00:17:39,840 the number of times that it appeared in 474 00:17:38,240 --> 00:17:41,440 the english 475 00:17:39,840 --> 00:17:42,720 and then the number of times that it 476 00:17:41,440 --> 00:17:43,760 just generally appears in the whole 477 00:17:42,720 --> 00:17:47,200 corpus 478 00:17:43,760 --> 00:17:50,160 so like the word all for example uh it 479 00:17:47,200 --> 00:17:51,360 marks the accusative case in doctrine so 480 00:17:50,160 --> 00:17:53,200 it's in lots and lots and lots of 481 00:17:51,360 --> 00:17:54,640 sentences 482 00:17:53,200 --> 00:17:55,600 um and 483 00:17:54,640 --> 00:17:57,360 um 484 00:17:55,600 --> 00:17:59,440 so you'd see that in lots of places and 485 00:17:57,360 --> 00:18:01,360 we need to sort of weight it by the fact 486 00:17:59,440 --> 00:18:02,640 that all appears in lots of places so 487 00:18:01,360 --> 00:18:04,880 it's very not likely to be the 488 00:18:02,640 --> 00:18:06,799 translation of the word save 489 00:18:04,880 --> 00:18:10,240 well what happens when we guess the word 490 00:18:06,799 --> 00:18:12,160 save lo and behold it actually works and 491 00:18:10,240 --> 00:18:14,480 the probability is something absolutely 492 00:18:12,160 --> 00:18:17,039 astronomically small it's like 10 to the 493 00:18:14,480 --> 00:18:19,039 minus like 100 or something like that so 494 00:18:17,039 --> 00:18:20,880 of course it just comes up saying 495 00:18:19,039 --> 00:18:23,520 nothing there 496 00:18:20,880 --> 00:18:26,880 and if i ask it to run through 497 00:18:23,520 --> 00:18:29,200 and find for me all the vocabulary that 498 00:18:26,880 --> 00:18:32,480 you possibly can 499 00:18:29,200 --> 00:18:33,600 it does okayish 500 00:18:32,480 --> 00:18:36,640 so 501 00:18:33,600 --> 00:18:37,760 long pella means a long thing and that 502 00:18:36,640 --> 00:18:39,200 was a pretty good translation of the 503 00:18:37,760 --> 00:18:41,919 word length 504 00:18:39,200 --> 00:18:44,400 arg one that seems right spelling is 505 00:18:41,919 --> 00:18:46,880 actually right it's if i'm gonna spell 506 00:18:44,400 --> 00:18:48,080 something um then that's the correct 507 00:18:46,880 --> 00:18:51,200 translation 508 00:18:48,080 --> 00:18:53,840 um kind of amusing um how do you say 509 00:18:51,200 --> 00:18:56,960 click a mouse click you press him long 510 00:18:53,840 --> 00:18:57,760 pressing belong to to a thing 511 00:18:56,960 --> 00:19:00,320 so 512 00:18:57,760 --> 00:19:03,039 some right some actually completely 513 00:19:00,320 --> 00:19:05,280 bonkers wrong 514 00:19:03,039 --> 00:19:05,280 so 515 00:19:06,000 --> 00:19:08,799 i'm just looking at that last one that 516 00:19:07,520 --> 00:19:10,000 doesn't look right to me i actually 517 00:19:08,799 --> 00:19:11,840 don't even know how to say that what 518 00:19:10,000 --> 00:19:14,320 that's actually saying but that's that's 519 00:19:11,840 --> 00:19:16,080 wrong but it got it somewhat right and 520 00:19:14,320 --> 00:19:19,360 if your problem is you just need to get 521 00:19:16,080 --> 00:19:21,679 one word out in your target language 522 00:19:19,360 --> 00:19:22,880 this is not a bad technique it does kind 523 00:19:21,679 --> 00:19:24,400 of work 524 00:19:22,880 --> 00:19:27,200 in fact 525 00:19:24,400 --> 00:19:30,320 as i jump onto my next page my slide if 526 00:19:27,200 --> 00:19:30,320 it's not going to crash on me 527 00:19:31,039 --> 00:19:38,039 i've actually done this on some 528 00:19:33,919 --> 00:19:38,039 very large number of 529 00:19:38,080 --> 00:19:42,640 languages in fact i took 530 00:19:40,480 --> 00:19:43,840 bible translations in 1500 different 531 00:19:42,640 --> 00:19:47,760 languages 532 00:19:43,840 --> 00:19:47,760 and looks like my um 533 00:19:48,080 --> 00:19:52,000 libreoffice session is about to crash 534 00:19:50,000 --> 00:19:54,400 i've just got the spinning beach ball of 535 00:19:52,000 --> 00:19:56,000 death i had to switch over from my linux 536 00:19:54,400 --> 00:19:59,039 box to the mac because there were 537 00:19:56,000 --> 00:20:01,440 problems with my camera that the tech 538 00:19:59,039 --> 00:20:04,159 session bothered complained about so i'm 539 00:20:01,440 --> 00:20:06,400 going to have to kill that and 540 00:20:04,159 --> 00:20:07,280 hopefully 541 00:20:06,400 --> 00:20:11,039 um 542 00:20:07,280 --> 00:20:11,039 restart it so let's 543 00:20:12,559 --> 00:20:16,480 let's um 544 00:20:14,320 --> 00:20:20,280 relaunch and um 545 00:20:16,480 --> 00:20:20,280 hope that it behaves itself 546 00:20:27,520 --> 00:20:31,919 let me just check that i'm sharing my 547 00:20:29,039 --> 00:20:33,760 screen successfully here 548 00:20:31,919 --> 00:20:36,320 it does not look very promising at all 549 00:20:33,760 --> 00:20:36,320 at the moment 550 00:20:44,799 --> 00:20:48,640 it goes up to about 551 00:20:47,039 --> 00:20:51,120 here 552 00:20:48,640 --> 00:20:54,400 and so i did this for like a very large 553 00:20:51,120 --> 00:20:56,799 number of languages and 28 000 cpu hours 554 00:20:54,400 --> 00:20:59,360 later this kind of i won't translate a 555 00:20:56,799 --> 00:21:02,720 single word um you can do it get about 556 00:20:59,360 --> 00:21:05,280 sort of 70 accuracy out of it 557 00:21:02,720 --> 00:21:06,720 but if you want to translate a sentence 558 00:21:05,280 --> 00:21:08,080 now that gets a little bit more 559 00:21:06,720 --> 00:21:10,559 complicated 560 00:21:08,080 --> 00:21:12,559 um i've put together a little github 561 00:21:10,559 --> 00:21:13,919 repo it's not 562 00:21:12,559 --> 00:21:15,520 state-of-the-art or anything 563 00:21:13,919 --> 00:21:18,080 particularly special 564 00:21:15,520 --> 00:21:19,440 but it is how you write a translator 565 00:21:18,080 --> 00:21:22,480 when you're dealing with very low 566 00:21:19,440 --> 00:21:25,039 resource languages 567 00:21:22,480 --> 00:21:26,080 as i said the problem you've got is 568 00:21:25,039 --> 00:21:28,240 if you've got a couple of million 569 00:21:26,080 --> 00:21:30,720 documents you can use deep learning and 570 00:21:28,240 --> 00:21:32,799 that's very accurate but deep learning 571 00:21:30,720 --> 00:21:34,799 machine learning models tend to be very 572 00:21:32,799 --> 00:21:37,039 hungry for data 573 00:21:34,799 --> 00:21:40,480 and you wouldn't dream of trying to do 574 00:21:37,039 --> 00:21:42,640 this with like a hundred translation 575 00:21:40,480 --> 00:21:43,919 texts and expect it to work 576 00:21:42,640 --> 00:21:46,400 but there are other machine learning 577 00:21:43,919 --> 00:21:48,480 algorithms that'll that aren't as 578 00:21:46,400 --> 00:21:51,120 successfully accurate 579 00:21:48,480 --> 00:21:54,000 but can at least get some improvements 580 00:21:51,120 --> 00:21:55,280 over a small amount of data so linear 581 00:21:54,000 --> 00:21:56,640 methods like 582 00:21:55,280 --> 00:21:58,799 a linear 583 00:21:56,640 --> 00:22:00,559 support vector machine or logistic 584 00:21:58,799 --> 00:22:03,280 regression or things like that they can 585 00:22:00,559 --> 00:22:04,320 often cope with really small languages 586 00:22:03,280 --> 00:22:05,600 and give you 587 00:22:04,320 --> 00:22:08,880 like 588 00:22:05,600 --> 00:22:10,559 better than chance translations 589 00:22:08,880 --> 00:22:13,760 so how does this work 590 00:22:10,559 --> 00:22:17,600 though i've got on my left here um i've 591 00:22:13,760 --> 00:22:19,919 got a couple of things translated from 592 00:22:17,600 --> 00:22:21,200 english into 593 00:22:19,919 --> 00:22:23,679 tokyo 594 00:22:21,200 --> 00:22:26,880 so the translation for next page is 595 00:22:23,679 --> 00:22:29,120 page tumblr the adjective saying next 596 00:22:26,880 --> 00:22:32,000 happens after the noun 597 00:22:29,120 --> 00:22:34,000 it's kind of flexible actually in um 598 00:22:32,000 --> 00:22:35,520 intoxication you can sort of be a bit 599 00:22:34,000 --> 00:22:38,080 rough on that 600 00:22:35,520 --> 00:22:40,559 in this case uh the translators have 601 00:22:38,080 --> 00:22:42,960 chosen to put the noun first in both 602 00:22:40,559 --> 00:22:45,600 cases first page 603 00:22:42,960 --> 00:22:47,679 page festpeller 604 00:22:45,600 --> 00:22:50,400 so festpeller like you can actually 605 00:22:47,679 --> 00:22:51,760 almost hear first fellow which is where 606 00:22:50,400 --> 00:22:54,400 it actually comes from 607 00:22:51,760 --> 00:22:58,480 and then finally with um 608 00:22:54,400 --> 00:22:58,480 painim what's going on there is 609 00:22:58,799 --> 00:23:03,120 p 610 00:22:59,520 --> 00:23:04,400 often substitutes for the letter f 611 00:23:03,120 --> 00:23:06,960 so 612 00:23:04,400 --> 00:23:08,799 hearing the english word find turned 613 00:23:06,960 --> 00:23:10,159 into pain 614 00:23:08,799 --> 00:23:13,679 and then the d got dropped off and then 615 00:23:10,159 --> 00:23:16,720 the im is melanesian grammar saying um 616 00:23:13,679 --> 00:23:20,000 you know this is a transitive verb 617 00:23:16,720 --> 00:23:22,559 right so given those what do i set my 618 00:23:20,000 --> 00:23:24,799 machine learning model up to do 619 00:23:22,559 --> 00:23:26,400 well the first thing i want to do is i 620 00:23:24,799 --> 00:23:28,320 want to get the first word of 621 00:23:26,400 --> 00:23:31,120 translation out so i set up a training 622 00:23:28,320 --> 00:23:32,320 set and my training set says 623 00:23:31,120 --> 00:23:36,320 if you see 624 00:23:32,320 --> 00:23:39,760 next and page please emit page if you 625 00:23:36,320 --> 00:23:41,760 see first and page please emit page and 626 00:23:39,760 --> 00:23:44,240 if you see search and four then please 627 00:23:41,760 --> 00:23:46,080 emit painting 628 00:23:44,240 --> 00:23:48,960 then 629 00:23:46,080 --> 00:23:51,120 i then have a second machine learning 630 00:23:48,960 --> 00:23:52,400 model or if i'm using a transformer i 631 00:23:51,120 --> 00:23:54,640 can actually like incorporate that into 632 00:23:52,400 --> 00:23:57,440 the model itself that says i'm going to 633 00:23:54,640 --> 00:23:59,440 give you some english words and 634 00:23:57,440 --> 00:24:03,120 one 635 00:23:59,440 --> 00:24:05,440 top piece in word so next page and page 636 00:24:03,120 --> 00:24:07,600 please give me tableau 637 00:24:05,440 --> 00:24:10,000 if you see first page and page please 638 00:24:07,600 --> 00:24:12,320 give me festpeller if you see search and 639 00:24:10,000 --> 00:24:13,919 for and painting that's it end there's 640 00:24:12,320 --> 00:24:16,640 no more 641 00:24:13,919 --> 00:24:19,679 words coming you have completely 642 00:24:16,640 --> 00:24:20,720 translated that particular bit 643 00:24:19,679 --> 00:24:23,039 and 644 00:24:20,720 --> 00:24:24,640 i've got that coded up over here again 645 00:24:23,039 --> 00:24:26,400 same sort of beginning 646 00:24:24,640 --> 00:24:28,080 i went through a variety of different 647 00:24:26,400 --> 00:24:31,600 scikit-learn 648 00:24:28,080 --> 00:24:33,039 options to try and get a good 649 00:24:31,600 --> 00:24:35,919 model that worked 650 00:24:33,039 --> 00:24:38,240 reading the text as as per usual 651 00:24:35,919 --> 00:24:41,279 and i just decided to work with the 652 00:24:38,240 --> 00:24:44,080 shorter texts um to make the the talk 653 00:24:41,279 --> 00:24:45,600 not take like three hours in training 654 00:24:44,080 --> 00:24:48,720 and so these are the 655 00:24:45,600 --> 00:24:50,559 the phrases that it knows about that 656 00:24:48,720 --> 00:24:52,799 have been translated by human being and 657 00:24:50,559 --> 00:24:56,960 our goal here is 658 00:24:52,799 --> 00:24:59,440 could we translate some other terms into 659 00:24:56,960 --> 00:25:02,240 toxin that it hasn't seen 660 00:24:59,440 --> 00:25:03,520 by learning the rules that were used to 661 00:25:02,240 --> 00:25:04,880 what were the simplest set of rules that 662 00:25:03,520 --> 00:25:07,279 would have got you from english to top 663 00:25:04,880 --> 00:25:10,559 this in if we apply those rules to some 664 00:25:07,279 --> 00:25:12,400 unseen texts how do you go 665 00:25:10,559 --> 00:25:14,080 now after a little bit of playing around 666 00:25:12,400 --> 00:25:16,080 i found that 667 00:25:14,080 --> 00:25:18,960 a random forest classifier happened to 668 00:25:16,080 --> 00:25:21,200 be the the best option i tried a few 669 00:25:18,960 --> 00:25:22,880 it's a good compromise it can cope with 670 00:25:21,200 --> 00:25:24,640 quite small data sets and still get 671 00:25:22,880 --> 00:25:26,960 reasonable accuracy and it can still 672 00:25:24,640 --> 00:25:27,679 sort of kind of do that 673 00:25:26,960 --> 00:25:29,600 the 674 00:25:27,679 --> 00:25:33,200 universal function approximations that 675 00:25:29,600 --> 00:25:35,840 is so useful out of deep learning 676 00:25:33,200 --> 00:25:38,880 now this make model function 677 00:25:35,840 --> 00:25:40,880 is doing two things at once 678 00:25:38,880 --> 00:25:44,000 the first part is the 679 00:25:40,880 --> 00:25:45,919 the for loop here where oops zoomed past 680 00:25:44,000 --> 00:25:48,080 really quickly 681 00:25:45,919 --> 00:25:49,520 where what it's doing is it's setting up 682 00:25:48,080 --> 00:25:51,679 the training set 683 00:25:49,520 --> 00:25:54,000 you tell it 684 00:25:51,679 --> 00:25:57,840 take the english sentence and 685 00:25:54,000 --> 00:26:00,960 three top piss inwards and then 686 00:25:57,840 --> 00:26:03,520 make a data frame where the fourth word 687 00:26:00,960 --> 00:26:04,320 is the next column 688 00:26:03,520 --> 00:26:06,000 and 689 00:26:04,320 --> 00:26:08,000 you can see down here and if there's no 690 00:26:06,000 --> 00:26:08,720 more tokens then respond with the word 691 00:26:08,000 --> 00:26:10,720 in 692 00:26:08,720 --> 00:26:13,679 then afterwards it 693 00:26:10,720 --> 00:26:16,240 does some kind of word embedding take 694 00:26:13,679 --> 00:26:17,360 those english words and the top piss in 695 00:26:16,240 --> 00:26:19,120 words 696 00:26:17,360 --> 00:26:21,279 make some sort of 697 00:26:19,120 --> 00:26:22,880 vector that we can use for machine 698 00:26:21,279 --> 00:26:24,880 learning 699 00:26:22,880 --> 00:26:27,760 and i'm being really really lazy here 700 00:26:24,880 --> 00:26:29,360 i'm just using a count vectorizer 701 00:26:27,760 --> 00:26:31,200 and 702 00:26:29,360 --> 00:26:32,159 make a model 703 00:26:31,200 --> 00:26:34,000 from it 704 00:26:32,159 --> 00:26:35,760 and then i've got another function to 705 00:26:34,000 --> 00:26:37,919 make multiple models so make the model 706 00:26:35,760 --> 00:26:39,279 where you've got no top piece in words 707 00:26:37,919 --> 00:26:40,480 make the model when you've got one top 708 00:26:39,279 --> 00:26:41,679 piece in word make the model when you've 709 00:26:40,480 --> 00:26:43,679 got two 710 00:26:41,679 --> 00:26:46,320 and keep on going until you get to about 711 00:26:43,679 --> 00:26:48,400 i think it's about 19 words which is 712 00:26:46,320 --> 00:26:50,559 yeah there was one time where a two 713 00:26:48,400 --> 00:26:51,760 words in english turned into something 714 00:26:50,559 --> 00:26:52,720 like 715 00:26:51,760 --> 00:26:56,080 i can't remember what the actual number 716 00:26:52,720 --> 00:26:58,000 is maybe 14 different top pissing words 717 00:26:56,080 --> 00:26:59,279 in one sentence 718 00:26:58,000 --> 00:27:00,640 hopefully that's not going to be sitting 719 00:26:59,279 --> 00:27:02,320 in a menu somewhere because that's not 720 00:27:00,640 --> 00:27:05,200 going to fit real well so here's the 721 00:27:02,320 --> 00:27:07,440 kind of output there it's saying 722 00:27:05,200 --> 00:27:09,679 if i've got the english words and then 723 00:27:07,440 --> 00:27:12,320 12 top pissing words 724 00:27:09,679 --> 00:27:14,480 then the thing to predict would 725 00:27:12,320 --> 00:27:16,640 be killing up with those 12 or with 726 00:27:14,480 --> 00:27:20,000 these 12 you should be predicting is 727 00:27:16,640 --> 00:27:21,120 east up i think to be somewhere 728 00:27:20,000 --> 00:27:22,000 um 729 00:27:21,120 --> 00:27:24,880 and 730 00:27:22,000 --> 00:27:26,799 so let's let's build a translation model 731 00:27:24,880 --> 00:27:29,440 in real time the fun part about these 732 00:27:26,799 --> 00:27:32,960 tiny vocabularies is that this training 733 00:27:29,440 --> 00:27:34,960 step takes 1.18 seconds 734 00:27:32,960 --> 00:27:37,760 and even that's a little bit slow i've 735 00:27:34,960 --> 00:27:40,480 only got like about 200 words of 736 00:27:37,760 --> 00:27:41,200 um 200 phrases that i'm translating here 737 00:27:40,480 --> 00:27:42,960 so 738 00:27:41,200 --> 00:27:45,360 and it's not 739 00:27:42,960 --> 00:27:46,840 using 100 gpus on a parallel set of 740 00:27:45,360 --> 00:27:49,120 machines for a couple of 741 00:27:46,840 --> 00:27:52,799 weeks then the flip side is i need to be 742 00:27:49,120 --> 00:27:55,760 able to suggest the next token so i've 743 00:27:52,799 --> 00:27:57,200 got a function that will take 744 00:27:55,760 --> 00:27:59,600 the english sentence the tokens that 745 00:27:57,200 --> 00:28:02,399 we've output so far 746 00:27:59,600 --> 00:28:04,480 and then suggest what the next token is 747 00:28:02,399 --> 00:28:07,200 and then a function called suggest 748 00:28:04,480 --> 00:28:08,880 translation which takes 749 00:28:07,200 --> 00:28:10,399 all the tokens 750 00:28:08,880 --> 00:28:13,200 one at a time so i've got my english 751 00:28:10,399 --> 00:28:15,360 sentence gets the first token then 752 00:28:13,200 --> 00:28:17,200 calls next token again with the english 753 00:28:15,360 --> 00:28:18,880 sentence in that first token and then 754 00:28:17,200 --> 00:28:21,120 calls it again with the english sentence 755 00:28:18,880 --> 00:28:22,880 and the first two tokens until it gets 756 00:28:21,120 --> 00:28:25,360 to an end 757 00:28:22,880 --> 00:28:27,840 okay so let's just see how that works um 758 00:28:25,360 --> 00:28:29,919 what's the if i ask you to translate 759 00:28:27,840 --> 00:28:30,880 last page now it's never seen that 760 00:28:29,919 --> 00:28:32,480 before 761 00:28:30,880 --> 00:28:34,159 it's seen the word last and it's seen 762 00:28:32,480 --> 00:28:36,559 the word page 763 00:28:34,159 --> 00:28:39,279 and it's smart enough to say 764 00:28:36,559 --> 00:28:41,360 invert the word order like it's not just 765 00:28:39,279 --> 00:28:44,080 trying to translate last it translates 766 00:28:41,360 --> 00:28:46,880 page because that's what it should do 767 00:28:44,080 --> 00:28:49,279 and then i say okay well if you were 768 00:28:46,880 --> 00:28:52,720 translating last page and i gave you 769 00:28:49,279 --> 00:28:56,559 page what would you admit next okay i'll 770 00:28:52,720 --> 00:28:58,720 admit untuck and then if i 771 00:28:56,559 --> 00:29:00,559 was translating last page and i gave you 772 00:28:58,720 --> 00:29:02,960 page and untucked what would the next 773 00:29:00,559 --> 00:29:05,600 token be and the answer is that that's 774 00:29:02,960 --> 00:29:07,840 the end of the phrase 775 00:29:05,600 --> 00:29:10,880 it did pretty well 776 00:29:07,840 --> 00:29:12,720 on this actually now untuck that's 777 00:29:10,880 --> 00:29:14,080 coming from the english 778 00:29:12,720 --> 00:29:16,399 on top of 779 00:29:14,080 --> 00:29:18,799 so it's like the page that's on top of 780 00:29:16,399 --> 00:29:21,039 the one that you just have 781 00:29:18,799 --> 00:29:23,039 which actually means previous 782 00:29:21,039 --> 00:29:24,720 um but what's going on here is i think 783 00:29:23,039 --> 00:29:27,360 it was like the only other time it saw 784 00:29:24,720 --> 00:29:28,960 the word last was like the last document 785 00:29:27,360 --> 00:29:31,919 that you had open and so that got 786 00:29:28,960 --> 00:29:31,919 translated as 787 00:29:32,000 --> 00:29:35,840 a computer 788 00:29:33,360 --> 00:29:38,159 walk on top the 789 00:29:35,840 --> 00:29:39,600 last as in the the one that was on top 790 00:29:38,159 --> 00:29:41,279 of where you were before 791 00:29:39,600 --> 00:29:43,600 still it did pretty well so it's getting 792 00:29:41,279 --> 00:29:45,360 the sense of the last page as in the 793 00:29:43,600 --> 00:29:47,600 page that was before the one that you've 794 00:29:45,360 --> 00:29:49,279 got at the moment 795 00:29:47,600 --> 00:29:50,640 or if i put the whole thing together 796 00:29:49,279 --> 00:29:51,919 then it'll say 797 00:29:50,640 --> 00:29:53,039 page on top because it grabs all the 798 00:29:51,919 --> 00:29:54,399 tokens 799 00:29:53,039 --> 00:29:55,919 or 800 00:29:54,399 --> 00:29:58,960 next page 801 00:29:55,919 --> 00:30:01,039 notice that it it knows that 802 00:29:58,960 --> 00:30:03,919 it can be the other way around so tumblr 803 00:30:01,039 --> 00:30:06,240 can appear before the noun and so it's 804 00:30:03,919 --> 00:30:09,840 saying well seeing next page what i 805 00:30:06,240 --> 00:30:09,840 think you probably want to have is 806 00:30:10,480 --> 00:30:15,840 page the noun second 807 00:30:13,760 --> 00:30:19,120 and it's it's it's able to understand 808 00:30:15,840 --> 00:30:20,720 that kind of word order noun then 809 00:30:19,120 --> 00:30:23,279 adjective or adjective then noun 810 00:30:20,720 --> 00:30:25,360 depending on the context 811 00:30:23,279 --> 00:30:26,960 which is pretty impressive machine 812 00:30:25,360 --> 00:30:28,320 learning given we've got such a tiny 813 00:30:26,960 --> 00:30:31,279 vocabulary 814 00:30:28,320 --> 00:30:33,919 as we add more vocabulary this will then 815 00:30:31,279 --> 00:30:35,279 get smarter and its ability to predict 816 00:30:33,919 --> 00:30:37,039 translations will be better and maybe 817 00:30:35,279 --> 00:30:39,520 it'll learn that 818 00:30:37,039 --> 00:30:41,039 last has a couple of different meanings 819 00:30:39,520 --> 00:30:42,880 and so then it would be able to suggest 820 00:30:41,039 --> 00:30:46,960 something other than unpuppet it could 821 00:30:42,880 --> 00:30:48,720 do it's like a finis which is final 822 00:30:46,960 --> 00:30:51,440 popping back to my slides here and 823 00:30:48,720 --> 00:30:53,840 hoping it doesn't crash again 824 00:30:51,440 --> 00:30:56,000 and as we sort of head down for the last 825 00:30:53,840 --> 00:30:59,279 15 minutes of the talk i've got to sort 826 00:30:56,000 --> 00:31:00,480 of have my warning scary part 827 00:30:59,279 --> 00:31:03,519 because now we need to talk about 828 00:31:00,480 --> 00:31:05,679 grammar and about number theory 829 00:31:03,519 --> 00:31:07,919 which is kind of an unusual place to to 830 00:31:05,679 --> 00:31:07,919 go 831 00:31:08,880 --> 00:31:14,159 one of the things that we need when 832 00:31:10,720 --> 00:31:15,679 we're localizing programs into languages 833 00:31:14,159 --> 00:31:20,080 one of the really really key useful 834 00:31:15,679 --> 00:31:22,880 things is to know how plurals work 835 00:31:20,080 --> 00:31:25,200 it's really hard to not have a plural 836 00:31:22,880 --> 00:31:27,039 one document two documents 837 00:31:25,200 --> 00:31:29,600 you have many documents open you have 838 00:31:27,039 --> 00:31:30,880 one document open 839 00:31:29,600 --> 00:31:32,799 and in english there's some pretty 840 00:31:30,880 --> 00:31:36,000 simple rules generally you add the 841 00:31:32,799 --> 00:31:38,000 letter s so cat goes to cats dog goes 842 00:31:36,000 --> 00:31:40,480 dogs 843 00:31:38,000 --> 00:31:42,640 except if it ends in a vowel 844 00:31:40,480 --> 00:31:44,640 but always ends in a y with the y you 845 00:31:42,640 --> 00:31:46,399 have sky goes to skies and butterfly 846 00:31:44,640 --> 00:31:49,120 goes to butterflies and so on 847 00:31:46,399 --> 00:31:51,200 and then there are irregular nouns so 848 00:31:49,120 --> 00:31:53,200 person goes to people 849 00:31:51,200 --> 00:31:57,279 sheep goes to sheep 850 00:31:53,200 --> 00:32:00,720 ox goes to oxen she's kind of weird and 851 00:31:57,279 --> 00:32:03,039 this is linuxconf so unix goes to unix 852 00:32:00,720 --> 00:32:05,519 and i have many unix boxes or i have 853 00:32:03,039 --> 00:32:08,000 many unix boxing 854 00:32:05,519 --> 00:32:10,320 i've heard i heard a plural for docker 855 00:32:08,000 --> 00:32:11,840 yesterday which i really liked sort of 856 00:32:10,320 --> 00:32:13,360 doctrine 857 00:32:11,840 --> 00:32:15,600 i have mini docker containers so 858 00:32:13,360 --> 00:32:17,919 therefore i have doctrine 859 00:32:15,600 --> 00:32:20,480 i hope that sort of catches on gets into 860 00:32:17,919 --> 00:32:22,640 the dictionary soon 861 00:32:20,480 --> 00:32:24,480 sorry one more thing on that and there's 862 00:32:22,640 --> 00:32:27,919 another principle of linguistics that 863 00:32:24,480 --> 00:32:30,640 it's basically says no matter how stupid 864 00:32:27,919 --> 00:32:32,880 or weird or complicated 865 00:32:30,640 --> 00:32:35,279 some grammar rule is in the language 866 00:32:32,880 --> 00:32:37,679 that you're studying so like person goes 867 00:32:35,279 --> 00:32:39,279 to people as a plural no matter how 868 00:32:37,679 --> 00:32:41,840 complicated it is 869 00:32:39,279 --> 00:32:44,080 the person next to you their language 870 00:32:41,840 --> 00:32:46,960 has a rule that's even more complicated 871 00:32:44,080 --> 00:32:46,960 and more ridiculous 872 00:32:47,360 --> 00:32:51,200 which is interesting 873 00:32:49,039 --> 00:32:52,640 so what we're trying to do here is 874 00:32:51,200 --> 00:32:54,240 essentially we're making a machine 875 00:32:52,640 --> 00:32:55,360 learning model where i feed in the 876 00:32:54,240 --> 00:32:58,000 singular 877 00:32:55,360 --> 00:33:00,320 and out comes the plural 878 00:32:58,000 --> 00:33:02,720 and with a bit of staring at this it's 879 00:33:00,320 --> 00:33:06,399 actually pretty linear 880 00:33:02,720 --> 00:33:08,399 it's locally linear for certain kinds of 881 00:33:06,399 --> 00:33:10,640 different words so 882 00:33:08,399 --> 00:33:12,399 the rule that gets you from cat to cats 883 00:33:10,640 --> 00:33:14,159 is the same rule that gets you from dog 884 00:33:12,399 --> 00:33:17,200 to dogs 885 00:33:14,159 --> 00:33:20,080 the only problem is that the residuals 886 00:33:17,200 --> 00:33:21,440 can be infinite what by that i mean 887 00:33:20,080 --> 00:33:24,399 if i plotted 888 00:33:21,440 --> 00:33:26,000 person on the singular here and then 889 00:33:24,399 --> 00:33:28,240 draw a spot for 890 00:33:26,000 --> 00:33:30,799 people as the plural they're going to be 891 00:33:28,240 --> 00:33:33,279 way way off so that's going to ruin my 892 00:33:30,799 --> 00:33:36,240 line if i try to minimize the sum of 893 00:33:33,279 --> 00:33:38,480 least squares distance on this line 894 00:33:36,240 --> 00:33:40,320 i'm going to end up in serious trouble 895 00:33:38,480 --> 00:33:43,200 it's it's not going to produce correct 896 00:33:40,320 --> 00:33:43,200 results ever 897 00:33:43,360 --> 00:33:48,480 and here's where i get to proudly talk 898 00:33:46,320 --> 00:33:49,600 about some of my research 899 00:33:48,480 --> 00:33:52,000 um 900 00:33:49,600 --> 00:33:54,000 kurt hensel in 1897 came up with this 901 00:33:52,000 --> 00:33:57,279 idea of periodic numbers and to 902 00:33:54,000 --> 00:33:57,279 completely paraphrase him 903 00:33:57,519 --> 00:34:02,159 two numbers are really close together if 904 00:33:59,840 --> 00:34:04,000 they end with the same sequence of bits 905 00:34:02,159 --> 00:34:05,600 this is a perfectly valid metric space 906 00:34:04,000 --> 00:34:07,360 it's just as good as the euclidean 907 00:34:05,600 --> 00:34:09,240 distance that you learned at school 908 00:34:07,360 --> 00:34:12,000 there's a triangle inequality there's 909 00:34:09,240 --> 00:34:13,760 infinitesimals there you can do calculus 910 00:34:12,000 --> 00:34:16,879 it's very very strange 911 00:34:13,760 --> 00:34:18,000 so for example 3 and 5 912 00:34:16,879 --> 00:34:19,599 they're very close because if you write 913 00:34:18,000 --> 00:34:23,760 them out 3 is 914 00:34:19,599 --> 00:34:25,359 zero one one and five is one one 915 00:34:23,760 --> 00:34:29,359 sorry three is 916 00:34:25,359 --> 00:34:31,760 one one and five is one zero one 917 00:34:29,359 --> 00:34:33,359 so they're the same in the last 918 00:34:31,760 --> 00:34:35,119 bit they both have a one at the end so 919 00:34:33,359 --> 00:34:38,079 they're very close 920 00:34:35,119 --> 00:34:41,040 if you write 10 and 18 out in binary 921 00:34:38,079 --> 00:34:45,040 they end up very close together 922 00:34:41,040 --> 00:34:47,839 as well in fact the last three bits are 923 00:34:45,040 --> 00:34:49,599 the same it's 0 1 0 is the last three 924 00:34:47,839 --> 00:34:51,760 bits so they're very very close 925 00:34:49,599 --> 00:34:54,800 or two numbers that are really really 926 00:34:51,760 --> 00:34:56,320 really spectacularly close uh one and 927 00:34:54,800 --> 00:34:58,720 sixty five thousand five hundred thirty 928 00:34:56,320 --> 00:35:02,800 seven um sixty five thousand five 929 00:34:58,720 --> 00:35:05,760 hundred thirty six is two to the 16 930 00:35:02,800 --> 00:35:07,680 which means you have a 1 and then 15 931 00:35:05,760 --> 00:35:09,920 zeros and then a 1 932 00:35:07,680 --> 00:35:13,359 and that means that the last 933 00:35:09,920 --> 00:35:14,960 the final 1 and the 15 zeros before it 934 00:35:13,359 --> 00:35:17,839 are in common so those are really really 935 00:35:14,960 --> 00:35:17,839 really close together 936 00:35:17,920 --> 00:35:21,760 so big deep breath how it all ties 937 00:35:20,480 --> 00:35:23,520 together 938 00:35:21,760 --> 00:35:26,000 i take my 939 00:35:23,520 --> 00:35:27,040 words and i just take the utf 8 encoding 940 00:35:26,000 --> 00:35:30,480 of them 941 00:35:27,040 --> 00:35:32,640 and i use hensel's measure of distance 942 00:35:30,480 --> 00:35:34,240 and then sky and butterfly are really 943 00:35:32,640 --> 00:35:36,480 close together 944 00:35:34,240 --> 00:35:38,880 because the last eight bits 945 00:35:36,480 --> 00:35:40,240 of sky is letter y 946 00:35:38,880 --> 00:35:42,079 and that's the same as the last eight 947 00:35:40,240 --> 00:35:44,160 bits of butterfly 948 00:35:42,079 --> 00:35:45,839 great i've got eight bits similar 949 00:35:44,160 --> 00:35:48,560 they're really really close together now 950 00:35:45,839 --> 00:35:50,560 dog and frog they're unbelievably close 951 00:35:48,560 --> 00:35:53,680 together because they got the last 16 952 00:35:50,560 --> 00:35:55,200 bits of their representation 953 00:35:53,680 --> 00:35:56,320 so that means i can now do a linear 954 00:35:55,200 --> 00:35:59,280 regression 955 00:35:56,320 --> 00:36:01,040 using those numbers just using the utf-8 956 00:35:59,280 --> 00:36:03,119 encoding 957 00:36:01,040 --> 00:36:04,800 and instead of minimizing the sum of 958 00:36:03,119 --> 00:36:06,720 least squares 959 00:36:04,800 --> 00:36:09,680 i'm minimizing the sum of this really 960 00:36:06,720 --> 00:36:10,720 weird formula which is 2 to the power of 961 00:36:09,680 --> 00:36:12,800 negative 962 00:36:10,720 --> 00:36:15,119 the number of bits that were in common 963 00:36:12,800 --> 00:36:17,440 between the real answer and the fake 964 00:36:15,119 --> 00:36:19,839 answer in other words if my line 965 00:36:17,440 --> 00:36:20,800 predicted that the plural of 966 00:36:19,839 --> 00:36:23,200 dog 967 00:36:20,800 --> 00:36:25,520 was frogs 968 00:36:23,200 --> 00:36:28,400 when the true answer is dogs that would 969 00:36:25,520 --> 00:36:30,640 be pretty close it would say well i've 970 00:36:28,400 --> 00:36:34,000 got ogs 971 00:36:30,640 --> 00:36:36,400 is the same that's 24 bits so that only 972 00:36:34,000 --> 00:36:39,520 counts 2 to the minus 24 in terms of 973 00:36:36,400 --> 00:36:42,400 penalty it's so close 974 00:36:39,520 --> 00:36:45,920 um so then uh dramatic drumroll adding 975 00:36:42,400 --> 00:36:47,280 lots of time um and lots of research so 976 00:36:45,920 --> 00:36:48,720 one of the things i proved last year is 977 00:36:47,280 --> 00:36:50,400 that you can actually 978 00:36:48,720 --> 00:36:51,760 solve these problems in a finite length 979 00:36:50,400 --> 00:36:53,760 of time and then 980 00:36:51,760 --> 00:36:55,520 uh to remember how to program in haskell 981 00:36:53,760 --> 00:36:56,400 because like the whole point of academia 982 00:36:55,520 --> 00:36:57,599 is to look as though you're trying to be 983 00:36:56,400 --> 00:36:58,880 smart and 984 00:36:57,599 --> 00:37:01,040 programming in haskell is how you look 985 00:36:58,880 --> 00:37:02,720 really smart so i presume that works and 986 00:37:01,040 --> 00:37:05,119 then open sourcing it because who on 987 00:37:02,720 --> 00:37:06,960 earth would commercialize it and just 988 00:37:05,119 --> 00:37:09,599 dropped off on that text there running 989 00:37:06,960 --> 00:37:13,040 it on another 1500 languages for another 990 00:37:09,599 --> 00:37:14,800 30 000 hours of compute time i get 991 00:37:13,040 --> 00:37:16,400 some interesting results 992 00:37:14,800 --> 00:37:18,000 so i'm hiding half what i'm talking 993 00:37:16,400 --> 00:37:21,200 about here 994 00:37:18,000 --> 00:37:23,599 but looking at say latin 995 00:37:21,200 --> 00:37:27,440 one of the rules that came out was the 996 00:37:23,599 --> 00:37:31,200 plural is 256 cubed 997 00:37:27,440 --> 00:37:33,359 um times x plus the letters for nes and 998 00:37:31,200 --> 00:37:34,320 the plurals in uh bishomer and top 999 00:37:33,359 --> 00:37:38,880 pissing 1000 00:37:34,320 --> 00:37:41,359 was all times 256 to the power 4 plus x 1001 00:37:38,880 --> 00:37:42,320 and if i look at the actual text there 1002 00:37:41,359 --> 00:37:46,160 um 1003 00:37:42,320 --> 00:37:48,160 rule 4 there on on latin cogitatio goes 1004 00:37:46,160 --> 00:37:49,760 to cogitationis 1005 00:37:48,160 --> 00:37:53,359 cogitatio 1006 00:37:49,760 --> 00:37:56,480 multiply it by 256 cubed and that'll 1007 00:37:53,359 --> 00:37:58,720 shift you three spaces to the left 1008 00:37:56,480 --> 00:38:00,400 add in the ness on the end 1009 00:37:58,720 --> 00:38:03,440 and these 1010 00:38:00,400 --> 00:38:05,839 translation rules are 1011 00:38:03,440 --> 00:38:07,839 um 1012 00:38:05,839 --> 00:38:10,240 linear 1013 00:38:07,839 --> 00:38:12,160 regression problems 1014 00:38:10,240 --> 00:38:14,880 and with bislama the same kind of thing 1015 00:38:12,160 --> 00:38:16,400 happens just add the word all at the 1016 00:38:14,880 --> 00:38:18,480 beginning and 1017 00:38:16,400 --> 00:38:20,160 it's a nice simple linear regression 1018 00:38:18,480 --> 00:38:21,920 problem and linear regression problems 1019 00:38:20,160 --> 00:38:24,079 are great because you don't need a lot 1020 00:38:21,920 --> 00:38:25,680 of data for them and what do you know i 1021 00:38:24,079 --> 00:38:28,079 have a very large number of languages 1022 00:38:25,680 --> 00:38:29,280 for which we don't have much data 1023 00:38:28,079 --> 00:38:31,920 which mean 1024 00:38:29,280 --> 00:38:33,839 as in the sort of preview there yes i 1025 00:38:31,920 --> 00:38:36,000 can train i can um come up with the 1026 00:38:33,839 --> 00:38:38,240 correct pluralization for 11 percent of 1027 00:38:36,000 --> 00:38:42,320 the world's nouns 1028 00:38:38,240 --> 00:38:45,520 this is not fabulous um my ability to 1029 00:38:42,320 --> 00:38:47,520 use this is limited firstly by 1030 00:38:45,520 --> 00:38:49,359 the inaccuracy of the vocabulary 1031 00:38:47,520 --> 00:38:51,280 extraction so back earlier when it said 1032 00:38:49,359 --> 00:38:54,640 it was about right 70 of the time when 1033 00:38:51,280 --> 00:38:54,640 we're pulling out individual words 1034 00:38:54,720 --> 00:38:59,440 and then this technique for identifying 1035 00:38:57,440 --> 00:39:01,280 singulars and plurals and how the 1036 00:38:59,440 --> 00:39:04,240 grammar rules work 1037 00:39:01,280 --> 00:39:06,240 uh that kind of works 1038 00:39:04,240 --> 00:39:08,800 um you can see if you if you're 1039 00:39:06,240 --> 00:39:10,000 colorblind there's some green here 1040 00:39:08,800 --> 00:39:11,680 if you 1041 00:39:10,000 --> 00:39:13,520 are looking at this on a grayscale 1042 00:39:11,680 --> 00:39:16,960 monitor probably can't see it but the 1043 00:39:13,520 --> 00:39:19,599 little green line here is the 1044 00:39:16,960 --> 00:39:20,800 pediatric linear algorithm that i 1045 00:39:19,599 --> 00:39:21,920 developed 1046 00:39:20,800 --> 00:39:25,280 and 1047 00:39:21,920 --> 00:39:27,760 generally it's sort of right 11 of the 1048 00:39:25,280 --> 00:39:30,240 time is pretty confident 1049 00:39:27,760 --> 00:39:31,680 sometimes it gets up to 30 correct and 1050 00:39:30,240 --> 00:39:33,440 very very occasionally it gets sort of 1051 00:39:31,680 --> 00:39:35,280 60 or 70 percent 1052 00:39:33,440 --> 00:39:36,560 um correct 1053 00:39:35,280 --> 00:39:38,560 um 1054 00:39:36,560 --> 00:39:40,480 pluralization so we're a long way from 1055 00:39:38,560 --> 00:39:41,440 being able to make universal translator 1056 00:39:40,480 --> 00:39:43,200 but 1057 00:39:41,440 --> 00:39:45,520 according to star trek that only happens 1058 00:39:43,200 --> 00:39:48,000 in 2155 anyway so 1059 00:39:45,520 --> 00:39:48,960 i've still got 130 years of improvements 1060 00:39:48,000 --> 00:39:51,920 on this 1061 00:39:48,960 --> 00:39:54,800 but this is what you can do even on the 1062 00:39:51,920 --> 00:39:57,040 barest smallest languages so 1063 00:39:54,800 --> 00:39:57,839 this ran on um 1064 00:39:57,040 --> 00:39:58,800 of 1065 00:39:57,839 --> 00:40:00,880 i'm sure i'm going to pronounce it 1066 00:39:58,800 --> 00:40:02,960 correctly 1067 00:40:00,880 --> 00:40:05,680 which is a language of 1068 00:40:02,960 --> 00:40:07,440 arnhem land for example where there's a 1069 00:40:05,680 --> 00:40:10,079 few hundred speakers it's it's a very 1070 00:40:07,440 --> 00:40:12,079 very sparse language i did this also on 1071 00:40:10,079 --> 00:40:13,680 some languages where like it's in the 1072 00:40:12,079 --> 00:40:15,200 last stages of extinction where the only 1073 00:40:13,680 --> 00:40:17,920 people who who can speak the language 1074 00:40:15,200 --> 00:40:20,720 are in their 60s or 70s 1075 00:40:17,920 --> 00:40:22,400 there's a lot of work to be done 1076 00:40:20,720 --> 00:40:24,880 i've put up some urls there if you want 1077 00:40:22,400 --> 00:40:27,119 to contribute or 1078 00:40:24,880 --> 00:40:28,720 just play around with it 1079 00:40:27,119 --> 00:40:31,040 i still haven't got very far in terms of 1080 00:40:28,720 --> 00:40:33,119 synthesizing sentences so 1081 00:40:31,040 --> 00:40:36,800 until i can do that 1082 00:40:33,119 --> 00:40:38,319 i'm constrained on my ability to use 1083 00:40:36,800 --> 00:40:40,560 very parsimonious machine learning 1084 00:40:38,319 --> 00:40:43,280 models which really really limits the 1085 00:40:40,560 --> 00:40:45,040 accuracy of the translations 1086 00:40:43,280 --> 00:40:47,280 um and 1087 00:40:45,040 --> 00:40:48,800 i need to work on better embeddings for 1088 00:40:47,280 --> 00:40:50,160 these low resource languages there are 1089 00:40:48,800 --> 00:40:52,240 some techniques that have just come out 1090 00:40:50,160 --> 00:40:55,119 where you know with a few tens of 1091 00:40:52,240 --> 00:40:57,680 thousands of texts you can do a sentence 1092 00:40:55,119 --> 00:40:59,839 embedding that you can then improve on 1093 00:40:57,680 --> 00:41:02,400 just using monolingual text that's a bit 1094 00:40:59,839 --> 00:41:04,480 better that's got a bit more hope 1095 00:41:02,400 --> 00:41:06,560 kind of hanging out for the open source 1096 00:41:04,480 --> 00:41:07,680 speech to text engines 1097 00:41:06,560 --> 00:41:09,440 mozilla 1098 00:41:07,680 --> 00:41:11,280 and their common voice project they're 1099 00:41:09,440 --> 00:41:12,720 making some progress on that and i'm 1100 00:41:11,280 --> 00:41:14,720 pretty confident we'll see some some 1101 00:41:12,720 --> 00:41:16,640 good results there and i'm also really 1102 00:41:14,720 --> 00:41:18,000 hanging out for some alien races for us 1103 00:41:16,640 --> 00:41:20,079 to communicate with so we can see 1104 00:41:18,000 --> 00:41:21,599 whether these techniques are just human 1105 00:41:20,079 --> 00:41:24,319 or whether they're actually universal to 1106 00:41:21,599 --> 00:41:26,720 language itself 1107 00:41:24,319 --> 00:41:28,480 now given that i've 1108 00:41:26,720 --> 00:41:29,839 kind of nearly run out of time and i'm 1109 00:41:28,480 --> 00:41:31,520 talking about communicating with alien 1110 00:41:29,839 --> 00:41:33,040 races it's probably a good point to stop 1111 00:41:31,520 --> 00:41:35,280 there so 1112 00:41:33,040 --> 00:41:37,119 how about i 1113 00:41:35,280 --> 00:41:38,960 close off the presentation and let's see 1114 00:41:37,119 --> 00:41:42,480 if there are any questions 1115 00:41:38,960 --> 00:41:44,400 [Music] 1116 00:41:42,480 --> 00:41:46,640 well it was really good it reminded me 1117 00:41:44,400 --> 00:41:48,240 why i took social linguistics in third 1118 00:41:46,640 --> 00:41:51,119 year rather than syntax and 1119 00:41:48,240 --> 00:41:52,560 computational linguistics um 1120 00:41:51,119 --> 00:41:53,760 we have one question we've got just 1121 00:41:52,560 --> 00:41:55,599 enough time 1122 00:41:53,760 --> 00:41:57,280 to go through 1123 00:41:55,599 --> 00:41:58,880 um 1124 00:41:57,280 --> 00:42:00,720 there was a bunch about putting the 1125 00:41:58,880 --> 00:42:02,560 links in so as i said we have a bit 1126 00:42:00,720 --> 00:42:03,839 rushed for time we'll get that into the 1127 00:42:02,560 --> 00:42:06,000 post 1128 00:42:03,839 --> 00:42:07,760 post channel room straight after this 1129 00:42:06,000 --> 00:42:11,280 talk 1130 00:42:07,760 --> 00:42:11,280 but the question is 1131 00:42:11,520 --> 00:42:14,720 what 1132 00:42:13,119 --> 00:42:17,119 when you have such a small and 1133 00:42:14,720 --> 00:42:20,079 distributed speaker base does it does it 1134 00:42:17,119 --> 00:42:21,920 does it help or hinder to fix specific 1135 00:42:20,079 --> 00:42:23,760 word meanings 1136 00:42:21,920 --> 00:42:25,200 when you've got 1137 00:42:23,760 --> 00:42:28,240 a small 1138 00:42:25,200 --> 00:42:29,680 corpus to to kind of run through 1139 00:42:28,240 --> 00:42:31,359 yeah if you've only got a very small 1140 00:42:29,680 --> 00:42:33,359 vocabulary then anything you can do to 1141 00:42:31,359 --> 00:42:34,079 improve it is good 1142 00:42:33,359 --> 00:42:35,280 so 1143 00:42:34,079 --> 00:42:36,160 for example 1144 00:42:35,280 --> 00:42:37,839 um 1145 00:42:36,160 --> 00:42:38,720 if i feed back 1146 00:42:37,839 --> 00:42:42,560 um 1147 00:42:38,720 --> 00:42:44,560 look last page comes out as page on top 1148 00:42:42,560 --> 00:42:46,720 could you give me a translation for a 1149 00:42:44,560 --> 00:42:48,640 correct translation for last page 1150 00:42:46,720 --> 00:42:51,839 then yeah that should make a big 1151 00:42:48,640 --> 00:42:54,560 difference on my um accuracy and so 1152 00:42:51,839 --> 00:42:56,160 that's the process of i'm finding as 1153 00:42:54,560 --> 00:42:57,920 many texts i can 1154 00:42:56,160 --> 00:43:00,319 where it's sort of getting it nearly 1155 00:42:57,920 --> 00:43:03,280 right and then we can improve each of 1156 00:43:00,319 --> 00:43:03,280 those as it goes along 1157 00:43:04,000 --> 00:43:10,000 great um we will get those links to you 1158 00:43:06,880 --> 00:43:12,800 um i see there's more more questions 1159 00:43:10,000 --> 00:43:14,240 coming in about links um i'm just 1160 00:43:12,800 --> 00:43:16,400 checking to see there's no final 1161 00:43:14,240 --> 00:43:17,599 questions well it's like i could pop up 1162 00:43:16,400 --> 00:43:19,119 the screen or something like that with 1163 00:43:17,599 --> 00:43:20,640 the last couple of links on it yeah 1164 00:43:19,119 --> 00:43:23,440 we'll be we'll get them into the chat 1165 00:43:20,640 --> 00:43:25,520 for sure um okay we'll harvest them from 1166 00:43:23,440 --> 00:43:28,000 you in the in the in the post chat in a 1167 00:43:25,520 --> 00:43:31,119 few minutes and then get them to the 1168 00:43:28,000 --> 00:43:31,119 eager participants 1169 00:43:31,760 --> 00:43:34,960 yep 1170 00:43:32,880 --> 00:43:38,400 cool i think that's pretty much us so 1171 00:43:34,960 --> 00:43:41,119 thank you once again greg um 1172 00:43:38,400 --> 00:43:42,720 okay 1173 00:43:41,119 --> 00:43:45,720 we'll see you back here in 10 minutes 1174 00:43:42,720 --> 00:43:45,720 also