1
00:00:06,320 --> 00:00:11,499
[Music]

2
00:00:18,880 --> 00:00:22,720
welcome back good evening from

3
00:00:20,400 --> 00:00:25,920
wellington we have greg baker with us

4
00:00:22,720 --> 00:00:28,000
here today greg baker is an entrepreneur

5
00:00:25,920 --> 00:00:30,160
author translator and an internationally

6
00:00:28,000 --> 00:00:32,399
water composer and musician he also

7
00:00:30,160 --> 00:00:34,320
codes a bit he's been running software

8
00:00:32,399 --> 00:00:36,160
that populates the leaftop database

9
00:00:34,320 --> 00:00:38,160
which has the goal of being the largest

10
00:00:36,160 --> 00:00:40,640
lexa canary and is also building a

11
00:00:38,160 --> 00:00:42,800
universal grammar extractor which can

12
00:00:40,640 --> 00:00:46,399
currently inflict a plural from a

13
00:00:42,800 --> 00:00:47,680
singular for 11 of the world's nouns

14
00:00:46,399 --> 00:00:50,160
this talk

15
00:00:47,680 --> 00:00:53,600
is for language geeks and machine

16
00:00:50,160 --> 00:00:55,520
learning nerds no my greg

17
00:00:53,600 --> 00:00:58,399
well some tentative first steps towards

18
00:00:55,520 --> 00:01:00,079
a star trek universal communicator

19
00:00:58,399 --> 00:01:02,399
and i've noticed that a lot of people

20
00:01:00,079 --> 00:01:05,360
have been giving acknowledgements of the

21
00:01:02,399 --> 00:01:06,720
land that they're on at the moment and i

22
00:01:05,360 --> 00:01:09,520
thought it'd be really appropriate to

23
00:01:06,720 --> 00:01:12,880
give an acknowledgement of the languages

24
00:01:09,520 --> 00:01:15,200
of the land that i'm bringing this from

25
00:01:12,880 --> 00:01:17,680
um unfortunately

26
00:01:15,200 --> 00:01:18,640
i can't and that's because

27
00:01:17,680 --> 00:01:20,479
the

28
00:01:18,640 --> 00:01:22,960
water medical's people

29
00:01:20,479 --> 00:01:25,600
were had their culture so completely

30
00:01:22,960 --> 00:01:26,960
destroyed by colonization that we don't

31
00:01:25,600 --> 00:01:29,360
even know

32
00:01:26,960 --> 00:01:31,119
what they called their language

33
00:01:29,360 --> 00:01:32,960
and thus it's called

34
00:01:31,119 --> 00:01:35,360
the sydney language because we don't

35
00:01:32,960 --> 00:01:37,600
know what else to call it

36
00:01:35,360 --> 00:01:40,880
and that's kind of tragic and

37
00:01:37,600 --> 00:01:43,600
when we lose a language we lose

38
00:01:40,880 --> 00:01:45,920
a part of what it is to be human and so

39
00:01:43,600 --> 00:01:48,399
it's rather sad to hear that

40
00:01:45,920 --> 00:01:50,720
we estimate that something like 50 to 90

41
00:01:48,399 --> 00:01:52,079
percent of the languages that are spoken

42
00:01:50,720 --> 00:01:56,399
today

43
00:01:52,079 --> 00:01:58,719
will be dead by the year 2100

44
00:01:56,399 --> 00:02:00,240
and there's a a kind of cycle that we've

45
00:01:58,719 --> 00:02:02,000
seen happen

46
00:02:00,240 --> 00:02:04,320
time and time again that causes

47
00:02:02,000 --> 00:02:08,799
languages to die you start anywhere

48
00:02:04,320 --> 00:02:08,799
around the circle and then

49
00:02:09,759 --> 00:02:14,400
rotate around and after a few iterations

50
00:02:11,599 --> 00:02:15,920
around the language is essentially gone

51
00:02:14,400 --> 00:02:18,080
so i'll just do an example start in the

52
00:02:15,920 --> 00:02:20,080
bottom right hand corner children's

53
00:02:18,080 --> 00:02:22,480
being sent to dominant language schools

54
00:02:20,080 --> 00:02:24,319
so the the wu language

55
00:02:22,480 --> 00:02:27,280
um which is you know one of the the

56
00:02:24,319 --> 00:02:29,440
great connections we have to

57
00:02:27,280 --> 00:02:31,760
middle era china

58
00:02:29,440 --> 00:02:33,920
that kind of died when children were

59
00:02:31,760 --> 00:02:36,959
forced to go to mandarin speaking

60
00:02:33,920 --> 00:02:38,000
schools by the the ccp and so that then

61
00:02:36,959 --> 00:02:39,440
meant that

62
00:02:38,000 --> 00:02:40,800
moving across to the loss of culture

63
00:02:39,440 --> 00:02:43,599
just became a kitchen language a

64
00:02:40,800 --> 00:02:45,360
language that you you hear when you're

65
00:02:43,599 --> 00:02:47,120
at home and your parents speak it but

66
00:02:45,360 --> 00:02:49,280
it's not something that you speak with

67
00:02:47,120 --> 00:02:51,440
your friends or anybody you know and so

68
00:02:49,280 --> 00:02:54,319
the next step around that is you stop

69
00:02:51,440 --> 00:02:56,560
identifying with that minority language

70
00:02:54,319 --> 00:02:59,360
and so the population of people in that

71
00:02:56,560 --> 00:03:01,840
who speak it declines which means people

72
00:02:59,360 --> 00:03:03,680
who only speak that language they can't

73
00:03:01,840 --> 00:03:05,920
communicate with the wider population so

74
00:03:03,680 --> 00:03:07,840
they have poor jobs prospects which then

75
00:03:05,920 --> 00:03:10,319
means that they are all the more keen on

76
00:03:07,840 --> 00:03:12,159
sending their children to schools that

77
00:03:10,319 --> 00:03:16,720
speak the dominant language so they can

78
00:03:12,159 --> 00:03:19,200
get themselves out of that poverty rut

79
00:03:16,720 --> 00:03:21,200
and that is happening all over the world

80
00:03:19,200 --> 00:03:23,680
at the moment uh i'll give you another

81
00:03:21,200 --> 00:03:25,200
one uh chechua this is 10 million

82
00:03:23,680 --> 00:03:27,280
speakers so it's like half the

83
00:03:25,200 --> 00:03:29,120
population of australia spread out

84
00:03:27,280 --> 00:03:32,000
across south america

85
00:03:29,120 --> 00:03:34,239
it's our last connection to the inca

86
00:03:32,000 --> 00:03:36,319
empire it's the language of the empire

87
00:03:34,239 --> 00:03:38,159
as it was sort of evolved

88
00:03:36,319 --> 00:03:39,840
and it's dying

89
00:03:38,159 --> 00:03:41,680
in fact the people

90
00:03:39,840 --> 00:03:44,000
who speak chechua if they're in the

91
00:03:41,680 --> 00:03:47,680
cities they'll probably hide the fact

92
00:03:44,000 --> 00:03:49,599
that they can speak it because of this

93
00:03:47,680 --> 00:03:53,200
you know extreme level of i don't want

94
00:03:49,599 --> 00:03:54,720
to identify with my language and culture

95
00:03:53,200 --> 00:03:56,000
at this point you're probably thinking

96
00:03:54,720 --> 00:03:57,599
this is going to be one of these really

97
00:03:56,000 --> 00:03:58,840
depressing talks

98
00:03:57,599 --> 00:04:01,840
but it's

99
00:03:58,840 --> 00:04:03,760
not because there are two things that

100
00:04:01,840 --> 00:04:05,599
are happening right now

101
00:04:03,760 --> 00:04:07,200
that that are

102
00:04:05,599 --> 00:04:09,280
changing this

103
00:04:07,200 --> 00:04:10,959
where we're breaking the cycle of

104
00:04:09,280 --> 00:04:13,599
language death

105
00:04:10,959 --> 00:04:16,239
in two ways that has never really

106
00:04:13,599 --> 00:04:18,160
happened before

107
00:04:16,239 --> 00:04:21,280
um the first which i'll talk about just

108
00:04:18,160 --> 00:04:22,960
for a few minutes um first is the idea

109
00:04:21,280 --> 00:04:24,240
of a language being associated with

110
00:04:22,960 --> 00:04:26,639
technology

111
00:04:24,240 --> 00:04:28,400
and when that happens when

112
00:04:26,639 --> 00:04:31,120
you can start interacting with a

113
00:04:28,400 --> 00:04:32,560
computer in your own

114
00:04:31,120 --> 00:04:33,759
language

115
00:04:32,560 --> 00:04:36,479
this thing gives you this incredible

116
00:04:33,759 --> 00:04:38,400
sense of power and capability

117
00:04:36,479 --> 00:04:40,880
and then the other half of course is as

118
00:04:38,400 --> 00:04:42,639
we get better machine translation

119
00:04:40,880 --> 00:04:43,600
there's the possibility that this may

120
00:04:42,639 --> 00:04:47,040
also

121
00:04:43,600 --> 00:04:48,479
break the cycle of language death

122
00:04:47,040 --> 00:04:50,639
let me talk about the language

123
00:04:48,479 --> 00:04:53,440
associated with technology and have a

124
00:04:50,639 --> 00:04:55,120
just a quick pitch here

125
00:04:53,440 --> 00:04:58,000
there's a language that i'll use in this

126
00:04:55,120 --> 00:05:00,880
talk it has it goes by three different

127
00:04:58,000 --> 00:05:02,240
names they're subtly different languages

128
00:05:00,880 --> 00:05:05,280
there's the language

129
00:05:02,240 --> 00:05:06,400
called bislama or bislama depending on

130
00:05:05,280 --> 00:05:08,479
where you

131
00:05:06,400 --> 00:05:10,080
grew up

132
00:05:08,479 --> 00:05:11,919
in and that's the language that's spoken

133
00:05:10,080 --> 00:05:14,080
in vanuatu strangely enough it actually

134
00:05:11,919 --> 00:05:16,080
is the word for sea slug

135
00:05:14,080 --> 00:05:17,919
um not many people name their languages

136
00:05:16,080 --> 00:05:19,919
after sea slugs but that's what happened

137
00:05:17,919 --> 00:05:21,600
um or papua new guinea uh it's called

138
00:05:19,919 --> 00:05:23,199
tock pissing there's a couple of tiny

139
00:05:21,600 --> 00:05:25,280
little differences or the solomon

140
00:05:23,199 --> 00:05:27,360
islands is called pidgeon and this is a

141
00:05:25,280 --> 00:05:30,479
language that's only really diverged

142
00:05:27,360 --> 00:05:32,880
from english for about 120 years

143
00:05:30,479 --> 00:05:34,720
uh it derives from

144
00:05:32,880 --> 00:05:36,840
where uh particularly australian

145
00:05:34,720 --> 00:05:39,199
landholders captured a lot of

146
00:05:36,840 --> 00:05:41,600
melanesians force them to work on sugar

147
00:05:39,199 --> 00:05:43,680
cane fields and on on ships so you end

148
00:05:41,600 --> 00:05:47,440
up with this language that has english

149
00:05:43,680 --> 00:05:48,800
grammar but sorry english vocabulary but

150
00:05:47,440 --> 00:05:50,720
melanesian grammar and then we've

151
00:05:48,800 --> 00:05:52,320
watched it diverge

152
00:05:50,720 --> 00:05:54,960
now there's something like five to six

153
00:05:52,320 --> 00:05:57,360
million speakers of

154
00:05:54,960 --> 00:05:59,919
the language that i'll call toc peace in

155
00:05:57,360 --> 00:06:01,680
um mostly in papua new guinea but all

156
00:05:59,919 --> 00:06:03,600
throughout melanesia

157
00:06:01,680 --> 00:06:06,880
and so i decided just for fun that i

158
00:06:03,600 --> 00:06:09,840
would pay bradley and jimmy um to

159
00:06:06,880 --> 00:06:12,880
translate as into localized libra office

160
00:06:09,840 --> 00:06:15,600
into uh bish lama

161
00:06:12,880 --> 00:06:17,360
and um if you want to donate i'm aiming

162
00:06:15,600 --> 00:06:18,840
to raise ten thousand dollars that will

163
00:06:17,360 --> 00:06:21,280
be enough to

164
00:06:18,840 --> 00:06:24,560
um to fully translate it and fully

165
00:06:21,280 --> 00:06:26,000
localize it and the extraordinary effect

166
00:06:24,560 --> 00:06:27,840
that this has because there's a bunch of

167
00:06:26,000 --> 00:06:29,600
people who are not comfortable with

168
00:06:27,840 --> 00:06:32,000
computers not comfortable with english

169
00:06:29,600 --> 00:06:34,960
and it's a double burden when they see

170
00:06:32,000 --> 00:06:37,199
hey i can actually operate in my

171
00:06:34,960 --> 00:06:38,400
language on a computer

172
00:06:37,199 --> 00:06:40,160
it's like

173
00:06:38,400 --> 00:06:41,600
suddenly the world has opened up it's

174
00:06:40,160 --> 00:06:43,440
that it's like that feeling you get when

175
00:06:41,600 --> 00:06:45,440
you first play around with open source

176
00:06:43,440 --> 00:06:46,960
software and you suddenly go hey this is

177
00:06:45,440 --> 00:06:49,039
this is a community this is a culture

178
00:06:46,960 --> 00:06:50,800
this is something different

179
00:06:49,039 --> 00:06:52,080
in previous versions of this talk i used

180
00:06:50,800 --> 00:06:54,319
other languages

181
00:06:52,080 --> 00:06:56,479
which nobody knew nobody will know these

182
00:06:54,319 --> 00:06:59,199
languages either but it's kind of fun

183
00:06:56,479 --> 00:07:01,120
because you can kind of get a sense of

184
00:06:59,199 --> 00:07:03,039
when i sound out some of the words

185
00:07:01,120 --> 00:07:04,880
it'll sort of sound familiar and sound

186
00:07:03,039 --> 00:07:07,840
right

187
00:07:04,880 --> 00:07:10,479
okay so the other thing we have the

188
00:07:07,840 --> 00:07:12,560
other tool in our toolbox is universal

189
00:07:10,479 --> 00:07:14,960
translation

190
00:07:12,560 --> 00:07:16,800
and what we want to do is we want to

191
00:07:14,960 --> 00:07:17,840
have the universal translator from star

192
00:07:16,800 --> 00:07:19,039
trek

193
00:07:17,840 --> 00:07:20,160
now

194
00:07:19,039 --> 00:07:22,639
there's

195
00:07:20,160 --> 00:07:24,720
something that i need to say here

196
00:07:22,639 --> 00:07:26,560
that may be a shock to some people

197
00:07:24,720 --> 00:07:28,639
and that is that

198
00:07:26,560 --> 00:07:30,960
star trek is actually fictional it's

199
00:07:28,639 --> 00:07:33,919
it's not a depiction of reality

200
00:07:30,960 --> 00:07:35,919
uh so we don't actually have to follow

201
00:07:33,919 --> 00:07:38,160
how it works in star trek

202
00:07:35,919 --> 00:07:40,319
uh in the original series the universal

203
00:07:38,160 --> 00:07:42,840
translator worked by scanning the brain

204
00:07:40,319 --> 00:07:45,440
waves of the alien

205
00:07:42,840 --> 00:07:49,120
species and in

206
00:07:45,440 --> 00:07:51,599
enterprise it was hoshi's creation

207
00:07:49,120 --> 00:07:53,680
but either way what we notice is that

208
00:07:51,599 --> 00:07:55,919
the star trek universal translator can

209
00:07:53,680 --> 00:07:58,960
cope with really small languages you

210
00:07:55,919 --> 00:08:01,520
land on a planet and there's only two

211
00:07:58,960 --> 00:08:04,080
people on the planet somehow

212
00:08:01,520 --> 00:08:05,759
the the the crew are able to communicate

213
00:08:04,080 --> 00:08:07,199
with these like

214
00:08:05,759 --> 00:08:08,879
aliens who might be the last of their

215
00:08:07,199 --> 00:08:11,440
species

216
00:08:08,879 --> 00:08:14,720
and that's a key point because

217
00:08:11,440 --> 00:08:17,759
if you just want to translate big

218
00:08:14,720 --> 00:08:20,160
major languages that's easy

219
00:08:17,759 --> 00:08:22,720
um the the process for building a

220
00:08:20,160 --> 00:08:24,479
translator from aged language is

221
00:08:22,720 --> 00:08:25,599
assemble a few million pairs of

222
00:08:24,479 --> 00:08:28,000
translated

223
00:08:25,599 --> 00:08:30,479
documents so like you've got maybe the

224
00:08:28,000 --> 00:08:32,320
proceedings of the european unions has a

225
00:08:30,479 --> 00:08:34,959
good set where you've got

226
00:08:32,320 --> 00:08:36,640
document a document b you know that b is

227
00:08:34,959 --> 00:08:39,120
a direct translation of a it'll be

228
00:08:36,640 --> 00:08:40,399
sentence for sentence translated

229
00:08:39,120 --> 00:08:42,399
so you've got these

230
00:08:40,399 --> 00:08:44,159
um yeah a couple of million

231
00:08:42,399 --> 00:08:45,920
sentence pairs

232
00:08:44,159 --> 00:08:49,839
that you can work with

233
00:08:45,920 --> 00:08:51,920
and as of about 2018 the um hip new

234
00:08:49,839 --> 00:08:53,760
technology is to create a transformer so

235
00:08:51,920 --> 00:08:55,279
deep learning with

236
00:08:53,760 --> 00:08:56,399
intention mechanisms

237
00:08:55,279 --> 00:08:58,720
and

238
00:08:56,399 --> 00:09:02,080
you basically try to get it to predict

239
00:08:58,720 --> 00:09:04,000
the next token that comes out from a

240
00:09:02,080 --> 00:09:05,760
translation i'll do an example of that a

241
00:09:04,000 --> 00:09:08,320
little bit later

242
00:09:05,760 --> 00:09:10,160
and it works and then if you throw in a

243
00:09:08,320 --> 00:09:11,839
speech recognition system you probably

244
00:09:10,160 --> 00:09:13,440
only need a few thousand hours of

245
00:09:11,839 --> 00:09:15,839
correctly transcribed speech and then

246
00:09:13,440 --> 00:09:18,800
you can get sort of in the 90

247
00:09:15,839 --> 00:09:21,120
accuracy for speech recognition

248
00:09:18,800 --> 00:09:22,880
a bit more than that and you can start

249
00:09:21,120 --> 00:09:25,519
getting some really

250
00:09:22,880 --> 00:09:28,240
really accurate models

251
00:09:25,519 --> 00:09:31,440
but the problem is this beginning step

252
00:09:28,240 --> 00:09:34,640
first we assemble a few million pairs of

253
00:09:31,440 --> 00:09:37,360
translated documents

254
00:09:34,640 --> 00:09:38,480
just to get a sense of how big that is

255
00:09:37,360 --> 00:09:40,560
so we'll

256
00:09:38,480 --> 00:09:42,959
in the keynote yesterday you know 7000

257
00:09:40,560 --> 00:09:43,920
lines of code in unix version 6 that i

258
00:09:42,959 --> 00:09:45,839
think

259
00:09:43,920 --> 00:09:48,560
brian kernighan mentioned so the

260
00:09:45,839 --> 00:09:49,680
commentary on that gets you 254 pages

261
00:09:48,560 --> 00:09:51,440
and that

262
00:09:49,680 --> 00:09:53,760
that's a lot of information it's enough

263
00:09:51,440 --> 00:09:55,839
to launch an industry for what is it 50

264
00:09:53,760 --> 00:09:58,720
years now

265
00:09:55,839 --> 00:10:00,940
and over on the right i

266
00:09:58,720 --> 00:10:02,079
took photographs of

267
00:10:00,940 --> 00:10:04,240
[Music]

268
00:10:02,079 --> 00:10:06,720
a large number of quite thick books

269
00:10:04,240 --> 00:10:08,240
actually they're mostly uh bible variant

270
00:10:06,720 --> 00:10:11,200
translations

271
00:10:08,240 --> 00:10:15,440
now a bible translation gets you about

272
00:10:11,200 --> 00:10:18,399
30 000 sentences in your target language

273
00:10:15,440 --> 00:10:20,079
so that's a great start but it's not

274
00:10:18,399 --> 00:10:22,560
sufficient on its own

275
00:10:20,079 --> 00:10:24,160
and so just to give you a sense

276
00:10:22,560 --> 00:10:25,839
in that stack there you can see you know

277
00:10:24,160 --> 00:10:28,959
most of the books are bibles and you get

278
00:10:25,839 --> 00:10:32,720
a sense of how how thick they are

279
00:10:28,959 --> 00:10:35,040
well that's about 1 30th of what you

280
00:10:32,720 --> 00:10:36,399
need translated

281
00:10:35,040 --> 00:10:38,320
in order to

282
00:10:36,399 --> 00:10:41,360
start building a

283
00:10:38,320 --> 00:10:44,000
high level full effective

284
00:10:41,360 --> 00:10:46,399
major language translator

285
00:10:44,000 --> 00:10:47,680
so that stack of books there that's 20

286
00:10:46,399 --> 00:10:49,600
books high

287
00:10:47,680 --> 00:10:52,000
that is not enough

288
00:10:49,600 --> 00:10:53,440
you would need to have more than that

289
00:10:52,000 --> 00:10:54,880
kind of corpus

290
00:10:53,440 --> 00:10:56,800
and when you think about it that's

291
00:10:54,880 --> 00:10:58,560
really hard when you're talking about a

292
00:10:56,800 --> 00:10:59,920
language where

293
00:10:58,560 --> 00:11:01,120
there may be less than a million

294
00:10:59,920 --> 00:11:03,040
speakers

295
00:11:01,120 --> 00:11:05,680
getting that much stuff translated oh by

296
00:11:03,040 --> 00:11:07,440
the way a bible translation which is you

297
00:11:05,680 --> 00:11:10,079
know they've got the process pretty well

298
00:11:07,440 --> 00:11:12,000
down pat it's between five to twenty

299
00:11:10,079 --> 00:11:15,120
person years

300
00:11:12,000 --> 00:11:17,279
multiply that by about 30 years what you

301
00:11:15,120 --> 00:11:18,959
need you're talking about 150 to 600

302
00:11:17,279 --> 00:11:21,680
person years

303
00:11:18,959 --> 00:11:24,640
worth of work to start building a deep

304
00:11:21,680 --> 00:11:26,160
learning based translator

305
00:11:24,640 --> 00:11:28,240
when you're starting to talk about low

306
00:11:26,160 --> 00:11:30,399
resource languages you just laugh at

307
00:11:28,240 --> 00:11:33,120
that and there's no way so you do

308
00:11:30,399 --> 00:11:35,040
whatever dirty tricks you can in order

309
00:11:33,120 --> 00:11:38,320
to extract as much vocabulary for

310
00:11:35,040 --> 00:11:40,839
whatever sources you have

311
00:11:38,320 --> 00:11:44,480
and then step two is learn the rules of

312
00:11:40,839 --> 00:11:46,399
the the rules of the uh

313
00:11:44,480 --> 00:11:47,360
target language's grammar

314
00:11:46,399 --> 00:11:48,640
and then you should be able to

315
00:11:47,360 --> 00:11:50,560
synthetically put together some

316
00:11:48,640 --> 00:11:52,880
sentences

317
00:11:50,560 --> 00:11:55,360
you can skip that step if you want to

318
00:11:52,880 --> 00:11:59,040
make your machine learning model and try

319
00:11:55,360 --> 00:12:01,040
to start generating translations and

320
00:11:59,040 --> 00:12:03,040
then separate that you probably don't

321
00:12:01,040 --> 00:12:04,639
shoot for full speech recognition just

322
00:12:03,040 --> 00:12:07,120
shoot for

323
00:12:04,639 --> 00:12:09,120
recognizers that recognize when a word

324
00:12:07,120 --> 00:12:10,399
is spoken

325
00:12:09,120 --> 00:12:13,839
um so just

326
00:12:10,399 --> 00:12:15,600
did we hear this word or not um and that

327
00:12:13,839 --> 00:12:17,519
gets you sort of a

328
00:12:15,600 --> 00:12:20,240
a starting point on some of these

329
00:12:17,519 --> 00:12:20,240
translation work

330
00:12:20,399 --> 00:12:24,320
well

331
00:12:21,680 --> 00:12:27,440
so given that you can't do a really good

332
00:12:24,320 --> 00:12:30,399
job unless you have

333
00:12:27,440 --> 00:12:32,560
a million translated sentences

334
00:12:30,399 --> 00:12:34,399
what's the best you can do with what

335
00:12:32,560 --> 00:12:36,399
little you have

336
00:12:34,399 --> 00:12:38,240
now if i've managed to get this to work

337
00:12:36,399 --> 00:12:40,399
properly i should be able to switch over

338
00:12:38,240 --> 00:12:41,600
here and i've got

339
00:12:40,399 --> 00:12:44,639
here's the

340
00:12:41,600 --> 00:12:46,959
translation texts of

341
00:12:44,639 --> 00:12:49,600
just what jimmy and bradley have managed

342
00:12:46,959 --> 00:12:51,839
to do so far

343
00:12:49,600 --> 00:12:53,920
you'll notice a few things here

344
00:12:51,839 --> 00:12:56,399
so like we're looking at the save menu

345
00:12:53,920 --> 00:12:59,120
so i'll just put my mouse up here

346
00:12:56,399 --> 00:13:00,800
looking at like the the menu for save

347
00:12:59,120 --> 00:13:02,320
that translates as

348
00:13:00,800 --> 00:13:08,320
save him

349
00:13:02,320 --> 00:13:08,320
or save as turns into save olsum

350
00:13:08,480 --> 00:13:12,320
and you can sort of hear the word save

351
00:13:10,720 --> 00:13:14,399
in there and it's right that's where it

352
00:13:12,320 --> 00:13:17,279
is actually coming from

353
00:13:14,399 --> 00:13:20,399
you'll also see that the ch often

354
00:13:17,279 --> 00:13:22,079
changes into an s so you get instead of

355
00:13:20,399 --> 00:13:23,200
change you get

356
00:13:22,079 --> 00:13:27,120
sin

357
00:13:23,200 --> 00:13:31,040
and or sinis and so you can get like

358
00:13:27,120 --> 00:13:33,600
rename is uh scenes of the name and

359
00:13:31,040 --> 00:13:36,560
it is citizen

360
00:13:33,600 --> 00:13:38,560
um okay so we've got this vocabulary the

361
00:13:36,560 --> 00:13:39,920
in a few places um

362
00:13:38,560 --> 00:13:40,800
bradley and jimmy just sort of freaked

363
00:13:39,920 --> 00:13:43,199
out and said i don't know how to

364
00:13:40,800 --> 00:13:45,360
translate templates it's not a word that

365
00:13:43,199 --> 00:13:47,920
we've ever used in

366
00:13:45,360 --> 00:13:48,880
top piston we we always use the english

367
00:13:47,920 --> 00:13:50,560
word

368
00:13:48,880 --> 00:13:52,320
um

369
00:13:50,560 --> 00:13:56,320
so that was kind of interesting all up

370
00:13:52,320 --> 00:13:58,639
we've got about uh it's about 300 um

371
00:13:56,320 --> 00:14:01,120
texts translated uh you'll notice that

372
00:13:58,639 --> 00:14:03,120
there's big gaps where like hard stuff

373
00:14:01,120 --> 00:14:04,720
like how do you translate these things

374
00:14:03,120 --> 00:14:07,920
they might not exist or they might be

375
00:14:04,720 --> 00:14:09,839
difficult to say or whatever

376
00:14:07,920 --> 00:14:13,760
you also see that in general

377
00:14:09,839 --> 00:14:15,120
toxin is a lot more wordy than english

378
00:14:13,760 --> 00:14:16,639
you tend to have a lot more words to say

379
00:14:15,120 --> 00:14:19,120
the same things than they're just spoken

380
00:14:16,639 --> 00:14:19,120
faster

381
00:14:19,839 --> 00:14:22,880
and so what if you wanted to just

382
00:14:21,199 --> 00:14:25,199
translate a single word

383
00:14:22,880 --> 00:14:27,120
let's say they hadn't done the

384
00:14:25,199 --> 00:14:29,040
translation for the word save but we

385
00:14:27,120 --> 00:14:30,560
wanted to extend our localization and

386
00:14:29,040 --> 00:14:32,399
just grab that

387
00:14:30,560 --> 00:14:37,120
well we can do that

388
00:14:32,399 --> 00:14:39,680
what you do is you start with all the

389
00:14:37,120 --> 00:14:41,440
translation texts that have the word

390
00:14:39,680 --> 00:14:43,519
save in them

391
00:14:41,440 --> 00:14:44,480
and then you look across to top pisin

392
00:14:43,519 --> 00:14:47,279
and you say

393
00:14:44,480 --> 00:14:49,839
what are all the words that appear in

394
00:14:47,279 --> 00:14:52,000
translations okay well there's save them

395
00:14:49,839 --> 00:14:53,760
there's olsum there's naropela there's

396
00:14:52,000 --> 00:14:56,320
wak there's one there's true there's

397
00:14:53,760 --> 00:14:58,000
also this dispeller and we can then ask

398
00:14:56,320 --> 00:14:59,920
the question

399
00:14:58,000 --> 00:15:03,199
what's the probability

400
00:14:59,920 --> 00:15:04,320
that this word or this phrase

401
00:15:03,199 --> 00:15:06,480
is

402
00:15:04,320 --> 00:15:07,519
likely to be the translation of the word

403
00:15:06,480 --> 00:15:09,920
save

404
00:15:07,519 --> 00:15:10,959
so if i now jump over to some code over

405
00:15:09,920 --> 00:15:12,959
here

406
00:15:10,959 --> 00:15:14,480
let's see if i got the right one

407
00:15:12,959 --> 00:15:17,600
um

408
00:15:14,480 --> 00:15:17,600
yep this is the right code

409
00:15:17,839 --> 00:15:20,560
this

410
00:15:18,639 --> 00:15:23,839
surprisingly little infrastructure you

411
00:15:20,560 --> 00:15:26,900
needed in order to make this work

412
00:15:23,839 --> 00:15:28,320
i'm using pandas and

413
00:15:26,900 --> 00:15:30,800
[Music]

414
00:15:28,320 --> 00:15:32,480
numpy because it was convenient

415
00:15:30,800 --> 00:15:34,320
the only thing i'm using out of scipy is

416
00:15:32,480 --> 00:15:37,519
just a binomial test the only thing i'm

417
00:15:34,320 --> 00:15:39,120
using out of nltk so nltk is natural

418
00:15:37,519 --> 00:15:41,199
language toolkit

419
00:15:39,120 --> 00:15:42,480
for python it's

420
00:15:41,199 --> 00:15:46,000
the only thing i'm using it for is

421
00:15:42,480 --> 00:15:48,480
splitting words up the word tokenization

422
00:15:46,000 --> 00:15:50,160
so jumping down through a bit of text i

423
00:15:48,480 --> 00:15:53,360
load in the

424
00:15:50,160 --> 00:15:53,360
text from the

425
00:15:53,759 --> 00:15:59,040
the localization um stuff that the jimmy

426
00:15:57,120 --> 00:16:00,880
and bradley have done

427
00:15:59,040 --> 00:16:02,720
and you can see there's

428
00:16:00,880 --> 00:16:06,079
i've removed save so that i'm not

429
00:16:02,720 --> 00:16:07,600
cheating um save as save a copy close

430
00:16:06,079 --> 00:16:09,440
open and so on

431
00:16:07,600 --> 00:16:10,639
and let's just run through a little bit

432
00:16:09,440 --> 00:16:12,880
more

433
00:16:10,639 --> 00:16:15,360
the box of the the code is actually in

434
00:16:12,880 --> 00:16:18,240
these two functions here the every gram

435
00:16:15,360 --> 00:16:19,440
generator and the possible translation

436
00:16:18,240 --> 00:16:21,759
phrases

437
00:16:19,440 --> 00:16:24,320
so the the every gram generator

438
00:16:21,759 --> 00:16:26,880
basically says you give me a sentence

439
00:16:24,320 --> 00:16:29,040
like you know the quick brown and i'll

440
00:16:26,880 --> 00:16:31,600
return back to you the the quick the

441
00:16:29,040 --> 00:16:33,759
quick brown quick quick brown and brown

442
00:16:31,600 --> 00:16:36,959
because i don't know how many words in

443
00:16:33,759 --> 00:16:39,360
top pissing corresponds to the the word

444
00:16:36,959 --> 00:16:41,199
in english so it might be that there's a

445
00:16:39,360 --> 00:16:43,839
phrase but if i'm translating a single

446
00:16:41,199 --> 00:16:46,639
word it's probably a phrase of of letter

447
00:16:43,839 --> 00:16:48,000
of words that come one after another

448
00:16:46,639 --> 00:16:50,079
so just a

449
00:16:48,000 --> 00:16:52,320
little bit optimization there

450
00:16:50,079 --> 00:16:54,720
so what i can do then is i can say all

451
00:16:52,320 --> 00:16:55,680
the possible translations of the word

452
00:16:54,720 --> 00:16:57,759
save

453
00:16:55,680 --> 00:16:59,279
so these are all the things that i've

454
00:16:57,759 --> 00:17:01,680
seen

455
00:16:59,279 --> 00:17:04,160
where the word save appeared in english

456
00:17:01,680 --> 00:17:06,880
somewhere what are all the words i've

457
00:17:04,160 --> 00:17:08,559
seen on the other side and uh

458
00:17:06,880 --> 00:17:11,199
and all the all the

459
00:17:08,559 --> 00:17:12,480
not just words but phrases that uh that

460
00:17:11,199 --> 00:17:14,799
appeared there so

461
00:17:12,480 --> 00:17:16,559
i think i took i mentioned dispeller

462
00:17:14,799 --> 00:17:17,919
computer when i was

463
00:17:16,559 --> 00:17:18,880
writing when i was reading something out

464
00:17:17,919 --> 00:17:20,240
earlier

465
00:17:18,880 --> 00:17:22,079
and so on there's there's literally

466
00:17:20,240 --> 00:17:23,280
hundreds of them but what we can do is

467
00:17:22,079 --> 00:17:25,919
we can say

468
00:17:23,280 --> 00:17:28,000
and this is the core of the code here

469
00:17:25,919 --> 00:17:30,960
i'm going to count up the number of

470
00:17:28,000 --> 00:17:33,520
times that i saw

471
00:17:30,960 --> 00:17:36,240
word a along with save

472
00:17:33,520 --> 00:17:38,240
the number of times that's that together

473
00:17:36,240 --> 00:17:39,840
the number of times that it appeared in

474
00:17:38,240 --> 00:17:41,440
the english

475
00:17:39,840 --> 00:17:42,720
and then the number of times that it

476
00:17:41,440 --> 00:17:43,760
just generally appears in the whole

477
00:17:42,720 --> 00:17:47,200
corpus

478
00:17:43,760 --> 00:17:50,160
so like the word all for example uh it

479
00:17:47,200 --> 00:17:51,360
marks the accusative case in doctrine so

480
00:17:50,160 --> 00:17:53,200
it's in lots and lots and lots of

481
00:17:51,360 --> 00:17:54,640
sentences

482
00:17:53,200 --> 00:17:55,600
um and

483
00:17:54,640 --> 00:17:57,360
um

484
00:17:55,600 --> 00:17:59,440
so you'd see that in lots of places and

485
00:17:57,360 --> 00:18:01,360
we need to sort of weight it by the fact

486
00:17:59,440 --> 00:18:02,640
that all appears in lots of places so

487
00:18:01,360 --> 00:18:04,880
it's very not likely to be the

488
00:18:02,640 --> 00:18:06,799
translation of the word save

489
00:18:04,880 --> 00:18:10,240
well what happens when we guess the word

490
00:18:06,799 --> 00:18:12,160
save lo and behold it actually works and

491
00:18:10,240 --> 00:18:14,480
the probability is something absolutely

492
00:18:12,160 --> 00:18:17,039
astronomically small it's like 10 to the

493
00:18:14,480 --> 00:18:19,039
minus like 100 or something like that so

494
00:18:17,039 --> 00:18:20,880
of course it just comes up saying

495
00:18:19,039 --> 00:18:23,520
nothing there

496
00:18:20,880 --> 00:18:26,880
and if i ask it to run through

497
00:18:23,520 --> 00:18:29,200
and find for me all the vocabulary that

498
00:18:26,880 --> 00:18:32,480
you possibly can

499
00:18:29,200 --> 00:18:33,600
it does okayish

500
00:18:32,480 --> 00:18:36,640
so

501
00:18:33,600 --> 00:18:37,760
long pella means a long thing and that

502
00:18:36,640 --> 00:18:39,200
was a pretty good translation of the

503
00:18:37,760 --> 00:18:41,919
word length

504
00:18:39,200 --> 00:18:44,400
arg one that seems right spelling is

505
00:18:41,919 --> 00:18:46,880
actually right it's if i'm gonna spell

506
00:18:44,400 --> 00:18:48,080
something um then that's the correct

507
00:18:46,880 --> 00:18:51,200
translation

508
00:18:48,080 --> 00:18:53,840
um kind of amusing um how do you say

509
00:18:51,200 --> 00:18:56,960
click a mouse click you press him long

510
00:18:53,840 --> 00:18:57,760
pressing belong to to a thing

511
00:18:56,960 --> 00:19:00,320
so

512
00:18:57,760 --> 00:19:03,039
some right some actually completely

513
00:19:00,320 --> 00:19:05,280
bonkers wrong

514
00:19:03,039 --> 00:19:05,280
so

515
00:19:06,000 --> 00:19:08,799
i'm just looking at that last one that

516
00:19:07,520 --> 00:19:10,000
doesn't look right to me i actually

517
00:19:08,799 --> 00:19:11,840
don't even know how to say that what

518
00:19:10,000 --> 00:19:14,320
that's actually saying but that's that's

519
00:19:11,840 --> 00:19:16,080
wrong but it got it somewhat right and

520
00:19:14,320 --> 00:19:19,360
if your problem is you just need to get

521
00:19:16,080 --> 00:19:21,679
one word out in your target language

522
00:19:19,360 --> 00:19:22,880
this is not a bad technique it does kind

523
00:19:21,679 --> 00:19:24,400
of work

524
00:19:22,880 --> 00:19:27,200
in fact

525
00:19:24,400 --> 00:19:30,320
as i jump onto my next page my slide if

526
00:19:27,200 --> 00:19:30,320
it's not going to crash on me

527
00:19:31,039 --> 00:19:38,039
i've actually done this on some

528
00:19:33,919 --> 00:19:38,039
very large number of

529
00:19:38,080 --> 00:19:42,640
languages in fact i took

530
00:19:40,480 --> 00:19:43,840
bible translations in 1500 different

531
00:19:42,640 --> 00:19:47,760
languages

532
00:19:43,840 --> 00:19:47,760
and looks like my um

533
00:19:48,080 --> 00:19:52,000
libreoffice session is about to crash

534
00:19:50,000 --> 00:19:54,400
i've just got the spinning beach ball of

535
00:19:52,000 --> 00:19:56,000
death i had to switch over from my linux

536
00:19:54,400 --> 00:19:59,039
box to the mac because there were

537
00:19:56,000 --> 00:20:01,440
problems with my camera that the tech

538
00:19:59,039 --> 00:20:04,159
session bothered complained about so i'm

539
00:20:01,440 --> 00:20:06,400
going to have to kill that and

540
00:20:04,159 --> 00:20:07,280
hopefully

541
00:20:06,400 --> 00:20:11,039
um

542
00:20:07,280 --> 00:20:11,039
restart it so let's

543
00:20:12,559 --> 00:20:16,480
let's um

544
00:20:14,320 --> 00:20:20,280
relaunch and um

545
00:20:16,480 --> 00:20:20,280
hope that it behaves itself

546
00:20:27,520 --> 00:20:31,919
let me just check that i'm sharing my

547
00:20:29,039 --> 00:20:33,760
screen successfully here

548
00:20:31,919 --> 00:20:36,320
it does not look very promising at all

549
00:20:33,760 --> 00:20:36,320
at the moment

550
00:20:44,799 --> 00:20:48,640
it goes up to about

551
00:20:47,039 --> 00:20:51,120
here

552
00:20:48,640 --> 00:20:54,400
and so i did this for like a very large

553
00:20:51,120 --> 00:20:56,799
number of languages and 28 000 cpu hours

554
00:20:54,400 --> 00:20:59,360
later this kind of i won't translate a

555
00:20:56,799 --> 00:21:02,720
single word um you can do it get about

556
00:20:59,360 --> 00:21:05,280
sort of 70 accuracy out of it

557
00:21:02,720 --> 00:21:06,720
but if you want to translate a sentence

558
00:21:05,280 --> 00:21:08,080
now that gets a little bit more

559
00:21:06,720 --> 00:21:10,559
complicated

560
00:21:08,080 --> 00:21:12,559
um i've put together a little github

561
00:21:10,559 --> 00:21:13,919
repo it's not

562
00:21:12,559 --> 00:21:15,520
state-of-the-art or anything

563
00:21:13,919 --> 00:21:18,080
particularly special

564
00:21:15,520 --> 00:21:19,440
but it is how you write a translator

565
00:21:18,080 --> 00:21:22,480
when you're dealing with very low

566
00:21:19,440 --> 00:21:25,039
resource languages

567
00:21:22,480 --> 00:21:26,080
as i said the problem you've got is

568
00:21:25,039 --> 00:21:28,240
if you've got a couple of million

569
00:21:26,080 --> 00:21:30,720
documents you can use deep learning and

570
00:21:28,240 --> 00:21:32,799
that's very accurate but deep learning

571
00:21:30,720 --> 00:21:34,799
machine learning models tend to be very

572
00:21:32,799 --> 00:21:37,039
hungry for data

573
00:21:34,799 --> 00:21:40,480
and you wouldn't dream of trying to do

574
00:21:37,039 --> 00:21:42,640
this with like a hundred translation

575
00:21:40,480 --> 00:21:43,919
texts and expect it to work

576
00:21:42,640 --> 00:21:46,400
but there are other machine learning

577
00:21:43,919 --> 00:21:48,480
algorithms that'll that aren't as

578
00:21:46,400 --> 00:21:51,120
successfully accurate

579
00:21:48,480 --> 00:21:54,000
but can at least get some improvements

580
00:21:51,120 --> 00:21:55,280
over a small amount of data so linear

581
00:21:54,000 --> 00:21:56,640
methods like

582
00:21:55,280 --> 00:21:58,799
a linear

583
00:21:56,640 --> 00:22:00,559
support vector machine or logistic

584
00:21:58,799 --> 00:22:03,280
regression or things like that they can

585
00:22:00,559 --> 00:22:04,320
often cope with really small languages

586
00:22:03,280 --> 00:22:05,600
and give you

587
00:22:04,320 --> 00:22:08,880
like

588
00:22:05,600 --> 00:22:10,559
better than chance translations

589
00:22:08,880 --> 00:22:13,760
so how does this work

590
00:22:10,559 --> 00:22:17,600
though i've got on my left here um i've

591
00:22:13,760 --> 00:22:19,919
got a couple of things translated from

592
00:22:17,600 --> 00:22:21,200
english into

593
00:22:19,919 --> 00:22:23,679
tokyo

594
00:22:21,200 --> 00:22:26,880
so the translation for next page is

595
00:22:23,679 --> 00:22:29,120
page tumblr the adjective saying next

596
00:22:26,880 --> 00:22:32,000
happens after the noun

597
00:22:29,120 --> 00:22:34,000
it's kind of flexible actually in um

598
00:22:32,000 --> 00:22:35,520
intoxication you can sort of be a bit

599
00:22:34,000 --> 00:22:38,080
rough on that

600
00:22:35,520 --> 00:22:40,559
in this case uh the translators have

601
00:22:38,080 --> 00:22:42,960
chosen to put the noun first in both

602
00:22:40,559 --> 00:22:45,600
cases first page

603
00:22:42,960 --> 00:22:47,679
page festpeller

604
00:22:45,600 --> 00:22:50,400
so festpeller like you can actually

605
00:22:47,679 --> 00:22:51,760
almost hear first fellow which is where

606
00:22:50,400 --> 00:22:54,400
it actually comes from

607
00:22:51,760 --> 00:22:58,480
and then finally with um

608
00:22:54,400 --> 00:22:58,480
painim what's going on there is

609
00:22:58,799 --> 00:23:03,120
p

610
00:22:59,520 --> 00:23:04,400
often substitutes for the letter f

611
00:23:03,120 --> 00:23:06,960
so

612
00:23:04,400 --> 00:23:08,799
hearing the english word find turned

613
00:23:06,960 --> 00:23:10,159
into pain

614
00:23:08,799 --> 00:23:13,679
and then the d got dropped off and then

615
00:23:10,159 --> 00:23:16,720
the im is melanesian grammar saying um

616
00:23:13,679 --> 00:23:20,000
you know this is a transitive verb

617
00:23:16,720 --> 00:23:22,559
right so given those what do i set my

618
00:23:20,000 --> 00:23:24,799
machine learning model up to do

619
00:23:22,559 --> 00:23:26,400
well the first thing i want to do is i

620
00:23:24,799 --> 00:23:28,320
want to get the first word of

621
00:23:26,400 --> 00:23:31,120
translation out so i set up a training

622
00:23:28,320 --> 00:23:32,320
set and my training set says

623
00:23:31,120 --> 00:23:36,320
if you see

624
00:23:32,320 --> 00:23:39,760
next and page please emit page if you

625
00:23:36,320 --> 00:23:41,760
see first and page please emit page and

626
00:23:39,760 --> 00:23:44,240
if you see search and four then please

627
00:23:41,760 --> 00:23:46,080
emit painting

628
00:23:44,240 --> 00:23:48,960
then

629
00:23:46,080 --> 00:23:51,120
i then have a second machine learning

630
00:23:48,960 --> 00:23:52,400
model or if i'm using a transformer i

631
00:23:51,120 --> 00:23:54,640
can actually like incorporate that into

632
00:23:52,400 --> 00:23:57,440
the model itself that says i'm going to

633
00:23:54,640 --> 00:23:59,440
give you some english words and

634
00:23:57,440 --> 00:24:03,120
one

635
00:23:59,440 --> 00:24:05,440
top piece in word so next page and page

636
00:24:03,120 --> 00:24:07,600
please give me tableau

637
00:24:05,440 --> 00:24:10,000
if you see first page and page please

638
00:24:07,600 --> 00:24:12,320
give me festpeller if you see search and

639
00:24:10,000 --> 00:24:13,919
for and painting that's it end there's

640
00:24:12,320 --> 00:24:16,640
no more

641
00:24:13,919 --> 00:24:19,679
words coming you have completely

642
00:24:16,640 --> 00:24:20,720
translated that particular bit

643
00:24:19,679 --> 00:24:23,039
and

644
00:24:20,720 --> 00:24:24,640
i've got that coded up over here again

645
00:24:23,039 --> 00:24:26,400
same sort of beginning

646
00:24:24,640 --> 00:24:28,080
i went through a variety of different

647
00:24:26,400 --> 00:24:31,600
scikit-learn

648
00:24:28,080 --> 00:24:33,039
options to try and get a good

649
00:24:31,600 --> 00:24:35,919
model that worked

650
00:24:33,039 --> 00:24:38,240
reading the text as as per usual

651
00:24:35,919 --> 00:24:41,279
and i just decided to work with the

652
00:24:38,240 --> 00:24:44,080
shorter texts um to make the the talk

653
00:24:41,279 --> 00:24:45,600
not take like three hours in training

654
00:24:44,080 --> 00:24:48,720
and so these are the

655
00:24:45,600 --> 00:24:50,559
the phrases that it knows about that

656
00:24:48,720 --> 00:24:52,799
have been translated by human being and

657
00:24:50,559 --> 00:24:56,960
our goal here is

658
00:24:52,799 --> 00:24:59,440
could we translate some other terms into

659
00:24:56,960 --> 00:25:02,240
toxin that it hasn't seen

660
00:24:59,440 --> 00:25:03,520
by learning the rules that were used to

661
00:25:02,240 --> 00:25:04,880
what were the simplest set of rules that

662
00:25:03,520 --> 00:25:07,279
would have got you from english to top

663
00:25:04,880 --> 00:25:10,559
this in if we apply those rules to some

664
00:25:07,279 --> 00:25:12,400
unseen texts how do you go

665
00:25:10,559 --> 00:25:14,080
now after a little bit of playing around

666
00:25:12,400 --> 00:25:16,080
i found that

667
00:25:14,080 --> 00:25:18,960
a random forest classifier happened to

668
00:25:16,080 --> 00:25:21,200
be the the best option i tried a few

669
00:25:18,960 --> 00:25:22,880
it's a good compromise it can cope with

670
00:25:21,200 --> 00:25:24,640
quite small data sets and still get

671
00:25:22,880 --> 00:25:26,960
reasonable accuracy and it can still

672
00:25:24,640 --> 00:25:27,679
sort of kind of do that

673
00:25:26,960 --> 00:25:29,600
the

674
00:25:27,679 --> 00:25:33,200
universal function approximations that

675
00:25:29,600 --> 00:25:35,840
is so useful out of deep learning

676
00:25:33,200 --> 00:25:38,880
now this make model function

677
00:25:35,840 --> 00:25:40,880
is doing two things at once

678
00:25:38,880 --> 00:25:44,000
the first part is the

679
00:25:40,880 --> 00:25:45,919
the for loop here where oops zoomed past

680
00:25:44,000 --> 00:25:48,080
really quickly

681
00:25:45,919 --> 00:25:49,520
where what it's doing is it's setting up

682
00:25:48,080 --> 00:25:51,679
the training set

683
00:25:49,520 --> 00:25:54,000
you tell it

684
00:25:51,679 --> 00:25:57,840
take the english sentence and

685
00:25:54,000 --> 00:26:00,960
three top piss inwards and then

686
00:25:57,840 --> 00:26:03,520
make a data frame where the fourth word

687
00:26:00,960 --> 00:26:04,320
is the next column

688
00:26:03,520 --> 00:26:06,000
and

689
00:26:04,320 --> 00:26:08,000
you can see down here and if there's no

690
00:26:06,000 --> 00:26:08,720
more tokens then respond with the word

691
00:26:08,000 --> 00:26:10,720
in

692
00:26:08,720 --> 00:26:13,679
then afterwards it

693
00:26:10,720 --> 00:26:16,240
does some kind of word embedding take

694
00:26:13,679 --> 00:26:17,360
those english words and the top piss in

695
00:26:16,240 --> 00:26:19,120
words

696
00:26:17,360 --> 00:26:21,279
make some sort of

697
00:26:19,120 --> 00:26:22,880
vector that we can use for machine

698
00:26:21,279 --> 00:26:24,880
learning

699
00:26:22,880 --> 00:26:27,760
and i'm being really really lazy here

700
00:26:24,880 --> 00:26:29,360
i'm just using a count vectorizer

701
00:26:27,760 --> 00:26:31,200
and

702
00:26:29,360 --> 00:26:32,159
make a model

703
00:26:31,200 --> 00:26:34,000
from it

704
00:26:32,159 --> 00:26:35,760
and then i've got another function to

705
00:26:34,000 --> 00:26:37,919
make multiple models so make the model

706
00:26:35,760 --> 00:26:39,279
where you've got no top piece in words

707
00:26:37,919 --> 00:26:40,480
make the model when you've got one top

708
00:26:39,279 --> 00:26:41,679
piece in word make the model when you've

709
00:26:40,480 --> 00:26:43,679
got two

710
00:26:41,679 --> 00:26:46,320
and keep on going until you get to about

711
00:26:43,679 --> 00:26:48,400
i think it's about 19 words which is

712
00:26:46,320 --> 00:26:50,559
yeah there was one time where a two

713
00:26:48,400 --> 00:26:51,760
words in english turned into something

714
00:26:50,559 --> 00:26:52,720
like

715
00:26:51,760 --> 00:26:56,080
i can't remember what the actual number

716
00:26:52,720 --> 00:26:58,000
is maybe 14 different top pissing words

717
00:26:56,080 --> 00:26:59,279
in one sentence

718
00:26:58,000 --> 00:27:00,640
hopefully that's not going to be sitting

719
00:26:59,279 --> 00:27:02,320
in a menu somewhere because that's not

720
00:27:00,640 --> 00:27:05,200
going to fit real well so here's the

721
00:27:02,320 --> 00:27:07,440
kind of output there it's saying

722
00:27:05,200 --> 00:27:09,679
if i've got the english words and then

723
00:27:07,440 --> 00:27:12,320
12 top pissing words

724
00:27:09,679 --> 00:27:14,480
then the thing to predict would

725
00:27:12,320 --> 00:27:16,640
be killing up with those 12 or with

726
00:27:14,480 --> 00:27:20,000
these 12 you should be predicting is

727
00:27:16,640 --> 00:27:21,120
east up i think to be somewhere

728
00:27:20,000 --> 00:27:22,000
um

729
00:27:21,120 --> 00:27:24,880
and

730
00:27:22,000 --> 00:27:26,799
so let's let's build a translation model

731
00:27:24,880 --> 00:27:29,440
in real time the fun part about these

732
00:27:26,799 --> 00:27:32,960
tiny vocabularies is that this training

733
00:27:29,440 --> 00:27:34,960
step takes 1.18 seconds

734
00:27:32,960 --> 00:27:37,760
and even that's a little bit slow i've

735
00:27:34,960 --> 00:27:40,480
only got like about 200 words of

736
00:27:37,760 --> 00:27:41,200
um 200 phrases that i'm translating here

737
00:27:40,480 --> 00:27:42,960
so

738
00:27:41,200 --> 00:27:45,360
and it's not

739
00:27:42,960 --> 00:27:46,840
using 100 gpus on a parallel set of

740
00:27:45,360 --> 00:27:49,120
machines for a couple of

741
00:27:46,840 --> 00:27:52,799
weeks then the flip side is i need to be

742
00:27:49,120 --> 00:27:55,760
able to suggest the next token so i've

743
00:27:52,799 --> 00:27:57,200
got a function that will take

744
00:27:55,760 --> 00:27:59,600
the english sentence the tokens that

745
00:27:57,200 --> 00:28:02,399
we've output so far

746
00:27:59,600 --> 00:28:04,480
and then suggest what the next token is

747
00:28:02,399 --> 00:28:07,200
and then a function called suggest

748
00:28:04,480 --> 00:28:08,880
translation which takes

749
00:28:07,200 --> 00:28:10,399
all the tokens

750
00:28:08,880 --> 00:28:13,200
one at a time so i've got my english

751
00:28:10,399 --> 00:28:15,360
sentence gets the first token then

752
00:28:13,200 --> 00:28:17,200
calls next token again with the english

753
00:28:15,360 --> 00:28:18,880
sentence in that first token and then

754
00:28:17,200 --> 00:28:21,120
calls it again with the english sentence

755
00:28:18,880 --> 00:28:22,880
and the first two tokens until it gets

756
00:28:21,120 --> 00:28:25,360
to an end

757
00:28:22,880 --> 00:28:27,840
okay so let's just see how that works um

758
00:28:25,360 --> 00:28:29,919
what's the if i ask you to translate

759
00:28:27,840 --> 00:28:30,880
last page now it's never seen that

760
00:28:29,919 --> 00:28:32,480
before

761
00:28:30,880 --> 00:28:34,159
it's seen the word last and it's seen

762
00:28:32,480 --> 00:28:36,559
the word page

763
00:28:34,159 --> 00:28:39,279
and it's smart enough to say

764
00:28:36,559 --> 00:28:41,360
invert the word order like it's not just

765
00:28:39,279 --> 00:28:44,080
trying to translate last it translates

766
00:28:41,360 --> 00:28:46,880
page because that's what it should do

767
00:28:44,080 --> 00:28:49,279
and then i say okay well if you were

768
00:28:46,880 --> 00:28:52,720
translating last page and i gave you

769
00:28:49,279 --> 00:28:56,559
page what would you admit next okay i'll

770
00:28:52,720 --> 00:28:58,720
admit untuck and then if i

771
00:28:56,559 --> 00:29:00,559
was translating last page and i gave you

772
00:28:58,720 --> 00:29:02,960
page and untucked what would the next

773
00:29:00,559 --> 00:29:05,600
token be and the answer is that that's

774
00:29:02,960 --> 00:29:07,840
the end of the phrase

775
00:29:05,600 --> 00:29:10,880
it did pretty well

776
00:29:07,840 --> 00:29:12,720
on this actually now untuck that's

777
00:29:10,880 --> 00:29:14,080
coming from the english

778
00:29:12,720 --> 00:29:16,399
on top of

779
00:29:14,080 --> 00:29:18,799
so it's like the page that's on top of

780
00:29:16,399 --> 00:29:21,039
the one that you just have

781
00:29:18,799 --> 00:29:23,039
which actually means previous

782
00:29:21,039 --> 00:29:24,720
um but what's going on here is i think

783
00:29:23,039 --> 00:29:27,360
it was like the only other time it saw

784
00:29:24,720 --> 00:29:28,960
the word last was like the last document

785
00:29:27,360 --> 00:29:31,919
that you had open and so that got

786
00:29:28,960 --> 00:29:31,919
translated as

787
00:29:32,000 --> 00:29:35,840
a computer

788
00:29:33,360 --> 00:29:38,159
walk on top the

789
00:29:35,840 --> 00:29:39,600
last as in the the one that was on top

790
00:29:38,159 --> 00:29:41,279
of where you were before

791
00:29:39,600 --> 00:29:43,600
still it did pretty well so it's getting

792
00:29:41,279 --> 00:29:45,360
the sense of the last page as in the

793
00:29:43,600 --> 00:29:47,600
page that was before the one that you've

794
00:29:45,360 --> 00:29:49,279
got at the moment

795
00:29:47,600 --> 00:29:50,640
or if i put the whole thing together

796
00:29:49,279 --> 00:29:51,919
then it'll say

797
00:29:50,640 --> 00:29:53,039
page on top because it grabs all the

798
00:29:51,919 --> 00:29:54,399
tokens

799
00:29:53,039 --> 00:29:55,919
or

800
00:29:54,399 --> 00:29:58,960
next page

801
00:29:55,919 --> 00:30:01,039
notice that it it knows that

802
00:29:58,960 --> 00:30:03,919
it can be the other way around so tumblr

803
00:30:01,039 --> 00:30:06,240
can appear before the noun and so it's

804
00:30:03,919 --> 00:30:09,840
saying well seeing next page what i

805
00:30:06,240 --> 00:30:09,840
think you probably want to have is

806
00:30:10,480 --> 00:30:15,840
page the noun second

807
00:30:13,760 --> 00:30:19,120
and it's it's it's able to understand

808
00:30:15,840 --> 00:30:20,720
that kind of word order noun then

809
00:30:19,120 --> 00:30:23,279
adjective or adjective then noun

810
00:30:20,720 --> 00:30:25,360
depending on the context

811
00:30:23,279 --> 00:30:26,960
which is pretty impressive machine

812
00:30:25,360 --> 00:30:28,320
learning given we've got such a tiny

813
00:30:26,960 --> 00:30:31,279
vocabulary

814
00:30:28,320 --> 00:30:33,919
as we add more vocabulary this will then

815
00:30:31,279 --> 00:30:35,279
get smarter and its ability to predict

816
00:30:33,919 --> 00:30:37,039
translations will be better and maybe

817
00:30:35,279 --> 00:30:39,520
it'll learn that

818
00:30:37,039 --> 00:30:41,039
last has a couple of different meanings

819
00:30:39,520 --> 00:30:42,880
and so then it would be able to suggest

820
00:30:41,039 --> 00:30:46,960
something other than unpuppet it could

821
00:30:42,880 --> 00:30:48,720
do it's like a finis which is final

822
00:30:46,960 --> 00:30:51,440
popping back to my slides here and

823
00:30:48,720 --> 00:30:53,840
hoping it doesn't crash again

824
00:30:51,440 --> 00:30:56,000
and as we sort of head down for the last

825
00:30:53,840 --> 00:30:59,279
15 minutes of the talk i've got to sort

826
00:30:56,000 --> 00:31:00,480
of have my warning scary part

827
00:30:59,279 --> 00:31:03,519
because now we need to talk about

828
00:31:00,480 --> 00:31:05,679
grammar and about number theory

829
00:31:03,519 --> 00:31:07,919
which is kind of an unusual place to to

830
00:31:05,679 --> 00:31:07,919
go

831
00:31:08,880 --> 00:31:14,159
one of the things that we need when

832
00:31:10,720 --> 00:31:15,679
we're localizing programs into languages

833
00:31:14,159 --> 00:31:20,080
one of the really really key useful

834
00:31:15,679 --> 00:31:22,880
things is to know how plurals work

835
00:31:20,080 --> 00:31:25,200
it's really hard to not have a plural

836
00:31:22,880 --> 00:31:27,039
one document two documents

837
00:31:25,200 --> 00:31:29,600
you have many documents open you have

838
00:31:27,039 --> 00:31:30,880
one document open

839
00:31:29,600 --> 00:31:32,799
and in english there's some pretty

840
00:31:30,880 --> 00:31:36,000
simple rules generally you add the

841
00:31:32,799 --> 00:31:38,000
letter s so cat goes to cats dog goes

842
00:31:36,000 --> 00:31:40,480
dogs

843
00:31:38,000 --> 00:31:42,640
except if it ends in a vowel

844
00:31:40,480 --> 00:31:44,640
but always ends in a y with the y you

845
00:31:42,640 --> 00:31:46,399
have sky goes to skies and butterfly

846
00:31:44,640 --> 00:31:49,120
goes to butterflies and so on

847
00:31:46,399 --> 00:31:51,200
and then there are irregular nouns so

848
00:31:49,120 --> 00:31:53,200
person goes to people

849
00:31:51,200 --> 00:31:57,279
sheep goes to sheep

850
00:31:53,200 --> 00:32:00,720
ox goes to oxen she's kind of weird and

851
00:31:57,279 --> 00:32:03,039
this is linuxconf so unix goes to unix

852
00:32:00,720 --> 00:32:05,519
and i have many unix boxes or i have

853
00:32:03,039 --> 00:32:08,000
many unix boxing

854
00:32:05,519 --> 00:32:10,320
i've heard i heard a plural for docker

855
00:32:08,000 --> 00:32:11,840
yesterday which i really liked sort of

856
00:32:10,320 --> 00:32:13,360
doctrine

857
00:32:11,840 --> 00:32:15,600
i have mini docker containers so

858
00:32:13,360 --> 00:32:17,919
therefore i have doctrine

859
00:32:15,600 --> 00:32:20,480
i hope that sort of catches on gets into

860
00:32:17,919 --> 00:32:22,640
the dictionary soon

861
00:32:20,480 --> 00:32:24,480
sorry one more thing on that and there's

862
00:32:22,640 --> 00:32:27,919
another principle of linguistics that

863
00:32:24,480 --> 00:32:30,640
it's basically says no matter how stupid

864
00:32:27,919 --> 00:32:32,880
or weird or complicated

865
00:32:30,640 --> 00:32:35,279
some grammar rule is in the language

866
00:32:32,880 --> 00:32:37,679
that you're studying so like person goes

867
00:32:35,279 --> 00:32:39,279
to people as a plural no matter how

868
00:32:37,679 --> 00:32:41,840
complicated it is

869
00:32:39,279 --> 00:32:44,080
the person next to you their language

870
00:32:41,840 --> 00:32:46,960
has a rule that's even more complicated

871
00:32:44,080 --> 00:32:46,960
and more ridiculous

872
00:32:47,360 --> 00:32:51,200
which is interesting

873
00:32:49,039 --> 00:32:52,640
so what we're trying to do here is

874
00:32:51,200 --> 00:32:54,240
essentially we're making a machine

875
00:32:52,640 --> 00:32:55,360
learning model where i feed in the

876
00:32:54,240 --> 00:32:58,000
singular

877
00:32:55,360 --> 00:33:00,320
and out comes the plural

878
00:32:58,000 --> 00:33:02,720
and with a bit of staring at this it's

879
00:33:00,320 --> 00:33:06,399
actually pretty linear

880
00:33:02,720 --> 00:33:08,399
it's locally linear for certain kinds of

881
00:33:06,399 --> 00:33:10,640
different words so

882
00:33:08,399 --> 00:33:12,399
the rule that gets you from cat to cats

883
00:33:10,640 --> 00:33:14,159
is the same rule that gets you from dog

884
00:33:12,399 --> 00:33:17,200
to dogs

885
00:33:14,159 --> 00:33:20,080
the only problem is that the residuals

886
00:33:17,200 --> 00:33:21,440
can be infinite what by that i mean

887
00:33:20,080 --> 00:33:24,399
if i plotted

888
00:33:21,440 --> 00:33:26,000
person on the singular here and then

889
00:33:24,399 --> 00:33:28,240
draw a spot for

890
00:33:26,000 --> 00:33:30,799
people as the plural they're going to be

891
00:33:28,240 --> 00:33:33,279
way way off so that's going to ruin my

892
00:33:30,799 --> 00:33:36,240
line if i try to minimize the sum of

893
00:33:33,279 --> 00:33:38,480
least squares distance on this line

894
00:33:36,240 --> 00:33:40,320
i'm going to end up in serious trouble

895
00:33:38,480 --> 00:33:43,200
it's it's not going to produce correct

896
00:33:40,320 --> 00:33:43,200
results ever

897
00:33:43,360 --> 00:33:48,480
and here's where i get to proudly talk

898
00:33:46,320 --> 00:33:49,600
about some of my research

899
00:33:48,480 --> 00:33:52,000
um

900
00:33:49,600 --> 00:33:54,000
kurt hensel in 1897 came up with this

901
00:33:52,000 --> 00:33:57,279
idea of periodic numbers and to

902
00:33:54,000 --> 00:33:57,279
completely paraphrase him

903
00:33:57,519 --> 00:34:02,159
two numbers are really close together if

904
00:33:59,840 --> 00:34:04,000
they end with the same sequence of bits

905
00:34:02,159 --> 00:34:05,600
this is a perfectly valid metric space

906
00:34:04,000 --> 00:34:07,360
it's just as good as the euclidean

907
00:34:05,600 --> 00:34:09,240
distance that you learned at school

908
00:34:07,360 --> 00:34:12,000
there's a triangle inequality there's

909
00:34:09,240 --> 00:34:13,760
infinitesimals there you can do calculus

910
00:34:12,000 --> 00:34:16,879
it's very very strange

911
00:34:13,760 --> 00:34:18,000
so for example 3 and 5

912
00:34:16,879 --> 00:34:19,599
they're very close because if you write

913
00:34:18,000 --> 00:34:23,760
them out 3 is

914
00:34:19,599 --> 00:34:25,359
zero one one and five is one one

915
00:34:23,760 --> 00:34:29,359
sorry three is

916
00:34:25,359 --> 00:34:31,760
one one and five is one zero one

917
00:34:29,359 --> 00:34:33,359
so they're the same in the last

918
00:34:31,760 --> 00:34:35,119
bit they both have a one at the end so

919
00:34:33,359 --> 00:34:38,079
they're very close

920
00:34:35,119 --> 00:34:41,040
if you write 10 and 18 out in binary

921
00:34:38,079 --> 00:34:45,040
they end up very close together

922
00:34:41,040 --> 00:34:47,839
as well in fact the last three bits are

923
00:34:45,040 --> 00:34:49,599
the same it's 0 1 0 is the last three

924
00:34:47,839 --> 00:34:51,760
bits so they're very very close

925
00:34:49,599 --> 00:34:54,800
or two numbers that are really really

926
00:34:51,760 --> 00:34:56,320
really spectacularly close uh one and

927
00:34:54,800 --> 00:34:58,720
sixty five thousand five hundred thirty

928
00:34:56,320 --> 00:35:02,800
seven um sixty five thousand five

929
00:34:58,720 --> 00:35:05,760
hundred thirty six is two to the 16

930
00:35:02,800 --> 00:35:07,680
which means you have a 1 and then 15

931
00:35:05,760 --> 00:35:09,920
zeros and then a 1

932
00:35:07,680 --> 00:35:13,359
and that means that the last

933
00:35:09,920 --> 00:35:14,960
the final 1 and the 15 zeros before it

934
00:35:13,359 --> 00:35:17,839
are in common so those are really really

935
00:35:14,960 --> 00:35:17,839
really close together

936
00:35:17,920 --> 00:35:21,760
so big deep breath how it all ties

937
00:35:20,480 --> 00:35:23,520
together

938
00:35:21,760 --> 00:35:26,000
i take my

939
00:35:23,520 --> 00:35:27,040
words and i just take the utf 8 encoding

940
00:35:26,000 --> 00:35:30,480
of them

941
00:35:27,040 --> 00:35:32,640
and i use hensel's measure of distance

942
00:35:30,480 --> 00:35:34,240
and then sky and butterfly are really

943
00:35:32,640 --> 00:35:36,480
close together

944
00:35:34,240 --> 00:35:38,880
because the last eight bits

945
00:35:36,480 --> 00:35:40,240
of sky is letter y

946
00:35:38,880 --> 00:35:42,079
and that's the same as the last eight

947
00:35:40,240 --> 00:35:44,160
bits of butterfly

948
00:35:42,079 --> 00:35:45,839
great i've got eight bits similar

949
00:35:44,160 --> 00:35:48,560
they're really really close together now

950
00:35:45,839 --> 00:35:50,560
dog and frog they're unbelievably close

951
00:35:48,560 --> 00:35:53,680
together because they got the last 16

952
00:35:50,560 --> 00:35:55,200
bits of their representation

953
00:35:53,680 --> 00:35:56,320
so that means i can now do a linear

954
00:35:55,200 --> 00:35:59,280
regression

955
00:35:56,320 --> 00:36:01,040
using those numbers just using the utf-8

956
00:35:59,280 --> 00:36:03,119
encoding

957
00:36:01,040 --> 00:36:04,800
and instead of minimizing the sum of

958
00:36:03,119 --> 00:36:06,720
least squares

959
00:36:04,800 --> 00:36:09,680
i'm minimizing the sum of this really

960
00:36:06,720 --> 00:36:10,720
weird formula which is 2 to the power of

961
00:36:09,680 --> 00:36:12,800
negative

962
00:36:10,720 --> 00:36:15,119
the number of bits that were in common

963
00:36:12,800 --> 00:36:17,440
between the real answer and the fake

964
00:36:15,119 --> 00:36:19,839
answer in other words if my line

965
00:36:17,440 --> 00:36:20,800
predicted that the plural of

966
00:36:19,839 --> 00:36:23,200
dog

967
00:36:20,800 --> 00:36:25,520
was frogs

968
00:36:23,200 --> 00:36:28,400
when the true answer is dogs that would

969
00:36:25,520 --> 00:36:30,640
be pretty close it would say well i've

970
00:36:28,400 --> 00:36:34,000
got ogs

971
00:36:30,640 --> 00:36:36,400
is the same that's 24 bits so that only

972
00:36:34,000 --> 00:36:39,520
counts 2 to the minus 24 in terms of

973
00:36:36,400 --> 00:36:42,400
penalty it's so close

974
00:36:39,520 --> 00:36:45,920
um so then uh dramatic drumroll adding

975
00:36:42,400 --> 00:36:47,280
lots of time um and lots of research so

976
00:36:45,920 --> 00:36:48,720
one of the things i proved last year is

977
00:36:47,280 --> 00:36:50,400
that you can actually

978
00:36:48,720 --> 00:36:51,760
solve these problems in a finite length

979
00:36:50,400 --> 00:36:53,760
of time and then

980
00:36:51,760 --> 00:36:55,520
uh to remember how to program in haskell

981
00:36:53,760 --> 00:36:56,400
because like the whole point of academia

982
00:36:55,520 --> 00:36:57,599
is to look as though you're trying to be

983
00:36:56,400 --> 00:36:58,880
smart and

984
00:36:57,599 --> 00:37:01,040
programming in haskell is how you look

985
00:36:58,880 --> 00:37:02,720
really smart so i presume that works and

986
00:37:01,040 --> 00:37:05,119
then open sourcing it because who on

987
00:37:02,720 --> 00:37:06,960
earth would commercialize it and just

988
00:37:05,119 --> 00:37:09,599
dropped off on that text there running

989
00:37:06,960 --> 00:37:13,040
it on another 1500 languages for another

990
00:37:09,599 --> 00:37:14,800
30 000 hours of compute time i get

991
00:37:13,040 --> 00:37:16,400
some interesting results

992
00:37:14,800 --> 00:37:18,000
so i'm hiding half what i'm talking

993
00:37:16,400 --> 00:37:21,200
about here

994
00:37:18,000 --> 00:37:23,599
but looking at say latin

995
00:37:21,200 --> 00:37:27,440
one of the rules that came out was the

996
00:37:23,599 --> 00:37:31,200
plural is 256 cubed

997
00:37:27,440 --> 00:37:33,359
um times x plus the letters for nes and

998
00:37:31,200 --> 00:37:34,320
the plurals in uh bishomer and top

999
00:37:33,359 --> 00:37:38,880
pissing

1000
00:37:34,320 --> 00:37:41,359
was all times 256 to the power 4 plus x

1001
00:37:38,880 --> 00:37:42,320
and if i look at the actual text there

1002
00:37:41,359 --> 00:37:46,160
um

1003
00:37:42,320 --> 00:37:48,160
rule 4 there on on latin cogitatio goes

1004
00:37:46,160 --> 00:37:49,760
to cogitationis

1005
00:37:48,160 --> 00:37:53,359
cogitatio

1006
00:37:49,760 --> 00:37:56,480
multiply it by 256 cubed and that'll

1007
00:37:53,359 --> 00:37:58,720
shift you three spaces to the left

1008
00:37:56,480 --> 00:38:00,400
add in the ness on the end

1009
00:37:58,720 --> 00:38:03,440
and these

1010
00:38:00,400 --> 00:38:05,839
translation rules are

1011
00:38:03,440 --> 00:38:07,839
um

1012
00:38:05,839 --> 00:38:10,240
linear

1013
00:38:07,839 --> 00:38:12,160
regression problems

1014
00:38:10,240 --> 00:38:14,880
and with bislama the same kind of thing

1015
00:38:12,160 --> 00:38:16,400
happens just add the word all at the

1016
00:38:14,880 --> 00:38:18,480
beginning and

1017
00:38:16,400 --> 00:38:20,160
it's a nice simple linear regression

1018
00:38:18,480 --> 00:38:21,920
problem and linear regression problems

1019
00:38:20,160 --> 00:38:24,079
are great because you don't need a lot

1020
00:38:21,920 --> 00:38:25,680
of data for them and what do you know i

1021
00:38:24,079 --> 00:38:28,079
have a very large number of languages

1022
00:38:25,680 --> 00:38:29,280
for which we don't have much data

1023
00:38:28,079 --> 00:38:31,920
which mean

1024
00:38:29,280 --> 00:38:33,839
as in the sort of preview there yes i

1025
00:38:31,920 --> 00:38:36,000
can train i can um come up with the

1026
00:38:33,839 --> 00:38:38,240
correct pluralization for 11 percent of

1027
00:38:36,000 --> 00:38:42,320
the world's nouns

1028
00:38:38,240 --> 00:38:45,520
this is not fabulous um my ability to

1029
00:38:42,320 --> 00:38:47,520
use this is limited firstly by

1030
00:38:45,520 --> 00:38:49,359
the inaccuracy of the vocabulary

1031
00:38:47,520 --> 00:38:51,280
extraction so back earlier when it said

1032
00:38:49,359 --> 00:38:54,640
it was about right 70 of the time when

1033
00:38:51,280 --> 00:38:54,640
we're pulling out individual words

1034
00:38:54,720 --> 00:38:59,440
and then this technique for identifying

1035
00:38:57,440 --> 00:39:01,280
singulars and plurals and how the

1036
00:38:59,440 --> 00:39:04,240
grammar rules work

1037
00:39:01,280 --> 00:39:06,240
uh that kind of works

1038
00:39:04,240 --> 00:39:08,800
um you can see if you if you're

1039
00:39:06,240 --> 00:39:10,000
colorblind there's some green here

1040
00:39:08,800 --> 00:39:11,680
if you

1041
00:39:10,000 --> 00:39:13,520
are looking at this on a grayscale

1042
00:39:11,680 --> 00:39:16,960
monitor probably can't see it but the

1043
00:39:13,520 --> 00:39:19,599
little green line here is the

1044
00:39:16,960 --> 00:39:20,800
pediatric linear algorithm that i

1045
00:39:19,599 --> 00:39:21,920
developed

1046
00:39:20,800 --> 00:39:25,280
and

1047
00:39:21,920 --> 00:39:27,760
generally it's sort of right 11 of the

1048
00:39:25,280 --> 00:39:30,240
time is pretty confident

1049
00:39:27,760 --> 00:39:31,680
sometimes it gets up to 30 correct and

1050
00:39:30,240 --> 00:39:33,440
very very occasionally it gets sort of

1051
00:39:31,680 --> 00:39:35,280
60 or 70 percent

1052
00:39:33,440 --> 00:39:36,560
um correct

1053
00:39:35,280 --> 00:39:38,560
um

1054
00:39:36,560 --> 00:39:40,480
pluralization so we're a long way from

1055
00:39:38,560 --> 00:39:41,440
being able to make universal translator

1056
00:39:40,480 --> 00:39:43,200
but

1057
00:39:41,440 --> 00:39:45,520
according to star trek that only happens

1058
00:39:43,200 --> 00:39:48,000
in 2155 anyway so

1059
00:39:45,520 --> 00:39:48,960
i've still got 130 years of improvements

1060
00:39:48,000 --> 00:39:51,920
on this

1061
00:39:48,960 --> 00:39:54,800
but this is what you can do even on the

1062
00:39:51,920 --> 00:39:57,040
barest smallest languages so

1063
00:39:54,800 --> 00:39:57,839
this ran on um

1064
00:39:57,040 --> 00:39:58,800
of

1065
00:39:57,839 --> 00:40:00,880
i'm sure i'm going to pronounce it

1066
00:39:58,800 --> 00:40:02,960
correctly

1067
00:40:00,880 --> 00:40:05,680
which is a language of

1068
00:40:02,960 --> 00:40:07,440
arnhem land for example where there's a

1069
00:40:05,680 --> 00:40:10,079
few hundred speakers it's it's a very

1070
00:40:07,440 --> 00:40:12,079
very sparse language i did this also on

1071
00:40:10,079 --> 00:40:13,680
some languages where like it's in the

1072
00:40:12,079 --> 00:40:15,200
last stages of extinction where the only

1073
00:40:13,680 --> 00:40:17,920
people who who can speak the language

1074
00:40:15,200 --> 00:40:20,720
are in their 60s or 70s

1075
00:40:17,920 --> 00:40:22,400
there's a lot of work to be done

1076
00:40:20,720 --> 00:40:24,880
i've put up some urls there if you want

1077
00:40:22,400 --> 00:40:27,119
to contribute or

1078
00:40:24,880 --> 00:40:28,720
just play around with it

1079
00:40:27,119 --> 00:40:31,040
i still haven't got very far in terms of

1080
00:40:28,720 --> 00:40:33,119
synthesizing sentences so

1081
00:40:31,040 --> 00:40:36,800
until i can do that

1082
00:40:33,119 --> 00:40:38,319
i'm constrained on my ability to use

1083
00:40:36,800 --> 00:40:40,560
very parsimonious machine learning

1084
00:40:38,319 --> 00:40:43,280
models which really really limits the

1085
00:40:40,560 --> 00:40:45,040
accuracy of the translations

1086
00:40:43,280 --> 00:40:47,280
um and

1087
00:40:45,040 --> 00:40:48,800
i need to work on better embeddings for

1088
00:40:47,280 --> 00:40:50,160
these low resource languages there are

1089
00:40:48,800 --> 00:40:52,240
some techniques that have just come out

1090
00:40:50,160 --> 00:40:55,119
where you know with a few tens of

1091
00:40:52,240 --> 00:40:57,680
thousands of texts you can do a sentence

1092
00:40:55,119 --> 00:40:59,839
embedding that you can then improve on

1093
00:40:57,680 --> 00:41:02,400
just using monolingual text that's a bit

1094
00:40:59,839 --> 00:41:04,480
better that's got a bit more hope

1095
00:41:02,400 --> 00:41:06,560
kind of hanging out for the open source

1096
00:41:04,480 --> 00:41:07,680
speech to text engines

1097
00:41:06,560 --> 00:41:09,440
mozilla

1098
00:41:07,680 --> 00:41:11,280
and their common voice project they're

1099
00:41:09,440 --> 00:41:12,720
making some progress on that and i'm

1100
00:41:11,280 --> 00:41:14,720
pretty confident we'll see some some

1101
00:41:12,720 --> 00:41:16,640
good results there and i'm also really

1102
00:41:14,720 --> 00:41:18,000
hanging out for some alien races for us

1103
00:41:16,640 --> 00:41:20,079
to communicate with so we can see

1104
00:41:18,000 --> 00:41:21,599
whether these techniques are just human

1105
00:41:20,079 --> 00:41:24,319
or whether they're actually universal to

1106
00:41:21,599 --> 00:41:26,720
language itself

1107
00:41:24,319 --> 00:41:28,480
now given that i've

1108
00:41:26,720 --> 00:41:29,839
kind of nearly run out of time and i'm

1109
00:41:28,480 --> 00:41:31,520
talking about communicating with alien

1110
00:41:29,839 --> 00:41:33,040
races it's probably a good point to stop

1111
00:41:31,520 --> 00:41:35,280
there so

1112
00:41:33,040 --> 00:41:37,119
how about i

1113
00:41:35,280 --> 00:41:38,960
close off the presentation and let's see

1114
00:41:37,119 --> 00:41:42,480
if there are any questions

1115
00:41:38,960 --> 00:41:44,400
[Music]

1116
00:41:42,480 --> 00:41:46,640
well it was really good it reminded me

1117
00:41:44,400 --> 00:41:48,240
why i took social linguistics in third

1118
00:41:46,640 --> 00:41:51,119
year rather than syntax and

1119
00:41:48,240 --> 00:41:52,560
computational linguistics um

1120
00:41:51,119 --> 00:41:53,760
we have one question we've got just

1121
00:41:52,560 --> 00:41:55,599
enough time

1122
00:41:53,760 --> 00:41:57,280
to go through

1123
00:41:55,599 --> 00:41:58,880
um

1124
00:41:57,280 --> 00:42:00,720
there was a bunch about putting the

1125
00:41:58,880 --> 00:42:02,560
links in so as i said we have a bit

1126
00:42:00,720 --> 00:42:03,839
rushed for time we'll get that into the

1127
00:42:02,560 --> 00:42:06,000
post

1128
00:42:03,839 --> 00:42:07,760
post channel room straight after this

1129
00:42:06,000 --> 00:42:11,280
talk

1130
00:42:07,760 --> 00:42:11,280
but the question is

1131
00:42:11,520 --> 00:42:14,720
what

1132
00:42:13,119 --> 00:42:17,119
when you have such a small and

1133
00:42:14,720 --> 00:42:20,079
distributed speaker base does it does it

1134
00:42:17,119 --> 00:42:21,920
does it help or hinder to fix specific

1135
00:42:20,079 --> 00:42:23,760
word meanings

1136
00:42:21,920 --> 00:42:25,200
when you've got

1137
00:42:23,760 --> 00:42:28,240
a small

1138
00:42:25,200 --> 00:42:29,680
corpus to to kind of run through

1139
00:42:28,240 --> 00:42:31,359
yeah if you've only got a very small

1140
00:42:29,680 --> 00:42:33,359
vocabulary then anything you can do to

1141
00:42:31,359 --> 00:42:34,079
improve it is good

1142
00:42:33,359 --> 00:42:35,280
so

1143
00:42:34,079 --> 00:42:36,160
for example

1144
00:42:35,280 --> 00:42:37,839
um

1145
00:42:36,160 --> 00:42:38,720
if i feed back

1146
00:42:37,839 --> 00:42:42,560
um

1147
00:42:38,720 --> 00:42:44,560
look last page comes out as page on top

1148
00:42:42,560 --> 00:42:46,720
could you give me a translation for a

1149
00:42:44,560 --> 00:42:48,640
correct translation for last page

1150
00:42:46,720 --> 00:42:51,839
then yeah that should make a big

1151
00:42:48,640 --> 00:42:54,560
difference on my um accuracy and so

1152
00:42:51,839 --> 00:42:56,160
that's the process of i'm finding as

1153
00:42:54,560 --> 00:42:57,920
many texts i can

1154
00:42:56,160 --> 00:43:00,319
where it's sort of getting it nearly

1155
00:42:57,920 --> 00:43:03,280
right and then we can improve each of

1156
00:43:00,319 --> 00:43:03,280
those as it goes along

1157
00:43:04,000 --> 00:43:10,000
great um we will get those links to you

1158
00:43:06,880 --> 00:43:12,800
um i see there's more more questions

1159
00:43:10,000 --> 00:43:14,240
coming in about links um i'm just

1160
00:43:12,800 --> 00:43:16,400
checking to see there's no final

1161
00:43:14,240 --> 00:43:17,599
questions well it's like i could pop up

1162
00:43:16,400 --> 00:43:19,119
the screen or something like that with

1163
00:43:17,599 --> 00:43:20,640
the last couple of links on it yeah

1164
00:43:19,119 --> 00:43:23,440
we'll be we'll get them into the chat

1165
00:43:20,640 --> 00:43:25,520
for sure um okay we'll harvest them from

1166
00:43:23,440 --> 00:43:28,000
you in the in the in the post chat in a

1167
00:43:25,520 --> 00:43:31,119
few minutes and then get them to the

1168
00:43:28,000 --> 00:43:31,119
eager participants

1169
00:43:31,760 --> 00:43:34,960
yep

1170
00:43:32,880 --> 00:43:38,400
cool i think that's pretty much us so

1171
00:43:34,960 --> 00:43:41,119
thank you once again greg um

1172
00:43:38,400 --> 00:43:42,720
okay

1173
00:43:41,119 --> 00:43:45,720
we'll see you back here in 10 minutes

1174
00:43:42,720 --> 00:43:45,720
also