1
00:00:12,640 --> 00:00:16,160
Welcome back. Before I introduce our 
next speakers, please remember that the  

2
00:00:16,160 --> 00:00:19,360
conference close is going to be in 
the Curlyboi theatre, that's the  

3
00:00:19,360 --> 00:00:24,320
room just above this one in Venueless. So don't 
go anywhere, except for that room after Q&A,  

4
00:00:24,880 --> 00:00:27,760
and we'll have the conference 
room and a nice time there.

5
00:00:28,640 --> 00:00:32,800
Next up, we've got Setting up a machine 
translation service for Timor-Leste  

6
00:00:32,800 --> 00:00:34,720
with Raphael Merx and Mel Mistica.

7
00:00:35,960 --> 00:00:40,720
>> Hi everyone, thanks for coming 
to our talk. We'll be around for  

8
00:00:41,600 --> 00:00:47,600
Q&A after the talk, and also we'll be 
online as well so please ask questions.

9
00:00:47,600 --> 00:00:57,840
Thanks very much.

10
00:00:57,840 --> 00:01:02,000
>> Allright, hello everyone. 
Today we're going to talk to how  

11
00:01:02,000 --> 00:01:07,680
we setup a machine translation service for 
Timor-Leste. So machine translation service  

12
00:01:08,640 --> 00:01:15,360
is a fancy natural language process term to mean 
something like Google Translate. So a service  

13
00:01:15,360 --> 00:01:21,920
that would translate sentences and whole articles. 
And we setup a new one for a language which is not  

14
00:01:21,920 --> 00:01:29,120
supported by Google Translate or Bing Translate or 
any other free online machine translation service.

15
00:01:30,960 --> 00:01:37,440
This was entirely setup as a volunteer project. 
You can check out the project on the website.  

16
00:01:38,400 --> 00:01:48,480
We have apps for Android and iOS and there is 
about 10,000 monthly active users and it is  

17
00:01:48,480 --> 00:01:57,280
growing quickly. A little about us. I am Raphael. 
I am the head of governance and transparency at  

18
00:02:00,240 --> 00:02:09,760
a technology for the development sector NGO. 
Then we have Mel who is our resident NLP  

19
00:02:09,760 --> 00:02:14,320
expert and she is a language data 
scientist at the University of Queensland.  

20
00:02:16,240 --> 00:02:24,800
We will start with background, what is the 
language and why it is important. Mel going to  

21
00:02:24,800 --> 00:02:33,680
take us through the project fundamentals and how 
we can approach it. Then I am going to talk about  

22
00:02:33,680 --> 00:02:39,360
how we did it in practice, gathered the data, 
trained the model and published it. Mel is going  

23
00:02:39,360 --> 00:02:45,600
to conclude with future interesting work that 
we can setup and some of the usage statistics.  

24
00:02:46,960 --> 00:02:52,400
All right. Little bit of background. Timor-Leste 
is a half island nation. The other half of the  

25
00:02:52,400 --> 00:03:06,720
island is part of Indonesia. It has around 1.3 
million inhabitants and got independence in 2002.  

26
00:03:08,720 --> 00:03:22,480
It has a variety of language and 
people might peek Tetun or Indonesian. 

27
00:03:26,480 --> 00:03:41,360
One of the languages standing out is Tetun 
and Tetun is the langua franca of the country.  

28
00:03:47,280 --> 00:03:54,800
A lot of people don't share the same local 
language will speak Tetun with each other.  

29
00:03:55,760 --> 00:04:02,880
Tetun is the most spoken language in the media, in 
politics, Etc. It is definitely a language of high  

30
00:04:02,880 --> 00:04:10,960
importance. Despite that, like I said, it was not 
on Google translate or Bing translate or any other  

31
00:04:12,080 --> 00:04:21,120
translation service. For a person trying to learn 
Tetun I can find words in individual dictionaries  

32
00:04:21,120 --> 00:04:31,680
but if I find a sentence I don't have an easy 
way to translate into English. A Tetun speaker  

33
00:04:31,680 --> 00:04:39,120
that doesn't yet master English, if they find a 
whole article in English they can't translate it  

34
00:04:39,120 --> 00:04:45,600
easily into Tetun. If they have a sentence they 
are most comfortable speaking in Tetun and want  

35
00:04:45,600 --> 00:04:51,840
to translate that to English there is no easy 
way to do this until we setup this new service. 

36
00:04:52,800 --> 00:04:59,760
All right. I am going to leave you to Mel who will 
take us through the fundamentals of our project. 

37
00:04:59,760 --> 00:05:05,360
>> Mel: Thanks for a great summary 
of the context in which Tetun  

38
00:05:05,360 --> 00:05:10,640
is used, Raphael. Given there diversity of the 
linguistic back grund -- background of the country  

39
00:05:10,640 --> 00:05:17,360
we thought it would be great there was a 
machine learning project for anybody to use.  

40
00:05:21,760 --> 00:05:25,840
We asked ourselves a couple of questions 
when we embarked on this. What resources do  

41
00:05:25,840 --> 00:05:33,840
we have or what resources could we quickly 
and easily develop ourselves? And what was  

42
00:05:33,840 --> 00:05:44,240
required to build an MT system. Obviously the 
resources we needed would depend on the kind of  

43
00:05:44,960 --> 00:05:54,560
system we wanted to build. For example, some of 
the earlier systems were symbolic systems that  

44
00:05:55,120 --> 00:06:00,880
were rule-based and required 
a bunch of rules and lexicons.  

45
00:06:02,560 --> 00:06:14,560
Here these kinds of systems were made up of three 
components. A set of rules for analysis, a set of  

46
00:06:14,560 --> 00:06:23,440
rules for transfer or correspondence rules, and 
a set of rules to generate the source language.  

47
00:06:25,840 --> 00:06:34,960
And actually, there were very few resources that 
were required to just get it up and running.  

48
00:06:38,880 --> 00:06:45,920
One of the very first systems from the '50s only 
had a handful of rules and no more than a couple  

49
00:06:45,920 --> 00:06:55,200
of hundred words. These kinds of systems were 
great for a very limited domain and they were very  

50
00:06:56,080 --> 00:07:04,800
quick to get up and going but beyond the 
rules, the few rules that were developed,  

51
00:07:04,800 --> 00:07:12,480
and the very limited number of words that it 
could analyze, it was not a very -- these were not  

52
00:07:12,480 --> 00:07:19,520
very robust systems, not to a wide variety 
ofm inputs. They were quite brittle.  

53
00:07:21,440 --> 00:07:27,760
From the 19'90s onwards, statistical 
machine learning translation was the made  

54
00:07:29,840 --> 00:07:38,800
mode. These models comprised of 
the translation and language model.  

55
00:07:39,760 --> 00:07:46,480
The language model here in purple is 
trained on vast amounts of monolingual data.  

56
00:07:48,160 --> 00:07:55,520
It is the part of the model that's responsible for 
a fluent sounding output. The translation model is  

57
00:07:55,520 --> 00:08:03,440
learned from parallel text. Often millions 
of sentences. It is very resource hungry.  

58
00:08:05,120 --> 00:08:16,480
It is not shown here in this formula. But part of 
building this kind of system and developing this  

59
00:08:16,480 --> 00:08:24,560
kind of translation model, also required modeling 
the best word alignment between the source  

60
00:08:24,560 --> 00:08:36,800
and the target language. The source is denoted 
by X and Y is the target output. In order to  

61
00:08:36,800 --> 00:08:46,640
learn a good translation model, we required 
lots and lots of parallel data. After 2014,  

62
00:08:46,640 --> 00:08:53,840
we then experienced the next big seismic change in 
machine translation with the use of neural models.  

63
00:08:55,680 --> 00:09:03,680
We will talk about neural models a bit more later 
but first I want to return to our resource issue.  

64
00:09:08,240 --> 00:09:16,000
We had only started with an idea of building 
an MT system with no other resources.  

65
00:09:18,000 --> 00:09:35,840
We had no available parallels and having a 
brittle system was not that appealing to us.  

66
00:09:38,720 --> 00:09:46,800
It would require, building this kind of MT system, 
a symbolic MT system, would require the help of  

67
00:09:46,800 --> 00:09:52,960
many linguists who knew how to code and who 
were possibly proficient in both languages.  

68
00:09:54,000 --> 00:10:00,640
But going down the SMT path meant building our 
own corpus. This bought -- brought other issues.  

69
00:10:04,800 --> 00:10:14,960
If we created a corpus for an SMT project, we 
wanted to be able to have a kind of translation,  

70
00:10:14,960 --> 00:10:22,400
the kind of translation system that reflected 
standard Tetun. This would limit what kinds of  

71
00:10:22,400 --> 00:10:28,560
text we could include. It meant that 
we couldn't include some sources  

72
00:10:28,560 --> 00:10:36,880
that were too formal. The other issue we had 
was that even though you had text that were  

73
00:10:36,880 --> 00:10:45,840
bilingual these texts were not completely 
parallel. It would have to be thrown away.  

74
00:10:47,680 --> 00:10:56,480
OK. Popping back to talking about the different 
types of MT systems. In 2014, we were introduced  

75
00:10:56,480 --> 00:11:07,760
to a new way of doing MT using neural networks. 
Before 2017, we had sequence to sequence models or  

76
00:11:12,000 --> 00:11:20,800
seek to seek models. These models were end to 
end models comprised of an encoder and decoder.  

77
00:11:21,360 --> 00:11:30,240
The encoder handles the input sequence and 
produced one encoding or one representation of the  

78
00:11:30,240 --> 00:11:40,080
whole entire input sentence. A decoder was then 
responsible for generating an output based on the  

79
00:11:40,080 --> 00:11:48,960
encoded input. It is a conditional language model 
that is conditioned on the encoding of the input.  

80
00:11:51,200 --> 00:12:00,800
OK. Cut to 2017 where we had 
attention-based neural networks  

81
00:12:01,680 --> 00:12:07,360
introduced to us. This brought about 
significant improvements in MT. -- in NMT. 

82
00:12:12,400 --> 00:12:17,760
We were not relying on how well 
the encoder represented this input.  

83
00:12:18,720 --> 00:12:25,120
The decoder could focus on specific parts of 
the input that it was currently translating  

84
00:12:25,120 --> 00:12:32,160
so it wasn't as though the encoder was responsible 
to just create this one representation.  

85
00:12:33,440 --> 00:12:42,640
The decoder was conditioned, was a language model, 
that was conditioned on the input but at different  

86
00:12:42,640 --> 00:12:52,080
time steps. In essence, it was a more elegant 
solution to SMTs world alignment problem. So given  

87
00:12:52,080 --> 00:12:58,080
these advances in NMT we thought we would try 
our hand at imelementing an attention-based model  

88
00:12:58,080 --> 00:13:05,200
from preexisting and publicly available tools and 
Raphael will go through -- implementing -- this  

89
00:13:05,200 --> 00:13:16,560
and let you know how we went about that process.
>> Raphael: Thank you, Mel. Now that Mel has given  

90
00:13:16,560 --> 00:13:22,480
us the theory fundamentals of how we will approach 
building our machine translation model I will take  

91
00:13:22,480 --> 00:13:29,440
you through what we did in practice. The first 
step is gathering parallel English to Tetun text.  

92
00:13:30,320 --> 00:13:34,640
What you find in the wild is not parallel 
sentences which is what you want to train  

93
00:13:34,640 --> 00:13:41,440
a model but you will find parallel articles. 
In our example, there is a wealth of websites  

94
00:13:41,440 --> 00:13:49,840
that publish content in Tetun and English. The 
government website, U.N., NGOs, think tanks,  

95
00:13:49,840 --> 00:14:00,400
and some educational books as well. We set out 
to scrape this data, gather it and try after in  

96
00:14:00,400 --> 00:14:05,520
the next step to create parallel sentences from 
it. This is the example you see on the right-hand  

97
00:14:05,520 --> 00:14:11,760
side of the slide. This is an article published 
on the official government of Tetun website  

98
00:14:11,760 --> 00:14:17,920
both in English and Tetun about receiving 
65,000 doses of the vaccine from Australia.  

99
00:14:21,440 --> 00:14:26,000
Once we have this parallel article 
we need to find parallel sentences.  

100
00:14:27,920 --> 00:14:33,440
The same paragraph in English and Tetun might 
have a different sentence count. Sometimes  

101
00:14:33,440 --> 00:14:42,960
one paragraph in English is never translated to 
Tetun and the other way around. The kind of tool  

102
00:14:46,240 --> 00:14:53,120
used is a sentence aligner. What you 
do is fit to align parallel article and  

103
00:14:53,120 --> 00:15:00,640
a dictionary of what Tetun word corresponds to 
the English word and it will use this to guess  

104
00:15:00,640 --> 00:15:10,640
the sentences that correspond. You throw away 
sentences. There is a confidence score for each  

105
00:15:11,440 --> 00:15:17,280
pair and we try to adjust the threshold 
that we used here so we would not discard  

106
00:15:17,280 --> 00:15:22,720
too many sentences but at the same time the 
quality of the sentence pairs was pretty good.  

107
00:15:24,560 --> 00:15:30,400
Now we have our initial set of sentence pairs, we 
are still going to have to clean them out before  

108
00:15:30,400 --> 00:15:36,160
we fit them to the model. This step is actually 
fairly important because you could have different  

109
00:15:36,880 --> 00:15:42,313
Unicode representations of the same 
text. You will need to remove long long  

110
00:15:42,313 --> 00:15:47,280
printing characters that might confuse your 
model. Another thing of high importance was  

111
00:15:48,240 --> 00:15:55,120
standardizing spelling. This was especially 
true for Tetun and also English for example like  

112
00:15:55,120 --> 00:16:01,760
UK versus American versus Australian spelling. For 
Tetun, even though there is a government mandated  

113
00:16:01,760 --> 00:16:10,880
spelling sometimes people won't use it. This is 
the case for the word on the right that means  

114
00:16:10,880 --> 00:16:18,480
which or where in Tetun and the bottom is the 
government mandated forms. The forms at the top  

115
00:16:20,000 --> 00:16:28,160
are used in the mild so we need to standardize 
these. There is two goals. One is because we want  

116
00:16:28,160 --> 00:16:33,840
our model to use the correct form of Tetun and 
the second goal is because it will make our model  

117
00:16:34,400 --> 00:16:39,680
smaller because it will not confuse the model 
to see the different forms of the same word  

118
00:16:39,680 --> 00:16:45,280
and try to have to guess between which 
form is the right one. OK. After this,  

119
00:16:45,280 --> 00:16:54,080
we apply tokenization. Separating each one of the 
words by a space and also the punctuation has to  

120
00:16:54,080 --> 00:17:01,600
be separated by a space. We spit the corpus into 
a training set, validation and test set and are  

121
00:17:02,960 --> 00:17:09,600
ready to create a first model. What did we 
use the train the model? This is going to be a  

122
00:17:11,280 --> 00:17:17,120
neural network and so we will need a GPU to 
train our model. In our case, we used Google  

123
00:17:17,120 --> 00:17:28,800
pro. It gives you a notebook and if you pay for 
the service you get a fairly good GPU. We got P100  

124
00:17:30,400 --> 00:17:36,160
which were more than good enough. I think the 
whole model takes half a day to be trained.  

125
00:17:37,520 --> 00:17:47,840
The library we used is called fair seq. It is a 
toolkit built by Facebook AI on top of Pytorch  

126
00:17:48,560 --> 00:17:55,200
that is specifically for sequence to sequence 
models. Machine translation models have a sentence  

127
00:17:57,680 --> 00:18:04,160
a sequence of tokens and the output is a 
sentence which is also a sequence of tokens.  

128
00:18:06,080 --> 00:18:14,800
Fairseq has examples on how you can use it to 
create our own model and made architectures  

129
00:18:15,360 --> 00:18:23,440
depending on what you want to do and on the size 
of your training data. All right. After this, we  

130
00:18:23,440 --> 00:18:29,840
created a first model, the quality was pretty poor 
but it was clear that if we improved it it would  

131
00:18:29,840 --> 00:18:36,640
be more than enough to be useful. At this stage, 
we just iterate. Iterating means getting feedback  

132
00:18:36,640 --> 00:18:43,040
from actual real Tetun linguists. Letting them 
try out the model and they can point out this is  

133
00:18:43,040 --> 00:18:50,480
not working like it should. This is fine. Trying 
to track progress during training using tensor  

134
00:18:51,440 --> 00:18:59,920
board. Is the validation set overfitting or the 
training set overfitting ect. Like I said before,  

135
00:18:59,920 --> 00:19:04,400
more spell checks and standardization. In 
our case it made the model a lot better.  

136
00:19:06,240 --> 00:19:12,400
All right. So by the end of the model training, I 
think there was something like 15-20 iterations we  

137
00:19:12,400 --> 00:19:19,920
had a model that was fairly good quality. Google 
translates for moderately well resourced language  

138
00:19:20,720 --> 00:19:27,920
have a BLU score around 40. This is what we had 
for our model where we had I think 38 BLU score  

139
00:19:27,920 --> 00:19:34,800
from English to Tetun and 40 in Tetun to English 
on the test set. We were pretty happy with it.  

140
00:19:37,440 --> 00:19:41,600
Trying it out on various articles 
in both directions we could see it  

141
00:19:41,600 --> 00:19:47,120
was useful and getting the gist of the 
article and people could actually use  

142
00:19:47,120 --> 00:19:54,640
to try to get the meaning of especially a language 
that isn't theirs. Now that we have a model of  

143
00:19:54,640 --> 00:20:01,440
fairly decent quality, we serve it using a 
Django API. The model is loaded by fairseq  

144
00:20:01,440 --> 00:20:08,320
but the load is happening inside a Django end 
point. Django rest framework, one end point,  

145
00:20:08,320 --> 00:20:15,120
three parameters and the text which is a series 
of sentences, the source language code and target  

146
00:20:15,120 --> 00:20:20,560
language code and everything it does is split the 
text into a bunch of sentences. For each sentence,  

147
00:20:21,680 --> 00:20:29,600
we applied cleaning up the corpus so the spell 
checks and translating each sentence individually  

148
00:20:30,240 --> 00:20:35,600
using our first model and responding to 
the request with that translated text.  

149
00:20:37,440 --> 00:20:46,160
That API is used by the front end. A 
website and mobile app called Tetun.org. 

150
00:20:46,160 --> 00:20:52,000
I had created them before for just a 
simple English Tetun dictionary and  

151
00:20:52,000 --> 00:20:57,280
a fair number of people were using it. 
This is a translate feature added to the  

152
00:20:57,280 --> 00:21:02,080
existing front ends. You can see it all here 
on the right-hand side. It is essentially like  

153
00:21:02,080 --> 00:21:07,840
a Google translate. You enter a sentence in 
English, click translate and it gives you the  

154
00:21:07,840 --> 00:21:13,600
sentence in Tetun.
>> Mel:  

155
00:21:13,600 --> 00:21:21,440
As Raphael mentioned the site hosting the 
current service used to be a bilingual service  

156
00:21:22,800 --> 00:21:30,240
and now core number of users doubled. We get 
around two and half thousand translations a day  

157
00:21:30,960 --> 00:21:37,920
and 10,000 active users per month. The people 
who normally use the service are people who  

158
00:21:38,560 --> 00:21:45,840
translate reports or students of the language 
translating everyday interactions. We are now  

159
00:21:45,840 --> 00:21:49,520
thinking of the next steps and how we 
improve the quality of the translations.  

160
00:21:52,160 --> 00:21:59,040
We are able to translate reports sufficiently but 
we fall down translating everyday conversational  

161
00:21:59,040 --> 00:22:05,520
language because the corpus wasn't 
trained on this kind of language at all.  

162
00:22:07,040 --> 00:22:12,320
One way to remedy this of course is to 
include this kind of conversational style  

163
00:22:13,360 --> 00:22:20,960
as training. In terms of different techniques to 
improve the overall translation quality, we wanted  

164
00:22:20,960 --> 00:22:27,600
to explore ways that we could improve the quality 
of translation without the reliance of manually  

165
00:22:27,600 --> 00:22:34,160
constructing parallel corpra because this is 
expensive task. One avue -- avenue we want today  

166
00:22:34,160 --> 00:22:51,440
pursue was using more monoling u -- monolingual 
data. We are creating a synthetic parallel corpus.  

167
00:22:53,200 --> 00:22:57,920
Please keep in touch with us via the website so 
you can check how we are fairing in this endeavor.  

168
00:22:59,440 --> 00:23:06,320
Thanks very much for attending our presentation. 
We hope you enjoyed us talking you on our journey  

169
00:23:06,320 --> 00:23:12,800
and creating this NMT service for Tetun. Thanks.
MODERATOR: Thank you for the talk. Before we  

170
00:23:12,800 --> 00:23:18,400
welcome you back to the do the Q&A after the 
session, just reminding people, after this  

171
00:23:18,400 --> 00:23:24,960
Q&A there is a conference close session in the 
Curlyboi theater above Platypus Hall so don't go  

172
00:23:24,960 --> 00:23:36,800
anywhere except that room. That went well. We have 
several questions with votes on them. We will go  

173
00:23:36,800 --> 00:23:42,560
for the first question first. Which is what sort 
of sizes of corpus do you need for language model  

174
00:23:42,560 --> 00:23:51,200
and translation model that you require?
>> Mel do you want to answer that? 

175
00:23:51,200 --> 00:23:58,240
>> You may need to unmute.
>> Do you mean if we went for  

176
00:23:58,240 --> 00:24:08,880
the SMT rather than neural? You usually 
need millions and millions. Raphael we  

177
00:24:08,880 --> 00:24:20,560
only had about 120,000 sentences? Is that right?
>> Yeah, yeah, that's right. For Tetun it was  

178
00:24:20,560 --> 00:24:26,560
enough. It depends on the size of the vocabulary. 
Depending on whether languages have a very wide  

179
00:24:26,560 --> 00:24:32,480
vocabulary or not. In the case of Tetun, you 
know, each word has usually only one form  

180
00:24:33,200 --> 00:24:46,560
apart from some words that come from port -- 
Portugese. It happened to work fine for us. 

181
00:24:46,560 --> 00:24:56,160
>> That's interesting. Cool work. Have you tried 
using embedded based metrics EG bert score? 

182
00:24:56,160 --> 00:25:05,840
>> We haven't.
>> No, we haven't. But maybe -- yeah. We could  

183
00:25:05,840 --> 00:25:12,720
rather than using bloom we could use berts skull. 
That's really what it was invented for. Thanks  

184
00:25:12,720 --> 00:25:21,360
for that suggestion. This was kind of done on the 
shoe string as well. Actually, Raphael and I met  

185
00:25:21,360 --> 00:25:29,760
via somebody that I met via Facebook. We actually, 
Raphael, and I have never actually met in-person.  

186
00:25:30,320 --> 00:25:37,760
We just kind thought it would be really nice 
to have this resource and it wasn't available. 

187
00:25:37,760 --> 00:25:44,800
>> There is a thousand things that we could do 
and try if we had unlimited time. This is one  

188
00:25:44,800 --> 00:25:51,680
of them. You know, there is so many others like 
back translation Etc. It is just not enough time. 

189
00:25:52,800 --> 00:25:56,480
MODERATOR: We have another question which is how 
much do you estimate the entire system costs? 

190
00:25:56,480 --> 00:26:05,680
>> Raphael: Django is on digital ocean 
droplet and I think that's $10 per month  

191
00:26:07,360 --> 00:26:16,800
and that's it. The front end using netlify. IOS 
app is costing me a $100 a year because of the  

192
00:26:16,800 --> 00:26:25,840
Apple developer fee. The usage is low enough that 
it is still free and there is the domain. You can  

193
00:26:25,840 --> 00:26:34,720
say all together maybe $250 per year.
>> But it also cost a lot of love. 

194
00:26:34,720 --> 00:26:40,240
We put a lot of love in it too.
>> We did. It cost way more time than  

195
00:26:40,240 --> 00:26:43,000
it cost money that's for sure.
>> Yes. [Laughter] 

196
00:26:43,000 --> 00:26:47,680
MODERATOR: Absolutely. I am not sure 
if we answered this in the talk.  

197
00:26:48,800 --> 00:26:54,880
What help do you need going forward?
>> It is really just time. Actually,  

198
00:26:54,880 --> 00:27:02,400
we have to try to figure out if we want to make 
sure that we can accommodate this conversational  

199
00:27:02,960 --> 00:27:10,160
sort of language. Like we kind of have to be a 
bit picky about how we are going to construct  

200
00:27:10,160 --> 00:27:18,720
that corpus because we don't -- like we said, we 
wanted to have, you know, more standard language  

201
00:27:18,720 --> 00:27:25,440
but, you know, conversational language is really 
quite informal so how can we stick to standard  

202
00:27:25,440 --> 00:27:34,960
language and still kind of, you know, have 
sort of that type of conversational style.  

203
00:27:38,000 --> 00:27:44,320
You know, with blogs sometimes you are 
going to have spelling that we don't want.  

204
00:27:44,880 --> 00:27:53,120
Raphael started looking at more informal 
kind of corpu that we decided to throw  

205
00:27:53,120 --> 00:28:01,040
out. Did you want to talk a bit about that?
>> This is something we would need to look for.  

206
00:28:01,840 --> 00:28:07,600
For example, we don't have any data coming from 
places that have informal speak like Facebook.  

207
00:28:08,640 --> 00:28:13,360
It exists out there. There is a bunch of 
people posting on Facebook using Tetun  

208
00:28:13,360 --> 00:28:19,840
and probably use more informal language than 
say a government website. This is what I would  

209
00:28:19,840 --> 00:28:26,960
start looking into. We haven't had the time to 
do this. Trying to find sources that have more  

210
00:28:26,960 --> 00:28:34,000
informal language and trying to integrate that and 
setup a test corpus of this is informal sentences  

211
00:28:34,000 --> 00:28:41,920
and we know the translation. Probably the model 
doesn't perform well on them now. Try to, yeah,  

212
00:28:41,920 --> 00:28:46,800
like feed more informal sentences to the model 
and see how it adapts. Hopefully it doesn't lose  

213
00:28:46,800 --> 00:28:56,960
quality and gains quality on the informal speak. 
It is high priority. Apart from that there is many  

214
00:28:56,960 --> 00:29:11,040
things that would be interesting to do. Trying 
to have a speech to text transcription for the  

215
00:29:11,040 --> 00:29:17,840
Tetun language is not something anybody has. It 
would be useful to a bunch of people trying to,  

216
00:29:20,000 --> 00:29:29,680
yeah, like, use data. Add more languages apart 
from English. I regularly have people reaching  

217
00:29:29,680 --> 00:29:36,720
out over the Facebook passenger we have for the 
service saying can you add Indonesian and I have  

218
00:29:36,720 --> 00:29:43,600
tried and the quality wasn't good and I gave up 
for lack of time. These are things if we added I  

219
00:29:43,600 --> 00:29:49,600
know there is people out there who would use it.
MODERATOR: We have several questions and only  

220
00:29:49,600 --> 00:29:56,000
time for one more question. I think we will 
pick the next highest voted one which is  

221
00:29:56,000 --> 00:30:00,560
what is needed to get this translation service 
added to Google translate or other platforms? 

222
00:30:00,560 --> 00:30:09,600
>> I reached out to the product manager from 
Google translate before we started working on this  

223
00:30:10,880 --> 00:30:16,960
saying hey, you know, we are -- it would be 
really, really awesome for the country if you  

224
00:30:16,960 --> 00:30:21,600
could add this language to Google translate. I 
was in touch with the ministry of education and  

225
00:30:22,240 --> 00:30:27,840
they were the people saying hey, can you figure 
out some way to reach out to Google. Anyway,  

226
00:30:28,480 --> 00:30:33,920
reached out to Google, and then product 
manager responded saying, you know, we are  

227
00:30:34,960 --> 00:30:38,880
essentially focusing on the languages we 
always support and trying to improve the  

228
00:30:38,880 --> 00:30:47,760
quality of translation. Even if they wanted to add 
languages to their list, I don't have a ranging  

229
00:30:47,760 --> 00:30:53,120
of languages by a number of speakers but I am 
assuming if they have a hundred languages now,  

230
00:30:54,000 --> 00:31:00,320
the hundredth language has many more speakers than 
Tetun. Even if they added another hundred, it is  

231
00:31:00,320 --> 00:31:08,720
possible Tetun still wouldn't be in that batch. 
Essentially, yeah. I mean Google has priorities  

232
00:31:08,720 --> 00:31:15,840
and they put a high priority on languages that 
have a lot of speakers so it isn't their priority  

233
00:31:15,840 --> 00:31:21,520
at the moment. He did mention if we have to 
the order of one million parallel sentences,  

234
00:31:25,840 --> 00:31:32,320
they could start looking into it. But I mean, 
yeah. We are one order of magnitude lower than  

235
00:31:32,320 --> 00:31:39,440
this so I am not sure we can find that many more.
MODERATOR: That's rough. Thank you so much. We  

236
00:31:39,440 --> 00:31:45,840
have more questions so I will put them into 
the platypus text chat. As Jack said, please,  

237
00:31:45,840 --> 00:31:52,000
don't go anywhere. The conference close is in 
Curlyboi theater after we finish. Head over to  

238
00:31:52,000 --> 00:31:58,480
the Curlyboi theater as soon as we finish up 
here and we will have lovely closing up and  

239
00:31:58,480 --> 00:32:05,040
thank yous and recaps of the day. Thank you 
so much for being here at PyConline AU 2021. 

240
00:32:05,040 --> 00:32:13,840
>> Thanks, everyone.
>> Bye.