1 00:00:12,640 --> 00:00:16,160 Welcome back. Before I introduce our  next speakers, please remember that the   2 00:00:16,160 --> 00:00:19,360 conference close is going to be in  the Curlyboi theatre, that's the   3 00:00:19,360 --> 00:00:24,320 room just above this one in Venueless. So don't  go anywhere, except for that room after Q&A,   4 00:00:24,880 --> 00:00:27,760 and we'll have the conference  room and a nice time there. 5 00:00:28,640 --> 00:00:32,800 Next up, we've got Setting up a machine  translation service for Timor-Leste   6 00:00:32,800 --> 00:00:34,720 with Raphael Merx and Mel Mistica. 7 00:00:35,960 --> 00:00:40,720 >> Hi everyone, thanks for coming  to our talk. We'll be around for   8 00:00:41,600 --> 00:00:47,600 Q&A after the talk, and also we'll be  online as well so please ask questions. 9 00:00:47,600 --> 00:00:57,840 Thanks very much. 10 00:00:57,840 --> 00:01:02,000 >> Allright, hello everyone.  Today we're going to talk to how   11 00:01:02,000 --> 00:01:07,680 we setup a machine translation service for  Timor-Leste. So machine translation service   12 00:01:08,640 --> 00:01:15,360 is a fancy natural language process term to mean  something like Google Translate. So a service   13 00:01:15,360 --> 00:01:21,920 that would translate sentences and whole articles.  And we setup a new one for a language which is not   14 00:01:21,920 --> 00:01:29,120 supported by Google Translate or Bing Translate or  any other free online machine translation service. 15 00:01:30,960 --> 00:01:37,440 This was entirely setup as a volunteer project.  You can check out the project on the website.   16 00:01:38,400 --> 00:01:48,480 We have apps for Android and iOS and there is  about 10,000 monthly active users and it is   17 00:01:48,480 --> 00:01:57,280 growing quickly. A little about us. I am Raphael.  I am the head of governance and transparency at   18 00:02:00,240 --> 00:02:09,760 a technology for the development sector NGO.  Then we have Mel who is our resident NLP   19 00:02:09,760 --> 00:02:14,320 expert and she is a language data  scientist at the University of Queensland.   20 00:02:16,240 --> 00:02:24,800 We will start with background, what is the  language and why it is important. Mel going to   21 00:02:24,800 --> 00:02:33,680 take us through the project fundamentals and how  we can approach it. Then I am going to talk about   22 00:02:33,680 --> 00:02:39,360 how we did it in practice, gathered the data,  trained the model and published it. Mel is going   23 00:02:39,360 --> 00:02:45,600 to conclude with future interesting work that  we can setup and some of the usage statistics.   24 00:02:46,960 --> 00:02:52,400 All right. Little bit of background. Timor-Leste  is a half island nation. The other half of the   25 00:02:52,400 --> 00:03:06,720 island is part of Indonesia. It has around 1.3  million inhabitants and got independence in 2002.   26 00:03:08,720 --> 00:03:22,480 It has a variety of language and  people might peek Tetun or Indonesian.  27 00:03:26,480 --> 00:03:41,360 One of the languages standing out is Tetun  and Tetun is the langua franca of the country.   28 00:03:47,280 --> 00:03:54,800 A lot of people don't share the same local  language will speak Tetun with each other.   29 00:03:55,760 --> 00:04:02,880 Tetun is the most spoken language in the media, in  politics, Etc. It is definitely a language of high   30 00:04:02,880 --> 00:04:10,960 importance. Despite that, like I said, it was not  on Google translate or Bing translate or any other   31 00:04:12,080 --> 00:04:21,120 translation service. For a person trying to learn  Tetun I can find words in individual dictionaries   32 00:04:21,120 --> 00:04:31,680 but if I find a sentence I don't have an easy  way to translate into English. A Tetun speaker   33 00:04:31,680 --> 00:04:39,120 that doesn't yet master English, if they find a  whole article in English they can't translate it   34 00:04:39,120 --> 00:04:45,600 easily into Tetun. If they have a sentence they  are most comfortable speaking in Tetun and want   35 00:04:45,600 --> 00:04:51,840 to translate that to English there is no easy  way to do this until we setup this new service.  36 00:04:52,800 --> 00:04:59,760 All right. I am going to leave you to Mel who will  take us through the fundamentals of our project.  37 00:04:59,760 --> 00:05:05,360 >> Mel: Thanks for a great summary  of the context in which Tetun   38 00:05:05,360 --> 00:05:10,640 is used, Raphael. Given there diversity of the  linguistic back grund -- background of the country   39 00:05:10,640 --> 00:05:17,360 we thought it would be great there was a  machine learning project for anybody to use.   40 00:05:21,760 --> 00:05:25,840 We asked ourselves a couple of questions  when we embarked on this. What resources do   41 00:05:25,840 --> 00:05:33,840 we have or what resources could we quickly  and easily develop ourselves? And what was   42 00:05:33,840 --> 00:05:44,240 required to build an MT system. Obviously the  resources we needed would depend on the kind of   43 00:05:44,960 --> 00:05:54,560 system we wanted to build. For example, some of  the earlier systems were symbolic systems that   44 00:05:55,120 --> 00:06:00,880 were rule-based and required  a bunch of rules and lexicons.   45 00:06:02,560 --> 00:06:14,560 Here these kinds of systems were made up of three  components. A set of rules for analysis, a set of   46 00:06:14,560 --> 00:06:23,440 rules for transfer or correspondence rules, and  a set of rules to generate the source language.   47 00:06:25,840 --> 00:06:34,960 And actually, there were very few resources that  were required to just get it up and running.   48 00:06:38,880 --> 00:06:45,920 One of the very first systems from the '50s only  had a handful of rules and no more than a couple   49 00:06:45,920 --> 00:06:55,200 of hundred words. These kinds of systems were  great for a very limited domain and they were very   50 00:06:56,080 --> 00:07:04,800 quick to get up and going but beyond the  rules, the few rules that were developed,   51 00:07:04,800 --> 00:07:12,480 and the very limited number of words that it  could analyze, it was not a very -- these were not   52 00:07:12,480 --> 00:07:19,520 very robust systems, not to a wide variety  ofm inputs. They were quite brittle.   53 00:07:21,440 --> 00:07:27,760 From the 19'90s onwards, statistical  machine learning translation was the made   54 00:07:29,840 --> 00:07:38,800 mode. These models comprised of  the translation and language model.   55 00:07:39,760 --> 00:07:46,480 The language model here in purple is  trained on vast amounts of monolingual data.   56 00:07:48,160 --> 00:07:55,520 It is the part of the model that's responsible for  a fluent sounding output. The translation model is   57 00:07:55,520 --> 00:08:03,440 learned from parallel text. Often millions  of sentences. It is very resource hungry.   58 00:08:05,120 --> 00:08:16,480 It is not shown here in this formula. But part of  building this kind of system and developing this   59 00:08:16,480 --> 00:08:24,560 kind of translation model, also required modeling  the best word alignment between the source   60 00:08:24,560 --> 00:08:36,800 and the target language. The source is denoted  by X and Y is the target output. In order to   61 00:08:36,800 --> 00:08:46,640 learn a good translation model, we required  lots and lots of parallel data. After 2014,   62 00:08:46,640 --> 00:08:53,840 we then experienced the next big seismic change in  machine translation with the use of neural models.   63 00:08:55,680 --> 00:09:03,680 We will talk about neural models a bit more later  but first I want to return to our resource issue.   64 00:09:08,240 --> 00:09:16,000 We had only started with an idea of building  an MT system with no other resources.   65 00:09:18,000 --> 00:09:35,840 We had no available parallels and having a  brittle system was not that appealing to us.   66 00:09:38,720 --> 00:09:46,800 It would require, building this kind of MT system,  a symbolic MT system, would require the help of   67 00:09:46,800 --> 00:09:52,960 many linguists who knew how to code and who  were possibly proficient in both languages.   68 00:09:54,000 --> 00:10:00,640 But going down the SMT path meant building our  own corpus. This bought -- brought other issues.   69 00:10:04,800 --> 00:10:14,960 If we created a corpus for an SMT project, we  wanted to be able to have a kind of translation,   70 00:10:14,960 --> 00:10:22,400 the kind of translation system that reflected  standard Tetun. This would limit what kinds of   71 00:10:22,400 --> 00:10:28,560 text we could include. It meant that  we couldn't include some sources   72 00:10:28,560 --> 00:10:36,880 that were too formal. The other issue we had  was that even though you had text that were   73 00:10:36,880 --> 00:10:45,840 bilingual these texts were not completely  parallel. It would have to be thrown away.   74 00:10:47,680 --> 00:10:56,480 OK. Popping back to talking about the different  types of MT systems. In 2014, we were introduced   75 00:10:56,480 --> 00:11:07,760 to a new way of doing MT using neural networks.  Before 2017, we had sequence to sequence models or   76 00:11:12,000 --> 00:11:20,800 seek to seek models. These models were end to  end models comprised of an encoder and decoder.   77 00:11:21,360 --> 00:11:30,240 The encoder handles the input sequence and  produced one encoding or one representation of the   78 00:11:30,240 --> 00:11:40,080 whole entire input sentence. A decoder was then  responsible for generating an output based on the   79 00:11:40,080 --> 00:11:48,960 encoded input. It is a conditional language model  that is conditioned on the encoding of the input.   80 00:11:51,200 --> 00:12:00,800 OK. Cut to 2017 where we had  attention-based neural networks   81 00:12:01,680 --> 00:12:07,360 introduced to us. This brought about  significant improvements in MT. -- in NMT.  82 00:12:12,400 --> 00:12:17,760 We were not relying on how well  the encoder represented this input.   83 00:12:18,720 --> 00:12:25,120 The decoder could focus on specific parts of  the input that it was currently translating   84 00:12:25,120 --> 00:12:32,160 so it wasn't as though the encoder was responsible  to just create this one representation.   85 00:12:33,440 --> 00:12:42,640 The decoder was conditioned, was a language model,  that was conditioned on the input but at different   86 00:12:42,640 --> 00:12:52,080 time steps. In essence, it was a more elegant  solution to SMTs world alignment problem. So given   87 00:12:52,080 --> 00:12:58,080 these advances in NMT we thought we would try  our hand at imelementing an attention-based model   88 00:12:58,080 --> 00:13:05,200 from preexisting and publicly available tools and  Raphael will go through -- implementing -- this   89 00:13:05,200 --> 00:13:16,560 and let you know how we went about that process. >> Raphael: Thank you, Mel. Now that Mel has given   90 00:13:16,560 --> 00:13:22,480 us the theory fundamentals of how we will approach  building our machine translation model I will take   91 00:13:22,480 --> 00:13:29,440 you through what we did in practice. The first  step is gathering parallel English to Tetun text.   92 00:13:30,320 --> 00:13:34,640 What you find in the wild is not parallel  sentences which is what you want to train   93 00:13:34,640 --> 00:13:41,440 a model but you will find parallel articles.  In our example, there is a wealth of websites   94 00:13:41,440 --> 00:13:49,840 that publish content in Tetun and English. The  government website, U.N., NGOs, think tanks,   95 00:13:49,840 --> 00:14:00,400 and some educational books as well. We set out  to scrape this data, gather it and try after in   96 00:14:00,400 --> 00:14:05,520 the next step to create parallel sentences from  it. This is the example you see on the right-hand   97 00:14:05,520 --> 00:14:11,760 side of the slide. This is an article published  on the official government of Tetun website   98 00:14:11,760 --> 00:14:17,920 both in English and Tetun about receiving  65,000 doses of the vaccine from Australia.   99 00:14:21,440 --> 00:14:26,000 Once we have this parallel article  we need to find parallel sentences.   100 00:14:27,920 --> 00:14:33,440 The same paragraph in English and Tetun might  have a different sentence count. Sometimes   101 00:14:33,440 --> 00:14:42,960 one paragraph in English is never translated to  Tetun and the other way around. The kind of tool   102 00:14:46,240 --> 00:14:53,120 used is a sentence aligner. What you  do is fit to align parallel article and   103 00:14:53,120 --> 00:15:00,640 a dictionary of what Tetun word corresponds to  the English word and it will use this to guess   104 00:15:00,640 --> 00:15:10,640 the sentences that correspond. You throw away  sentences. There is a confidence score for each   105 00:15:11,440 --> 00:15:17,280 pair and we try to adjust the threshold  that we used here so we would not discard   106 00:15:17,280 --> 00:15:22,720 too many sentences but at the same time the  quality of the sentence pairs was pretty good.   107 00:15:24,560 --> 00:15:30,400 Now we have our initial set of sentence pairs, we  are still going to have to clean them out before   108 00:15:30,400 --> 00:15:36,160 we fit them to the model. This step is actually  fairly important because you could have different   109 00:15:36,880 --> 00:15:42,313 Unicode representations of the same  text. You will need to remove long long   110 00:15:42,313 --> 00:15:47,280 printing characters that might confuse your  model. Another thing of high importance was   111 00:15:48,240 --> 00:15:55,120 standardizing spelling. This was especially  true for Tetun and also English for example like   112 00:15:55,120 --> 00:16:01,760 UK versus American versus Australian spelling. For  Tetun, even though there is a government mandated   113 00:16:01,760 --> 00:16:10,880 spelling sometimes people won't use it. This is  the case for the word on the right that means   114 00:16:10,880 --> 00:16:18,480 which or where in Tetun and the bottom is the  government mandated forms. The forms at the top   115 00:16:20,000 --> 00:16:28,160 are used in the mild so we need to standardize  these. There is two goals. One is because we want   116 00:16:28,160 --> 00:16:33,840 our model to use the correct form of Tetun and  the second goal is because it will make our model   117 00:16:34,400 --> 00:16:39,680 smaller because it will not confuse the model  to see the different forms of the same word   118 00:16:39,680 --> 00:16:45,280 and try to have to guess between which  form is the right one. OK. After this,   119 00:16:45,280 --> 00:16:54,080 we apply tokenization. Separating each one of the  words by a space and also the punctuation has to   120 00:16:54,080 --> 00:17:01,600 be separated by a space. We spit the corpus into  a training set, validation and test set and are   121 00:17:02,960 --> 00:17:09,600 ready to create a first model. What did we  use the train the model? This is going to be a   122 00:17:11,280 --> 00:17:17,120 neural network and so we will need a GPU to  train our model. In our case, we used Google   123 00:17:17,120 --> 00:17:28,800 pro. It gives you a notebook and if you pay for  the service you get a fairly good GPU. We got P100   124 00:17:30,400 --> 00:17:36,160 which were more than good enough. I think the  whole model takes half a day to be trained.   125 00:17:37,520 --> 00:17:47,840 The library we used is called fair seq. It is a  toolkit built by Facebook AI on top of Pytorch   126 00:17:48,560 --> 00:17:55,200 that is specifically for sequence to sequence  models. Machine translation models have a sentence   127 00:17:57,680 --> 00:18:04,160 a sequence of tokens and the output is a  sentence which is also a sequence of tokens.   128 00:18:06,080 --> 00:18:14,800 Fairseq has examples on how you can use it to  create our own model and made architectures   129 00:18:15,360 --> 00:18:23,440 depending on what you want to do and on the size  of your training data. All right. After this, we   130 00:18:23,440 --> 00:18:29,840 created a first model, the quality was pretty poor  but it was clear that if we improved it it would   131 00:18:29,840 --> 00:18:36,640 be more than enough to be useful. At this stage,  we just iterate. Iterating means getting feedback   132 00:18:36,640 --> 00:18:43,040 from actual real Tetun linguists. Letting them  try out the model and they can point out this is   133 00:18:43,040 --> 00:18:50,480 not working like it should. This is fine. Trying  to track progress during training using tensor   134 00:18:51,440 --> 00:18:59,920 board. Is the validation set overfitting or the  training set overfitting ect. Like I said before,   135 00:18:59,920 --> 00:19:04,400 more spell checks and standardization. In  our case it made the model a lot better.   136 00:19:06,240 --> 00:19:12,400 All right. So by the end of the model training, I  think there was something like 15-20 iterations we   137 00:19:12,400 --> 00:19:19,920 had a model that was fairly good quality. Google  translates for moderately well resourced language   138 00:19:20,720 --> 00:19:27,920 have a BLU score around 40. This is what we had  for our model where we had I think 38 BLU score   139 00:19:27,920 --> 00:19:34,800 from English to Tetun and 40 in Tetun to English  on the test set. We were pretty happy with it.   140 00:19:37,440 --> 00:19:41,600 Trying it out on various articles  in both directions we could see it   141 00:19:41,600 --> 00:19:47,120 was useful and getting the gist of the  article and people could actually use   142 00:19:47,120 --> 00:19:54,640 to try to get the meaning of especially a language  that isn't theirs. Now that we have a model of   143 00:19:54,640 --> 00:20:01,440 fairly decent quality, we serve it using a  Django API. The model is loaded by fairseq   144 00:20:01,440 --> 00:20:08,320 but the load is happening inside a Django end  point. Django rest framework, one end point,   145 00:20:08,320 --> 00:20:15,120 three parameters and the text which is a series  of sentences, the source language code and target   146 00:20:15,120 --> 00:20:20,560 language code and everything it does is split the  text into a bunch of sentences. For each sentence,   147 00:20:21,680 --> 00:20:29,600 we applied cleaning up the corpus so the spell  checks and translating each sentence individually   148 00:20:30,240 --> 00:20:35,600 using our first model and responding to  the request with that translated text.   149 00:20:37,440 --> 00:20:46,160 That API is used by the front end. A  website and mobile app called Tetun.org.  150 00:20:46,160 --> 00:20:52,000 I had created them before for just a  simple English Tetun dictionary and   151 00:20:52,000 --> 00:20:57,280 a fair number of people were using it.  This is a translate feature added to the   152 00:20:57,280 --> 00:21:02,080 existing front ends. You can see it all here  on the right-hand side. It is essentially like   153 00:21:02,080 --> 00:21:07,840 a Google translate. You enter a sentence in  English, click translate and it gives you the   154 00:21:07,840 --> 00:21:13,600 sentence in Tetun. >> Mel:   155 00:21:13,600 --> 00:21:21,440 As Raphael mentioned the site hosting the  current service used to be a bilingual service   156 00:21:22,800 --> 00:21:30,240 and now core number of users doubled. We get  around two and half thousand translations a day   157 00:21:30,960 --> 00:21:37,920 and 10,000 active users per month. The people  who normally use the service are people who   158 00:21:38,560 --> 00:21:45,840 translate reports or students of the language  translating everyday interactions. We are now   159 00:21:45,840 --> 00:21:49,520 thinking of the next steps and how we  improve the quality of the translations.   160 00:21:52,160 --> 00:21:59,040 We are able to translate reports sufficiently but  we fall down translating everyday conversational   161 00:21:59,040 --> 00:22:05,520 language because the corpus wasn't  trained on this kind of language at all.   162 00:22:07,040 --> 00:22:12,320 One way to remedy this of course is to  include this kind of conversational style   163 00:22:13,360 --> 00:22:20,960 as training. In terms of different techniques to  improve the overall translation quality, we wanted   164 00:22:20,960 --> 00:22:27,600 to explore ways that we could improve the quality  of translation without the reliance of manually   165 00:22:27,600 --> 00:22:34,160 constructing parallel corpra because this is  expensive task. One avue -- avenue we want today   166 00:22:34,160 --> 00:22:51,440 pursue was using more monoling u -- monolingual  data. We are creating a synthetic parallel corpus.   167 00:22:53,200 --> 00:22:57,920 Please keep in touch with us via the website so  you can check how we are fairing in this endeavor.   168 00:22:59,440 --> 00:23:06,320 Thanks very much for attending our presentation.  We hope you enjoyed us talking you on our journey   169 00:23:06,320 --> 00:23:12,800 and creating this NMT service for Tetun. Thanks. MODERATOR: Thank you for the talk. Before we   170 00:23:12,800 --> 00:23:18,400 welcome you back to the do the Q&A after the  session, just reminding people, after this   171 00:23:18,400 --> 00:23:24,960 Q&A there is a conference close session in the  Curlyboi theater above Platypus Hall so don't go   172 00:23:24,960 --> 00:23:36,800 anywhere except that room. That went well. We have  several questions with votes on them. We will go   173 00:23:36,800 --> 00:23:42,560 for the first question first. Which is what sort  of sizes of corpus do you need for language model   174 00:23:42,560 --> 00:23:51,200 and translation model that you require? >> Mel do you want to answer that?  175 00:23:51,200 --> 00:23:58,240 >> You may need to unmute. >> Do you mean if we went for   176 00:23:58,240 --> 00:24:08,880 the SMT rather than neural? You usually  need millions and millions. Raphael we   177 00:24:08,880 --> 00:24:20,560 only had about 120,000 sentences? Is that right? >> Yeah, yeah, that's right. For Tetun it was   178 00:24:20,560 --> 00:24:26,560 enough. It depends on the size of the vocabulary.  Depending on whether languages have a very wide   179 00:24:26,560 --> 00:24:32,480 vocabulary or not. In the case of Tetun, you  know, each word has usually only one form   180 00:24:33,200 --> 00:24:46,560 apart from some words that come from port --  Portugese. It happened to work fine for us.  181 00:24:46,560 --> 00:24:56,160 >> That's interesting. Cool work. Have you tried  using embedded based metrics EG bert score?  182 00:24:56,160 --> 00:25:05,840 >> We haven't. >> No, we haven't. But maybe -- yeah. We could   183 00:25:05,840 --> 00:25:12,720 rather than using bloom we could use berts skull.  That's really what it was invented for. Thanks   184 00:25:12,720 --> 00:25:21,360 for that suggestion. This was kind of done on the  shoe string as well. Actually, Raphael and I met   185 00:25:21,360 --> 00:25:29,760 via somebody that I met via Facebook. We actually,  Raphael, and I have never actually met in-person.   186 00:25:30,320 --> 00:25:37,760 We just kind thought it would be really nice  to have this resource and it wasn't available.  187 00:25:37,760 --> 00:25:44,800 >> There is a thousand things that we could do  and try if we had unlimited time. This is one   188 00:25:44,800 --> 00:25:51,680 of them. You know, there is so many others like  back translation Etc. It is just not enough time.  189 00:25:52,800 --> 00:25:56,480 MODERATOR: We have another question which is how  much do you estimate the entire system costs?  190 00:25:56,480 --> 00:26:05,680 >> Raphael: Django is on digital ocean  droplet and I think that's $10 per month   191 00:26:07,360 --> 00:26:16,800 and that's it. The front end using netlify. IOS  app is costing me a $100 a year because of the   192 00:26:16,800 --> 00:26:25,840 Apple developer fee. The usage is low enough that  it is still free and there is the domain. You can   193 00:26:25,840 --> 00:26:34,720 say all together maybe $250 per year. >> But it also cost a lot of love.  194 00:26:34,720 --> 00:26:40,240 We put a lot of love in it too. >> We did. It cost way more time than   195 00:26:40,240 --> 00:26:43,000 it cost money that's for sure. >> Yes. [Laughter]  196 00:26:43,000 --> 00:26:47,680 MODERATOR: Absolutely. I am not sure  if we answered this in the talk.   197 00:26:48,800 --> 00:26:54,880 What help do you need going forward? >> It is really just time. Actually,   198 00:26:54,880 --> 00:27:02,400 we have to try to figure out if we want to make  sure that we can accommodate this conversational   199 00:27:02,960 --> 00:27:10,160 sort of language. Like we kind of have to be a  bit picky about how we are going to construct   200 00:27:10,160 --> 00:27:18,720 that corpus because we don't -- like we said, we  wanted to have, you know, more standard language   201 00:27:18,720 --> 00:27:25,440 but, you know, conversational language is really  quite informal so how can we stick to standard   202 00:27:25,440 --> 00:27:34,960 language and still kind of, you know, have  sort of that type of conversational style.   203 00:27:38,000 --> 00:27:44,320 You know, with blogs sometimes you are  going to have spelling that we don't want.   204 00:27:44,880 --> 00:27:53,120 Raphael started looking at more informal  kind of corpu that we decided to throw   205 00:27:53,120 --> 00:28:01,040 out. Did you want to talk a bit about that? >> This is something we would need to look for.   206 00:28:01,840 --> 00:28:07,600 For example, we don't have any data coming from  places that have informal speak like Facebook.   207 00:28:08,640 --> 00:28:13,360 It exists out there. There is a bunch of  people posting on Facebook using Tetun   208 00:28:13,360 --> 00:28:19,840 and probably use more informal language than  say a government website. This is what I would   209 00:28:19,840 --> 00:28:26,960 start looking into. We haven't had the time to  do this. Trying to find sources that have more   210 00:28:26,960 --> 00:28:34,000 informal language and trying to integrate that and  setup a test corpus of this is informal sentences   211 00:28:34,000 --> 00:28:41,920 and we know the translation. Probably the model  doesn't perform well on them now. Try to, yeah,   212 00:28:41,920 --> 00:28:46,800 like feed more informal sentences to the model  and see how it adapts. Hopefully it doesn't lose   213 00:28:46,800 --> 00:28:56,960 quality and gains quality on the informal speak.  It is high priority. Apart from that there is many   214 00:28:56,960 --> 00:29:11,040 things that would be interesting to do. Trying  to have a speech to text transcription for the   215 00:29:11,040 --> 00:29:17,840 Tetun language is not something anybody has. It  would be useful to a bunch of people trying to,   216 00:29:20,000 --> 00:29:29,680 yeah, like, use data. Add more languages apart  from English. I regularly have people reaching   217 00:29:29,680 --> 00:29:36,720 out over the Facebook passenger we have for the  service saying can you add Indonesian and I have   218 00:29:36,720 --> 00:29:43,600 tried and the quality wasn't good and I gave up  for lack of time. These are things if we added I   219 00:29:43,600 --> 00:29:49,600 know there is people out there who would use it. MODERATOR: We have several questions and only   220 00:29:49,600 --> 00:29:56,000 time for one more question. I think we will  pick the next highest voted one which is   221 00:29:56,000 --> 00:30:00,560 what is needed to get this translation service  added to Google translate or other platforms?  222 00:30:00,560 --> 00:30:09,600 >> I reached out to the product manager from  Google translate before we started working on this   223 00:30:10,880 --> 00:30:16,960 saying hey, you know, we are -- it would be  really, really awesome for the country if you   224 00:30:16,960 --> 00:30:21,600 could add this language to Google translate. I  was in touch with the ministry of education and   225 00:30:22,240 --> 00:30:27,840 they were the people saying hey, can you figure  out some way to reach out to Google. Anyway,   226 00:30:28,480 --> 00:30:33,920 reached out to Google, and then product  manager responded saying, you know, we are   227 00:30:34,960 --> 00:30:38,880 essentially focusing on the languages we  always support and trying to improve the   228 00:30:38,880 --> 00:30:47,760 quality of translation. Even if they wanted to add  languages to their list, I don't have a ranging   229 00:30:47,760 --> 00:30:53,120 of languages by a number of speakers but I am  assuming if they have a hundred languages now,   230 00:30:54,000 --> 00:31:00,320 the hundredth language has many more speakers than  Tetun. Even if they added another hundred, it is   231 00:31:00,320 --> 00:31:08,720 possible Tetun still wouldn't be in that batch.  Essentially, yeah. I mean Google has priorities   232 00:31:08,720 --> 00:31:15,840 and they put a high priority on languages that  have a lot of speakers so it isn't their priority   233 00:31:15,840 --> 00:31:21,520 at the moment. He did mention if we have to  the order of one million parallel sentences,   234 00:31:25,840 --> 00:31:32,320 they could start looking into it. But I mean,  yeah. We are one order of magnitude lower than   235 00:31:32,320 --> 00:31:39,440 this so I am not sure we can find that many more. MODERATOR: That's rough. Thank you so much. We   236 00:31:39,440 --> 00:31:45,840 have more questions so I will put them into  the platypus text chat. As Jack said, please,   237 00:31:45,840 --> 00:31:52,000 don't go anywhere. The conference close is in  Curlyboi theater after we finish. Head over to   238 00:31:52,000 --> 00:31:58,480 the Curlyboi theater as soon as we finish up  here and we will have lovely closing up and   239 00:31:58,480 --> 00:32:05,040 thank yous and recaps of the day. Thank you  so much for being here at PyConline AU 2021.  240 00:32:05,040 --> 00:32:13,840 >> Thanks, everyone. >> Bye.