1 00:00:11,011 --> 00:00:18,080 MODERATOR: Hello, and welcome back to  the Platypus Hall. Alan Green is next.   2 00:00:20,320 --> 00:00:26,400 He is a software engineer and in his  spare time Alan likes to pay with sill   3 00:00:26,400 --> 00:00:30,720 scope and do soldering. >> Thank you, Kaitlyn.   4 00:00:32,960 --> 00:00:44,320 I am coming from a leafy side of the hills. I  would like to knowledge the people of the land and   5 00:00:45,520 --> 00:00:59,040 the traditional custodians the land. I am  excited to present the CFU-playground. It builds   6 00:01:00,080 --> 00:01:09,520 machine learning accelerators. Python is key to  making it all work and making it fun. Myself,   7 00:01:09,520 --> 00:01:14,720 my co-author and a number of others have been  building the CFU-playground for the past year.   8 00:01:17,760 --> 00:01:24,000 It uses FPGA and ML which is machine learning.   9 00:01:27,840 --> 00:01:31,040 You can think of an FPGA as hardware  but with software defined configuration.   10 00:01:33,360 --> 00:01:38,880 We will examine what machine learning is and  how models are run specifically in the context   11 00:01:38,880 --> 00:01:44,800 of small computers and then look at FPGA and  how they can accelerate arithmetic. We will   12 00:01:49,120 --> 00:01:54,160 take a high level overview of RISC-V and  how we extend them with function units.   13 00:01:56,480 --> 00:02:00,320 We will tie it back together with an  example of accelerating ML on FPGAs.   14 00:02:03,280 --> 00:02:07,840 I have tried to pitch the talk so  anyone with experience gets talk   15 00:02:08,880 --> 00:02:13,120 but if you find yourself lost, don't  worry. A new topic will be coming along.  16 00:02:18,560 --> 00:02:29,040 Running machine learning models on FPGAs is  unusual. We had a project where it was necessary   17 00:02:29,040 --> 00:02:35,840 to run ML models at low power, quickly and using  a completely -- completely Open Source stack.   18 00:02:45,680 --> 00:02:53,760 CPUs are power hungry for applications. Since CPUs  are too large, we looked at microcontrollers with   19 00:02:53,760 --> 00:03:03,840 embedded digital signal processors or DSP. Many  met the power budget but all the ones we found had   20 00:03:03,840 --> 00:03:11,360 a closed source component. A secret C++ compiler,  a model preprocessor or a block to link against   21 00:03:14,800 --> 00:03:19,360 and were not suitable. While a micro controller  without a DSP can be low power and open it can't   22 00:03:19,360 --> 00:03:27,120 also at the same time be fast enough. We thought  we might build our own ML engine out of an FPGA.   23 00:03:29,200 --> 00:03:37,280 My co-author suggested using custom function  units as a way to reduce complexity. The idea   24 00:03:37,280 --> 00:03:42,480 is we would implement our own low power micro  controller power design and add custom function   25 00:03:42,480 --> 00:03:48,320 units to make the ML evaluation fast while doing  all this with completely open source toolchain.   26 00:03:49,760 --> 00:03:55,360 Once we decided to build our own ML accelerators  with custom function units it became clear we   27 00:03:55,360 --> 00:04:01,520 needed a way to experiment with the CFU and  that's how the CFU-playground came about.   28 00:04:03,760 --> 00:04:10,720 This let's programmers build ML accelerators  that are fast, low power and open source   29 00:04:10,720 --> 00:04:14,400 on FPGAs and as a bonus in Python.   30 00:04:17,440 --> 00:04:24,720 The CFU-playground CFU-playground runs on  real hardware. This is one board we use.   31 00:04:27,440 --> 00:04:29,760 We use this because it is reasonably obtainable,   32 00:04:30,320 --> 00:04:36,960 capable and has an open source chain and a USB  port to communicate with there host computer.   33 00:04:39,440 --> 00:04:47,360 What goes on the FPGA? The FPGA is a blank  slate why -- we start. We know we need a CPU   34 00:04:49,920 --> 00:04:56,320 that needs memory so we have memory  too. To communicate with the host,   35 00:04:56,320 --> 00:05:35,040 we at a serial periphery bort. The exact kind of  CFU we build depends are largely on the model we   36 00:05:35,040 --> 00:05:43,920 want to accelerate. Did I mention open source? I  will out -- call out the Python framework we use.   37 00:06:19,280 --> 00:06:25,520 Running which uses the model. Let's look at  training first. In training, one takes a bunch of   38 00:06:25,520 --> 00:06:30,960 labeled examples which is fancy way of saying when  you see this example, keep the label as an answer.   39 00:06:32,800 --> 00:06:38,400 You give it examples for all different kinds  of output you expect. For something like image   40 00:06:38,400 --> 00:06:45,520 recognition, you would need many labeled examples.  The more the better. After a lot of computation,   41 00:06:45,520 --> 00:06:49,360 maybe hundreds or thousands of hours, we  have a model that knows it is hot dogs.   42 00:06:50,720 --> 00:07:02,480 We can take this model and use it I  will run the model and get the result.   43 00:07:05,360 --> 00:07:08,640 What does it mean to run the model  and how does it actually work?   44 00:07:11,520 --> 00:07:19,840 The model specifies a series of operations. To  run the model, we set the input data, then run   45 00:07:19,840 --> 00:07:29,840 each of the operations one after the other to get  a result. Data in, operations, result out -- easy!  46 00:07:32,480 --> 00:07:35,040 An important type of operation is 2D convolution.   47 00:07:38,080 --> 00:07:43,840 Let's look at the Tensorflow source code  and see how these two deconvolutions work.   48 00:07:45,680 --> 00:07:52,560 The code is in C++. It is about 30 lines.  It will look complicated but in less than   49 00:07:52,560 --> 00:08:00,400 two minutes you will understand everything we  need to know about this code. The first thing   50 00:08:00,400 --> 00:08:06,560 to notice are the seven levels deep nested loops  which we can illustrate with a friendly rainbow.   51 00:08:08,560 --> 00:08:13,280 Let's focus on the innermost loop since that is  where the code spends almost all of its time.   52 00:08:14,880 --> 00:08:19,760 The inner loop does three things. It  fetches something called import from memory,   53 00:08:20,880 --> 00:08:26,800 then fetches something called filter, and  then performs a simple calculation and adds   54 00:08:26,800 --> 00:08:31,440 an off set to the input, multiplies  and adds a result to an accumilator.   55 00:08:33,600 --> 00:08:41,600 The code spends a lot of time multiplying  and adding. Let's look at the accumilator.   56 00:08:42,880 --> 00:08:48,800 It is reset before the loop, after the loop,  there is a complicated process to compress   57 00:08:49,760 --> 00:08:56,160 from large 32 bit to 8 bit, and finally the  output data is stored in the output data array.   58 00:08:59,600 --> 00:09:03,200 The rest of the code is concerned with  making sure that we get the write indices   59 00:09:03,200 --> 00:09:08,160 into the input data, filter data and  output data arrays at the right time.   60 00:09:11,280 --> 00:09:15,920 The most important thing to learn from this  is that an ML evaluation is mostly a matter   61 00:09:15,920 --> 00:09:21,840 of running this one-line calculation  millions of times with the right inputs.   62 00:09:27,440 --> 00:09:30,320 That was a whirlwind tour of machine  learning, 2D accumulation and   63 00:09:33,120 --> 00:09:36,240 next session we will look at  FPGAs and how they are configured.   64 00:09:39,520 --> 00:09:45,520 Before this presentation, many of you probably  already know that this is an acronym chip and   65 00:09:46,880 --> 00:10:04,480 stands for field programable gate array. FPGAs  are relatively slow by hardware standards   66 00:10:05,040 --> 00:10:12,080 but incredibly fast by software  standards. We build up the gateway   67 00:10:12,720 --> 00:10:19,600 in Python with the nMigen library. You  can use verify log but that is not Python.   68 00:10:21,760 --> 00:10:26,560 The FPGA contains many different kinds of  logic building blocks. Here are three that are   69 00:10:26,560 --> 00:10:36,240 important. The look' table, the flip-flop and the  multiplier. The look up table can be programmed   70 00:10:36,240 --> 00:10:43,200 to calculate any function that has up to four  1-bit inputs and one 1-bit output. For example,   71 00:10:44,160 --> 00:10:50,160 if we program it to always output 0 unless all  four inputs are 1 then we have an and function.   72 00:10:52,160 --> 00:10:57,120 If it is programmed to output one unless all  inputs are zero then we have an or function.   73 00:10:59,280 --> 00:11:11,280 Arbitarily functions of 4 bit can be programmed.  We can use a large number of boots. They function   74 00:11:11,280 --> 00:11:16,400 continuousally and any change will be reflected  in the output after a just a few nanoseconds.   75 00:11:20,400 --> 00:11:28,160 To sync the outputs we use flip-flops and  their job is to remember a single bit.   76 00:11:29,680 --> 00:11:35,760 If enabled, it takes the input and sends  it to the output on the next clock cycle.   77 00:11:36,560 --> 00:11:43,360 If it is not enabled it ignores the input and  shows the same output value. A single bit can   78 00:11:43,360 --> 00:11:48,160 be useful but to store multi-bit numbers we  can gain flip-flops together into registers.   79 00:11:51,680 --> 00:11:54,240 And finally, the third kind of  logic block mentioned is the   80 00:11:54,240 --> 00:11:58,400 multiplier and it does exactly what you  would expect -- it multiplies numbers.   81 00:12:01,120 --> 00:12:06,480 The question is given large number of these  types of logical building blocks, how would   82 00:12:06,480 --> 00:12:10,960 we use them to implement a single function  that did an addition and a multiplication?   83 00:12:13,440 --> 00:12:16,800 The picture on the right shows one way  to do it with lots and multipliers.   84 00:12:18,320 --> 00:12:25,200 A and B are inputs, A is added to a constant  offset and A and B are multiplied together.   85 00:12:27,840 --> 00:12:33,520 The adders are made of look up tables and we use  the adder block because multiplication is harder.   86 00:12:34,880 --> 00:12:47,600 We express this like this. The null function  begins by creating a 16-bit number object   87 00:12:48,320 --> 00:12:56,960 and assigning it to R. The n.d.com plus  equal notation generates the domain   88 00:12:56,960 --> 00:13:00,320 meaning it happens continuously  without flip-flops to sync.   89 00:13:08,800 --> 00:13:15,200 Let's make a bit more Python code here. Our  next step is to have four add multipliers.   90 00:13:17,120 --> 00:13:21,440 FPGAs are great at having many copies of  the same logic running at the same time.   91 00:13:22,960 --> 00:13:27,840 Our configuration language, Python, is great  at making many copies of the same object.   92 00:13:29,760 --> 00:13:34,960 On the right, we have inputs capital  A and B which are 32 bits each.   93 00:13:36,000 --> 00:13:43,680 We slice up the 32 inputs into 4 values and  for each of the four slices we perform the same   94 00:13:43,680 --> 00:13:48,560 additional multiplication and add everything  together to get a single result number.   95 00:13:52,480 --> 00:13:57,440 Looking at the Python code on the left, we  have the null function from the previous slide   96 00:13:57,440 --> 00:14:01,840 and added a new function which is  null 4 because it does four of them.   97 00:14:03,200 --> 00:14:09,360 This uses generation to slice 32 umput  kitchen -- input and use one for each   98 00:14:13,680 --> 00:14:16,720 function. Then it specifies  the results be added together.   99 00:14:24,240 --> 00:14:32,320 Looking at -- at this point, we think it might  be convenient to add the result of null 4 to the   100 00:14:32,320 --> 00:14:41,200 acomilator, we make a register and connect on the  right so the result is added to the existing value   101 00:14:41,200 --> 00:14:51,520 of the accumulator. To do this in N region we  declare a 32 bit value named ac and use n.d.sync.   102 00:14:56,880 --> 00:15:03,200 The value on the next clock cycle is whatever  value there is now plus the result of the nul4.   103 00:15:05,680 --> 00:15:11,040 That's quite a bit. I hope you see that nMigen is  a relatively friendly way for Python programmers   104 00:15:11,040 --> 00:15:20,960 to specify FPGA configuration. Back to the  outline, we will move on from FPGAs to how a   105 00:15:20,960 --> 00:15:30,960 RISC-V processor can be extended with custom unit  function. Here is the architecture diagram again.   106 00:15:32,160 --> 00:15:36,080 We are interested in the interface  between the CPU and custom function unit.   107 00:15:37,280 --> 00:15:41,840 To illustrate how the interface works we will  begin with a standard RISC-V add instruction.   108 00:15:43,520 --> 00:15:46,960 This is the assembly instruction  in the form a programmer would use   109 00:15:47,920 --> 00:15:52,560 as the comment hopefully points out it  adds register 4 to 7 and puts the result   110 00:15:52,560 --> 00:15:58,640 into register 24. Of course, the CPU doesn't  read the instruction or the comment. It only   111 00:15:58,640 --> 00:16:03,280 understands the 32 bit instruction  word. Let's look at those 32 bits.   112 00:16:06,000 --> 00:16:11,200 The first thing the RISC-V CPU looks at is  the right-most seven bits forming the up code.   113 00:16:12,400 --> 00:16:17,920 In this case, 011, 0011 means this  instruction is going to do something   114 00:16:17,920 --> 00:16:23,520 with the registers and the arithmetic  logic unit or ALU. It tells the code   115 00:16:24,800 --> 00:16:30,080 and CPU how it should interpret the  rest of the 32 bit instruction word.   116 00:16:32,560 --> 00:16:38,560 The RS1 and 2 fields specify which registers  are sources or input that need to be routed   117 00:16:38,560 --> 00:16:46,080 to the ALU. Register 7 and 4 are read  from the register file in this case.   118 00:16:46,080 --> 00:16:56,480 Func7 and 3 tell the ALU what operation to  perform. They tell the ALU to add in this case.   119 00:16:56,480 --> 00:17:03,840 Rd specifies the registration  and that is sent to 24.   120 00:17:08,320 --> 00:17:13,600 Two values coming from the register file,  go to the ALU that performs the calculation   121 00:17:13,600 --> 00:17:19,120 and the result is placed back into the  register file. Let's look at how the   122 00:17:19,120 --> 00:17:25,200 custom function unit is used. The RISC-V  instruction set has two reserved opcodes.   123 00:17:26,160 --> 00:17:32,240 We use one to tell the CPU to  process data using the CFU.   124 00:17:32,240 --> 00:17:37,120 Once the CPU sees the up code it understand how  to interpret the remainder of the instruction.   125 00:17:41,760 --> 00:17:48,880 Rs1 and 2 specify the source registers and the  CPU routes them to the CFU instead of the ALU.   126 00:17:50,480 --> 00:18:00,800 Funct7 and 3 tell the operation and rd  specifies where the output should be sent.   127 00:18:03,120 --> 00:18:09,440 With the CFU instruction, a program can tell there  is five CPU to take two 32 bits from registers,   128 00:18:10,000 --> 00:18:14,480 send them to processing and put a 32  bit result back into the register.   129 00:18:18,240 --> 00:18:26,000 To allow the CFU to be used from a C program, the  CFU-playground provides a macro that translates   130 00:18:26,000 --> 00:18:32,560 into a single CFU machine instruction. The example  at the bottom of the slide shows C code and the   131 00:18:32,560 --> 00:18:41,840 assembler it is translated into. As you can  see the translation is pretty much one-to-one.   132 00:18:46,720 --> 00:18:51,680 So that was an overview of how a CFU  works and how to use it from a C program.   133 00:18:53,680 --> 00:18:58,800 Now we are going to put it all together and  accelerate Tensorflow like 2D convolutions.   134 00:19:00,080 --> 00:19:04,640 Here is our system on chip again.  To accelerate 2D convolutions   135 00:19:05,280 --> 00:19:09,840 we will need to build a custom function  unit that can assist with the 2D convolution   136 00:19:10,560 --> 00:19:16,160 and modify the Tensorflow implementation of  2D convolution to take advantage of the CFU.   137 00:19:18,560 --> 00:19:23,440 Here is the schematic for the gateway we defined  10 minutes ago. It needs a couple tweaks.   138 00:19:27,360 --> 00:19:30,640 Having a fixed off set might  work but it is a little crude.   139 00:19:31,600 --> 00:19:45,200 We will replace that with a register and some way  to set the content. We will add that reset too.   140 00:19:48,240 --> 00:19:53,200 This is enough to be useful. Now, here is  the image in code to implement all of this.   141 00:19:55,360 --> 00:19:57,600 There is a little more broiler  plate that surrounds this code   142 00:19:57,600 --> 00:20:01,680 but I omitted it to get it into one  slide. Here are the important parts.   143 00:20:03,680 --> 00:20:08,080 We begin by defining off set and  ak as 9 and 32 bit sign numbers.   144 00:20:09,680 --> 00:20:14,400 Next are mul and mul4 functions  working the same way as before.   145 00:20:16,560 --> 00:20:21,360 The next sends a done signal after the  status is received or the functions we   146 00:20:21,360 --> 00:20:28,560 perform take a single cycle. This line ensures  the CFU always outputs the accumulator value.   147 00:20:34,240 --> 00:20:45,280 With the m.f means whenever. Whenever this is  true, the contained logic sen abled, otherwise   148 00:20:45,280 --> 00:20:54,640 it is disabled. Whenever start is received and  funct7 is 0, the first input value in 0 will   149 00:20:54,640 --> 00:21:03,280 be set into the offset register. If funct7 is 1  while status is true the accumulator is set to 0.   150 00:21:05,680 --> 00:21:15,840 If funct is 2 while status is true we perform  it on the values and add it to the accumulator.   151 00:21:17,760 --> 00:21:23,040 That's all the gateway we need. The Python  image and syntax makes it pretty compact.   152 00:21:25,840 --> 00:21:29,840 So remember this code. We are going to use  a new gate way to accelerate the inner loop.   153 00:21:33,520 --> 00:21:48,480 The first thing to note is the original code  that retrieves the values and this allows for   154 00:21:48,480 --> 00:21:56,320 data and arrays. At the top of the file, we  hash to find a macro that wraps the CFU macro   155 00:21:57,120 --> 00:22:02,960 and use it here. These are the main changes we  need to off set and reset the cume -- accumulator.   156 00:22:07,680 --> 00:22:14,320 This shows how it works for the example model.   157 00:22:15,600 --> 00:22:24,640 The overall model evaluation time went from  147 million cycles to 8 million. If we look   158 00:22:24,640 --> 00:22:36,000 at the 2D convolution we managed from 95 million  cycles to 33 million cycles. Not bad for 20 lines   159 00:22:36,000 --> 00:22:48,160 of Python and another 15-20 lines or C++. In  this talk, we looked at how tiny ML models run.   160 00:22:48,800 --> 00:22:54,160 They are doing millions of adds and multipliers.  We looked at what FPGAs are and how to build   161 00:22:54,160 --> 00:23:00,320 gateways that add those. We looked at how a  CPU interacts with the custom function unit   162 00:23:01,120 --> 00:23:07,840 and put it all together to build an accelerator  that makes running an ML model faster.   163 00:23:11,600 --> 00:23:15,200 There are plenty of aspects of the  CFU-playground we didn't cover.   164 00:23:16,160 --> 00:23:20,320 We put a lot of effort into the tutorials  and guides for beginners, people who do them   165 00:23:20,320 --> 00:23:24,080 tell them they are helpful and fun, so please,  try them out and raise bugs to give feedback.   166 00:23:25,920 --> 00:23:30,080 Unit testing gateway is more  fun than debugging on hardware.   167 00:23:31,440 --> 00:23:36,480 I am a bit sad I didn't have time to talk  about the iterative baby steps approach to   168 00:23:36,480 --> 00:23:42,560 building gateway that is easier and more fun than  trying to build and integrate giant components.   169 00:23:45,040 --> 00:23:49,520 There are many people to thank their part in  making CFU play grund a reality and I would   170 00:23:49,520 --> 00:23:56,480 like to call out a couple. First, Tim Callahan  put together the structure of the playground   171 00:23:56,480 --> 00:24:02,880 so the rest of us can have fun writing. I would  like to call out the work of Rachel and Joey   172 00:24:03,440 --> 00:24:08,880 who started off knowing very little about FPGAs  and implemented in the end whole accelerators.   173 00:24:09,760 --> 00:24:14,960 Joey made a step-by-step for getting started  and I used that heavily in this talk.   174 00:24:17,440 --> 00:24:23,040 If you would like to learn more about FPGAs  or understanding how ML models are evaluated   175 00:24:23,040 --> 00:24:27,280 or building ML accelerators with open  source tools here is how to get started.   176 00:24:29,040 --> 00:24:36,160 The tutorial on read the docs the best.  The code is on Gubernatorial -- GitHub.   177 00:24:44,800 --> 00:24:56,080 Feel free reach out. >> We have great questions we can get straight   178 00:24:56,080 --> 00:25:08,000 into them. Fantastic. What are you building? >> We are building a thing which may or may not   179 00:25:08,000 --> 00:25:17,600 work. If it works, it may appear in a Chromebook. >> Great. Next up is is the support for non- -- or   180 00:25:21,120 --> 00:25:27,200 the interesting Chinese FPGA and the lattice   181 00:25:27,200 --> 00:25:34,720 targeting - urban source toolchain? >> We really want our product to be   182 00:25:34,720 --> 00:25:43,520 open source. We are going out of our way  to support open source toolchains. The   183 00:25:43,520 --> 00:25:53,360 lattice toolchain is great. These are all good. >> Thanks a lot. Next question was as a person   184 00:25:53,360 --> 00:26:01,200 with curiosity but no hardware knowledge what's a  good entry point into edge or tiny ML? Projects or   185 00:26:01,200 --> 00:26:06,800 references? Thanks for the beautiful explanations. >> The CFU-playground is a great way to get into   186 00:26:06,800 --> 00:26:12,880 it. [Laughter] The Tensorflow for micro controller  site has pretty good instructions if you want to   187 00:26:12,880 --> 00:26:19,040 get started. I recommend a number of micro  controller products and that's not a bad way   188 00:26:19,040 --> 00:26:26,720 to get into it. The CFU-playground if you want  to get ahold of like an FPGA development board,   189 00:26:27,920 --> 00:26:32,000 we make it super easy to get into it and  start messing around with the C code.  190 00:26:32,000 --> 00:26:38,880 >> And I think some conferences give  FPGA boards as a particular badge for the   191 00:26:38,880 --> 00:26:46,000 conference. That can often be quite good. >> Yeah, if you have a development board,   192 00:26:46,640 --> 00:26:49,920 or a conference badge you would  like to see the CFU-playground on,   193 00:26:49,920 --> 00:26:56,480 please get in touch. That would be fun. >> You mentioned that you were are using   194 00:26:56,480 --> 00:27:03,680 a reserved RISC-V is this space reserved for  custom lock codes? Or future use by standardized   195 00:27:04,400 --> 00:27:06,720 opcodes? >> Good question.   196 00:27:08,080 --> 00:27:13,360 No, the RISC-V specification reserves these two  opcodes for implementers. They are never going   197 00:27:13,360 --> 00:27:19,040 to be used for RISC-V instructions. There are  other reserve spaces for RISC-V instructions.   198 00:27:19,600 --> 00:27:26,400 But these are specifically so that implementers  like the CFU-playground can try out new things.   199 00:27:28,160 --> 00:27:33,120 It is one of the really great things about RISC-V.  You have this giant RISC-V GCC toolchain that   200 00:27:33,120 --> 00:27:38,000 already works and you can add just the little bit  you need to make it better and do what you want.  201 00:27:38,000 --> 00:27:47,760 >> One more question. If there are more you might  have time to type it in. Is it easy to write   202 00:27:47,760 --> 00:27:55,360 functional tests for components in nMigen? >> Yeah, advanced with nMigen.  203 00:27:55,360 --> 00:28:03,440 >> I had to ask. >> Yeah, you would have to. Do   204 00:28:03,440 --> 00:28:12,000 you run it through Python or hardware simulators? >> We did both. We definitely find the most bugs   205 00:28:12,000 --> 00:28:19,360 with Python unit tests. Python unit tests  import unit tests and subclass unit tests   206 00:28:20,560 --> 00:28:26,880 and then nMigen has a simulator and you can  poke at the simulator and check it is giving you   207 00:28:26,880 --> 00:28:30,880 the results back that you want. The  test cases are really fun. I don't   208 00:28:30,880 --> 00:28:35,760 know if any other environment that let's  you write some gateway and test it and fix   209 00:28:35,760 --> 00:28:41,920 it and run it again in less than 30 seconds. >> That's pretty good. We have a flip side.   210 00:28:42,880 --> 00:28:46,640 As a person with lots of hardware experience  but no ML experience where is a good   211 00:28:46,640 --> 00:28:50,240 starting point for that? >> I would have to think about   212 00:28:50,240 --> 00:28:55,840 that. Please come and talk to me afterwards. >> We have the hallway track for that.   213 00:28:56,640 --> 00:29:03,280 One last question is why do you  need four times mul times 8?  214 00:29:03,280 --> 00:29:17,280 >> Right. This all comes down to the  technology your ML model is using. There is a   215 00:29:20,640 --> 00:29:26,160 lot of research being put into 8-bit models at  the moment and the good thing about 8-bit models   216 00:29:26,160 --> 00:29:31,520 is you can move the data around four times as  fast because there is four times less of it.   217 00:29:32,800 --> 00:29:39,680 That's why we use are using 8-bit mods models. >> Thank you so much for your talk, Alan. This   218 00:29:39,680 --> 00:29:41,680 was really good. >> Thank you.  219 00:29:41,680 --> 00:29:44,720 >> Thank you to everybody. >> If you want to continue the   220 00:29:44,720 --> 00:29:49,760 conversation, we have a long break coming on. You can continue to use this hall or the hall   221 00:29:51,600 --> 00:29:56,800 text chat that's related to it. During the  lunch break, if you have got some spare time,   222 00:29:57,840 --> 00:30:02,400 and you are interested, the lightning talk  submission close at 1:00 p.m. Australia time which   223 00:30:02,400 --> 00:30:07,360 is in half an hour. If you think you can come  up with a 5-minute talk on anything interesting,   224 00:30:07,360 --> 00:30:14,640 doesn't have to be Python related. Thank you,  mickey, my cat meowed. You can find the links   225 00:30:19,120 --> 00:30:26,000 and it has all the social events  happening in the next day. Thank you