1
00:00:11,011 --> 00:00:18,080
MODERATOR: Hello, and welcome back to 
the Platypus Hall. Alan Green is next.  

2
00:00:20,320 --> 00:00:26,400
He is a software engineer and in his 
spare time Alan likes to pay with sill  

3
00:00:26,400 --> 00:00:30,720
scope and do soldering.
>> Thank you, Kaitlyn.  

4
00:00:32,960 --> 00:00:44,320
I am coming from a leafy side of the hills. I 
would like to knowledge the people of the land and  

5
00:00:45,520 --> 00:00:59,040
the traditional custodians the land. I am 
excited to present the CFU-playground. It builds  

6
00:01:00,080 --> 00:01:09,520
machine learning accelerators. Python is key to 
making it all work and making it fun. Myself,  

7
00:01:09,520 --> 00:01:14,720
my co-author and a number of others have been 
building the CFU-playground for the past year.  

8
00:01:17,760 --> 00:01:24,000
It uses FPGA and ML which is machine learning.  

9
00:01:27,840 --> 00:01:31,040
You can think of an FPGA as hardware 
but with software defined configuration.  

10
00:01:33,360 --> 00:01:38,880
We will examine what machine learning is and 
how models are run specifically in the context  

11
00:01:38,880 --> 00:01:44,800
of small computers and then look at FPGA and 
how they can accelerate arithmetic. We will  

12
00:01:49,120 --> 00:01:54,160
take a high level overview of RISC-V and 
how we extend them with function units.  

13
00:01:56,480 --> 00:02:00,320
We will tie it back together with an 
example of accelerating ML on FPGAs.  

14
00:02:03,280 --> 00:02:07,840
I have tried to pitch the talk so 
anyone with experience gets talk  

15
00:02:08,880 --> 00:02:13,120
but if you find yourself lost, don't 
worry. A new topic will be coming along. 

16
00:02:18,560 --> 00:02:29,040
Running machine learning models on FPGAs is 
unusual. We had a project where it was necessary  

17
00:02:29,040 --> 00:02:35,840
to run ML models at low power, quickly and using 
a completely -- completely Open Source stack.  

18
00:02:45,680 --> 00:02:53,760
CPUs are power hungry for applications. Since CPUs 
are too large, we looked at microcontrollers with  

19
00:02:53,760 --> 00:03:03,840
embedded digital signal processors or DSP. Many 
met the power budget but all the ones we found had  

20
00:03:03,840 --> 00:03:11,360
a closed source component. A secret C++ compiler, 
a model preprocessor or a block to link against  

21
00:03:14,800 --> 00:03:19,360
and were not suitable. While a micro controller 
without a DSP can be low power and open it can't  

22
00:03:19,360 --> 00:03:27,120
also at the same time be fast enough. We thought 
we might build our own ML engine out of an FPGA.  

23
00:03:29,200 --> 00:03:37,280
My co-author suggested using custom function 
units as a way to reduce complexity. The idea  

24
00:03:37,280 --> 00:03:42,480
is we would implement our own low power micro 
controller power design and add custom function  

25
00:03:42,480 --> 00:03:48,320
units to make the ML evaluation fast while doing 
all this with completely open source toolchain.  

26
00:03:49,760 --> 00:03:55,360
Once we decided to build our own ML accelerators 
with custom function units it became clear we  

27
00:03:55,360 --> 00:04:01,520
needed a way to experiment with the CFU and 
that's how the CFU-playground came about.  

28
00:04:03,760 --> 00:04:10,720
This let's programmers build ML accelerators 
that are fast, low power and open source  

29
00:04:10,720 --> 00:04:14,400
on FPGAs and as a bonus in Python.  

30
00:04:17,440 --> 00:04:24,720
The CFU-playground CFU-playground runs on 
real hardware. This is one board we use.  

31
00:04:27,440 --> 00:04:29,760
We use this because it is reasonably obtainable,  

32
00:04:30,320 --> 00:04:36,960
capable and has an open source chain and a USB 
port to communicate with there host computer.  

33
00:04:39,440 --> 00:04:47,360
What goes on the FPGA? The FPGA is a blank 
slate why -- we start. We know we need a CPU  

34
00:04:49,920 --> 00:04:56,320
that needs memory so we have memory 
too. To communicate with the host,  

35
00:04:56,320 --> 00:05:35,040
we at a serial periphery bort. The exact kind of 
CFU we build depends are largely on the model we  

36
00:05:35,040 --> 00:05:43,920
want to accelerate. Did I mention open source? I 
will out -- call out the Python framework we use.  

37
00:06:19,280 --> 00:06:25,520
Running which uses the model. Let's look at 
training first. In training, one takes a bunch of  

38
00:06:25,520 --> 00:06:30,960
labeled examples which is fancy way of saying when 
you see this example, keep the label as an answer.  

39
00:06:32,800 --> 00:06:38,400
You give it examples for all different kinds 
of output you expect. For something like image  

40
00:06:38,400 --> 00:06:45,520
recognition, you would need many labeled examples. 
The more the better. After a lot of computation,  

41
00:06:45,520 --> 00:06:49,360
maybe hundreds or thousands of hours, we 
have a model that knows it is hot dogs.  

42
00:06:50,720 --> 00:07:02,480
We can take this model and use it I 
will run the model and get the result.  

43
00:07:05,360 --> 00:07:08,640
What does it mean to run the model 
and how does it actually work?  

44
00:07:11,520 --> 00:07:19,840
The model specifies a series of operations. To 
run the model, we set the input data, then run  

45
00:07:19,840 --> 00:07:29,840
each of the operations one after the other to get 
a result. Data in, operations, result out -- easy! 

46
00:07:32,480 --> 00:07:35,040
An important type of operation is 2D convolution.  

47
00:07:38,080 --> 00:07:43,840
Let's look at the Tensorflow source code 
and see how these two deconvolutions work.  

48
00:07:45,680 --> 00:07:52,560
The code is in C++. It is about 30 lines. 
It will look complicated but in less than  

49
00:07:52,560 --> 00:08:00,400
two minutes you will understand everything we 
need to know about this code. The first thing  

50
00:08:00,400 --> 00:08:06,560
to notice are the seven levels deep nested loops 
which we can illustrate with a friendly rainbow.  

51
00:08:08,560 --> 00:08:13,280
Let's focus on the innermost loop since that is 
where the code spends almost all of its time.  

52
00:08:14,880 --> 00:08:19,760
The inner loop does three things. It 
fetches something called import from memory,  

53
00:08:20,880 --> 00:08:26,800
then fetches something called filter, and 
then performs a simple calculation and adds  

54
00:08:26,800 --> 00:08:31,440
an off set to the input, multiplies 
and adds a result to an accumilator.  

55
00:08:33,600 --> 00:08:41,600
The code spends a lot of time multiplying 
and adding. Let's look at the accumilator.  

56
00:08:42,880 --> 00:08:48,800
It is reset before the loop, after the loop, 
there is a complicated process to compress  

57
00:08:49,760 --> 00:08:56,160
from large 32 bit to 8 bit, and finally the 
output data is stored in the output data array.  

58
00:08:59,600 --> 00:09:03,200
The rest of the code is concerned with 
making sure that we get the write indices  

59
00:09:03,200 --> 00:09:08,160
into the input data, filter data and 
output data arrays at the right time.  

60
00:09:11,280 --> 00:09:15,920
The most important thing to learn from this 
is that an ML evaluation is mostly a matter  

61
00:09:15,920 --> 00:09:21,840
of running this one-line calculation 
millions of times with the right inputs.  

62
00:09:27,440 --> 00:09:30,320
That was a whirlwind tour of machine 
learning, 2D accumulation and  

63
00:09:33,120 --> 00:09:36,240
next session we will look at 
FPGAs and how they are configured.  

64
00:09:39,520 --> 00:09:45,520
Before this presentation, many of you probably 
already know that this is an acronym chip and  

65
00:09:46,880 --> 00:10:04,480
stands for field programable gate array. FPGAs 
are relatively slow by hardware standards  

66
00:10:05,040 --> 00:10:12,080
but incredibly fast by software 
standards. We build up the gateway  

67
00:10:12,720 --> 00:10:19,600
in Python with the nMigen library. You 
can use verify log but that is not Python.  

68
00:10:21,760 --> 00:10:26,560
The FPGA contains many different kinds of 
logic building blocks. Here are three that are  

69
00:10:26,560 --> 00:10:36,240
important. The look' table, the flip-flop and the 
multiplier. The look up table can be programmed  

70
00:10:36,240 --> 00:10:43,200
to calculate any function that has up to four 
1-bit inputs and one 1-bit output. For example,  

71
00:10:44,160 --> 00:10:50,160
if we program it to always output 0 unless all 
four inputs are 1 then we have an and function.  

72
00:10:52,160 --> 00:10:57,120
If it is programmed to output one unless all 
inputs are zero then we have an or function.  

73
00:10:59,280 --> 00:11:11,280
Arbitarily functions of 4 bit can be programmed. 
We can use a large number of boots. They function  

74
00:11:11,280 --> 00:11:16,400
continuousally and any change will be reflected 
in the output after a just a few nanoseconds.  

75
00:11:20,400 --> 00:11:28,160
To sync the outputs we use flip-flops and 
their job is to remember a single bit.  

76
00:11:29,680 --> 00:11:35,760
If enabled, it takes the input and sends 
it to the output on the next clock cycle.  

77
00:11:36,560 --> 00:11:43,360
If it is not enabled it ignores the input and 
shows the same output value. A single bit can  

78
00:11:43,360 --> 00:11:48,160
be useful but to store multi-bit numbers we 
can gain flip-flops together into registers.  

79
00:11:51,680 --> 00:11:54,240
And finally, the third kind of 
logic block mentioned is the  

80
00:11:54,240 --> 00:11:58,400
multiplier and it does exactly what you 
would expect -- it multiplies numbers.  

81
00:12:01,120 --> 00:12:06,480
The question is given large number of these 
types of logical building blocks, how would  

82
00:12:06,480 --> 00:12:10,960
we use them to implement a single function 
that did an addition and a multiplication?  

83
00:12:13,440 --> 00:12:16,800
The picture on the right shows one way 
to do it with lots and multipliers.  

84
00:12:18,320 --> 00:12:25,200
A and B are inputs, A is added to a constant 
offset and A and B are multiplied together.  

85
00:12:27,840 --> 00:12:33,520
The adders are made of look up tables and we use 
the adder block because multiplication is harder.  

86
00:12:34,880 --> 00:12:47,600
We express this like this. The null function 
begins by creating a 16-bit number object  

87
00:12:48,320 --> 00:12:56,960
and assigning it to R. The n.d.com plus 
equal notation generates the domain  

88
00:12:56,960 --> 00:13:00,320
meaning it happens continuously 
without flip-flops to sync.  

89
00:13:08,800 --> 00:13:15,200
Let's make a bit more Python code here. Our 
next step is to have four add multipliers.  

90
00:13:17,120 --> 00:13:21,440
FPGAs are great at having many copies of 
the same logic running at the same time.  

91
00:13:22,960 --> 00:13:27,840
Our configuration language, Python, is great 
at making many copies of the same object.  

92
00:13:29,760 --> 00:13:34,960
On the right, we have inputs capital 
A and B which are 32 bits each.  

93
00:13:36,000 --> 00:13:43,680
We slice up the 32 inputs into 4 values and 
for each of the four slices we perform the same  

94
00:13:43,680 --> 00:13:48,560
additional multiplication and add everything 
together to get a single result number.  

95
00:13:52,480 --> 00:13:57,440
Looking at the Python code on the left, we 
have the null function from the previous slide  

96
00:13:57,440 --> 00:14:01,840
and added a new function which is 
null 4 because it does four of them.  

97
00:14:03,200 --> 00:14:09,360
This uses generation to slice 32 umput 
kitchen -- input and use one for each  

98
00:14:13,680 --> 00:14:16,720
function. Then it specifies 
the results be added together.  

99
00:14:24,240 --> 00:14:32,320
Looking at -- at this point, we think it might 
be convenient to add the result of null 4 to the  

100
00:14:32,320 --> 00:14:41,200
acomilator, we make a register and connect on the 
right so the result is added to the existing value  

101
00:14:41,200 --> 00:14:51,520
of the accumulator. To do this in N region we 
declare a 32 bit value named ac and use n.d.sync.  

102
00:14:56,880 --> 00:15:03,200
The value on the next clock cycle is whatever 
value there is now plus the result of the nul4.  

103
00:15:05,680 --> 00:15:11,040
That's quite a bit. I hope you see that nMigen is 
a relatively friendly way for Python programmers  

104
00:15:11,040 --> 00:15:20,960
to specify FPGA configuration. Back to the 
outline, we will move on from FPGAs to how a  

105
00:15:20,960 --> 00:15:30,960
RISC-V processor can be extended with custom unit 
function. Here is the architecture diagram again.  

106
00:15:32,160 --> 00:15:36,080
We are interested in the interface 
between the CPU and custom function unit.  

107
00:15:37,280 --> 00:15:41,840
To illustrate how the interface works we will 
begin with a standard RISC-V add instruction.  

108
00:15:43,520 --> 00:15:46,960
This is the assembly instruction 
in the form a programmer would use  

109
00:15:47,920 --> 00:15:52,560
as the comment hopefully points out it 
adds register 4 to 7 and puts the result  

110
00:15:52,560 --> 00:15:58,640
into register 24. Of course, the CPU doesn't 
read the instruction or the comment. It only  

111
00:15:58,640 --> 00:16:03,280
understands the 32 bit instruction 
word. Let's look at those 32 bits.  

112
00:16:06,000 --> 00:16:11,200
The first thing the RISC-V CPU looks at is 
the right-most seven bits forming the up code.  

113
00:16:12,400 --> 00:16:17,920
In this case, 011, 0011 means this 
instruction is going to do something  

114
00:16:17,920 --> 00:16:23,520
with the registers and the arithmetic 
logic unit or ALU. It tells the code  

115
00:16:24,800 --> 00:16:30,080
and CPU how it should interpret the 
rest of the 32 bit instruction word.  

116
00:16:32,560 --> 00:16:38,560
The RS1 and 2 fields specify which registers 
are sources or input that need to be routed  

117
00:16:38,560 --> 00:16:46,080
to the ALU. Register 7 and 4 are read 
from the register file in this case.  

118
00:16:46,080 --> 00:16:56,480
Func7 and 3 tell the ALU what operation to 
perform. They tell the ALU to add in this case.  

119
00:16:56,480 --> 00:17:03,840
Rd specifies the registration 
and that is sent to 24.  

120
00:17:08,320 --> 00:17:13,600
Two values coming from the register file, 
go to the ALU that performs the calculation  

121
00:17:13,600 --> 00:17:19,120
and the result is placed back into the 
register file. Let's look at how the  

122
00:17:19,120 --> 00:17:25,200
custom function unit is used. The RISC-V 
instruction set has two reserved opcodes.  

123
00:17:26,160 --> 00:17:32,240
We use one to tell the CPU to 
process data using the CFU.  

124
00:17:32,240 --> 00:17:37,120
Once the CPU sees the up code it understand how 
to interpret the remainder of the instruction.  

125
00:17:41,760 --> 00:17:48,880
Rs1 and 2 specify the source registers and the 
CPU routes them to the CFU instead of the ALU.  

126
00:17:50,480 --> 00:18:00,800
Funct7 and 3 tell the operation and rd 
specifies where the output should be sent.  

127
00:18:03,120 --> 00:18:09,440
With the CFU instruction, a program can tell there 
is five CPU to take two 32 bits from registers,  

128
00:18:10,000 --> 00:18:14,480
send them to processing and put a 32 
bit result back into the register.  

129
00:18:18,240 --> 00:18:26,000
To allow the CFU to be used from a C program, the 
CFU-playground provides a macro that translates  

130
00:18:26,000 --> 00:18:32,560
into a single CFU machine instruction. The example 
at the bottom of the slide shows C code and the  

131
00:18:32,560 --> 00:18:41,840
assembler it is translated into. As you can 
see the translation is pretty much one-to-one.  

132
00:18:46,720 --> 00:18:51,680
So that was an overview of how a CFU 
works and how to use it from a C program.  

133
00:18:53,680 --> 00:18:58,800
Now we are going to put it all together and 
accelerate Tensorflow like 2D convolutions.  

134
00:19:00,080 --> 00:19:04,640
Here is our system on chip again. 
To accelerate 2D convolutions  

135
00:19:05,280 --> 00:19:09,840
we will need to build a custom function 
unit that can assist with the 2D convolution  

136
00:19:10,560 --> 00:19:16,160
and modify the Tensorflow implementation of 
2D convolution to take advantage of the CFU.  

137
00:19:18,560 --> 00:19:23,440
Here is the schematic for the gateway we defined 
10 minutes ago. It needs a couple tweaks.  

138
00:19:27,360 --> 00:19:30,640
Having a fixed off set might 
work but it is a little crude.  

139
00:19:31,600 --> 00:19:45,200
We will replace that with a register and some way 
to set the content. We will add that reset too.  

140
00:19:48,240 --> 00:19:53,200
This is enough to be useful. Now, here is 
the image in code to implement all of this.  

141
00:19:55,360 --> 00:19:57,600
There is a little more broiler 
plate that surrounds this code  

142
00:19:57,600 --> 00:20:01,680
but I omitted it to get it into one 
slide. Here are the important parts.  

143
00:20:03,680 --> 00:20:08,080
We begin by defining off set and 
ak as 9 and 32 bit sign numbers.  

144
00:20:09,680 --> 00:20:14,400
Next are mul and mul4 functions 
working the same way as before.  

145
00:20:16,560 --> 00:20:21,360
The next sends a done signal after the 
status is received or the functions we  

146
00:20:21,360 --> 00:20:28,560
perform take a single cycle. This line ensures 
the CFU always outputs the accumulator value.  

147
00:20:34,240 --> 00:20:45,280
With the m.f means whenever. Whenever this is 
true, the contained logic sen abled, otherwise  

148
00:20:45,280 --> 00:20:54,640
it is disabled. Whenever start is received and 
funct7 is 0, the first input value in 0 will  

149
00:20:54,640 --> 00:21:03,280
be set into the offset register. If funct7 is 1 
while status is true the accumulator is set to 0.  

150
00:21:05,680 --> 00:21:15,840
If funct is 2 while status is true we perform 
it on the values and add it to the accumulator.  

151
00:21:17,760 --> 00:21:23,040
That's all the gateway we need. The Python 
image and syntax makes it pretty compact.  

152
00:21:25,840 --> 00:21:29,840
So remember this code. We are going to use 
a new gate way to accelerate the inner loop.  

153
00:21:33,520 --> 00:21:48,480
The first thing to note is the original code 
that retrieves the values and this allows for  

154
00:21:48,480 --> 00:21:56,320
data and arrays. At the top of the file, we 
hash to find a macro that wraps the CFU macro  

155
00:21:57,120 --> 00:22:02,960
and use it here. These are the main changes we 
need to off set and reset the cume -- accumulator.  

156
00:22:07,680 --> 00:22:14,320
This shows how it works for the example model.  

157
00:22:15,600 --> 00:22:24,640
The overall model evaluation time went from 
147 million cycles to 8 million. If we look  

158
00:22:24,640 --> 00:22:36,000
at the 2D convolution we managed from 95 million 
cycles to 33 million cycles. Not bad for 20 lines  

159
00:22:36,000 --> 00:22:48,160
of Python and another 15-20 lines or C++. In 
this talk, we looked at how tiny ML models run.  

160
00:22:48,800 --> 00:22:54,160
They are doing millions of adds and multipliers. 
We looked at what FPGAs are and how to build  

161
00:22:54,160 --> 00:23:00,320
gateways that add those. We looked at how a 
CPU interacts with the custom function unit  

162
00:23:01,120 --> 00:23:07,840
and put it all together to build an accelerator 
that makes running an ML model faster.  

163
00:23:11,600 --> 00:23:15,200
There are plenty of aspects of the 
CFU-playground we didn't cover.  

164
00:23:16,160 --> 00:23:20,320
We put a lot of effort into the tutorials 
and guides for beginners, people who do them  

165
00:23:20,320 --> 00:23:24,080
tell them they are helpful and fun, so please, 
try them out and raise bugs to give feedback.  

166
00:23:25,920 --> 00:23:30,080
Unit testing gateway is more 
fun than debugging on hardware.  

167
00:23:31,440 --> 00:23:36,480
I am a bit sad I didn't have time to talk 
about the iterative baby steps approach to  

168
00:23:36,480 --> 00:23:42,560
building gateway that is easier and more fun than 
trying to build and integrate giant components.  

169
00:23:45,040 --> 00:23:49,520
There are many people to thank their part in 
making CFU play grund a reality and I would  

170
00:23:49,520 --> 00:23:56,480
like to call out a couple. First, Tim Callahan 
put together the structure of the playground  

171
00:23:56,480 --> 00:24:02,880
so the rest of us can have fun writing. I would 
like to call out the work of Rachel and Joey  

172
00:24:03,440 --> 00:24:08,880
who started off knowing very little about FPGAs 
and implemented in the end whole accelerators.  

173
00:24:09,760 --> 00:24:14,960
Joey made a step-by-step for getting started 
and I used that heavily in this talk.  

174
00:24:17,440 --> 00:24:23,040
If you would like to learn more about FPGAs 
or understanding how ML models are evaluated  

175
00:24:23,040 --> 00:24:27,280
or building ML accelerators with open 
source tools here is how to get started.  

176
00:24:29,040 --> 00:24:36,160
The tutorial on read the docs the best. 
The code is on Gubernatorial -- GitHub.  

177
00:24:44,800 --> 00:24:56,080
Feel free reach out.
>> We have great questions we can get straight  

178
00:24:56,080 --> 00:25:08,000
into them. Fantastic. What are you building?
>> We are building a thing which may or may not  

179
00:25:08,000 --> 00:25:17,600
work. If it works, it may appear in a Chromebook.
>> Great. Next up is is the support for non- -- or  

180
00:25:21,120 --> 00:25:27,200
the interesting Chinese FPGA and the lattice  

181
00:25:27,200 --> 00:25:34,720
targeting - urban source toolchain?
>> We really want our product to be  

182
00:25:34,720 --> 00:25:43,520
open source. We are going out of our way 
to support open source toolchains. The  

183
00:25:43,520 --> 00:25:53,360
lattice toolchain is great. These are all good.
>> Thanks a lot. Next question was as a person  

184
00:25:53,360 --> 00:26:01,200
with curiosity but no hardware knowledge what's a 
good entry point into edge or tiny ML? Projects or  

185
00:26:01,200 --> 00:26:06,800
references? Thanks for the beautiful explanations.
>> The CFU-playground is a great way to get into  

186
00:26:06,800 --> 00:26:12,880
it. [Laughter] The Tensorflow for micro controller 
site has pretty good instructions if you want to  

187
00:26:12,880 --> 00:26:19,040
get started. I recommend a number of micro 
controller products and that's not a bad way  

188
00:26:19,040 --> 00:26:26,720
to get into it. The CFU-playground if you want 
to get ahold of like an FPGA development board,  

189
00:26:27,920 --> 00:26:32,000
we make it super easy to get into it and 
start messing around with the C code. 

190
00:26:32,000 --> 00:26:38,880
>> And I think some conferences give 
FPGA boards as a particular badge for the  

191
00:26:38,880 --> 00:26:46,000
conference. That can often be quite good.
>> Yeah, if you have a development board,  

192
00:26:46,640 --> 00:26:49,920
or a conference badge you would 
like to see the CFU-playground on,  

193
00:26:49,920 --> 00:26:56,480
please get in touch. That would be fun.
>> You mentioned that you were are using  

194
00:26:56,480 --> 00:27:03,680
a reserved RISC-V is this space reserved for 
custom lock codes? Or future use by standardized  

195
00:27:04,400 --> 00:27:06,720
opcodes?
>> Good question.  

196
00:27:08,080 --> 00:27:13,360
No, the RISC-V specification reserves these two 
opcodes for implementers. They are never going  

197
00:27:13,360 --> 00:27:19,040
to be used for RISC-V instructions. There are 
other reserve spaces for RISC-V instructions.  

198
00:27:19,600 --> 00:27:26,400
But these are specifically so that implementers 
like the CFU-playground can try out new things.  

199
00:27:28,160 --> 00:27:33,120
It is one of the really great things about RISC-V. 
You have this giant RISC-V GCC toolchain that  

200
00:27:33,120 --> 00:27:38,000
already works and you can add just the little bit 
you need to make it better and do what you want. 

201
00:27:38,000 --> 00:27:47,760
>> One more question. If there are more you might 
have time to type it in. Is it easy to write  

202
00:27:47,760 --> 00:27:55,360
functional tests for components in nMigen?
>> Yeah, advanced with nMigen. 

203
00:27:55,360 --> 00:28:03,440
>> I had to ask.
>> Yeah, you would have to. Do  

204
00:28:03,440 --> 00:28:12,000
you run it through Python or hardware simulators?
>> We did both. We definitely find the most bugs  

205
00:28:12,000 --> 00:28:19,360
with Python unit tests. Python unit tests 
import unit tests and subclass unit tests  

206
00:28:20,560 --> 00:28:26,880
and then nMigen has a simulator and you can 
poke at the simulator and check it is giving you  

207
00:28:26,880 --> 00:28:30,880
the results back that you want. The 
test cases are really fun. I don't  

208
00:28:30,880 --> 00:28:35,760
know if any other environment that let's 
you write some gateway and test it and fix  

209
00:28:35,760 --> 00:28:41,920
it and run it again in less than 30 seconds.
>> That's pretty good. We have a flip side.  

210
00:28:42,880 --> 00:28:46,640
As a person with lots of hardware experience 
but no ML experience where is a good  

211
00:28:46,640 --> 00:28:50,240
starting point for that?
>> I would have to think about  

212
00:28:50,240 --> 00:28:55,840
that. Please come and talk to me afterwards.
>> We have the hallway track for that.  

213
00:28:56,640 --> 00:29:03,280
One last question is why do you 
need four times mul times 8? 

214
00:29:03,280 --> 00:29:17,280
>> Right. This all comes down to the 
technology your ML model is using. There is a  

215
00:29:20,640 --> 00:29:26,160
lot of research being put into 8-bit models at 
the moment and the good thing about 8-bit models  

216
00:29:26,160 --> 00:29:31,520
is you can move the data around four times as 
fast because there is four times less of it.  

217
00:29:32,800 --> 00:29:39,680
That's why we use are using 8-bit mods models.
>> Thank you so much for your talk, Alan. This  

218
00:29:39,680 --> 00:29:41,680
was really good.
>> Thank you. 

219
00:29:41,680 --> 00:29:44,720
>> Thank you to everybody.
>> If you want to continue the  

220
00:29:44,720 --> 00:29:49,760
conversation, we have a long break coming on.
You can continue to use this hall or the hall  

221
00:29:51,600 --> 00:29:56,800
text chat that's related to it. During the 
lunch break, if you have got some spare time,  

222
00:29:57,840 --> 00:30:02,400
and you are interested, the lightning talk 
submission close at 1:00 p.m. Australia time which  

223
00:30:02,400 --> 00:30:07,360
is in half an hour. If you think you can come 
up with a 5-minute talk on anything interesting,  

224
00:30:07,360 --> 00:30:14,640
doesn't have to be Python related. Thank you, 
mickey, my cat meowed. You can find the links  

225
00:30:19,120 --> 00:30:26,000
and it has all the social events 
happening in the next day. Thank you