1
00:00:00,420 --> 00:00:05,910
[Music]

2
00:00:10,400 --> 00:00:16,080
Good afternoon and welcome back to our

3
00:00:13,200 --> 00:00:18,160
next session for PyCon 2025. Hope you

4
00:00:16,080 --> 00:00:20,720
have all enjoyed your lunch and are

5
00:00:18,160 --> 00:00:24,400
ready and eager to hear from our next

6
00:00:20,720 --> 00:00:27,920
speaker, Mr. Anthony Shaw, presenting My

7
00:00:24,400 --> 00:00:28,480
AI is slow, make it faster. Thank you,

8
00:00:27,920 --> 00:00:30,240
Anthony.

9
00:00:28,480 --> 00:00:32,719
All right.

10
00:00:30,240 --> 00:00:34,960
[Applause]

11
00:00:32,719 --> 00:00:37,200
Cool. Hey everybody. I hope you enjoyed

12
00:00:34,960 --> 00:00:39,360
lunch and I've been enjoying PyCon AU so

13
00:00:37,200 --> 00:00:41,680
far. Uh yeah, so my name is Anthony and

14
00:00:39,360 --> 00:00:43,360
I'm going to be talking about uh AI

15
00:00:41,680 --> 00:00:45,680
performance. And when I say performance,

16
00:00:43,360 --> 00:00:48,160
I mean like speed performance, not how

17
00:00:45,680 --> 00:00:49,600
good or bad it is at doing things. Um so

18
00:00:48,160 --> 00:00:51,920
before we get started, a little bit

19
00:00:49,600 --> 00:00:54,640
about me. Uh my name is Anthony. Uh I

20
00:00:51,920 --> 00:00:56,480
work at Microsoft. Uh I'm a Python

21
00:00:54,640 --> 00:01:00,399
Software Foundation fellow and I'm also

22
00:00:56,480 --> 00:01:03,840
a fellow at McCory Uni. Um, and people

23
00:01:00,399 --> 00:01:05,519
talk about a scale of AI from I've drunk

24
00:01:03,840 --> 00:01:06,960
all the Kool-Aid and I think this is

25
00:01:05,519 --> 00:01:09,200
going to be the greatest thing that's

26
00:01:06,960 --> 00:01:11,280
ever happened to I think it's the worst

27
00:01:09,200 --> 00:01:13,360
thing that's ever been created. I'm sort

28
00:01:11,280 --> 00:01:16,000
of floating somewhere in the middle and

29
00:01:13,360 --> 00:01:18,560
and to be honest, I kind of waver day by

30
00:01:16,000 --> 00:01:20,000
day between this is a terrible idea and

31
00:01:18,560 --> 00:01:22,400
this is amazing and it's changing

32
00:01:20,000 --> 00:01:24,960
everything. Um, AI is quite

33
00:01:22,400 --> 00:01:27,040
unpredictable. Uh, it's very very

34
00:01:24,960 --> 00:01:28,159
difficult to work with. for someone

35
00:01:27,040 --> 00:01:30,560
who's come from an engineering

36
00:01:28,159 --> 00:01:33,280
background and you expect things to work

37
00:01:30,560 --> 00:01:35,200
uh the same way each time and the way it

38
00:01:33,280 --> 00:01:37,920
surprises you day by day and it's

39
00:01:35,200 --> 00:01:40,000
unbelievably frustrating. Um and then

40
00:01:37,920 --> 00:01:41,439
sometimes you get these moments of joy

41
00:01:40,000 --> 00:01:43,920
in there as well where it's just saved

42
00:01:41,439 --> 00:01:45,680
you a lot of time. So um you can follow

43
00:01:43,920 --> 00:01:48,320
me on socials. I'm on lots of different

44
00:01:45,680 --> 00:01:52,720
platforms. Um, also I've made a written

45
00:01:48,320 --> 00:01:54,240
a book um called CPython internals um

46
00:01:52,720 --> 00:01:56,799
and it is all about the internals of

47
00:01:54,240 --> 00:01:58,880
CPython and how it works. So if you're

48
00:01:56,799 --> 00:02:01,840
interested in that uh I also made this

49
00:01:58,880 --> 00:02:04,719
thing called VS Code Pets um which is a

50
00:02:01,840 --> 00:02:08,239
extension for VS Code where pets walk

51
00:02:04,719 --> 00:02:10,560
around. Uh actually uh it turns out to

52
00:02:08,239 --> 00:02:12,400
be really popular and frustratingly is

53
00:02:10,560 --> 00:02:15,200
the most successful piece of software

54
00:02:12,400 --> 00:02:17,760
I've ever written. Um

55
00:02:15,200 --> 00:02:21,120
nearly actually just coming up to 2

56
00:02:17,760 --> 00:02:23,599
million users now. Um the someone just

57
00:02:21,120 --> 00:02:26,080
yesterday someone on the Windows team

58
00:02:23,599 --> 00:02:27,760
messaged me. I work at Microsoft and

59
00:02:26,080 --> 00:02:29,440
they were like how can we get this into

60
00:02:27,760 --> 00:02:32,879
Windows

61
00:02:29,440 --> 00:02:35,519
and I was like the productivity of

62
00:02:32,879 --> 00:02:37,360
millions of people around the world is

63
00:02:35,519 --> 00:02:38,720
going to decrease.

64
00:02:37,360 --> 00:02:40,879
It's going to look really good on my

65
00:02:38,720 --> 00:02:43,760
performance review. So it's a sacrifice

66
00:02:40,879 --> 00:02:46,800
I'm willing to take. But uh

67
00:02:43,760 --> 00:02:49,920
yeah anyway so uh our agenda today is

68
00:02:46,800 --> 00:02:52,080
that uh AI is slow. Um if you've used it

69
00:02:49,920 --> 00:02:54,480
you would have realized that uh we're

70
00:02:52,080 --> 00:02:56,400
going to look at uh some requirements in

71
00:02:54,480 --> 00:02:58,959
terms of like performance and what your

72
00:02:56,400 --> 00:03:00,720
expectations are. Uh I'm going to spend

73
00:02:58,959 --> 00:03:03,440
a big chunk of this talking about

74
00:03:00,720 --> 00:03:04,800
benchmarking and uh performance tools

75
00:03:03,440 --> 00:03:07,760
and things like that and how we can use

76
00:03:04,800 --> 00:03:10,400
them with AI. I've got a few tips to

77
00:03:07,760 --> 00:03:13,519
improve the speed of how you're calling

78
00:03:10,400 --> 00:03:14,879
AIS. Um and then some takeaways. So

79
00:03:13,519 --> 00:03:16,959
there'll be a nice slide at the end that

80
00:03:14,879 --> 00:03:19,760
you can just take a picture of. Um there

81
00:03:16,959 --> 00:03:21,760
all the bits you need to know. Uh my

82
00:03:19,760 --> 00:03:24,080
promises to you today as well. Um

83
00:03:21,760 --> 00:03:25,519
nothing in this talk is proprietary. Uh

84
00:03:24,080 --> 00:03:27,680
I do work for Microsoft, but everything

85
00:03:25,519 --> 00:03:29,599
here is open source. Uh everything is

86
00:03:27,680 --> 00:03:31,519
free and available to download and use

87
00:03:29,599 --> 00:03:34,159
yourself. Nothing has any weird

88
00:03:31,519 --> 00:03:36,159
licenses. Um

89
00:03:34,159 --> 00:03:37,840
and this is all based on experience as

90
00:03:36,159 --> 00:03:39,760
well. Uh so these are things that I've

91
00:03:37,840 --> 00:03:41,760
been doing as part of my day-to-day

92
00:03:39,760 --> 00:03:44,799
work. Uh and I just wanted to share some

93
00:03:41,760 --> 00:03:47,360
of these lessons with you uh today. Um

94
00:03:44,799 --> 00:03:50,000
so this is actually um an example I

95
00:03:47,360 --> 00:03:52,799
wanted to use. So this is me on a train

96
00:03:50,000 --> 00:03:55,840
from uh Bordeaux to Paris. Uh this train

97
00:03:52,799 --> 00:03:58,959
goes 300 km an hour. Uh and I had 2 and

98
00:03:55,840 --> 00:04:00,959
a half hours. So a nice casual time to

99
00:03:58,959 --> 00:04:03,599
look out the window and really not very

100
00:04:00,959 --> 00:04:05,439
much to do. Um, we had Wi-Fi on the

101
00:04:03,599 --> 00:04:07,439
plane on the train as well, which is

102
00:04:05,439 --> 00:04:09,200
always dangerous because instead of

103
00:04:07,439 --> 00:04:10,480
relaxing, you're like, "Oh, I can do

104
00:04:09,200 --> 00:04:14,080
things on the internet. That's more

105
00:04:10,480 --> 00:04:15,599
fun." Um, so I wanted to actually just

106
00:04:14,080 --> 00:04:18,000
make a game and then just play the game

107
00:04:15,599 --> 00:04:20,079
on the train. Uh, and the game that I

108
00:04:18,000 --> 00:04:23,360
worked on was

109
00:04:20,079 --> 00:04:26,800
uh this, which is uh a Suduku game uh

110
00:04:23,360 --> 00:04:29,120
but instead of using one to nine uh it's

111
00:04:26,800 --> 00:04:32,800
hexadimal.

112
00:04:29,120 --> 00:04:35,680
Um, so on the train from Bordeaux to

113
00:04:32,800 --> 00:04:38,320
Paris, I basically worked on this uh

114
00:04:35,680 --> 00:04:40,880
with the AI doing most of the heavy work

115
00:04:38,320 --> 00:04:42,960
because well firstly I was on holiday.

116
00:04:40,880 --> 00:04:44,639
Um, so I shouldn't have been doing any

117
00:04:42,960 --> 00:04:47,199
work anyway. I just wanted to do

118
00:04:44,639 --> 00:04:51,040
something fun. There was no obligation.

119
00:04:47,199 --> 00:04:52,639
Um, and it took ages. So like two and a

120
00:04:51,040 --> 00:04:54,560
half hours to make a game is impressive.

121
00:04:52,639 --> 00:04:55,840
But every time I'd ask the AI question,

122
00:04:54,560 --> 00:04:57,440
it would go away for a couple of

123
00:04:55,840 --> 00:04:59,440
minutes, come back and give me an

124
00:04:57,440 --> 00:05:02,080
answer. But I could look out the window

125
00:04:59,440 --> 00:05:06,320
and just, you know, watch France go

126
00:05:02,080 --> 00:05:10,240
past, which is lovely. Um, but

127
00:05:06,320 --> 00:05:12,560
that's not always the case. Um, and in

128
00:05:10,240 --> 00:05:13,680
terms of user expectations,

129
00:05:12,560 --> 00:05:17,360
uh, I've been working on web

130
00:05:13,680 --> 00:05:19,840
benchmarking for many many years and,

131
00:05:17,360 --> 00:05:23,759
uh, also performance of applications and

132
00:05:19,840 --> 00:05:27,280
code running on computers. And if if I

133
00:05:23,759 --> 00:05:30,960
said to you, is your phone like fast or

134
00:05:27,280 --> 00:05:33,039
slow? You it's a subjective question,

135
00:05:30,960 --> 00:05:34,960
but you know that when you use a system,

136
00:05:33,039 --> 00:05:36,080
like when you use the menu on the TV,

137
00:05:34,960 --> 00:05:38,240
for example, if you got one of those

138
00:05:36,080 --> 00:05:40,560
smart TVs, it kind of feels laggy. It

139
00:05:38,240 --> 00:05:43,919
feels a bit slow. And this is because

140
00:05:40,560 --> 00:05:46,400
you're over time using technology, your

141
00:05:43,919 --> 00:05:48,639
expectations are that if you're using

142
00:05:46,400 --> 00:05:50,800
something on a local system like a

143
00:05:48,639 --> 00:05:53,360
computer, like a local computer, your

144
00:05:50,800 --> 00:05:55,840
expectations are within about 10

145
00:05:53,360 --> 00:05:57,680
milliseconds to 100 milliseconds to when

146
00:05:55,840 --> 00:05:59,680
you click on something, touch something

147
00:05:57,680 --> 00:06:01,919
that it responds.

148
00:05:59,680 --> 00:06:04,400
And if it starts to take longer than 100

149
00:06:01,919 --> 00:06:06,560
milliseconds, it feels slow and it feels

150
00:06:04,400 --> 00:06:09,440
laggy. So, if you've ever upgraded from

151
00:06:06,560 --> 00:06:11,759
one phone to the next model and it it

152
00:06:09,440 --> 00:06:14,000
feels snappier, the differences can be

153
00:06:11,759 --> 00:06:16,160
like 10 milliseconds.

154
00:06:14,000 --> 00:06:18,400
Um, if you're working on the internet

155
00:06:16,160 --> 00:06:20,400
though, if you're loading a website, if

156
00:06:18,400 --> 00:06:22,000
you're doing a search online, your

157
00:06:20,400 --> 00:06:24,960
expectations are shifted slightly

158
00:06:22,000 --> 00:06:27,360
further to right. So, you would expect

159
00:06:24,960 --> 00:06:29,759
like a web search, a typical web search

160
00:06:27,360 --> 00:06:31,680
is about 50 to 100 milliseconds. That's

161
00:06:29,759 --> 00:06:33,600
how long it takes. um that's actually

162
00:06:31,680 --> 00:06:35,440
got slower over time, not faster because

163
00:06:33,600 --> 00:06:39,360
they've put more and more crap into the

164
00:06:35,440 --> 00:06:40,880
uh uh the pi page results. Um but your

165
00:06:39,360 --> 00:06:43,680
expectations when you're clicking on and

166
00:06:40,880 --> 00:06:47,039
loading websites are roughly between 50

167
00:06:43,680 --> 00:06:49,680
milliseconds to a second. And and lots

168
00:06:47,039 --> 00:06:52,960
of data shows us that if a page takes a

169
00:06:49,680 --> 00:06:55,120
second to load, users do get bored and

170
00:06:52,960 --> 00:06:57,520
then they find other things to do. So

171
00:06:55,120 --> 00:07:00,000
your expectations kind of shift. When it

172
00:06:57,520 --> 00:07:02,800
comes to AI, it's even further to the

173
00:07:00,000 --> 00:07:04,720
right. So, you know, when you you've

174
00:07:02,800 --> 00:07:06,800
been learning as you've used like chat

175
00:07:04,720 --> 00:07:09,520
GPT or something else that if you type

176
00:07:06,800 --> 00:07:12,800
in a question, it will take a few

177
00:07:09,520 --> 00:07:15,280
seconds to respond. And there's really

178
00:07:12,800 --> 00:07:17,520
kind of this shift with how what you

179
00:07:15,280 --> 00:07:19,840
deem as snappy and what you deem as

180
00:07:17,520 --> 00:07:22,720
fast. And the point I want to make early

181
00:07:19,840 --> 00:07:24,880
on is that if you try and stick AI into

182
00:07:22,720 --> 00:07:27,440
things that are further down the left of

183
00:07:24,880 --> 00:07:29,680
this scale like local time and internet

184
00:07:27,440 --> 00:07:32,160
time, you've basically introduce this

185
00:07:29,680 --> 00:07:35,120
lag to users and all of a sudden things

186
00:07:32,160 --> 00:07:37,440
feel slow. Uh there's also another

187
00:07:35,120 --> 00:07:41,520
trajectory to this which is Australian

188
00:07:37,440 --> 00:07:44,960
internet time which also seems to be

189
00:07:41,520 --> 00:07:47,599
getting slower over time somehow. Um so

190
00:07:44,960 --> 00:07:50,880
when is slow bad? So in this example,

191
00:07:47,599 --> 00:07:53,360
you've got a search box. Um, and I have

192
00:07:50,880 --> 00:07:56,160
seen this in products where they've got

193
00:07:53,360 --> 00:07:58,479
a basic search functionality and it used

194
00:07:56,160 --> 00:08:00,319
to work on a database and it would take

195
00:07:58,479 --> 00:08:03,759
out keywords and do a keyword search on

196
00:08:00,319 --> 00:08:06,000
a database. So users know when they type

197
00:08:03,759 --> 00:08:08,720
into that box, it's going to look for

198
00:08:06,000 --> 00:08:10,800
keywords. And over the last 20 years of

199
00:08:08,720 --> 00:08:13,199
having web search, people have

200
00:08:10,800 --> 00:08:15,440
intuitively know that when there's a

201
00:08:13,199 --> 00:08:17,360
search box, you don't write, you know,

202
00:08:15,440 --> 00:08:19,039
written sentences. You don't write fully

203
00:08:17,360 --> 00:08:21,039
formed sentences. You just write

204
00:08:19,039 --> 00:08:24,800
keywords and then you click search and

205
00:08:21,039 --> 00:08:27,759
it gives you back answers. Now, what has

206
00:08:24,800 --> 00:08:28,960
been very tempting for people to do um

207
00:08:27,759 --> 00:08:30,960
where you've got an expectation that

208
00:08:28,960 --> 00:08:32,640
that's going to respond pretty quickly,

209
00:08:30,960 --> 00:08:35,519
what's been tempting for people to do

210
00:08:32,640 --> 00:08:37,039
over the last few years is to add AI to

211
00:08:35,519 --> 00:08:39,519
the search.

212
00:08:37,039 --> 00:08:42,560
So where users would previously type in

213
00:08:39,519 --> 00:08:44,560
a question or a series of keywords, they

214
00:08:42,560 --> 00:08:46,640
would then introduce AI so the AI could

215
00:08:44,560 --> 00:08:49,040
give you more customized or smarter

216
00:08:46,640 --> 00:08:51,600
answers or whatever it is. The problem

217
00:08:49,040 --> 00:08:53,760
is that you've basically shifted the

218
00:08:51,600 --> 00:08:57,040
user's expectation that it will respond

219
00:08:53,760 --> 00:09:01,120
in 100 milliseconds, 500 milliseconds to

220
00:08:57,040 --> 00:09:03,600
it taking several seconds and also it

221
00:09:01,120 --> 00:09:06,240
going from a keyword search to a sort of

222
00:09:03,600 --> 00:09:09,040
vague AI kind of search whatever that

223
00:09:06,240 --> 00:09:10,480
is. Um so this is where I found that

224
00:09:09,040 --> 00:09:13,120
people have been really really

225
00:09:10,480 --> 00:09:16,240
frustrated um because it feels slow to

226
00:09:13,120 --> 00:09:18,399
interact with these systems that as an

227
00:09:16,240 --> 00:09:20,959
example there are two apps on my phone

228
00:09:18,399 --> 00:09:23,279
that have done this recently uh made by

229
00:09:20,959 --> 00:09:25,760
some very large tech companies where

230
00:09:23,279 --> 00:09:27,760
previously I would search for a post on

231
00:09:25,760 --> 00:09:30,000
socials or something like that and then

232
00:09:27,760 --> 00:09:31,839
the AI would kind of try and do extra

233
00:09:30,000 --> 00:09:33,600
things that I wasn't looking for because

234
00:09:31,839 --> 00:09:37,120
really all I wanted to do was a keyword

235
00:09:33,600 --> 00:09:41,200
search. I didn't want a smart AI search.

236
00:09:37,120 --> 00:09:43,680
So sometimes slow is actually okay. And

237
00:09:41,200 --> 00:09:46,160
I use this other example where I have a

238
00:09:43,680 --> 00:09:48,800
question. It's not a search. I have a

239
00:09:46,160 --> 00:09:51,360
research question. And then I can click

240
00:09:48,800 --> 00:09:53,040
go and it can come back and say, "Okay,

241
00:09:51,360 --> 00:09:55,839
I'm going to research your question for

242
00:09:53,040 --> 00:09:57,760
you using AI. I'm going to look online.

243
00:09:55,839 --> 00:10:00,160
I'm going to read some papers and it's

244
00:09:57,760 --> 00:10:02,640
going to take a couple of hours." And so

245
00:10:00,160 --> 00:10:04,720
my expectations are set. it's pretty

246
00:10:02,640 --> 00:10:07,600
clear what's going to happen and also

247
00:10:04,720 --> 00:10:08,959
what I'm looking for is involves a lot

248
00:10:07,600 --> 00:10:11,440
more information and a lot more

249
00:10:08,959 --> 00:10:14,000
searching. So I'm quite happy for the AI

250
00:10:11,440 --> 00:10:16,480
to take its time. Um but they're two

251
00:10:14,000 --> 00:10:19,120
different expectations. And so the first

252
00:10:16,480 --> 00:10:21,360
point I want to make in the AI thing is

253
00:10:19,120 --> 00:10:23,760
that users have an expectation to how

254
00:10:21,360 --> 00:10:25,519
things respond when they run locally,

255
00:10:23,760 --> 00:10:27,360
when they run on the web, and

256
00:10:25,519 --> 00:10:31,360
increasingly they have an expectation to

257
00:10:27,360 --> 00:10:33,760
what happens when you interact with AIS.

258
00:10:31,360 --> 00:10:36,959
So to summarize the my kind of usability

259
00:10:33,760 --> 00:10:38,560
points um don't put a in the way of

260
00:10:36,959 --> 00:10:41,440
something which is otherwise fast and

261
00:10:38,560 --> 00:10:43,600
efficient um especially if it doesn't

262
00:10:41,440 --> 00:10:46,399
really add much value. Search is my main

263
00:10:43,600 --> 00:10:48,560
one that I keep bringing up. Um because

264
00:10:46,399 --> 00:10:50,800
AI almost never improves search

265
00:10:48,560 --> 00:10:52,800
functionality. Like search is a solved

266
00:10:50,800 --> 00:10:55,519
problem. It's very efficient. Uh it

267
00:10:52,800 --> 00:10:58,320
scales really well. Uh we have plenty of

268
00:10:55,519 --> 00:11:00,720
algorithms to do search. Um, AI can do

269
00:10:58,320 --> 00:11:02,959
like customization of the answer from

270
00:11:00,720 --> 00:11:04,480
the search, but like don't put AI in the

271
00:11:02,959 --> 00:11:07,120
search.

272
00:11:04,480 --> 00:11:08,640
Uh, don't wait for the entire result.

273
00:11:07,120 --> 00:11:10,800
So, I'll come back to this in a second,

274
00:11:08,640 --> 00:11:14,800
but basically we stream results back

275
00:11:10,800 --> 00:11:17,200
from AI so that the user sees the kind

276
00:11:14,800 --> 00:11:19,040
of way it looks in chat GPT and other

277
00:11:17,200 --> 00:11:21,120
chat bots. It kind of feels like the AI

278
00:11:19,040 --> 00:11:22,880
is typing back to you, but actually

279
00:11:21,120 --> 00:11:25,519
what's happening is as it's computing

280
00:11:22,880 --> 00:11:27,360
the answers, it's sending back the

281
00:11:25,519 --> 00:11:30,000
stream back to you. And I'll come to

282
00:11:27,360 --> 00:11:32,160
some details on that uh in a minute. Uh

283
00:11:30,000 --> 00:11:33,680
and don't use reasoning models, and I'll

284
00:11:32,160 --> 00:11:35,760
cover what they are a bit later, unless

285
00:11:33,680 --> 00:11:38,959
you actually need reasoning uh because

286
00:11:35,760 --> 00:11:40,800
reasoning models are all very very slow

287
00:11:38,959 --> 00:11:43,760
uh today, and a lot of the time you

288
00:11:40,800 --> 00:11:46,640
don't actually need them. So using AI to

289
00:11:43,760 --> 00:11:48,480
enrich features is great. um having the

290
00:11:46,640 --> 00:11:49,920
option to disable and dismiss AI

291
00:11:48,480 --> 00:11:51,519
features is also really important

292
00:11:49,920 --> 00:11:53,839
because there are a lot of concerns

293
00:11:51,519 --> 00:11:56,000
about AI uh safety and privacy at the

294
00:11:53,839 --> 00:11:57,680
moment and also where the usability

295
00:11:56,000 --> 00:12:00,399
actually gets worse because you've

296
00:11:57,680 --> 00:12:03,279
introduced AI giving users the ability

297
00:12:00,399 --> 00:12:06,160
to switch that off I think is critical

298
00:12:03,279 --> 00:12:09,360
uh and stream back the results I mention

299
00:12:06,160 --> 00:12:11,200
um so this QR code at the top uh I've

300
00:12:09,360 --> 00:12:13,279
I've included a few QR codes in this

301
00:12:11,200 --> 00:12:16,399
talk only one of them is a Rick roll but

302
00:12:13,279 --> 00:12:18,639
this one is uh to some guidelines that

303
00:12:16,399 --> 00:12:21,680
we've published and they're basically if

304
00:12:18,639 --> 00:12:23,440
you're designing a system using AI there

305
00:12:21,680 --> 00:12:25,760
are a set of guidelines that we have

306
00:12:23,440 --> 00:12:30,800
called the hacks guidelines. So

307
00:12:25,760 --> 00:12:32,880
basically a UX guidelines for AI

308
00:12:30,800 --> 00:12:34,880
enriched applications. I can't what it

309
00:12:32,880 --> 00:12:37,120
stands for but um some very smart people

310
00:12:34,880 --> 00:12:39,760
at Microsoft have sat down and said in

311
00:12:37,120 --> 00:12:41,760
an ideal world here are the 16 things

312
00:12:39,760 --> 00:12:45,360
that you should do if you've got AI in

313
00:12:41,760 --> 00:12:46,720
your application. Um, and yes, not all

314
00:12:45,360 --> 00:12:48,560
of those things are done in Microsoft

315
00:12:46,720 --> 00:12:52,240
products as well. Uh, they are

316
00:12:48,560 --> 00:12:53,760
guidelines. They're ambitious. Um, and I

317
00:12:52,240 --> 00:12:55,519
really wish that more product managers

318
00:12:53,760 --> 00:12:57,279
would try and keep to some of them

319
00:12:55,519 --> 00:12:59,519
especially. Um, but there's some really

320
00:12:57,279 --> 00:13:02,480
smart ideas in there. Uh, including

321
00:12:59,519 --> 00:13:04,959
actually uh the keynote this morning um,

322
00:13:02,480 --> 00:13:06,480
uh, about the sort of AI ethics

323
00:13:04,959 --> 00:13:08,079
challenges with biases and stuff like

324
00:13:06,480 --> 00:13:09,680
that. There's a lot of guidelines in

325
00:13:08,079 --> 00:13:11,680
there. There's also a lot of lessons

326
00:13:09,680 --> 00:13:13,440
that we've learned over the years from

327
00:13:11,680 --> 00:13:16,399
putting bots online which have caused

328
00:13:13,440 --> 00:13:20,000
all sorts of challenges. Um uh so

329
00:13:16,399 --> 00:13:21,920
there's a tool called pirate p y.

330
00:13:20,000 --> 00:13:24,480
It is basically a red teaming tool for

331
00:13:21,920 --> 00:13:26,720
AIS that gets it to try and uh produce

332
00:13:24,480 --> 00:13:28,079
all sorts of horrible things and then it

333
00:13:26,720 --> 00:13:29,519
gives you a checklist to see whether

334
00:13:28,079 --> 00:13:31,920
you've actually put safety guards in

335
00:13:29,519 --> 00:13:34,240
place. Um so yeah, we have we have

336
00:13:31,920 --> 00:13:36,720
learned some lessons.

337
00:13:34,240 --> 00:13:39,200
So um when it comes to actually

338
00:13:36,720 --> 00:13:42,079
measuring performance uh if anyone out

339
00:13:39,200 --> 00:13:43,680
here has done any web benchmarking

340
00:13:42,079 --> 00:13:47,279
like web app benchmarking and stuff like

341
00:13:43,680 --> 00:13:49,680
that. Okay, a little bit. Um so normally

342
00:13:47,279 --> 00:13:52,399
what you do with web benchmarking is you

343
00:13:49,680 --> 00:13:54,399
would send thousands of requests to your

344
00:13:52,399 --> 00:13:57,680
application and you'd measure how long

345
00:13:54,399 --> 00:14:00,079
it takes to get back the answer. Um, and

346
00:13:57,680 --> 00:14:01,600
it's normally just the full response and

347
00:14:00,079 --> 00:14:03,600
then you'd also measure how long it

348
00:14:01,600 --> 00:14:05,279
takes for the browser to render the page

349
00:14:03,600 --> 00:14:07,120
and load up all the assets and stuff

350
00:14:05,279 --> 00:14:09,519
like that. So that's kind of how we we

351
00:14:07,120 --> 00:14:12,720
measure web applications. AI is a bit

352
00:14:09,519 --> 00:14:15,760
different um because the way they work

353
00:14:12,720 --> 00:14:18,480
is they actually um produce streams of

354
00:14:15,760 --> 00:14:20,959
tokens. So in the question um what is

355
00:14:18,480 --> 00:14:24,160
the capital of France which has become

356
00:14:20,959 --> 00:14:27,760
the go-to test question for AIS. I don't

357
00:14:24,160 --> 00:14:30,880
know why. Um it just is. Uh so if you

358
00:14:27,760 --> 00:14:33,360
ask AI what is the capital of France, it

359
00:14:30,880 --> 00:14:35,360
will send you back these tokens. Um and

360
00:14:33,360 --> 00:14:38,240
each token basically represents a whole

361
00:14:35,360 --> 00:14:41,040
word or a part of a word in over the

362
00:14:38,240 --> 00:14:44,480
network. It actually sends those tokens

363
00:14:41,040 --> 00:14:48,399
back in separate uh packets in separate

364
00:14:44,480 --> 00:14:50,639
HTTP response parts called a stream. So,

365
00:14:48,399 --> 00:14:52,160
um, you send it one request and you

366
00:14:50,639 --> 00:14:54,720
don't get back one response, you

367
00:14:52,160 --> 00:14:57,040
actually get back several responses. So,

368
00:14:54,720 --> 00:14:59,199
with pretty much every AI these days,

369
00:14:57,040 --> 00:15:01,440
you send a request and you get back a

370
00:14:59,199 --> 00:15:03,920
stream of data and as you receive the

371
00:15:01,440 --> 00:15:06,639
stream, you draw it on the screen, you

372
00:15:03,920 --> 00:15:08,480
save it to disk, whatever it is. Um, and

373
00:15:06,639 --> 00:15:11,279
this basically means that you can give

374
00:15:08,480 --> 00:15:12,480
the user the feedback really early on

375
00:15:11,279 --> 00:15:15,279
that they're getting information back

376
00:15:12,480 --> 00:15:18,079
and they can start to read the answer.

377
00:15:15,279 --> 00:15:21,440
So the time in terms of how we measure

378
00:15:18,079 --> 00:15:24,079
this uh the time between the moment you

379
00:15:21,440 --> 00:15:26,720
click go and send the request to when

380
00:15:24,079 --> 00:15:30,000
you get back the first token is called

381
00:15:26,720 --> 00:15:32,000
the time to first token. Tokens come in

382
00:15:30,000 --> 00:15:33,920
chunks. So I also call it the time to

383
00:15:32,000 --> 00:15:35,360
first chunk. Uh that's a really

384
00:15:33,920 --> 00:15:37,600
important metric because it basically

385
00:15:35,360 --> 00:15:40,480
says how long did it take for the model

386
00:15:37,600 --> 00:15:41,839
to think about the answer before it

387
00:15:40,480 --> 00:15:44,000
started sending you back any

388
00:15:41,839 --> 00:15:45,839
information.

389
00:15:44,000 --> 00:15:47,760
Then you've got the total time, which is

390
00:15:45,839 --> 00:15:50,079
how long does it take for it to finish

391
00:15:47,760 --> 00:15:52,160
its answer. Uh it's also important to

392
00:15:50,079 --> 00:15:53,680
note that sometimes if you I'm pretty

393
00:15:52,160 --> 00:15:55,759
sure we all have done this. We've

394
00:15:53,680 --> 00:15:57,759
canceled an AI, stopped it because it

395
00:15:55,759 --> 00:15:59,519
just been waffling on for ages and like

396
00:15:57,759 --> 00:16:02,000
I don't care. You got to the point ages

397
00:15:59,519 --> 00:16:05,120
ago. At least the good thing is you can

398
00:16:02,000 --> 00:16:06,480
kind of stop the stream earlier on. Um

399
00:16:05,120 --> 00:16:08,240
there are some other metrics as well

400
00:16:06,480 --> 00:16:10,320
which are important like the number of

401
00:16:08,240 --> 00:16:13,040
characters. So the capital of France is

402
00:16:10,320 --> 00:16:15,600
Paris is 31 characters. There were seven

403
00:16:13,040 --> 00:16:17,920
tokens, three chunks. And so I can

404
00:16:15,600 --> 00:16:20,480
basically work out what was the total

405
00:16:17,920 --> 00:16:22,160
time, how many chunks were there, and

406
00:16:20,480 --> 00:16:24,320
therefore what was the throughput by

407
00:16:22,160 --> 00:16:26,480
dividing the two. So like what is the

408
00:16:24,320 --> 00:16:29,759
chunks per second or what is the tokens

409
00:16:26,480 --> 00:16:32,000
per second for that AI model.

410
00:16:29,759 --> 00:16:34,480
Uh these things are important because uh

411
00:16:32,000 --> 00:16:36,639
it basically shows you how fast AI can

412
00:16:34,480 --> 00:16:38,320
write back to you and there are huge

413
00:16:36,639 --> 00:16:40,079
differences between different AI models

414
00:16:38,320 --> 00:16:43,759
which is what I want to spend some of

415
00:16:40,079 --> 00:16:46,320
this time looking into. Um so on the

416
00:16:43,759 --> 00:16:48,160
right hand side is a graph which uh was

417
00:16:46,320 --> 00:16:49,519
produced by a tool that I made and I'll

418
00:16:48,160 --> 00:16:53,519
give you a demo of that and how it

419
00:16:49,519 --> 00:16:56,959
works. Uh basically um it will produce

420
00:16:53,519 --> 00:16:59,920
these four measures. Uh one is total

421
00:16:56,959 --> 00:17:02,480
time. Uh these are all box and whisker

422
00:16:59,920 --> 00:17:04,799
um charts. Uh if you're not familiar

423
00:17:02,480 --> 00:17:06,799
with box and whiskers, basically the the

424
00:17:04,799 --> 00:17:08,559
box, so the colored box is the

425
00:17:06,799 --> 00:17:11,439
intercortile range, which is basically

426
00:17:08,559 --> 00:17:13,679
like the the middle range of values. How

427
00:17:11,439 --> 00:17:15,760
long did it take for it to answer? Um

428
00:17:13,679 --> 00:17:17,760
and what is like the most common part of

429
00:17:15,760 --> 00:17:21,039
that? And the black line in the middle

430
00:17:17,760 --> 00:17:23,600
of it is the median. Um

431
00:17:21,039 --> 00:17:26,319
and then uh any little circles and stuff

432
00:17:23,600 --> 00:17:27,679
like that, outliers. Uh, and the whisker

433
00:17:26,319 --> 00:17:30,559
is basically showing like what are the

434
00:17:27,679 --> 00:17:32,160
outside cortiles. So box and whisker

435
00:17:30,559 --> 00:17:36,160
charts are really helpful because you're

436
00:17:32,160 --> 00:17:37,760
looking for how thin is the box. So like

437
00:17:36,160 --> 00:17:40,799
how consistently does it give you the

438
00:17:37,760 --> 00:17:43,039
same response or the same time. So the

439
00:17:40,799 --> 00:17:45,280
bigger the box, the more variable it is.

440
00:17:43,039 --> 00:17:48,000
How big is the whiskers? So like how

441
00:17:45,280 --> 00:17:49,600
spread out is the answer. Um, and then

442
00:17:48,000 --> 00:17:51,440
these little circles are basically where

443
00:17:49,600 --> 00:17:53,280
you've got these anomalies in your data.

444
00:17:51,440 --> 00:17:54,799
So box and whisker graphs are great when

445
00:17:53,280 --> 00:17:57,440
you're running performance benchmarks

446
00:17:54,799 --> 00:17:59,679
because you're looking for um a nice

447
00:17:57,440 --> 00:18:02,080
small box with small whiskers and you're

448
00:17:59,679 --> 00:18:03,840
looking for it to be consistent. So

449
00:18:02,080 --> 00:18:05,520
basically on this one I ran a benchmark

450
00:18:03,840 --> 00:18:08,480
against four different models. Asked

451
00:18:05,520 --> 00:18:11,440
them all exactly the same question. Um

452
00:18:08,480 --> 00:18:14,799
and in terms of like total time model C

453
00:18:11,440 --> 00:18:19,840
which is Llama 3.2 2 uh responded within

454
00:18:14,799 --> 00:18:22,400
like uh couple of seconds uh whereas 54

455
00:18:19,840 --> 00:18:24,559
went all the way up to a minute. So it's

456
00:18:22,400 --> 00:18:27,840
a pretty big difference uh to answer

457
00:18:24,559 --> 00:18:29,280
exactly the same question. Um what's

458
00:18:27,840 --> 00:18:30,960
important here is that we've got the

459
00:18:29,280 --> 00:18:33,200
total time which I kind of covered in

460
00:18:30,960 --> 00:18:35,120
the last uh slide. The time to first

461
00:18:33,200 --> 00:18:37,840
chunk so that's like how long it took to

462
00:18:35,120 --> 00:18:40,400
think about the answer. the length of

463
00:18:37,840 --> 00:18:42,480
response um which you'll notice there is

464
00:18:40,400 --> 00:18:44,799
an enormous difference between the

465
00:18:42,480 --> 00:18:46,880
models and the length of response and

466
00:18:44,799 --> 00:18:49,919
I'll talk about that in a in a bit uh

467
00:18:46,880 --> 00:18:52,080
and the chunks per second. So what you

468
00:18:49,919 --> 00:18:54,480
can do with this tool is uh effectively

469
00:18:52,080 --> 00:18:56,880
you can give it any range of models to

470
00:18:54,480 --> 00:18:58,640
test against and you can get this graph

471
00:18:56,880 --> 00:19:00,320
and so if you've got a problem or a

472
00:18:58,640 --> 00:19:02,799
question and you want to figure out

473
00:19:00,320 --> 00:19:06,320
which is the best model for us to pick

474
00:19:02,799 --> 00:19:09,840
uh it gives you a nice comparison.

475
00:19:06,320 --> 00:19:12,559
So let me show you

476
00:19:09,840 --> 00:19:15,600
that working. Um, so the way this tool

477
00:19:12,559 --> 00:19:18,000
works is it's built into a command line

478
00:19:15,600 --> 00:19:20,640
interface called LLM. Has anybody used

479
00:19:18,000 --> 00:19:23,440
the LLM CLI?

480
00:19:20,640 --> 00:19:26,400
Okay, a couple of you. Um, so I really

481
00:19:23,440 --> 00:19:29,840
like the LLM CLI. It's a tool by Simon

482
00:19:26,400 --> 00:19:32,559
Willis. Um, basically you can do all

483
00:19:29,840 --> 00:19:34,320
sorts of cool stuff. Um, it's a command

484
00:19:32,559 --> 00:19:36,799
line interface for you to run locally.

485
00:19:34,320 --> 00:19:39,039
So you can do things like pipe stuff

486
00:19:36,799 --> 00:19:40,880
from your local disk directly into an

487
00:19:39,039 --> 00:19:43,760
LLM. So if you've got like files and

488
00:19:40,880 --> 00:19:46,720
things and you want to extract data. Um

489
00:19:43,760 --> 00:19:48,960
it's also very extensible. So uh in

490
00:19:46,720 --> 00:19:51,679
terms of the models that you're using,

491
00:19:48,960 --> 00:19:53,840
um you can install plugins to talk to

492
00:19:51,679 --> 00:19:55,760
basically anything. So you can run

493
00:19:53,840 --> 00:19:58,400
models locally. Like I'm running lots of

494
00:19:55,760 --> 00:20:02,160
models on my laptop. Um you can connect

495
00:19:58,400 --> 00:20:04,960
it to openai.com, AWS Bedrock, Azure,

496
00:20:02,160 --> 00:20:07,600
GitHub models. basically like any model

497
00:20:04,960 --> 00:20:09,520
hosting platform or one that you're

498
00:20:07,600 --> 00:20:11,679
running locally on your machine or if

499
00:20:09,520 --> 00:20:14,559
you've got a server and a GPU cluster

500
00:20:11,679 --> 00:20:16,080
you can connect it to that. So uh with

501
00:20:14,559 --> 00:20:17,600
LLM

502
00:20:16,080 --> 00:20:19,760
once you've got these plugins you can

503
00:20:17,600 --> 00:20:21,840
run LLM models and it will list what

504
00:20:19,760 --> 00:20:25,200
models you have available to you. I have

505
00:20:21,840 --> 00:20:27,440
a lot because I do this as a job. Um, so

506
00:20:25,200 --> 00:20:29,600
I've got a lot available to me and I can

507
00:20:27,440 --> 00:20:32,559
basically kind of pick and choose which

508
00:20:29,600 --> 00:20:34,400
of these models I want to compare uh and

509
00:20:32,559 --> 00:20:36,000
see like where they're hosted. So for

510
00:20:34,400 --> 00:20:39,280
example, I was doing a comparison

511
00:20:36,000 --> 00:20:41,919
between GPT5 in Australia and one hosted

512
00:20:39,280 --> 00:20:43,760
in the US to see what the difference was

513
00:20:41,919 --> 00:20:47,200
performance and and what difference it

514
00:20:43,760 --> 00:20:49,840
made. Um, if you're running local models

515
00:20:47,200 --> 00:20:52,559
as well, um, so whether that's using

516
00:20:49,840 --> 00:20:56,480
something like Olama for example. Um so

517
00:20:52,559 --> 00:20:58,960
in Olama you can download and run uh a

518
00:20:56,480 --> 00:21:01,280
number of smaller language models and

519
00:20:58,960 --> 00:21:03,679
also embedding models uh and you can

520
00:21:01,280 --> 00:21:06,320
benchmark those and test them as well.

521
00:21:03,679 --> 00:21:09,039
Uh you need a pretty decent laptop to do

522
00:21:06,320 --> 00:21:10,960
that. So I will caution you. Um also

523
00:21:09,039 --> 00:21:15,679
don't run with it. Don't do it with the

524
00:21:10,960 --> 00:21:17,919
laptop on your lap. Um it gets it gets

525
00:21:15,679 --> 00:21:19,760
uncomfortably warm.

526
00:21:17,919 --> 00:21:22,240
Um,

527
00:21:19,760 --> 00:21:25,280
so the benchmarking tool is basically a

528
00:21:22,240 --> 00:21:26,880
plugin for LLM called LLM profile and it

529
00:21:25,280 --> 00:21:29,440
gives you this extra command where you

530
00:21:26,880 --> 00:21:32,320
say LLM benchmark and then your prompt

531
00:21:29,440 --> 00:21:34,880
input and then you list as many models

532
00:21:32,320 --> 00:21:37,039
as you feel like. So they can be models

533
00:21:34,880 --> 00:21:38,480
online, they can be models locally, they

534
00:21:37,039 --> 00:21:41,039
can be models on different clouds,

535
00:21:38,480 --> 00:21:42,480
different locations, whatever. Um, and

536
00:21:41,039 --> 00:21:44,720
then you say how many times you want it

537
00:21:42,480 --> 00:21:46,720
to repeat the test. uh and then if you

538
00:21:44,720 --> 00:21:49,360
want it to produce one of those graphs

539
00:21:46,720 --> 00:21:51,200
and then you just ask it for a graph uh

540
00:21:49,360 --> 00:21:54,880
and it will give you back this sort of

541
00:21:51,200 --> 00:21:58,400
summary table of information.

542
00:21:54,880 --> 00:22:00,480
So it will run that test locally um if

543
00:21:58,400 --> 00:22:02,960
that is too simple for you. If you want

544
00:22:00,480 --> 00:22:05,679
to do some more complicated test

545
00:22:02,960 --> 00:22:07,440
scenarios, for example, you want to try

546
00:22:05,679 --> 00:22:10,080
to see what the difference is if the

547
00:22:07,440 --> 00:22:12,320
temperature is lower in one test but

548
00:22:10,080 --> 00:22:14,320
higher in another or you want to test

549
00:22:12,320 --> 00:22:15,919
like one model in one country and one in

550
00:22:14,320 --> 00:22:18,320
another or even test two different

551
00:22:15,919 --> 00:22:22,400
prompts um then you can do that by

552
00:22:18,320 --> 00:22:23,600
providing this YAML file um with like

553
00:22:22,400 --> 00:22:25,919
different inputs and different

554
00:22:23,600 --> 00:22:28,240
questions. So like the go-to question of

555
00:22:25,919 --> 00:22:29,760
what is the capital of France? Um we can

556
00:22:28,240 --> 00:22:33,679
compare that with what is the capital of

557
00:22:29,760 --> 00:22:35,360
Azabaijan? Um and the differences in

558
00:22:33,679 --> 00:22:37,760
length for example and the differences

559
00:22:35,360 --> 00:22:39,440
in thinking time are actually quite

560
00:22:37,760 --> 00:22:41,039
different between the models between

561
00:22:39,440 --> 00:22:43,200
those questions. Even though the

562
00:22:41,039 --> 00:22:45,760
questions semantically are the same like

563
00:22:43,200 --> 00:22:47,440
what is the capital city of this country

564
00:22:45,760 --> 00:22:50,000
uh should be a simple straightforward

565
00:22:47,440 --> 00:22:52,080
sentence answer. Um however in some

566
00:22:50,000 --> 00:22:55,039
cases it will write you multiple

567
00:22:52,080 --> 00:22:58,159
paragraphs of travel tips uh about

568
00:22:55,039 --> 00:23:00,799
asaban including the answer. Does anyone

569
00:22:58,159 --> 00:23:04,240
know the answer?

570
00:23:00,799 --> 00:23:08,159
No Formula 1 fans in the room. Baku,

571
00:23:04,240 --> 00:23:11,159
thank you. Um that is next weekend.

572
00:23:08,159 --> 00:23:11,159
Okay.

573
00:23:11,600 --> 00:23:15,760
All right. So um once you've done this

574
00:23:13,919 --> 00:23:17,200
and you've got your graph, uh you can

575
00:23:15,760 --> 00:23:19,440
see the differences in performance

576
00:23:17,200 --> 00:23:21,280
between the models. So let's get on to

577
00:23:19,440 --> 00:23:25,120
uh what are the how do we actually make

578
00:23:21,280 --> 00:23:28,480
it faster? Uh my number one tip uh is to

579
00:23:25,120 --> 00:23:30,559
look for shorter answers. Um there was a

580
00:23:28,480 --> 00:23:32,400
talk yesterday actually Jack uh Jack's

581
00:23:30,559 --> 00:23:36,080
talk he talked about how Charles Darwin

582
00:23:32,400 --> 00:23:38,559
was paid by the word. Um which is why

583
00:23:36,080 --> 00:23:41,840
the opening part of Taylor two cities is

584
00:23:38,559 --> 00:23:44,400
so ridiculously long. Um, these LLM

585
00:23:41,840 --> 00:23:46,960
models are rewarded in their training

586
00:23:44,400 --> 00:23:49,760
for their ability to solve problems,

587
00:23:46,960 --> 00:23:52,240
answer questions, uh, and solve coding

588
00:23:49,760 --> 00:23:53,760
challenges. There is no penalty for the

589
00:23:52,240 --> 00:23:56,720
length of the answer for them to achieve

590
00:23:53,760 --> 00:23:59,200
that. Therefore, they are also paid by

591
00:23:56,720 --> 00:24:02,640
the word. Um, the way that you pay them

592
00:23:59,200 --> 00:24:05,039
is also by the token. So, LLM will give

593
00:24:02,640 --> 00:24:06,799
you really, really long answers when you

594
00:24:05,039 --> 00:24:09,120
don't ask for it. So, what is the

595
00:24:06,799 --> 00:24:12,720
capital of Azabaijan? The capital is

596
00:24:09,120 --> 00:24:15,039
Baku. Everyone knows that. Um,

597
00:24:12,720 --> 00:24:17,279
uh, this one here gave me instead of

598
00:24:15,039 --> 00:24:19,600
taking one second, it took a 100 seconds

599
00:24:17,279 --> 00:24:23,279
and it wrote four paragraphs about Baku

600
00:24:19,600 --> 00:24:24,960
and its cultural history. Um, and it

601
00:24:23,279 --> 00:24:26,559
also did not mention the Formula 1,

602
00:24:24,960 --> 00:24:30,159
which is ridiculous like why would you

603
00:24:26,559 --> 00:24:32,559
not? Um, the total time and also the

604
00:24:30,159 --> 00:24:35,039
time to first chunk was significantly

605
00:24:32,559 --> 00:24:36,480
higher for that as well. So my first

606
00:24:35,039 --> 00:24:39,919
kind of suggestion is that you should

607
00:24:36,480 --> 00:24:42,400
write uh get shorter responses. You can

608
00:24:39,919 --> 00:24:46,159
do that either by using a smaller model

609
00:24:42,400 --> 00:24:49,120
or you can say in your prompt I want a

610
00:24:46,159 --> 00:24:52,159
single sentence or I want two to three

611
00:24:49,120 --> 00:24:54,000
sentences or I want three answers. You

612
00:24:52,159 --> 00:24:56,080
basically set the expectation to the

613
00:24:54,000 --> 00:24:57,679
model as to the length of the prompt. If

614
00:24:56,080 --> 00:25:00,240
you don't do that, if it's completely

615
00:24:57,679 --> 00:25:02,080
unbound, they will tend to, especially

616
00:25:00,240 --> 00:25:03,679
the bigger models, they will tend to

617
00:25:02,080 --> 00:25:05,120
produce way more output than you

618
00:25:03,679 --> 00:25:07,760
actually need and it will take a lot

619
00:25:05,120 --> 00:25:09,600
longer. It'll also cost you more money.

620
00:25:07,760 --> 00:25:12,400
Oops. Uh the second thing is

621
00:25:09,600 --> 00:25:15,919
distillation. So this is basically uh we

622
00:25:12,400 --> 00:25:17,279
don't have time to cover it but um you

623
00:25:15,919 --> 00:25:21,039
what they've done is they've condensed

624
00:25:17,279 --> 00:25:22,480
models down um so that you've taken all

625
00:25:21,039 --> 00:25:24,159
the the weights so the number of

626
00:25:22,480 --> 00:25:27,120
parameters of the model and you've

627
00:25:24,159 --> 00:25:29,360
reduced them by basically uh it's called

628
00:25:27,120 --> 00:25:32,000
matrage so the those dolls you get where

629
00:25:29,360 --> 00:25:33,919
they one inside another um and the

630
00:25:32,000 --> 00:25:35,760
reason it's they've used that analogy is

631
00:25:33,919 --> 00:25:37,679
because you've basically taken the

632
00:25:35,760 --> 00:25:40,240
answer from a big model and you've

633
00:25:37,679 --> 00:25:42,159
stuffed it into a smaller one. So

634
00:25:40,240 --> 00:25:44,159
instead of it having to know about every

635
00:25:42,159 --> 00:25:45,520
single capital city and how to think

636
00:25:44,159 --> 00:25:47,039
about it and stuff like that, it's kind

637
00:25:45,520 --> 00:25:49,679
of cheated. It's got like a lookup table

638
00:25:47,039 --> 00:25:53,279
and it's it's saved the information. So

639
00:25:49,679 --> 00:25:57,039
like that's a very brief summary. This

640
00:25:53,279 --> 00:25:59,919
means that you can run a model with the

641
00:25:57,039 --> 00:26:02,159
same kind of corpus as a much bigger

642
00:25:59,919 --> 00:26:05,200
model on smaller hardware and it's also

643
00:26:02,159 --> 00:26:07,679
a lot faster. So when you see in Olama

644
00:26:05,200 --> 00:26:09,919
you've got these small models uh like 7

645
00:26:07,679 --> 00:26:12,400
billion 8 billion models it's basically

646
00:26:09,919 --> 00:26:13,520
the full model but they've asked it lots

647
00:26:12,400 --> 00:26:15,279
and lots of questions and they've

648
00:26:13,520 --> 00:26:16,960
condensed it down into a into a smaller

649
00:26:15,279 --> 00:26:19,039
one.

650
00:26:16,960 --> 00:26:22,799
Um you can see this in this benchmark.

651
00:26:19,039 --> 00:26:25,760
This is the Quen uh 3 series model with

652
00:26:22,799 --> 00:26:27,919
uh and basically I've taken a larger one

653
00:26:25,760 --> 00:26:30,400
which is 8 billion parameters and I've

654
00:26:27,919 --> 00:26:32,480
taken the smallest one which is 0.6 six.

655
00:26:30,400 --> 00:26:34,640
And you'll see that the time it takes to

656
00:26:32,480 --> 00:26:36,080
give me the same answer um to give me

657
00:26:34,640 --> 00:26:37,919
this answer the same question, sorry,

658
00:26:36,080 --> 00:26:42,000
which is can you come up with some names

659
00:26:37,919 --> 00:26:44,240
my pet octopus um it takes a lot longer

660
00:26:42,000 --> 00:26:46,799
for an 8 billion parameter model than it

661
00:26:44,240 --> 00:26:50,080
does for a 600 million one. Um also the

662
00:26:46,799 --> 00:26:52,880
length of response uh goes up and then

663
00:26:50,080 --> 00:26:55,279
weirdly enough it comes back down again

664
00:26:52,880 --> 00:26:57,760
between four and the 8 billion version

665
00:26:55,279 --> 00:26:59,760
model. I don't know why, but it gives

666
00:26:57,760 --> 00:27:02,000
you shorter answers for eight than it

667
00:26:59,760 --> 00:27:04,000
does for four, which is why you should

668
00:27:02,000 --> 00:27:04,880
test things. Um, because you have

669
00:27:04,000 --> 00:27:06,960
assumptions and you have

670
00:27:04,880 --> 00:27:08,240
generalizations, but when you use tools

671
00:27:06,960 --> 00:27:10,320
like this, you can actually come to

672
00:27:08,240 --> 00:27:12,720
different conclusions. Uh, the next

673
00:27:10,320 --> 00:27:15,919
point is quantization, which is a really

674
00:27:12,720 --> 00:27:17,440
cool word to say. Um, and it's something

675
00:27:15,919 --> 00:27:19,440
you can just drop into conversations all

676
00:27:17,440 --> 00:27:22,880
the time.

677
00:27:19,440 --> 00:27:25,200
Um, and basically it's this the theory

678
00:27:22,880 --> 00:27:29,200
is fairly simple. If you have a range of

679
00:27:25,200 --> 00:27:31,279
values on a scale um you can basically

680
00:27:29,200 --> 00:27:34,480
get the smallest one and the biggest one

681
00:27:31,279 --> 00:27:36,320
and say okay the smallest one is now the

682
00:27:34,480 --> 00:27:40,320
smallest point in a in a different

683
00:27:36,320 --> 00:27:44,480
scale. So if we did scala that's int 8

684
00:27:40,320 --> 00:27:46,720
uh which is 255 different values. Um so

685
00:27:44,480 --> 00:27:49,120
basically the smallest one is now minus

686
00:27:46,720 --> 00:27:50,480
128 and the biggest one is now plus. And

687
00:27:49,120 --> 00:27:53,919
basically what we're doing is taking

688
00:27:50,480 --> 00:27:56,559
data from a 32-bit or 64-bit value into

689
00:27:53,919 --> 00:27:59,200
a much smaller format. The reason you do

690
00:27:56,559 --> 00:28:01,919
that is because CPUs and GPUs can do

691
00:27:59,200 --> 00:28:04,159
that calculation much much faster. So

692
00:28:01,919 --> 00:28:05,360
like four eight times faster. So

693
00:28:04,159 --> 00:28:08,080
basically what we're doing is we're

694
00:28:05,360 --> 00:28:09,840
approximating the data inside the model

695
00:28:08,080 --> 00:28:12,399
like the weights basically inside the

696
00:28:09,840 --> 00:28:14,799
model into different values. Do you need

697
00:28:12,399 --> 00:28:18,159
to know any of this? No, not really. uh

698
00:28:14,799 --> 00:28:20,000
all you need to know is that um by using

699
00:28:18,159 --> 00:28:22,640
this technique you can run things on

700
00:28:20,000 --> 00:28:24,960
smaller hardware and it will run faster.

701
00:28:22,640 --> 00:28:26,320
So when you hear quantization or you see

702
00:28:24,960 --> 00:28:28,480
there's different flavors of a model

703
00:28:26,320 --> 00:28:30,960
that you can pull um you can use a

704
00:28:28,480 --> 00:28:33,120
smaller model to basically run similar

705
00:28:30,960 --> 00:28:36,000
answers and get similar accuracy uh but

706
00:28:33,120 --> 00:28:37,120
with much smaller hardware. Uh the other

707
00:28:36,000 --> 00:28:38,960
thing you can do is something called

708
00:28:37,120 --> 00:28:42,159
semantic caching. This is a really

709
00:28:38,960 --> 00:28:45,840
really new technique. Um

710
00:28:42,159 --> 00:28:48,240
in uh Python there is a great function

711
00:28:45,840 --> 00:28:50,720
built in called LU cache. You can use it

712
00:28:48,240 --> 00:28:52,159
as a decorator on a Python function. It

713
00:28:50,720 --> 00:28:54,799
basically means if you call the function

714
00:28:52,159 --> 00:28:56,399
the same uh with the same arguments it

715
00:28:54,799 --> 00:28:59,279
will cache and give you back the result

716
00:28:56,399 --> 00:29:01,279
the first time. Um you should use LU

717
00:28:59,279 --> 00:29:03,120
cache a lot where you have a function

718
00:29:01,279 --> 00:29:05,039
that takes a long time to execute and

719
00:29:03,120 --> 00:29:07,200
you just want to cache the responses

720
00:29:05,039 --> 00:29:09,919
into memory. So I really recommend using

721
00:29:07,200 --> 00:29:12,559
Lu cache. The problem is that it is very

722
00:29:09,919 --> 00:29:14,720
specific. Um so if we ask the same

723
00:29:12,559 --> 00:29:16,320
question more or less, it's just going

724
00:29:14,720 --> 00:29:19,760
to compare the strings and see them as

725
00:29:16,320 --> 00:29:21,360
the same thing. Um so in the QR code on

726
00:29:19,760 --> 00:29:23,840
the bottom right hand corner, that one

727
00:29:21,360 --> 00:29:26,480
is not a Rick roll. Um is a link to a

728
00:29:23,840 --> 00:29:27,919
repo that I put together for this talk

729
00:29:26,480 --> 00:29:30,640
with a demo of something called a

730
00:29:27,919 --> 00:29:34,720
semantic cache. It is a function

731
00:29:30,640 --> 00:29:37,600
decorator. Um and instead of using um

732
00:29:34,720 --> 00:29:40,320
like literal comparisons, it does an

733
00:29:37,600 --> 00:29:42,000
embedding model similarity and you can

734
00:29:40,320 --> 00:29:44,480
tell it what at what trigger point you

735
00:29:42,000 --> 00:29:46,000
want to cache the inputs. So basically

736
00:29:44,480 --> 00:29:48,000
if you had the question that I had in

737
00:29:46,000 --> 00:29:49,360
the last slide, which is uh what is the

738
00:29:48,000 --> 00:29:51,039
capital of France and what is the

739
00:29:49,360 --> 00:29:52,960
capital city of France, it would see

740
00:29:51,039 --> 00:29:56,080
those as being very similar questions

741
00:29:52,960 --> 00:29:58,720
and actually cache the answer. Um this

742
00:29:56,080 --> 00:30:00,240
is quite a new technique and we're still

743
00:29:58,720 --> 00:30:02,880
kind of working out some details on

744
00:30:00,240 --> 00:30:05,200
this. Um but it really gives you the

745
00:30:02,880 --> 00:30:06,720
ability to improve performance. Uh where

746
00:30:05,200 --> 00:30:10,120
you've got users asking very very

747
00:30:06,720 --> 00:30:10,120
similar questions.

748
00:30:11,200 --> 00:30:14,960
Okay. The next technique is something

749
00:30:12,640 --> 00:30:17,360
called model routing. Uh this is where

750
00:30:14,960 --> 00:30:18,799
you have an input uh and you're

751
00:30:17,360 --> 00:30:21,039
basically trying to direct it to the

752
00:30:18,799 --> 00:30:24,399
cheapest and fastest model. There are

753
00:30:21,039 --> 00:30:27,440
lots of challenges with this approach um

754
00:30:24,399 --> 00:30:29,760
uh as has been unveiled with the GPT5

755
00:30:27,440 --> 00:30:31,279
which has got this builtin and I don't

756
00:30:29,760 --> 00:30:32,320
think people realized they were using it

757
00:30:31,279 --> 00:30:35,360
and that's why there's been this

758
00:30:32,320 --> 00:30:36,880
mismatch in expectations with GPT5. But

759
00:30:35,360 --> 00:30:39,440
basically you get a user question and

760
00:30:36,880 --> 00:30:42,399
you say is this a simple question? If so

761
00:30:39,440 --> 00:30:44,720
give it to a cheap model. Uh if it's not

762
00:30:42,399 --> 00:30:46,720
then give it to a chat model and then

763
00:30:44,720 --> 00:30:49,120
otherwise give it to a reasoning model.

764
00:30:46,720 --> 00:30:51,120
Uh we have effectively run out of time

765
00:30:49,120 --> 00:30:52,880
so I'm going to summarize the talk. Uh

766
00:30:51,120 --> 00:30:54,399
one, smaller models are faster. The

767
00:30:52,880 --> 00:30:56,640
bigger the model, the longer the

768
00:30:54,399 --> 00:30:58,159
response. Models are paid by the word.

769
00:30:56,640 --> 00:31:01,200
Um so keep that in mind when you're

770
00:30:58,159 --> 00:31:03,760
using them. Longer responses take more

771
00:31:01,200 --> 00:31:06,640
time. Uh and users have preset

772
00:31:03,760 --> 00:31:08,640
expectations on responsiveness as well.

773
00:31:06,640 --> 00:31:10,960
So um yeah, thank you very much for your

774
00:31:08,640 --> 00:31:13,919
time. My name is Anthony and I'll give

775
00:31:10,960 --> 00:31:17,520
the last slide. We go. There we go. All

776
00:31:13,919 --> 00:31:17,520
right. Thanks very much.