1
00:00:00,420 --> 00:00:05,910
[Music]

2
00:00:10,240 --> 00:00:14,719
Good afternoon everyone. Welcome back to

3
00:00:12,160 --> 00:00:16,720
the data data and AI track here at PyCon

4
00:00:14,719 --> 00:00:19,760
AU 2025.

5
00:00:16,720 --> 00:00:22,400
Uh very happy to introduce uh Dilitri

6
00:00:19,760 --> 00:00:24,720
who is here to present uh beyond vibes

7
00:00:22,400 --> 00:00:26,640
building evals for generative AI. Can I

8
00:00:24,720 --> 00:00:29,880
have a warm round of welcoming applause

9
00:00:26,640 --> 00:00:29,880
for Dilpri?

10
00:00:30,200 --> 00:00:36,480
[Applause]

11
00:00:33,520 --> 00:00:39,280
Thanks for that. Uh hello hello. I am

12
00:00:36,480 --> 00:00:41,600
here to present the super exciting world

13
00:00:39,280 --> 00:00:44,800
of evals to you all. Um my name is

14
00:00:41,600 --> 00:00:47,680
Dilprit and I've been doing machine

15
00:00:44,800 --> 00:00:49,600
learning for about 10 years. Um I run a

16
00:00:47,680 --> 00:00:51,840
little place called Loom Labs where I

17
00:00:49,600 --> 00:00:53,680
work with sort of small organizations,

18
00:00:51,840 --> 00:00:57,039
large organizations

19
00:00:53,680 --> 00:01:00,879
helping companies um ship ML products.

20
00:00:57,039 --> 00:01:03,280
Um, so in this talk I hope to first

21
00:01:00,879 --> 00:01:05,600
cover why emails are important and why

22
00:01:03,280 --> 00:01:07,760
we should pay attention to them. Then

23
00:01:05,600 --> 00:01:09,680
next sort of like a 101 so you can

24
00:01:07,760 --> 00:01:11,600
really take some of this stuff back and

25
00:01:09,680 --> 00:01:13,840
if you're writing your first prompt or

26
00:01:11,600 --> 00:01:16,159
your first model you can take this and

27
00:01:13,840 --> 00:01:18,240
apply it. And along the way I'll

28
00:01:16,159 --> 00:01:21,439
hopefully make all these emails a bit

29
00:01:18,240 --> 00:01:25,040
more concrete with some examples.

30
00:01:21,439 --> 00:01:27,360
So let's get started. First thing is

31
00:01:25,040 --> 00:01:28,880
like the hype slide because this is

32
00:01:27,360 --> 00:01:31,040
going to be like the most exciting

33
00:01:28,880 --> 00:01:34,400
thing. You know, we live in a very

34
00:01:31,040 --> 00:01:37,040
exciting time where we get models every

35
00:01:34,400 --> 00:01:39,920
week. And this is a chart that I've

36
00:01:37,040 --> 00:01:41,759
created that only covers the major labs,

37
00:01:39,920 --> 00:01:44,400
sort of the frontier labs as we call

38
00:01:41,759 --> 00:01:46,560
them, which doesn't cover all the sort

39
00:01:44,400 --> 00:01:50,880
of small open-source contributions that

40
00:01:46,560 --> 00:01:54,399
we see, but you can see it's relentless

41
00:01:50,880 --> 00:01:56,240
um but exciting at the same time. Um,

42
00:01:54,399 --> 00:01:57,600
but you know, when you talk about evals,

43
00:01:56,240 --> 00:02:00,560
you're sort of like, okay, look at these

44
00:01:57,600 --> 00:02:02,079
shiny models. You can sort of feels like

45
00:02:00,560 --> 00:02:03,680
the bear of bad news, you know, like

46
00:02:02,079 --> 00:02:06,960
turning the lights on at a party kind of

47
00:02:03,680 --> 00:02:08,160
a thing. Um, that the new and shiny,

48
00:02:06,960 --> 00:02:10,080
it's not going to solve all your

49
00:02:08,160 --> 00:02:13,840
problems.

50
00:02:10,080 --> 00:02:16,879
Even though benchmarks tend to go up.

51
00:02:13,840 --> 00:02:20,400
Um, this is my favorite benchmark. So I

52
00:02:16,879 --> 00:02:23,760
could have shown you any bar bar chart

53
00:02:20,400 --> 00:02:27,280
where open AI is like look we're 90%

54
00:02:23,760 --> 00:02:30,400
solve this benchmark but this is from uh

55
00:02:27,280 --> 00:02:33,840
someone called Simon Wilson where he

56
00:02:30,400 --> 00:02:37,280
likes to get a model and give it the

57
00:02:33,840 --> 00:02:39,280
prompt of create an SVG of a pelican

58
00:02:37,280 --> 00:02:41,280
riding a bicycle.

59
00:02:39,280 --> 00:02:44,239
And you can see we start from the top

60
00:02:41,280 --> 00:02:47,360
left of GPT4 mini sort of the oldest

61
00:02:44,239 --> 00:02:49,599
generation and I don't know it doesn't

62
00:02:47,360 --> 00:02:52,239
look like a bicycle or a Pelican and

63
00:02:49,599 --> 00:02:55,360
then we have at the bottom right GPT5

64
00:02:52,239 --> 00:02:59,280
and you have a tour of France style you

65
00:02:55,360 --> 00:03:00,480
know very very aerodynamic Pelican.

66
00:02:59,280 --> 00:03:03,200
So when you look at all these

67
00:03:00,480 --> 00:03:05,760
benchmarks, even sort of fun proxy

68
00:03:03,200 --> 00:03:09,200
benchmarks, you might think, well, you

69
00:03:05,760 --> 00:03:11,200
know, my problem, my 10step agent

70
00:03:09,200 --> 00:03:16,080
workflow that I've cooking, it doesn't

71
00:03:11,200 --> 00:03:18,239
work right now, but surely GPG6, GPG7 is

72
00:03:16,080 --> 00:03:19,920
going to work, right?

73
00:03:18,239 --> 00:03:23,680
Yeah, unless you're trying to draw

74
00:03:19,920 --> 00:03:25,840
pelicans, maybe not. So yes, the models

75
00:03:23,680 --> 00:03:28,080
keep getting better, but it's important

76
00:03:25,840 --> 00:03:29,920
to be very grounded and that your

77
00:03:28,080 --> 00:03:34,440
problems aren't just going to be

78
00:03:29,920 --> 00:03:34,440
magically solved by open AI.

79
00:03:34,879 --> 00:03:39,599
Evails, I mean, they've been around

80
00:03:36,799 --> 00:03:40,959
forever. Um, if there's any sort of data

81
00:03:39,599 --> 00:03:43,040
scientists in the room, they might be

82
00:03:40,959 --> 00:03:44,720
like, "This guy, emails, we've been

83
00:03:43,040 --> 00:03:46,560
doing them forever. Why are we talking

84
00:03:44,720 --> 00:03:48,560
about this right now?" Right? These are

85
00:03:46,560 --> 00:03:50,720
just some of the choice metrics I

86
00:03:48,560 --> 00:03:54,959
pulled. Um, you might be familiar with

87
00:03:50,720 --> 00:03:58,480
them or maybe not, but sort of why now?

88
00:03:54,959 --> 00:04:00,000
Um, and because it's never been as

89
00:03:58,480 --> 00:04:03,599
accessible

90
00:04:00,000 --> 00:04:06,319
to write and ship an AI model as it is

91
00:04:03,599 --> 00:04:09,040
right now, right? You can just clone the

92
00:04:06,319 --> 00:04:13,360
Enthropic uh, pip install the anthropic

93
00:04:09,040 --> 00:04:16,639
SDK, get an API key, write some English,

94
00:04:13,360 --> 00:04:18,959
you got a chatbot, right? Um, and that's

95
00:04:16,639 --> 00:04:21,680
very different to how it was 10 years

96
00:04:18,959 --> 00:04:23,840
ago. Um, and these tools weren't really

97
00:04:21,680 --> 00:04:26,000
powerful back then, right? Everyone

98
00:04:23,840 --> 00:04:29,520
didn't really need to use them, but now

99
00:04:26,000 --> 00:04:31,520
we can, and that's exciting. Um, but

100
00:04:29,520 --> 00:04:34,560
it's also dangerous.

101
00:04:31,520 --> 00:04:36,479
I sort of like to think of these models

102
00:04:34,560 --> 00:04:38,960
as like, you know, we just invented cars

103
00:04:36,479 --> 00:04:40,479
and everyone's getting free cars, but

104
00:04:38,960 --> 00:04:42,639
I'm the guy talking about seat belts.

105
00:04:40,479 --> 00:04:44,800
Like, you can go super fast and that's

106
00:04:42,639 --> 00:04:48,720
fine. Um, but just a little bit of

107
00:04:44,800 --> 00:04:50,479
safety can go a long way.

108
00:04:48,720 --> 00:04:53,199
And this this is sort of what I mean.

109
00:04:50,479 --> 00:04:55,840
Um, we're all very familiar with the

110
00:04:53,199 --> 00:04:58,880
left hand side. Um, also I had to show a

111
00:04:55,840 --> 00:05:00,639
bit of Python, of course. Um, we're very

112
00:04:58,880 --> 00:05:02,479
familiar with the left hand side and we

113
00:05:00,639 --> 00:05:05,360
know it's going to be deterministic. You

114
00:05:02,479 --> 00:05:07,520
can run that for loop millions of times

115
00:05:05,360 --> 00:05:10,240
and you're going to get hello. On the

116
00:05:07,520 --> 00:05:12,479
right hand side, very similar looking

117
00:05:10,240 --> 00:05:16,320
code, right? I mean not similar but it's

118
00:05:12,479 --> 00:05:18,960
five lines but the output it is in the

119
00:05:16,320 --> 00:05:22,479
realm of hello but is it the same

120
00:05:18,960 --> 00:05:25,120
string? Um can you really tell me with

121
00:05:22,479 --> 00:05:28,400
100% certainty what you're going to get

122
00:05:25,120 --> 00:05:32,240
out of the right hand side code? Not

123
00:05:28,400 --> 00:05:35,600
really. And this sort of probabilistic

124
00:05:32,240 --> 00:05:37,840
thinking um may not come naturally

125
00:05:35,600 --> 00:05:40,240
because we're not really used to this.

126
00:05:37,840 --> 00:05:42,560
uh we've got to start thinking in

127
00:05:40,240 --> 00:05:45,520
distributions, right? We got to think

128
00:05:42,560 --> 00:05:48,320
about consistency of output and how to

129
00:05:45,520 --> 00:05:50,880
sort of narrow down this variety that we

130
00:05:48,320 --> 00:05:53,360
have. And there it's really a change of

131
00:05:50,880 --> 00:05:56,000
mindset. Um and that's what sort of

132
00:05:53,360 --> 00:05:58,639
doing eval sort of gets you in these new

133
00:05:56,000 --> 00:06:02,320
habits that are very important for our

134
00:05:58,639 --> 00:06:04,400
probabilistic world.

135
00:06:02,320 --> 00:06:06,080
Yeah, this this sort of covers it. It

136
00:06:04,400 --> 00:06:09,360
looks like normal code but doesn't

137
00:06:06,080 --> 00:06:11,680
really break like it. Um, it'll drift,

138
00:06:09,360 --> 00:06:13,600
it'll improvise, it'll be confidently

139
00:06:11,680 --> 00:06:16,880
wrong, and you kind of won't even know

140
00:06:13,600 --> 00:06:19,520
it. Um, I think someone had in a

141
00:06:16,880 --> 00:06:23,360
previous presentation a slide around a

142
00:06:19,520 --> 00:06:25,120
Canada and them promising a consumer

143
00:06:23,360 --> 00:06:28,000
something. Um, and Reddit is full of

144
00:06:25,120 --> 00:06:30,000
this, right? You go to Reddit, someone's

145
00:06:28,000 --> 00:06:32,479
like, "Oh, if you go to this ecommerce

146
00:06:30,000 --> 00:06:34,319
chatbot, you can just get a refund or a

147
00:06:32,479 --> 00:06:37,680
voucher." because they'd know how to

148
00:06:34,319 --> 00:06:40,240
push this chatbot just the way they want

149
00:06:37,680 --> 00:06:43,039
to get whatever they want out of it. But

150
00:06:40,240 --> 00:06:45,840
did the business really expect that? Did

151
00:06:43,039 --> 00:06:48,319
they test it? Did they think about this

152
00:06:45,840 --> 00:06:51,440
these scenarios and what the internet

153
00:06:48,319 --> 00:06:55,680
can do to a chatbot? I don't think so.

154
00:06:51,440 --> 00:06:58,960
Right. Um and that's why it's again

155
00:06:55,680 --> 00:07:02,479
emails. Um, and before we start sort of

156
00:06:58,960 --> 00:07:04,800
looking at ways we can address the

157
00:07:02,479 --> 00:07:06,400
evaluation situation,

158
00:07:04,800 --> 00:07:08,400
the other side of email, so I like to

159
00:07:06,400 --> 00:07:11,360
think of like emails as a thing you do

160
00:07:08,400 --> 00:07:14,080
before you've decided to ship a model to

161
00:07:11,360 --> 00:07:16,400
prod or to your users.

162
00:07:14,080 --> 00:07:20,160
And verification is like what you can do

163
00:07:16,400 --> 00:07:22,319
afterwards. So eval

164
00:07:20,160 --> 00:07:24,560
what the model is going to do, but you

165
00:07:22,319 --> 00:07:27,759
could also just verify all the outputs

166
00:07:24,560 --> 00:07:30,560
that the model produces. Um, so for

167
00:07:27,759 --> 00:07:33,840
example, if you were the New York Times

168
00:07:30,560 --> 00:07:36,400
and you were like this AI summarization

169
00:07:33,840 --> 00:07:38,319
thing that everyone's pushing, summaries

170
00:07:36,400 --> 00:07:41,520
are really hot right now. So, we're

171
00:07:38,319 --> 00:07:44,160
going to do summaries, but they New York

172
00:07:41,520 --> 00:07:46,639
Times can't afford to trust a summary

173
00:07:44,160 --> 00:07:49,599
from an AI bot. They need to verify.

174
00:07:46,639 --> 00:07:52,400
They have zero risk tolerance, right? So

175
00:07:49,599 --> 00:07:54,639
they could do all the eval but at the

176
00:07:52,400 --> 00:07:56,800
end of the day they've got to have a

177
00:07:54,639 --> 00:08:00,800
human make sure it didn't make up

178
00:07:56,800 --> 00:08:03,840
anything all the facts are there etc etc

179
00:08:00,800 --> 00:08:07,039
so that is verification and that's great

180
00:08:03,840 --> 00:08:09,520
for the New York Times big organization

181
00:08:07,039 --> 00:08:13,280
it's their mission they can afford it

182
00:08:09,520 --> 00:08:17,120
but not everyone can verify everything

183
00:08:13,280 --> 00:08:19,680
right you cannot check all of the

184
00:08:17,120 --> 00:08:23,280
messages your chatbot is going to

185
00:08:19,680 --> 00:08:25,599
or anything like that. And of course,

186
00:08:23,280 --> 00:08:27,039
that's not even talking about the money.

187
00:08:25,599 --> 00:08:29,520
We're we're trying to use these models

188
00:08:27,039 --> 00:08:31,360
and software to create efficiencies, but

189
00:08:29,520 --> 00:08:33,839
if we're then verifying everything they

190
00:08:31,360 --> 00:08:36,560
produce, why don't you just have a human

191
00:08:33,839 --> 00:08:38,880
do the thing in the first place? So,

192
00:08:36,560 --> 00:08:43,200
verification isn't the answer for

193
00:08:38,880 --> 00:08:46,480
everything. Um, but it is an important

194
00:08:43,200 --> 00:08:48,800
thing to consider. A friend when I was

195
00:08:46,480 --> 00:08:52,959
rehearsing this talk was like, "Okay,

196
00:08:48,800 --> 00:08:56,240
what if I'm building a travel itinerary

197
00:08:52,959 --> 00:09:00,240
recommener um for people?

198
00:08:56,240 --> 00:09:03,600
Do I verify or do I eval?" I was like,

199
00:09:00,240 --> 00:09:06,480
"Okay, let's talk through this." If

200
00:09:03,600 --> 00:09:08,080
you're recommending itineraries, you got

201
00:09:06,480 --> 00:09:10,640
to make sure that the places and

202
00:09:08,080 --> 00:09:12,880
restaurant names you suggest exist,

203
00:09:10,640 --> 00:09:15,680
right? So, those need to be verified.

204
00:09:12,880 --> 00:09:17,920
They can be done with a Google API call

205
00:09:15,680 --> 00:09:19,760
or whatever it may be. But if you're

206
00:09:17,920 --> 00:09:22,720
recommending a French restaurant instead

207
00:09:19,760 --> 00:09:25,600
of like an Italian one, that doesn't

208
00:09:22,720 --> 00:09:27,279
matter. So it's not just a whole system.

209
00:09:25,600 --> 00:09:29,120
You've got to think of components of

210
00:09:27,279 --> 00:09:32,080
your system.

211
00:09:29,120 --> 00:09:35,360
So that's the verification.

212
00:09:32,080 --> 00:09:37,200
All right. Now, hopefully I've motivated

213
00:09:35,360 --> 00:09:39,040
why we need evals and why I think

214
00:09:37,200 --> 00:09:42,080
they're important and why you should

215
00:09:39,040 --> 00:09:43,519
think they're important. Um, and this

216
00:09:42,080 --> 00:09:45,120
these are the steps we're going to go

217
00:09:43,519 --> 00:09:46,880
through.

218
00:09:45,120 --> 00:09:48,240
So, this is sort of like a 101. I'm

219
00:09:46,880 --> 00:09:50,800
going to try and introduce all these

220
00:09:48,240 --> 00:09:54,000
concepts, but you can make this as

221
00:09:50,800 --> 00:09:56,160
simple and as complex as you like. These

222
00:09:54,000 --> 00:09:58,959
can be integrated in your like CI/CD

223
00:09:56,160 --> 00:10:00,720
pipelines, in your premerge setup, but

224
00:09:58,959 --> 00:10:02,240
we're not going to talk about that.

225
00:10:00,720 --> 00:10:05,360
We're going to focus on what are the

226
00:10:02,240 --> 00:10:07,600
steps that we need to focus on. So, the

227
00:10:05,360 --> 00:10:10,080
first thing I'm going to do is go talk

228
00:10:07,600 --> 00:10:12,959
about the golden set. Then we have uh

229
00:10:10,080 --> 00:10:14,959
quick sanity checks, human preference

230
00:10:12,959 --> 00:10:18,880
capture, which is like what do you do

231
00:10:14,959 --> 00:10:22,480
with subjective outputs? Then a new

232
00:10:18,880 --> 00:10:24,000
unish uh technique called LM as a judge.

233
00:10:22,480 --> 00:10:26,720
And we're going to rinse and repeat

234
00:10:24,000 --> 00:10:28,399
until we get a good model.

235
00:10:26,720 --> 00:10:31,440
All right,

236
00:10:28,399 --> 00:10:34,079
first one, golden set.

237
00:10:31,440 --> 00:10:35,519
This may not seem very exciting, but

238
00:10:34,079 --> 00:10:38,079
I've got to tell you, this is like one

239
00:10:35,519 --> 00:10:41,920
of the most useful things you can really

240
00:10:38,079 --> 00:10:44,160
do. So this you can use numbers, Google

241
00:10:41,920 --> 00:10:47,360
sheets or whatever it may be. And it's

242
00:10:44,160 --> 00:10:50,480
simply a list of things that you'd like

243
00:10:47,360 --> 00:10:53,760
your model to be good at. So the example

244
00:10:50,480 --> 00:10:55,680
I've chosen is a support chatbot. A

245
00:10:53,760 --> 00:10:59,120
chatbot that will simply answer support

246
00:10:55,680 --> 00:11:02,240
queries. So what is a golden set? It a

247
00:10:59,120 --> 00:11:05,680
set of features or set of inputs that I

248
00:11:02,240 --> 00:11:08,640
expect my bot to handle, right? And this

249
00:11:05,680 --> 00:11:11,120
doesn't need to be a thousand rows or 10

250
00:11:08,640 --> 00:11:13,760
or 20, you know, just to start off with.

251
00:11:11,120 --> 00:11:17,279
So, for example, how do I reset my

252
00:11:13,760 --> 00:11:19,120
password? And given that input query,

253
00:11:17,279 --> 00:11:21,920
what do I really expect the bot to say

254
00:11:19,120 --> 00:11:25,440
to me? Um, another query might be, what

255
00:11:21,920 --> 00:11:29,040
are your business hours? Or maybe how to

256
00:11:25,440 --> 00:11:32,320
handle a customer who's not really happy

257
00:11:29,040 --> 00:11:34,880
about things. So, those are my inputs

258
00:11:32,320 --> 00:11:36,720
and what I expect the bot to do. Then I

259
00:11:34,880 --> 00:11:38,880
have a whole bunch of columns around

260
00:11:36,720 --> 00:11:41,040
categorization. So I can do filtering,

261
00:11:38,880 --> 00:11:44,000
but you can kind of ignore that. You

262
00:11:41,040 --> 00:11:46,880
know, the first two columns are all that

263
00:11:44,000 --> 00:11:48,880
you need. And then the end ones, they're

264
00:11:46,880 --> 00:11:50,880
more helping you sort of organize your

265
00:11:48,880 --> 00:11:52,959
set.

266
00:11:50,880 --> 00:11:54,959
And then okay, so let's say you wrote

267
00:11:52,959 --> 00:11:58,079
down 10 things that you'd like your bot

268
00:11:54,959 --> 00:11:59,920
to do. What then? Okay,

269
00:11:58,079 --> 00:12:04,160
Open AI just announced a new model,

270
00:11:59,920 --> 00:12:08,079
GPT7. What do you do? You take your list

271
00:12:04,160 --> 00:12:10,959
one by one. You put it into GPG7

272
00:12:08,079 --> 00:12:12,880
and see what it does, right? You put the

273
00:12:10,959 --> 00:12:15,360
row in three times just for consistency

274
00:12:12,880 --> 00:12:17,920
sake. Did it do well? Well, you get a

275
00:12:15,360 --> 00:12:20,000
check. Didn't do well, you get a cross,

276
00:12:17,920 --> 00:12:22,560
right? You go through the list and

277
00:12:20,000 --> 00:12:27,600
eventually at the end you'll have a nice

278
00:12:22,560 --> 00:12:29,760
percentage. Is it 80%? Is it 70%? Is it

279
00:12:27,600 --> 00:12:32,160
better than or worse than what you have

280
00:12:29,760 --> 00:12:36,560
right now? And that immediately gives

281
00:12:32,160 --> 00:12:38,959
you confidence, right, of is the new

282
00:12:36,560 --> 00:12:42,560
model or my prompt change or whatever it

283
00:12:38,959 --> 00:12:47,040
may be, is it really helping solve my

284
00:12:42,560 --> 00:12:50,639
problem? And that's why this is a very

285
00:12:47,040 --> 00:12:53,959
simple but critical piece um of testing

286
00:12:50,639 --> 00:12:53,959
these models.

287
00:12:54,000 --> 00:12:57,920
And yeah, don't try and cover the world

288
00:12:55,760 --> 00:13:00,720
with this. Um, I like to sort of break

289
00:12:57,920 --> 00:13:03,200
these down and have like 10 golden sets

290
00:13:00,720 --> 00:13:05,360
depending on whatever I whatever I'm

291
00:13:03,200 --> 00:13:08,399
thinking. But if you just start off

292
00:13:05,360 --> 00:13:11,279
small and just trust me, it'll sort of

293
00:13:08,399 --> 00:13:15,240
grow organically and you'll add it to it

294
00:13:11,279 --> 00:13:15,240
as you think of it.

295
00:13:15,920 --> 00:13:20,320
Next,

296
00:13:18,160 --> 00:13:23,279
sanity checks.

297
00:13:20,320 --> 00:13:24,880
So if your model, for example, um, so

298
00:13:23,279 --> 00:13:28,480
I'm trying to cover like a variety of

299
00:13:24,880 --> 00:13:30,560
inputs. If let's say you're building a

300
00:13:28,480 --> 00:13:33,040
document paser,

301
00:13:30,560 --> 00:13:35,519
you get given a document, you feed it

302
00:13:33,040 --> 00:13:38,480
into your favorite model, you get nice

303
00:13:35,519 --> 00:13:41,360
structured JSON output back, and in

304
00:13:38,480 --> 00:13:44,079
there you have some fields, right? You

305
00:13:41,360 --> 00:13:46,079
have the date, you have, in this case,

306
00:13:44,079 --> 00:13:48,880
we're passing an invoice, so who it was

307
00:13:46,079 --> 00:13:52,639
build to, what are the amounts, what is

308
00:13:48,880 --> 00:13:55,279
the tax, and this is like old standard

309
00:13:52,639 --> 00:13:57,199
code that we're used to. You can unit

310
00:13:55,279 --> 00:14:00,399
test that.

311
00:13:57,199 --> 00:14:02,399
You can make sure that your date is like

312
00:14:00,399 --> 00:14:04,240
actually a date because you'll be

313
00:14:02,399 --> 00:14:06,240
surprised that these models can make up

314
00:14:04,240 --> 00:14:07,920
everything, right? Their dates are not

315
00:14:06,240 --> 00:14:10,639
the same as our dates. We've got to

316
00:14:07,920 --> 00:14:12,959
check that they're dates. Um, their

317
00:14:10,639 --> 00:14:15,199
numbers, do they have strings in their

318
00:14:12,959 --> 00:14:16,800
numbers? Are they negative? Do you live

319
00:14:15,199 --> 00:14:19,279
in a country with a negative tax

320
00:14:16,800 --> 00:14:21,760
percentage? I don't think so. So, you

321
00:14:19,279 --> 00:14:24,560
should check. Are they internally

322
00:14:21,760 --> 00:14:27,600
consistent? Do the totals add up to what

323
00:14:24,560 --> 00:14:30,800
they should? Um, and for like the text

324
00:14:27,600 --> 00:14:33,360
fields, is there only like one word in

325
00:14:30,800 --> 00:14:36,399
there, but you expect there to be 20?

326
00:14:33,360 --> 00:14:38,079
Um, is it an email? Is it formatted as

327
00:14:36,399 --> 00:14:40,639
an email?

328
00:14:38,079 --> 00:14:42,639
And all of this again gives you

329
00:14:40,639 --> 00:14:46,480
confidence. So if you're testing two

330
00:14:42,639 --> 00:14:48,880
models, you give them the same document,

331
00:14:46,480 --> 00:14:52,320
you pause it, you have these tests and

332
00:14:48,880 --> 00:14:56,240
you check does model A consistently get

333
00:14:52,320 --> 00:14:58,480
dates right? Does model B not? Suddenly

334
00:14:56,240 --> 00:15:01,120
you can actually make a valid choice

335
00:14:58,480 --> 00:15:03,519
based on grounded data. And of course

336
00:15:01,120 --> 00:15:06,720
you can automate this, right? So this

337
00:15:03,519 --> 00:15:09,680
will also help in verification. So if

338
00:15:06,720 --> 00:15:12,399
someone is uh giving you an invoice and

339
00:15:09,680 --> 00:15:14,079
you're pausing it and you generate a

340
00:15:12,399 --> 00:15:16,480
number that doesn't look like a number,

341
00:15:14,079 --> 00:15:18,720
maybe instead of returning

342
00:15:16,480 --> 00:15:20,880
negatives, you sort of say, I don't

343
00:15:18,720 --> 00:15:24,800
know, something went wrong. Let's try

344
00:15:20,880 --> 00:15:28,079
again. Um so this again, seemingly

345
00:15:24,800 --> 00:15:32,320
simple checks can be very critical,

346
00:15:28,079 --> 00:15:33,519
especially at test time.

347
00:15:32,320 --> 00:15:37,839
How about we move to something

348
00:15:33,519 --> 00:15:41,279
complicated? Um, now you want to compare

349
00:15:37,839 --> 00:15:43,839
subjective outputs. So you've got, so

350
00:15:41,279 --> 00:15:47,040
you're an Amazon vendor and you've got

351
00:15:43,839 --> 00:15:49,040
50,000 products and you think using AI

352
00:15:47,040 --> 00:15:51,040
to create these descriptions is like the

353
00:15:49,040 --> 00:15:54,800
way to go. So you're going to gen

354
00:15:51,040 --> 00:15:57,360
generate 50,000 product descriptions.

355
00:15:54,800 --> 00:15:59,839
What is a good product description? How

356
00:15:57,360 --> 00:16:03,360
do you know? Right? So you have three

357
00:15:59,839 --> 00:16:05,120
models. They each give you an output,

358
00:16:03,360 --> 00:16:07,360
but how do you really choose between

359
00:16:05,120 --> 00:16:09,920
them?

360
00:16:07,360 --> 00:16:12,320
Humans, turns out, can be important in

361
00:16:09,920 --> 00:16:15,440
this process. And I don't know how many

362
00:16:12,320 --> 00:16:18,320
of you you use a chat GPT app, but I get

363
00:16:15,440 --> 00:16:22,399
this very often. I'll get two model

364
00:16:18,320 --> 00:16:24,480
responses and open AAI is using me to

365
00:16:22,399 --> 00:16:28,480
tell them which model is better. So,

366
00:16:24,480 --> 00:16:32,480
this is a proven thing. So in this I

367
00:16:28,480 --> 00:16:34,000
think VC in his um uh presentation also

368
00:16:32,480 --> 00:16:36,720
talked about sort of getting humans to

369
00:16:34,000 --> 00:16:39,519
label and sort of getting data and he

370
00:16:36,720 --> 00:16:43,040
wanted a thousand uh 2500 but only got a

371
00:16:39,519 --> 00:16:48,240
thousand. So aim for the highest you can

372
00:16:43,040 --> 00:16:51,120
so 100 samples 500 samples suddenly you

373
00:16:48,240 --> 00:16:54,079
can tell do does everyone really prefer

374
00:16:51,120 --> 00:16:56,000
model A or is it a competition between A

375
00:16:54,079 --> 00:16:58,399
and C?

376
00:16:56,000 --> 00:17:01,360
You can use very simple metrics like

377
00:16:58,399 --> 00:17:04,319
just percentages or you can make it like

378
00:17:01,360 --> 00:17:06,400
a chess style ELO sort of ranking, you

379
00:17:04,319 --> 00:17:08,240
know, where the best models compete

380
00:17:06,400 --> 00:17:11,120
against each other. Again, the

381
00:17:08,240 --> 00:17:13,439
simplicity and complexity of these, it's

382
00:17:11,120 --> 00:17:15,520
kind of all up to you.

383
00:17:13,439 --> 00:17:18,160
But the most important thing, well, an

384
00:17:15,520 --> 00:17:20,559
important thing to capture here is the

385
00:17:18,160 --> 00:17:23,039
thing at the bottom is why did you

386
00:17:20,559 --> 00:17:26,160
choose this option? So when someone

387
00:17:23,039 --> 00:17:30,000
chooses an option, get them to put in a

388
00:17:26,160 --> 00:17:34,400
rationale. Why did they pick the option

389
00:17:30,000 --> 00:17:36,480
they picked? And the reason for that is

390
00:17:34,400 --> 00:17:40,559
we're going to try and automate human

391
00:17:36,480 --> 00:17:43,280
work. So this is where LLMs as a judge

392
00:17:40,559 --> 00:17:46,000
come in. So given that you have your

393
00:17:43,280 --> 00:17:50,000
50,000 product descriptions,

394
00:17:46,000 --> 00:17:51,919
sure you know that model A won out by 60

395
00:17:50,000 --> 00:17:54,480
to 40.

396
00:17:51,919 --> 00:17:57,039
That's that's not a clear victory. How

397
00:17:54,480 --> 00:17:59,280
do you really know that the 50,000

398
00:17:57,039 --> 00:18:01,600
descriptions that you generated are

399
00:17:59,280 --> 00:18:03,840
going to be good?

400
00:18:01,600 --> 00:18:05,440
Uh unless you pay someone a lot of

401
00:18:03,840 --> 00:18:10,160
money, it's going to be very hard for

402
00:18:05,440 --> 00:18:13,360
people to check 50,000 descriptions. So

403
00:18:10,160 --> 00:18:16,000
this is where we use another model to

404
00:18:13,360 --> 00:18:18,960
check a model's work. So it's we're

405
00:18:16,000 --> 00:18:21,440
kinding kind of getting meta here. Um

406
00:18:18,960 --> 00:18:24,799
but now we take the description that we

407
00:18:21,440 --> 00:18:28,320
generated from our model. Then we give

408
00:18:24,799 --> 00:18:33,600
our judge model some assessment criteras

409
00:18:28,320 --> 00:18:36,880
like is it playful? Does this um

410
00:18:33,600 --> 00:18:38,720
description really talk to my brand? Is

411
00:18:36,880 --> 00:18:42,000
there clarity?

412
00:18:38,720 --> 00:18:45,039
Are there errors? Does it have sort of

413
00:18:42,000 --> 00:18:47,200
bad language in there? And suddenly you

414
00:18:45,039 --> 00:18:50,799
can get from a very subjective

415
00:18:47,200 --> 00:18:53,440
description to a very structured output

416
00:18:50,799 --> 00:18:56,160
at the right which is JSON. And we all

417
00:18:53,440 --> 00:18:58,640
sort of know and love JSON, right? And

418
00:18:56,160 --> 00:19:02,960
you can work through these and figure

419
00:18:58,640 --> 00:19:05,840
out out of the 50,000 descriptions, how

420
00:19:02,960 --> 00:19:08,720
many of those were sort of approved by a

421
00:19:05,840 --> 00:19:12,080
judge and how many really need human

422
00:19:08,720 --> 00:19:15,600
review. And this can be really important

423
00:19:12,080 --> 00:19:18,080
especially especially at scale.

424
00:19:15,600 --> 00:19:20,480
Having said that,

425
00:19:18,080 --> 00:19:23,520
now you have another problem, right?

426
00:19:20,480 --> 00:19:25,840
Who's going to validate the judge model?

427
00:19:23,520 --> 00:19:27,520
It is called there is a meta judge model

428
00:19:25,840 --> 00:19:30,720
but we're not going to go into a rabbit

429
00:19:27,520 --> 00:19:33,039
hole. Um but this is again you know this

430
00:19:30,720 --> 00:19:36,000
is rinse and repeat. You've got to have

431
00:19:33,039 --> 00:19:38,240
a golden set for your judge model right?

432
00:19:36,000 --> 00:19:40,640
You got to make sure that the judge

433
00:19:38,240 --> 00:19:43,039
agrees with you. So when it's judging

434
00:19:40,640 --> 00:19:44,640
things they kind of match your

435
00:19:43,039 --> 00:19:46,640
expectations.

436
00:19:44,640 --> 00:19:50,400
But this is very important. We cannot

437
00:19:46,640 --> 00:19:54,679
blindly trust probabilistic models. Um,

438
00:19:50,400 --> 00:19:54,679
and you've you got to verify.

439
00:19:55,039 --> 00:20:00,799
And the final thing I really wanted to

440
00:19:58,080 --> 00:20:04,880
touch on was just old school error

441
00:20:00,799 --> 00:20:06,720
analysis, right? If if your judge or

442
00:20:04,880 --> 00:20:09,280
whatever your check may be or whatever

443
00:20:06,720 --> 00:20:12,720
flag may be. Um, maybe you're using

444
00:20:09,280 --> 00:20:14,880
heristics, right? In a chatbot,

445
00:20:12,720 --> 00:20:17,679
let's say your chatbot produced like

446
00:20:14,880 --> 00:20:20,640
four paragraphs worth of text in one go.

447
00:20:17,679 --> 00:20:23,360
That should raise a flag. Why is it

448
00:20:20,640 --> 00:20:25,600
producing so much output?

449
00:20:23,360 --> 00:20:28,799
And you can capture all those things

450
00:20:25,600 --> 00:20:31,039
again, put them in a CSV file so you can

451
00:20:28,799 --> 00:20:33,360
just look through them.

452
00:20:31,039 --> 00:20:35,440
And then you go got to do the hard work

453
00:20:33,360 --> 00:20:38,240
of just looking through your data

454
00:20:35,440 --> 00:20:41,919
systematically, right?

455
00:20:38,240 --> 00:20:44,000
I'm sure if you raise 500 examples,

456
00:20:41,919 --> 00:20:45,760
you're not going to have 500 different

457
00:20:44,000 --> 00:20:47,760
errors. there's there are going to be

458
00:20:45,760 --> 00:20:50,720
like five and they're going to account

459
00:20:47,760 --> 00:20:54,000
for 80% of your issues, right? And what

460
00:20:50,720 --> 00:20:56,480
do you do then? You solve your issue.

461
00:20:54,000 --> 00:20:58,559
You take three to five examples of your

462
00:20:56,480 --> 00:21:01,200
issues and then you put them in your

463
00:20:58,559 --> 00:21:04,000
golden set, right? And you've closed the

464
00:21:01,200 --> 00:21:06,640
loop so that the next model you develop,

465
00:21:04,000 --> 00:21:08,799
the next prompt you write does not make

466
00:21:06,640 --> 00:21:10,559
the same mistake because that's what

467
00:21:08,799 --> 00:21:12,880
we're trying to do. We're trying to

468
00:21:10,559 --> 00:21:15,919
build confidence. We're trying to make

469
00:21:12,880 --> 00:21:18,640
sure that we kind of know with some sort

470
00:21:15,919 --> 00:21:22,320
of certainty that these models aren't

471
00:21:18,640 --> 00:21:26,240
going to produce random outputs.

472
00:21:22,320 --> 00:21:29,039
And I'd be remiss if I didn't talk about

473
00:21:26,240 --> 00:21:30,559
latency and cost

474
00:21:29,039 --> 00:21:33,679
because

475
00:21:30,559 --> 00:21:36,799
you can of course use a bigger model.

476
00:21:33,679 --> 00:21:40,080
You can use thinking tokens or whatever

477
00:21:36,799 --> 00:21:43,520
it may be um to try and squeeze more

478
00:21:40,080 --> 00:21:46,159
percentages, right? So go from 80% on

479
00:21:43,520 --> 00:21:49,360
your golden set to 85%.

480
00:21:46,159 --> 00:21:51,600
But if the cost is 2x, is that really

481
00:21:49,360 --> 00:21:53,440
worth it to your organization?

482
00:21:51,600 --> 00:21:56,720
That's that's a choice completely up to

483
00:21:53,440 --> 00:22:00,159
you. Um, but this is very important

484
00:21:56,720 --> 00:22:03,280
because you can get driven by quality,

485
00:22:00,159 --> 00:22:06,280
but latency and cost are very important

486
00:22:03,280 --> 00:22:06,280
considerations.

487
00:22:06,559 --> 00:22:12,159
Now, I thought I'd walk you through some

488
00:22:09,760 --> 00:22:14,000
of the things we talked about in a very

489
00:22:12,159 --> 00:22:18,159
different domain. So, we've just been

490
00:22:14,000 --> 00:22:21,919
talking text and text and text because

491
00:22:18,159 --> 00:22:25,360
super popular. But this is a project I

492
00:22:21,919 --> 00:22:29,440
have with MES University called Edward

493
00:22:25,360 --> 00:22:32,559
where we take um elevation models. So

494
00:22:29,440 --> 00:22:37,120
height maps of a landform and we turn

495
00:22:32,559 --> 00:22:40,799
them into Swiss style shadings and these

496
00:22:37,120 --> 00:22:43,440
get used by expert cgraphers for like

497
00:22:40,799 --> 00:22:47,760
National Geographic books and published

498
00:22:43,440 --> 00:22:49,440
as maps. Um, and they're sort of if you

499
00:22:47,760 --> 00:22:51,440
start noticing things, you know, Google

500
00:22:49,440 --> 00:22:53,360
Maps uses relief shadings. They're sort

501
00:22:51,440 --> 00:22:56,799
of everywhere.

502
00:22:53,360 --> 00:22:59,919
And we were working on a new feature for

503
00:22:56,799 --> 00:23:02,720
contours. So, not sure if you guys are

504
00:22:59,919 --> 00:23:05,760
familiar with contours, but basically

505
00:23:02,720 --> 00:23:08,559
different colors represent different

506
00:23:05,760 --> 00:23:10,240
elevation ranges. So, the darker screen

507
00:23:08,559 --> 00:23:13,360
is sort of the lowest, and then we build

508
00:23:10,240 --> 00:23:15,919
up through the browns, dark browns to

509
00:23:13,360 --> 00:23:18,400
the purples. So this is basically a way

510
00:23:15,919 --> 00:23:20,559
to represent height. So you can if

511
00:23:18,400 --> 00:23:24,960
you're hiking somewhere, you can look

512
00:23:20,559 --> 00:23:27,840
and figure out where you kind of stand.

513
00:23:24,960 --> 00:23:30,240
But to make a map like this, it's

514
00:23:27,840 --> 00:23:34,080
actually very difficult because it's

515
00:23:30,240 --> 00:23:37,360
very subjective. Um, where do you curve

516
00:23:34,080 --> 00:23:40,640
off a land form? Do you attach two land

517
00:23:37,360 --> 00:23:44,080
forms? Do you not draw a valley? These

518
00:23:40,640 --> 00:23:46,720
are very like human preferences. So

519
00:23:44,080 --> 00:23:50,799
think about how do you evaluate this

520
00:23:46,720 --> 00:23:53,679
right well the most obvious way would be

521
00:23:50,799 --> 00:23:56,080
like okay you'll pre you have pixel

522
00:23:53,679 --> 00:23:58,480
information right your model's going to

523
00:23:56,080 --> 00:24:01,760
produce an image why don't you just

524
00:23:58,480 --> 00:24:06,880
check pixel colors right is it green and

525
00:24:01,760 --> 00:24:09,760
green or purple and purple yes but where

526
00:24:06,880 --> 00:24:13,440
right if that huge block at the bottom

527
00:24:09,760 --> 00:24:16,720
right is all correct that's what 20% of

528
00:24:13,440 --> 00:24:19,760
our data, but if it gets all the tiny

529
00:24:16,720 --> 00:24:22,320
crevices wrong, they're less than, you

530
00:24:19,760 --> 00:24:24,559
know, 0.0001%.

531
00:24:22,320 --> 00:24:28,880
But that's what's important to us. So

532
00:24:24,559 --> 00:24:30,880
then, how do we evaluate this

533
00:24:28,880 --> 00:24:33,120
golden sets, right? So I've taken a

534
00:24:30,880 --> 00:24:35,919
section of the map and I'll show you

535
00:24:33,120 --> 00:24:38,480
what we did.

536
00:24:35,919 --> 00:24:42,320
We created so these are like the rows in

537
00:24:38,480 --> 00:24:44,640
our CSV, right? One map represents tiny

538
00:24:42,320 --> 00:24:47,039
little areas that we actually care

539
00:24:44,640 --> 00:24:49,200
about. These are the areas that

540
00:24:47,039 --> 00:24:51,360
performance is critical because we kind

541
00:24:49,200 --> 00:24:53,520
of know that the model's going to get

542
00:24:51,360 --> 00:24:56,320
the other stuff right. But from our

543
00:24:53,520 --> 00:24:58,720
expert ctographers, they told us this is

544
00:24:56,320 --> 00:25:00,559
where it tends to make mistakes. And

545
00:24:58,720 --> 00:25:03,039
we're not just going to have one map.

546
00:25:00,559 --> 00:25:04,960
Just like our golden set, we're going to

547
00:25:03,039 --> 00:25:07,760
have multiple maps that focus on

548
00:25:04,960 --> 00:25:11,200
different areas. These focus on the tiny

549
00:25:07,760 --> 00:25:14,320
landforms. these focus on like other

550
00:25:11,200 --> 00:25:16,400
type of land forms. And once we have

551
00:25:14,320 --> 00:25:18,559
that,

552
00:25:16,400 --> 00:25:21,120
we're gonna like run a bunch of models

553
00:25:18,559 --> 00:25:23,520
and see how they do on our golden set.

554
00:25:21,120 --> 00:25:26,240
So this is what we call a parallel

555
00:25:23,520 --> 00:25:29,279
coordinates plot. So each line from the

556
00:25:26,240 --> 00:25:32,480
left to the right all the way through

557
00:25:29,279 --> 00:25:35,679
represents a single model and it's

558
00:25:32,480 --> 00:25:37,760
performance on the different metrics.

559
00:25:35,679 --> 00:25:40,559
You can sort of ignore sort of the left

560
00:25:37,760 --> 00:25:42,640
hand side and just focus on the three at

561
00:25:40,559 --> 00:25:44,799
the right because those are the masks

562
00:25:42,640 --> 00:25:46,799
that I showed you.

563
00:25:44,799 --> 00:25:48,799
And immediately

564
00:25:46,799 --> 00:25:50,880
we kind of see that there's a bunch of

565
00:25:48,799 --> 00:25:54,400
models that go towards the bottom,

566
00:25:50,880 --> 00:25:55,919
right? That's a clear indication.

567
00:25:54,400 --> 00:25:58,720
They're not they're not going to be

568
00:25:55,919 --> 00:26:01,279
good, right? we kind of know that we

569
00:25:58,720 --> 00:26:05,360
only need to focus on the top third of

570
00:26:01,279 --> 00:26:08,400
that top branch. And

571
00:26:05,360 --> 00:26:10,960
it's a really easy way to sort of cut

572
00:26:08,400 --> 00:26:14,159
through thousands of model iterations

573
00:26:10,960 --> 00:26:15,840
really quickly. And

574
00:26:14,159 --> 00:26:18,000
you'll see that there's a line that goes

575
00:26:15,840 --> 00:26:20,240
right to the top, right? And you might

576
00:26:18,000 --> 00:26:23,760
say, "Well, isn't that the best model?"

577
00:26:20,240 --> 00:26:26,159
No, it's not because this is subjective.

578
00:26:23,760 --> 00:26:30,320
Just because your metric says this is

579
00:26:26,159 --> 00:26:32,320
the best in subjective domains, it's not

580
00:26:30,320 --> 00:26:35,919
necessarily the best because human

581
00:26:32,320 --> 00:26:38,720
preferences, human aesthetics um are the

582
00:26:35,919 --> 00:26:41,919
final dictator.

583
00:26:38,720 --> 00:26:44,960
What did we do next? We captured

584
00:26:41,919 --> 00:26:47,039
preference data from our experts, right?

585
00:26:44,960 --> 00:26:49,600
Um so we built a completely custom

586
00:26:47,039 --> 00:26:52,799
interface and I'm very pro building

587
00:26:49,600 --> 00:26:54,559
custom interfaces, streamlit, anything.

588
00:26:52,799 --> 00:26:58,400
I mean, you can use open source things

589
00:26:54,559 --> 00:27:00,159
like label studio, but I feel like if

590
00:26:58,400 --> 00:27:02,400
it's a thing that you're going to be

591
00:27:00,159 --> 00:27:05,760
investing in, a custom interface that's

592
00:27:02,400 --> 00:27:08,159
just right lets you move at much faster

593
00:27:05,760 --> 00:27:11,360
velocities. So, here we made it so that

594
00:27:08,159 --> 00:27:13,279
if you zoomed into any of the maps, they

595
00:27:11,360 --> 00:27:15,120
would all zoom in at the same time. So,

596
00:27:13,279 --> 00:27:18,159
the experts could really look through

597
00:27:15,120 --> 00:27:21,279
what was going on. And we asked them

598
00:27:18,159 --> 00:27:24,320
which one of these is better, right? And

599
00:27:21,279 --> 00:27:26,080
that's how we sort of collected data and

600
00:27:24,320 --> 00:27:28,480
figured out what model we were going to

601
00:27:26,080 --> 00:27:30,080
ship in our app.

602
00:27:28,480 --> 00:27:33,520
So hopefully that sort of gives you a

603
00:27:30,080 --> 00:27:36,640
taste of the things I talked about in a

604
00:27:33,520 --> 00:27:38,720
very different domain. Um, the last

605
00:27:36,640 --> 00:27:42,000
thing I just want to mention, I guess,

606
00:27:38,720 --> 00:27:45,520
is sort of easy pitfalls and

607
00:27:42,000 --> 00:27:48,000
antiatterns. And this is also quite

608
00:27:45,520 --> 00:27:50,400
subjective, just like the contour lines.

609
00:27:48,000 --> 00:27:55,039
But I'm not a fan of

610
00:27:50,400 --> 00:27:57,200
10,000 line um, sets where you kind of

611
00:27:55,039 --> 00:27:59,760
have forgotten what was really in there

612
00:27:57,200 --> 00:28:01,360
and you're not even sure if it's just

613
00:27:59,760 --> 00:28:04,000
quantity that's in there because you

614
00:28:01,360 --> 00:28:05,679
care about variety. Um, so it's very

615
00:28:04,000 --> 00:28:07,840
important to sort of know what your

616
00:28:05,679 --> 00:28:09,520
emails are and be looking at them all

617
00:28:07,840 --> 00:28:11,200
the time.

618
00:28:09,520 --> 00:28:13,679
Second one, I think we sort of all know

619
00:28:11,200 --> 00:28:15,840
we can't trust an AI even if we call it

620
00:28:13,679 --> 00:28:19,360
a judge. Just because call we call it a

621
00:28:15,840 --> 00:28:21,919
judge doesn't make it any better.

622
00:28:19,360 --> 00:28:25,520
Stay away from dashboards, please. Just

623
00:28:21,919 --> 00:28:28,080
a general dashboard with like a 4.2

624
00:28:25,520 --> 00:28:30,480
score. It's meaningless. What does 4.2

625
00:28:28,080 --> 00:28:32,720
mean? I mean, what errors is it

626
00:28:30,480 --> 00:28:35,919
catching? So,

627
00:28:32,720 --> 00:28:37,600
not a fan. Um, chasing proxy metrics

628
00:28:35,919 --> 00:28:40,960
again, just because it's the top

629
00:28:37,600 --> 00:28:42,640
performing model on a metric may not

630
00:28:40,960 --> 00:28:46,399
necessarily mean that it's the met best

631
00:28:42,640 --> 00:28:48,799
model for you. And yeah, latency and

632
00:28:46,399 --> 00:28:51,039
blind spots.

633
00:28:48,799 --> 00:28:54,559
Yeah, like turned on just emails are a

634
00:28:51,039 --> 00:28:55,980
habit and not just a pretty dashboard.

635
00:28:54,559 --> 00:29:04,960
Thank you.

636
00:28:55,980 --> 00:29:06,799
[Applause]

637
00:29:04,960 --> 00:29:10,240
De Prit, that was um that was

638
00:29:06,799 --> 00:29:13,120
fascinating. Um to say thank you, I've

639
00:29:10,240 --> 00:29:13,760
uh brought you the 2025 PyCon AU speakers

640
00:29:13,120 --> 00:29:14,080
mug.

641
00:29:13,760 --> 00:29:16,000
Thank you.

642
00:29:14,080 --> 00:29:21,559
Thank you very much for for the talk. Um

643
00:29:16,000 --> 00:29:21,559
we do have time for probably a question.

644
00:29:29,039 --> 00:29:33,679
Hey, great talk. U so I work as an AI

645
00:29:31,840 --> 00:29:36,720
engineer at an AI startup and eval are

646
00:29:33,679 --> 00:29:38,399
like big for us and um so we've tried

647
00:29:36,720 --> 00:29:42,559
all of these different platforms like

648
00:29:38,399 --> 00:29:45,440
DBal, comet, bits and biases, languages

649
00:29:42,559 --> 00:29:46,880
so many um and while they all look so

650
00:29:45,440 --> 00:29:48,399
good from the outside, you know, once

651
00:29:46,880 --> 00:29:50,320
you start using them like for like

652
00:29:48,399 --> 00:29:51,919
really specific stuff, you see the

653
00:29:50,320 --> 00:29:54,000
cracks because they're all so new. So

654
00:29:51,919 --> 00:29:56,399
just like in your experience uh what

655
00:29:54,000 --> 00:29:58,960
have you found to be working well?

656
00:29:56,399 --> 00:30:01,039
Yeah, that that's a great point because

657
00:29:58,960 --> 00:30:03,200
I have used a bunch of those and that

658
00:30:01,039 --> 00:30:07,360
plot you saw was from weights and biases

659
00:30:03,200 --> 00:30:09,279
and what I found is

660
00:30:07,360 --> 00:30:12,720
personally from for the problems that I

661
00:30:09,279 --> 00:30:15,600
work at just starting off very simple.

662
00:30:12,720 --> 00:30:18,000
So I tend to whatever problem it may be

663
00:30:15,600 --> 00:30:20,399
even if I have to write like custom

664
00:30:18,000 --> 00:30:23,360
charts in mappa

665
00:30:20,399 --> 00:30:25,520
can do that right um I really prefer

666
00:30:23,360 --> 00:30:28,320
that and building it to a point where

667
00:30:25,520 --> 00:30:32,399
you're like I've got a good model and

668
00:30:28,320 --> 00:30:35,440
I've got repetitive tests then only move

669
00:30:32,399 --> 00:30:37,120
to a platform. So I wouldn't move to

670
00:30:35,440 --> 00:30:40,240
weights and biases for my first

671
00:30:37,120 --> 00:30:42,880
experiment, right? For my first 10, not

672
00:30:40,240 --> 00:30:44,720
even my first 100. But after that, when

673
00:30:42,880 --> 00:30:47,279
you kind of know what you want and

674
00:30:44,720 --> 00:30:51,600
you've got repeatable things, that's

675
00:30:47,279 --> 00:30:54,240
when you sort of upgrade. Yeah.

676
00:30:51,600 --> 00:30:55,760
All right. I I apologize. Um that is

677
00:30:54,240 --> 00:30:59,120
probably all we have time for. Um

678
00:30:55,760 --> 00:31:02,279
another very warm thank you for Dilbury.

679
00:30:59,120 --> 00:31:02,279
Thank you.