1 00:00:00,420 --> 00:00:05,910 [Music] 2 00:00:10,240 --> 00:00:14,719 Good afternoon everyone. Welcome back to 3 00:00:12,160 --> 00:00:16,720 the data data and AI track here at PyCon 4 00:00:14,719 --> 00:00:19,760 AU 2025. 5 00:00:16,720 --> 00:00:22,400 Uh very happy to introduce uh Dilitri 6 00:00:19,760 --> 00:00:24,720 who is here to present uh beyond vibes 7 00:00:22,400 --> 00:00:26,640 building evals for generative AI. Can I 8 00:00:24,720 --> 00:00:29,880 have a warm round of welcoming applause 9 00:00:26,640 --> 00:00:29,880 for Dilpri? 10 00:00:30,200 --> 00:00:36,480 [Applause] 11 00:00:33,520 --> 00:00:39,280 Thanks for that. Uh hello hello. I am 12 00:00:36,480 --> 00:00:41,600 here to present the super exciting world 13 00:00:39,280 --> 00:00:44,800 of evals to you all. Um my name is 14 00:00:41,600 --> 00:00:47,680 Dilprit and I've been doing machine 15 00:00:44,800 --> 00:00:49,600 learning for about 10 years. Um I run a 16 00:00:47,680 --> 00:00:51,840 little place called Loom Labs where I 17 00:00:49,600 --> 00:00:53,680 work with sort of small organizations, 18 00:00:51,840 --> 00:00:57,039 large organizations 19 00:00:53,680 --> 00:01:00,879 helping companies um ship ML products. 20 00:00:57,039 --> 00:01:03,280 Um, so in this talk I hope to first 21 00:01:00,879 --> 00:01:05,600 cover why emails are important and why 22 00:01:03,280 --> 00:01:07,760 we should pay attention to them. Then 23 00:01:05,600 --> 00:01:09,680 next sort of like a 101 so you can 24 00:01:07,760 --> 00:01:11,600 really take some of this stuff back and 25 00:01:09,680 --> 00:01:13,840 if you're writing your first prompt or 26 00:01:11,600 --> 00:01:16,159 your first model you can take this and 27 00:01:13,840 --> 00:01:18,240 apply it. And along the way I'll 28 00:01:16,159 --> 00:01:21,439 hopefully make all these emails a bit 29 00:01:18,240 --> 00:01:25,040 more concrete with some examples. 30 00:01:21,439 --> 00:01:27,360 So let's get started. First thing is 31 00:01:25,040 --> 00:01:28,880 like the hype slide because this is 32 00:01:27,360 --> 00:01:31,040 going to be like the most exciting 33 00:01:28,880 --> 00:01:34,400 thing. You know, we live in a very 34 00:01:31,040 --> 00:01:37,040 exciting time where we get models every 35 00:01:34,400 --> 00:01:39,920 week. And this is a chart that I've 36 00:01:37,040 --> 00:01:41,759 created that only covers the major labs, 37 00:01:39,920 --> 00:01:44,400 sort of the frontier labs as we call 38 00:01:41,759 --> 00:01:46,560 them, which doesn't cover all the sort 39 00:01:44,400 --> 00:01:50,880 of small open-source contributions that 40 00:01:46,560 --> 00:01:54,399 we see, but you can see it's relentless 41 00:01:50,880 --> 00:01:56,240 um but exciting at the same time. Um, 42 00:01:54,399 --> 00:01:57,600 but you know, when you talk about evals, 43 00:01:56,240 --> 00:02:00,560 you're sort of like, okay, look at these 44 00:01:57,600 --> 00:02:02,079 shiny models. You can sort of feels like 45 00:02:00,560 --> 00:02:03,680 the bear of bad news, you know, like 46 00:02:02,079 --> 00:02:06,960 turning the lights on at a party kind of 47 00:02:03,680 --> 00:02:08,160 a thing. Um, that the new and shiny, 48 00:02:06,960 --> 00:02:10,080 it's not going to solve all your 49 00:02:08,160 --> 00:02:13,840 problems. 50 00:02:10,080 --> 00:02:16,879 Even though benchmarks tend to go up. 51 00:02:13,840 --> 00:02:20,400 Um, this is my favorite benchmark. So I 52 00:02:16,879 --> 00:02:23,760 could have shown you any bar bar chart 53 00:02:20,400 --> 00:02:27,280 where open AI is like look we're 90% 54 00:02:23,760 --> 00:02:30,400 solve this benchmark but this is from uh 55 00:02:27,280 --> 00:02:33,840 someone called Simon Wilson where he 56 00:02:30,400 --> 00:02:37,280 likes to get a model and give it the 57 00:02:33,840 --> 00:02:39,280 prompt of create an SVG of a pelican 58 00:02:37,280 --> 00:02:41,280 riding a bicycle. 59 00:02:39,280 --> 00:02:44,239 And you can see we start from the top 60 00:02:41,280 --> 00:02:47,360 left of GPT4 mini sort of the oldest 61 00:02:44,239 --> 00:02:49,599 generation and I don't know it doesn't 62 00:02:47,360 --> 00:02:52,239 look like a bicycle or a Pelican and 63 00:02:49,599 --> 00:02:55,360 then we have at the bottom right GPT5 64 00:02:52,239 --> 00:02:59,280 and you have a tour of France style you 65 00:02:55,360 --> 00:03:00,480 know very very aerodynamic Pelican. 66 00:02:59,280 --> 00:03:03,200 So when you look at all these 67 00:03:00,480 --> 00:03:05,760 benchmarks, even sort of fun proxy 68 00:03:03,200 --> 00:03:09,200 benchmarks, you might think, well, you 69 00:03:05,760 --> 00:03:11,200 know, my problem, my 10step agent 70 00:03:09,200 --> 00:03:16,080 workflow that I've cooking, it doesn't 71 00:03:11,200 --> 00:03:18,239 work right now, but surely GPG6, GPG7 is 72 00:03:16,080 --> 00:03:19,920 going to work, right? 73 00:03:18,239 --> 00:03:23,680 Yeah, unless you're trying to draw 74 00:03:19,920 --> 00:03:25,840 pelicans, maybe not. So yes, the models 75 00:03:23,680 --> 00:03:28,080 keep getting better, but it's important 76 00:03:25,840 --> 00:03:29,920 to be very grounded and that your 77 00:03:28,080 --> 00:03:34,440 problems aren't just going to be 78 00:03:29,920 --> 00:03:34,440 magically solved by open AI. 79 00:03:34,879 --> 00:03:39,599 Evails, I mean, they've been around 80 00:03:36,799 --> 00:03:40,959 forever. Um, if there's any sort of data 81 00:03:39,599 --> 00:03:43,040 scientists in the room, they might be 82 00:03:40,959 --> 00:03:44,720 like, "This guy, emails, we've been 83 00:03:43,040 --> 00:03:46,560 doing them forever. Why are we talking 84 00:03:44,720 --> 00:03:48,560 about this right now?" Right? These are 85 00:03:46,560 --> 00:03:50,720 just some of the choice metrics I 86 00:03:48,560 --> 00:03:54,959 pulled. Um, you might be familiar with 87 00:03:50,720 --> 00:03:58,480 them or maybe not, but sort of why now? 88 00:03:54,959 --> 00:04:00,000 Um, and because it's never been as 89 00:03:58,480 --> 00:04:03,599 accessible 90 00:04:00,000 --> 00:04:06,319 to write and ship an AI model as it is 91 00:04:03,599 --> 00:04:09,040 right now, right? You can just clone the 92 00:04:06,319 --> 00:04:13,360 Enthropic uh, pip install the anthropic 93 00:04:09,040 --> 00:04:16,639 SDK, get an API key, write some English, 94 00:04:13,360 --> 00:04:18,959 you got a chatbot, right? Um, and that's 95 00:04:16,639 --> 00:04:21,680 very different to how it was 10 years 96 00:04:18,959 --> 00:04:23,840 ago. Um, and these tools weren't really 97 00:04:21,680 --> 00:04:26,000 powerful back then, right? Everyone 98 00:04:23,840 --> 00:04:29,520 didn't really need to use them, but now 99 00:04:26,000 --> 00:04:31,520 we can, and that's exciting. Um, but 100 00:04:29,520 --> 00:04:34,560 it's also dangerous. 101 00:04:31,520 --> 00:04:36,479 I sort of like to think of these models 102 00:04:34,560 --> 00:04:38,960 as like, you know, we just invented cars 103 00:04:36,479 --> 00:04:40,479 and everyone's getting free cars, but 104 00:04:38,960 --> 00:04:42,639 I'm the guy talking about seat belts. 105 00:04:40,479 --> 00:04:44,800 Like, you can go super fast and that's 106 00:04:42,639 --> 00:04:48,720 fine. Um, but just a little bit of 107 00:04:44,800 --> 00:04:50,479 safety can go a long way. 108 00:04:48,720 --> 00:04:53,199 And this this is sort of what I mean. 109 00:04:50,479 --> 00:04:55,840 Um, we're all very familiar with the 110 00:04:53,199 --> 00:04:58,880 left hand side. Um, also I had to show a 111 00:04:55,840 --> 00:05:00,639 bit of Python, of course. Um, we're very 112 00:04:58,880 --> 00:05:02,479 familiar with the left hand side and we 113 00:05:00,639 --> 00:05:05,360 know it's going to be deterministic. You 114 00:05:02,479 --> 00:05:07,520 can run that for loop millions of times 115 00:05:05,360 --> 00:05:10,240 and you're going to get hello. On the 116 00:05:07,520 --> 00:05:12,479 right hand side, very similar looking 117 00:05:10,240 --> 00:05:16,320 code, right? I mean not similar but it's 118 00:05:12,479 --> 00:05:18,960 five lines but the output it is in the 119 00:05:16,320 --> 00:05:22,479 realm of hello but is it the same 120 00:05:18,960 --> 00:05:25,120 string? Um can you really tell me with 121 00:05:22,479 --> 00:05:28,400 100% certainty what you're going to get 122 00:05:25,120 --> 00:05:32,240 out of the right hand side code? Not 123 00:05:28,400 --> 00:05:35,600 really. And this sort of probabilistic 124 00:05:32,240 --> 00:05:37,840 thinking um may not come naturally 125 00:05:35,600 --> 00:05:40,240 because we're not really used to this. 126 00:05:37,840 --> 00:05:42,560 uh we've got to start thinking in 127 00:05:40,240 --> 00:05:45,520 distributions, right? We got to think 128 00:05:42,560 --> 00:05:48,320 about consistency of output and how to 129 00:05:45,520 --> 00:05:50,880 sort of narrow down this variety that we 130 00:05:48,320 --> 00:05:53,360 have. And there it's really a change of 131 00:05:50,880 --> 00:05:56,000 mindset. Um and that's what sort of 132 00:05:53,360 --> 00:05:58,639 doing eval sort of gets you in these new 133 00:05:56,000 --> 00:06:02,320 habits that are very important for our 134 00:05:58,639 --> 00:06:04,400 probabilistic world. 135 00:06:02,320 --> 00:06:06,080 Yeah, this this sort of covers it. It 136 00:06:04,400 --> 00:06:09,360 looks like normal code but doesn't 137 00:06:06,080 --> 00:06:11,680 really break like it. Um, it'll drift, 138 00:06:09,360 --> 00:06:13,600 it'll improvise, it'll be confidently 139 00:06:11,680 --> 00:06:16,880 wrong, and you kind of won't even know 140 00:06:13,600 --> 00:06:19,520 it. Um, I think someone had in a 141 00:06:16,880 --> 00:06:23,360 previous presentation a slide around a 142 00:06:19,520 --> 00:06:25,120 Canada and them promising a consumer 143 00:06:23,360 --> 00:06:28,000 something. Um, and Reddit is full of 144 00:06:25,120 --> 00:06:30,000 this, right? You go to Reddit, someone's 145 00:06:28,000 --> 00:06:32,479 like, "Oh, if you go to this ecommerce 146 00:06:30,000 --> 00:06:34,319 chatbot, you can just get a refund or a 147 00:06:32,479 --> 00:06:37,680 voucher." because they'd know how to 148 00:06:34,319 --> 00:06:40,240 push this chatbot just the way they want 149 00:06:37,680 --> 00:06:43,039 to get whatever they want out of it. But 150 00:06:40,240 --> 00:06:45,840 did the business really expect that? Did 151 00:06:43,039 --> 00:06:48,319 they test it? Did they think about this 152 00:06:45,840 --> 00:06:51,440 these scenarios and what the internet 153 00:06:48,319 --> 00:06:55,680 can do to a chatbot? I don't think so. 154 00:06:51,440 --> 00:06:58,960 Right. Um and that's why it's again 155 00:06:55,680 --> 00:07:02,479 emails. Um, and before we start sort of 156 00:06:58,960 --> 00:07:04,800 looking at ways we can address the 157 00:07:02,479 --> 00:07:06,400 evaluation situation, 158 00:07:04,800 --> 00:07:08,400 the other side of email, so I like to 159 00:07:06,400 --> 00:07:11,360 think of like emails as a thing you do 160 00:07:08,400 --> 00:07:14,080 before you've decided to ship a model to 161 00:07:11,360 --> 00:07:16,400 prod or to your users. 162 00:07:14,080 --> 00:07:20,160 And verification is like what you can do 163 00:07:16,400 --> 00:07:22,319 afterwards. So eval 164 00:07:20,160 --> 00:07:24,560 what the model is going to do, but you 165 00:07:22,319 --> 00:07:27,759 could also just verify all the outputs 166 00:07:24,560 --> 00:07:30,560 that the model produces. Um, so for 167 00:07:27,759 --> 00:07:33,840 example, if you were the New York Times 168 00:07:30,560 --> 00:07:36,400 and you were like this AI summarization 169 00:07:33,840 --> 00:07:38,319 thing that everyone's pushing, summaries 170 00:07:36,400 --> 00:07:41,520 are really hot right now. So, we're 171 00:07:38,319 --> 00:07:44,160 going to do summaries, but they New York 172 00:07:41,520 --> 00:07:46,639 Times can't afford to trust a summary 173 00:07:44,160 --> 00:07:49,599 from an AI bot. They need to verify. 174 00:07:46,639 --> 00:07:52,400 They have zero risk tolerance, right? So 175 00:07:49,599 --> 00:07:54,639 they could do all the eval but at the 176 00:07:52,400 --> 00:07:56,800 end of the day they've got to have a 177 00:07:54,639 --> 00:08:00,800 human make sure it didn't make up 178 00:07:56,800 --> 00:08:03,840 anything all the facts are there etc etc 179 00:08:00,800 --> 00:08:07,039 so that is verification and that's great 180 00:08:03,840 --> 00:08:09,520 for the New York Times big organization 181 00:08:07,039 --> 00:08:13,280 it's their mission they can afford it 182 00:08:09,520 --> 00:08:17,120 but not everyone can verify everything 183 00:08:13,280 --> 00:08:19,680 right you cannot check all of the 184 00:08:17,120 --> 00:08:23,280 messages your chatbot is going to 185 00:08:19,680 --> 00:08:25,599 or anything like that. And of course, 186 00:08:23,280 --> 00:08:27,039 that's not even talking about the money. 187 00:08:25,599 --> 00:08:29,520 We're we're trying to use these models 188 00:08:27,039 --> 00:08:31,360 and software to create efficiencies, but 189 00:08:29,520 --> 00:08:33,839 if we're then verifying everything they 190 00:08:31,360 --> 00:08:36,560 produce, why don't you just have a human 191 00:08:33,839 --> 00:08:38,880 do the thing in the first place? So, 192 00:08:36,560 --> 00:08:43,200 verification isn't the answer for 193 00:08:38,880 --> 00:08:46,480 everything. Um, but it is an important 194 00:08:43,200 --> 00:08:48,800 thing to consider. A friend when I was 195 00:08:46,480 --> 00:08:52,959 rehearsing this talk was like, "Okay, 196 00:08:48,800 --> 00:08:56,240 what if I'm building a travel itinerary 197 00:08:52,959 --> 00:09:00,240 recommener um for people? 198 00:08:56,240 --> 00:09:03,600 Do I verify or do I eval?" I was like, 199 00:09:00,240 --> 00:09:06,480 "Okay, let's talk through this." If 200 00:09:03,600 --> 00:09:08,080 you're recommending itineraries, you got 201 00:09:06,480 --> 00:09:10,640 to make sure that the places and 202 00:09:08,080 --> 00:09:12,880 restaurant names you suggest exist, 203 00:09:10,640 --> 00:09:15,680 right? So, those need to be verified. 204 00:09:12,880 --> 00:09:17,920 They can be done with a Google API call 205 00:09:15,680 --> 00:09:19,760 or whatever it may be. But if you're 206 00:09:17,920 --> 00:09:22,720 recommending a French restaurant instead 207 00:09:19,760 --> 00:09:25,600 of like an Italian one, that doesn't 208 00:09:22,720 --> 00:09:27,279 matter. So it's not just a whole system. 209 00:09:25,600 --> 00:09:29,120 You've got to think of components of 210 00:09:27,279 --> 00:09:32,080 your system. 211 00:09:29,120 --> 00:09:35,360 So that's the verification. 212 00:09:32,080 --> 00:09:37,200 All right. Now, hopefully I've motivated 213 00:09:35,360 --> 00:09:39,040 why we need evals and why I think 214 00:09:37,200 --> 00:09:42,080 they're important and why you should 215 00:09:39,040 --> 00:09:43,519 think they're important. Um, and this 216 00:09:42,080 --> 00:09:45,120 these are the steps we're going to go 217 00:09:43,519 --> 00:09:46,880 through. 218 00:09:45,120 --> 00:09:48,240 So, this is sort of like a 101. I'm 219 00:09:46,880 --> 00:09:50,800 going to try and introduce all these 220 00:09:48,240 --> 00:09:54,000 concepts, but you can make this as 221 00:09:50,800 --> 00:09:56,160 simple and as complex as you like. These 222 00:09:54,000 --> 00:09:58,959 can be integrated in your like CI/CD 223 00:09:56,160 --> 00:10:00,720 pipelines, in your premerge setup, but 224 00:09:58,959 --> 00:10:02,240 we're not going to talk about that. 225 00:10:00,720 --> 00:10:05,360 We're going to focus on what are the 226 00:10:02,240 --> 00:10:07,600 steps that we need to focus on. So, the 227 00:10:05,360 --> 00:10:10,080 first thing I'm going to do is go talk 228 00:10:07,600 --> 00:10:12,959 about the golden set. Then we have uh 229 00:10:10,080 --> 00:10:14,959 quick sanity checks, human preference 230 00:10:12,959 --> 00:10:18,880 capture, which is like what do you do 231 00:10:14,959 --> 00:10:22,480 with subjective outputs? Then a new 232 00:10:18,880 --> 00:10:24,000 unish uh technique called LM as a judge. 233 00:10:22,480 --> 00:10:26,720 And we're going to rinse and repeat 234 00:10:24,000 --> 00:10:28,399 until we get a good model. 235 00:10:26,720 --> 00:10:31,440 All right, 236 00:10:28,399 --> 00:10:34,079 first one, golden set. 237 00:10:31,440 --> 00:10:35,519 This may not seem very exciting, but 238 00:10:34,079 --> 00:10:38,079 I've got to tell you, this is like one 239 00:10:35,519 --> 00:10:41,920 of the most useful things you can really 240 00:10:38,079 --> 00:10:44,160 do. So this you can use numbers, Google 241 00:10:41,920 --> 00:10:47,360 sheets or whatever it may be. And it's 242 00:10:44,160 --> 00:10:50,480 simply a list of things that you'd like 243 00:10:47,360 --> 00:10:53,760 your model to be good at. So the example 244 00:10:50,480 --> 00:10:55,680 I've chosen is a support chatbot. A 245 00:10:53,760 --> 00:10:59,120 chatbot that will simply answer support 246 00:10:55,680 --> 00:11:02,240 queries. So what is a golden set? It a 247 00:10:59,120 --> 00:11:05,680 set of features or set of inputs that I 248 00:11:02,240 --> 00:11:08,640 expect my bot to handle, right? And this 249 00:11:05,680 --> 00:11:11,120 doesn't need to be a thousand rows or 10 250 00:11:08,640 --> 00:11:13,760 or 20, you know, just to start off with. 251 00:11:11,120 --> 00:11:17,279 So, for example, how do I reset my 252 00:11:13,760 --> 00:11:19,120 password? And given that input query, 253 00:11:17,279 --> 00:11:21,920 what do I really expect the bot to say 254 00:11:19,120 --> 00:11:25,440 to me? Um, another query might be, what 255 00:11:21,920 --> 00:11:29,040 are your business hours? Or maybe how to 256 00:11:25,440 --> 00:11:32,320 handle a customer who's not really happy 257 00:11:29,040 --> 00:11:34,880 about things. So, those are my inputs 258 00:11:32,320 --> 00:11:36,720 and what I expect the bot to do. Then I 259 00:11:34,880 --> 00:11:38,880 have a whole bunch of columns around 260 00:11:36,720 --> 00:11:41,040 categorization. So I can do filtering, 261 00:11:38,880 --> 00:11:44,000 but you can kind of ignore that. You 262 00:11:41,040 --> 00:11:46,880 know, the first two columns are all that 263 00:11:44,000 --> 00:11:48,880 you need. And then the end ones, they're 264 00:11:46,880 --> 00:11:50,880 more helping you sort of organize your 265 00:11:48,880 --> 00:11:52,959 set. 266 00:11:50,880 --> 00:11:54,959 And then okay, so let's say you wrote 267 00:11:52,959 --> 00:11:58,079 down 10 things that you'd like your bot 268 00:11:54,959 --> 00:11:59,920 to do. What then? Okay, 269 00:11:58,079 --> 00:12:04,160 Open AI just announced a new model, 270 00:11:59,920 --> 00:12:08,079 GPT7. What do you do? You take your list 271 00:12:04,160 --> 00:12:10,959 one by one. You put it into GPG7 272 00:12:08,079 --> 00:12:12,880 and see what it does, right? You put the 273 00:12:10,959 --> 00:12:15,360 row in three times just for consistency 274 00:12:12,880 --> 00:12:17,920 sake. Did it do well? Well, you get a 275 00:12:15,360 --> 00:12:20,000 check. Didn't do well, you get a cross, 276 00:12:17,920 --> 00:12:22,560 right? You go through the list and 277 00:12:20,000 --> 00:12:27,600 eventually at the end you'll have a nice 278 00:12:22,560 --> 00:12:29,760 percentage. Is it 80%? Is it 70%? Is it 279 00:12:27,600 --> 00:12:32,160 better than or worse than what you have 280 00:12:29,760 --> 00:12:36,560 right now? And that immediately gives 281 00:12:32,160 --> 00:12:38,959 you confidence, right, of is the new 282 00:12:36,560 --> 00:12:42,560 model or my prompt change or whatever it 283 00:12:38,959 --> 00:12:47,040 may be, is it really helping solve my 284 00:12:42,560 --> 00:12:50,639 problem? And that's why this is a very 285 00:12:47,040 --> 00:12:53,959 simple but critical piece um of testing 286 00:12:50,639 --> 00:12:53,959 these models. 287 00:12:54,000 --> 00:12:57,920 And yeah, don't try and cover the world 288 00:12:55,760 --> 00:13:00,720 with this. Um, I like to sort of break 289 00:12:57,920 --> 00:13:03,200 these down and have like 10 golden sets 290 00:13:00,720 --> 00:13:05,360 depending on whatever I whatever I'm 291 00:13:03,200 --> 00:13:08,399 thinking. But if you just start off 292 00:13:05,360 --> 00:13:11,279 small and just trust me, it'll sort of 293 00:13:08,399 --> 00:13:15,240 grow organically and you'll add it to it 294 00:13:11,279 --> 00:13:15,240 as you think of it. 295 00:13:15,920 --> 00:13:20,320 Next, 296 00:13:18,160 --> 00:13:23,279 sanity checks. 297 00:13:20,320 --> 00:13:24,880 So if your model, for example, um, so 298 00:13:23,279 --> 00:13:28,480 I'm trying to cover like a variety of 299 00:13:24,880 --> 00:13:30,560 inputs. If let's say you're building a 300 00:13:28,480 --> 00:13:33,040 document paser, 301 00:13:30,560 --> 00:13:35,519 you get given a document, you feed it 302 00:13:33,040 --> 00:13:38,480 into your favorite model, you get nice 303 00:13:35,519 --> 00:13:41,360 structured JSON output back, and in 304 00:13:38,480 --> 00:13:44,079 there you have some fields, right? You 305 00:13:41,360 --> 00:13:46,079 have the date, you have, in this case, 306 00:13:44,079 --> 00:13:48,880 we're passing an invoice, so who it was 307 00:13:46,079 --> 00:13:52,639 build to, what are the amounts, what is 308 00:13:48,880 --> 00:13:55,279 the tax, and this is like old standard 309 00:13:52,639 --> 00:13:57,199 code that we're used to. You can unit 310 00:13:55,279 --> 00:14:00,399 test that. 311 00:13:57,199 --> 00:14:02,399 You can make sure that your date is like 312 00:14:00,399 --> 00:14:04,240 actually a date because you'll be 313 00:14:02,399 --> 00:14:06,240 surprised that these models can make up 314 00:14:04,240 --> 00:14:07,920 everything, right? Their dates are not 315 00:14:06,240 --> 00:14:10,639 the same as our dates. We've got to 316 00:14:07,920 --> 00:14:12,959 check that they're dates. Um, their 317 00:14:10,639 --> 00:14:15,199 numbers, do they have strings in their 318 00:14:12,959 --> 00:14:16,800 numbers? Are they negative? Do you live 319 00:14:15,199 --> 00:14:19,279 in a country with a negative tax 320 00:14:16,800 --> 00:14:21,760 percentage? I don't think so. So, you 321 00:14:19,279 --> 00:14:24,560 should check. Are they internally 322 00:14:21,760 --> 00:14:27,600 consistent? Do the totals add up to what 323 00:14:24,560 --> 00:14:30,800 they should? Um, and for like the text 324 00:14:27,600 --> 00:14:33,360 fields, is there only like one word in 325 00:14:30,800 --> 00:14:36,399 there, but you expect there to be 20? 326 00:14:33,360 --> 00:14:38,079 Um, is it an email? Is it formatted as 327 00:14:36,399 --> 00:14:40,639 an email? 328 00:14:38,079 --> 00:14:42,639 And all of this again gives you 329 00:14:40,639 --> 00:14:46,480 confidence. So if you're testing two 330 00:14:42,639 --> 00:14:48,880 models, you give them the same document, 331 00:14:46,480 --> 00:14:52,320 you pause it, you have these tests and 332 00:14:48,880 --> 00:14:56,240 you check does model A consistently get 333 00:14:52,320 --> 00:14:58,480 dates right? Does model B not? Suddenly 334 00:14:56,240 --> 00:15:01,120 you can actually make a valid choice 335 00:14:58,480 --> 00:15:03,519 based on grounded data. And of course 336 00:15:01,120 --> 00:15:06,720 you can automate this, right? So this 337 00:15:03,519 --> 00:15:09,680 will also help in verification. So if 338 00:15:06,720 --> 00:15:12,399 someone is uh giving you an invoice and 339 00:15:09,680 --> 00:15:14,079 you're pausing it and you generate a 340 00:15:12,399 --> 00:15:16,480 number that doesn't look like a number, 341 00:15:14,079 --> 00:15:18,720 maybe instead of returning 342 00:15:16,480 --> 00:15:20,880 negatives, you sort of say, I don't 343 00:15:18,720 --> 00:15:24,800 know, something went wrong. Let's try 344 00:15:20,880 --> 00:15:28,079 again. Um so this again, seemingly 345 00:15:24,800 --> 00:15:32,320 simple checks can be very critical, 346 00:15:28,079 --> 00:15:33,519 especially at test time. 347 00:15:32,320 --> 00:15:37,839 How about we move to something 348 00:15:33,519 --> 00:15:41,279 complicated? Um, now you want to compare 349 00:15:37,839 --> 00:15:43,839 subjective outputs. So you've got, so 350 00:15:41,279 --> 00:15:47,040 you're an Amazon vendor and you've got 351 00:15:43,839 --> 00:15:49,040 50,000 products and you think using AI 352 00:15:47,040 --> 00:15:51,040 to create these descriptions is like the 353 00:15:49,040 --> 00:15:54,800 way to go. So you're going to gen 354 00:15:51,040 --> 00:15:57,360 generate 50,000 product descriptions. 355 00:15:54,800 --> 00:15:59,839 What is a good product description? How 356 00:15:57,360 --> 00:16:03,360 do you know? Right? So you have three 357 00:15:59,839 --> 00:16:05,120 models. They each give you an output, 358 00:16:03,360 --> 00:16:07,360 but how do you really choose between 359 00:16:05,120 --> 00:16:09,920 them? 360 00:16:07,360 --> 00:16:12,320 Humans, turns out, can be important in 361 00:16:09,920 --> 00:16:15,440 this process. And I don't know how many 362 00:16:12,320 --> 00:16:18,320 of you you use a chat GPT app, but I get 363 00:16:15,440 --> 00:16:22,399 this very often. I'll get two model 364 00:16:18,320 --> 00:16:24,480 responses and open AAI is using me to 365 00:16:22,399 --> 00:16:28,480 tell them which model is better. So, 366 00:16:24,480 --> 00:16:32,480 this is a proven thing. So in this I 367 00:16:28,480 --> 00:16:34,000 think VC in his um uh presentation also 368 00:16:32,480 --> 00:16:36,720 talked about sort of getting humans to 369 00:16:34,000 --> 00:16:39,519 label and sort of getting data and he 370 00:16:36,720 --> 00:16:43,040 wanted a thousand uh 2500 but only got a 371 00:16:39,519 --> 00:16:48,240 thousand. So aim for the highest you can 372 00:16:43,040 --> 00:16:51,120 so 100 samples 500 samples suddenly you 373 00:16:48,240 --> 00:16:54,079 can tell do does everyone really prefer 374 00:16:51,120 --> 00:16:56,000 model A or is it a competition between A 375 00:16:54,079 --> 00:16:58,399 and C? 376 00:16:56,000 --> 00:17:01,360 You can use very simple metrics like 377 00:16:58,399 --> 00:17:04,319 just percentages or you can make it like 378 00:17:01,360 --> 00:17:06,400 a chess style ELO sort of ranking, you 379 00:17:04,319 --> 00:17:08,240 know, where the best models compete 380 00:17:06,400 --> 00:17:11,120 against each other. Again, the 381 00:17:08,240 --> 00:17:13,439 simplicity and complexity of these, it's 382 00:17:11,120 --> 00:17:15,520 kind of all up to you. 383 00:17:13,439 --> 00:17:18,160 But the most important thing, well, an 384 00:17:15,520 --> 00:17:20,559 important thing to capture here is the 385 00:17:18,160 --> 00:17:23,039 thing at the bottom is why did you 386 00:17:20,559 --> 00:17:26,160 choose this option? So when someone 387 00:17:23,039 --> 00:17:30,000 chooses an option, get them to put in a 388 00:17:26,160 --> 00:17:34,400 rationale. Why did they pick the option 389 00:17:30,000 --> 00:17:36,480 they picked? And the reason for that is 390 00:17:34,400 --> 00:17:40,559 we're going to try and automate human 391 00:17:36,480 --> 00:17:43,280 work. So this is where LLMs as a judge 392 00:17:40,559 --> 00:17:46,000 come in. So given that you have your 393 00:17:43,280 --> 00:17:50,000 50,000 product descriptions, 394 00:17:46,000 --> 00:17:51,919 sure you know that model A won out by 60 395 00:17:50,000 --> 00:17:54,480 to 40. 396 00:17:51,919 --> 00:17:57,039 That's that's not a clear victory. How 397 00:17:54,480 --> 00:17:59,280 do you really know that the 50,000 398 00:17:57,039 --> 00:18:01,600 descriptions that you generated are 399 00:17:59,280 --> 00:18:03,840 going to be good? 400 00:18:01,600 --> 00:18:05,440 Uh unless you pay someone a lot of 401 00:18:03,840 --> 00:18:10,160 money, it's going to be very hard for 402 00:18:05,440 --> 00:18:13,360 people to check 50,000 descriptions. So 403 00:18:10,160 --> 00:18:16,000 this is where we use another model to 404 00:18:13,360 --> 00:18:18,960 check a model's work. So it's we're 405 00:18:16,000 --> 00:18:21,440 kinding kind of getting meta here. Um 406 00:18:18,960 --> 00:18:24,799 but now we take the description that we 407 00:18:21,440 --> 00:18:28,320 generated from our model. Then we give 408 00:18:24,799 --> 00:18:33,600 our judge model some assessment criteras 409 00:18:28,320 --> 00:18:36,880 like is it playful? Does this um 410 00:18:33,600 --> 00:18:38,720 description really talk to my brand? Is 411 00:18:36,880 --> 00:18:42,000 there clarity? 412 00:18:38,720 --> 00:18:45,039 Are there errors? Does it have sort of 413 00:18:42,000 --> 00:18:47,200 bad language in there? And suddenly you 414 00:18:45,039 --> 00:18:50,799 can get from a very subjective 415 00:18:47,200 --> 00:18:53,440 description to a very structured output 416 00:18:50,799 --> 00:18:56,160 at the right which is JSON. And we all 417 00:18:53,440 --> 00:18:58,640 sort of know and love JSON, right? And 418 00:18:56,160 --> 00:19:02,960 you can work through these and figure 419 00:18:58,640 --> 00:19:05,840 out out of the 50,000 descriptions, how 420 00:19:02,960 --> 00:19:08,720 many of those were sort of approved by a 421 00:19:05,840 --> 00:19:12,080 judge and how many really need human 422 00:19:08,720 --> 00:19:15,600 review. And this can be really important 423 00:19:12,080 --> 00:19:18,080 especially especially at scale. 424 00:19:15,600 --> 00:19:20,480 Having said that, 425 00:19:18,080 --> 00:19:23,520 now you have another problem, right? 426 00:19:20,480 --> 00:19:25,840 Who's going to validate the judge model? 427 00:19:23,520 --> 00:19:27,520 It is called there is a meta judge model 428 00:19:25,840 --> 00:19:30,720 but we're not going to go into a rabbit 429 00:19:27,520 --> 00:19:33,039 hole. Um but this is again you know this 430 00:19:30,720 --> 00:19:36,000 is rinse and repeat. You've got to have 431 00:19:33,039 --> 00:19:38,240 a golden set for your judge model right? 432 00:19:36,000 --> 00:19:40,640 You got to make sure that the judge 433 00:19:38,240 --> 00:19:43,039 agrees with you. So when it's judging 434 00:19:40,640 --> 00:19:44,640 things they kind of match your 435 00:19:43,039 --> 00:19:46,640 expectations. 436 00:19:44,640 --> 00:19:50,400 But this is very important. We cannot 437 00:19:46,640 --> 00:19:54,679 blindly trust probabilistic models. Um, 438 00:19:50,400 --> 00:19:54,679 and you've you got to verify. 439 00:19:55,039 --> 00:20:00,799 And the final thing I really wanted to 440 00:19:58,080 --> 00:20:04,880 touch on was just old school error 441 00:20:00,799 --> 00:20:06,720 analysis, right? If if your judge or 442 00:20:04,880 --> 00:20:09,280 whatever your check may be or whatever 443 00:20:06,720 --> 00:20:12,720 flag may be. Um, maybe you're using 444 00:20:09,280 --> 00:20:14,880 heristics, right? In a chatbot, 445 00:20:12,720 --> 00:20:17,679 let's say your chatbot produced like 446 00:20:14,880 --> 00:20:20,640 four paragraphs worth of text in one go. 447 00:20:17,679 --> 00:20:23,360 That should raise a flag. Why is it 448 00:20:20,640 --> 00:20:25,600 producing so much output? 449 00:20:23,360 --> 00:20:28,799 And you can capture all those things 450 00:20:25,600 --> 00:20:31,039 again, put them in a CSV file so you can 451 00:20:28,799 --> 00:20:33,360 just look through them. 452 00:20:31,039 --> 00:20:35,440 And then you go got to do the hard work 453 00:20:33,360 --> 00:20:38,240 of just looking through your data 454 00:20:35,440 --> 00:20:41,919 systematically, right? 455 00:20:38,240 --> 00:20:44,000 I'm sure if you raise 500 examples, 456 00:20:41,919 --> 00:20:45,760 you're not going to have 500 different 457 00:20:44,000 --> 00:20:47,760 errors. there's there are going to be 458 00:20:45,760 --> 00:20:50,720 like five and they're going to account 459 00:20:47,760 --> 00:20:54,000 for 80% of your issues, right? And what 460 00:20:50,720 --> 00:20:56,480 do you do then? You solve your issue. 461 00:20:54,000 --> 00:20:58,559 You take three to five examples of your 462 00:20:56,480 --> 00:21:01,200 issues and then you put them in your 463 00:20:58,559 --> 00:21:04,000 golden set, right? And you've closed the 464 00:21:01,200 --> 00:21:06,640 loop so that the next model you develop, 465 00:21:04,000 --> 00:21:08,799 the next prompt you write does not make 466 00:21:06,640 --> 00:21:10,559 the same mistake because that's what 467 00:21:08,799 --> 00:21:12,880 we're trying to do. We're trying to 468 00:21:10,559 --> 00:21:15,919 build confidence. We're trying to make 469 00:21:12,880 --> 00:21:18,640 sure that we kind of know with some sort 470 00:21:15,919 --> 00:21:22,320 of certainty that these models aren't 471 00:21:18,640 --> 00:21:26,240 going to produce random outputs. 472 00:21:22,320 --> 00:21:29,039 And I'd be remiss if I didn't talk about 473 00:21:26,240 --> 00:21:30,559 latency and cost 474 00:21:29,039 --> 00:21:33,679 because 475 00:21:30,559 --> 00:21:36,799 you can of course use a bigger model. 476 00:21:33,679 --> 00:21:40,080 You can use thinking tokens or whatever 477 00:21:36,799 --> 00:21:43,520 it may be um to try and squeeze more 478 00:21:40,080 --> 00:21:46,159 percentages, right? So go from 80% on 479 00:21:43,520 --> 00:21:49,360 your golden set to 85%. 480 00:21:46,159 --> 00:21:51,600 But if the cost is 2x, is that really 481 00:21:49,360 --> 00:21:53,440 worth it to your organization? 482 00:21:51,600 --> 00:21:56,720 That's that's a choice completely up to 483 00:21:53,440 --> 00:22:00,159 you. Um, but this is very important 484 00:21:56,720 --> 00:22:03,280 because you can get driven by quality, 485 00:22:00,159 --> 00:22:06,280 but latency and cost are very important 486 00:22:03,280 --> 00:22:06,280 considerations. 487 00:22:06,559 --> 00:22:12,159 Now, I thought I'd walk you through some 488 00:22:09,760 --> 00:22:14,000 of the things we talked about in a very 489 00:22:12,159 --> 00:22:18,159 different domain. So, we've just been 490 00:22:14,000 --> 00:22:21,919 talking text and text and text because 491 00:22:18,159 --> 00:22:25,360 super popular. But this is a project I 492 00:22:21,919 --> 00:22:29,440 have with MES University called Edward 493 00:22:25,360 --> 00:22:32,559 where we take um elevation models. So 494 00:22:29,440 --> 00:22:37,120 height maps of a landform and we turn 495 00:22:32,559 --> 00:22:40,799 them into Swiss style shadings and these 496 00:22:37,120 --> 00:22:43,440 get used by expert cgraphers for like 497 00:22:40,799 --> 00:22:47,760 National Geographic books and published 498 00:22:43,440 --> 00:22:49,440 as maps. Um, and they're sort of if you 499 00:22:47,760 --> 00:22:51,440 start noticing things, you know, Google 500 00:22:49,440 --> 00:22:53,360 Maps uses relief shadings. They're sort 501 00:22:51,440 --> 00:22:56,799 of everywhere. 502 00:22:53,360 --> 00:22:59,919 And we were working on a new feature for 503 00:22:56,799 --> 00:23:02,720 contours. So, not sure if you guys are 504 00:22:59,919 --> 00:23:05,760 familiar with contours, but basically 505 00:23:02,720 --> 00:23:08,559 different colors represent different 506 00:23:05,760 --> 00:23:10,240 elevation ranges. So, the darker screen 507 00:23:08,559 --> 00:23:13,360 is sort of the lowest, and then we build 508 00:23:10,240 --> 00:23:15,919 up through the browns, dark browns to 509 00:23:13,360 --> 00:23:18,400 the purples. So this is basically a way 510 00:23:15,919 --> 00:23:20,559 to represent height. So you can if 511 00:23:18,400 --> 00:23:24,960 you're hiking somewhere, you can look 512 00:23:20,559 --> 00:23:27,840 and figure out where you kind of stand. 513 00:23:24,960 --> 00:23:30,240 But to make a map like this, it's 514 00:23:27,840 --> 00:23:34,080 actually very difficult because it's 515 00:23:30,240 --> 00:23:37,360 very subjective. Um, where do you curve 516 00:23:34,080 --> 00:23:40,640 off a land form? Do you attach two land 517 00:23:37,360 --> 00:23:44,080 forms? Do you not draw a valley? These 518 00:23:40,640 --> 00:23:46,720 are very like human preferences. So 519 00:23:44,080 --> 00:23:50,799 think about how do you evaluate this 520 00:23:46,720 --> 00:23:53,679 right well the most obvious way would be 521 00:23:50,799 --> 00:23:56,080 like okay you'll pre you have pixel 522 00:23:53,679 --> 00:23:58,480 information right your model's going to 523 00:23:56,080 --> 00:24:01,760 produce an image why don't you just 524 00:23:58,480 --> 00:24:06,880 check pixel colors right is it green and 525 00:24:01,760 --> 00:24:09,760 green or purple and purple yes but where 526 00:24:06,880 --> 00:24:13,440 right if that huge block at the bottom 527 00:24:09,760 --> 00:24:16,720 right is all correct that's what 20% of 528 00:24:13,440 --> 00:24:19,760 our data, but if it gets all the tiny 529 00:24:16,720 --> 00:24:22,320 crevices wrong, they're less than, you 530 00:24:19,760 --> 00:24:24,559 know, 0.0001%. 531 00:24:22,320 --> 00:24:28,880 But that's what's important to us. So 532 00:24:24,559 --> 00:24:30,880 then, how do we evaluate this 533 00:24:28,880 --> 00:24:33,120 golden sets, right? So I've taken a 534 00:24:30,880 --> 00:24:35,919 section of the map and I'll show you 535 00:24:33,120 --> 00:24:38,480 what we did. 536 00:24:35,919 --> 00:24:42,320 We created so these are like the rows in 537 00:24:38,480 --> 00:24:44,640 our CSV, right? One map represents tiny 538 00:24:42,320 --> 00:24:47,039 little areas that we actually care 539 00:24:44,640 --> 00:24:49,200 about. These are the areas that 540 00:24:47,039 --> 00:24:51,360 performance is critical because we kind 541 00:24:49,200 --> 00:24:53,520 of know that the model's going to get 542 00:24:51,360 --> 00:24:56,320 the other stuff right. But from our 543 00:24:53,520 --> 00:24:58,720 expert ctographers, they told us this is 544 00:24:56,320 --> 00:25:00,559 where it tends to make mistakes. And 545 00:24:58,720 --> 00:25:03,039 we're not just going to have one map. 546 00:25:00,559 --> 00:25:04,960 Just like our golden set, we're going to 547 00:25:03,039 --> 00:25:07,760 have multiple maps that focus on 548 00:25:04,960 --> 00:25:11,200 different areas. These focus on the tiny 549 00:25:07,760 --> 00:25:14,320 landforms. these focus on like other 550 00:25:11,200 --> 00:25:16,400 type of land forms. And once we have 551 00:25:14,320 --> 00:25:18,559 that, 552 00:25:16,400 --> 00:25:21,120 we're gonna like run a bunch of models 553 00:25:18,559 --> 00:25:23,520 and see how they do on our golden set. 554 00:25:21,120 --> 00:25:26,240 So this is what we call a parallel 555 00:25:23,520 --> 00:25:29,279 coordinates plot. So each line from the 556 00:25:26,240 --> 00:25:32,480 left to the right all the way through 557 00:25:29,279 --> 00:25:35,679 represents a single model and it's 558 00:25:32,480 --> 00:25:37,760 performance on the different metrics. 559 00:25:35,679 --> 00:25:40,559 You can sort of ignore sort of the left 560 00:25:37,760 --> 00:25:42,640 hand side and just focus on the three at 561 00:25:40,559 --> 00:25:44,799 the right because those are the masks 562 00:25:42,640 --> 00:25:46,799 that I showed you. 563 00:25:44,799 --> 00:25:48,799 And immediately 564 00:25:46,799 --> 00:25:50,880 we kind of see that there's a bunch of 565 00:25:48,799 --> 00:25:54,400 models that go towards the bottom, 566 00:25:50,880 --> 00:25:55,919 right? That's a clear indication. 567 00:25:54,400 --> 00:25:58,720 They're not they're not going to be 568 00:25:55,919 --> 00:26:01,279 good, right? we kind of know that we 569 00:25:58,720 --> 00:26:05,360 only need to focus on the top third of 570 00:26:01,279 --> 00:26:08,400 that top branch. And 571 00:26:05,360 --> 00:26:10,960 it's a really easy way to sort of cut 572 00:26:08,400 --> 00:26:14,159 through thousands of model iterations 573 00:26:10,960 --> 00:26:15,840 really quickly. And 574 00:26:14,159 --> 00:26:18,000 you'll see that there's a line that goes 575 00:26:15,840 --> 00:26:20,240 right to the top, right? And you might 576 00:26:18,000 --> 00:26:23,760 say, "Well, isn't that the best model?" 577 00:26:20,240 --> 00:26:26,159 No, it's not because this is subjective. 578 00:26:23,760 --> 00:26:30,320 Just because your metric says this is 579 00:26:26,159 --> 00:26:32,320 the best in subjective domains, it's not 580 00:26:30,320 --> 00:26:35,919 necessarily the best because human 581 00:26:32,320 --> 00:26:38,720 preferences, human aesthetics um are the 582 00:26:35,919 --> 00:26:41,919 final dictator. 583 00:26:38,720 --> 00:26:44,960 What did we do next? We captured 584 00:26:41,919 --> 00:26:47,039 preference data from our experts, right? 585 00:26:44,960 --> 00:26:49,600 Um so we built a completely custom 586 00:26:47,039 --> 00:26:52,799 interface and I'm very pro building 587 00:26:49,600 --> 00:26:54,559 custom interfaces, streamlit, anything. 588 00:26:52,799 --> 00:26:58,400 I mean, you can use open source things 589 00:26:54,559 --> 00:27:00,159 like label studio, but I feel like if 590 00:26:58,400 --> 00:27:02,400 it's a thing that you're going to be 591 00:27:00,159 --> 00:27:05,760 investing in, a custom interface that's 592 00:27:02,400 --> 00:27:08,159 just right lets you move at much faster 593 00:27:05,760 --> 00:27:11,360 velocities. So, here we made it so that 594 00:27:08,159 --> 00:27:13,279 if you zoomed into any of the maps, they 595 00:27:11,360 --> 00:27:15,120 would all zoom in at the same time. So, 596 00:27:13,279 --> 00:27:18,159 the experts could really look through 597 00:27:15,120 --> 00:27:21,279 what was going on. And we asked them 598 00:27:18,159 --> 00:27:24,320 which one of these is better, right? And 599 00:27:21,279 --> 00:27:26,080 that's how we sort of collected data and 600 00:27:24,320 --> 00:27:28,480 figured out what model we were going to 601 00:27:26,080 --> 00:27:30,080 ship in our app. 602 00:27:28,480 --> 00:27:33,520 So hopefully that sort of gives you a 603 00:27:30,080 --> 00:27:36,640 taste of the things I talked about in a 604 00:27:33,520 --> 00:27:38,720 very different domain. Um, the last 605 00:27:36,640 --> 00:27:42,000 thing I just want to mention, I guess, 606 00:27:38,720 --> 00:27:45,520 is sort of easy pitfalls and 607 00:27:42,000 --> 00:27:48,000 antiatterns. And this is also quite 608 00:27:45,520 --> 00:27:50,400 subjective, just like the contour lines. 609 00:27:48,000 --> 00:27:55,039 But I'm not a fan of 610 00:27:50,400 --> 00:27:57,200 10,000 line um, sets where you kind of 611 00:27:55,039 --> 00:27:59,760 have forgotten what was really in there 612 00:27:57,200 --> 00:28:01,360 and you're not even sure if it's just 613 00:27:59,760 --> 00:28:04,000 quantity that's in there because you 614 00:28:01,360 --> 00:28:05,679 care about variety. Um, so it's very 615 00:28:04,000 --> 00:28:07,840 important to sort of know what your 616 00:28:05,679 --> 00:28:09,520 emails are and be looking at them all 617 00:28:07,840 --> 00:28:11,200 the time. 618 00:28:09,520 --> 00:28:13,679 Second one, I think we sort of all know 619 00:28:11,200 --> 00:28:15,840 we can't trust an AI even if we call it 620 00:28:13,679 --> 00:28:19,360 a judge. Just because call we call it a 621 00:28:15,840 --> 00:28:21,919 judge doesn't make it any better. 622 00:28:19,360 --> 00:28:25,520 Stay away from dashboards, please. Just 623 00:28:21,919 --> 00:28:28,080 a general dashboard with like a 4.2 624 00:28:25,520 --> 00:28:30,480 score. It's meaningless. What does 4.2 625 00:28:28,080 --> 00:28:32,720 mean? I mean, what errors is it 626 00:28:30,480 --> 00:28:35,919 catching? So, 627 00:28:32,720 --> 00:28:37,600 not a fan. Um, chasing proxy metrics 628 00:28:35,919 --> 00:28:40,960 again, just because it's the top 629 00:28:37,600 --> 00:28:42,640 performing model on a metric may not 630 00:28:40,960 --> 00:28:46,399 necessarily mean that it's the met best 631 00:28:42,640 --> 00:28:48,799 model for you. And yeah, latency and 632 00:28:46,399 --> 00:28:51,039 blind spots. 633 00:28:48,799 --> 00:28:54,559 Yeah, like turned on just emails are a 634 00:28:51,039 --> 00:28:55,980 habit and not just a pretty dashboard. 635 00:28:54,559 --> 00:29:04,960 Thank you. 636 00:28:55,980 --> 00:29:06,799 [Applause] 637 00:29:04,960 --> 00:29:10,240 De Prit, that was um that was 638 00:29:06,799 --> 00:29:13,120 fascinating. Um to say thank you, I've 639 00:29:10,240 --> 00:29:13,760 uh brought you the 2025 PyCon AU speakers 640 00:29:13,120 --> 00:29:14,080 mug. 641 00:29:13,760 --> 00:29:16,000 Thank you. 642 00:29:14,080 --> 00:29:21,559 Thank you very much for for the talk. Um 643 00:29:16,000 --> 00:29:21,559 we do have time for probably a question. 644 00:29:29,039 --> 00:29:33,679 Hey, great talk. U so I work as an AI 645 00:29:31,840 --> 00:29:36,720 engineer at an AI startup and eval are 646 00:29:33,679 --> 00:29:38,399 like big for us and um so we've tried 647 00:29:36,720 --> 00:29:42,559 all of these different platforms like 648 00:29:38,399 --> 00:29:45,440 DBal, comet, bits and biases, languages 649 00:29:42,559 --> 00:29:46,880 so many um and while they all look so 650 00:29:45,440 --> 00:29:48,399 good from the outside, you know, once 651 00:29:46,880 --> 00:29:50,320 you start using them like for like 652 00:29:48,399 --> 00:29:51,919 really specific stuff, you see the 653 00:29:50,320 --> 00:29:54,000 cracks because they're all so new. So 654 00:29:51,919 --> 00:29:56,399 just like in your experience uh what 655 00:29:54,000 --> 00:29:58,960 have you found to be working well? 656 00:29:56,399 --> 00:30:01,039 Yeah, that that's a great point because 657 00:29:58,960 --> 00:30:03,200 I have used a bunch of those and that 658 00:30:01,039 --> 00:30:07,360 plot you saw was from weights and biases 659 00:30:03,200 --> 00:30:09,279 and what I found is 660 00:30:07,360 --> 00:30:12,720 personally from for the problems that I 661 00:30:09,279 --> 00:30:15,600 work at just starting off very simple. 662 00:30:12,720 --> 00:30:18,000 So I tend to whatever problem it may be 663 00:30:15,600 --> 00:30:20,399 even if I have to write like custom 664 00:30:18,000 --> 00:30:23,360 charts in mappa 665 00:30:20,399 --> 00:30:25,520 can do that right um I really prefer 666 00:30:23,360 --> 00:30:28,320 that and building it to a point where 667 00:30:25,520 --> 00:30:32,399 you're like I've got a good model and 668 00:30:28,320 --> 00:30:35,440 I've got repetitive tests then only move 669 00:30:32,399 --> 00:30:37,120 to a platform. So I wouldn't move to 670 00:30:35,440 --> 00:30:40,240 weights and biases for my first 671 00:30:37,120 --> 00:30:42,880 experiment, right? For my first 10, not 672 00:30:40,240 --> 00:30:44,720 even my first 100. But after that, when 673 00:30:42,880 --> 00:30:47,279 you kind of know what you want and 674 00:30:44,720 --> 00:30:51,600 you've got repeatable things, that's 675 00:30:47,279 --> 00:30:54,240 when you sort of upgrade. Yeah. 676 00:30:51,600 --> 00:30:55,760 All right. I I apologize. Um that is 677 00:30:54,240 --> 00:30:59,120 probably all we have time for. Um 678 00:30:55,760 --> 00:31:02,279 another very warm thank you for Dilbury. 679 00:30:59,120 --> 00:31:02,279 Thank you.