1 00:00:00,420 --> 00:00:05,910 [Music] 2 00:00:10,400 --> 00:00:16,080 Good afternoon and welcome back to our 3 00:00:13,200 --> 00:00:18,160 next session for PyCon 2025. Hope you 4 00:00:16,080 --> 00:00:20,720 have all enjoyed your lunch and are 5 00:00:18,160 --> 00:00:24,400 ready and eager to hear from our next 6 00:00:20,720 --> 00:00:27,920 speaker, Mr. Anthony Shaw, presenting My 7 00:00:24,400 --> 00:00:28,480 AI is slow, make it faster. Thank you, 8 00:00:27,920 --> 00:00:30,240 Anthony. 9 00:00:28,480 --> 00:00:32,719 All right. 10 00:00:30,240 --> 00:00:34,960 [Applause] 11 00:00:32,719 --> 00:00:37,200 Cool. Hey everybody. I hope you enjoyed 12 00:00:34,960 --> 00:00:39,360 lunch and I've been enjoying PyCon AU so 13 00:00:37,200 --> 00:00:41,680 far. Uh yeah, so my name is Anthony and 14 00:00:39,360 --> 00:00:43,360 I'm going to be talking about uh AI 15 00:00:41,680 --> 00:00:45,680 performance. And when I say performance, 16 00:00:43,360 --> 00:00:48,160 I mean like speed performance, not how 17 00:00:45,680 --> 00:00:49,600 good or bad it is at doing things. Um so 18 00:00:48,160 --> 00:00:51,920 before we get started, a little bit 19 00:00:49,600 --> 00:00:54,640 about me. Uh my name is Anthony. Uh I 20 00:00:51,920 --> 00:00:56,480 work at Microsoft. Uh I'm a Python 21 00:00:54,640 --> 00:01:00,399 Software Foundation fellow and I'm also 22 00:00:56,480 --> 00:01:03,840 a fellow at McCory Uni. Um, and people 23 00:01:00,399 --> 00:01:05,519 talk about a scale of AI from I've drunk 24 00:01:03,840 --> 00:01:06,960 all the Kool-Aid and I think this is 25 00:01:05,519 --> 00:01:09,200 going to be the greatest thing that's 26 00:01:06,960 --> 00:01:11,280 ever happened to I think it's the worst 27 00:01:09,200 --> 00:01:13,360 thing that's ever been created. I'm sort 28 00:01:11,280 --> 00:01:16,000 of floating somewhere in the middle and 29 00:01:13,360 --> 00:01:18,560 and to be honest, I kind of waver day by 30 00:01:16,000 --> 00:01:20,000 day between this is a terrible idea and 31 00:01:18,560 --> 00:01:22,400 this is amazing and it's changing 32 00:01:20,000 --> 00:01:24,960 everything. Um, AI is quite 33 00:01:22,400 --> 00:01:27,040 unpredictable. Uh, it's very very 34 00:01:24,960 --> 00:01:28,159 difficult to work with. for someone 35 00:01:27,040 --> 00:01:30,560 who's come from an engineering 36 00:01:28,159 --> 00:01:33,280 background and you expect things to work 37 00:01:30,560 --> 00:01:35,200 uh the same way each time and the way it 38 00:01:33,280 --> 00:01:37,920 surprises you day by day and it's 39 00:01:35,200 --> 00:01:40,000 unbelievably frustrating. Um and then 40 00:01:37,920 --> 00:01:41,439 sometimes you get these moments of joy 41 00:01:40,000 --> 00:01:43,920 in there as well where it's just saved 42 00:01:41,439 --> 00:01:45,680 you a lot of time. So um you can follow 43 00:01:43,920 --> 00:01:48,320 me on socials. I'm on lots of different 44 00:01:45,680 --> 00:01:52,720 platforms. Um, also I've made a written 45 00:01:48,320 --> 00:01:54,240 a book um called CPython internals um 46 00:01:52,720 --> 00:01:56,799 and it is all about the internals of 47 00:01:54,240 --> 00:01:58,880 CPython and how it works. So if you're 48 00:01:56,799 --> 00:02:01,840 interested in that uh I also made this 49 00:01:58,880 --> 00:02:04,719 thing called VS Code Pets um which is a 50 00:02:01,840 --> 00:02:08,239 extension for VS Code where pets walk 51 00:02:04,719 --> 00:02:10,560 around. Uh actually uh it turns out to 52 00:02:08,239 --> 00:02:12,400 be really popular and frustratingly is 53 00:02:10,560 --> 00:02:15,200 the most successful piece of software 54 00:02:12,400 --> 00:02:17,760 I've ever written. Um 55 00:02:15,200 --> 00:02:21,120 nearly actually just coming up to 2 56 00:02:17,760 --> 00:02:23,599 million users now. Um the someone just 57 00:02:21,120 --> 00:02:26,080 yesterday someone on the Windows team 58 00:02:23,599 --> 00:02:27,760 messaged me. I work at Microsoft and 59 00:02:26,080 --> 00:02:29,440 they were like how can we get this into 60 00:02:27,760 --> 00:02:32,879 Windows 61 00:02:29,440 --> 00:02:35,519 and I was like the productivity of 62 00:02:32,879 --> 00:02:37,360 millions of people around the world is 63 00:02:35,519 --> 00:02:38,720 going to decrease. 64 00:02:37,360 --> 00:02:40,879 It's going to look really good on my 65 00:02:38,720 --> 00:02:43,760 performance review. So it's a sacrifice 66 00:02:40,879 --> 00:02:46,800 I'm willing to take. But uh 67 00:02:43,760 --> 00:02:49,920 yeah anyway so uh our agenda today is 68 00:02:46,800 --> 00:02:52,080 that uh AI is slow. Um if you've used it 69 00:02:49,920 --> 00:02:54,480 you would have realized that uh we're 70 00:02:52,080 --> 00:02:56,400 going to look at uh some requirements in 71 00:02:54,480 --> 00:02:58,959 terms of like performance and what your 72 00:02:56,400 --> 00:03:00,720 expectations are. Uh I'm going to spend 73 00:02:58,959 --> 00:03:03,440 a big chunk of this talking about 74 00:03:00,720 --> 00:03:04,800 benchmarking and uh performance tools 75 00:03:03,440 --> 00:03:07,760 and things like that and how we can use 76 00:03:04,800 --> 00:03:10,400 them with AI. I've got a few tips to 77 00:03:07,760 --> 00:03:13,519 improve the speed of how you're calling 78 00:03:10,400 --> 00:03:14,879 AIS. Um and then some takeaways. So 79 00:03:13,519 --> 00:03:16,959 there'll be a nice slide at the end that 80 00:03:14,879 --> 00:03:19,760 you can just take a picture of. Um there 81 00:03:16,959 --> 00:03:21,760 all the bits you need to know. Uh my 82 00:03:19,760 --> 00:03:24,080 promises to you today as well. Um 83 00:03:21,760 --> 00:03:25,519 nothing in this talk is proprietary. Uh 84 00:03:24,080 --> 00:03:27,680 I do work for Microsoft, but everything 85 00:03:25,519 --> 00:03:29,599 here is open source. Uh everything is 86 00:03:27,680 --> 00:03:31,519 free and available to download and use 87 00:03:29,599 --> 00:03:34,159 yourself. Nothing has any weird 88 00:03:31,519 --> 00:03:36,159 licenses. Um 89 00:03:34,159 --> 00:03:37,840 and this is all based on experience as 90 00:03:36,159 --> 00:03:39,760 well. Uh so these are things that I've 91 00:03:37,840 --> 00:03:41,760 been doing as part of my day-to-day 92 00:03:39,760 --> 00:03:44,799 work. Uh and I just wanted to share some 93 00:03:41,760 --> 00:03:47,360 of these lessons with you uh today. Um 94 00:03:44,799 --> 00:03:50,000 so this is actually um an example I 95 00:03:47,360 --> 00:03:52,799 wanted to use. So this is me on a train 96 00:03:50,000 --> 00:03:55,840 from uh Bordeaux to Paris. Uh this train 97 00:03:52,799 --> 00:03:58,959 goes 300 km an hour. Uh and I had 2 and 98 00:03:55,840 --> 00:04:00,959 a half hours. So a nice casual time to 99 00:03:58,959 --> 00:04:03,599 look out the window and really not very 100 00:04:00,959 --> 00:04:05,439 much to do. Um, we had Wi-Fi on the 101 00:04:03,599 --> 00:04:07,439 plane on the train as well, which is 102 00:04:05,439 --> 00:04:09,200 always dangerous because instead of 103 00:04:07,439 --> 00:04:10,480 relaxing, you're like, "Oh, I can do 104 00:04:09,200 --> 00:04:14,080 things on the internet. That's more 105 00:04:10,480 --> 00:04:15,599 fun." Um, so I wanted to actually just 106 00:04:14,080 --> 00:04:18,000 make a game and then just play the game 107 00:04:15,599 --> 00:04:20,079 on the train. Uh, and the game that I 108 00:04:18,000 --> 00:04:23,360 worked on was 109 00:04:20,079 --> 00:04:26,800 uh this, which is uh a Suduku game uh 110 00:04:23,360 --> 00:04:29,120 but instead of using one to nine uh it's 111 00:04:26,800 --> 00:04:32,800 hexadimal. 112 00:04:29,120 --> 00:04:35,680 Um, so on the train from Bordeaux to 113 00:04:32,800 --> 00:04:38,320 Paris, I basically worked on this uh 114 00:04:35,680 --> 00:04:40,880 with the AI doing most of the heavy work 115 00:04:38,320 --> 00:04:42,960 because well firstly I was on holiday. 116 00:04:40,880 --> 00:04:44,639 Um, so I shouldn't have been doing any 117 00:04:42,960 --> 00:04:47,199 work anyway. I just wanted to do 118 00:04:44,639 --> 00:04:51,040 something fun. There was no obligation. 119 00:04:47,199 --> 00:04:52,639 Um, and it took ages. So like two and a 120 00:04:51,040 --> 00:04:54,560 half hours to make a game is impressive. 121 00:04:52,639 --> 00:04:55,840 But every time I'd ask the AI question, 122 00:04:54,560 --> 00:04:57,440 it would go away for a couple of 123 00:04:55,840 --> 00:04:59,440 minutes, come back and give me an 124 00:04:57,440 --> 00:05:02,080 answer. But I could look out the window 125 00:04:59,440 --> 00:05:06,320 and just, you know, watch France go 126 00:05:02,080 --> 00:05:10,240 past, which is lovely. Um, but 127 00:05:06,320 --> 00:05:12,560 that's not always the case. Um, and in 128 00:05:10,240 --> 00:05:13,680 terms of user expectations, 129 00:05:12,560 --> 00:05:17,360 uh, I've been working on web 130 00:05:13,680 --> 00:05:19,840 benchmarking for many many years and, 131 00:05:17,360 --> 00:05:23,759 uh, also performance of applications and 132 00:05:19,840 --> 00:05:27,280 code running on computers. And if if I 133 00:05:23,759 --> 00:05:30,960 said to you, is your phone like fast or 134 00:05:27,280 --> 00:05:33,039 slow? You it's a subjective question, 135 00:05:30,960 --> 00:05:34,960 but you know that when you use a system, 136 00:05:33,039 --> 00:05:36,080 like when you use the menu on the TV, 137 00:05:34,960 --> 00:05:38,240 for example, if you got one of those 138 00:05:36,080 --> 00:05:40,560 smart TVs, it kind of feels laggy. It 139 00:05:38,240 --> 00:05:43,919 feels a bit slow. And this is because 140 00:05:40,560 --> 00:05:46,400 you're over time using technology, your 141 00:05:43,919 --> 00:05:48,639 expectations are that if you're using 142 00:05:46,400 --> 00:05:50,800 something on a local system like a 143 00:05:48,639 --> 00:05:53,360 computer, like a local computer, your 144 00:05:50,800 --> 00:05:55,840 expectations are within about 10 145 00:05:53,360 --> 00:05:57,680 milliseconds to 100 milliseconds to when 146 00:05:55,840 --> 00:05:59,680 you click on something, touch something 147 00:05:57,680 --> 00:06:01,919 that it responds. 148 00:05:59,680 --> 00:06:04,400 And if it starts to take longer than 100 149 00:06:01,919 --> 00:06:06,560 milliseconds, it feels slow and it feels 150 00:06:04,400 --> 00:06:09,440 laggy. So, if you've ever upgraded from 151 00:06:06,560 --> 00:06:11,759 one phone to the next model and it it 152 00:06:09,440 --> 00:06:14,000 feels snappier, the differences can be 153 00:06:11,759 --> 00:06:16,160 like 10 milliseconds. 154 00:06:14,000 --> 00:06:18,400 Um, if you're working on the internet 155 00:06:16,160 --> 00:06:20,400 though, if you're loading a website, if 156 00:06:18,400 --> 00:06:22,000 you're doing a search online, your 157 00:06:20,400 --> 00:06:24,960 expectations are shifted slightly 158 00:06:22,000 --> 00:06:27,360 further to right. So, you would expect 159 00:06:24,960 --> 00:06:29,759 like a web search, a typical web search 160 00:06:27,360 --> 00:06:31,680 is about 50 to 100 milliseconds. That's 161 00:06:29,759 --> 00:06:33,600 how long it takes. um that's actually 162 00:06:31,680 --> 00:06:35,440 got slower over time, not faster because 163 00:06:33,600 --> 00:06:39,360 they've put more and more crap into the 164 00:06:35,440 --> 00:06:40,880 uh uh the pi page results. Um but your 165 00:06:39,360 --> 00:06:43,680 expectations when you're clicking on and 166 00:06:40,880 --> 00:06:47,039 loading websites are roughly between 50 167 00:06:43,680 --> 00:06:49,680 milliseconds to a second. And and lots 168 00:06:47,039 --> 00:06:52,960 of data shows us that if a page takes a 169 00:06:49,680 --> 00:06:55,120 second to load, users do get bored and 170 00:06:52,960 --> 00:06:57,520 then they find other things to do. So 171 00:06:55,120 --> 00:07:00,000 your expectations kind of shift. When it 172 00:06:57,520 --> 00:07:02,800 comes to AI, it's even further to the 173 00:07:00,000 --> 00:07:04,720 right. So, you know, when you you've 174 00:07:02,800 --> 00:07:06,800 been learning as you've used like chat 175 00:07:04,720 --> 00:07:09,520 GPT or something else that if you type 176 00:07:06,800 --> 00:07:12,800 in a question, it will take a few 177 00:07:09,520 --> 00:07:15,280 seconds to respond. And there's really 178 00:07:12,800 --> 00:07:17,520 kind of this shift with how what you 179 00:07:15,280 --> 00:07:19,840 deem as snappy and what you deem as 180 00:07:17,520 --> 00:07:22,720 fast. And the point I want to make early 181 00:07:19,840 --> 00:07:24,880 on is that if you try and stick AI into 182 00:07:22,720 --> 00:07:27,440 things that are further down the left of 183 00:07:24,880 --> 00:07:29,680 this scale like local time and internet 184 00:07:27,440 --> 00:07:32,160 time, you've basically introduce this 185 00:07:29,680 --> 00:07:35,120 lag to users and all of a sudden things 186 00:07:32,160 --> 00:07:37,440 feel slow. Uh there's also another 187 00:07:35,120 --> 00:07:41,520 trajectory to this which is Australian 188 00:07:37,440 --> 00:07:44,960 internet time which also seems to be 189 00:07:41,520 --> 00:07:47,599 getting slower over time somehow. Um so 190 00:07:44,960 --> 00:07:50,880 when is slow bad? So in this example, 191 00:07:47,599 --> 00:07:53,360 you've got a search box. Um, and I have 192 00:07:50,880 --> 00:07:56,160 seen this in products where they've got 193 00:07:53,360 --> 00:07:58,479 a basic search functionality and it used 194 00:07:56,160 --> 00:08:00,319 to work on a database and it would take 195 00:07:58,479 --> 00:08:03,759 out keywords and do a keyword search on 196 00:08:00,319 --> 00:08:06,000 a database. So users know when they type 197 00:08:03,759 --> 00:08:08,720 into that box, it's going to look for 198 00:08:06,000 --> 00:08:10,800 keywords. And over the last 20 years of 199 00:08:08,720 --> 00:08:13,199 having web search, people have 200 00:08:10,800 --> 00:08:15,440 intuitively know that when there's a 201 00:08:13,199 --> 00:08:17,360 search box, you don't write, you know, 202 00:08:15,440 --> 00:08:19,039 written sentences. You don't write fully 203 00:08:17,360 --> 00:08:21,039 formed sentences. You just write 204 00:08:19,039 --> 00:08:24,800 keywords and then you click search and 205 00:08:21,039 --> 00:08:27,759 it gives you back answers. Now, what has 206 00:08:24,800 --> 00:08:28,960 been very tempting for people to do um 207 00:08:27,759 --> 00:08:30,960 where you've got an expectation that 208 00:08:28,960 --> 00:08:32,640 that's going to respond pretty quickly, 209 00:08:30,960 --> 00:08:35,519 what's been tempting for people to do 210 00:08:32,640 --> 00:08:37,039 over the last few years is to add AI to 211 00:08:35,519 --> 00:08:39,519 the search. 212 00:08:37,039 --> 00:08:42,560 So where users would previously type in 213 00:08:39,519 --> 00:08:44,560 a question or a series of keywords, they 214 00:08:42,560 --> 00:08:46,640 would then introduce AI so the AI could 215 00:08:44,560 --> 00:08:49,040 give you more customized or smarter 216 00:08:46,640 --> 00:08:51,600 answers or whatever it is. The problem 217 00:08:49,040 --> 00:08:53,760 is that you've basically shifted the 218 00:08:51,600 --> 00:08:57,040 user's expectation that it will respond 219 00:08:53,760 --> 00:09:01,120 in 100 milliseconds, 500 milliseconds to 220 00:08:57,040 --> 00:09:03,600 it taking several seconds and also it 221 00:09:01,120 --> 00:09:06,240 going from a keyword search to a sort of 222 00:09:03,600 --> 00:09:09,040 vague AI kind of search whatever that 223 00:09:06,240 --> 00:09:10,480 is. Um so this is where I found that 224 00:09:09,040 --> 00:09:13,120 people have been really really 225 00:09:10,480 --> 00:09:16,240 frustrated um because it feels slow to 226 00:09:13,120 --> 00:09:18,399 interact with these systems that as an 227 00:09:16,240 --> 00:09:20,959 example there are two apps on my phone 228 00:09:18,399 --> 00:09:23,279 that have done this recently uh made by 229 00:09:20,959 --> 00:09:25,760 some very large tech companies where 230 00:09:23,279 --> 00:09:27,760 previously I would search for a post on 231 00:09:25,760 --> 00:09:30,000 socials or something like that and then 232 00:09:27,760 --> 00:09:31,839 the AI would kind of try and do extra 233 00:09:30,000 --> 00:09:33,600 things that I wasn't looking for because 234 00:09:31,839 --> 00:09:37,120 really all I wanted to do was a keyword 235 00:09:33,600 --> 00:09:41,200 search. I didn't want a smart AI search. 236 00:09:37,120 --> 00:09:43,680 So sometimes slow is actually okay. And 237 00:09:41,200 --> 00:09:46,160 I use this other example where I have a 238 00:09:43,680 --> 00:09:48,800 question. It's not a search. I have a 239 00:09:46,160 --> 00:09:51,360 research question. And then I can click 240 00:09:48,800 --> 00:09:53,040 go and it can come back and say, "Okay, 241 00:09:51,360 --> 00:09:55,839 I'm going to research your question for 242 00:09:53,040 --> 00:09:57,760 you using AI. I'm going to look online. 243 00:09:55,839 --> 00:10:00,160 I'm going to read some papers and it's 244 00:09:57,760 --> 00:10:02,640 going to take a couple of hours." And so 245 00:10:00,160 --> 00:10:04,720 my expectations are set. it's pretty 246 00:10:02,640 --> 00:10:07,600 clear what's going to happen and also 247 00:10:04,720 --> 00:10:08,959 what I'm looking for is involves a lot 248 00:10:07,600 --> 00:10:11,440 more information and a lot more 249 00:10:08,959 --> 00:10:14,000 searching. So I'm quite happy for the AI 250 00:10:11,440 --> 00:10:16,480 to take its time. Um but they're two 251 00:10:14,000 --> 00:10:19,120 different expectations. And so the first 252 00:10:16,480 --> 00:10:21,360 point I want to make in the AI thing is 253 00:10:19,120 --> 00:10:23,760 that users have an expectation to how 254 00:10:21,360 --> 00:10:25,519 things respond when they run locally, 255 00:10:23,760 --> 00:10:27,360 when they run on the web, and 256 00:10:25,519 --> 00:10:31,360 increasingly they have an expectation to 257 00:10:27,360 --> 00:10:33,760 what happens when you interact with AIS. 258 00:10:31,360 --> 00:10:36,959 So to summarize the my kind of usability 259 00:10:33,760 --> 00:10:38,560 points um don't put a in the way of 260 00:10:36,959 --> 00:10:41,440 something which is otherwise fast and 261 00:10:38,560 --> 00:10:43,600 efficient um especially if it doesn't 262 00:10:41,440 --> 00:10:46,399 really add much value. Search is my main 263 00:10:43,600 --> 00:10:48,560 one that I keep bringing up. Um because 264 00:10:46,399 --> 00:10:50,800 AI almost never improves search 265 00:10:48,560 --> 00:10:52,800 functionality. Like search is a solved 266 00:10:50,800 --> 00:10:55,519 problem. It's very efficient. Uh it 267 00:10:52,800 --> 00:10:58,320 scales really well. Uh we have plenty of 268 00:10:55,519 --> 00:11:00,720 algorithms to do search. Um, AI can do 269 00:10:58,320 --> 00:11:02,959 like customization of the answer from 270 00:11:00,720 --> 00:11:04,480 the search, but like don't put AI in the 271 00:11:02,959 --> 00:11:07,120 search. 272 00:11:04,480 --> 00:11:08,640 Uh, don't wait for the entire result. 273 00:11:07,120 --> 00:11:10,800 So, I'll come back to this in a second, 274 00:11:08,640 --> 00:11:14,800 but basically we stream results back 275 00:11:10,800 --> 00:11:17,200 from AI so that the user sees the kind 276 00:11:14,800 --> 00:11:19,040 of way it looks in chat GPT and other 277 00:11:17,200 --> 00:11:21,120 chat bots. It kind of feels like the AI 278 00:11:19,040 --> 00:11:22,880 is typing back to you, but actually 279 00:11:21,120 --> 00:11:25,519 what's happening is as it's computing 280 00:11:22,880 --> 00:11:27,360 the answers, it's sending back the 281 00:11:25,519 --> 00:11:30,000 stream back to you. And I'll come to 282 00:11:27,360 --> 00:11:32,160 some details on that uh in a minute. Uh 283 00:11:30,000 --> 00:11:33,680 and don't use reasoning models, and I'll 284 00:11:32,160 --> 00:11:35,760 cover what they are a bit later, unless 285 00:11:33,680 --> 00:11:38,959 you actually need reasoning uh because 286 00:11:35,760 --> 00:11:40,800 reasoning models are all very very slow 287 00:11:38,959 --> 00:11:43,760 uh today, and a lot of the time you 288 00:11:40,800 --> 00:11:46,640 don't actually need them. So using AI to 289 00:11:43,760 --> 00:11:48,480 enrich features is great. um having the 290 00:11:46,640 --> 00:11:49,920 option to disable and dismiss AI 291 00:11:48,480 --> 00:11:51,519 features is also really important 292 00:11:49,920 --> 00:11:53,839 because there are a lot of concerns 293 00:11:51,519 --> 00:11:56,000 about AI uh safety and privacy at the 294 00:11:53,839 --> 00:11:57,680 moment and also where the usability 295 00:11:56,000 --> 00:12:00,399 actually gets worse because you've 296 00:11:57,680 --> 00:12:03,279 introduced AI giving users the ability 297 00:12:00,399 --> 00:12:06,160 to switch that off I think is critical 298 00:12:03,279 --> 00:12:09,360 uh and stream back the results I mention 299 00:12:06,160 --> 00:12:11,200 um so this QR code at the top uh I've 300 00:12:09,360 --> 00:12:13,279 I've included a few QR codes in this 301 00:12:11,200 --> 00:12:16,399 talk only one of them is a Rick roll but 302 00:12:13,279 --> 00:12:18,639 this one is uh to some guidelines that 303 00:12:16,399 --> 00:12:21,680 we've published and they're basically if 304 00:12:18,639 --> 00:12:23,440 you're designing a system using AI there 305 00:12:21,680 --> 00:12:25,760 are a set of guidelines that we have 306 00:12:23,440 --> 00:12:30,800 called the hacks guidelines. So 307 00:12:25,760 --> 00:12:32,880 basically a UX guidelines for AI 308 00:12:30,800 --> 00:12:34,880 enriched applications. I can't what it 309 00:12:32,880 --> 00:12:37,120 stands for but um some very smart people 310 00:12:34,880 --> 00:12:39,760 at Microsoft have sat down and said in 311 00:12:37,120 --> 00:12:41,760 an ideal world here are the 16 things 312 00:12:39,760 --> 00:12:45,360 that you should do if you've got AI in 313 00:12:41,760 --> 00:12:46,720 your application. Um, and yes, not all 314 00:12:45,360 --> 00:12:48,560 of those things are done in Microsoft 315 00:12:46,720 --> 00:12:52,240 products as well. Uh, they are 316 00:12:48,560 --> 00:12:53,760 guidelines. They're ambitious. Um, and I 317 00:12:52,240 --> 00:12:55,519 really wish that more product managers 318 00:12:53,760 --> 00:12:57,279 would try and keep to some of them 319 00:12:55,519 --> 00:12:59,519 especially. Um, but there's some really 320 00:12:57,279 --> 00:13:02,480 smart ideas in there. Uh, including 321 00:12:59,519 --> 00:13:04,959 actually uh the keynote this morning um, 322 00:13:02,480 --> 00:13:06,480 uh, about the sort of AI ethics 323 00:13:04,959 --> 00:13:08,079 challenges with biases and stuff like 324 00:13:06,480 --> 00:13:09,680 that. There's a lot of guidelines in 325 00:13:08,079 --> 00:13:11,680 there. There's also a lot of lessons 326 00:13:09,680 --> 00:13:13,440 that we've learned over the years from 327 00:13:11,680 --> 00:13:16,399 putting bots online which have caused 328 00:13:13,440 --> 00:13:20,000 all sorts of challenges. Um uh so 329 00:13:16,399 --> 00:13:21,920 there's a tool called pirate p y. 330 00:13:20,000 --> 00:13:24,480 It is basically a red teaming tool for 331 00:13:21,920 --> 00:13:26,720 AIS that gets it to try and uh produce 332 00:13:24,480 --> 00:13:28,079 all sorts of horrible things and then it 333 00:13:26,720 --> 00:13:29,519 gives you a checklist to see whether 334 00:13:28,079 --> 00:13:31,920 you've actually put safety guards in 335 00:13:29,519 --> 00:13:34,240 place. Um so yeah, we have we have 336 00:13:31,920 --> 00:13:36,720 learned some lessons. 337 00:13:34,240 --> 00:13:39,200 So um when it comes to actually 338 00:13:36,720 --> 00:13:42,079 measuring performance uh if anyone out 339 00:13:39,200 --> 00:13:43,680 here has done any web benchmarking 340 00:13:42,079 --> 00:13:47,279 like web app benchmarking and stuff like 341 00:13:43,680 --> 00:13:49,680 that. Okay, a little bit. Um so normally 342 00:13:47,279 --> 00:13:52,399 what you do with web benchmarking is you 343 00:13:49,680 --> 00:13:54,399 would send thousands of requests to your 344 00:13:52,399 --> 00:13:57,680 application and you'd measure how long 345 00:13:54,399 --> 00:14:00,079 it takes to get back the answer. Um, and 346 00:13:57,680 --> 00:14:01,600 it's normally just the full response and 347 00:14:00,079 --> 00:14:03,600 then you'd also measure how long it 348 00:14:01,600 --> 00:14:05,279 takes for the browser to render the page 349 00:14:03,600 --> 00:14:07,120 and load up all the assets and stuff 350 00:14:05,279 --> 00:14:09,519 like that. So that's kind of how we we 351 00:14:07,120 --> 00:14:12,720 measure web applications. AI is a bit 352 00:14:09,519 --> 00:14:15,760 different um because the way they work 353 00:14:12,720 --> 00:14:18,480 is they actually um produce streams of 354 00:14:15,760 --> 00:14:20,959 tokens. So in the question um what is 355 00:14:18,480 --> 00:14:24,160 the capital of France which has become 356 00:14:20,959 --> 00:14:27,760 the go-to test question for AIS. I don't 357 00:14:24,160 --> 00:14:30,880 know why. Um it just is. Uh so if you 358 00:14:27,760 --> 00:14:33,360 ask AI what is the capital of France, it 359 00:14:30,880 --> 00:14:35,360 will send you back these tokens. Um and 360 00:14:33,360 --> 00:14:38,240 each token basically represents a whole 361 00:14:35,360 --> 00:14:41,040 word or a part of a word in over the 362 00:14:38,240 --> 00:14:44,480 network. It actually sends those tokens 363 00:14:41,040 --> 00:14:48,399 back in separate uh packets in separate 364 00:14:44,480 --> 00:14:50,639 HTTP response parts called a stream. So, 365 00:14:48,399 --> 00:14:52,160 um, you send it one request and you 366 00:14:50,639 --> 00:14:54,720 don't get back one response, you 367 00:14:52,160 --> 00:14:57,040 actually get back several responses. So, 368 00:14:54,720 --> 00:14:59,199 with pretty much every AI these days, 369 00:14:57,040 --> 00:15:01,440 you send a request and you get back a 370 00:14:59,199 --> 00:15:03,920 stream of data and as you receive the 371 00:15:01,440 --> 00:15:06,639 stream, you draw it on the screen, you 372 00:15:03,920 --> 00:15:08,480 save it to disk, whatever it is. Um, and 373 00:15:06,639 --> 00:15:11,279 this basically means that you can give 374 00:15:08,480 --> 00:15:12,480 the user the feedback really early on 375 00:15:11,279 --> 00:15:15,279 that they're getting information back 376 00:15:12,480 --> 00:15:18,079 and they can start to read the answer. 377 00:15:15,279 --> 00:15:21,440 So the time in terms of how we measure 378 00:15:18,079 --> 00:15:24,079 this uh the time between the moment you 379 00:15:21,440 --> 00:15:26,720 click go and send the request to when 380 00:15:24,079 --> 00:15:30,000 you get back the first token is called 381 00:15:26,720 --> 00:15:32,000 the time to first token. Tokens come in 382 00:15:30,000 --> 00:15:33,920 chunks. So I also call it the time to 383 00:15:32,000 --> 00:15:35,360 first chunk. Uh that's a really 384 00:15:33,920 --> 00:15:37,600 important metric because it basically 385 00:15:35,360 --> 00:15:40,480 says how long did it take for the model 386 00:15:37,600 --> 00:15:41,839 to think about the answer before it 387 00:15:40,480 --> 00:15:44,000 started sending you back any 388 00:15:41,839 --> 00:15:45,839 information. 389 00:15:44,000 --> 00:15:47,760 Then you've got the total time, which is 390 00:15:45,839 --> 00:15:50,079 how long does it take for it to finish 391 00:15:47,760 --> 00:15:52,160 its answer. Uh it's also important to 392 00:15:50,079 --> 00:15:53,680 note that sometimes if you I'm pretty 393 00:15:52,160 --> 00:15:55,759 sure we all have done this. We've 394 00:15:53,680 --> 00:15:57,759 canceled an AI, stopped it because it 395 00:15:55,759 --> 00:15:59,519 just been waffling on for ages and like 396 00:15:57,759 --> 00:16:02,000 I don't care. You got to the point ages 397 00:15:59,519 --> 00:16:05,120 ago. At least the good thing is you can 398 00:16:02,000 --> 00:16:06,480 kind of stop the stream earlier on. Um 399 00:16:05,120 --> 00:16:08,240 there are some other metrics as well 400 00:16:06,480 --> 00:16:10,320 which are important like the number of 401 00:16:08,240 --> 00:16:13,040 characters. So the capital of France is 402 00:16:10,320 --> 00:16:15,600 Paris is 31 characters. There were seven 403 00:16:13,040 --> 00:16:17,920 tokens, three chunks. And so I can 404 00:16:15,600 --> 00:16:20,480 basically work out what was the total 405 00:16:17,920 --> 00:16:22,160 time, how many chunks were there, and 406 00:16:20,480 --> 00:16:24,320 therefore what was the throughput by 407 00:16:22,160 --> 00:16:26,480 dividing the two. So like what is the 408 00:16:24,320 --> 00:16:29,759 chunks per second or what is the tokens 409 00:16:26,480 --> 00:16:32,000 per second for that AI model. 410 00:16:29,759 --> 00:16:34,480 Uh these things are important because uh 411 00:16:32,000 --> 00:16:36,639 it basically shows you how fast AI can 412 00:16:34,480 --> 00:16:38,320 write back to you and there are huge 413 00:16:36,639 --> 00:16:40,079 differences between different AI models 414 00:16:38,320 --> 00:16:43,759 which is what I want to spend some of 415 00:16:40,079 --> 00:16:46,320 this time looking into. Um so on the 416 00:16:43,759 --> 00:16:48,160 right hand side is a graph which uh was 417 00:16:46,320 --> 00:16:49,519 produced by a tool that I made and I'll 418 00:16:48,160 --> 00:16:53,519 give you a demo of that and how it 419 00:16:49,519 --> 00:16:56,959 works. Uh basically um it will produce 420 00:16:53,519 --> 00:16:59,920 these four measures. Uh one is total 421 00:16:56,959 --> 00:17:02,480 time. Uh these are all box and whisker 422 00:16:59,920 --> 00:17:04,799 um charts. Uh if you're not familiar 423 00:17:02,480 --> 00:17:06,799 with box and whiskers, basically the the 424 00:17:04,799 --> 00:17:08,559 box, so the colored box is the 425 00:17:06,799 --> 00:17:11,439 intercortile range, which is basically 426 00:17:08,559 --> 00:17:13,679 like the the middle range of values. How 427 00:17:11,439 --> 00:17:15,760 long did it take for it to answer? Um 428 00:17:13,679 --> 00:17:17,760 and what is like the most common part of 429 00:17:15,760 --> 00:17:21,039 that? And the black line in the middle 430 00:17:17,760 --> 00:17:23,600 of it is the median. Um 431 00:17:21,039 --> 00:17:26,319 and then uh any little circles and stuff 432 00:17:23,600 --> 00:17:27,679 like that, outliers. Uh, and the whisker 433 00:17:26,319 --> 00:17:30,559 is basically showing like what are the 434 00:17:27,679 --> 00:17:32,160 outside cortiles. So box and whisker 435 00:17:30,559 --> 00:17:36,160 charts are really helpful because you're 436 00:17:32,160 --> 00:17:37,760 looking for how thin is the box. So like 437 00:17:36,160 --> 00:17:40,799 how consistently does it give you the 438 00:17:37,760 --> 00:17:43,039 same response or the same time. So the 439 00:17:40,799 --> 00:17:45,280 bigger the box, the more variable it is. 440 00:17:43,039 --> 00:17:48,000 How big is the whiskers? So like how 441 00:17:45,280 --> 00:17:49,600 spread out is the answer. Um, and then 442 00:17:48,000 --> 00:17:51,440 these little circles are basically where 443 00:17:49,600 --> 00:17:53,280 you've got these anomalies in your data. 444 00:17:51,440 --> 00:17:54,799 So box and whisker graphs are great when 445 00:17:53,280 --> 00:17:57,440 you're running performance benchmarks 446 00:17:54,799 --> 00:17:59,679 because you're looking for um a nice 447 00:17:57,440 --> 00:18:02,080 small box with small whiskers and you're 448 00:17:59,679 --> 00:18:03,840 looking for it to be consistent. So 449 00:18:02,080 --> 00:18:05,520 basically on this one I ran a benchmark 450 00:18:03,840 --> 00:18:08,480 against four different models. Asked 451 00:18:05,520 --> 00:18:11,440 them all exactly the same question. Um 452 00:18:08,480 --> 00:18:14,799 and in terms of like total time model C 453 00:18:11,440 --> 00:18:19,840 which is Llama 3.2 2 uh responded within 454 00:18:14,799 --> 00:18:22,400 like uh couple of seconds uh whereas 54 455 00:18:19,840 --> 00:18:24,559 went all the way up to a minute. So it's 456 00:18:22,400 --> 00:18:27,840 a pretty big difference uh to answer 457 00:18:24,559 --> 00:18:29,280 exactly the same question. Um what's 458 00:18:27,840 --> 00:18:30,960 important here is that we've got the 459 00:18:29,280 --> 00:18:33,200 total time which I kind of covered in 460 00:18:30,960 --> 00:18:35,120 the last uh slide. The time to first 461 00:18:33,200 --> 00:18:37,840 chunk so that's like how long it took to 462 00:18:35,120 --> 00:18:40,400 think about the answer. the length of 463 00:18:37,840 --> 00:18:42,480 response um which you'll notice there is 464 00:18:40,400 --> 00:18:44,799 an enormous difference between the 465 00:18:42,480 --> 00:18:46,880 models and the length of response and 466 00:18:44,799 --> 00:18:49,919 I'll talk about that in a in a bit uh 467 00:18:46,880 --> 00:18:52,080 and the chunks per second. So what you 468 00:18:49,919 --> 00:18:54,480 can do with this tool is uh effectively 469 00:18:52,080 --> 00:18:56,880 you can give it any range of models to 470 00:18:54,480 --> 00:18:58,640 test against and you can get this graph 471 00:18:56,880 --> 00:19:00,320 and so if you've got a problem or a 472 00:18:58,640 --> 00:19:02,799 question and you want to figure out 473 00:19:00,320 --> 00:19:06,320 which is the best model for us to pick 474 00:19:02,799 --> 00:19:09,840 uh it gives you a nice comparison. 475 00:19:06,320 --> 00:19:12,559 So let me show you 476 00:19:09,840 --> 00:19:15,600 that working. Um, so the way this tool 477 00:19:12,559 --> 00:19:18,000 works is it's built into a command line 478 00:19:15,600 --> 00:19:20,640 interface called LLM. Has anybody used 479 00:19:18,000 --> 00:19:23,440 the LLM CLI? 480 00:19:20,640 --> 00:19:26,400 Okay, a couple of you. Um, so I really 481 00:19:23,440 --> 00:19:29,840 like the LLM CLI. It's a tool by Simon 482 00:19:26,400 --> 00:19:32,559 Willis. Um, basically you can do all 483 00:19:29,840 --> 00:19:34,320 sorts of cool stuff. Um, it's a command 484 00:19:32,559 --> 00:19:36,799 line interface for you to run locally. 485 00:19:34,320 --> 00:19:39,039 So you can do things like pipe stuff 486 00:19:36,799 --> 00:19:40,880 from your local disk directly into an 487 00:19:39,039 --> 00:19:43,760 LLM. So if you've got like files and 488 00:19:40,880 --> 00:19:46,720 things and you want to extract data. Um 489 00:19:43,760 --> 00:19:48,960 it's also very extensible. So uh in 490 00:19:46,720 --> 00:19:51,679 terms of the models that you're using, 491 00:19:48,960 --> 00:19:53,840 um you can install plugins to talk to 492 00:19:51,679 --> 00:19:55,760 basically anything. So you can run 493 00:19:53,840 --> 00:19:58,400 models locally. Like I'm running lots of 494 00:19:55,760 --> 00:20:02,160 models on my laptop. Um you can connect 495 00:19:58,400 --> 00:20:04,960 it to openai.com, AWS Bedrock, Azure, 496 00:20:02,160 --> 00:20:07,600 GitHub models. basically like any model 497 00:20:04,960 --> 00:20:09,520 hosting platform or one that you're 498 00:20:07,600 --> 00:20:11,679 running locally on your machine or if 499 00:20:09,520 --> 00:20:14,559 you've got a server and a GPU cluster 500 00:20:11,679 --> 00:20:16,080 you can connect it to that. So uh with 501 00:20:14,559 --> 00:20:17,600 LLM 502 00:20:16,080 --> 00:20:19,760 once you've got these plugins you can 503 00:20:17,600 --> 00:20:21,840 run LLM models and it will list what 504 00:20:19,760 --> 00:20:25,200 models you have available to you. I have 505 00:20:21,840 --> 00:20:27,440 a lot because I do this as a job. Um, so 506 00:20:25,200 --> 00:20:29,600 I've got a lot available to me and I can 507 00:20:27,440 --> 00:20:32,559 basically kind of pick and choose which 508 00:20:29,600 --> 00:20:34,400 of these models I want to compare uh and 509 00:20:32,559 --> 00:20:36,000 see like where they're hosted. So for 510 00:20:34,400 --> 00:20:39,280 example, I was doing a comparison 511 00:20:36,000 --> 00:20:41,919 between GPT5 in Australia and one hosted 512 00:20:39,280 --> 00:20:43,760 in the US to see what the difference was 513 00:20:41,919 --> 00:20:47,200 performance and and what difference it 514 00:20:43,760 --> 00:20:49,840 made. Um, if you're running local models 515 00:20:47,200 --> 00:20:52,559 as well, um, so whether that's using 516 00:20:49,840 --> 00:20:56,480 something like Olama for example. Um so 517 00:20:52,559 --> 00:20:58,960 in Olama you can download and run uh a 518 00:20:56,480 --> 00:21:01,280 number of smaller language models and 519 00:20:58,960 --> 00:21:03,679 also embedding models uh and you can 520 00:21:01,280 --> 00:21:06,320 benchmark those and test them as well. 521 00:21:03,679 --> 00:21:09,039 Uh you need a pretty decent laptop to do 522 00:21:06,320 --> 00:21:10,960 that. So I will caution you. Um also 523 00:21:09,039 --> 00:21:15,679 don't run with it. Don't do it with the 524 00:21:10,960 --> 00:21:17,919 laptop on your lap. Um it gets it gets 525 00:21:15,679 --> 00:21:19,760 uncomfortably warm. 526 00:21:17,919 --> 00:21:22,240 Um, 527 00:21:19,760 --> 00:21:25,280 so the benchmarking tool is basically a 528 00:21:22,240 --> 00:21:26,880 plugin for LLM called LLM profile and it 529 00:21:25,280 --> 00:21:29,440 gives you this extra command where you 530 00:21:26,880 --> 00:21:32,320 say LLM benchmark and then your prompt 531 00:21:29,440 --> 00:21:34,880 input and then you list as many models 532 00:21:32,320 --> 00:21:37,039 as you feel like. So they can be models 533 00:21:34,880 --> 00:21:38,480 online, they can be models locally, they 534 00:21:37,039 --> 00:21:41,039 can be models on different clouds, 535 00:21:38,480 --> 00:21:42,480 different locations, whatever. Um, and 536 00:21:41,039 --> 00:21:44,720 then you say how many times you want it 537 00:21:42,480 --> 00:21:46,720 to repeat the test. uh and then if you 538 00:21:44,720 --> 00:21:49,360 want it to produce one of those graphs 539 00:21:46,720 --> 00:21:51,200 and then you just ask it for a graph uh 540 00:21:49,360 --> 00:21:54,880 and it will give you back this sort of 541 00:21:51,200 --> 00:21:58,400 summary table of information. 542 00:21:54,880 --> 00:22:00,480 So it will run that test locally um if 543 00:21:58,400 --> 00:22:02,960 that is too simple for you. If you want 544 00:22:00,480 --> 00:22:05,679 to do some more complicated test 545 00:22:02,960 --> 00:22:07,440 scenarios, for example, you want to try 546 00:22:05,679 --> 00:22:10,080 to see what the difference is if the 547 00:22:07,440 --> 00:22:12,320 temperature is lower in one test but 548 00:22:10,080 --> 00:22:14,320 higher in another or you want to test 549 00:22:12,320 --> 00:22:15,919 like one model in one country and one in 550 00:22:14,320 --> 00:22:18,320 another or even test two different 551 00:22:15,919 --> 00:22:22,400 prompts um then you can do that by 552 00:22:18,320 --> 00:22:23,600 providing this YAML file um with like 553 00:22:22,400 --> 00:22:25,919 different inputs and different 554 00:22:23,600 --> 00:22:28,240 questions. So like the go-to question of 555 00:22:25,919 --> 00:22:29,760 what is the capital of France? Um we can 556 00:22:28,240 --> 00:22:33,679 compare that with what is the capital of 557 00:22:29,760 --> 00:22:35,360 Azabaijan? Um and the differences in 558 00:22:33,679 --> 00:22:37,760 length for example and the differences 559 00:22:35,360 --> 00:22:39,440 in thinking time are actually quite 560 00:22:37,760 --> 00:22:41,039 different between the models between 561 00:22:39,440 --> 00:22:43,200 those questions. Even though the 562 00:22:41,039 --> 00:22:45,760 questions semantically are the same like 563 00:22:43,200 --> 00:22:47,440 what is the capital city of this country 564 00:22:45,760 --> 00:22:50,000 uh should be a simple straightforward 565 00:22:47,440 --> 00:22:52,080 sentence answer. Um however in some 566 00:22:50,000 --> 00:22:55,039 cases it will write you multiple 567 00:22:52,080 --> 00:22:58,159 paragraphs of travel tips uh about 568 00:22:55,039 --> 00:23:00,799 asaban including the answer. Does anyone 569 00:22:58,159 --> 00:23:04,240 know the answer? 570 00:23:00,799 --> 00:23:08,159 No Formula 1 fans in the room. Baku, 571 00:23:04,240 --> 00:23:11,159 thank you. Um that is next weekend. 572 00:23:08,159 --> 00:23:11,159 Okay. 573 00:23:11,600 --> 00:23:15,760 All right. So um once you've done this 574 00:23:13,919 --> 00:23:17,200 and you've got your graph, uh you can 575 00:23:15,760 --> 00:23:19,440 see the differences in performance 576 00:23:17,200 --> 00:23:21,280 between the models. So let's get on to 577 00:23:19,440 --> 00:23:25,120 uh what are the how do we actually make 578 00:23:21,280 --> 00:23:28,480 it faster? Uh my number one tip uh is to 579 00:23:25,120 --> 00:23:30,559 look for shorter answers. Um there was a 580 00:23:28,480 --> 00:23:32,400 talk yesterday actually Jack uh Jack's 581 00:23:30,559 --> 00:23:36,080 talk he talked about how Charles Darwin 582 00:23:32,400 --> 00:23:38,559 was paid by the word. Um which is why 583 00:23:36,080 --> 00:23:41,840 the opening part of Taylor two cities is 584 00:23:38,559 --> 00:23:44,400 so ridiculously long. Um, these LLM 585 00:23:41,840 --> 00:23:46,960 models are rewarded in their training 586 00:23:44,400 --> 00:23:49,760 for their ability to solve problems, 587 00:23:46,960 --> 00:23:52,240 answer questions, uh, and solve coding 588 00:23:49,760 --> 00:23:53,760 challenges. There is no penalty for the 589 00:23:52,240 --> 00:23:56,720 length of the answer for them to achieve 590 00:23:53,760 --> 00:23:59,200 that. Therefore, they are also paid by 591 00:23:56,720 --> 00:24:02,640 the word. Um, the way that you pay them 592 00:23:59,200 --> 00:24:05,039 is also by the token. So, LLM will give 593 00:24:02,640 --> 00:24:06,799 you really, really long answers when you 594 00:24:05,039 --> 00:24:09,120 don't ask for it. So, what is the 595 00:24:06,799 --> 00:24:12,720 capital of Azabaijan? The capital is 596 00:24:09,120 --> 00:24:15,039 Baku. Everyone knows that. Um, 597 00:24:12,720 --> 00:24:17,279 uh, this one here gave me instead of 598 00:24:15,039 --> 00:24:19,600 taking one second, it took a 100 seconds 599 00:24:17,279 --> 00:24:23,279 and it wrote four paragraphs about Baku 600 00:24:19,600 --> 00:24:24,960 and its cultural history. Um, and it 601 00:24:23,279 --> 00:24:26,559 also did not mention the Formula 1, 602 00:24:24,960 --> 00:24:30,159 which is ridiculous like why would you 603 00:24:26,559 --> 00:24:32,559 not? Um, the total time and also the 604 00:24:30,159 --> 00:24:35,039 time to first chunk was significantly 605 00:24:32,559 --> 00:24:36,480 higher for that as well. So my first 606 00:24:35,039 --> 00:24:39,919 kind of suggestion is that you should 607 00:24:36,480 --> 00:24:42,400 write uh get shorter responses. You can 608 00:24:39,919 --> 00:24:46,159 do that either by using a smaller model 609 00:24:42,400 --> 00:24:49,120 or you can say in your prompt I want a 610 00:24:46,159 --> 00:24:52,159 single sentence or I want two to three 611 00:24:49,120 --> 00:24:54,000 sentences or I want three answers. You 612 00:24:52,159 --> 00:24:56,080 basically set the expectation to the 613 00:24:54,000 --> 00:24:57,679 model as to the length of the prompt. If 614 00:24:56,080 --> 00:25:00,240 you don't do that, if it's completely 615 00:24:57,679 --> 00:25:02,080 unbound, they will tend to, especially 616 00:25:00,240 --> 00:25:03,679 the bigger models, they will tend to 617 00:25:02,080 --> 00:25:05,120 produce way more output than you 618 00:25:03,679 --> 00:25:07,760 actually need and it will take a lot 619 00:25:05,120 --> 00:25:09,600 longer. It'll also cost you more money. 620 00:25:07,760 --> 00:25:12,400 Oops. Uh the second thing is 621 00:25:09,600 --> 00:25:15,919 distillation. So this is basically uh we 622 00:25:12,400 --> 00:25:17,279 don't have time to cover it but um you 623 00:25:15,919 --> 00:25:21,039 what they've done is they've condensed 624 00:25:17,279 --> 00:25:22,480 models down um so that you've taken all 625 00:25:21,039 --> 00:25:24,159 the the weights so the number of 626 00:25:22,480 --> 00:25:27,120 parameters of the model and you've 627 00:25:24,159 --> 00:25:29,360 reduced them by basically uh it's called 628 00:25:27,120 --> 00:25:32,000 matrage so the those dolls you get where 629 00:25:29,360 --> 00:25:33,919 they one inside another um and the 630 00:25:32,000 --> 00:25:35,760 reason it's they've used that analogy is 631 00:25:33,919 --> 00:25:37,679 because you've basically taken the 632 00:25:35,760 --> 00:25:40,240 answer from a big model and you've 633 00:25:37,679 --> 00:25:42,159 stuffed it into a smaller one. So 634 00:25:40,240 --> 00:25:44,159 instead of it having to know about every 635 00:25:42,159 --> 00:25:45,520 single capital city and how to think 636 00:25:44,159 --> 00:25:47,039 about it and stuff like that, it's kind 637 00:25:45,520 --> 00:25:49,679 of cheated. It's got like a lookup table 638 00:25:47,039 --> 00:25:53,279 and it's it's saved the information. So 639 00:25:49,679 --> 00:25:57,039 like that's a very brief summary. This 640 00:25:53,279 --> 00:25:59,919 means that you can run a model with the 641 00:25:57,039 --> 00:26:02,159 same kind of corpus as a much bigger 642 00:25:59,919 --> 00:26:05,200 model on smaller hardware and it's also 643 00:26:02,159 --> 00:26:07,679 a lot faster. So when you see in Olama 644 00:26:05,200 --> 00:26:09,919 you've got these small models uh like 7 645 00:26:07,679 --> 00:26:12,400 billion 8 billion models it's basically 646 00:26:09,919 --> 00:26:13,520 the full model but they've asked it lots 647 00:26:12,400 --> 00:26:15,279 and lots of questions and they've 648 00:26:13,520 --> 00:26:16,960 condensed it down into a into a smaller 649 00:26:15,279 --> 00:26:19,039 one. 650 00:26:16,960 --> 00:26:22,799 Um you can see this in this benchmark. 651 00:26:19,039 --> 00:26:25,760 This is the Quen uh 3 series model with 652 00:26:22,799 --> 00:26:27,919 uh and basically I've taken a larger one 653 00:26:25,760 --> 00:26:30,400 which is 8 billion parameters and I've 654 00:26:27,919 --> 00:26:32,480 taken the smallest one which is 0.6 six. 655 00:26:30,400 --> 00:26:34,640 And you'll see that the time it takes to 656 00:26:32,480 --> 00:26:36,080 give me the same answer um to give me 657 00:26:34,640 --> 00:26:37,919 this answer the same question, sorry, 658 00:26:36,080 --> 00:26:42,000 which is can you come up with some names 659 00:26:37,919 --> 00:26:44,240 my pet octopus um it takes a lot longer 660 00:26:42,000 --> 00:26:46,799 for an 8 billion parameter model than it 661 00:26:44,240 --> 00:26:50,080 does for a 600 million one. Um also the 662 00:26:46,799 --> 00:26:52,880 length of response uh goes up and then 663 00:26:50,080 --> 00:26:55,279 weirdly enough it comes back down again 664 00:26:52,880 --> 00:26:57,760 between four and the 8 billion version 665 00:26:55,279 --> 00:26:59,760 model. I don't know why, but it gives 666 00:26:57,760 --> 00:27:02,000 you shorter answers for eight than it 667 00:26:59,760 --> 00:27:04,000 does for four, which is why you should 668 00:27:02,000 --> 00:27:04,880 test things. Um, because you have 669 00:27:04,000 --> 00:27:06,960 assumptions and you have 670 00:27:04,880 --> 00:27:08,240 generalizations, but when you use tools 671 00:27:06,960 --> 00:27:10,320 like this, you can actually come to 672 00:27:08,240 --> 00:27:12,720 different conclusions. Uh, the next 673 00:27:10,320 --> 00:27:15,919 point is quantization, which is a really 674 00:27:12,720 --> 00:27:17,440 cool word to say. Um, and it's something 675 00:27:15,919 --> 00:27:19,440 you can just drop into conversations all 676 00:27:17,440 --> 00:27:22,880 the time. 677 00:27:19,440 --> 00:27:25,200 Um, and basically it's this the theory 678 00:27:22,880 --> 00:27:29,200 is fairly simple. If you have a range of 679 00:27:25,200 --> 00:27:31,279 values on a scale um you can basically 680 00:27:29,200 --> 00:27:34,480 get the smallest one and the biggest one 681 00:27:31,279 --> 00:27:36,320 and say okay the smallest one is now the 682 00:27:34,480 --> 00:27:40,320 smallest point in a in a different 683 00:27:36,320 --> 00:27:44,480 scale. So if we did scala that's int 8 684 00:27:40,320 --> 00:27:46,720 uh which is 255 different values. Um so 685 00:27:44,480 --> 00:27:49,120 basically the smallest one is now minus 686 00:27:46,720 --> 00:27:50,480 128 and the biggest one is now plus. And 687 00:27:49,120 --> 00:27:53,919 basically what we're doing is taking 688 00:27:50,480 --> 00:27:56,559 data from a 32-bit or 64-bit value into 689 00:27:53,919 --> 00:27:59,200 a much smaller format. The reason you do 690 00:27:56,559 --> 00:28:01,919 that is because CPUs and GPUs can do 691 00:27:59,200 --> 00:28:04,159 that calculation much much faster. So 692 00:28:01,919 --> 00:28:05,360 like four eight times faster. So 693 00:28:04,159 --> 00:28:08,080 basically what we're doing is we're 694 00:28:05,360 --> 00:28:09,840 approximating the data inside the model 695 00:28:08,080 --> 00:28:12,399 like the weights basically inside the 696 00:28:09,840 --> 00:28:14,799 model into different values. Do you need 697 00:28:12,399 --> 00:28:18,159 to know any of this? No, not really. uh 698 00:28:14,799 --> 00:28:20,000 all you need to know is that um by using 699 00:28:18,159 --> 00:28:22,640 this technique you can run things on 700 00:28:20,000 --> 00:28:24,960 smaller hardware and it will run faster. 701 00:28:22,640 --> 00:28:26,320 So when you hear quantization or you see 702 00:28:24,960 --> 00:28:28,480 there's different flavors of a model 703 00:28:26,320 --> 00:28:30,960 that you can pull um you can use a 704 00:28:28,480 --> 00:28:33,120 smaller model to basically run similar 705 00:28:30,960 --> 00:28:36,000 answers and get similar accuracy uh but 706 00:28:33,120 --> 00:28:37,120 with much smaller hardware. Uh the other 707 00:28:36,000 --> 00:28:38,960 thing you can do is something called 708 00:28:37,120 --> 00:28:42,159 semantic caching. This is a really 709 00:28:38,960 --> 00:28:45,840 really new technique. Um 710 00:28:42,159 --> 00:28:48,240 in uh Python there is a great function 711 00:28:45,840 --> 00:28:50,720 built in called LU cache. You can use it 712 00:28:48,240 --> 00:28:52,159 as a decorator on a Python function. It 713 00:28:50,720 --> 00:28:54,799 basically means if you call the function 714 00:28:52,159 --> 00:28:56,399 the same uh with the same arguments it 715 00:28:54,799 --> 00:28:59,279 will cache and give you back the result 716 00:28:56,399 --> 00:29:01,279 the first time. Um you should use LU 717 00:28:59,279 --> 00:29:03,120 cache a lot where you have a function 718 00:29:01,279 --> 00:29:05,039 that takes a long time to execute and 719 00:29:03,120 --> 00:29:07,200 you just want to cache the responses 720 00:29:05,039 --> 00:29:09,919 into memory. So I really recommend using 721 00:29:07,200 --> 00:29:12,559 Lu cache. The problem is that it is very 722 00:29:09,919 --> 00:29:14,720 specific. Um so if we ask the same 723 00:29:12,559 --> 00:29:16,320 question more or less, it's just going 724 00:29:14,720 --> 00:29:19,760 to compare the strings and see them as 725 00:29:16,320 --> 00:29:21,360 the same thing. Um so in the QR code on 726 00:29:19,760 --> 00:29:23,840 the bottom right hand corner, that one 727 00:29:21,360 --> 00:29:26,480 is not a Rick roll. Um is a link to a 728 00:29:23,840 --> 00:29:27,919 repo that I put together for this talk 729 00:29:26,480 --> 00:29:30,640 with a demo of something called a 730 00:29:27,919 --> 00:29:34,720 semantic cache. It is a function 731 00:29:30,640 --> 00:29:37,600 decorator. Um and instead of using um 732 00:29:34,720 --> 00:29:40,320 like literal comparisons, it does an 733 00:29:37,600 --> 00:29:42,000 embedding model similarity and you can 734 00:29:40,320 --> 00:29:44,480 tell it what at what trigger point you 735 00:29:42,000 --> 00:29:46,000 want to cache the inputs. So basically 736 00:29:44,480 --> 00:29:48,000 if you had the question that I had in 737 00:29:46,000 --> 00:29:49,360 the last slide, which is uh what is the 738 00:29:48,000 --> 00:29:51,039 capital of France and what is the 739 00:29:49,360 --> 00:29:52,960 capital city of France, it would see 740 00:29:51,039 --> 00:29:56,080 those as being very similar questions 741 00:29:52,960 --> 00:29:58,720 and actually cache the answer. Um this 742 00:29:56,080 --> 00:30:00,240 is quite a new technique and we're still 743 00:29:58,720 --> 00:30:02,880 kind of working out some details on 744 00:30:00,240 --> 00:30:05,200 this. Um but it really gives you the 745 00:30:02,880 --> 00:30:06,720 ability to improve performance. Uh where 746 00:30:05,200 --> 00:30:10,120 you've got users asking very very 747 00:30:06,720 --> 00:30:10,120 similar questions. 748 00:30:11,200 --> 00:30:14,960 Okay. The next technique is something 749 00:30:12,640 --> 00:30:17,360 called model routing. Uh this is where 750 00:30:14,960 --> 00:30:18,799 you have an input uh and you're 751 00:30:17,360 --> 00:30:21,039 basically trying to direct it to the 752 00:30:18,799 --> 00:30:24,399 cheapest and fastest model. There are 753 00:30:21,039 --> 00:30:27,440 lots of challenges with this approach um 754 00:30:24,399 --> 00:30:29,760 uh as has been unveiled with the GPT5 755 00:30:27,440 --> 00:30:31,279 which has got this builtin and I don't 756 00:30:29,760 --> 00:30:32,320 think people realized they were using it 757 00:30:31,279 --> 00:30:35,360 and that's why there's been this 758 00:30:32,320 --> 00:30:36,880 mismatch in expectations with GPT5. But 759 00:30:35,360 --> 00:30:39,440 basically you get a user question and 760 00:30:36,880 --> 00:30:42,399 you say is this a simple question? If so 761 00:30:39,440 --> 00:30:44,720 give it to a cheap model. Uh if it's not 762 00:30:42,399 --> 00:30:46,720 then give it to a chat model and then 763 00:30:44,720 --> 00:30:49,120 otherwise give it to a reasoning model. 764 00:30:46,720 --> 00:30:51,120 Uh we have effectively run out of time 765 00:30:49,120 --> 00:30:52,880 so I'm going to summarize the talk. Uh 766 00:30:51,120 --> 00:30:54,399 one, smaller models are faster. The 767 00:30:52,880 --> 00:30:56,640 bigger the model, the longer the 768 00:30:54,399 --> 00:30:58,159 response. Models are paid by the word. 769 00:30:56,640 --> 00:31:01,200 Um so keep that in mind when you're 770 00:30:58,159 --> 00:31:03,760 using them. Longer responses take more 771 00:31:01,200 --> 00:31:06,640 time. Uh and users have preset 772 00:31:03,760 --> 00:31:08,640 expectations on responsiveness as well. 773 00:31:06,640 --> 00:31:10,960 So um yeah, thank you very much for your 774 00:31:08,640 --> 00:31:13,919 time. My name is Anthony and I'll give 775 00:31:10,960 --> 00:31:17,520 the last slide. We go. There we go. All 776 00:31:13,919 --> 00:31:17,520 right. Thanks very much.