1 00:00:12,080 --> 00:00:15,519 hello welcome back 2 00:00:13,920 --> 00:00:18,080 next up on the data science and 3 00:00:15,519 --> 00:00:19,520 analytics track here at pycon australia 4 00:00:18,080 --> 00:00:22,640 2021 5 00:00:19,520 --> 00:00:25,119 is coming from london a alcazine who's 6 00:00:22,640 --> 00:00:26,000 an ex-cosmologist turned data scientist 7 00:00:25,119 --> 00:00:27,680 without 8 00:00:26,000 --> 00:00:29,920 over 10 years experience in machine 9 00:00:27,680 --> 00:00:32,000 learning statistical inference and data 10 00:00:29,920 --> 00:00:34,079 inside visualizations 11 00:00:32,000 --> 00:00:35,760 ayal's claim to fame is having lived in 12 00:00:34,079 --> 00:00:38,399 four different continents within the 13 00:00:35,760 --> 00:00:40,960 span of a decade including three tennis 14 00:00:38,399 --> 00:00:43,280 grand slam cities like new york where he 15 00:00:40,960 --> 00:00:45,760 obtained his phd in astrophysics 16 00:00:43,280 --> 00:00:48,000 melbourne where he did a post postdoc 17 00:00:45,760 --> 00:00:49,520 stint at swinburne and london where he 18 00:00:48,000 --> 00:00:51,600 now resides 19 00:00:49,520 --> 00:00:54,000 ayal's topic today is one that has a 20 00:00:51,600 --> 00:00:55,840 direct relationship to real life 21 00:00:54,000 --> 00:00:58,000 how do we optimize for multiple 22 00:00:55,840 --> 00:01:00,960 objectives especially when they are in 23 00:00:58,000 --> 00:01:03,760 conflict for example how can one best 24 00:01:00,960 --> 00:01:05,760 overcome the classic trade-off between 25 00:01:03,760 --> 00:01:08,400 quality and cost 26 00:01:05,760 --> 00:01:10,000 in this talk ayal's willing yell will 27 00:01:08,400 --> 00:01:12,080 introduce us to multi-objective 28 00:01:10,000 --> 00:01:13,520 optimization using pareto fronts for the 29 00:01:12,080 --> 00:01:16,159 data-driven decision makers in the 30 00:01:13,520 --> 00:01:17,040 audience which i honestly hope is all of 31 00:01:16,159 --> 00:01:18,640 you 32 00:01:17,040 --> 00:01:20,320 we'll hear about the advantages and 33 00:01:18,640 --> 00:01:22,320 shortcomings of the technique and be 34 00:01:20,320 --> 00:01:23,759 able to assess applicability for our own 35 00:01:22,320 --> 00:01:26,479 projects 36 00:01:23,759 --> 00:01:29,040 welcome back to melbourne hey y'all 37 00:01:26,479 --> 00:01:32,479 everybody a yel casin and improved 38 00:01:29,040 --> 00:01:35,200 decision making with pareto france 39 00:01:32,479 --> 00:01:36,799 hi good morning 40 00:01:35,200 --> 00:01:39,200 hi good morning i just want to mention 41 00:01:36,799 --> 00:01:41,680 that there is a technical difficulty and 42 00:01:39,200 --> 00:01:43,200 i cannot hear anything so xavier if you 43 00:01:41,680 --> 00:01:44,640 have anything to say please put it in 44 00:01:43,200 --> 00:01:46,320 the chat at the moment and we'll just 45 00:01:44,640 --> 00:01:47,439 have to communicate that way i hope 46 00:01:46,320 --> 00:01:50,079 that's fine 47 00:01:47,439 --> 00:01:51,920 um so i'll just give you one second to 48 00:01:50,079 --> 00:01:53,360 give me that feedback otherwise i'll 49 00:01:51,920 --> 00:01:55,360 just start 50 00:01:53,360 --> 00:01:57,920 okay you hear me that's great okay great 51 00:01:55,360 --> 00:02:00,320 uh so hi everyone um hello from london 52 00:01:57,920 --> 00:02:03,680 uh my name is ayal i'm a senior data 53 00:02:00,320 --> 00:02:05,200 scientist at babylon and i'm very happy 54 00:02:03,680 --> 00:02:06,159 to be here at 55 00:02:05,200 --> 00:02:09,119 in this 56 00:02:06,159 --> 00:02:13,040 conference to talk about optimizing with 57 00:02:09,119 --> 00:02:14,640 pareto fronts um and any other things 58 00:02:13,040 --> 00:02:17,360 javier i should you mentioned before 59 00:02:14,640 --> 00:02:17,360 should i just start 60 00:02:18,640 --> 00:02:22,400 okay all right great so i'll start 61 00:02:23,840 --> 00:02:27,840 we're now living in an era of 62 00:02:25,680 --> 00:02:29,200 data-driven decision-making and there's 63 00:02:27,840 --> 00:02:32,400 always an interesting challenge that 64 00:02:29,200 --> 00:02:34,959 arises is how do we make decisions when 65 00:02:32,400 --> 00:02:38,000 dealing with multiple parameters 66 00:02:34,959 --> 00:02:40,239 normally these are not um there are 67 00:02:38,000 --> 00:02:41,519 trade-off decisions to be made making 68 00:02:40,239 --> 00:02:43,599 this a very 69 00:02:41,519 --> 00:02:45,200 challenging procedure what sort of 70 00:02:43,599 --> 00:02:47,599 trade-offs do i mean well you can always 71 00:02:45,200 --> 00:02:50,000 think about the classical trade-off for 72 00:02:47,599 --> 00:02:52,080 example between price and quality that 73 00:02:50,000 --> 00:02:54,560 we experience not only in our work lives 74 00:02:52,080 --> 00:02:56,319 but also in our everyday lives uh 75 00:02:54,560 --> 00:02:58,560 sometimes the relationship between 76 00:02:56,319 --> 00:03:01,920 quality and price is not that clear and 77 00:02:58,560 --> 00:03:04,000 so we have to make a trade-off decision 78 00:03:01,920 --> 00:03:06,319 and this also manifests in just about 79 00:03:04,000 --> 00:03:07,840 any industry that you can think of 80 00:03:06,319 --> 00:03:10,000 you could probably imagine in your own 81 00:03:07,840 --> 00:03:11,280 industry different considerations that 82 00:03:10,000 --> 00:03:13,680 you have to make 83 00:03:11,280 --> 00:03:17,120 for example you can imagine uh in the 84 00:03:13,680 --> 00:03:18,800 case of home food delivery services on 85 00:03:17,120 --> 00:03:21,360 the one hand they want to optimize the 86 00:03:18,800 --> 00:03:23,599 experience for 87 00:03:21,360 --> 00:03:26,000 people ordering food at home but also 88 00:03:23,599 --> 00:03:27,760 they have to take that optimize for the 89 00:03:26,000 --> 00:03:30,319 kitchens as well as the delivery 90 00:03:27,760 --> 00:03:33,440 personnel you cannot focus on one and 91 00:03:30,319 --> 00:03:35,920 neglect the others uh i got exposed to 92 00:03:33,440 --> 00:03:38,159 this in the field of drug discovery i 93 00:03:35,920 --> 00:03:40,000 worked in a biotech lab and uh with 94 00:03:38,159 --> 00:03:45,120 protein engineers and what i learned 95 00:03:40,000 --> 00:03:46,799 there is that uh in order for um a 96 00:03:45,120 --> 00:03:49,599 protein 97 00:03:46,799 --> 00:03:51,200 to be considered a viable drug 98 00:03:49,599 --> 00:03:52,640 for our health 99 00:03:51,200 --> 00:03:54,640 it has to pass 100 00:03:52,640 --> 00:03:57,680 rigorously many tests here i'm just 101 00:03:54,640 --> 00:03:59,680 listing a few characterizations that a 102 00:03:57,680 --> 00:04:03,360 protein has to abide by and it takes 103 00:03:59,680 --> 00:04:05,760 only one for a drug to fail the test and 104 00:04:03,360 --> 00:04:08,080 before the days of covid um these sort 105 00:04:05,760 --> 00:04:10,319 of processes would take um 106 00:04:08,080 --> 00:04:11,920 take up to a decade and these are very 107 00:04:10,319 --> 00:04:14,080 expensive like millions of billions of 108 00:04:11,920 --> 00:04:15,760 dollars so if something like toxicity 109 00:04:14,080 --> 00:04:18,880 doesn't pass you want to learn about 110 00:04:15,760 --> 00:04:22,079 this as soon as possible and hence the 111 00:04:18,880 --> 00:04:24,560 need to optimize for multiple parameters 112 00:04:22,079 --> 00:04:27,600 simultaneously normally when people 113 00:04:24,560 --> 00:04:30,240 think about optimizing for two or more 114 00:04:27,600 --> 00:04:32,479 parameters or objectives what they do is 115 00:04:30,240 --> 00:04:34,320 not multiple objectives they combine 116 00:04:32,479 --> 00:04:36,320 things them together into this heuristic 117 00:04:34,320 --> 00:04:38,000 one heuristic uh which actually in the 118 00:04:36,320 --> 00:04:39,680 literature is called single objective 119 00:04:38,000 --> 00:04:41,600 optimization and so one thing you'll 120 00:04:39,680 --> 00:04:44,160 take away from this talk is the 121 00:04:41,600 --> 00:04:46,560 limitations of this common practice and 122 00:04:44,160 --> 00:04:49,199 how this is resolved by a technique 123 00:04:46,560 --> 00:04:51,040 called pareto fronts um 124 00:04:49,199 --> 00:04:53,120 and why do you yield better solutions 125 00:04:51,040 --> 00:04:54,960 and then i'll talk about applicability 126 00:04:53,120 --> 00:04:56,479 and for those who are interested i'll 127 00:04:54,960 --> 00:04:59,199 provide you free on the material where 128 00:04:56,479 --> 00:05:01,600 you can gain hands-on experience 129 00:04:59,199 --> 00:05:04,160 to really master this topic 130 00:05:01,600 --> 00:05:06,639 this is for you if you use data to make 131 00:05:04,160 --> 00:05:09,440 decisions and you're interested in 132 00:05:06,639 --> 00:05:11,919 improving your optimization skills 133 00:05:09,440 --> 00:05:15,520 there's no real maths background needed 134 00:05:11,919 --> 00:05:18,080 and python knowledge can be fairly basic 135 00:05:15,520 --> 00:05:20,800 so just about a bit about myself thank 136 00:05:18,080 --> 00:05:23,520 you for the introduction javier 137 00:05:20,800 --> 00:05:26,960 i currently work as a data scientist in 138 00:05:23,520 --> 00:05:29,840 health tech um as heavier said i spent 139 00:05:26,960 --> 00:05:31,919 the end of my extensive academic career 140 00:05:29,840 --> 00:05:35,360 in swinburne university in melbourne 141 00:05:31,919 --> 00:05:37,440 australia um so very glad to be here in 142 00:05:35,360 --> 00:05:39,440 conference in australia 143 00:05:37,440 --> 00:05:41,360 and most relevant for this talk is two 144 00:05:39,440 --> 00:05:43,199 years that i spent in lab genius a 145 00:05:41,360 --> 00:05:45,360 biotech company in which i learned the 146 00:05:43,199 --> 00:05:48,400 trade of multi-objective optimization in 147 00:05:45,360 --> 00:05:50,880 the context of drug discovery 148 00:05:48,400 --> 00:05:53,680 um so i'll show like a use case from 149 00:05:50,880 --> 00:05:56,160 there uh this is uh uh and the agenda 150 00:05:53,680 --> 00:05:57,919 that i prepared for today in which uh we 151 00:05:56,160 --> 00:06:00,080 already covered the motivation uh next 152 00:05:57,919 --> 00:06:02,000 we'll talk about the basics of the 153 00:06:00,080 --> 00:06:04,400 concept of pareto fronts why are they 154 00:06:02,000 --> 00:06:06,880 useful uh then i'll give 155 00:06:04,400 --> 00:06:08,880 real world example with an emphasis on 156 00:06:06,880 --> 00:06:10,400 applicability how can you 157 00:06:08,880 --> 00:06:13,840 is this relevant for the projects you're 158 00:06:10,400 --> 00:06:15,280 working on and um this this this last 159 00:06:13,840 --> 00:06:17,039 part if there isn't time you'll have 160 00:06:15,280 --> 00:06:18,319 online material in which you'll be able 161 00:06:17,039 --> 00:06:19,600 to use something called genetic 162 00:06:18,319 --> 00:06:21,440 algorithms 163 00:06:19,600 --> 00:06:22,800 for those who are interested in applying 164 00:06:21,440 --> 00:06:25,039 in practice 165 00:06:22,800 --> 00:06:27,759 so with that uh i'll i can start with 166 00:06:25,039 --> 00:06:30,720 the talk um optimization means different 167 00:06:27,759 --> 00:06:32,639 things to people so for practitioners i 168 00:06:30,720 --> 00:06:35,520 like this um 169 00:06:32,639 --> 00:06:37,600 uh this definition by professor deb as a 170 00:06:35,520 --> 00:06:40,080 procedure of comparing feasible 171 00:06:37,600 --> 00:06:41,520 solutions until really there's no more 172 00:06:40,080 --> 00:06:43,520 that can be found 173 00:06:41,520 --> 00:06:44,800 either practically or just in terms of 174 00:06:43,520 --> 00:06:47,199 resources 175 00:06:44,800 --> 00:06:50,400 um you know a topic is interesting if 176 00:06:47,199 --> 00:06:53,680 the creator of the um illustrations of 177 00:06:50,400 --> 00:06:55,440 xkcd actually has a cartoon about it so 178 00:06:53,680 --> 00:06:58,000 i like this one over here and which is 179 00:06:55,440 --> 00:06:59,840 useful to understand the concepts of 180 00:06:58,000 --> 00:07:00,960 multi-objective optimization so imagine 181 00:06:59,840 --> 00:07:03,440 that you're 182 00:07:00,960 --> 00:07:06,080 so what you see here is um the creator 183 00:07:03,440 --> 00:07:08,880 randall just uh subjectively distributed 184 00:07:06,080 --> 00:07:11,199 his um how tasty he finds fruit and so 185 00:07:08,880 --> 00:07:12,720 let's just imagine you're an office mate 186 00:07:11,199 --> 00:07:15,039 of his and you want to treat him with 187 00:07:12,720 --> 00:07:16,400 the most tasty fruit according to him so 188 00:07:15,039 --> 00:07:18,400 of course you can't going to hand him a 189 00:07:16,400 --> 00:07:20,400 peach but let's say 190 00:07:18,400 --> 00:07:21,919 he also um you want him he's leaving the 191 00:07:20,400 --> 00:07:24,240 office and you want to provide him with 192 00:07:21,919 --> 00:07:26,319 the most easiest fruit to eat 193 00:07:24,240 --> 00:07:28,560 on on the go so then you'll give him 194 00:07:26,319 --> 00:07:30,400 seedless scrapes but then the question 195 00:07:28,560 --> 00:07:33,039 arises and this i actually want you to 196 00:07:30,400 --> 00:07:35,039 answer in the chat is 197 00:07:33,039 --> 00:07:38,000 which is the optimal fruit if you want 198 00:07:35,039 --> 00:07:40,479 to optimize both for taste and easy so 199 00:07:38,000 --> 00:07:43,039 just take um a few seconds to think 200 00:07:40,479 --> 00:07:44,960 about it and just pop in the chap which 201 00:07:43,039 --> 00:07:46,080 one fruit do you think would be optimal 202 00:07:44,960 --> 00:07:48,960 which one should you give them to it 203 00:07:46,080 --> 00:07:51,520 here both for ease and taste i'm not 204 00:07:48,960 --> 00:07:53,759 seeing anybody write in the chat yet 205 00:07:51,520 --> 00:07:54,960 so i'll give you a few more seconds i 206 00:07:53,759 --> 00:07:56,639 was told actually that it might be a 207 00:07:54,960 --> 00:07:57,759 delay in some countries 208 00:07:56,639 --> 00:07:59,280 um 209 00:07:57,759 --> 00:08:01,840 so remember you want to optimize for 210 00:07:59,280 --> 00:08:03,840 easy and tasty 211 00:08:01,840 --> 00:08:06,720 okay 212 00:08:03,840 --> 00:08:08,560 chris you say bananas 213 00:08:06,720 --> 00:08:11,199 i'm not sure about that 214 00:08:08,560 --> 00:08:13,039 i see one for peaches okay 215 00:08:11,199 --> 00:08:14,240 so while you're thinking about it i'll 216 00:08:13,039 --> 00:08:16,560 just give you the answer actually it was 217 00:08:14,240 --> 00:08:19,360 a trick question there's no one fruit 218 00:08:16,560 --> 00:08:21,840 that's optimal uh but rather we have to 219 00:08:19,360 --> 00:08:23,680 consider and this a set of optimal 220 00:08:21,840 --> 00:08:25,680 solutions called an optimal front or 221 00:08:23,680 --> 00:08:28,560 pareto front and the purpose of this 222 00:08:25,680 --> 00:08:30,000 talk is to have you understand why these 223 00:08:28,560 --> 00:08:32,320 are all considered 224 00:08:30,000 --> 00:08:33,919 equally optimal and 225 00:08:32,320 --> 00:08:38,080 how we can make use of this when we're 226 00:08:33,919 --> 00:08:38,080 doing multi-objective optimization 227 00:08:38,159 --> 00:08:42,399 so what is optimize and what does 228 00:08:40,159 --> 00:08:44,959 optimizing look like so imagine that we 229 00:08:42,399 --> 00:08:47,760 have in a parameter space in which we 230 00:08:44,959 --> 00:08:49,279 have two objectives uh very creatively i 231 00:08:47,760 --> 00:08:51,040 call them here objective one and 232 00:08:49,279 --> 00:08:53,040 objective two and 233 00:08:51,040 --> 00:08:55,279 this then this imagine that you're this 234 00:08:53,040 --> 00:08:56,880 is like as an analogy like a dark room 235 00:08:55,279 --> 00:08:59,279 you know nothing about the pro about 236 00:08:56,880 --> 00:09:00,560 these objectives this could be quality 237 00:08:59,279 --> 00:09:02,720 and price or this could be 238 00:09:00,560 --> 00:09:05,040 characteristics of a protein or this 239 00:09:02,720 --> 00:09:07,279 could be on the user experience of the 240 00:09:05,040 --> 00:09:09,839 the uh in home delivery service the 241 00:09:07,279 --> 00:09:12,399 experience of the people who ordered 242 00:09:09,839 --> 00:09:14,320 food or people who are creating the food 243 00:09:12,399 --> 00:09:15,920 in the kitchen so any two parameters 244 00:09:14,320 --> 00:09:18,560 that you can imagine you don't know what 245 00:09:15,920 --> 00:09:20,320 to expect so what does optimization look 246 00:09:18,560 --> 00:09:21,839 like so normally what 247 00:09:20,320 --> 00:09:23,680 what's common practice what i refer to 248 00:09:21,839 --> 00:09:25,920 as single objective optimization is you 249 00:09:23,680 --> 00:09:28,640 is is um 250 00:09:25,920 --> 00:09:30,320 what a lot of analysts do they just 251 00:09:28,640 --> 00:09:31,680 do some sort of multiplication for 252 00:09:30,320 --> 00:09:34,000 example and so what they're doing 253 00:09:31,680 --> 00:09:35,279 they're tunneling um let's say with the 254 00:09:34,000 --> 00:09:37,519 flashlight they're just going through 255 00:09:35,279 --> 00:09:39,839 this tunnel of space just optimizing for 256 00:09:37,519 --> 00:09:41,200 this one direction right there 257 00:09:39,839 --> 00:09:43,920 and then they'll conclude that this 258 00:09:41,200 --> 00:09:45,200 solution is the most optimal and so of 259 00:09:43,920 --> 00:09:46,959 course there's different algorithms you 260 00:09:45,200 --> 00:09:48,399 can add you can take the mean and let's 261 00:09:46,959 --> 00:09:50,399 say you want to that's when you're 262 00:09:48,399 --> 00:09:52,560 maximizing both but you can imagine 263 00:09:50,399 --> 00:09:53,760 maximizing one minimizing the other then 264 00:09:52,560 --> 00:09:57,200 you can take 265 00:09:53,760 --> 00:09:58,880 the difference or or divide okay 266 00:09:57,200 --> 00:10:00,800 when i was working the biology lab i 267 00:09:58,880 --> 00:10:02,959 learned that actually some processes is 268 00:10:00,800 --> 00:10:04,320 what's called i call linear one id 269 00:10:02,959 --> 00:10:06,160 optimization in which they choose one 270 00:10:04,320 --> 00:10:08,000 objective they maximize that one 271 00:10:06,160 --> 00:10:10,000 direction and then they turn around for 272 00:10:08,000 --> 00:10:12,560 the second one and then in this case 273 00:10:10,000 --> 00:10:15,040 like they say oh this is optimal 274 00:10:12,560 --> 00:10:16,320 you but that was that was 275 00:10:15,040 --> 00:10:17,839 a subjective choice why did they take 276 00:10:16,320 --> 00:10:19,680 objective one they could have just as 277 00:10:17,839 --> 00:10:22,320 easily take objective two 278 00:10:19,680 --> 00:10:23,839 uh go maximize in one direction and then 279 00:10:22,320 --> 00:10:25,120 maximize the other direction say well 280 00:10:23,839 --> 00:10:27,120 this is optimal 281 00:10:25,120 --> 00:10:30,160 right three different methods three 282 00:10:27,120 --> 00:10:31,839 different choices um and so imagine 283 00:10:30,160 --> 00:10:33,519 again this is using a flashlight in the 284 00:10:31,839 --> 00:10:34,720 dark room what do you really want to do 285 00:10:33,519 --> 00:10:37,279 well you want to turn on the light in 286 00:10:34,720 --> 00:10:39,440 the room and take have a full absorb the 287 00:10:37,279 --> 00:10:40,560 full solution space so here i'm 288 00:10:39,440 --> 00:10:42,399 highlighting 289 00:10:40,560 --> 00:10:44,399 um these three 290 00:10:42,399 --> 00:10:45,920 uh solutions that we chose that's 291 00:10:44,399 --> 00:10:48,079 optimal 292 00:10:45,920 --> 00:10:50,399 in this highly curated distribution but 293 00:10:48,079 --> 00:10:52,640 if we had the likes turned on then we're 294 00:10:50,399 --> 00:10:55,200 most likely you know subjectively just 295 00:10:52,640 --> 00:10:57,519 looking at and say um well this one is 296 00:10:55,200 --> 00:10:59,120 more likely to be more optimal than the 297 00:10:57,519 --> 00:11:00,959 rest of them 298 00:10:59,120 --> 00:11:03,279 that was very subjective but there's a 299 00:11:00,959 --> 00:11:05,279 whole lot of others which we can um with 300 00:11:03,279 --> 00:11:07,839 some mathematical rigor we can say well 301 00:11:05,279 --> 00:11:11,200 all of these are considered equally 302 00:11:07,839 --> 00:11:13,120 optimal they're called up a pareto front 303 00:11:11,200 --> 00:11:16,079 solutions and you can see of the three 304 00:11:13,120 --> 00:11:18,800 ones that are in red this one is not in 305 00:11:16,079 --> 00:11:21,360 this pro in its optimal set why is that 306 00:11:18,800 --> 00:11:23,040 well that's the purpose of the next 307 00:11:21,360 --> 00:11:24,880 size that we'll get to 308 00:11:23,040 --> 00:11:26,560 but before that we have to talk about 309 00:11:24,880 --> 00:11:29,839 subjectivity i mentioned the word 310 00:11:26,560 --> 00:11:32,000 subjectivity a few times and um why is 311 00:11:29,839 --> 00:11:34,480 that important um well i like to think 312 00:11:32,000 --> 00:11:37,360 of this quote by muliel 313 00:11:34,480 --> 00:11:40,399 which says that um 314 00:11:37,360 --> 00:11:42,160 in other words it says that if you think 315 00:11:40,399 --> 00:11:43,600 you truly believe that you understand 316 00:11:42,160 --> 00:11:45,360 something without exploring it then 317 00:11:43,600 --> 00:11:47,760 you're just fooling yourself so that's 318 00:11:45,360 --> 00:11:49,200 my takeaway from the statement and what 319 00:11:47,760 --> 00:11:51,120 does that have to do with what we're 320 00:11:49,200 --> 00:11:53,440 talking about is the fact that in any 321 00:11:51,120 --> 00:11:54,880 analysis as objective as we want to be 322 00:11:53,440 --> 00:11:57,680 we're going to make a subjective 323 00:11:54,880 --> 00:11:59,760 decision at some stage the question is 324 00:11:57,680 --> 00:12:01,839 when are we going to make this so when i 325 00:11:59,760 --> 00:12:03,040 talk about single objective optimization 326 00:12:01,839 --> 00:12:05,279 then normally 327 00:12:03,040 --> 00:12:06,880 we're creating an ad hoc heuristic not 328 00:12:05,279 --> 00:12:09,200 ad hoc sorry we're creating a heuristic 329 00:12:06,880 --> 00:12:11,279 before we're actually doing the search 330 00:12:09,200 --> 00:12:13,440 the purpose of this talk is to convey 331 00:12:11,279 --> 00:12:15,279 that with the pareto front method you 332 00:12:13,440 --> 00:12:17,680 actually want to hold on to that 333 00:12:15,279 --> 00:12:19,200 decision not make it be open for 334 00:12:17,680 --> 00:12:22,399 anything that comes and make that 335 00:12:19,200 --> 00:12:23,519 decision only after you start exploring 336 00:12:22,399 --> 00:12:24,800 the data 337 00:12:23,519 --> 00:12:26,800 and that's what we'll see in the next 338 00:12:24,800 --> 00:12:29,200 few slides so that's the importance of 339 00:12:26,800 --> 00:12:30,720 subjectivity and when you think about it 340 00:12:29,200 --> 00:12:32,399 what do i mean by analysts making 341 00:12:30,720 --> 00:12:33,760 subjective decisions in general well 342 00:12:32,399 --> 00:12:35,920 anytime you have a distribution and you 343 00:12:33,760 --> 00:12:37,760 want to quote it by number let's say do 344 00:12:35,920 --> 00:12:41,440 you decide the mean 345 00:12:37,760 --> 00:12:43,760 mode or um or median well if you have a 346 00:12:41,440 --> 00:12:44,800 perfectly symmetric bell curve shape 347 00:12:43,760 --> 00:12:46,800 then they're it's all the same it 348 00:12:44,800 --> 00:12:48,560 doesn't matter but most real-life data 349 00:12:46,800 --> 00:12:51,200 is skewed and so that's important maybe 350 00:12:48,560 --> 00:12:52,720 it's bimodal right so 351 00:12:51,200 --> 00:12:54,720 you have to learn the data and then make 352 00:12:52,720 --> 00:12:56,800 a subjective decision and the good 353 00:12:54,720 --> 00:12:59,200 analysts make sure that's a sound 354 00:12:56,800 --> 00:13:00,480 decision that can be justified the same 355 00:12:59,200 --> 00:13:02,800 thing when you decide the bins of a 356 00:13:00,480 --> 00:13:04,720 histogram the same things when you 357 00:13:02,800 --> 00:13:07,920 decide should i present data in a linear 358 00:13:04,720 --> 00:13:09,360 scale or a a log scale right so these 359 00:13:07,920 --> 00:13:11,920 are subjective decisions that we make 360 00:13:09,360 --> 00:13:14,560 and it all depends on context uh same 361 00:13:11,920 --> 00:13:16,480 here when you decide what's optimal 362 00:13:14,560 --> 00:13:19,200 first look at your parameter space and 363 00:13:16,480 --> 00:13:20,800 then make that decision 364 00:13:19,200 --> 00:13:23,200 so with that in mind we're ready to 365 00:13:20,800 --> 00:13:25,440 define what pareto front is um so 366 00:13:23,200 --> 00:13:27,200 beforehand um in order to define that we 367 00:13:25,440 --> 00:13:30,240 have to um 368 00:13:27,200 --> 00:13:31,760 when we look at a a scatter plot we have 369 00:13:30,240 --> 00:13:34,079 to differentiate between two types of 370 00:13:31,760 --> 00:13:36,720 solutions so these are two objects you 371 00:13:34,079 --> 00:13:38,560 know two dimensions of objective space 372 00:13:36,720 --> 00:13:39,760 and um 373 00:13:38,560 --> 00:13:42,000 we have to 374 00:13:39,760 --> 00:13:44,160 define all parameter uh all solutions 375 00:13:42,000 --> 00:13:46,639 either as what we call dominated or 376 00:13:44,160 --> 00:13:48,399 non-dominated uh with a non-dominant or 377 00:13:46,639 --> 00:13:50,000 what's called pareto optimum so what do 378 00:13:48,399 --> 00:13:50,959 we mean by that so let's learn by 379 00:13:50,000 --> 00:13:54,000 example 380 00:13:50,959 --> 00:13:56,399 for example this solution k over here so 381 00:13:54,000 --> 00:13:58,000 it's it's classified as dominated 382 00:13:56,399 --> 00:13:59,120 because there's at least one other 383 00:13:58,000 --> 00:14:00,000 solution 384 00:13:59,120 --> 00:14:01,519 that 385 00:14:00,000 --> 00:14:03,839 um that 386 00:14:01,519 --> 00:14:05,519 is is better that has better performance 387 00:14:03,839 --> 00:14:07,199 both in objective one and objective two 388 00:14:05,519 --> 00:14:09,279 assuming that we want to maximize both 389 00:14:07,199 --> 00:14:12,720 in that case for example we have n 390 00:14:09,279 --> 00:14:14,639 and e the reason n is called is is also 391 00:14:12,720 --> 00:14:16,959 called dominated is because e dominates 392 00:14:14,639 --> 00:14:18,720 n both in objective 1 and objective 2. 393 00:14:16,959 --> 00:14:20,639 the reason that e 394 00:14:18,720 --> 00:14:22,480 is considered non-dominated is because 395 00:14:20,639 --> 00:14:24,800 there's nobody over here that dominates 396 00:14:22,480 --> 00:14:27,040 it both in objective 1 and objective 2. 397 00:14:24,800 --> 00:14:28,720 you can say well f dominates it in 1 398 00:14:27,040 --> 00:14:31,120 right in this direction but it doesn't 399 00:14:28,720 --> 00:14:33,279 dominate it in objective two in in the 400 00:14:31,120 --> 00:14:35,279 horizon in the vertical direction same 401 00:14:33,279 --> 00:14:36,880 thing but the opposite for d d performs 402 00:14:35,279 --> 00:14:38,639 better in objective two but not in 403 00:14:36,880 --> 00:14:41,040 objective one so that's why e is 404 00:14:38,639 --> 00:14:43,519 considered non-dominated and hence what 405 00:14:41,040 --> 00:14:45,040 we call pareto optimal and the same goes 406 00:14:43,519 --> 00:14:48,079 for all of these 407 00:14:45,040 --> 00:14:49,839 solutions a through through h so again 408 00:14:48,079 --> 00:14:53,040 the definition of a pareto front is a 409 00:14:49,839 --> 00:14:54,720 set of non-dominated solutions 410 00:14:53,040 --> 00:14:56,399 and then you can ask the question 411 00:14:54,720 --> 00:14:58,160 well 412 00:14:56,399 --> 00:14:59,920 which is optimal so that's where the 413 00:14:58,160 --> 00:15:02,240 subjectivity comes in so we haven't made 414 00:14:59,920 --> 00:15:04,480 that subjective decision um 415 00:15:02,240 --> 00:15:07,199 and they're still considered equally 416 00:15:04,480 --> 00:15:10,000 optimal and then once we have this look 417 00:15:07,199 --> 00:15:11,839 of the space then you take then you or 418 00:15:10,000 --> 00:15:13,760 where the domain expert is then you 419 00:15:11,839 --> 00:15:15,760 create a subjective ranking you can 420 00:15:13,760 --> 00:15:17,440 either focus on these over here or maybe 421 00:15:15,760 --> 00:15:19,760 you're more interested in the ones over 422 00:15:17,440 --> 00:15:21,360 here it really depends on context as 423 00:15:19,760 --> 00:15:22,880 we'll see another thing that's worth 424 00:15:21,360 --> 00:15:25,040 highlighting is the nice thing about 425 00:15:22,880 --> 00:15:26,399 this approach is we do not care about 426 00:15:25,040 --> 00:15:29,120 the units of scale 427 00:15:26,399 --> 00:15:31,600 right um in this this could be anywhere 428 00:15:29,120 --> 00:15:33,120 between you know milligrams to kilograms 429 00:15:31,600 --> 00:15:34,959 it doesn't matter because you're going 430 00:15:33,120 --> 00:15:36,800 to get mathematically you're just going 431 00:15:34,959 --> 00:15:39,120 to get uh the same 432 00:15:36,800 --> 00:15:40,240 differentiation between dominated to 433 00:15:39,120 --> 00:15:42,240 non-nominate 434 00:15:40,240 --> 00:15:43,360 so that's what um that's what pareto's 435 00:15:42,240 --> 00:15:45,759 fronts are 436 00:15:43,360 --> 00:15:48,639 um so before we look at applicability 437 00:15:45,759 --> 00:15:50,959 we'll talk about um 438 00:15:48,639 --> 00:15:52,839 uh we'll talk about um 439 00:15:50,959 --> 00:15:56,240 sorry we'll just summarize 440 00:15:52,839 --> 00:15:58,959 um okay so a pareto front is a set of of 441 00:15:56,240 --> 00:16:01,279 of trade-off non-dominated solutions and 442 00:15:58,959 --> 00:16:03,440 this should be all considered equally 443 00:16:01,279 --> 00:16:05,680 optimal uh throughout this talk i'll 444 00:16:03,440 --> 00:16:07,839 i'll show you uh um in two-dimensional 445 00:16:05,680 --> 00:16:10,399 space but this is also relevant for any 446 00:16:07,839 --> 00:16:13,040 n-dimensional space this extrapolates uh 447 00:16:10,399 --> 00:16:14,480 to many dimensions 448 00:16:13,040 --> 00:16:16,160 and 449 00:16:14,480 --> 00:16:18,079 remember that the context of 450 00:16:16,160 --> 00:16:21,279 subjectivity we don't make any prior 451 00:16:18,079 --> 00:16:22,800 constraints before looking at the um at 452 00:16:21,279 --> 00:16:24,480 the distributions you if there are 453 00:16:22,800 --> 00:16:26,079 limitations of course you put them in 454 00:16:24,480 --> 00:16:27,519 but otherwise there's no if there's no 455 00:16:26,079 --> 00:16:29,839 good reason to put constraints you won't 456 00:16:27,519 --> 00:16:31,759 use them and um 457 00:16:29,839 --> 00:16:34,480 and the reason that perhaps optimization 458 00:16:31,759 --> 00:16:36,800 as you see is better than the single 459 00:16:34,480 --> 00:16:39,199 objective optimization is because you're 460 00:16:36,800 --> 00:16:41,360 really getting a broad like a bird's eye 461 00:16:39,199 --> 00:16:44,320 view of the solution space as opposed to 462 00:16:41,360 --> 00:16:46,720 putting on horse blinders 463 00:16:44,320 --> 00:16:48,399 um okay so with that 464 00:16:46,720 --> 00:16:50,720 in order to practice this concept i 465 00:16:48,399 --> 00:16:53,600 created an app and here's the link so 466 00:16:50,720 --> 00:16:56,959 feel free to go into the link and i'll 467 00:16:53,600 --> 00:16:59,839 just quickly i'll demonstrate it 468 00:16:56,959 --> 00:17:02,880 cool so i call it pareto because 469 00:16:59,839 --> 00:17:05,679 just like in wekamo the objective is to 470 00:17:02,880 --> 00:17:07,919 look at a distribution and try to guess 471 00:17:05,679 --> 00:17:10,559 of which um 472 00:17:07,919 --> 00:17:10,559 uh which 473 00:17:10,799 --> 00:17:14,319 which solutions are actually pareto 474 00:17:12,319 --> 00:17:17,039 optimal uh so javier i will ask for you 475 00:17:14,319 --> 00:17:18,079 in the chat to give me a number um so 476 00:17:17,039 --> 00:17:19,520 this will be the same number of a 477 00:17:18,079 --> 00:17:23,839 distribution that i'm not aware of i 478 00:17:19,520 --> 00:17:26,400 cannot see um 705. okay so i will put in 479 00:17:23,839 --> 00:17:28,319 705. thank you for that very 480 00:17:26,400 --> 00:17:30,799 thoughtful of you okay so this is a new 481 00:17:28,319 --> 00:17:32,720 distribution i have not seen before um 482 00:17:30,799 --> 00:17:34,960 okay so 483 00:17:32,720 --> 00:17:36,640 and i want to maximize let me make sure 484 00:17:34,960 --> 00:17:38,160 i want to do max max because that's 485 00:17:36,640 --> 00:17:40,400 easier to understand you can do any 486 00:17:38,160 --> 00:17:42,400 combination but we'll do max max in this 487 00:17:40,400 --> 00:17:44,559 case that's what i'm asked for how many 488 00:17:42,400 --> 00:17:46,400 of the 50 solutions are pareto optimal 489 00:17:44,559 --> 00:17:48,000 so my understanding and something you'll 490 00:17:46,400 --> 00:17:50,160 gain with time that's the purpose of 491 00:17:48,000 --> 00:17:52,160 this sort of app is to learn that one 492 00:17:50,160 --> 00:17:53,919 two three hopefully i'm right i'm not i 493 00:17:52,160 --> 00:17:56,720 don't always get it right but i think 494 00:17:53,919 --> 00:17:59,120 that it's three so i answer it here 495 00:17:56,720 --> 00:18:00,240 and in this case i'm correct 496 00:17:59,120 --> 00:18:03,760 okay have you ever give me another 497 00:18:00,240 --> 00:18:03,760 number quickly if you have one 498 00:18:06,080 --> 00:18:09,440 i just want to see if i can 499 00:18:08,640 --> 00:18:12,400 five 500 00:18:09,440 --> 00:18:12,400 okay so that's a five 501 00:18:12,840 --> 00:18:16,960 okay 502 00:18:14,720 --> 00:18:19,679 five new distribution 503 00:18:16,960 --> 00:18:22,480 okay that that's a bit easy this is two 504 00:18:19,679 --> 00:18:23,520 but yeah but i i i so i don't always get 505 00:18:22,480 --> 00:18:25,919 it right that's that's my that's the 506 00:18:23,520 --> 00:18:27,840 point i wanted to make uh cool so feel 507 00:18:25,919 --> 00:18:29,360 free to play around with that app you'll 508 00:18:27,840 --> 00:18:31,200 have the link again 509 00:18:29,360 --> 00:18:34,320 if you want to see it um maybe i'll just 510 00:18:31,200 --> 00:18:34,320 put it in the chat as well 511 00:18:34,720 --> 00:18:37,039 okay 512 00:18:35,679 --> 00:18:39,120 cool so 513 00:18:37,039 --> 00:18:40,000 um yeah so feel free to play around with 514 00:18:39,120 --> 00:18:42,160 that 515 00:18:40,000 --> 00:18:44,640 uh just to understand the topic okay so 516 00:18:42,160 --> 00:18:46,960 we're nearly there i i i'm nearly there 517 00:18:44,640 --> 00:18:48,320 in which i can talk about real world so 518 00:18:46,960 --> 00:18:50,160 i'm gonna 519 00:18:48,320 --> 00:18:52,240 share my screen like this 520 00:18:50,160 --> 00:18:54,400 okay we're nearly ready to present it in 521 00:18:52,240 --> 00:18:56,320 the real world but beforehand i have to 522 00:18:54,400 --> 00:18:58,320 talk about two different spaces so we 523 00:18:56,320 --> 00:18:59,600 talked about objective space uh but now 524 00:18:58,320 --> 00:19:01,679 we're talking about something called 525 00:18:59,600 --> 00:19:04,080 decision space and what does that mean 526 00:19:01,679 --> 00:19:05,600 um so far we were focusing on pareto 527 00:19:04,080 --> 00:19:07,200 fronts in the objective space how do we 528 00:19:05,600 --> 00:19:09,360 optimize for it but there is this other 529 00:19:07,200 --> 00:19:11,120 set of parameters which is what yeah 530 00:19:09,360 --> 00:19:12,640 which is called a decision space and 531 00:19:11,120 --> 00:19:14,320 what do i mean by that the objective 532 00:19:12,640 --> 00:19:16,240 space is what we're interested in right 533 00:19:14,320 --> 00:19:18,720 that's the price versus quality that's 534 00:19:16,240 --> 00:19:20,480 what we're interested uh at the end of 535 00:19:18,720 --> 00:19:22,000 at the end of the pipeline but we don't 536 00:19:20,480 --> 00:19:23,840 have control over that there are 537 00:19:22,000 --> 00:19:25,600 parameters think of those knobs those 538 00:19:23,840 --> 00:19:27,600 are the things that we turn around and 539 00:19:25,600 --> 00:19:30,240 that's what we're actually interested 540 00:19:27,600 --> 00:19:32,240 and we we can just make decisions on 541 00:19:30,240 --> 00:19:34,559 okay um so uh 542 00:19:32,240 --> 00:19:36,720 i'll give a few real a few toy examples 543 00:19:34,559 --> 00:19:38,160 and a few real-life examples this is a 544 00:19:36,720 --> 00:19:39,919 classical toy example called the 545 00:19:38,160 --> 00:19:42,640 knapsack problem and so the objective of 546 00:19:39,919 --> 00:19:46,160 the knapsack problem is you have um a 547 00:19:42,640 --> 00:19:48,880 lot of boxes and they all have value and 548 00:19:46,160 --> 00:19:50,000 they have weight and your objective 549 00:19:48,880 --> 00:19:51,679 your 550 00:19:50,000 --> 00:19:54,080 your you have two objectives you want to 551 00:19:51,679 --> 00:19:56,480 create you have to put them in in a 552 00:19:54,080 --> 00:19:58,480 knapsack in which you want to maximize 553 00:19:56,480 --> 00:20:00,480 uh for let's say you want to maximize 554 00:19:58,480 --> 00:20:03,280 for value of the packages but you want 555 00:20:00,480 --> 00:20:05,760 to minimize for weights okay my axes 556 00:20:03,280 --> 00:20:08,720 here aren't perfect but 557 00:20:05,760 --> 00:20:10,080 but just imagine that you 558 00:20:08,720 --> 00:20:11,600 make these combinations that's your 559 00:20:10,080 --> 00:20:14,080 decision space and then you have your 560 00:20:11,600 --> 00:20:17,440 objective space um 561 00:20:14,080 --> 00:20:19,679 in which again it's value versus weight 562 00:20:17,440 --> 00:20:21,600 okay and so what's also important to 563 00:20:19,679 --> 00:20:23,600 emphasis is you need a mapping between 564 00:20:21,600 --> 00:20:26,400 them the mapping here is quite trivial 565 00:20:23,600 --> 00:20:29,280 right you just weigh the boxes or you 566 00:20:26,400 --> 00:20:30,720 you quantify the monetary in that way 567 00:20:29,280 --> 00:20:32,960 that's the mapping between decision 568 00:20:30,720 --> 00:20:34,799 space to objective space so this is a 569 00:20:32,960 --> 00:20:36,640 very useful toy example and then i'll 570 00:20:34,799 --> 00:20:38,159 show you how you can actually learn this 571 00:20:36,640 --> 00:20:39,840 more in detail but i want to mention 572 00:20:38,159 --> 00:20:42,880 that this i use this as a perfect 573 00:20:39,840 --> 00:20:45,039 analogy for my work in uh 574 00:20:42,880 --> 00:20:46,880 therapeutic discovery and so what i mean 575 00:20:45,039 --> 00:20:48,720 by that is when i was working with 576 00:20:46,880 --> 00:20:51,840 protein engineers they have full 577 00:20:48,720 --> 00:20:53,760 constraint over the dna sequences so 578 00:20:51,840 --> 00:20:55,840 those things that they can easily order 579 00:20:53,760 --> 00:20:59,280 from a lab and they actually construct 580 00:20:55,840 --> 00:21:02,559 it uh within their own settings and so i 581 00:20:59,280 --> 00:21:04,960 was helping them decide what is the best 582 00:21:02,559 --> 00:21:06,880 dna sequence to optimize for what we're 583 00:21:04,960 --> 00:21:08,880 interested in we're interested in 584 00:21:06,880 --> 00:21:11,360 protein function that might be how it 585 00:21:08,880 --> 00:21:13,679 binds to another protein it's potency 586 00:21:11,360 --> 00:21:16,080 that means how effective it is or rather 587 00:21:13,679 --> 00:21:18,320 uh something deleterious like how toxic 588 00:21:16,080 --> 00:21:19,919 is it so um 589 00:21:18,320 --> 00:21:21,679 and so this is what they decide this is 590 00:21:19,919 --> 00:21:24,000 what we're interested in but the real 591 00:21:21,679 --> 00:21:25,919 expense actually is in the mapping 592 00:21:24,000 --> 00:21:27,440 because you know just having a sequence 593 00:21:25,919 --> 00:21:29,200 on your computer screen 594 00:21:27,440 --> 00:21:31,039 to actually measure this well you need 595 00:21:29,200 --> 00:21:32,960 you need a full-fledged laboratory so 596 00:21:31,039 --> 00:21:34,799 that's very expensive and that's what uh 597 00:21:32,960 --> 00:21:36,640 lab genius the biotech company work in 598 00:21:34,799 --> 00:21:38,640 they actually have that so they have the 599 00:21:36,640 --> 00:21:40,000 biological setting and i was working in 600 00:21:38,640 --> 00:21:41,360 a machine learning group and we were 601 00:21:40,000 --> 00:21:42,640 actually creating this 602 00:21:41,360 --> 00:21:44,480 using machine learning this mapping 603 00:21:42,640 --> 00:21:46,400 between sequence to 604 00:21:44,480 --> 00:21:50,400 um to protein function 605 00:21:46,400 --> 00:21:52,000 so we can optimize for um 606 00:21:50,400 --> 00:21:53,600 for for the drug discovery we were 607 00:21:52,000 --> 00:21:54,640 working in the context of crohn's 608 00:21:53,600 --> 00:21:57,520 disease 609 00:21:54,640 --> 00:21:59,039 um so here i'll just show you this 610 00:21:57,520 --> 00:22:01,200 a case study it isn't the crohn's 611 00:21:59,039 --> 00:22:03,280 disease case study this is kind of like 612 00:22:01,200 --> 00:22:07,679 i'm actually using public data as 613 00:22:03,280 --> 00:22:09,840 opposed to their priority data um 614 00:22:07,679 --> 00:22:11,280 as opposed to the proprietary data and 615 00:22:09,840 --> 00:22:14,159 so the data that i'm using is small 616 00:22:11,280 --> 00:22:15,760 proteins called anti antibodies in which 617 00:22:14,159 --> 00:22:18,080 um they're about 200 in length but 618 00:22:15,760 --> 00:22:20,159 they're about 20 locations which uh 619 00:22:18,080 --> 00:22:22,480 we're causing mutations and so 20 to the 620 00:22:20,159 --> 00:22:23,600 power 20 that's a very large space 621 00:22:22,480 --> 00:22:26,799 that's 622 00:22:23,600 --> 00:22:28,159 intractable it's 10 to the 26 10 with 26 623 00:22:26,799 --> 00:22:31,200 digits 624 00:22:28,159 --> 00:22:33,280 10 to the power 26 um and so it's 625 00:22:31,200 --> 00:22:35,039 actually mostly what's called wild type 626 00:22:33,280 --> 00:22:36,480 and that's what i manifest here in this 627 00:22:35,039 --> 00:22:38,880 frequency distribution so here are the 628 00:22:36,480 --> 00:22:40,880 20 locations i just put them all side by 629 00:22:38,880 --> 00:22:42,799 side and here you can see that ninety 630 00:22:40,880 --> 00:22:44,320 percent of the time this first one is t 631 00:22:42,799 --> 00:22:46,000 and ten percent in time it's these other 632 00:22:44,320 --> 00:22:47,440 letters that you can can't see because 633 00:22:46,000 --> 00:22:49,840 they're squashed so that's what i mean 634 00:22:47,440 --> 00:22:52,400 by ninety percent wild type and the goal 635 00:22:49,840 --> 00:22:55,679 was to discover um discover the best 636 00:22:52,400 --> 00:22:56,880 candidate for uh binding and potency and 637 00:22:55,679 --> 00:22:58,240 again i mentioned that we use machine 638 00:22:56,880 --> 00:23:01,520 learning to create that mapping that 639 00:22:58,240 --> 00:23:03,120 link between protein sequence to binding 640 00:23:01,520 --> 00:23:04,400 um so 641 00:23:03,120 --> 00:23:05,679 yeah i don't expect you to fully 642 00:23:04,400 --> 00:23:06,640 understand this in the setup and it's 643 00:23:05,679 --> 00:23:08,320 fine but 644 00:23:06,640 --> 00:23:10,720 but here i just want to show you a nice 645 00:23:08,320 --> 00:23:12,720 way that i visualized on the explo 646 00:23:10,720 --> 00:23:14,000 exploring the decision space so what 647 00:23:12,720 --> 00:23:15,200 you're seeing here is that before i 648 00:23:14,000 --> 00:23:17,360 showed frequency but here i'm showing 649 00:23:15,200 --> 00:23:19,360 the frequency change so the way to read 650 00:23:17,360 --> 00:23:21,600 this is that um 651 00:23:19,360 --> 00:23:23,120 here for example remember i mentioned t 652 00:23:21,600 --> 00:23:24,559 in the first position here it's actually 653 00:23:23,120 --> 00:23:26,559 rejecting it and preferring it for 654 00:23:24,559 --> 00:23:28,320 example t over that as opposed to these 655 00:23:26,559 --> 00:23:29,840 locations here in which it actually 656 00:23:28,320 --> 00:23:31,200 likes the wild type i don't remember 657 00:23:29,840 --> 00:23:33,200 what it is but it actually likes the 658 00:23:31,200 --> 00:23:34,960 it's probably why so it's actually likes 659 00:23:33,200 --> 00:23:36,480 the wild type uh and definitely here it 660 00:23:34,960 --> 00:23:38,240 likes the weld that's what you're seeing 661 00:23:36,480 --> 00:23:39,600 here and this is the decision space but 662 00:23:38,240 --> 00:23:40,960 remember i talked about objective space 663 00:23:39,600 --> 00:23:44,000 this is what we're actually interested 664 00:23:40,960 --> 00:23:46,000 in let's say binding and potency 665 00:23:44,000 --> 00:23:47,039 and so i'll just rerun the video from 666 00:23:46,000 --> 00:23:49,039 the start here and you can see the 667 00:23:47,039 --> 00:23:50,880 solution space started from over here 668 00:23:49,039 --> 00:23:53,120 and here i'm like doing explore exploit 669 00:23:50,880 --> 00:23:54,799 like looking at at the space and the 670 00:23:53,120 --> 00:23:56,480 objective is to get over here and i'm 671 00:23:54,799 --> 00:23:59,679 looking for solutions meaning these sort 672 00:23:56,480 --> 00:24:03,360 of combinations to get to this region 673 00:23:59,679 --> 00:24:05,200 over here so this is one uh way 674 00:24:03,360 --> 00:24:08,000 and the way i'm using it is you see here 675 00:24:05,200 --> 00:24:09,520 this pareto front the um if you start to 676 00:24:08,000 --> 00:24:11,120 get the rift of how to identify 677 00:24:09,520 --> 00:24:13,120 protofronts here they're all marked with 678 00:24:11,120 --> 00:24:15,440 x as i mentioned here so this is the 679 00:24:13,120 --> 00:24:17,200 protofront of this distribution 680 00:24:15,440 --> 00:24:18,799 um so then you write for that okay so 681 00:24:17,200 --> 00:24:22,000 great so you have this you did many 682 00:24:18,799 --> 00:24:24,480 iterations um so how do you decide uh 683 00:24:22,000 --> 00:24:26,000 which proteins to suggest um that they 684 00:24:24,480 --> 00:24:27,679 actually order in 685 00:24:26,000 --> 00:24:29,279 an engineer for well that's really 686 00:24:27,679 --> 00:24:32,080 subjective that really depends on the 687 00:24:29,279 --> 00:24:33,919 domain so the purpose is uh whatever 688 00:24:32,080 --> 00:24:35,840 decisions we made we did it only after 689 00:24:33,919 --> 00:24:36,960 the search remember not before the 690 00:24:35,840 --> 00:24:39,039 search so that's exactly what i'm 691 00:24:36,960 --> 00:24:40,720 talking about in our case um the the 692 00:24:39,039 --> 00:24:42,159 protein engineer told me well we're not 693 00:24:40,720 --> 00:24:43,919 interested in this region this is too 694 00:24:42,159 --> 00:24:45,440 low and this objective these are too 695 00:24:43,919 --> 00:24:47,360 long this objective well you know 696 00:24:45,440 --> 00:24:49,919 definitely take these and then anything 697 00:24:47,360 --> 00:24:52,880 in this vicinity and so what we ended up 698 00:24:49,919 --> 00:24:55,520 doing is testing over here for example 699 00:24:52,880 --> 00:24:57,120 and so this is one way to go about it 700 00:24:55,520 --> 00:24:59,039 so again the purpose of that slide is to 701 00:24:57,120 --> 00:25:00,960 emphasize the subject you have to make a 702 00:24:59,039 --> 00:25:02,799 subjective decision at some stage but 703 00:25:00,960 --> 00:25:05,760 you want to hold that to the end 704 00:25:02,799 --> 00:25:06,799 um this is another case study um 705 00:25:05,760 --> 00:25:08,960 so 706 00:25:06,799 --> 00:25:10,720 if you're familiar with with machine 707 00:25:08,960 --> 00:25:12,480 learning then these these parameters 708 00:25:10,720 --> 00:25:14,400 that are called hyperparameters and so 709 00:25:12,480 --> 00:25:16,480 these are things that again that we can 710 00:25:14,400 --> 00:25:17,840 can control for so here for example 711 00:25:16,480 --> 00:25:19,840 there's this technique called decision 712 00:25:17,840 --> 00:25:22,240 trees and it has one parameter called 713 00:25:19,840 --> 00:25:23,840 learning rate and the number of trees uh 714 00:25:22,240 --> 00:25:25,360 that's what we control for hence our 715 00:25:23,840 --> 00:25:27,440 decision space but that's not what we're 716 00:25:25,360 --> 00:25:29,360 interested in what we're interested in 717 00:25:27,440 --> 00:25:32,240 actually is the precision recall 718 00:25:29,360 --> 00:25:34,080 accuracy and other related metrics and 719 00:25:32,240 --> 00:25:35,279 the mapping is a whole process of 720 00:25:34,080 --> 00:25:37,679 machine learning of training 721 00:25:35,279 --> 00:25:39,600 cross-validation testing etc going 722 00:25:37,679 --> 00:25:41,120 between these parameters to what we're 723 00:25:39,600 --> 00:25:43,039 actually interested in 724 00:25:41,120 --> 00:25:44,960 so this is common practice for machine 725 00:25:43,039 --> 00:25:47,679 learning practitioners most of them do 726 00:25:44,960 --> 00:25:50,320 single objective optimization but i've 727 00:25:47,679 --> 00:25:52,880 one but i i was happy to find this 728 00:25:50,320 --> 00:25:54,640 article online in which this this team 729 00:25:52,880 --> 00:25:57,279 from sas institute they actually use 730 00:25:54,640 --> 00:25:59,840 multiple objective optimization and 731 00:25:57,279 --> 00:26:01,520 meaning pareto fronts in order to do 732 00:25:59,840 --> 00:26:03,679 hyper primary tuning so here i'm just 733 00:26:01,520 --> 00:26:05,039 highlighting uh they have a whole lot of 734 00:26:03,679 --> 00:26:07,120 case studies here i'm just highlighting 735 00:26:05,039 --> 00:26:10,240 just one of them a company called 736 00:26:07,120 --> 00:26:11,360 donorschoose in which they're helping um 737 00:26:10,240 --> 00:26:13,679 um 738 00:26:11,360 --> 00:26:14,960 uh this organization uh provide 739 00:26:13,679 --> 00:26:16,720 materials for 740 00:26:14,960 --> 00:26:19,039 for for for teachers the thing is that 741 00:26:16,720 --> 00:26:20,799 they get many proposals only about 20 of 742 00:26:19,039 --> 00:26:23,360 them are actually 743 00:26:20,799 --> 00:26:24,960 worth their while and so they wanted a 744 00:26:23,360 --> 00:26:27,200 machine learning algorithm to tell them 745 00:26:24,960 --> 00:26:29,520 to maximize uh to minimize 746 00:26:27,200 --> 00:26:32,640 misclassification and minimize uh 747 00:26:29,520 --> 00:26:34,480 simultaneously false positive rates um 748 00:26:32,640 --> 00:26:35,679 yeah so the details don't matter but the 749 00:26:34,480 --> 00:26:38,080 fact that they 750 00:26:35,679 --> 00:26:39,360 had two objectives to minimize four and 751 00:26:38,080 --> 00:26:41,360 that's what i'm showing over here so 752 00:26:39,360 --> 00:26:42,720 here um 753 00:26:41,360 --> 00:26:45,679 so 754 00:26:42,720 --> 00:26:47,279 just using the vanilla parameters uh um 755 00:26:45,679 --> 00:26:49,679 you know without doing any tests they 756 00:26:47,279 --> 00:26:51,760 have you know this misclassification 757 00:26:49,679 --> 00:26:53,200 misclassification about 15 758 00:26:51,760 --> 00:26:55,360 and just about just about three and a 759 00:26:53,200 --> 00:26:57,520 half percent false positive rate and so 760 00:26:55,360 --> 00:27:00,880 they tried single objective optimization 761 00:26:57,520 --> 00:27:02,799 minimize um you know a a few metrics and 762 00:27:00,880 --> 00:27:05,120 you can see it massively reduced 763 00:27:02,799 --> 00:27:07,360 misclassification but actually it did 764 00:27:05,120 --> 00:27:10,159 worse in terms of false pause 765 00:27:07,360 --> 00:27:13,120 false positive rate and the importance 766 00:27:10,159 --> 00:27:14,960 of this and so um and so they said then 767 00:27:13,120 --> 00:27:16,320 they did multi-objective optimization 768 00:27:14,960 --> 00:27:18,480 and here you can see the distribution of 769 00:27:16,320 --> 00:27:20,399 all the solutions they looked at and you 770 00:27:18,480 --> 00:27:22,399 can see highlighted here in green is the 771 00:27:20,399 --> 00:27:23,760 pareto front well now we're at this 772 00:27:22,399 --> 00:27:25,600 juncture well we have all these 773 00:27:23,760 --> 00:27:26,799 solutions how do we make a decision well 774 00:27:25,600 --> 00:27:28,480 what they said well before making the 775 00:27:26,799 --> 00:27:30,000 decision they notice this odd bump over 776 00:27:28,480 --> 00:27:32,000 here so what they said is they'll do a 777 00:27:30,000 --> 00:27:34,000 cut off at miss classification they'll 778 00:27:32,000 --> 00:27:36,000 redo re-run their calculation so here 779 00:27:34,000 --> 00:27:38,320 just focusing on this box they have this 780 00:27:36,000 --> 00:27:40,240 other nice visual which is so well the 781 00:27:38,320 --> 00:27:42,240 green dots are the same as before all 782 00:27:40,240 --> 00:27:44,320 these dots are the same except for these 783 00:27:42,240 --> 00:27:45,840 triangles in these triangles what they 784 00:27:44,320 --> 00:27:47,760 did is what they called a constrained 785 00:27:45,840 --> 00:27:50,000 search so even though they were open for 786 00:27:47,760 --> 00:27:51,840 all parameters before now they're saying 787 00:27:50,000 --> 00:27:54,559 only accept uh solutions that have a 788 00:27:51,840 --> 00:27:56,640 misclassification less than uh 0.15 789 00:27:54,559 --> 00:27:58,640 where this bump is and so hence they got 790 00:27:56,640 --> 00:27:59,919 these triangles and then they get again 791 00:27:58,640 --> 00:28:01,679 at this point well well we have to 792 00:27:59,919 --> 00:28:03,919 decide one of these and this is where 793 00:28:01,679 --> 00:28:05,840 subjectivity comes in just 794 00:28:03,919 --> 00:28:08,080 well i think this one 795 00:28:05,840 --> 00:28:09,840 is the best let's go with that okay 796 00:28:08,080 --> 00:28:11,760 and so that's the idea and the end 797 00:28:09,840 --> 00:28:13,840 result is compared to this initial data 798 00:28:11,760 --> 00:28:16,000 point they show that in terms of false 799 00:28:13,840 --> 00:28:18,159 positive rate um you know to get to this 800 00:28:16,000 --> 00:28:20,000 point they reduce about a relative eight 801 00:28:18,159 --> 00:28:22,399 percent and misclassification they 802 00:28:20,000 --> 00:28:24,080 improve by an absolute five percent 803 00:28:22,399 --> 00:28:26,480 so um if you're a machine learning 804 00:28:24,080 --> 00:28:27,840 practitioner then um 805 00:28:26,480 --> 00:28:30,000 it's extra resources do these 806 00:28:27,840 --> 00:28:31,520 calculations but you might get an edge 807 00:28:30,000 --> 00:28:33,120 by doing so 808 00:28:31,520 --> 00:28:35,120 um so another 809 00:28:33,120 --> 00:28:37,120 so another question for the audience um 810 00:28:35,120 --> 00:28:39,200 i talked about how to identify pareto 811 00:28:37,120 --> 00:28:40,720 fronts i challenged myself i gave you an 812 00:28:39,200 --> 00:28:43,120 app let's see what what you see over 813 00:28:40,720 --> 00:28:44,799 here so let's say what's the which fruit 814 00:28:43,120 --> 00:28:46,399 are pareto optimal for the opposite 815 00:28:44,799 --> 00:28:48,000 before before we want to be nasty 816 00:28:46,399 --> 00:28:50,000 colleagues and we want to provide them 817 00:28:48,000 --> 00:28:52,799 with the fruit that's most difficult and 818 00:28:50,000 --> 00:28:54,640 most untasty so if people can just 819 00:28:52,799 --> 00:28:56,640 answer in the chat what they think it is 820 00:28:54,640 --> 00:28:58,559 before i give the answer 821 00:28:56,640 --> 00:29:00,720 so i'll give you about 10 seconds 822 00:28:58,559 --> 00:29:03,039 which fruit are both 823 00:29:00,720 --> 00:29:05,840 difficult and untasty 824 00:29:03,039 --> 00:29:08,720 according to randall 825 00:29:05,840 --> 00:29:08,720 five more seconds 826 00:29:09,360 --> 00:29:13,200 all right so these are the ones that i 827 00:29:11,120 --> 00:29:16,720 picked up these are all considered 828 00:29:13,200 --> 00:29:18,000 proto-optimal if we want to um yes i 829 00:29:16,720 --> 00:29:19,200 appreciate that the chat is a bit 830 00:29:18,000 --> 00:29:22,159 delayed 831 00:29:19,200 --> 00:29:23,039 um but 832 00:29:22,159 --> 00:29:24,320 okay 833 00:29:23,039 --> 00:29:25,600 cool um 834 00:29:24,320 --> 00:29:27,440 so 835 00:29:25,600 --> 00:29:29,200 so yeah so if you got it great if not 836 00:29:27,440 --> 00:29:30,640 then um in any case you have the app to 837 00:29:29,200 --> 00:29:35,679 play around with it 838 00:29:30,640 --> 00:29:37,600 um is this top is is this um pareto um 839 00:29:35,679 --> 00:29:39,919 is protofronts relevant for the projects 840 00:29:37,600 --> 00:29:42,240 that you're working on that's what we're 841 00:29:39,919 --> 00:29:44,480 addressing over here so you want to look 842 00:29:42,240 --> 00:29:46,720 into this technique if you're dealing 843 00:29:44,480 --> 00:29:48,000 with conflicting objectives the next 844 00:29:46,720 --> 00:29:49,679 thing um 845 00:29:48,000 --> 00:29:51,760 my experience with my conversation with 846 00:29:49,679 --> 00:29:53,840 people i realize that not everybody has 847 00:29:51,760 --> 00:29:56,240 full control over the decision space so 848 00:29:53,840 --> 00:29:57,520 you want full control over that 849 00:29:56,240 --> 00:29:59,440 and then when it comes to the objective 850 00:29:57,520 --> 00:30:00,880 space you need an inexpensive mapping 851 00:29:59,440 --> 00:30:02,799 because sometimes you can make the 852 00:30:00,880 --> 00:30:05,600 decisions but in order to map that to 853 00:30:02,799 --> 00:30:07,600 the actual objective space that might be 854 00:30:05,600 --> 00:30:09,279 very time consuming or might be very 855 00:30:07,600 --> 00:30:11,919 expensive 856 00:30:09,279 --> 00:30:14,320 and so that's another consideration 857 00:30:11,919 --> 00:30:19,279 there's also a further 858 00:30:14,320 --> 00:30:21,600 consideration is how big is your 859 00:30:19,279 --> 00:30:23,279 objective space or decision space for 860 00:30:21,600 --> 00:30:26,159 example when i talked about drug 861 00:30:23,279 --> 00:30:28,640 discovery well the dna sequence space is 862 00:30:26,159 --> 00:30:31,919 astronomical and hence what you need is 863 00:30:28,640 --> 00:30:33,600 a stochastic algorithm in order to 864 00:30:31,919 --> 00:30:36,480 um 865 00:30:33,600 --> 00:30:37,600 in order to navigate the space 866 00:30:36,480 --> 00:30:40,559 and so 867 00:30:37,600 --> 00:30:43,760 uh one recommended technique is genetic 868 00:30:40,559 --> 00:30:46,240 algorithms and that um so with that that 869 00:30:43,760 --> 00:30:48,640 actually is that last topic but i think 870 00:30:46,240 --> 00:30:51,840 uh how much more time do we have for for 871 00:30:48,640 --> 00:30:51,840 um this talk 872 00:30:52,159 --> 00:30:55,760 okay i was asked to 873 00:30:53,840 --> 00:30:58,000 wrap up so that's fine 874 00:30:55,760 --> 00:30:58,000 so 875 00:30:58,640 --> 00:31:00,960 okay 876 00:31:02,320 --> 00:31:06,159 so don't worry about the rest of the 877 00:31:03,440 --> 00:31:07,360 material uh we have um you'll have all 878 00:31:06,159 --> 00:31:09,679 access to 879 00:31:07,360 --> 00:31:12,080 all the material online um so i actually 880 00:31:09,679 --> 00:31:14,159 in the past uh conferences i have a 881 00:31:12,080 --> 00:31:15,600 hands-on tutorial in which if you're 882 00:31:14,159 --> 00:31:17,519 interested to go over this material 883 00:31:15,600 --> 00:31:19,279 again and have getting hands-on 884 00:31:17,519 --> 00:31:21,840 experience with the knapsack problem 885 00:31:19,279 --> 00:31:23,760 with jupiter notebooks so uh you have 886 00:31:21,840 --> 00:31:26,080 all of that over here i'll just very 887 00:31:23,760 --> 00:31:27,600 quickly show you what it looks like 888 00:31:26,080 --> 00:31:29,519 um i see already 889 00:31:27,600 --> 00:31:32,240 14 of you are actually looking at it so 890 00:31:29,519 --> 00:31:33,360 that's great um so just quickly to go 891 00:31:32,240 --> 00:31:35,679 through it 892 00:31:33,360 --> 00:31:35,679 um 893 00:31:35,840 --> 00:31:41,200 okay so what it is is you have basically 894 00:31:38,960 --> 00:31:43,919 videos of me talking exactly about this 895 00:31:41,200 --> 00:31:45,919 topic and you'll have jupiter notebooks 896 00:31:43,919 --> 00:31:48,399 um in which you can actually practice 897 00:31:45,919 --> 00:31:50,240 this um in which um you learn about 898 00:31:48,399 --> 00:31:53,360 genetic algorithms and vast techniques 899 00:31:50,240 --> 00:31:55,200 so you can uh use in your own projects 900 00:31:53,360 --> 00:31:56,880 uh okay so 901 00:31:55,200 --> 00:31:57,840 sorry i'm just going to 902 00:31:56,880 --> 00:31:59,120 share 903 00:31:57,840 --> 00:31:59,640 so i can see 904 00:31:59,120 --> 00:32:01,440 the 905 00:31:59,640 --> 00:32:04,480 [Music] 906 00:32:01,440 --> 00:32:07,120 okay so just um 907 00:32:04,480 --> 00:32:09,519 okay so uh just to summarize uh 908 00:32:07,120 --> 00:32:11,120 multi-objective optimization is useful 909 00:32:09,519 --> 00:32:12,399 when you have conflicting objectives you 910 00:32:11,120 --> 00:32:14,880 have full control over your decision 911 00:32:12,399 --> 00:32:17,519 space and you have an inexpected mapping 912 00:32:14,880 --> 00:32:20,640 to the objective space um 913 00:32:17,519 --> 00:32:22,640 yeah if you have an intractable 914 00:32:20,640 --> 00:32:24,320 objective space you might consider 915 00:32:22,640 --> 00:32:26,480 genetic algorithms again you have all 916 00:32:24,320 --> 00:32:29,519 the material online you have a tutorial 917 00:32:26,480 --> 00:32:31,279 um you can play the pareto wack game if 918 00:32:29,519 --> 00:32:33,200 anybody's interested in more material 919 00:32:31,279 --> 00:32:35,440 online i highly recommend anything by 920 00:32:33,200 --> 00:32:38,159 zitzler it's a great read and there's 921 00:32:35,440 --> 00:32:39,919 also python prototyping 922 00:32:38,159 --> 00:32:41,039 module called deep which i highly 923 00:32:39,919 --> 00:32:44,080 recommend 924 00:32:41,039 --> 00:32:45,039 as well to do prototyping i find it very 925 00:32:44,080 --> 00:32:49,039 useful 926 00:32:45,039 --> 00:32:49,039 so um yeah thank you very much 927 00:32:50,720 --> 00:32:55,440 thank you ayal this has been great and 928 00:32:53,519 --> 00:32:57,360 unfortunately we don't have time for 929 00:32:55,440 --> 00:32:58,799 more questions but i think 930 00:32:57,360 --> 00:33:00,720 you know all these follow-through 931 00:32:58,799 --> 00:33:04,320 materials and follow-on materials are 932 00:33:00,720 --> 00:33:08,399 gonna tide us so thank you yell again 933 00:33:04,320 --> 00:33:10,559 and next up on today's 934 00:33:08,399 --> 00:33:14,159 data science and analytic track at 935 00:33:10,559 --> 00:33:16,559 pyconeu 2021 is a closing session with 936 00:33:14,159 --> 00:33:18,320 lars jenkin systems of the world 937 00:33:16,559 --> 00:33:21,519 cataloging the world's data for great 938 00:33:18,320 --> 00:33:24,320 good please stay on uh he will be on in 939 00:33:21,519 --> 00:33:27,720 another 10 minutes at 4 45 australian 940 00:33:24,320 --> 00:33:27,720 eastern time