1 00:00:11,519 --> 00:00:16,480 good afternoon those of you who are on 2 00:00:14,160 --> 00:00:18,880 conference time that's melbourne time 3 00:00:16,480 --> 00:00:21,279 australian eastern time and good evening 4 00:00:18,880 --> 00:00:23,920 to those of you from california which is 5 00:00:21,279 --> 00:00:26,080 the land when comes back on nathan known 6 00:00:23,920 --> 00:00:28,400 in the industry as a player slash coach 7 00:00:26,080 --> 00:00:30,400 with core expertise in data science 8 00:00:28,400 --> 00:00:31,599 natural language processing and cloud 9 00:00:30,400 --> 00:00:33,360 computing 10 00:00:31,599 --> 00:00:35,360 about 40 years of tech industry 11 00:00:33,360 --> 00:00:37,120 experience ranging from bell labs to 12 00:00:35,360 --> 00:00:39,120 early stage startups 13 00:00:37,120 --> 00:00:41,200 he has too many accolades to list 14 00:00:39,120 --> 00:00:43,680 completely but some highlights he's the 15 00:00:41,200 --> 00:00:44,399 lead committer of pytech's rank and kg 16 00:00:43,680 --> 00:00:46,399 lab 17 00:00:44,399 --> 00:00:47,840 which he will be talking about today and 18 00:00:46,399 --> 00:00:50,320 formerly he was the director of 19 00:00:47,840 --> 00:00:52,239 community evangelism for databricks and 20 00:00:50,320 --> 00:00:55,360 apache spark if you want to know more 21 00:00:52,239 --> 00:00:57,520 about him you can find out at derwin dot 22 00:00:55,360 --> 00:00:59,520 ai slash paco 23 00:00:57,520 --> 00:01:00,640 this talk is about how graphs are 24 00:00:59,520 --> 00:01:03,039 amazing 25 00:01:00,640 --> 00:01:04,879 there's the emerging graph ml work in 26 00:01:03,039 --> 00:01:06,560 deep learning 27 00:01:04,879 --> 00:01:09,040 there's the rise of knowledge graphs in 28 00:01:06,560 --> 00:01:11,360 industry powerful graph algorithms huge 29 00:01:09,040 --> 00:01:13,920 graph databases interactive graph 30 00:01:11,360 --> 00:01:15,920 visualizations and so on 31 00:01:13,920 --> 00:01:18,400 but it seems like a bit daunting pack of 32 00:01:15,920 --> 00:01:20,560 stock is about how you don't really need 33 00:01:18,400 --> 00:01:22,400 to have like special platforms and very 34 00:01:20,560 --> 00:01:24,479 specialized knowledge you can start on 35 00:01:22,400 --> 00:01:27,280 graph based technologies using just 36 00:01:24,479 --> 00:01:29,119 python kglab and many of the other 37 00:01:27,280 --> 00:01:30,640 python technologies that you already 38 00:01:29,119 --> 00:01:33,280 know and love 39 00:01:30,640 --> 00:01:36,000 please give a virtual hand of applause 40 00:01:33,280 --> 00:01:38,799 and show your love to paco nathan and 41 00:01:36,000 --> 00:01:40,400 graph data science 42 00:01:38,799 --> 00:01:42,799 thank you very kindly thank you javi i 43 00:01:40,400 --> 00:01:45,280 appreciate it uh and i'm grateful to get 44 00:01:42,799 --> 00:01:48,159 to present here today um the talk the 45 00:01:45,280 --> 00:01:51,040 slides are online i probably err on the 46 00:01:48,159 --> 00:01:54,320 side of having way too many links in my 47 00:01:51,040 --> 00:01:55,759 slide decks so uh there's the link here 48 00:01:54,320 --> 00:01:58,240 and it should be in the chat as well and 49 00:01:55,759 --> 00:01:59,520 i'll be on chat afterwards uh but if you 50 00:01:58,240 --> 00:02:02,479 want to check through the slides there's 51 00:01:59,520 --> 00:02:05,520 a lot of links after primary sources and 52 00:02:02,479 --> 00:02:08,319 repos and articles and whatnot 53 00:02:05,520 --> 00:02:11,039 so this is about graph data science and 54 00:02:08,319 --> 00:02:12,800 just briefly about me uh yeah i've been 55 00:02:11,039 --> 00:02:14,000 doing this for a few years but uh you 56 00:02:12,800 --> 00:02:16,000 know i had some interest in neural 57 00:02:14,000 --> 00:02:17,280 networks a long long time ago i got 58 00:02:16,000 --> 00:02:18,239 involved in machine learning back in the 59 00:02:17,280 --> 00:02:20,160 80s 60 00:02:18,239 --> 00:02:22,560 and then nobody cared so i did a lot of 61 00:02:20,160 --> 00:02:24,080 work in network engineering uh i ended 62 00:02:22,560 --> 00:02:25,920 up being a guinea pig for something new 63 00:02:24,080 --> 00:02:27,920 that was getting launched that i found 64 00:02:25,920 --> 00:02:29,760 out later was named aws 65 00:02:27,920 --> 00:02:31,440 and i ended up 66 00:02:29,760 --> 00:02:32,640 leading one of the first large hadoop 67 00:02:31,440 --> 00:02:34,080 instances 68 00:02:32,640 --> 00:02:35,760 ever running in the cloud 69 00:02:34,080 --> 00:02:37,440 and then from there into spark and 70 00:02:35,760 --> 00:02:39,519 working with jupiter and ray and some 71 00:02:37,440 --> 00:02:42,239 other projects 72 00:02:39,519 --> 00:02:43,599 so this talk is about 73 00:02:42,239 --> 00:02:46,160 really about understanding how to work 74 00:02:43,599 --> 00:02:47,920 with graphs because there's so much 75 00:02:46,160 --> 00:02:50,319 open source that's available so i want 76 00:02:47,920 --> 00:02:53,760 to lead you through a brief story 77 00:02:50,319 --> 00:02:55,360 imagine here uh this is a village let's 78 00:02:53,760 --> 00:02:56,720 say it's a medieval village somewhere in 79 00:02:55,360 --> 00:02:59,120 the black forest 80 00:02:56,720 --> 00:03:00,239 and it's it's beautiful of course 81 00:02:59,120 --> 00:03:02,000 and if you want to look at it in terms 82 00:03:00,239 --> 00:03:03,760 of a map there's you know a road running 83 00:03:02,000 --> 00:03:05,920 through and a stream running by and and 84 00:03:03,760 --> 00:03:07,360 some different cottages and businesses 85 00:03:05,920 --> 00:03:08,959 and whatnot 86 00:03:07,360 --> 00:03:09,920 so let's talk about who's there in the 87 00:03:08,959 --> 00:03:12,800 village 88 00:03:09,920 --> 00:03:14,800 we've got pat pat runs the pub 89 00:03:12,800 --> 00:03:16,319 pat has a couple of friends hannah and 90 00:03:14,800 --> 00:03:18,239 thomas 91 00:03:16,319 --> 00:03:19,280 now hannah works the fields and produces 92 00:03:18,239 --> 00:03:22,400 grain 93 00:03:19,280 --> 00:03:23,840 and she's got a friend aiden 94 00:03:22,400 --> 00:03:26,560 and then thomas 95 00:03:23,840 --> 00:03:30,000 raises poultry produces eggs 96 00:03:26,560 --> 00:03:31,760 and thomas has a friend named brenda 97 00:03:30,000 --> 00:03:33,840 now if you go back to iden aiden works 98 00:03:31,760 --> 00:03:35,760 the mill and takes the grain from hannah 99 00:03:33,840 --> 00:03:38,400 and produces flour 100 00:03:35,760 --> 00:03:39,840 ident has a friend chris 101 00:03:38,400 --> 00:03:42,319 and then brenda 102 00:03:39,840 --> 00:03:45,360 she takes the grain from hannah and she 103 00:03:42,319 --> 00:03:47,760 works the brewery produces beer 104 00:03:45,360 --> 00:03:49,360 and she's got a friend named kim 105 00:03:47,760 --> 00:03:51,519 and then there's chris chris runs the 106 00:03:49,360 --> 00:03:53,360 bakery produces bread 107 00:03:51,519 --> 00:03:56,239 now chris is taking 108 00:03:53,360 --> 00:03:57,519 products from iden the flour and eggs 109 00:03:56,239 --> 00:04:00,080 from thomas 110 00:03:57,519 --> 00:04:01,760 and selling the bread back to pat at the 111 00:04:00,080 --> 00:04:03,519 pub so they've got some nice bakery 112 00:04:01,760 --> 00:04:06,720 goods 113 00:04:03,519 --> 00:04:08,480 and then kim runs a recycler she buys 114 00:04:06,720 --> 00:04:12,480 organic wastes from 115 00:04:08,480 --> 00:04:14,239 thomas and oh chris and all and and then 116 00:04:12,480 --> 00:04:17,120 she produces fertilizer that goes back 117 00:04:14,239 --> 00:04:18,400 hanna buys that to use in the fields 118 00:04:17,120 --> 00:04:20,959 and the point i want to make is that 119 00:04:18,400 --> 00:04:23,199 with seven nodes here seven people in a 120 00:04:20,959 --> 00:04:24,560 small medieval village we already have a 121 00:04:23,199 --> 00:04:27,199 definition of what we would call a 122 00:04:24,560 --> 00:04:29,600 circular economy uh we have something 123 00:04:27,199 --> 00:04:31,759 that's sustainable and we have this 124 00:04:29,600 --> 00:04:33,440 producer consumer view of what they're 125 00:04:31,759 --> 00:04:35,280 doing what their activities are we have 126 00:04:33,440 --> 00:04:37,360 a friend graph 127 00:04:35,280 --> 00:04:39,280 uh we also have this 128 00:04:37,360 --> 00:04:41,600 product graph if you will the flows of 129 00:04:39,280 --> 00:04:43,120 products going through this workflow 130 00:04:41,600 --> 00:04:45,600 and so this is a graph and this is the 131 00:04:43,120 --> 00:04:47,199 way you think about how a complex 132 00:04:45,600 --> 00:04:49,759 intertwined 133 00:04:47,199 --> 00:04:51,759 system operates 134 00:04:49,759 --> 00:04:53,440 now if we were to take that very same 135 00:04:51,759 --> 00:04:55,040 data and put it into 136 00:04:53,440 --> 00:04:56,400 a 137 00:04:55,040 --> 00:04:57,840 relational database 138 00:04:56,400 --> 00:05:00,560 and do the proper things in terms of 139 00:04:57,840 --> 00:05:01,919 normalization it would look most likely 140 00:05:00,560 --> 00:05:05,600 very much like this 141 00:05:01,919 --> 00:05:08,000 um in some ways you have to take 142 00:05:05,600 --> 00:05:09,680 and dismember all those relationships 143 00:05:08,000 --> 00:05:11,520 and connections that you had and put 144 00:05:09,680 --> 00:05:14,400 them into tables and you know you can 145 00:05:11,520 --> 00:05:16,000 kind of see this here can you understand 146 00:05:14,400 --> 00:05:18,080 who's really 147 00:05:16,000 --> 00:05:20,160 producing the most important goods by 148 00:05:18,080 --> 00:05:22,080 looking at these numbers 149 00:05:20,160 --> 00:05:23,360 that's hard 150 00:05:22,080 --> 00:05:25,039 however 151 00:05:23,360 --> 00:05:26,720 if you come back into a network view you 152 00:05:25,039 --> 00:05:27,919 might get a better better 153 00:05:26,720 --> 00:05:29,280 sense of it 154 00:05:27,919 --> 00:05:32,160 uh so 155 00:05:29,280 --> 00:05:33,759 when you do have complex contexts 156 00:05:32,160 --> 00:05:35,280 network views 157 00:05:33,759 --> 00:05:37,120 bring the data and the connections 158 00:05:35,280 --> 00:05:39,280 within the data closer to the people who 159 00:05:37,120 --> 00:05:41,520 can make sense of it 160 00:05:39,280 --> 00:05:44,400 and this is predicated on a few things 161 00:05:41,520 --> 00:05:45,600 one is acknowledging the complexity of 162 00:05:44,400 --> 00:05:46,800 the kind of use case you're working with 163 00:05:45,600 --> 00:05:49,280 the context 164 00:05:46,800 --> 00:05:51,919 and then being able to understand can we 165 00:05:49,280 --> 00:05:53,600 identify emergent patterns within these 166 00:05:51,919 --> 00:05:55,520 connections and from there be able to 167 00:05:53,600 --> 00:05:57,440 make informed decisions 168 00:05:55,520 --> 00:05:59,039 so think about patterns whenever you're 169 00:05:57,440 --> 00:06:00,800 working with graphs think about patterns 170 00:05:59,039 --> 00:06:03,600 or vice versa when you're seeing complex 171 00:06:00,800 --> 00:06:05,199 patterns perhaps graphs are indicated 172 00:06:03,600 --> 00:06:07,360 now hannah is relatively new in this 173 00:06:05,199 --> 00:06:08,960 village but she'd like to expand her 174 00:06:07,360 --> 00:06:11,199 business so one question naturally she 175 00:06:08,960 --> 00:06:14,319 could ask is she's noticed that one of 176 00:06:11,199 --> 00:06:16,080 her customers brenda buys a lot of grain 177 00:06:14,319 --> 00:06:17,600 who are the other villagers who might be 178 00:06:16,080 --> 00:06:19,120 similar to brenda can we do some 179 00:06:17,600 --> 00:06:21,360 lookalike analysis 180 00:06:19,120 --> 00:06:23,759 and if you go across the graph and do 181 00:06:21,360 --> 00:06:26,479 some some analysis there you find that 182 00:06:23,759 --> 00:06:28,560 well chris also sells product to pad and 183 00:06:26,479 --> 00:06:30,479 he sells waste to kim 184 00:06:28,560 --> 00:06:32,639 maybe the bakery could make good use of 185 00:06:30,479 --> 00:06:35,520 unmilled grain maybe they could produce 186 00:06:32,639 --> 00:06:37,360 malt or something 187 00:06:35,520 --> 00:06:38,639 hannah is also interested in sponsoring 188 00:06:37,360 --> 00:06:41,280 a co-marketing campaign here in her 189 00:06:38,639 --> 00:06:44,319 medieval village to help drive demand so 190 00:06:41,280 --> 00:06:46,000 helping out her customers to to reach 191 00:06:44,319 --> 00:06:47,919 more of their customers 192 00:06:46,000 --> 00:06:48,960 who are the customers of hannah's grain 193 00:06:47,919 --> 00:06:50,639 business 194 00:06:48,960 --> 00:06:52,960 as it turns out chris pat and kim are 195 00:06:50,639 --> 00:06:56,000 each a minimum of two hops away and we 196 00:06:52,960 --> 00:06:58,080 can see this from graph analysis 197 00:06:56,000 --> 00:07:00,400 now uh just suppose you had some tech 198 00:06:58,080 --> 00:07:01,840 millionaire some elon musk type who uses 199 00:07:00,400 --> 00:07:03,759 time travel to relocate back to a 200 00:07:01,840 --> 00:07:04,880 medieval village in the black forest 201 00:07:03,759 --> 00:07:06,479 like you do 202 00:07:04,880 --> 00:07:08,240 which businesses are the most 203 00:07:06,479 --> 00:07:10,000 influential if they were to you know 204 00:07:08,240 --> 00:07:11,120 perhaps try to have an acquisition 205 00:07:10,000 --> 00:07:13,759 target 206 00:07:11,120 --> 00:07:15,520 um it turns out that if you look at this 207 00:07:13,759 --> 00:07:17,759 this medieval village and the business 208 00:07:15,520 --> 00:07:19,520 going on there hannah and chris are the 209 00:07:17,759 --> 00:07:20,880 ones that they're really the most 210 00:07:19,520 --> 00:07:22,960 central figures 211 00:07:20,880 --> 00:07:24,720 and in fact if you run graph algorithms 212 00:07:22,960 --> 00:07:26,720 on this exact data you find out that 213 00:07:24,720 --> 00:07:29,680 hana leads the ranking in terms of using 214 00:07:26,720 --> 00:07:31,840 it between a centrality measure 215 00:07:29,680 --> 00:07:33,360 now also think about what happens if 216 00:07:31,840 --> 00:07:35,280 somebody new comes in 217 00:07:33,360 --> 00:07:37,120 mika is an old friend of kim she's just 218 00:07:35,280 --> 00:07:39,199 moved to the village she has a delicious 219 00:07:37,120 --> 00:07:40,960 recipe for chicken sausage that's going 220 00:07:39,199 --> 00:07:42,840 to require some 221 00:07:40,960 --> 00:07:45,759 poultry and some 222 00:07:42,840 --> 00:07:46,639 grain um so you know think about how 223 00:07:45,759 --> 00:07:48,400 would you 224 00:07:46,639 --> 00:07:50,479 position mika in here how would she be 225 00:07:48,400 --> 00:07:52,560 connected naturally of course she could 226 00:07:50,479 --> 00:07:54,639 sell sausages off to the pub so pat 227 00:07:52,560 --> 00:07:56,720 would be interested in buying those 228 00:07:54,639 --> 00:07:59,120 so this type of approach is very minimal 229 00:07:56,720 --> 00:08:00,879 to bringing in new data data acquisition 230 00:07:59,120 --> 00:08:03,520 and and data integration i think is 231 00:08:00,879 --> 00:08:05,759 probably a better way to put them 232 00:08:03,520 --> 00:08:07,919 i want to bring this back to something 233 00:08:05,759 --> 00:08:10,479 that isn't really about technology let's 234 00:08:07,919 --> 00:08:12,400 talk about pedagogy let's talk about 235 00:08:10,479 --> 00:08:14,800 learning and and teaching 236 00:08:12,400 --> 00:08:16,720 um susan ambrose has a great book called 237 00:08:14,800 --> 00:08:17,759 how learning works it's you know one of 238 00:08:16,720 --> 00:08:19,919 the 239 00:08:17,759 --> 00:08:22,560 typical kinds of textbooks in in field 240 00:08:19,919 --> 00:08:22,560 of pedagogy 241 00:08:22,800 --> 00:08:25,440 when people look at the cognitive 242 00:08:23,840 --> 00:08:27,039 psychology of how 243 00:08:25,440 --> 00:08:30,319 individuals learn 244 00:08:27,039 --> 00:08:32,320 how people proceed from being a novice 245 00:08:30,319 --> 00:08:34,080 into gaining more and more expertise in 246 00:08:32,320 --> 00:08:35,760 a given field 247 00:08:34,080 --> 00:08:37,440 well we can talk about that at first 248 00:08:35,760 --> 00:08:39,680 when someone's a novice they tend to 249 00:08:37,440 --> 00:08:41,839 learn relatively superficial mental 250 00:08:39,680 --> 00:08:43,200 structures disconnected sets of simple 251 00:08:41,839 --> 00:08:45,360 facts 252 00:08:43,200 --> 00:08:47,760 and of course when we're building tests 253 00:08:45,360 --> 00:08:49,440 we query for that 254 00:08:47,760 --> 00:08:51,839 um and then as people become more 255 00:08:49,440 --> 00:08:52,880 advanced in a practice they start 256 00:08:51,839 --> 00:08:54,240 learning how to string together 257 00:08:52,880 --> 00:08:56,320 different facts and they can make 258 00:08:54,240 --> 00:08:58,160 associations these tend to be 259 00:08:56,320 --> 00:09:01,760 represented as far as like linear kinds 260 00:08:58,160 --> 00:09:04,320 of sequences of cognitive structure 261 00:09:01,760 --> 00:09:05,760 but then when people become much more 262 00:09:04,320 --> 00:09:08,080 practitioners 263 00:09:05,760 --> 00:09:10,399 they tend to think in terms of something 264 00:09:08,080 --> 00:09:12,720 quite like decision trees 265 00:09:10,399 --> 00:09:13,680 and at this stage people can really get 266 00:09:12,720 --> 00:09:15,519 into 267 00:09:13,680 --> 00:09:17,839 you know a lot more decisions and 268 00:09:15,519 --> 00:09:20,640 formulating plans and whatnot 269 00:09:17,839 --> 00:09:22,880 but when people begin to show expertise 270 00:09:20,640 --> 00:09:25,360 in a given field one of the first things 271 00:09:22,880 --> 00:09:27,760 that they almost always do is they learn 272 00:09:25,360 --> 00:09:29,360 where to break the rules those decision 273 00:09:27,760 --> 00:09:31,279 trees of rules 274 00:09:29,360 --> 00:09:33,440 they know which parts of those they can 275 00:09:31,279 --> 00:09:34,640 rely on and which parts perhaps are 276 00:09:33,440 --> 00:09:36,000 questionable 277 00:09:34,640 --> 00:09:38,240 and so when you start to look at the 278 00:09:36,000 --> 00:09:39,839 cognitive structure for how experts work 279 00:09:38,240 --> 00:09:42,800 they organize 280 00:09:39,839 --> 00:09:44,720 what they have learned in terms of 281 00:09:42,800 --> 00:09:45,760 graphs and this is what the cockside 282 00:09:44,720 --> 00:09:48,000 people say 283 00:09:45,760 --> 00:09:50,399 ai notwithstanding 284 00:09:48,000 --> 00:09:52,240 the experts look for emerging patterns 285 00:09:50,399 --> 00:09:53,600 when they try to make sense now i could 286 00:09:52,240 --> 00:09:55,440 go into much more detail too i'm going 287 00:09:53,600 --> 00:09:57,040 to skip over this part if you've ever 288 00:09:55,440 --> 00:09:59,279 heard the rather unfortunate use in the 289 00:09:57,040 --> 00:10:00,959 united states of the unknown unknowns 290 00:09:59,279 --> 00:10:03,680 back 20 years ago 291 00:10:00,959 --> 00:10:05,440 it derived from a gentleman at ibm named 292 00:10:03,680 --> 00:10:07,600 dave snowden who had the connection 293 00:10:05,440 --> 00:10:10,240 framework uh talking about how leaders 294 00:10:07,600 --> 00:10:12,720 make decisions and in this there's a 295 00:10:10,240 --> 00:10:14,800 sense of simple context where you follow 296 00:10:12,720 --> 00:10:16,880 best practices you have facts that are 297 00:10:14,800 --> 00:10:19,360 established you follow the rules done 298 00:10:16,880 --> 00:10:21,120 and done but then when you get 299 00:10:19,360 --> 00:10:22,320 confronted with a more complicated kind 300 00:10:21,120 --> 00:10:24,240 of of 301 00:10:22,320 --> 00:10:26,160 problem you typically have to bring in 302 00:10:24,240 --> 00:10:27,760 some experts and they do some analysis 303 00:10:26,160 --> 00:10:28,880 and that leads to decision making about 304 00:10:27,760 --> 00:10:32,079 trade-offs 305 00:10:28,880 --> 00:10:33,839 but then you get into a complex place 306 00:10:32,079 --> 00:10:36,160 and this is where 307 00:10:33,839 --> 00:10:37,519 you reach the infamous unknown unknowns 308 00:10:36,160 --> 00:10:39,920 determining cause and effect just 309 00:10:37,519 --> 00:10:41,839 doesn't work so instead you have to look 310 00:10:39,920 --> 00:10:43,600 for emergent patterns 311 00:10:41,839 --> 00:10:46,079 again we see this over and over again in 312 00:10:43,600 --> 00:10:48,399 business in terms of how do we approach 313 00:10:46,079 --> 00:10:50,880 management when you hit complex business 314 00:10:48,399 --> 00:10:52,959 problems you're looking at merchant 315 00:10:50,880 --> 00:10:54,399 patterns you're dealing with uncertainty 316 00:10:52,959 --> 00:10:56,480 and for global business today this is 317 00:10:54,399 --> 00:10:58,240 where we're at 318 00:10:56,480 --> 00:11:01,680 so i just want to make the sense that 319 00:10:58,240 --> 00:11:03,920 relational data and data warehouses are 320 00:11:01,680 --> 00:11:05,440 really great if you're operating at the 321 00:11:03,920 --> 00:11:06,720 level of say a novice or an advanced 322 00:11:05,440 --> 00:11:08,959 beginner 323 00:11:06,720 --> 00:11:10,399 when you proceed into more of the 324 00:11:08,959 --> 00:11:13,279 complicated 325 00:11:10,399 --> 00:11:15,040 competent practitioner type 326 00:11:13,279 --> 00:11:16,560 yeah okay maybe some of the analytic 327 00:11:15,040 --> 00:11:18,320 warehouse approaches that's good that's 328 00:11:16,560 --> 00:11:20,320 where data science tends to live but 329 00:11:18,320 --> 00:11:22,640 when you get into the expert parts of 330 00:11:20,320 --> 00:11:24,000 this you need to be leveraging graphs 331 00:11:22,640 --> 00:11:26,240 and we know this even before the 332 00:11:24,000 --> 00:11:27,120 technology begins to get applied 333 00:11:26,240 --> 00:11:28,880 uh 334 00:11:27,120 --> 00:11:30,880 and i i will say that you know if you 335 00:11:28,880 --> 00:11:33,600 look at the literature gartner 336 00:11:30,880 --> 00:11:35,440 up until 2020 gartner research was 337 00:11:33,600 --> 00:11:37,440 saying that uh well i'm not really so 338 00:11:35,440 --> 00:11:40,720 sure about graphs they did an abrupt 339 00:11:37,440 --> 00:11:43,279 about face early in 2021 and they are 340 00:11:40,720 --> 00:11:45,440 now saying no actually even though graph 341 00:11:43,279 --> 00:11:46,560 analytics is at 10 penetration in in 342 00:11:45,440 --> 00:11:49,680 industry 343 00:11:46,560 --> 00:11:52,160 today they're projecting that by 2025 344 00:11:49,680 --> 00:11:54,639 graph technologies be used by 80 of data 345 00:11:52,160 --> 00:11:56,399 analytics and currently over half the 346 00:11:54,639 --> 00:11:59,680 inquiries that they get from enterprise 347 00:11:56,399 --> 00:12:01,440 about ai are about graph technology 348 00:11:59,680 --> 00:12:03,360 right here this bottom line is really 349 00:12:01,440 --> 00:12:04,959 good the very thing that's happening in 350 00:12:03,360 --> 00:12:06,320 the background when you're running a sql 351 00:12:04,959 --> 00:12:07,600 query or when you're using an excel 352 00:12:06,320 --> 00:12:09,200 spreadsheet the very thing that's 353 00:12:07,600 --> 00:12:11,120 happening is it's a graph that's 354 00:12:09,200 --> 00:12:12,399 executing and the metadata and the 355 00:12:11,120 --> 00:12:14,480 business rules are being applied are 356 00:12:12,399 --> 00:12:16,000 through a graph but those are 357 00:12:14,480 --> 00:12:18,639 hidden behind the scenes to make them 358 00:12:16,000 --> 00:12:21,120 look simple now the problem is as we've 359 00:12:18,639 --> 00:12:23,040 learned from kinefin if you take a very 360 00:12:21,120 --> 00:12:25,519 complex kind of problem and try to 361 00:12:23,040 --> 00:12:27,120 oversimplify it 362 00:12:25,519 --> 00:12:28,480 bad things happen 363 00:12:27,120 --> 00:12:29,360 we saw that in the united states 364 00:12:28,480 --> 00:12:31,120 certainly 365 00:12:29,360 --> 00:12:32,320 where the unknown unknowns came into 366 00:12:31,120 --> 00:12:33,120 play 367 00:12:32,320 --> 00:12:35,360 so 368 00:12:33,120 --> 00:12:37,680 um this point about a village and about 369 00:12:35,360 --> 00:12:39,920 graph thinking we can embellish this a 370 00:12:37,680 --> 00:12:41,440 bit more i want to shout out to uh some 371 00:12:39,920 --> 00:12:44,560 very good friends and colleagues uh 372 00:12:41,440 --> 00:12:45,920 jurgen buehler at bsf uh there's an 373 00:12:44,560 --> 00:12:47,519 article called graph thinking that goes 374 00:12:45,920 --> 00:12:49,600 into much more detail about this i'll 375 00:12:47,519 --> 00:12:51,360 give jurgen credit he and i developed 376 00:12:49,600 --> 00:12:53,120 this whole village thing together we've 377 00:12:51,360 --> 00:12:54,880 been using it to help train executives 378 00:12:53,120 --> 00:12:57,600 in graph thinking 379 00:12:54,880 --> 00:13:00,800 there's also a recent podcast 380 00:12:57,600 --> 00:13:03,200 with my friends ben lorca and jen webb 381 00:13:00,800 --> 00:13:04,560 talking about this kind of surge in 382 00:13:03,200 --> 00:13:06,560 interesting graphs 383 00:13:04,560 --> 00:13:08,320 and along with it also i'll point out a 384 00:13:06,560 --> 00:13:11,519 really good article from last year by 385 00:13:08,320 --> 00:13:13,440 cassie kozakov about what is ambiguity 386 00:13:11,519 --> 00:13:14,720 aversion because i'll say this one of 387 00:13:13,440 --> 00:13:17,680 the points that jurgen mueller and 388 00:13:14,720 --> 00:13:19,760 others out in industry make about graphs 389 00:13:17,680 --> 00:13:22,399 in industry is that when you propose 390 00:13:19,760 --> 00:13:23,839 using a graph for a problem typically 391 00:13:22,399 --> 00:13:25,839 about half the people in the room who 392 00:13:23,839 --> 00:13:27,600 are domain experts they'll get it 393 00:13:25,839 --> 00:13:29,360 they'll jump right into it it's how they 394 00:13:27,600 --> 00:13:31,040 think about their problem whether 395 00:13:29,360 --> 00:13:33,279 they're working in logistics or market 396 00:13:31,040 --> 00:13:35,120 intelligence or some sort of you know 397 00:13:33,279 --> 00:13:36,399 business process 398 00:13:35,120 --> 00:13:38,240 but the other half of the people run 399 00:13:36,399 --> 00:13:39,600 screaming the opposite direction 400 00:13:38,240 --> 00:13:41,360 and there's a reason for this out of 401 00:13:39,600 --> 00:13:43,680 behavioral economics is something called 402 00:13:41,360 --> 00:13:45,440 ambiguity aversion and we could go into 403 00:13:43,680 --> 00:13:47,519 more detail but cassie has done a really 404 00:13:45,440 --> 00:13:50,480 fantastic article about that 405 00:13:47,519 --> 00:13:52,079 so um moving right along 406 00:13:50,480 --> 00:13:54,240 there's something called graph theory 407 00:13:52,079 --> 00:13:56,560 especially algebraic graph theory which 408 00:13:54,240 --> 00:13:58,320 says that if we do have complex data 409 00:13:56,560 --> 00:14:00,560 organized as a graph there's nodes and 410 00:13:58,320 --> 00:14:03,360 edges and properties and all that 411 00:14:00,560 --> 00:14:05,680 we can take and simplify this and turn 412 00:14:03,360 --> 00:14:08,240 it into on the one hand a vector a 413 00:14:05,680 --> 00:14:09,760 really simple case or a matrix 414 00:14:08,240 --> 00:14:12,399 that's a very common case in algebraic 415 00:14:09,760 --> 00:14:14,000 graph here or a tensor if you want to 416 00:14:12,399 --> 00:14:14,880 you capture more of the nuances of the 417 00:14:14,000 --> 00:14:16,320 graph 418 00:14:14,880 --> 00:14:18,320 this is a very mathematical kind of 419 00:14:16,320 --> 00:14:20,880 approach to working graphs and it it 420 00:14:18,320 --> 00:14:22,800 tends to move everything into a numeric 421 00:14:20,880 --> 00:14:24,639 space so that we can do number crunching 422 00:14:22,800 --> 00:14:26,079 with things like gpus which is very 423 00:14:24,639 --> 00:14:28,399 useful 424 00:14:26,079 --> 00:14:30,160 but i want to make a point sometimes we 425 00:14:28,399 --> 00:14:31,120 must use both the symbolic and the 426 00:14:30,160 --> 00:14:32,800 numeric 427 00:14:31,120 --> 00:14:34,480 the numeric is great if you're doing 428 00:14:32,800 --> 00:14:35,920 deep learning or graph algorithms or 429 00:14:34,480 --> 00:14:38,639 visualization 430 00:14:35,920 --> 00:14:40,000 but if you're auditing for compliance 431 00:14:38,639 --> 00:14:41,279 you're working with natural language 432 00:14:40,000 --> 00:14:43,920 applications you're doing human in the 433 00:14:41,279 --> 00:14:45,600 loop if there's rules regarding domain 434 00:14:43,920 --> 00:14:47,760 expertise that have to be merged in or 435 00:14:45,600 --> 00:14:49,839 if there's needs to explain what your 436 00:14:47,760 --> 00:14:52,240 models are doing you're going to be in 437 00:14:49,839 --> 00:14:54,079 the symbolic space so you need to be 438 00:14:52,240 --> 00:14:56,800 able to have transforms back and forth 439 00:14:54,079 --> 00:14:59,760 between the numeric and symbolic 440 00:14:56,800 --> 00:15:01,680 and we provide this in an open source uh 441 00:14:59,760 --> 00:15:03,440 sense it's it's very similar to say what 442 00:15:01,680 --> 00:15:05,120 label encoding does in psychic learn or 443 00:15:03,440 --> 00:15:07,120 of course embedding in pi torch and 444 00:15:05,120 --> 00:15:08,880 other types of deep learning uh 445 00:15:07,120 --> 00:15:11,120 frameworks 446 00:15:08,880 --> 00:15:12,800 uh and in general this idea of going 447 00:15:11,120 --> 00:15:15,120 back and forth between american symbolic 448 00:15:12,800 --> 00:15:16,320 there are analogies uh i like to think 449 00:15:15,120 --> 00:15:18,079 of it as something called thinking 450 00:15:16,320 --> 00:15:21,199 sparse and dense 451 00:15:18,079 --> 00:15:23,519 with uh all apologies 452 00:15:21,199 --> 00:15:25,360 the author of uh thinking fast and slow 453 00:15:23,519 --> 00:15:27,920 um but the idea is that when you're 454 00:15:25,360 --> 00:15:28,720 working with data and workflows it's 455 00:15:27,920 --> 00:15:30,959 really crucial when you're thinking 456 00:15:28,720 --> 00:15:32,959 about how to leverage the hardware at 457 00:15:30,959 --> 00:15:34,320 some stages you must work in a sparse 458 00:15:32,959 --> 00:15:36,160 sense and these are typically bandwidth 459 00:15:34,320 --> 00:15:37,440 limited uh 460 00:15:36,160 --> 00:15:38,880 and even parts of deep learning when 461 00:15:37,440 --> 00:15:40,160 you're calculating a loss function it's 462 00:15:38,880 --> 00:15:42,079 typically sparse like this and it's 463 00:15:40,160 --> 00:15:43,120 bandwidth limited data preparation is 464 00:15:42,079 --> 00:15:44,320 like that 465 00:15:43,120 --> 00:15:46,160 but then there are other times we need 466 00:15:44,320 --> 00:15:48,079 to crunch on a very dense piece of data 467 00:15:46,160 --> 00:15:49,759 that can be packed into a gpu like the 468 00:15:48,079 --> 00:15:51,360 convolution layers in deep learning 469 00:15:49,759 --> 00:15:52,560 that's compute limited 470 00:15:51,360 --> 00:15:53,920 and if you want to go into more detail 471 00:15:52,560 --> 00:15:57,040 on this there's a recent report dean 472 00:15:53,920 --> 00:15:59,519 wampler and i just did for uh nvidia uh 473 00:15:57,040 --> 00:16:01,360 working with the open source leads uh 474 00:15:59,519 --> 00:16:02,399 tech leads at nvidia it's called 475 00:16:01,360 --> 00:16:04,800 hardware is greater than software 476 00:16:02,399 --> 00:16:07,199 greater than process 477 00:16:04,800 --> 00:16:08,880 uh and i'm 478 00:16:07,199 --> 00:16:11,199 throwing this out here as a sense of 479 00:16:08,880 --> 00:16:12,800 graph data science how can we leverage 480 00:16:11,199 --> 00:16:14,959 graphs how can we leverage the 481 00:16:12,800 --> 00:16:16,959 dimensionality within graphs how can we 482 00:16:14,959 --> 00:16:18,320 go through this process of starting with 483 00:16:16,959 --> 00:16:20,320 unstructured data that has a lot of 484 00:16:18,320 --> 00:16:21,680 connections in it to producing more 485 00:16:20,320 --> 00:16:24,240 structure that we can leverage for data 486 00:16:21,680 --> 00:16:27,040 science problems 487 00:16:24,240 --> 00:16:29,120 and so that leads us to a library it's 488 00:16:27,040 --> 00:16:30,079 called kg lab we started on this ah 489 00:16:29,120 --> 00:16:32,000 we've been working on a lot of these 490 00:16:30,079 --> 00:16:33,519 things for a long time but as an 491 00:16:32,000 --> 00:16:35,839 official project it got started last 492 00:16:33,519 --> 00:16:37,920 october and it's really picked up it's 493 00:16:35,839 --> 00:16:41,440 an abstraction layer in python 494 00:16:37,920 --> 00:16:42,160 and here's links to the repo and forums 495 00:16:41,440 --> 00:16:43,199 and 496 00:16:42,160 --> 00:16:44,000 and all that 497 00:16:43,199 --> 00:16:46,160 um 498 00:16:44,000 --> 00:16:47,600 the idea is that in the graph space 499 00:16:46,160 --> 00:16:50,000 there's a lot of really interesting open 500 00:16:47,600 --> 00:16:51,120 source unfortunately they're broken out 501 00:16:50,000 --> 00:16:52,880 into different camps and they don't 502 00:16:51,120 --> 00:16:55,199 really talk with each other the people 503 00:16:52,880 --> 00:16:57,040 doing w3c technologies in rdf lib they 504 00:16:55,199 --> 00:16:58,560 do great things but they don't really 505 00:16:57,040 --> 00:17:00,959 have an integration path if you want to 506 00:16:58,560 --> 00:17:02,560 use networkx to do graph algorithms 507 00:17:00,959 --> 00:17:04,480 another of them have an integration path 508 00:17:02,560 --> 00:17:07,039 if you want to go out and use pybiz to 509 00:17:04,480 --> 00:17:08,079 do some nice visualization and if you 510 00:17:07,039 --> 00:17:09,600 happen to have some sort of 511 00:17:08,079 --> 00:17:11,520 probabilistic graph problem and you're 512 00:17:09,600 --> 00:17:12,720 doing you know statistical relational 513 00:17:11,520 --> 00:17:15,520 learning and you're using something like 514 00:17:12,720 --> 00:17:17,120 psl or or pgmpi 515 00:17:15,520 --> 00:17:19,039 nobody talks with them 516 00:17:17,120 --> 00:17:20,799 so the problem was how can we establish 517 00:17:19,039 --> 00:17:22,880 common ground abstraction layers to 518 00:17:20,799 --> 00:17:24,720 integrate open source projects 519 00:17:22,880 --> 00:17:26,559 and while we're at it let's make a lot 520 00:17:24,720 --> 00:17:30,720 of paths out to things that we know like 521 00:17:26,559 --> 00:17:32,799 pandas numpy psychic learn pi torch etc 522 00:17:30,720 --> 00:17:34,240 and while we're at it let's do the right 523 00:17:32,799 --> 00:17:37,679 thing and make this stuff so we can 524 00:17:34,240 --> 00:17:41,200 parallelize it using spark and ray and 525 00:17:37,679 --> 00:17:43,120 and arrow and rapids on and on and on 526 00:17:41,200 --> 00:17:44,640 that's the gist of what we're doing with 527 00:17:43,120 --> 00:17:46,799 kg lab 528 00:17:44,640 --> 00:17:48,240 and there's like a very ambitious 529 00:17:46,799 --> 00:17:49,360 roadmap although we've knocked a lot of 530 00:17:48,240 --> 00:17:51,360 this down we have a lot of these 531 00:17:49,360 --> 00:17:52,240 integrations now 532 00:17:51,360 --> 00:17:53,760 um 533 00:17:52,240 --> 00:17:55,039 also there's a related project i've been 534 00:17:53,760 --> 00:17:57,120 working on for a number of years going 535 00:17:55,039 --> 00:18:00,320 back to 2008 i've been working on text 536 00:17:57,120 --> 00:18:02,320 rank um pi text rank has become uh 537 00:18:00,320 --> 00:18:04,799 rather popular in the python space for 538 00:18:02,320 --> 00:18:07,360 doing uh relatively lightweight but very 539 00:18:04,799 --> 00:18:10,080 effective entity extraction from text 540 00:18:07,360 --> 00:18:11,360 and also you can do some summarization 541 00:18:10,080 --> 00:18:13,039 very closely related the two projects 542 00:18:11,360 --> 00:18:16,640 work together 543 00:18:13,039 --> 00:18:18,400 so um let me go into 544 00:18:16,640 --> 00:18:20,880 last parts of this rather quickly 545 00:18:18,400 --> 00:18:22,799 kg lab we do a lot of things here but 546 00:18:20,880 --> 00:18:25,039 i'll give you a little bit of a flavor 547 00:18:22,799 --> 00:18:27,919 if you're ever working with semantic 548 00:18:25,039 --> 00:18:29,840 graphs and the whole w3c space like rdf 549 00:18:27,919 --> 00:18:30,720 we do that 550 00:18:29,840 --> 00:18:32,480 and 551 00:18:30,720 --> 00:18:33,919 we try to make it as simple and painless 552 00:18:32,480 --> 00:18:35,600 as possible if you've ever worked with 553 00:18:33,919 --> 00:18:37,840 rdf lib before it 554 00:18:35,600 --> 00:18:39,039 takes a lot of coding so we try to 555 00:18:37,840 --> 00:18:40,640 provide an abstraction later make this 556 00:18:39,039 --> 00:18:42,000 simpler 557 00:18:40,640 --> 00:18:43,840 and we also have a lot of different 558 00:18:42,000 --> 00:18:46,400 serialization formats supported but 559 00:18:43,840 --> 00:18:47,760 importantly we support json ld a lot of 560 00:18:46,400 --> 00:18:49,440 people talk to that 561 00:18:47,760 --> 00:18:51,520 more importantly these days we're using 562 00:18:49,440 --> 00:18:53,679 parquet and its orders of magnitude more 563 00:18:51,520 --> 00:18:55,360 efficient than any of those others 564 00:18:53,679 --> 00:18:57,919 it also integrates well with spark and 565 00:18:55,360 --> 00:19:01,360 neutrino and cassandra and other things 566 00:18:57,919 --> 00:19:03,600 works really well with gpus 567 00:19:01,360 --> 00:19:05,760 ah there's a number of paths for 568 00:19:03,600 --> 00:19:07,120 working with cairo medpo lib and pi viz 569 00:19:05,760 --> 00:19:09,120 and exports to gaffy and things like 570 00:19:07,120 --> 00:19:11,440 that for doing visualizations again 571 00:19:09,120 --> 00:19:15,039 really just a few lines of code 572 00:19:11,440 --> 00:19:17,440 if you need to query you can use sparkle 573 00:19:15,039 --> 00:19:20,000 this is built in and we 574 00:19:17,440 --> 00:19:21,760 bring the results back as well you can 575 00:19:20,000 --> 00:19:25,039 get them as an array of rows or you can 576 00:19:21,760 --> 00:19:26,480 get them back as a pandas data frame 577 00:19:25,039 --> 00:19:28,799 uh if you need to do some sort of 578 00:19:26,480 --> 00:19:31,360 validation one of my favorite projects 579 00:19:28,799 --> 00:19:33,679 lately is called shackle it provides 580 00:19:31,360 --> 00:19:35,440 shape constraints you can apply rules to 581 00:19:33,679 --> 00:19:36,640 validate a graph like building unit 582 00:19:35,440 --> 00:19:38,640 tests 583 00:19:36,640 --> 00:19:40,480 also really great part of this 584 00:19:38,640 --> 00:19:42,160 and when you need to jump into your 585 00:19:40,480 --> 00:19:44,960 graph algorithms we have transforms 586 00:19:42,160 --> 00:19:47,679 there and back for working with networkx 587 00:19:44,960 --> 00:19:49,200 igraph and and some others on the on the 588 00:19:47,679 --> 00:19:51,600 horizon 589 00:19:49,200 --> 00:19:53,360 so the reason is this 590 00:19:51,600 --> 00:19:55,840 you can use graphs for a lot of things 591 00:19:53,360 --> 00:19:57,360 they really fit complex data problems 592 00:19:55,840 --> 00:19:59,120 when you look at the space of what's 593 00:19:57,360 --> 00:20:00,480 being implemented for graphs it's kind 594 00:19:59,120 --> 00:20:02,080 of all over the map and i've tried to 595 00:20:00,480 --> 00:20:04,159 represent this here some of the 596 00:20:02,080 --> 00:20:07,039 solutions work really well with 597 00:20:04,159 --> 00:20:08,559 uncertainty some of them are much more 598 00:20:07,039 --> 00:20:10,320 formalized they're much more of an 599 00:20:08,559 --> 00:20:11,919 analytic solution 600 00:20:10,320 --> 00:20:14,000 and so you know 601 00:20:11,919 --> 00:20:16,559 axiomatic transforms in scars don't 602 00:20:14,000 --> 00:20:17,760 really fit well with graph embedding in 603 00:20:16,559 --> 00:20:19,360 pi torch 604 00:20:17,760 --> 00:20:20,960 we're trying to bridge that so that you 605 00:20:19,360 --> 00:20:22,799 can build applications that mix and 606 00:20:20,960 --> 00:20:24,960 match between these different parts and 607 00:20:22,799 --> 00:20:26,159 leverage these different technologies 608 00:20:24,960 --> 00:20:28,080 uh 609 00:20:26,159 --> 00:20:30,000 and so for instance here's an idea of 610 00:20:28,080 --> 00:20:31,919 using probabilistic graphs 611 00:20:30,000 --> 00:20:33,679 with psl which i find extremely useful 612 00:20:31,919 --> 00:20:35,520 for uncertainty 613 00:20:33,679 --> 00:20:37,760 also graph embeddings using something 614 00:20:35,520 --> 00:20:39,679 like pytorch geometric 615 00:20:37,760 --> 00:20:41,840 so real brief in the time remaining i 616 00:20:39,679 --> 00:20:44,159 want to show a general pattern for what 617 00:20:41,840 --> 00:20:46,320 we're doing here along with an example 618 00:20:44,159 --> 00:20:48,000 so if you think of an idealized machine 619 00:20:46,320 --> 00:20:49,760 learning workflow this is the kind of 620 00:20:48,000 --> 00:20:53,360 thing you're typically working with data 621 00:20:49,760 --> 00:20:55,679 prep feature engineering models etc 622 00:20:53,360 --> 00:20:58,240 let's look at an example of that uh 623 00:20:55,679 --> 00:21:00,000 here's a data set from kaggle and it's 624 00:20:58,240 --> 00:21:02,880 from food.com it has a quarter of a 625 00:21:00,000 --> 00:21:05,520 million recipes all their ingredients we 626 00:21:02,880 --> 00:21:07,760 take it as a cvs file and build it into 627 00:21:05,520 --> 00:21:09,600 a knowledge graph 628 00:21:07,760 --> 00:21:10,880 so you've got to have some data prep 629 00:21:09,600 --> 00:21:12,720 you've got to go and do some 630 00:21:10,880 --> 00:21:14,320 deduplication in addition to reading 631 00:21:12,720 --> 00:21:15,600 csvs 632 00:21:14,320 --> 00:21:18,159 but you've got to be able to infer 633 00:21:15,600 --> 00:21:19,520 relations out of what's there 634 00:21:18,159 --> 00:21:20,960 somewhat connected but it could be 635 00:21:19,520 --> 00:21:22,400 connected more 636 00:21:20,960 --> 00:21:24,159 uh you probably want to have some 637 00:21:22,400 --> 00:21:25,679 feedback loops as far as getting people 638 00:21:24,159 --> 00:21:27,520 in there to understand how to label 639 00:21:25,679 --> 00:21:29,520 things appropriately apply annotations 640 00:21:27,520 --> 00:21:31,200 using rubrics or snorkel something like 641 00:21:29,520 --> 00:21:33,039 that and then you want to go out and 642 00:21:31,200 --> 00:21:35,200 build a graph and then start to measure 643 00:21:33,039 --> 00:21:36,320 some of the data quality within it rdf 644 00:21:35,200 --> 00:21:38,000 and shackle are really good at these 645 00:21:36,320 --> 00:21:39,600 points 646 00:21:38,000 --> 00:21:40,960 then you can go out and run some graph 647 00:21:39,600 --> 00:21:42,320 algorithms perhaps you want to analyze 648 00:21:40,960 --> 00:21:44,720 the structure and find out what's the 649 00:21:42,320 --> 00:21:47,600 most popular ingredient in this set of 650 00:21:44,720 --> 00:21:49,840 recipes for how to make uh my favorite 651 00:21:47,600 --> 00:21:51,760 food chastio bao um 652 00:21:49,840 --> 00:21:53,280 maybe you could use leiden or take horse 653 00:21:51,760 --> 00:21:54,080 or some sort of centrality measure for 654 00:21:53,280 --> 00:21:56,080 that 655 00:21:54,080 --> 00:21:57,919 on and on and on you can apply for the 656 00:21:56,080 --> 00:21:59,600 different parts of the use cases these 657 00:21:57,919 --> 00:22:01,760 different types of tools all within the 658 00:21:59,600 --> 00:22:03,039 same structure and again we're really 659 00:22:01,760 --> 00:22:04,159 we're not trying to invent a lot of new 660 00:22:03,039 --> 00:22:06,960 code here we're really trying to do 661 00:22:04,159 --> 00:22:09,200 integration and test and tutorial added 662 00:22:06,960 --> 00:22:11,200 on top of other existing open source 663 00:22:09,200 --> 00:22:13,919 libraries 664 00:22:11,200 --> 00:22:17,280 um so all of this is online and there's 665 00:22:13,919 --> 00:22:19,760 uh about 12 different jupiter notebooks 666 00:22:17,280 --> 00:22:21,520 on our docs page that uh go through this 667 00:22:19,760 --> 00:22:23,760 step-by-step and use this worked example 668 00:22:21,520 --> 00:22:25,679 with the food.com recipe data 669 00:22:23,760 --> 00:22:27,919 so you can take a look and run those uh 670 00:22:25,679 --> 00:22:29,520 run those we have docker compose that 671 00:22:27,919 --> 00:22:30,960 we're building out of github if you want 672 00:22:29,520 --> 00:22:32,320 to run it in docker 673 00:22:30,960 --> 00:22:34,559 um 674 00:22:32,320 --> 00:22:36,240 and there's some other tangential 675 00:22:34,559 --> 00:22:38,480 tutorials i really love what's going on 676 00:22:36,240 --> 00:22:41,280 my friends uh in spain 677 00:22:38,480 --> 00:22:42,799 in madrid and olympia uh this company 678 00:22:41,280 --> 00:22:44,799 called recognize they do something 679 00:22:42,799 --> 00:22:46,320 called rubrics we're very close friends 680 00:22:44,799 --> 00:22:48,240 and work together on our projects back 681 00:22:46,320 --> 00:22:50,720 and forth 682 00:22:48,240 --> 00:22:52,880 also i want to give a shout out to 683 00:22:50,720 --> 00:22:54,480 our github sponsors on the projects and 684 00:22:52,880 --> 00:22:57,200 also to the knowledge architecture team 685 00:22:54,480 --> 00:22:59,600 at basf global digital services they've 686 00:22:57,200 --> 00:23:01,679 been hugely supportive uh jurgen miller 687 00:22:59,600 --> 00:23:03,840 and others there 688 00:23:01,679 --> 00:23:05,679 and we'll be going into this in more 689 00:23:03,840 --> 00:23:07,440 detail with some tutorials and keynotes 690 00:23:05,679 --> 00:23:08,799 and whatnot uh 691 00:23:07,440 --> 00:23:10,400 other conferences coming up for the next 692 00:23:08,799 --> 00:23:12,159 several months 693 00:23:10,400 --> 00:23:13,760 and then with that i would like to say 694 00:23:12,159 --> 00:23:15,760 thank you very much i appreciate we have 695 00:23:13,760 --> 00:23:17,600 a lot more details online i wanted to 696 00:23:15,760 --> 00:23:20,320 save some time so that we could have 697 00:23:17,600 --> 00:23:22,400 some questions here um i 698 00:23:20,320 --> 00:23:24,640 have only one screen 699 00:23:22,400 --> 00:23:26,159 where i'm at so i'm going to leave it on 700 00:23:24,640 --> 00:23:28,400 this page right here and we can go to 701 00:23:26,159 --> 00:23:30,720 questions 702 00:23:28,400 --> 00:23:32,880 thank you paco that 703 00:23:30,720 --> 00:23:34,480 that has been a great talk and i 704 00:23:32,880 --> 00:23:36,159 particularly appreciate the shout out to 705 00:23:34,480 --> 00:23:38,480 my friends in madrid valencia who i 706 00:23:36,159 --> 00:23:40,320 don't know yet but now i will 707 00:23:38,480 --> 00:23:41,919 there are questions from the audience 708 00:23:40,320 --> 00:23:44,400 and 709 00:23:41,919 --> 00:23:46,000 here they come 710 00:23:44,400 --> 00:23:48,640 the first one is 711 00:23:46,000 --> 00:23:51,120 does the kg lab abstraction layer work 712 00:23:48,640 --> 00:23:54,720 with neo4j and how it's different from 713 00:23:51,120 --> 00:23:56,640 network x i'm mixing i'm messing them up 714 00:23:54,720 --> 00:23:58,080 yeah no that's a great question so i'm 715 00:23:56,640 --> 00:24:01,279 you know neo4j if you haven't worked 716 00:23:58,080 --> 00:24:04,080 with it neo is arguably the most popular 717 00:24:01,279 --> 00:24:05,520 amongst the graph databases and it 718 00:24:04,080 --> 00:24:07,840 really started out as like a single 719 00:24:05,520 --> 00:24:09,600 instance they've tried to go into more 720 00:24:07,840 --> 00:24:11,279 distributed areas but 721 00:24:09,600 --> 00:24:13,679 it's not going to be the thing you use 722 00:24:11,279 --> 00:24:16,320 for really really large graphs as much 723 00:24:13,679 --> 00:24:18,559 but they're arguably they have a a long 724 00:24:16,320 --> 00:24:20,880 tail distribution of customer use cases 725 00:24:18,559 --> 00:24:23,279 so very very popular 726 00:24:20,880 --> 00:24:25,520 it is a database and you can do queries 727 00:24:23,279 --> 00:24:27,120 inside of it although some of the other 728 00:24:25,520 --> 00:24:28,559 things you would want to do graphs 729 00:24:27,120 --> 00:24:31,520 aren't going to be quite as native 730 00:24:28,559 --> 00:24:32,799 inside neo4j there are integration paths 731 00:24:31,520 --> 00:24:34,799 we do have a connector there's something 732 00:24:32,799 --> 00:24:36,159 called neo4j oh well actually there's a 733 00:24:34,799 --> 00:24:37,760 few different ways but there's a way of 734 00:24:36,159 --> 00:24:40,320 bridging between neo4j and semantic 735 00:24:37,760 --> 00:24:42,000 graphs and so we can use that to import 736 00:24:40,320 --> 00:24:43,360 and in fact one of the developers just 737 00:24:42,000 --> 00:24:45,360 used it for 738 00:24:43,360 --> 00:24:47,840 i think it was a 40 terabyte project it 739 00:24:45,360 --> 00:24:50,159 was huge it was enormous so we've got a 740 00:24:47,840 --> 00:24:52,000 good industrial strength integration put 741 00:24:50,159 --> 00:24:53,919 in place there 742 00:24:52,000 --> 00:24:55,840 the thing that it differs from network x 743 00:24:53,919 --> 00:24:57,760 network x is a python library for 744 00:24:55,840 --> 00:25:00,159 algorithms it doesn't have anything to 745 00:24:57,760 --> 00:25:02,320 do with persistence or really any other 746 00:25:00,159 --> 00:25:04,640 parts of graph other than 747 00:25:02,320 --> 00:25:06,880 let's get a lot of different algorithms 748 00:25:04,640 --> 00:25:08,799 implemented this is something that neo4j 749 00:25:06,880 --> 00:25:12,400 is not really what you'd want to be 750 00:25:08,799 --> 00:25:14,240 using um so the very separate kinds of 751 00:25:12,400 --> 00:25:16,320 use cases so for instance if you wanted 752 00:25:14,240 --> 00:25:18,960 to run page rank or or betweenness 753 00:25:16,320 --> 00:25:20,400 centrality or or perhaps 754 00:25:18,960 --> 00:25:23,840 some sort of you know connected 755 00:25:20,400 --> 00:25:25,919 components neo4js sorry uh networkx is 756 00:25:23,840 --> 00:25:28,880 brilliant for that so we integrate with 757 00:25:25,919 --> 00:25:29,760 both of those 758 00:25:28,880 --> 00:25:31,440 i see 759 00:25:29,760 --> 00:25:33,200 and um 760 00:25:31,440 --> 00:25:35,360 someone asked thinking about the idea of 761 00:25:33,200 --> 00:25:37,840 sensing emergent patterns from unknown 762 00:25:35,360 --> 00:25:39,919 unknowns what strategy what strategies 763 00:25:37,840 --> 00:25:42,880 can experts use to separate 764 00:25:39,919 --> 00:25:46,159 what they see as an emergent pattern 765 00:25:42,880 --> 00:25:48,400 that they sense from relevant experience 766 00:25:46,159 --> 00:25:50,720 uh that they have and 767 00:25:48,400 --> 00:25:51,600 from implicit bias they may not be aware 768 00:25:50,720 --> 00:25:54,080 of 769 00:25:51,600 --> 00:25:55,760 something very good question um so you 770 00:25:54,080 --> 00:25:56,880 know the experts do 771 00:25:55,760 --> 00:25:57,919 this is one of the things we find over 772 00:25:56,880 --> 00:26:00,000 and over again if you take somebody 773 00:25:57,919 --> 00:26:02,320 who's an expert in global supply chain 774 00:26:00,000 --> 00:26:03,840 and logistics and you put a graph of 775 00:26:02,320 --> 00:26:05,360 their data up in front of them they 776 00:26:03,840 --> 00:26:06,559 start to see the connections especially 777 00:26:05,360 --> 00:26:08,640 if there's any kind of a real-time 778 00:26:06,559 --> 00:26:11,679 updated aspect they get it right away i 779 00:26:08,640 --> 00:26:13,840 mean this is how they think so um you 780 00:26:11,679 --> 00:26:16,240 know typically in a domain the experts 781 00:26:13,840 --> 00:26:17,760 will have these kinds of insights how 782 00:26:16,240 --> 00:26:21,039 can we make it easier for them to 783 00:26:17,760 --> 00:26:23,200 surface though if you have a 5 billion 784 00:26:21,039 --> 00:26:24,960 entity graph they probably won't catch 785 00:26:23,200 --> 00:26:27,120 all those patterns you probably won't be 786 00:26:24,960 --> 00:26:29,919 able to grab go to to visualize them all 787 00:26:27,120 --> 00:26:31,679 on one screen so uh what can we use to 788 00:26:29,919 --> 00:26:33,360 to basically shine a lens and this is 789 00:26:31,679 --> 00:26:35,919 where the algorithms come in handy some 790 00:26:33,360 --> 00:26:37,440 of the query preprocessing certainly a 791 00:26:35,919 --> 00:26:39,919 lot of work that's going on with deep 792 00:26:37,440 --> 00:26:41,760 learning is basically shining a lens in 793 00:26:39,919 --> 00:26:43,760 different parts of the graph that might 794 00:26:41,760 --> 00:26:45,679 be very interesting given the kinds of 795 00:26:43,760 --> 00:26:47,440 things that that expert is looking for 796 00:26:45,679 --> 00:26:48,880 and this is where the graph data science 797 00:26:47,440 --> 00:26:51,120 comes in very handy to work with the 798 00:26:48,880 --> 00:26:52,240 domain experts 799 00:26:51,120 --> 00:26:54,240 but the second part of the question is 800 00:26:52,240 --> 00:26:56,400 what if it's something they're missing 801 00:26:54,240 --> 00:26:57,840 and again this is where by having 802 00:26:56,400 --> 00:26:59,679 automation having very large deep 803 00:26:57,840 --> 00:27:00,960 learning models go in learn those 804 00:26:59,679 --> 00:27:03,039 patterns 805 00:27:00,960 --> 00:27:05,200 then you can say well the expert 806 00:27:03,039 --> 00:27:07,200 actually i'll point out a great use case 807 00:27:05,200 --> 00:27:09,120 of this novartis has exactly this 808 00:27:07,200 --> 00:27:11,520 problem they take all the research 809 00:27:09,120 --> 00:27:13,520 coming in in terms of pharma and they 810 00:27:11,520 --> 00:27:15,840 run it through a graph and then they 811 00:27:13,520 --> 00:27:17,600 build models out of that to predict the 812 00:27:15,840 --> 00:27:18,640 interactions between different genes 813 00:27:17,600 --> 00:27:20,480 different molecules and different 814 00:27:18,640 --> 00:27:21,919 diseases and what they found the first 815 00:27:20,480 --> 00:27:23,600 time they did it they presented to their 816 00:27:21,919 --> 00:27:25,840 biologists the biologists said oh we 817 00:27:23,600 --> 00:27:27,679 know all this stuff well we know 80 of 818 00:27:25,840 --> 00:27:30,080 it so the company said well what about 819 00:27:27,679 --> 00:27:31,200 the 20 that you didn't know and they 820 00:27:30,080 --> 00:27:32,880 said 821 00:27:31,200 --> 00:27:34,320 hold my beer i'm gonna go back to the 822 00:27:32,880 --> 00:27:36,640 lab and then they found out actually 823 00:27:34,320 --> 00:27:39,279 very important discoveries by that 20 824 00:27:36,640 --> 00:27:40,880 they hadn't tried before and so now this 825 00:27:39,279 --> 00:27:42,559 approach leveraging graph analytics has 826 00:27:40,880 --> 00:27:45,440 become part and parcel of the research 827 00:27:42,559 --> 00:27:48,880 strategy at novartis and at other kinds 828 00:27:45,440 --> 00:27:50,720 of pharma so this is where yeah they may 829 00:27:48,880 --> 00:27:53,279 be some bias on the part of the domain 830 00:27:50,720 --> 00:27:56,000 experts how can we use ai technologies 831 00:27:53,279 --> 00:27:58,159 to augment that and uh in particular we 832 00:27:56,000 --> 00:28:00,000 have one project at kg lab where we're 833 00:27:58,159 --> 00:28:02,080 using reinforcement learning to surface 834 00:28:00,000 --> 00:28:05,679 a leader board of emergent patterns that 835 00:28:02,080 --> 00:28:05,679 maybe nobody bothered to check before 836 00:28:06,559 --> 00:28:11,600 thank you you've mentioned um working at 837 00:28:09,200 --> 00:28:13,840 higher a bigger scale of data and it 838 00:28:11,600 --> 00:28:16,159 segues into the next question from an 839 00:28:13,840 --> 00:28:18,240 audience member who asks how does kg lab 840 00:28:16,159 --> 00:28:21,200 perform at scale when compared with 841 00:28:18,240 --> 00:28:23,520 libraries such as igraph or networkx on 842 00:28:21,200 --> 00:28:25,919 algorithms like betweenness centrality 843 00:28:23,520 --> 00:28:26,880 so the idea is now performance where you 844 00:28:25,919 --> 00:28:28,960 mostly 845 00:28:26,880 --> 00:28:30,799 focus on capability before 846 00:28:28,960 --> 00:28:32,880 very good very good well i mean things 847 00:28:30,799 --> 00:28:35,840 like like uh running between a 848 00:28:32,880 --> 00:28:38,240 centrality on network x or igraph um 849 00:28:35,840 --> 00:28:39,840 what you probably want to do there's a 850 00:28:38,240 --> 00:28:40,799 few slides back there's an illustration 851 00:28:39,840 --> 00:28:42,720 of this 852 00:28:40,799 --> 00:28:44,480 have your graph manage your graph you 853 00:28:42,720 --> 00:28:46,559 can use different technologies to manage 854 00:28:44,480 --> 00:28:48,000 it and persist it 855 00:28:46,559 --> 00:28:49,360 when you get to the point of needing to 856 00:28:48,000 --> 00:28:51,039 use some of these algorithms you 857 00:28:49,360 --> 00:28:53,120 typically want to take a projection off 858 00:28:51,039 --> 00:28:54,080 that you want to take a sub graph and in 859 00:28:53,120 --> 00:28:56,799 fact when you're working with network 860 00:28:54,080 --> 00:28:58,240 hex you must so that's one of the 861 00:28:56,799 --> 00:28:59,840 abstractions that we have is how to 862 00:28:58,240 --> 00:29:01,679 project out part of the graph as a 863 00:28:59,840 --> 00:29:03,360 subgraph 864 00:29:01,679 --> 00:29:05,279 you won't be using the entire graph when 865 00:29:03,360 --> 00:29:06,960 you run it through the algorithm 866 00:29:05,279 --> 00:29:08,399 and in some ways there are actually some 867 00:29:06,960 --> 00:29:10,159 pretty good algorithms to try to filter 868 00:29:08,399 --> 00:29:12,480 and pare down 869 00:29:10,159 --> 00:29:14,640 i did a multi-scale backbone uh for 870 00:29:12,480 --> 00:29:16,080 instance that's in production use in 871 00:29:14,640 --> 00:29:18,399 hungary i just found out about that at 872 00:29:16,080 --> 00:29:19,760 one of the recent conferences um there 873 00:29:18,399 --> 00:29:22,720 there are techniques of being able to 874 00:29:19,760 --> 00:29:24,799 scale down a pair down if you will a 875 00:29:22,720 --> 00:29:26,720 subgraph to get again the lens on the 876 00:29:24,799 --> 00:29:28,559 part that you need 877 00:29:26,720 --> 00:29:29,520 as far as how does kg lab scale it 878 00:29:28,559 --> 00:29:32,240 really depends on what kind of 879 00:29:29,520 --> 00:29:34,880 infrastructure you have um i guess 880 00:29:32,240 --> 00:29:37,520 without letting the cat out of the bag 881 00:29:34,880 --> 00:29:38,480 i do other work besides kglab some of 882 00:29:37,520 --> 00:29:41,520 which is 883 00:29:38,480 --> 00:29:43,520 currently private in early stages but 884 00:29:41,520 --> 00:29:45,840 maybe someday soon will be open source 885 00:29:43,520 --> 00:29:48,159 that's in active negotiation right now 886 00:29:45,840 --> 00:29:49,679 and uh it's something that's built using 887 00:29:48,159 --> 00:29:52,880 much the same technologies as we use the 888 00:29:49,679 --> 00:29:55,200 kg lab uh using ray for scaling on a 889 00:29:52,880 --> 00:29:57,120 large cluster and uh 890 00:29:55,200 --> 00:29:58,640 to be clear we're already at pretty high 891 00:29:57,120 --> 00:30:01,520 scale we're aiming for trillion node 892 00:29:58,640 --> 00:30:03,360 graphs by the end of the year 893 00:30:01,520 --> 00:30:04,480 thank you i 894 00:30:03,360 --> 00:30:06,640 that's all the time we have 895 00:30:04,480 --> 00:30:09,679 unfortunately but thank you so much for 896 00:30:06,640 --> 00:30:10,559 giving us your time today at pike on eu 897 00:30:09,679 --> 00:30:12,320 and 898 00:30:10,559 --> 00:30:14,640 um big hand of applause is going to be 899 00:30:12,320 --> 00:30:15,840 virtual and venue less but 900 00:30:14,640 --> 00:30:17,200 thank you 901 00:30:15,840 --> 00:30:20,360 much appreciated thank you javier have a 902 00:30:17,200 --> 00:30:20,360 great day 903 00:30:24,559 --> 00:30:26,640 you