1 00:00:00,420 --> 00:00:05,910 [Music] 2 00:00:09,920 --> 00:00:15,440 Welcome back friends to the data and AI 3 00:00:12,080 --> 00:00:18,240 specialist track at PyConau 2025. 4 00:00:15,440 --> 00:00:20,800 I am very excited to introduce the last 5 00:00:18,240 --> 00:00:22,640 speaker in this block of talks. Um and 6 00:00:20,800 --> 00:00:24,960 please give a very big round of applause 7 00:00:22,640 --> 00:00:27,359 to Niha on and her talk hierarchical 8 00:00:24,960 --> 00:00:31,119 clustering finding the awkward reunions. 9 00:00:27,359 --> 00:00:31,119 Big round of applause everybody. 10 00:00:36,800 --> 00:00:41,360 U first of all I would like to thank you 11 00:00:39,360 --> 00:00:43,680 all. Uh this is my first time attending 12 00:00:41,360 --> 00:00:46,399 the PyCon conference and I'm greatly 13 00:00:43,680 --> 00:00:49,440 privileged to be over here and to be 14 00:00:46,399 --> 00:00:51,520 selected as speaker. So uh I want to 15 00:00:49,440 --> 00:00:54,239 give you a brief introduction of myself. 16 00:00:51,520 --> 00:00:56,879 Uh I'm a second year PhD student at 17 00:00:54,239 --> 00:00:58,320 University of Newcastle, Australia. I 18 00:00:56,879 --> 00:01:00,480 work in the intersection of computer 19 00:00:58,320 --> 00:01:02,960 sciences and healthcare domains where I 20 00:01:00,480 --> 00:01:06,000 develop software solutions uh primarily 21 00:01:02,960 --> 00:01:08,640 using Python uh to for the complex 22 00:01:06,000 --> 00:01:10,640 diseases and disorders. Uh before moving 23 00:01:08,640 --> 00:01:12,960 to Australia, I have gained some 24 00:01:10,640 --> 00:01:15,439 experience as a systems engineer at Tata 25 00:01:12,960 --> 00:01:18,080 Consultancy Services and HCL 26 00:01:15,439 --> 00:01:21,119 technologies. My research interests 27 00:01:18,080 --> 00:01:23,759 include uh medical software engineering, 28 00:01:21,119 --> 00:01:26,240 machine learning and healthcare, Python 29 00:01:23,759 --> 00:01:29,200 for health informatics. Also I work with 30 00:01:26,240 --> 00:01:31,439 communities and NOS's which uh provide 31 00:01:29,200 --> 00:01:33,200 AI training and Python training in 32 00:01:31,439 --> 00:01:35,840 various underprivileged schools across 33 00:01:33,200 --> 00:01:40,000 Australia. We also supply laptops to the 34 00:01:35,840 --> 00:01:42,400 students and I would like to say that um 35 00:01:40,000 --> 00:01:44,320 nowadays the technology has advanced so 36 00:01:42,400 --> 00:01:47,439 much that students are willing to learn 37 00:01:44,320 --> 00:01:49,439 these at such a younger age and they 38 00:01:47,439 --> 00:01:51,600 know uh and I was so surprised to know 39 00:01:49,439 --> 00:01:53,680 that they know even uh like the advanced 40 00:01:51,600 --> 00:01:56,479 things which are happening in the Python 41 00:01:53,680 --> 00:02:00,000 and AI. Uh besides that I have provided 42 00:01:56,479 --> 00:02:03,119 my LinkedIn uh profile over here. 43 00:02:00,000 --> 00:02:06,399 uh I have developed the framework for uh 44 00:02:03,119 --> 00:02:08,959 oncology using the this is my uh work 45 00:02:06,399 --> 00:02:10,640 which I have done as a PhD student. Uh I 46 00:02:08,959 --> 00:02:13,680 have developed the framework for 47 00:02:10,640 --> 00:02:16,720 oncology uh which predicts the data and 48 00:02:13,680 --> 00:02:18,400 analyzes the survival analysis and 49 00:02:16,720 --> 00:02:20,080 various kinds of it generates various 50 00:02:18,400 --> 00:02:23,040 kinds of reports. It also generates 51 00:02:20,080 --> 00:02:24,640 various kinds of visualizations. Uh the 52 00:02:23,040 --> 00:02:27,200 other thing which I'm working on is 53 00:02:24,640 --> 00:02:29,599 developing a Python package for complex 54 00:02:27,200 --> 00:02:31,760 network analysis. Currently there's a 55 00:02:29,599 --> 00:02:33,599 library called network X and mattplot 56 00:02:31,760 --> 00:02:35,599 liib which can be used for network 57 00:02:33,599 --> 00:02:37,440 analysis but there's no comprehensive 58 00:02:35,599 --> 00:02:40,319 library for the biological network 59 00:02:37,440 --> 00:02:42,400 analysis. It really when you do the 60 00:02:40,319 --> 00:02:44,800 complex network analysis it really helps 61 00:02:42,400 --> 00:02:46,239 in predicting the diseases and the 62 00:02:44,800 --> 00:02:48,959 outcomes for the diseases and it's 63 00:02:46,239 --> 00:02:51,200 really helpful but uh unfortunately 64 00:02:48,959 --> 00:02:54,720 there was no such uh comprehensive 65 00:02:51,200 --> 00:02:58,319 package for that. So I'm going to 66 00:02:54,720 --> 00:03:00,720 present today on uh a generative model 67 00:02:58,319 --> 00:03:02,560 which is hierarchical random graphs. 68 00:03:00,720 --> 00:03:05,360 They have been derived from hierarchical 69 00:03:02,560 --> 00:03:07,200 clustering which is an algorithm to 70 00:03:05,360 --> 00:03:10,319 extract out the hierarchies from the 71 00:03:07,200 --> 00:03:12,800 networks. And this generative model is 72 00:03:10,319 --> 00:03:16,239 uh based on Monte Carlo simulation and 73 00:03:12,800 --> 00:03:18,080 probabilistic uh things. So first and 74 00:03:16,239 --> 00:03:20,159 foremost question that comes to our mind 75 00:03:18,080 --> 00:03:22,640 is that why do we even need to find the 76 00:03:20,159 --> 00:03:25,519 reunions like why do we even need to 77 00:03:22,640 --> 00:03:27,599 predict the missing links? So whatever 78 00:03:25,519 --> 00:03:29,920 networks we observe in real life they 79 00:03:27,599 --> 00:03:33,040 are incomplete or partially observed. 80 00:03:29,920 --> 00:03:34,879 Hence u like even if you use any kind of 81 00:03:33,040 --> 00:03:36,640 platform whatever networks you extract 82 00:03:34,879 --> 00:03:38,879 from that most of the links are missing 83 00:03:36,640 --> 00:03:41,280 in that and most of the real world 84 00:03:38,879 --> 00:03:43,360 networks are a part of complex systems. 85 00:03:41,280 --> 00:03:46,159 Hidden links can sign significantly 86 00:03:43,360 --> 00:03:47,840 influence the system behavior. If we are 87 00:03:46,159 --> 00:03:49,599 able to predict the missing links, it 88 00:03:47,840 --> 00:03:51,040 improves the system understanding and 89 00:03:49,599 --> 00:03:53,280 leads to the optimization of the 90 00:03:51,040 --> 00:03:55,519 network. When there are missing links, 91 00:03:53,280 --> 00:03:57,519 the network becomes fragile. It is not 92 00:03:55,519 --> 00:04:00,000 robust enough. You are not able to do 93 00:03:57,519 --> 00:04:03,040 any kind of analysis on it. It leads to 94 00:04:00,000 --> 00:04:04,480 misleading metrics. Uh for example um 95 00:04:03,040 --> 00:04:07,360 for example if you want to know the 96 00:04:04,480 --> 00:04:09,599 degree of the node uh you uh you analyze 97 00:04:07,360 --> 00:04:11,680 the node and you want to check the 98 00:04:09,599 --> 00:04:13,120 influence of the node if there are 99 00:04:11,680 --> 00:04:14,959 missing links you are not able to 100 00:04:13,120 --> 00:04:17,600 predict the correct correct degree of 101 00:04:14,959 --> 00:04:20,799 the node whether it's having like 102 00:04:17,600 --> 00:04:22,639 whether it's influential or not and then 103 00:04:20,799 --> 00:04:24,160 there are faulty predictions you are not 104 00:04:22,639 --> 00:04:27,440 able to predict enough from those 105 00:04:24,160 --> 00:04:30,160 networks there's reduced connectivity 106 00:04:27,440 --> 00:04:32,479 so what I did I analyzed a random 107 00:04:30,160 --> 00:04:35,680 network where I use a matrix called 108 00:04:32,479 --> 00:04:37,440 adjacent uh random index. What it does 109 00:04:35,680 --> 00:04:39,840 is uh it studies the community 110 00:04:37,440 --> 00:04:42,080 structures of the network as in when you 111 00:04:39,840 --> 00:04:43,520 remove the links from it fraction by 112 00:04:42,080 --> 00:04:46,240 fraction you can see that it is 113 00:04:43,520 --> 00:04:50,000 decreasing. First it was 0.11 and then 114 00:04:46,240 --> 00:04:51,680 it became 010 and 07. So that means when 115 00:04:50,000 --> 00:04:53,520 the missing links are there the 116 00:04:51,680 --> 00:04:58,560 community structure of the network keeps 117 00:04:53,520 --> 00:05:00,960 on depleting. So the range of a lies 118 00:04:58,560 --> 00:05:02,960 between minus1 to one. So one means 119 00:05:00,960 --> 00:05:05,440 perfect agreement between the clusters 120 00:05:02,960 --> 00:05:07,360 and minus one means a disagreement. So 121 00:05:05,440 --> 00:05:10,639 as it is decreasing so the network 122 00:05:07,360 --> 00:05:12,639 structure is also depleting 123 00:05:10,639 --> 00:05:14,000 overview of various missing link 124 00:05:12,639 --> 00:05:16,639 prediction methods which are available 125 00:05:14,000 --> 00:05:18,160 today. So uh there are experimental 126 00:05:16,639 --> 00:05:20,400 methods where you don't infer the 127 00:05:18,160 --> 00:05:22,639 missing links but you do the validation 128 00:05:20,400 --> 00:05:25,600 that yes there should be a link between 129 00:05:22,639 --> 00:05:28,240 the nodes uh on the basis of 130 00:05:25,600 --> 00:05:30,720 experimental approaches. Then there are 131 00:05:28,240 --> 00:05:32,479 similar similarity based methods where 132 00:05:30,720 --> 00:05:34,800 you infer the structure of the nodes and 133 00:05:32,479 --> 00:05:37,280 on the basis of the degree of the node 134 00:05:34,800 --> 00:05:40,240 or the average path length or the kind 135 00:05:37,280 --> 00:05:42,240 of uh bridges it has you infer that 136 00:05:40,240 --> 00:05:44,639 these nodes should be linked and these 137 00:05:42,240 --> 00:05:47,280 links are missing. Then there are matrix 138 00:05:44,639 --> 00:05:49,440 and latent factor methods uh which 139 00:05:47,280 --> 00:05:52,240 divide uh which convert the network into 140 00:05:49,440 --> 00:05:54,400 adjacency matrix and then you convert 141 00:05:52,240 --> 00:05:56,800 the matrix into different subcomponents 142 00:05:54,400 --> 00:05:58,400 and during the reconstruction is that uh 143 00:05:56,800 --> 00:06:00,160 is when you infer that what kinds of 144 00:05:58,400 --> 00:06:02,720 links are missing. We have used 145 00:06:00,160 --> 00:06:04,800 probabilistic methods because as 146 00:06:02,720 --> 00:06:06,800 compared to the other methods they are 147 00:06:04,800 --> 00:06:08,800 more interpretable. The other methods 148 00:06:06,800 --> 00:06:11,600 can uh predict the missing links but you 149 00:06:08,800 --> 00:06:14,240 would not know that how actually they 150 00:06:11,600 --> 00:06:16,560 predicted the links. Is that missing 151 00:06:14,240 --> 00:06:19,039 link really true? That does it really 152 00:06:16,560 --> 00:06:21,199 exist. The other methods consist uh 153 00:06:19,039 --> 00:06:23,520 considers the networks as homogeneous. 154 00:06:21,199 --> 00:06:25,680 That means it considers that all the 155 00:06:23,520 --> 00:06:28,080 nodes in the network are of same type. 156 00:06:25,680 --> 00:06:30,080 But what if you consider the network a 157 00:06:28,080 --> 00:06:32,400 research network where the nodes can 158 00:06:30,080 --> 00:06:34,880 represents researchers, authors, 159 00:06:32,400 --> 00:06:36,800 citation indexes. So the nodes are 160 00:06:34,880 --> 00:06:38,960 different uh and heterogeneous in 161 00:06:36,800 --> 00:06:40,960 nature. But when you use those 162 00:06:38,960 --> 00:06:42,479 approaches, they are not able to 163 00:06:40,960 --> 00:06:44,639 consider the heterogenity of the 164 00:06:42,479 --> 00:06:46,319 network. They only capture the local 165 00:06:44,639 --> 00:06:48,000 structure while the probabilistic 166 00:06:46,319 --> 00:06:51,199 methods are able to capture the global 167 00:06:48,000 --> 00:06:53,039 structure as well. The previous methods 168 00:06:51,199 --> 00:06:55,199 are exhaustive in nature. When you come 169 00:06:53,039 --> 00:06:57,199 to the probabilistic methods, they only 170 00:06:55,199 --> 00:06:59,280 require the probability distribution and 171 00:06:57,199 --> 00:07:02,880 hence they infer the parameters. That's 172 00:06:59,280 --> 00:07:04,720 why they are non-exhaustive in nature. 173 00:07:02,880 --> 00:07:07,840 So networks are a representation of 174 00:07:04,720 --> 00:07:10,080 complex systems. Uh networks means nodes 175 00:07:07,840 --> 00:07:12,479 and edges are connected. When you talk 176 00:07:10,080 --> 00:07:14,960 about the complex systems which exist in 177 00:07:12,479 --> 00:07:17,520 real life, they are a connection of 178 00:07:14,960 --> 00:07:19,919 various interacting components. The 179 00:07:17,520 --> 00:07:22,080 examples of various complex systems can 180 00:07:19,919 --> 00:07:24,319 be the biological networks, the 181 00:07:22,080 --> 00:07:26,080 technological networks, social networks 182 00:07:24,319 --> 00:07:29,599 like the Facebook connections, the 183 00:07:26,080 --> 00:07:31,039 ecological networks and believe me most 184 00:07:29,599 --> 00:07:33,039 of the networks which are existing are 185 00:07:31,039 --> 00:07:35,120 complex in nature. If you even consider 186 00:07:33,039 --> 00:07:36,880 consider your brain system, the how the 187 00:07:35,120 --> 00:07:39,520 neurons are working that is also a 188 00:07:36,880 --> 00:07:40,960 complex system. So all the complex 189 00:07:39,520 --> 00:07:43,840 systems are networks but all the 190 00:07:40,960 --> 00:07:45,840 networks cannot be complex systems. 191 00:07:43,840 --> 00:07:47,360 Complex systems have certain properties 192 00:07:45,840 --> 00:07:49,440 which distinguish them from the 193 00:07:47,360 --> 00:07:51,840 networks. They are nonlinear in nature. 194 00:07:49,440 --> 00:07:53,599 A small input change in the network can 195 00:07:51,840 --> 00:07:55,919 lead to a big change. They are 196 00:07:53,599 --> 00:07:58,000 spontaneous. They do not require any 197 00:07:55,919 --> 00:08:00,560 kind of central leader in between. They 198 00:07:58,000 --> 00:08:02,639 can self-organize themselves. Um for 199 00:08:00,560 --> 00:08:04,160 example, if you see the flock of birds, 200 00:08:02,639 --> 00:08:06,080 they do not need a leader. They 201 00:08:04,160 --> 00:08:08,080 self-organize themselves into groups and 202 00:08:06,080 --> 00:08:10,400 fly. They are adaptive. They are 203 00:08:08,080 --> 00:08:13,039 influenced by the environment. uh they 204 00:08:10,400 --> 00:08:15,440 require the feedback uh where they can 205 00:08:13,039 --> 00:08:17,440 change according to the output and then 206 00:08:15,440 --> 00:08:20,000 they are emergent where they function as 207 00:08:17,440 --> 00:08:22,000 a whole. They do not work as simple 208 00:08:20,000 --> 00:08:23,680 singular components but they require 209 00:08:22,000 --> 00:08:26,319 multiple components to work together and 210 00:08:23,680 --> 00:08:28,639 they are emergent in nature. So there 211 00:08:26,319 --> 00:08:30,800 are various types of complex systems but 212 00:08:28,639 --> 00:08:33,919 we are specifically f focusing on the 213 00:08:30,800 --> 00:08:36,320 hierarchic systems. Why? case these 214 00:08:33,919 --> 00:08:39,680 complex adaptive systems polyentric and 215 00:08:36,320 --> 00:08:42,000 disorganized they all exist in 216 00:08:39,680 --> 00:08:44,399 combination. There's no single system 217 00:08:42,000 --> 00:08:46,160 which can be completely hierarchic which 218 00:08:44,399 --> 00:08:48,240 can be completely disorganized or 219 00:08:46,160 --> 00:08:50,160 polyentric but they exist in the 220 00:08:48,240 --> 00:08:52,320 combination and it has been inferred 221 00:08:50,160 --> 00:08:53,920 that most of the networks uh which are 222 00:08:52,320 --> 00:08:55,920 existing they are hierarchical in 223 00:08:53,920 --> 00:08:57,440 nature. If you talk about complex 224 00:08:55,920 --> 00:08:59,120 adaptive systems they are completely 225 00:08:57,440 --> 00:09:01,440 adaptive in nature. They can change 226 00:08:59,120 --> 00:09:03,519 according to the environment. Polyentric 227 00:09:01,440 --> 00:09:05,680 means they have different uh centers for 228 00:09:03,519 --> 00:09:07,200 the authority where the even if you 229 00:09:05,680 --> 00:09:08,800 remove one of the components they would 230 00:09:07,200 --> 00:09:11,519 still function as a network. 231 00:09:08,800 --> 00:09:13,519 Disorganized are random in nature but 232 00:09:11,519 --> 00:09:15,440 they still follow some rules and then 233 00:09:13,519 --> 00:09:18,480 there are cybernetic system which take 234 00:09:15,440 --> 00:09:21,040 the output and then adapt to the output 235 00:09:18,480 --> 00:09:23,120 and change its functions. The 236 00:09:21,040 --> 00:09:25,680 hierarchical systems are of three types 237 00:09:23,120 --> 00:09:27,920 ordered nested and flow. We are 238 00:09:25,680 --> 00:09:29,680 considering the nested systems because 239 00:09:27,920 --> 00:09:31,760 if you see the graphs the complex 240 00:09:29,680 --> 00:09:33,120 systems can be divided into subunits 241 00:09:31,760 --> 00:09:35,440 which can be further divided into 242 00:09:33,120 --> 00:09:37,680 subunits. You can take any example even 243 00:09:35,440 --> 00:09:39,440 if you take a Facebook network. So you 244 00:09:37,680 --> 00:09:41,360 can divide it into a group of friends 245 00:09:39,440 --> 00:09:44,320 those would be further divided into 246 00:09:41,360 --> 00:09:45,760 further groups. So the most of the 247 00:09:44,320 --> 00:09:48,000 networks we consider over here are 248 00:09:45,760 --> 00:09:50,240 nested in nature. 249 00:09:48,000 --> 00:09:52,560 So how does the hierarchies emerge? So 250 00:09:50,240 --> 00:09:56,880 it has been wonderfully explained by a 251 00:09:52,560 --> 00:09:58,720 model Bonab model. So what happens is so 252 00:09:56,880 --> 00:10:01,040 nodes and edges are represented as a 253 00:09:58,720 --> 00:10:03,200 lattice. So where one node tries to 254 00:10:01,040 --> 00:10:05,600 interact with the neighboring nodes and 255 00:10:03,200 --> 00:10:07,760 if there's an empty space it'll shift 256 00:10:05,600 --> 00:10:09,680 towards there. But if there's a node 257 00:10:07,760 --> 00:10:12,080 existing over there there would be a 258 00:10:09,680 --> 00:10:14,399 fight between them uh on the basis of 259 00:10:12,080 --> 00:10:16,399 the fitness function they have and 260 00:10:14,399 --> 00:10:18,079 whichever node wins it'll take the place 261 00:10:16,399 --> 00:10:20,560 of that node and the other node would 262 00:10:18,079 --> 00:10:23,200 shift to the previous node. But here 263 00:10:20,560 --> 00:10:25,040 over the time the network ages and all 264 00:10:23,200 --> 00:10:27,680 the nodes 265 00:10:25,040 --> 00:10:29,519 either emerge as winners or sometimes 266 00:10:27,680 --> 00:10:31,120 most of the nodes are consistent and 267 00:10:29,519 --> 00:10:33,600 become as losers and that's how 268 00:10:31,120 --> 00:10:36,240 hierarchies arrange themselves. So it 269 00:10:33,600 --> 00:10:38,640 becomes so the winner nodes create one 270 00:10:36,240 --> 00:10:40,240 level the loser nodes create the lower 271 00:10:38,640 --> 00:10:42,399 level and that's how most of the 272 00:10:40,240 --> 00:10:43,920 networks have hierarchies in nature. If 273 00:10:42,399 --> 00:10:45,760 you see this lettuce how the nodes are 274 00:10:43,920 --> 00:10:48,480 moving. So they randomly keep on moving 275 00:10:45,760 --> 00:10:50,240 and shifting their places and that's why 276 00:10:48,480 --> 00:10:52,399 nature has so many hierarchies in the 277 00:10:50,240 --> 00:10:55,360 networks. 278 00:10:52,399 --> 00:10:57,600 So how to extract the hierarchies from 279 00:10:55,360 --> 00:10:59,440 the networks? We need clustering. For 280 00:10:57,600 --> 00:11:01,600 that we perform clustering on the 281 00:10:59,440 --> 00:11:03,600 networks and extract the hierarchies. 282 00:11:01,600 --> 00:11:05,360 The hierarchies are visually represented 283 00:11:03,600 --> 00:11:07,760 through dendrograms. You must have all 284 00:11:05,360 --> 00:11:09,279 seen the dendrograms in uh various kinds 285 00:11:07,760 --> 00:11:10,720 of research papers where you are seeing 286 00:11:09,279 --> 00:11:12,880 that graphs are represented through 287 00:11:10,720 --> 00:11:15,920 dendrograms. Did we really understand 288 00:11:12,880 --> 00:11:19,120 that? So the dendrograms represent how 289 00:11:15,920 --> 00:11:20,480 one point is uh is linked to another 290 00:11:19,120 --> 00:11:23,120 point through distance or 291 00:11:20,480 --> 00:11:25,120 dissimilarities and the linkage decides 292 00:11:23,120 --> 00:11:27,040 how one cluster is linked to another 293 00:11:25,120 --> 00:11:28,720 clusters. There are various metrics that 294 00:11:27,040 --> 00:11:31,040 can be used for distance. Ukidian 295 00:11:28,720 --> 00:11:32,880 distance, Manhattan distance linkages 296 00:11:31,040 --> 00:11:35,279 also have various methods. You can use 297 00:11:32,880 --> 00:11:38,399 any of them to extract the clusters out 298 00:11:35,279 --> 00:11:40,399 of them. But u over here we have some 299 00:11:38,399 --> 00:11:42,000 assumptions that clusters are not 300 00:11:40,399 --> 00:11:43,839 non-over overlapping in nature. One 301 00:11:42,000 --> 00:11:45,920 point cannot uh belong to another 302 00:11:43,839 --> 00:11:48,240 cluster. So it's not like fuzzy 303 00:11:45,920 --> 00:11:49,760 clustering where we are assigning the me 304 00:11:48,240 --> 00:11:51,519 membership to a point where it should 305 00:11:49,760 --> 00:11:54,560 belong to the two clusters together but 306 00:11:51,519 --> 00:11:56,800 it's uh belonging to one cluster only. 307 00:11:54,560 --> 00:11:59,920 And then the dendrogram is binary. Why 308 00:11:56,800 --> 00:12:02,720 this assumption is there? Because um 309 00:11:59,920 --> 00:12:04,959 hierarchical clustering is of two types 310 00:12:02,720 --> 00:12:07,120 and divisive. This one is a simpler 311 00:12:04,959 --> 00:12:10,560 clustering algorithm because its 312 00:12:07,120 --> 00:12:12,480 complexity is only n² log n while we go 313 00:12:10,560 --> 00:12:14,240 for divisive then it's it becomes 314 00:12:12,480 --> 00:12:16,240 exponential in nature which becomes 2 315 00:12:14,240 --> 00:12:18,959 raised to power n. So we only consider 316 00:12:16,240 --> 00:12:21,440 elumerative and in elmerative we only 317 00:12:18,959 --> 00:12:23,200 combine uh two clusters together. So 318 00:12:21,440 --> 00:12:25,360 when we are only combining two clusters 319 00:12:23,200 --> 00:12:27,360 so it becomes a binary tree. So that's 320 00:12:25,360 --> 00:12:29,360 why we represent it as a binary tree 321 00:12:27,360 --> 00:12:33,120 over here. 322 00:12:29,360 --> 00:12:35,200 Now when we move one step ahead where we 323 00:12:33,120 --> 00:12:37,680 have obtained the clusters and we have 324 00:12:35,200 --> 00:12:39,680 obtained the dendrogram we want to 325 00:12:37,680 --> 00:12:41,600 predict the missing links now. So we 326 00:12:39,680 --> 00:12:43,839 assign the probabilities to the internal 327 00:12:41,600 --> 00:12:46,720 nodes. How we assign the probabilities 328 00:12:43,839 --> 00:12:49,040 to the internal nodes is we assign the 329 00:12:46,720 --> 00:12:51,040 actual edges we divide the actual edges 330 00:12:49,040 --> 00:12:54,639 by the possible edges. So if you see a 331 00:12:51,040 --> 00:12:57,120 sample graph uh that simple graph A B C 332 00:12:54,639 --> 00:13:01,279 D and the candidates for the missing 333 00:12:57,120 --> 00:13:05,920 links are provided as AC A D and BD. Now 334 00:13:01,279 --> 00:13:08,320 how is R note as uh 0.25 and R1 is one. 335 00:13:05,920 --> 00:13:11,519 So if you see there's only one actual 336 00:13:08,320 --> 00:13:13,839 edge between A and B and number of 337 00:13:11,519 --> 00:13:16,079 possible edges between A and B is also 338 00:13:13,839 --> 00:13:18,560 one. So the probability is 1 by one 339 00:13:16,079 --> 00:13:21,920 which is equal to one. Similarly for R 340 00:13:18,560 --> 00:13:23,920 not there are four nodes and it's the 341 00:13:21,920 --> 00:13:26,480 number of possible edges between A B and 342 00:13:23,920 --> 00:13:28,639 CD are four but the actual edge between 343 00:13:26,480 --> 00:13:30,800 AB and CD is one they are connected 344 00:13:28,639 --> 00:13:33,680 through BC so there's only one link 345 00:13:30,800 --> 00:13:36,560 between B and C so it becomes 1x4 which 346 00:13:33,680 --> 00:13:39,279 is 0.25 25. 347 00:13:36,560 --> 00:13:42,000 So now we have to infer the posterior 348 00:13:39,279 --> 00:13:44,079 probability for the distribution. We 349 00:13:42,000 --> 00:13:46,560 calculate the likelihood for each of the 350 00:13:44,079 --> 00:13:48,639 internal root and then the likelihood 351 00:13:46,560 --> 00:13:50,959 for the entire tree. The prior is the 352 00:13:48,639 --> 00:13:53,040 uniform prior because we do not have any 353 00:13:50,959 --> 00:13:55,120 information for the dendrogram prior to 354 00:13:53,040 --> 00:13:57,760 that. So we use a non-informative prior 355 00:13:55,120 --> 00:14:01,519 which is the uniform prior for the uh 356 00:13:57,760 --> 00:14:04,160 dendrograms. But uh if you see PB it's 357 00:14:01,519 --> 00:14:06,079 difficult to calculate it's the evidence 358 00:14:04,160 --> 00:14:09,680 here I have used only three dendrograms 359 00:14:06,079 --> 00:14:12,720 but in reality we are taking thousands 360 00:14:09,680 --> 00:14:15,680 and hundreds of uh dendrograms and it's 361 00:14:12,720 --> 00:14:18,480 like computationally invisible to 362 00:14:15,680 --> 00:14:21,600 calculate the evidence for them. So we 363 00:14:18,480 --> 00:14:24,320 have to use a sampling technique where 364 00:14:21,600 --> 00:14:26,480 what if we go for direct sampling. So 365 00:14:24,320 --> 00:14:28,320 what direct sampling does is it 366 00:14:26,480 --> 00:14:30,480 constructs thousands and millions of 367 00:14:28,320 --> 00:14:32,320 dendrograms and then calculates the 368 00:14:30,480 --> 00:14:35,360 likelihood of each of the dendrograms 369 00:14:32,320 --> 00:14:37,360 obtained and then it checks the 370 00:14:35,360 --> 00:14:40,160 physibility of the dendrogram. But again 371 00:14:37,360 --> 00:14:41,920 that is not feasible because uh we are 372 00:14:40,160 --> 00:14:43,760 constructing so many dendrograms and 373 00:14:41,920 --> 00:14:46,079 then calculating the likelihood of each 374 00:14:43,760 --> 00:14:48,000 of these. It is going to be taking like 375 00:14:46,079 --> 00:14:50,720 too much time and it's computationally 376 00:14:48,000 --> 00:14:52,959 expensive. it it has memory issues and 377 00:14:50,720 --> 00:14:55,440 it's completely impossible for the large 378 00:14:52,959 --> 00:14:57,600 networks. Therefore, we switch to Monte 379 00:14:55,440 --> 00:15:00,639 Carlo which is the approximate sampling 380 00:14:57,600 --> 00:15:02,399 technique. Here you deise a new 381 00:15:00,639 --> 00:15:04,880 dendrogram depending on the previous 382 00:15:02,399 --> 00:15:08,000 state and calculate the likelihood 383 00:15:04,880 --> 00:15:11,040 ratios for the roots and the and the 384 00:15:08,000 --> 00:15:15,360 complete trees. But uh over here if you 385 00:15:11,040 --> 00:15:17,920 see the Monte Carlo is random in nature. 386 00:15:15,360 --> 00:15:20,000 So it'll most probably go to the lowest 387 00:15:17,920 --> 00:15:21,839 probability regions more. It does not 388 00:15:20,000 --> 00:15:23,760 know that there exists the higher 389 00:15:21,839 --> 00:15:26,079 probability regions also. It'll keep on 390 00:15:23,760 --> 00:15:28,399 circulating on those lower probability 391 00:15:26,079 --> 00:15:30,160 regions and we may end up with some of 392 00:15:28,399 --> 00:15:34,079 the dendagramgrams which are not uh 393 00:15:30,160 --> 00:15:36,320 efficient enough for inferring the uh 394 00:15:34,079 --> 00:15:38,880 the networks. 395 00:15:36,320 --> 00:15:41,440 So there's something called MCMC 396 00:15:38,880 --> 00:15:44,399 sampling which uses Metropolis Hastings. 397 00:15:41,440 --> 00:15:46,160 Here from every dendrogram we calculate 398 00:15:44,399 --> 00:15:48,240 the new dendagramgrams depending on the 399 00:15:46,160 --> 00:15:50,079 previous state and then we have the 400 00:15:48,240 --> 00:15:52,000 acceptance probability where we check 401 00:15:50,079 --> 00:15:55,440 whether this dendogram is suitable for 402 00:15:52,000 --> 00:15:58,079 our network prediction or not. If it is 403 00:15:55,440 --> 00:15:59,920 suitable then we select that and then we 404 00:15:58,079 --> 00:16:02,480 move to the next. But there's also a 405 00:15:59,920 --> 00:16:04,240 burn-in period where we are not able to 406 00:16:02,480 --> 00:16:06,079 infer the true probability but we keep 407 00:16:04,240 --> 00:16:07,920 on generating the dendrograms until we 408 00:16:06,079 --> 00:16:10,959 infer the true probability which is 409 00:16:07,920 --> 00:16:14,320 known as the burn-in period. Here we uh 410 00:16:10,959 --> 00:16:16,720 because after certain time the number of 411 00:16:14,320 --> 00:16:18,320 dendrograms generated become stable and 412 00:16:16,720 --> 00:16:20,560 they follow the same distribution and 413 00:16:18,320 --> 00:16:22,560 that is how we are able to infer the 414 00:16:20,560 --> 00:16:24,320 posterior distribution for the base 415 00:16:22,560 --> 00:16:26,639 theorem over here using this kind of 416 00:16:24,320 --> 00:16:29,040 sampling. So if you see here we pick up 417 00:16:26,639 --> 00:16:31,680 our initial value of the state generate 418 00:16:29,040 --> 00:16:33,759 a candidate for the next state evaluate 419 00:16:31,680 --> 00:16:36,720 the move compare the likelihood using 420 00:16:33,759 --> 00:16:38,880 the target distributions accept move if 421 00:16:36,720 --> 00:16:40,639 it is more likely otherwise accept it 422 00:16:38,880 --> 00:16:43,120 with some probability and then we 423 00:16:40,639 --> 00:16:45,199 iterate it over times. So we have sample 424 00:16:43,120 --> 00:16:47,040 of the probability distributions which 425 00:16:45,199 --> 00:16:49,199 we can feed into the base theorem and 426 00:16:47,040 --> 00:16:51,199 calculate it. 427 00:16:49,199 --> 00:16:53,040 Now there's one step further ahead of 428 00:16:51,199 --> 00:16:55,440 it. Since we want to predict the missing 429 00:16:53,040 --> 00:16:57,279 links, we cannot resort to base theorem 430 00:16:55,440 --> 00:16:58,639 only. We have to go further because we 431 00:16:57,279 --> 00:17:01,360 want to predict the links between the 432 00:16:58,639 --> 00:17:03,680 future missing ages. So for that we do 433 00:17:01,360 --> 00:17:06,319 the ranking and the averaging method 434 00:17:03,680 --> 00:17:09,120 where we have the probability posterior 435 00:17:06,319 --> 00:17:12,480 probability distributions. We do take 436 00:17:09,120 --> 00:17:15,039 the average of these methods and then 437 00:17:12,480 --> 00:17:17,679 get a consensus dendrogram from that 438 00:17:15,039 --> 00:17:19,280 ranking and averaging the we calculate 439 00:17:17,679 --> 00:17:21,839 the highest probability distribution 440 00:17:19,280 --> 00:17:24,559 over here. So if you see I calculated it 441 00:17:21,839 --> 00:17:27,280 for a d a c and bd and a d was having 442 00:17:24,559 --> 00:17:30,000 the maximum average probability which is 443 00:17:27,280 --> 00:17:32,400 of 385. So we can say that the missing 444 00:17:30,000 --> 00:17:34,320 link is existing between a and d and 445 00:17:32,400 --> 00:17:37,200 there's no missing link between a and c 446 00:17:34,320 --> 00:17:40,320 and b and d over here. 447 00:17:37,200 --> 00:17:42,080 The advantages of this method is that 448 00:17:40,320 --> 00:17:44,559 when we compared it against the other 449 00:17:42,080 --> 00:17:46,320 methods uh like the jacket coefficient 450 00:17:44,559 --> 00:17:48,799 which is a similarity based method, the 451 00:17:46,320 --> 00:17:50,799 degree product, the shortest parts all 452 00:17:48,799 --> 00:17:53,360 these are similarity based methods and 453 00:17:50,799 --> 00:17:55,679 other methods like pure chance which is 454 00:17:53,360 --> 00:17:57,760 a random method. It could capture the 455 00:17:55,679 --> 00:18:00,240 assortative and disorderative uh 456 00:17:57,760 --> 00:18:01,679 relations. What it means is that uh we 457 00:18:00,240 --> 00:18:03,919 are not saying that the probability of 458 00:18:01,679 --> 00:18:06,320 the root is higher or it is increasing 459 00:18:03,919 --> 00:18:08,160 or it is decreasing. It can vary. It can 460 00:18:06,320 --> 00:18:10,720 increase or decrease. So it captures 461 00:18:08,160 --> 00:18:13,360 both the relations and it's not fixed 462 00:18:10,720 --> 00:18:14,559 that it is that the root will have the 463 00:18:13,360 --> 00:18:16,799 higher probability or the lower 464 00:18:14,559 --> 00:18:19,440 probability. It's flexible in nature. It 465 00:18:16,799 --> 00:18:21,200 can be it can accommodate different 466 00:18:19,440 --> 00:18:23,600 kinds of distributions like there's a 467 00:18:21,200 --> 00:18:25,760 variation of this algorithm which is uh 468 00:18:23,600 --> 00:18:28,080 which has been used to capture the local 469 00:18:25,760 --> 00:18:30,080 structure. Also this algorithm is used 470 00:18:28,080 --> 00:18:31,440 for global structure. But there's a 471 00:18:30,080 --> 00:18:33,600 variation of this which is used for 472 00:18:31,440 --> 00:18:35,440 capturing the uh local structure which 473 00:18:33,600 --> 00:18:37,200 is hierarchical exponential random 474 00:18:35,440 --> 00:18:39,280 graphs. So we just change the 475 00:18:37,200 --> 00:18:41,679 distribution and it's flexible enough to 476 00:18:39,280 --> 00:18:44,240 accommodate any kind of networks. It has 477 00:18:41,679 --> 00:18:46,240 better a statistic than any of the other 478 00:18:44,240 --> 00:18:48,400 methods compared over here. So that's 479 00:18:46,240 --> 00:18:50,320 why this method has been tested on many 480 00:18:48,400 --> 00:18:54,640 of the networks and it resulted in a 481 00:18:50,320 --> 00:18:56,880 better AU statistics and uh R square. I 482 00:18:54,640 --> 00:18:58,640 applied this to the protein protein 483 00:18:56,880 --> 00:19:00,880 interaction networks uh which I'm 484 00:18:58,640 --> 00:19:03,760 currently working on for the development 485 00:19:00,880 --> 00:19:06,640 of the package. So when I applied so 486 00:19:03,760 --> 00:19:08,880 there's a library called pyrg pi hrg 487 00:19:06,640 --> 00:19:11,440 it's a partial implementation of this 488 00:19:08,880 --> 00:19:13,600 algorithm which is available in python 489 00:19:11,440 --> 00:19:15,200 if you want u the full implementation 490 00:19:13,600 --> 00:19:18,880 that is available in other languages 491 00:19:15,200 --> 00:19:21,200 also but for python u the only partial 492 00:19:18,880 --> 00:19:22,720 limit the partial implementation of this 493 00:19:21,200 --> 00:19:26,080 is available and there's a package 494 00:19:22,720 --> 00:19:28,320 called hrg so first this package does 495 00:19:26,080 --> 00:19:30,720 the hrg fit and it converts it into 496 00:19:28,320 --> 00:19:32,559 dendrogram and then if you see it 497 00:19:30,720 --> 00:19:36,240 calculates the probability probabilities 498 00:19:32,559 --> 00:19:39,760 of the graph uh using the actual edges 499 00:19:36,240 --> 00:19:41,039 and the uh existing edges. So this is 500 00:19:39,760 --> 00:19:42,559 how you are able to predict the 501 00:19:41,039 --> 00:19:44,640 probability over here and through this 502 00:19:42,559 --> 00:19:46,559 you are able to construct the consensus 503 00:19:44,640 --> 00:19:48,720 dendrogram and you are able to predict 504 00:19:46,559 --> 00:19:50,400 which uh of the missing links should 505 00:19:48,720 --> 00:19:53,200 exist between the graph and this is 506 00:19:50,400 --> 00:19:56,320 really helpful because uh these networks 507 00:19:53,200 --> 00:19:58,720 are completely uh like they are 508 00:19:56,320 --> 00:20:00,559 incomplete in nature and it is difficult 509 00:19:58,720 --> 00:20:03,280 to predict the pathways. So proteins 510 00:20:00,559 --> 00:20:05,760 have pathways and uh they are they help 511 00:20:03,280 --> 00:20:07,360 in detection of any kind of disease. Uh 512 00:20:05,760 --> 00:20:09,120 for example, if I want to know that a 513 00:20:07,360 --> 00:20:11,280 person is suffering from schizophrenia 514 00:20:09,120 --> 00:20:13,200 or anything. So I will study the 515 00:20:11,280 --> 00:20:15,280 metabolic pathways regulatory mechanism 516 00:20:13,200 --> 00:20:16,880 through these kinds of networks and it 517 00:20:15,280 --> 00:20:18,640 becomes completely invisible if there 518 00:20:16,880 --> 00:20:20,960 are missing proteins or missing links in 519 00:20:18,640 --> 00:20:22,480 between them and uh we are not able to 520 00:20:20,960 --> 00:20:24,240 get the correct predictions from them. 521 00:20:22,480 --> 00:20:26,320 So this algorithm really helped me to 522 00:20:24,240 --> 00:20:30,600 get the correct predictions and to get 523 00:20:26,320 --> 00:20:30,600 the better statistics for the disease. 524 00:20:30,640 --> 00:20:35,280 the facts uh that there's a partial 525 00:20:33,360 --> 00:20:37,120 implementation for this. It supports 526 00:20:35,280 --> 00:20:40,000 only fitting an HRG model to the 527 00:20:37,120 --> 00:20:42,000 network, finding a consensus diagram and 528 00:20:40,000 --> 00:20:44,559 merging multiple consensus dendrograms 529 00:20:42,000 --> 00:20:46,880 into a new consensus drogram. Pi HRG is 530 00:20:44,559 --> 00:20:49,760 based on network X and MET plot lip. Uh 531 00:20:46,880 --> 00:20:52,159 so this package for PI HRG has been 532 00:20:49,760 --> 00:20:54,159 developed using network X library which 533 00:20:52,159 --> 00:20:56,240 is already existing but since uh I 534 00:20:54,159 --> 00:20:58,000 already told that it is not for the 535 00:20:56,240 --> 00:20:59,600 comprehensive analysis. It can manage 536 00:20:58,000 --> 00:21:01,679 the simple networks but for 537 00:20:59,600 --> 00:21:04,240 comprehensive network analysis we do 538 00:21:01,679 --> 00:21:06,400 need such algorithms but they are based 539 00:21:04,240 --> 00:21:09,120 on these libraries. 540 00:21:06,400 --> 00:21:11,919 The GitHub repo for this algorithm can 541 00:21:09,120 --> 00:21:15,039 be found over here and you can find 542 00:21:11,919 --> 00:21:17,840 three files over here which is uh HRG 543 00:21:15,039 --> 00:21:20,320 fit the dendrogram and the missing 544 00:21:17,840 --> 00:21:22,530 length prediction. 545 00:21:20,320 --> 00:21:29,360 Thank you. 546 00:21:22,530 --> 00:21:29,360 [Applause]