1
00:00:00,420 --> 00:00:05,910
[Music]

2
00:00:09,920 --> 00:00:15,440
Welcome back friends to the data and AI

3
00:00:12,080 --> 00:00:18,240
specialist track at PyConau 2025.

4
00:00:15,440 --> 00:00:20,800
I am very excited to introduce the last

5
00:00:18,240 --> 00:00:22,640
speaker in this block of talks. Um and

6
00:00:20,800 --> 00:00:24,960
please give a very big round of applause

7
00:00:22,640 --> 00:00:27,359
to Niha on and her talk hierarchical

8
00:00:24,960 --> 00:00:31,119
clustering finding the awkward reunions.

9
00:00:27,359 --> 00:00:31,119
Big round of applause everybody.

10
00:00:36,800 --> 00:00:41,360
U first of all I would like to thank you

11
00:00:39,360 --> 00:00:43,680
all. Uh this is my first time attending

12
00:00:41,360 --> 00:00:46,399
the PyCon conference and I'm greatly

13
00:00:43,680 --> 00:00:49,440
privileged to be over here and to be

14
00:00:46,399 --> 00:00:51,520
selected as speaker. So uh I want to

15
00:00:49,440 --> 00:00:54,239
give you a brief introduction of myself.

16
00:00:51,520 --> 00:00:56,879
Uh I'm a second year PhD student at

17
00:00:54,239 --> 00:00:58,320
University of Newcastle, Australia. I

18
00:00:56,879 --> 00:01:00,480
work in the intersection of computer

19
00:00:58,320 --> 00:01:02,960
sciences and healthcare domains where I

20
00:01:00,480 --> 00:01:06,000
develop software solutions uh primarily

21
00:01:02,960 --> 00:01:08,640
using Python uh to for the complex

22
00:01:06,000 --> 00:01:10,640
diseases and disorders. Uh before moving

23
00:01:08,640 --> 00:01:12,960
to Australia, I have gained some

24
00:01:10,640 --> 00:01:15,439
experience as a systems engineer at Tata

25
00:01:12,960 --> 00:01:18,080
Consultancy Services and HCL

26
00:01:15,439 --> 00:01:21,119
technologies. My research interests

27
00:01:18,080 --> 00:01:23,759
include uh medical software engineering,

28
00:01:21,119 --> 00:01:26,240
machine learning and healthcare, Python

29
00:01:23,759 --> 00:01:29,200
for health informatics. Also I work with

30
00:01:26,240 --> 00:01:31,439
communities and NOS's which uh provide

31
00:01:29,200 --> 00:01:33,200
AI training and Python training in

32
00:01:31,439 --> 00:01:35,840
various underprivileged schools across

33
00:01:33,200 --> 00:01:40,000
Australia. We also supply laptops to the

34
00:01:35,840 --> 00:01:42,400
students and I would like to say that um

35
00:01:40,000 --> 00:01:44,320
nowadays the technology has advanced so

36
00:01:42,400 --> 00:01:47,439
much that students are willing to learn

37
00:01:44,320 --> 00:01:49,439
these at such a younger age and they

38
00:01:47,439 --> 00:01:51,600
know uh and I was so surprised to know

39
00:01:49,439 --> 00:01:53,680
that they know even uh like the advanced

40
00:01:51,600 --> 00:01:56,479
things which are happening in the Python

41
00:01:53,680 --> 00:02:00,000
and AI. Uh besides that I have provided

42
00:01:56,479 --> 00:02:03,119
my LinkedIn uh profile over here.

43
00:02:00,000 --> 00:02:06,399
uh I have developed the framework for uh

44
00:02:03,119 --> 00:02:08,959
oncology using the this is my uh work

45
00:02:06,399 --> 00:02:10,640
which I have done as a PhD student. Uh I

46
00:02:08,959 --> 00:02:13,680
have developed the framework for

47
00:02:10,640 --> 00:02:16,720
oncology uh which predicts the data and

48
00:02:13,680 --> 00:02:18,400
analyzes the survival analysis and

49
00:02:16,720 --> 00:02:20,080
various kinds of it generates various

50
00:02:18,400 --> 00:02:23,040
kinds of reports. It also generates

51
00:02:20,080 --> 00:02:24,640
various kinds of visualizations. Uh the

52
00:02:23,040 --> 00:02:27,200
other thing which I'm working on is

53
00:02:24,640 --> 00:02:29,599
developing a Python package for complex

54
00:02:27,200 --> 00:02:31,760
network analysis. Currently there's a

55
00:02:29,599 --> 00:02:33,599
library called network X and mattplot

56
00:02:31,760 --> 00:02:35,599
liib which can be used for network

57
00:02:33,599 --> 00:02:37,440
analysis but there's no comprehensive

58
00:02:35,599 --> 00:02:40,319
library for the biological network

59
00:02:37,440 --> 00:02:42,400
analysis. It really when you do the

60
00:02:40,319 --> 00:02:44,800
complex network analysis it really helps

61
00:02:42,400 --> 00:02:46,239
in predicting the diseases and the

62
00:02:44,800 --> 00:02:48,959
outcomes for the diseases and it's

63
00:02:46,239 --> 00:02:51,200
really helpful but uh unfortunately

64
00:02:48,959 --> 00:02:54,720
there was no such uh comprehensive

65
00:02:51,200 --> 00:02:58,319
package for that. So I'm going to

66
00:02:54,720 --> 00:03:00,720
present today on uh a generative model

67
00:02:58,319 --> 00:03:02,560
which is hierarchical random graphs.

68
00:03:00,720 --> 00:03:05,360
They have been derived from hierarchical

69
00:03:02,560 --> 00:03:07,200
clustering which is an algorithm to

70
00:03:05,360 --> 00:03:10,319
extract out the hierarchies from the

71
00:03:07,200 --> 00:03:12,800
networks. And this generative model is

72
00:03:10,319 --> 00:03:16,239
uh based on Monte Carlo simulation and

73
00:03:12,800 --> 00:03:18,080
probabilistic uh things. So first and

74
00:03:16,239 --> 00:03:20,159
foremost question that comes to our mind

75
00:03:18,080 --> 00:03:22,640
is that why do we even need to find the

76
00:03:20,159 --> 00:03:25,519
reunions like why do we even need to

77
00:03:22,640 --> 00:03:27,599
predict the missing links? So whatever

78
00:03:25,519 --> 00:03:29,920
networks we observe in real life they

79
00:03:27,599 --> 00:03:33,040
are incomplete or partially observed.

80
00:03:29,920 --> 00:03:34,879
Hence u like even if you use any kind of

81
00:03:33,040 --> 00:03:36,640
platform whatever networks you extract

82
00:03:34,879 --> 00:03:38,879
from that most of the links are missing

83
00:03:36,640 --> 00:03:41,280
in that and most of the real world

84
00:03:38,879 --> 00:03:43,360
networks are a part of complex systems.

85
00:03:41,280 --> 00:03:46,159
Hidden links can sign significantly

86
00:03:43,360 --> 00:03:47,840
influence the system behavior. If we are

87
00:03:46,159 --> 00:03:49,599
able to predict the missing links, it

88
00:03:47,840 --> 00:03:51,040
improves the system understanding and

89
00:03:49,599 --> 00:03:53,280
leads to the optimization of the

90
00:03:51,040 --> 00:03:55,519
network. When there are missing links,

91
00:03:53,280 --> 00:03:57,519
the network becomes fragile. It is not

92
00:03:55,519 --> 00:04:00,000
robust enough. You are not able to do

93
00:03:57,519 --> 00:04:03,040
any kind of analysis on it. It leads to

94
00:04:00,000 --> 00:04:04,480
misleading metrics. Uh for example um

95
00:04:03,040 --> 00:04:07,360
for example if you want to know the

96
00:04:04,480 --> 00:04:09,599
degree of the node uh you uh you analyze

97
00:04:07,360 --> 00:04:11,680
the node and you want to check the

98
00:04:09,599 --> 00:04:13,120
influence of the node if there are

99
00:04:11,680 --> 00:04:14,959
missing links you are not able to

100
00:04:13,120 --> 00:04:17,600
predict the correct correct degree of

101
00:04:14,959 --> 00:04:20,799
the node whether it's having like

102
00:04:17,600 --> 00:04:22,639
whether it's influential or not and then

103
00:04:20,799 --> 00:04:24,160
there are faulty predictions you are not

104
00:04:22,639 --> 00:04:27,440
able to predict enough from those

105
00:04:24,160 --> 00:04:30,160
networks there's reduced connectivity

106
00:04:27,440 --> 00:04:32,479
so what I did I analyzed a random

107
00:04:30,160 --> 00:04:35,680
network where I use a matrix called

108
00:04:32,479 --> 00:04:37,440
adjacent uh random index. What it does

109
00:04:35,680 --> 00:04:39,840
is uh it studies the community

110
00:04:37,440 --> 00:04:42,080
structures of the network as in when you

111
00:04:39,840 --> 00:04:43,520
remove the links from it fraction by

112
00:04:42,080 --> 00:04:46,240
fraction you can see that it is

113
00:04:43,520 --> 00:04:50,000
decreasing. First it was 0.11 and then

114
00:04:46,240 --> 00:04:51,680
it became 010 and 07. So that means when

115
00:04:50,000 --> 00:04:53,520
the missing links are there the

116
00:04:51,680 --> 00:04:58,560
community structure of the network keeps

117
00:04:53,520 --> 00:05:00,960
on depleting. So the range of a lies

118
00:04:58,560 --> 00:05:02,960
between minus1 to one. So one means

119
00:05:00,960 --> 00:05:05,440
perfect agreement between the clusters

120
00:05:02,960 --> 00:05:07,360
and minus one means a disagreement. So

121
00:05:05,440 --> 00:05:10,639
as it is decreasing so the network

122
00:05:07,360 --> 00:05:12,639
structure is also depleting

123
00:05:10,639 --> 00:05:14,000
overview of various missing link

124
00:05:12,639 --> 00:05:16,639
prediction methods which are available

125
00:05:14,000 --> 00:05:18,160
today. So uh there are experimental

126
00:05:16,639 --> 00:05:20,400
methods where you don't infer the

127
00:05:18,160 --> 00:05:22,639
missing links but you do the validation

128
00:05:20,400 --> 00:05:25,600
that yes there should be a link between

129
00:05:22,639 --> 00:05:28,240
the nodes uh on the basis of

130
00:05:25,600 --> 00:05:30,720
experimental approaches. Then there are

131
00:05:28,240 --> 00:05:32,479
similar similarity based methods where

132
00:05:30,720 --> 00:05:34,800
you infer the structure of the nodes and

133
00:05:32,479 --> 00:05:37,280
on the basis of the degree of the node

134
00:05:34,800 --> 00:05:40,240
or the average path length or the kind

135
00:05:37,280 --> 00:05:42,240
of uh bridges it has you infer that

136
00:05:40,240 --> 00:05:44,639
these nodes should be linked and these

137
00:05:42,240 --> 00:05:47,280
links are missing. Then there are matrix

138
00:05:44,639 --> 00:05:49,440
and latent factor methods uh which

139
00:05:47,280 --> 00:05:52,240
divide uh which convert the network into

140
00:05:49,440 --> 00:05:54,400
adjacency matrix and then you convert

141
00:05:52,240 --> 00:05:56,800
the matrix into different subcomponents

142
00:05:54,400 --> 00:05:58,400
and during the reconstruction is that uh

143
00:05:56,800 --> 00:06:00,160
is when you infer that what kinds of

144
00:05:58,400 --> 00:06:02,720
links are missing. We have used

145
00:06:00,160 --> 00:06:04,800
probabilistic methods because as

146
00:06:02,720 --> 00:06:06,800
compared to the other methods they are

147
00:06:04,800 --> 00:06:08,800
more interpretable. The other methods

148
00:06:06,800 --> 00:06:11,600
can uh predict the missing links but you

149
00:06:08,800 --> 00:06:14,240
would not know that how actually they

150
00:06:11,600 --> 00:06:16,560
predicted the links. Is that missing

151
00:06:14,240 --> 00:06:19,039
link really true? That does it really

152
00:06:16,560 --> 00:06:21,199
exist. The other methods consist uh

153
00:06:19,039 --> 00:06:23,520
considers the networks as homogeneous.

154
00:06:21,199 --> 00:06:25,680
That means it considers that all the

155
00:06:23,520 --> 00:06:28,080
nodes in the network are of same type.

156
00:06:25,680 --> 00:06:30,080
But what if you consider the network a

157
00:06:28,080 --> 00:06:32,400
research network where the nodes can

158
00:06:30,080 --> 00:06:34,880
represents researchers, authors,

159
00:06:32,400 --> 00:06:36,800
citation indexes. So the nodes are

160
00:06:34,880 --> 00:06:38,960
different uh and heterogeneous in

161
00:06:36,800 --> 00:06:40,960
nature. But when you use those

162
00:06:38,960 --> 00:06:42,479
approaches, they are not able to

163
00:06:40,960 --> 00:06:44,639
consider the heterogenity of the

164
00:06:42,479 --> 00:06:46,319
network. They only capture the local

165
00:06:44,639 --> 00:06:48,000
structure while the probabilistic

166
00:06:46,319 --> 00:06:51,199
methods are able to capture the global

167
00:06:48,000 --> 00:06:53,039
structure as well. The previous methods

168
00:06:51,199 --> 00:06:55,199
are exhaustive in nature. When you come

169
00:06:53,039 --> 00:06:57,199
to the probabilistic methods, they only

170
00:06:55,199 --> 00:06:59,280
require the probability distribution and

171
00:06:57,199 --> 00:07:02,880
hence they infer the parameters. That's

172
00:06:59,280 --> 00:07:04,720
why they are non-exhaustive in nature.

173
00:07:02,880 --> 00:07:07,840
So networks are a representation of

174
00:07:04,720 --> 00:07:10,080
complex systems. Uh networks means nodes

175
00:07:07,840 --> 00:07:12,479
and edges are connected. When you talk

176
00:07:10,080 --> 00:07:14,960
about the complex systems which exist in

177
00:07:12,479 --> 00:07:17,520
real life, they are a connection of

178
00:07:14,960 --> 00:07:19,919
various interacting components. The

179
00:07:17,520 --> 00:07:22,080
examples of various complex systems can

180
00:07:19,919 --> 00:07:24,319
be the biological networks, the

181
00:07:22,080 --> 00:07:26,080
technological networks, social networks

182
00:07:24,319 --> 00:07:29,599
like the Facebook connections, the

183
00:07:26,080 --> 00:07:31,039
ecological networks and believe me most

184
00:07:29,599 --> 00:07:33,039
of the networks which are existing are

185
00:07:31,039 --> 00:07:35,120
complex in nature. If you even consider

186
00:07:33,039 --> 00:07:36,880
consider your brain system, the how the

187
00:07:35,120 --> 00:07:39,520
neurons are working that is also a

188
00:07:36,880 --> 00:07:40,960
complex system. So all the complex

189
00:07:39,520 --> 00:07:43,840
systems are networks but all the

190
00:07:40,960 --> 00:07:45,840
networks cannot be complex systems.

191
00:07:43,840 --> 00:07:47,360
Complex systems have certain properties

192
00:07:45,840 --> 00:07:49,440
which distinguish them from the

193
00:07:47,360 --> 00:07:51,840
networks. They are nonlinear in nature.

194
00:07:49,440 --> 00:07:53,599
A small input change in the network can

195
00:07:51,840 --> 00:07:55,919
lead to a big change. They are

196
00:07:53,599 --> 00:07:58,000
spontaneous. They do not require any

197
00:07:55,919 --> 00:08:00,560
kind of central leader in between. They

198
00:07:58,000 --> 00:08:02,639
can self-organize themselves. Um for

199
00:08:00,560 --> 00:08:04,160
example, if you see the flock of birds,

200
00:08:02,639 --> 00:08:06,080
they do not need a leader. They

201
00:08:04,160 --> 00:08:08,080
self-organize themselves into groups and

202
00:08:06,080 --> 00:08:10,400
fly. They are adaptive. They are

203
00:08:08,080 --> 00:08:13,039
influenced by the environment. uh they

204
00:08:10,400 --> 00:08:15,440
require the feedback uh where they can

205
00:08:13,039 --> 00:08:17,440
change according to the output and then

206
00:08:15,440 --> 00:08:20,000
they are emergent where they function as

207
00:08:17,440 --> 00:08:22,000
a whole. They do not work as simple

208
00:08:20,000 --> 00:08:23,680
singular components but they require

209
00:08:22,000 --> 00:08:26,319
multiple components to work together and

210
00:08:23,680 --> 00:08:28,639
they are emergent in nature. So there

211
00:08:26,319 --> 00:08:30,800
are various types of complex systems but

212
00:08:28,639 --> 00:08:33,919
we are specifically f focusing on the

213
00:08:30,800 --> 00:08:36,320
hierarchic systems. Why? case these

214
00:08:33,919 --> 00:08:39,680
complex adaptive systems polyentric and

215
00:08:36,320 --> 00:08:42,000
disorganized they all exist in

216
00:08:39,680 --> 00:08:44,399
combination. There's no single system

217
00:08:42,000 --> 00:08:46,160
which can be completely hierarchic which

218
00:08:44,399 --> 00:08:48,240
can be completely disorganized or

219
00:08:46,160 --> 00:08:50,160
polyentric but they exist in the

220
00:08:48,240 --> 00:08:52,320
combination and it has been inferred

221
00:08:50,160 --> 00:08:53,920
that most of the networks uh which are

222
00:08:52,320 --> 00:08:55,920
existing they are hierarchical in

223
00:08:53,920 --> 00:08:57,440
nature. If you talk about complex

224
00:08:55,920 --> 00:08:59,120
adaptive systems they are completely

225
00:08:57,440 --> 00:09:01,440
adaptive in nature. They can change

226
00:08:59,120 --> 00:09:03,519
according to the environment. Polyentric

227
00:09:01,440 --> 00:09:05,680
means they have different uh centers for

228
00:09:03,519 --> 00:09:07,200
the authority where the even if you

229
00:09:05,680 --> 00:09:08,800
remove one of the components they would

230
00:09:07,200 --> 00:09:11,519
still function as a network.

231
00:09:08,800 --> 00:09:13,519
Disorganized are random in nature but

232
00:09:11,519 --> 00:09:15,440
they still follow some rules and then

233
00:09:13,519 --> 00:09:18,480
there are cybernetic system which take

234
00:09:15,440 --> 00:09:21,040
the output and then adapt to the output

235
00:09:18,480 --> 00:09:23,120
and change its functions. The

236
00:09:21,040 --> 00:09:25,680
hierarchical systems are of three types

237
00:09:23,120 --> 00:09:27,920
ordered nested and flow. We are

238
00:09:25,680 --> 00:09:29,680
considering the nested systems because

239
00:09:27,920 --> 00:09:31,760
if you see the graphs the complex

240
00:09:29,680 --> 00:09:33,120
systems can be divided into subunits

241
00:09:31,760 --> 00:09:35,440
which can be further divided into

242
00:09:33,120 --> 00:09:37,680
subunits. You can take any example even

243
00:09:35,440 --> 00:09:39,440
if you take a Facebook network. So you

244
00:09:37,680 --> 00:09:41,360
can divide it into a group of friends

245
00:09:39,440 --> 00:09:44,320
those would be further divided into

246
00:09:41,360 --> 00:09:45,760
further groups. So the most of the

247
00:09:44,320 --> 00:09:48,000
networks we consider over here are

248
00:09:45,760 --> 00:09:50,240
nested in nature.

249
00:09:48,000 --> 00:09:52,560
So how does the hierarchies emerge? So

250
00:09:50,240 --> 00:09:56,880
it has been wonderfully explained by a

251
00:09:52,560 --> 00:09:58,720
model Bonab model. So what happens is so

252
00:09:56,880 --> 00:10:01,040
nodes and edges are represented as a

253
00:09:58,720 --> 00:10:03,200
lattice. So where one node tries to

254
00:10:01,040 --> 00:10:05,600
interact with the neighboring nodes and

255
00:10:03,200 --> 00:10:07,760
if there's an empty space it'll shift

256
00:10:05,600 --> 00:10:09,680
towards there. But if there's a node

257
00:10:07,760 --> 00:10:12,080
existing over there there would be a

258
00:10:09,680 --> 00:10:14,399
fight between them uh on the basis of

259
00:10:12,080 --> 00:10:16,399
the fitness function they have and

260
00:10:14,399 --> 00:10:18,079
whichever node wins it'll take the place

261
00:10:16,399 --> 00:10:20,560
of that node and the other node would

262
00:10:18,079 --> 00:10:23,200
shift to the previous node. But here

263
00:10:20,560 --> 00:10:25,040
over the time the network ages and all

264
00:10:23,200 --> 00:10:27,680
the nodes

265
00:10:25,040 --> 00:10:29,519
either emerge as winners or sometimes

266
00:10:27,680 --> 00:10:31,120
most of the nodes are consistent and

267
00:10:29,519 --> 00:10:33,600
become as losers and that's how

268
00:10:31,120 --> 00:10:36,240
hierarchies arrange themselves. So it

269
00:10:33,600 --> 00:10:38,640
becomes so the winner nodes create one

270
00:10:36,240 --> 00:10:40,240
level the loser nodes create the lower

271
00:10:38,640 --> 00:10:42,399
level and that's how most of the

272
00:10:40,240 --> 00:10:43,920
networks have hierarchies in nature. If

273
00:10:42,399 --> 00:10:45,760
you see this lettuce how the nodes are

274
00:10:43,920 --> 00:10:48,480
moving. So they randomly keep on moving

275
00:10:45,760 --> 00:10:50,240
and shifting their places and that's why

276
00:10:48,480 --> 00:10:52,399
nature has so many hierarchies in the

277
00:10:50,240 --> 00:10:55,360
networks.

278
00:10:52,399 --> 00:10:57,600
So how to extract the hierarchies from

279
00:10:55,360 --> 00:10:59,440
the networks? We need clustering. For

280
00:10:57,600 --> 00:11:01,600
that we perform clustering on the

281
00:10:59,440 --> 00:11:03,600
networks and extract the hierarchies.

282
00:11:01,600 --> 00:11:05,360
The hierarchies are visually represented

283
00:11:03,600 --> 00:11:07,760
through dendrograms. You must have all

284
00:11:05,360 --> 00:11:09,279
seen the dendrograms in uh various kinds

285
00:11:07,760 --> 00:11:10,720
of research papers where you are seeing

286
00:11:09,279 --> 00:11:12,880
that graphs are represented through

287
00:11:10,720 --> 00:11:15,920
dendrograms. Did we really understand

288
00:11:12,880 --> 00:11:19,120
that? So the dendrograms represent how

289
00:11:15,920 --> 00:11:20,480
one point is uh is linked to another

290
00:11:19,120 --> 00:11:23,120
point through distance or

291
00:11:20,480 --> 00:11:25,120
dissimilarities and the linkage decides

292
00:11:23,120 --> 00:11:27,040
how one cluster is linked to another

293
00:11:25,120 --> 00:11:28,720
clusters. There are various metrics that

294
00:11:27,040 --> 00:11:31,040
can be used for distance. Ukidian

295
00:11:28,720 --> 00:11:32,880
distance, Manhattan distance linkages

296
00:11:31,040 --> 00:11:35,279
also have various methods. You can use

297
00:11:32,880 --> 00:11:38,399
any of them to extract the clusters out

298
00:11:35,279 --> 00:11:40,399
of them. But u over here we have some

299
00:11:38,399 --> 00:11:42,000
assumptions that clusters are not

300
00:11:40,399 --> 00:11:43,839
non-over overlapping in nature. One

301
00:11:42,000 --> 00:11:45,920
point cannot uh belong to another

302
00:11:43,839 --> 00:11:48,240
cluster. So it's not like fuzzy

303
00:11:45,920 --> 00:11:49,760
clustering where we are assigning the me

304
00:11:48,240 --> 00:11:51,519
membership to a point where it should

305
00:11:49,760 --> 00:11:54,560
belong to the two clusters together but

306
00:11:51,519 --> 00:11:56,800
it's uh belonging to one cluster only.

307
00:11:54,560 --> 00:11:59,920
And then the dendrogram is binary. Why

308
00:11:56,800 --> 00:12:02,720
this assumption is there? Because um

309
00:11:59,920 --> 00:12:04,959
hierarchical clustering is of two types

310
00:12:02,720 --> 00:12:07,120
and divisive. This one is a simpler

311
00:12:04,959 --> 00:12:10,560
clustering algorithm because its

312
00:12:07,120 --> 00:12:12,480
complexity is only n² log n while we go

313
00:12:10,560 --> 00:12:14,240
for divisive then it's it becomes

314
00:12:12,480 --> 00:12:16,240
exponential in nature which becomes 2

315
00:12:14,240 --> 00:12:18,959
raised to power n. So we only consider

316
00:12:16,240 --> 00:12:21,440
elumerative and in elmerative we only

317
00:12:18,959 --> 00:12:23,200
combine uh two clusters together. So

318
00:12:21,440 --> 00:12:25,360
when we are only combining two clusters

319
00:12:23,200 --> 00:12:27,360
so it becomes a binary tree. So that's

320
00:12:25,360 --> 00:12:29,360
why we represent it as a binary tree

321
00:12:27,360 --> 00:12:33,120
over here.

322
00:12:29,360 --> 00:12:35,200
Now when we move one step ahead where we

323
00:12:33,120 --> 00:12:37,680
have obtained the clusters and we have

324
00:12:35,200 --> 00:12:39,680
obtained the dendrogram we want to

325
00:12:37,680 --> 00:12:41,600
predict the missing links now. So we

326
00:12:39,680 --> 00:12:43,839
assign the probabilities to the internal

327
00:12:41,600 --> 00:12:46,720
nodes. How we assign the probabilities

328
00:12:43,839 --> 00:12:49,040
to the internal nodes is we assign the

329
00:12:46,720 --> 00:12:51,040
actual edges we divide the actual edges

330
00:12:49,040 --> 00:12:54,639
by the possible edges. So if you see a

331
00:12:51,040 --> 00:12:57,120
sample graph uh that simple graph A B C

332
00:12:54,639 --> 00:13:01,279
D and the candidates for the missing

333
00:12:57,120 --> 00:13:05,920
links are provided as AC A D and BD. Now

334
00:13:01,279 --> 00:13:08,320
how is R note as uh 0.25 and R1 is one.

335
00:13:05,920 --> 00:13:11,519
So if you see there's only one actual

336
00:13:08,320 --> 00:13:13,839
edge between A and B and number of

337
00:13:11,519 --> 00:13:16,079
possible edges between A and B is also

338
00:13:13,839 --> 00:13:18,560
one. So the probability is 1 by one

339
00:13:16,079 --> 00:13:21,920
which is equal to one. Similarly for R

340
00:13:18,560 --> 00:13:23,920
not there are four nodes and it's the

341
00:13:21,920 --> 00:13:26,480
number of possible edges between A B and

342
00:13:23,920 --> 00:13:28,639
CD are four but the actual edge between

343
00:13:26,480 --> 00:13:30,800
AB and CD is one they are connected

344
00:13:28,639 --> 00:13:33,680
through BC so there's only one link

345
00:13:30,800 --> 00:13:36,560
between B and C so it becomes 1x4 which

346
00:13:33,680 --> 00:13:39,279
is 0.25 25.

347
00:13:36,560 --> 00:13:42,000
So now we have to infer the posterior

348
00:13:39,279 --> 00:13:44,079
probability for the distribution. We

349
00:13:42,000 --> 00:13:46,560
calculate the likelihood for each of the

350
00:13:44,079 --> 00:13:48,639
internal root and then the likelihood

351
00:13:46,560 --> 00:13:50,959
for the entire tree. The prior is the

352
00:13:48,639 --> 00:13:53,040
uniform prior because we do not have any

353
00:13:50,959 --> 00:13:55,120
information for the dendrogram prior to

354
00:13:53,040 --> 00:13:57,760
that. So we use a non-informative prior

355
00:13:55,120 --> 00:14:01,519
which is the uniform prior for the uh

356
00:13:57,760 --> 00:14:04,160
dendrograms. But uh if you see PB it's

357
00:14:01,519 --> 00:14:06,079
difficult to calculate it's the evidence

358
00:14:04,160 --> 00:14:09,680
here I have used only three dendrograms

359
00:14:06,079 --> 00:14:12,720
but in reality we are taking thousands

360
00:14:09,680 --> 00:14:15,680
and hundreds of uh dendrograms and it's

361
00:14:12,720 --> 00:14:18,480
like computationally invisible to

362
00:14:15,680 --> 00:14:21,600
calculate the evidence for them. So we

363
00:14:18,480 --> 00:14:24,320
have to use a sampling technique where

364
00:14:21,600 --> 00:14:26,480
what if we go for direct sampling. So

365
00:14:24,320 --> 00:14:28,320
what direct sampling does is it

366
00:14:26,480 --> 00:14:30,480
constructs thousands and millions of

367
00:14:28,320 --> 00:14:32,320
dendrograms and then calculates the

368
00:14:30,480 --> 00:14:35,360
likelihood of each of the dendrograms

369
00:14:32,320 --> 00:14:37,360
obtained and then it checks the

370
00:14:35,360 --> 00:14:40,160
physibility of the dendrogram. But again

371
00:14:37,360 --> 00:14:41,920
that is not feasible because uh we are

372
00:14:40,160 --> 00:14:43,760
constructing so many dendrograms and

373
00:14:41,920 --> 00:14:46,079
then calculating the likelihood of each

374
00:14:43,760 --> 00:14:48,000
of these. It is going to be taking like

375
00:14:46,079 --> 00:14:50,720
too much time and it's computationally

376
00:14:48,000 --> 00:14:52,959
expensive. it it has memory issues and

377
00:14:50,720 --> 00:14:55,440
it's completely impossible for the large

378
00:14:52,959 --> 00:14:57,600
networks. Therefore, we switch to Monte

379
00:14:55,440 --> 00:15:00,639
Carlo which is the approximate sampling

380
00:14:57,600 --> 00:15:02,399
technique. Here you deise a new

381
00:15:00,639 --> 00:15:04,880
dendrogram depending on the previous

382
00:15:02,399 --> 00:15:08,000
state and calculate the likelihood

383
00:15:04,880 --> 00:15:11,040
ratios for the roots and the and the

384
00:15:08,000 --> 00:15:15,360
complete trees. But uh over here if you

385
00:15:11,040 --> 00:15:17,920
see the Monte Carlo is random in nature.

386
00:15:15,360 --> 00:15:20,000
So it'll most probably go to the lowest

387
00:15:17,920 --> 00:15:21,839
probability regions more. It does not

388
00:15:20,000 --> 00:15:23,760
know that there exists the higher

389
00:15:21,839 --> 00:15:26,079
probability regions also. It'll keep on

390
00:15:23,760 --> 00:15:28,399
circulating on those lower probability

391
00:15:26,079 --> 00:15:30,160
regions and we may end up with some of

392
00:15:28,399 --> 00:15:34,079
the dendagramgrams which are not uh

393
00:15:30,160 --> 00:15:36,320
efficient enough for inferring the uh

394
00:15:34,079 --> 00:15:38,880
the networks.

395
00:15:36,320 --> 00:15:41,440
So there's something called MCMC

396
00:15:38,880 --> 00:15:44,399
sampling which uses Metropolis Hastings.

397
00:15:41,440 --> 00:15:46,160
Here from every dendrogram we calculate

398
00:15:44,399 --> 00:15:48,240
the new dendagramgrams depending on the

399
00:15:46,160 --> 00:15:50,079
previous state and then we have the

400
00:15:48,240 --> 00:15:52,000
acceptance probability where we check

401
00:15:50,079 --> 00:15:55,440
whether this dendogram is suitable for

402
00:15:52,000 --> 00:15:58,079
our network prediction or not. If it is

403
00:15:55,440 --> 00:15:59,920
suitable then we select that and then we

404
00:15:58,079 --> 00:16:02,480
move to the next. But there's also a

405
00:15:59,920 --> 00:16:04,240
burn-in period where we are not able to

406
00:16:02,480 --> 00:16:06,079
infer the true probability but we keep

407
00:16:04,240 --> 00:16:07,920
on generating the dendrograms until we

408
00:16:06,079 --> 00:16:10,959
infer the true probability which is

409
00:16:07,920 --> 00:16:14,320
known as the burn-in period. Here we uh

410
00:16:10,959 --> 00:16:16,720
because after certain time the number of

411
00:16:14,320 --> 00:16:18,320
dendrograms generated become stable and

412
00:16:16,720 --> 00:16:20,560
they follow the same distribution and

413
00:16:18,320 --> 00:16:22,560
that is how we are able to infer the

414
00:16:20,560 --> 00:16:24,320
posterior distribution for the base

415
00:16:22,560 --> 00:16:26,639
theorem over here using this kind of

416
00:16:24,320 --> 00:16:29,040
sampling. So if you see here we pick up

417
00:16:26,639 --> 00:16:31,680
our initial value of the state generate

418
00:16:29,040 --> 00:16:33,759
a candidate for the next state evaluate

419
00:16:31,680 --> 00:16:36,720
the move compare the likelihood using

420
00:16:33,759 --> 00:16:38,880
the target distributions accept move if

421
00:16:36,720 --> 00:16:40,639
it is more likely otherwise accept it

422
00:16:38,880 --> 00:16:43,120
with some probability and then we

423
00:16:40,639 --> 00:16:45,199
iterate it over times. So we have sample

424
00:16:43,120 --> 00:16:47,040
of the probability distributions which

425
00:16:45,199 --> 00:16:49,199
we can feed into the base theorem and

426
00:16:47,040 --> 00:16:51,199
calculate it.

427
00:16:49,199 --> 00:16:53,040
Now there's one step further ahead of

428
00:16:51,199 --> 00:16:55,440
it. Since we want to predict the missing

429
00:16:53,040 --> 00:16:57,279
links, we cannot resort to base theorem

430
00:16:55,440 --> 00:16:58,639
only. We have to go further because we

431
00:16:57,279 --> 00:17:01,360
want to predict the links between the

432
00:16:58,639 --> 00:17:03,680
future missing ages. So for that we do

433
00:17:01,360 --> 00:17:06,319
the ranking and the averaging method

434
00:17:03,680 --> 00:17:09,120
where we have the probability posterior

435
00:17:06,319 --> 00:17:12,480
probability distributions. We do take

436
00:17:09,120 --> 00:17:15,039
the average of these methods and then

437
00:17:12,480 --> 00:17:17,679
get a consensus dendrogram from that

438
00:17:15,039 --> 00:17:19,280
ranking and averaging the we calculate

439
00:17:17,679 --> 00:17:21,839
the highest probability distribution

440
00:17:19,280 --> 00:17:24,559
over here. So if you see I calculated it

441
00:17:21,839 --> 00:17:27,280
for a d a c and bd and a d was having

442
00:17:24,559 --> 00:17:30,000
the maximum average probability which is

443
00:17:27,280 --> 00:17:32,400
of 385. So we can say that the missing

444
00:17:30,000 --> 00:17:34,320
link is existing between a and d and

445
00:17:32,400 --> 00:17:37,200
there's no missing link between a and c

446
00:17:34,320 --> 00:17:40,320
and b and d over here.

447
00:17:37,200 --> 00:17:42,080
The advantages of this method is that

448
00:17:40,320 --> 00:17:44,559
when we compared it against the other

449
00:17:42,080 --> 00:17:46,320
methods uh like the jacket coefficient

450
00:17:44,559 --> 00:17:48,799
which is a similarity based method, the

451
00:17:46,320 --> 00:17:50,799
degree product, the shortest parts all

452
00:17:48,799 --> 00:17:53,360
these are similarity based methods and

453
00:17:50,799 --> 00:17:55,679
other methods like pure chance which is

454
00:17:53,360 --> 00:17:57,760
a random method. It could capture the

455
00:17:55,679 --> 00:18:00,240
assortative and disorderative uh

456
00:17:57,760 --> 00:18:01,679
relations. What it means is that uh we

457
00:18:00,240 --> 00:18:03,919
are not saying that the probability of

458
00:18:01,679 --> 00:18:06,320
the root is higher or it is increasing

459
00:18:03,919 --> 00:18:08,160
or it is decreasing. It can vary. It can

460
00:18:06,320 --> 00:18:10,720
increase or decrease. So it captures

461
00:18:08,160 --> 00:18:13,360
both the relations and it's not fixed

462
00:18:10,720 --> 00:18:14,559
that it is that the root will have the

463
00:18:13,360 --> 00:18:16,799
higher probability or the lower

464
00:18:14,559 --> 00:18:19,440
probability. It's flexible in nature. It

465
00:18:16,799 --> 00:18:21,200
can be it can accommodate different

466
00:18:19,440 --> 00:18:23,600
kinds of distributions like there's a

467
00:18:21,200 --> 00:18:25,760
variation of this algorithm which is uh

468
00:18:23,600 --> 00:18:28,080
which has been used to capture the local

469
00:18:25,760 --> 00:18:30,080
structure. Also this algorithm is used

470
00:18:28,080 --> 00:18:31,440
for global structure. But there's a

471
00:18:30,080 --> 00:18:33,600
variation of this which is used for

472
00:18:31,440 --> 00:18:35,440
capturing the uh local structure which

473
00:18:33,600 --> 00:18:37,200
is hierarchical exponential random

474
00:18:35,440 --> 00:18:39,280
graphs. So we just change the

475
00:18:37,200 --> 00:18:41,679
distribution and it's flexible enough to

476
00:18:39,280 --> 00:18:44,240
accommodate any kind of networks. It has

477
00:18:41,679 --> 00:18:46,240
better a statistic than any of the other

478
00:18:44,240 --> 00:18:48,400
methods compared over here. So that's

479
00:18:46,240 --> 00:18:50,320
why this method has been tested on many

480
00:18:48,400 --> 00:18:54,640
of the networks and it resulted in a

481
00:18:50,320 --> 00:18:56,880
better AU statistics and uh R square. I

482
00:18:54,640 --> 00:18:58,640
applied this to the protein protein

483
00:18:56,880 --> 00:19:00,880
interaction networks uh which I'm

484
00:18:58,640 --> 00:19:03,760
currently working on for the development

485
00:19:00,880 --> 00:19:06,640
of the package. So when I applied so

486
00:19:03,760 --> 00:19:08,880
there's a library called pyrg pi hrg

487
00:19:06,640 --> 00:19:11,440
it's a partial implementation of this

488
00:19:08,880 --> 00:19:13,600
algorithm which is available in python

489
00:19:11,440 --> 00:19:15,200
if you want u the full implementation

490
00:19:13,600 --> 00:19:18,880
that is available in other languages

491
00:19:15,200 --> 00:19:21,200
also but for python u the only partial

492
00:19:18,880 --> 00:19:22,720
limit the partial implementation of this

493
00:19:21,200 --> 00:19:26,080
is available and there's a package

494
00:19:22,720 --> 00:19:28,320
called hrg so first this package does

495
00:19:26,080 --> 00:19:30,720
the hrg fit and it converts it into

496
00:19:28,320 --> 00:19:32,559
dendrogram and then if you see it

497
00:19:30,720 --> 00:19:36,240
calculates the probability probabilities

498
00:19:32,559 --> 00:19:39,760
of the graph uh using the actual edges

499
00:19:36,240 --> 00:19:41,039
and the uh existing edges. So this is

500
00:19:39,760 --> 00:19:42,559
how you are able to predict the

501
00:19:41,039 --> 00:19:44,640
probability over here and through this

502
00:19:42,559 --> 00:19:46,559
you are able to construct the consensus

503
00:19:44,640 --> 00:19:48,720
dendrogram and you are able to predict

504
00:19:46,559 --> 00:19:50,400
which uh of the missing links should

505
00:19:48,720 --> 00:19:53,200
exist between the graph and this is

506
00:19:50,400 --> 00:19:56,320
really helpful because uh these networks

507
00:19:53,200 --> 00:19:58,720
are completely uh like they are

508
00:19:56,320 --> 00:20:00,559
incomplete in nature and it is difficult

509
00:19:58,720 --> 00:20:03,280
to predict the pathways. So proteins

510
00:20:00,559 --> 00:20:05,760
have pathways and uh they are they help

511
00:20:03,280 --> 00:20:07,360
in detection of any kind of disease. Uh

512
00:20:05,760 --> 00:20:09,120
for example, if I want to know that a

513
00:20:07,360 --> 00:20:11,280
person is suffering from schizophrenia

514
00:20:09,120 --> 00:20:13,200
or anything. So I will study the

515
00:20:11,280 --> 00:20:15,280
metabolic pathways regulatory mechanism

516
00:20:13,200 --> 00:20:16,880
through these kinds of networks and it

517
00:20:15,280 --> 00:20:18,640
becomes completely invisible if there

518
00:20:16,880 --> 00:20:20,960
are missing proteins or missing links in

519
00:20:18,640 --> 00:20:22,480
between them and uh we are not able to

520
00:20:20,960 --> 00:20:24,240
get the correct predictions from them.

521
00:20:22,480 --> 00:20:26,320
So this algorithm really helped me to

522
00:20:24,240 --> 00:20:30,600
get the correct predictions and to get

523
00:20:26,320 --> 00:20:30,600
the better statistics for the disease.

524
00:20:30,640 --> 00:20:35,280
the facts uh that there's a partial

525
00:20:33,360 --> 00:20:37,120
implementation for this. It supports

526
00:20:35,280 --> 00:20:40,000
only fitting an HRG model to the

527
00:20:37,120 --> 00:20:42,000
network, finding a consensus diagram and

528
00:20:40,000 --> 00:20:44,559
merging multiple consensus dendrograms

529
00:20:42,000 --> 00:20:46,880
into a new consensus drogram. Pi HRG is

530
00:20:44,559 --> 00:20:49,760
based on network X and MET plot lip. Uh

531
00:20:46,880 --> 00:20:52,159
so this package for PI HRG has been

532
00:20:49,760 --> 00:20:54,159
developed using network X library which

533
00:20:52,159 --> 00:20:56,240
is already existing but since uh I

534
00:20:54,159 --> 00:20:58,000
already told that it is not for the

535
00:20:56,240 --> 00:20:59,600
comprehensive analysis. It can manage

536
00:20:58,000 --> 00:21:01,679
the simple networks but for

537
00:20:59,600 --> 00:21:04,240
comprehensive network analysis we do

538
00:21:01,679 --> 00:21:06,400
need such algorithms but they are based

539
00:21:04,240 --> 00:21:09,120
on these libraries.

540
00:21:06,400 --> 00:21:11,919
The GitHub repo for this algorithm can

541
00:21:09,120 --> 00:21:15,039
be found over here and you can find

542
00:21:11,919 --> 00:21:17,840
three files over here which is uh HRG

543
00:21:15,039 --> 00:21:20,320
fit the dendrogram and the missing

544
00:21:17,840 --> 00:21:22,530
length prediction.

545
00:21:20,320 --> 00:21:29,360
Thank you.

546
00:21:22,530 --> 00:21:29,360
[Applause]