1
00:00:11,759 --> 00:00:16,720
hello welcome back and welcome all the

2
00:00:14,160 --> 00:00:19,359
way from california jotika singh

3
00:00:16,720 --> 00:00:20,960
she's the vp of data science at icx

4
00:00:19,359 --> 00:00:23,119
where she and her team work on natural

5
00:00:20,960 --> 00:00:25,760
language processing future engineering

6
00:00:23,119 --> 00:00:27,599
and all kinds of machine learning

7
00:00:25,760 --> 00:00:29,840
jotika has a master's degree from the

8
00:00:27,599 --> 00:00:31,760
university of california los angeles

9
00:00:29,840 --> 00:00:33,920
where among other research topics she

10
00:00:31,760 --> 00:00:35,600
worked on signal and speech processing

11
00:00:33,920 --> 00:00:36,960
and developed new approaches to remove

12
00:00:35,600 --> 00:00:38,719
noise from speech

13
00:00:36,960 --> 00:00:41,040
she shares her findings via her open

14
00:00:38,719 --> 00:00:44,640
source projects on github such as pi

15
00:00:41,040 --> 00:00:46,559
youtube analysis and pi audio processing

16
00:00:44,640 --> 00:00:47,440
and that's what jotika is talking about

17
00:00:46,559 --> 00:00:48,480
today

18
00:00:47,440 --> 00:00:50,399
audio

19
00:00:48,480 --> 00:00:52,719
what is audio data

20
00:00:50,399 --> 00:00:55,840
how to build features and classification

21
00:00:52,719 --> 00:00:57,680
models on audio how to solve these

22
00:00:55,840 --> 00:01:00,640
problems in python

23
00:00:57,680 --> 00:01:03,280
now is where we find out from the author

24
00:01:00,640 --> 00:01:06,400
of pi audio processing herself

25
00:01:03,280 --> 00:01:08,560
so please join me in a hand of virtual

26
00:01:06,400 --> 00:01:10,640
applause for jotika singh and

27
00:01:08,560 --> 00:01:12,640
classifying audio into types using

28
00:01:10,640 --> 00:01:14,240
python

29
00:01:12,640 --> 00:01:16,640
thank you so much for the wonderful

30
00:01:14,240 --> 00:01:18,479
introduction um like you mentioned i'll

31
00:01:16,640 --> 00:01:20,479
be talking about classifying audio into

32
00:01:18,479 --> 00:01:22,479
types using python

33
00:01:20,479 --> 00:01:24,960
before diving right in i just quickly

34
00:01:22,479 --> 00:01:27,360
want to introduce myself i work as a vp

35
00:01:24,960 --> 00:01:29,200
of data science as i6 media it's a

36
00:01:27,360 --> 00:01:31,600
content and audience intelligence

37
00:01:29,200 --> 00:01:33,520
company based in washington dc um

38
00:01:31,600 --> 00:01:34,960
attaching my social media handles there

39
00:01:33,520 --> 00:01:37,280
because i'm going to be posting the

40
00:01:34,960 --> 00:01:38,720
slide deck on social media as well on my

41
00:01:37,280 --> 00:01:41,360
twitter account

42
00:01:38,720 --> 00:01:43,600
after the talk also in case anybody has

43
00:01:41,360 --> 00:01:45,360
any questions that you are unable to ask

44
00:01:43,600 --> 00:01:47,840
me during the conference uh you can

45
00:01:45,360 --> 00:01:50,240
shoot a note out to me on twitter

46
00:01:47,840 --> 00:01:52,560
uh also attaching my linkedin and github

47
00:01:50,240 --> 00:01:53,520
accounts for reference i'm also going to

48
00:01:52,560 --> 00:01:55,119
be

49
00:01:53,520 --> 00:01:57,520
you know there's an upcoming book in the

50
00:01:55,119 --> 00:01:59,520
summer or to fall 2022 that i'm working

51
00:01:57,520 --> 00:02:01,200
on that i'm working on authoring it's on

52
00:01:59,520 --> 00:02:03,600
natural language processing in the real

53
00:02:01,200 --> 00:02:05,600
world which contains um

54
00:02:03,600 --> 00:02:07,280
descriptions about how natural language

55
00:02:05,600 --> 00:02:09,039
processing is used across several

56
00:02:07,280 --> 00:02:12,479
industry verticals and actually how to

57
00:02:09,039 --> 00:02:12,479
implement it using python

58
00:02:13,040 --> 00:02:18,000
so without further ado this entire talk

59
00:02:16,000 --> 00:02:19,680
will contain a few sections starting

60
00:02:18,000 --> 00:02:21,599
from what is audio

61
00:02:19,680 --> 00:02:23,040
machine learning at a high level

62
00:02:21,599 --> 00:02:25,680
audio features

63
00:02:23,040 --> 00:02:27,680
tools and using some of these tools and

64
00:02:25,680 --> 00:02:30,160
then classification examples that

65
00:02:27,680 --> 00:02:32,959
classify audio into different types

66
00:02:30,160 --> 00:02:34,480
across different genres uh towards the

67
00:02:32,959 --> 00:02:37,840
end using the tools that we have

68
00:02:34,480 --> 00:02:37,840
discussed previously

69
00:02:38,080 --> 00:02:42,480
so what is audio uh it's essentially a

70
00:02:40,560 --> 00:02:44,879
signal that vibrates in the audible

71
00:02:42,480 --> 00:02:46,480
frequency range what does that mean well

72
00:02:44,879 --> 00:02:47,840
when i'm talking now and you can hear me

73
00:02:46,480 --> 00:02:50,000
through the speakers

74
00:02:47,840 --> 00:02:52,640
uh those sounds basically create air

75
00:02:50,000 --> 00:02:54,640
pressure waves that is then received by

76
00:02:52,640 --> 00:02:57,040
our ear and these pressure signals is

77
00:02:54,640 --> 00:02:59,120
converted uh to some responses that our

78
00:02:57,040 --> 00:03:01,360
brain can understand and finally

79
00:02:59,120 --> 00:03:02,480
recognize the audio as a particular

80
00:03:01,360 --> 00:03:05,120
meaning

81
00:03:02,480 --> 00:03:07,280
there are so many great matlab tools for

82
00:03:05,120 --> 00:03:09,599
just digital signal processing speech

83
00:03:07,280 --> 00:03:11,519
processing and audio processing

84
00:03:09,599 --> 00:03:13,360
a lot of research that goes on you know

85
00:03:11,519 --> 00:03:15,360
the first thing where we can actually

86
00:03:13,360 --> 00:03:16,400
see the effects of those research is in

87
00:03:15,360 --> 00:03:18,239
matlab

88
00:03:16,400 --> 00:03:20,080
given that machine learning is the

89
00:03:18,239 --> 00:03:23,040
language of choice for

90
00:03:20,080 --> 00:03:25,519
building classification models uh uh

91
00:03:23,040 --> 00:03:28,080
python is the language of choice there

92
00:03:25,519 --> 00:03:29,680
uh there's there's little few gaps that

93
00:03:28,080 --> 00:03:31,440
i noticed in the community when i was

94
00:03:29,680 --> 00:03:32,959
trying to build audio classification

95
00:03:31,440 --> 00:03:35,280
models and i needed to extract

96
00:03:32,959 --> 00:03:37,680
particular features so in that attempts

97
00:03:35,280 --> 00:03:39,280
uh there are some open source libraries

98
00:03:37,680 --> 00:03:40,879
that are created to do audio processing

99
00:03:39,280 --> 00:03:42,879
in python

100
00:03:40,879 --> 00:03:44,480
uh and one of them is also pi audio

101
00:03:42,879 --> 00:03:48,159
processing which i'll be talking about

102
00:03:44,480 --> 00:03:48,159
the usage of it in a little bit

103
00:03:49,200 --> 00:03:53,599
so what is machine learning at a high

104
00:03:50,720 --> 00:03:56,000
level uh it's essentially we can imagine

105
00:03:53,599 --> 00:03:57,680
this as divided into three phases

106
00:03:56,000 --> 00:03:58,879
there's a data phase

107
00:03:57,680 --> 00:04:01,760
there's a training phase and an

108
00:03:58,879 --> 00:04:03,439
evaluation phase the data phase has

109
00:04:01,760 --> 00:04:05,920
everything to do with data from data

110
00:04:03,439 --> 00:04:09,120
collection whether you are scraping data

111
00:04:05,920 --> 00:04:10,720
or you have data from some resource or

112
00:04:09,120 --> 00:04:12,959
you're actually leveraging publicly

113
00:04:10,720 --> 00:04:14,799
available data sets to then cleaning of

114
00:04:12,959 --> 00:04:16,720
the data because oftentimes the data is

115
00:04:14,799 --> 00:04:18,880
not in the perfect shape uh that it's

116
00:04:16,720 --> 00:04:20,400
ready for feature extraction but once

117
00:04:18,880 --> 00:04:22,720
you have cleaned the data then we

118
00:04:20,400 --> 00:04:25,120
transform the data so that it is now in

119
00:04:22,720 --> 00:04:26,720
a numerical representation which goes as

120
00:04:25,120 --> 00:04:28,160
an input to the machine learning model

121
00:04:26,720 --> 00:04:29,840
that you're training and then the

122
00:04:28,160 --> 00:04:31,840
evaluation of the model further

123
00:04:29,840 --> 00:04:33,759
influences what else you can do in the

124
00:04:31,840 --> 00:04:36,000
data phase do you need more data do you

125
00:04:33,759 --> 00:04:38,160
need to clean it differently uh do you

126
00:04:36,000 --> 00:04:40,240
need to use other data transformation

127
00:04:38,160 --> 00:04:42,160
techniques

128
00:04:40,240 --> 00:04:44,400
so as you mentioned features are

129
00:04:42,160 --> 00:04:46,720
numerical representation of data

130
00:04:44,400 --> 00:04:48,479
the usable machine learning models but

131
00:04:46,720 --> 00:04:49,520
they really highly depend on the data

132
00:04:48,479 --> 00:04:51,120
type

133
00:04:49,520 --> 00:04:53,680
for instance

134
00:04:51,120 --> 00:04:55,840
if you have a text corpus using word to

135
00:04:53,680 --> 00:04:57,840
web to represent the phrases or the

136
00:04:55,840 --> 00:04:59,759
words within the text corpus as

137
00:04:57,840 --> 00:05:00,880
numerical representations works

138
00:04:59,759 --> 00:05:02,880
perfectly

139
00:05:00,880 --> 00:05:04,800
but if you were to pass just random

140
00:05:02,880 --> 00:05:07,520
numbers through word to work we don't

141
00:05:04,800 --> 00:05:08,560
really expect to get anything meaningful

142
00:05:07,520 --> 00:05:10,479
so there are different feature

143
00:05:08,560 --> 00:05:13,199
generation methods that are suitable for

144
00:05:10,479 --> 00:05:15,600
different types of data

145
00:05:13,199 --> 00:05:17,360
this gets us to audio features now there

146
00:05:15,600 --> 00:05:18,639
are so many different audio features and

147
00:05:17,360 --> 00:05:20,960
we are not going to talk about all of

148
00:05:18,639 --> 00:05:22,639
them but i wanted to mention everything

149
00:05:20,960 --> 00:05:24,880
or not everything but a lot of the

150
00:05:22,639 --> 00:05:26,320
things on one slide so if anybody is

151
00:05:24,880 --> 00:05:27,919
curious and wants to look up other

152
00:05:26,320 --> 00:05:31,039
things for reference

153
00:05:27,919 --> 00:05:31,039
it is there in one place

154
00:05:31,840 --> 00:05:35,120
let's start with

155
00:05:33,280 --> 00:05:37,919
two important things when we're talking

156
00:05:35,120 --> 00:05:40,720
about audio features uh spectrum and

157
00:05:37,919 --> 00:05:42,400
kepstrum what is spectrum

158
00:05:40,720 --> 00:05:44,240
when the audio signal is passed through

159
00:05:42,400 --> 00:05:46,720
a fourier transform what results is a

160
00:05:44,240 --> 00:05:48,639
spectrum but what is it essentially

161
00:05:46,720 --> 00:05:49,840
it is the audio signal in the frequency

162
00:05:48,639 --> 00:05:51,680
domain

163
00:05:49,840 --> 00:05:52,960
how we compute that is using a fourier

164
00:05:51,680 --> 00:05:55,600
transform

165
00:05:52,960 --> 00:05:57,280
if people are aware about fourier series

166
00:05:55,600 --> 00:05:59,919
even if not it is just a way to

167
00:05:57,280 --> 00:06:01,440
represent your signals in terms of sines

168
00:05:59,919 --> 00:06:03,360
and cosines

169
00:06:01,440 --> 00:06:04,880
uh and that is for your transform and

170
00:06:03,360 --> 00:06:08,000
that helps us get the signal in the

171
00:06:04,880 --> 00:06:09,919
frequency domain that is called spectrum

172
00:06:08,000 --> 00:06:12,479
now if we take the log magnitude of the

173
00:06:09,919 --> 00:06:14,000
spectrum to reduce amplitude differences

174
00:06:12,479 --> 00:06:16,560
and then take the inverse fourier

175
00:06:14,000 --> 00:06:18,400
transform what results is a kepstrum

176
00:06:16,560 --> 00:06:20,560
now kepstrom is neither in the time

177
00:06:18,400 --> 00:06:22,080
domain nor in the frequency domain

178
00:06:20,560 --> 00:06:23,680
why is it not in the time domain even

179
00:06:22,080 --> 00:06:26,000
though we took inverse fourier transform

180
00:06:23,680 --> 00:06:27,600
is because of the log magnitude step

181
00:06:26,000 --> 00:06:28,960
uh and because we took inverse fourier

182
00:06:27,600 --> 00:06:31,039
transform it's not in the frequency

183
00:06:28,960 --> 00:06:35,160
domain and oftentimes people refer to

184
00:06:31,039 --> 00:06:35,160
this as a quifferency domain

185
00:06:35,919 --> 00:06:40,000
so to show a representation of how these

186
00:06:38,400 --> 00:06:42,720
uh different representations look

187
00:06:40,000 --> 00:06:45,120
visually uh there's a wave form of a

188
00:06:42,720 --> 00:06:46,880
simple wobble and then the spectrum

189
00:06:45,120 --> 00:06:48,400
followed by the capstone

190
00:06:46,880 --> 00:06:50,479
and then the first 20 capstone

191
00:06:48,400 --> 00:06:52,240
coefficients now one thing great about

192
00:06:50,479 --> 00:06:55,280
the kepstrum is that the first few

193
00:06:52,240 --> 00:06:57,440
sometimes 13 sometimes 20 coefficients

194
00:06:55,280 --> 00:07:00,160
form as great features to build machine

195
00:06:57,440 --> 00:07:00,160
learning models

196
00:07:01,680 --> 00:07:05,280
but why do we care about the frequency

197
00:07:03,440 --> 00:07:07,360
domain right why is it that we are

198
00:07:05,280 --> 00:07:10,160
talking about spectrum and kepstrom and

199
00:07:07,360 --> 00:07:13,199
uh features associated with that now the

200
00:07:10,160 --> 00:07:16,240
inspiration is biology especially if you

201
00:07:13,199 --> 00:07:18,080
consider uh how we even see an image you

202
00:07:16,240 --> 00:07:20,880
know what it looks to the eye is certain

203
00:07:18,080 --> 00:07:23,280
thing but for a computer to process that

204
00:07:20,880 --> 00:07:26,319
uh when our eyes see something we see

205
00:07:23,280 --> 00:07:28,720
something a color like blue or green

206
00:07:26,319 --> 00:07:31,199
uh but for a computer they can represent

207
00:07:28,720 --> 00:07:32,240
it as the pixel values associated behind

208
00:07:31,199 --> 00:07:34,479
the image

209
00:07:32,240 --> 00:07:36,479
right similarly when we hear there's a

210
00:07:34,479 --> 00:07:38,880
whole process that goes on and inspired

211
00:07:36,479 --> 00:07:40,160
by that is how we generate features from

212
00:07:38,880 --> 00:07:42,080
audio

213
00:07:40,160 --> 00:07:44,160
there's the spiral

214
00:07:42,080 --> 00:07:46,479
fluid fill structure in the ear it's

215
00:07:44,160 --> 00:07:49,039
called the cochlea it has thousands of

216
00:07:46,479 --> 00:07:51,440
tiny hair that are of different lengths

217
00:07:49,039 --> 00:07:54,160
what happens is the longer hair resonate

218
00:07:51,440 --> 00:07:55,599
with sounds of lower frequencies and the

219
00:07:54,160 --> 00:07:56,879
shorter hair resonate with higher

220
00:07:55,599 --> 00:07:59,120
frequencies

221
00:07:56,879 --> 00:08:02,080
so it's almost considered because of the

222
00:07:59,120 --> 00:08:03,840
way uh the signal is processed like our

223
00:08:02,080 --> 00:08:06,720
ear is a natural fourier transform

224
00:08:03,840 --> 00:08:11,319
analyzer and this is why uh spectrum

225
00:08:06,720 --> 00:08:11,319
kepstrum is of great interest to us

226
00:08:12,479 --> 00:08:16,960
coming from the kepstrum there are a few

227
00:08:14,800 --> 00:08:19,280
features that form like of great

228
00:08:16,960 --> 00:08:21,440
importance and a lot of

229
00:08:19,280 --> 00:08:22,960
machine learning applications it's

230
00:08:21,440 --> 00:08:24,639
called the mel frequency kept

231
00:08:22,960 --> 00:08:26,800
coefficients

232
00:08:24,639 --> 00:08:28,560
behind it is what we call the mel filter

233
00:08:26,800 --> 00:08:31,120
bank which is just those triangles as

234
00:08:28,560 --> 00:08:32,719
you see on the screen

235
00:08:31,120 --> 00:08:35,680
now you see the triangles keep getting

236
00:08:32,719 --> 00:08:38,479
wider and this is because our human ear

237
00:08:35,680 --> 00:08:40,800
is a breast less frequency selective

238
00:08:38,479 --> 00:08:43,039
after one kilohertz so we want to grab

239
00:08:40,800 --> 00:08:46,080
less and less as we go forward now this

240
00:08:43,039 --> 00:08:47,920
filter the aim is to represent closely

241
00:08:46,080 --> 00:08:49,839
how the human hearing works

242
00:08:47,920 --> 00:08:51,440
and how it's mathematically produced is

243
00:08:49,839 --> 00:08:53,040
the spectrum of the signal passes

244
00:08:51,440 --> 00:08:55,120
through a male scale filter bank which

245
00:08:53,040 --> 00:08:57,279
is those that the filter that you see on

246
00:08:55,120 --> 00:08:59,440
the screen and then a log magnitude

247
00:08:57,279 --> 00:09:01,680
followed by a discrete cosine transform

248
00:08:59,440 --> 00:09:04,320
which results in mfcc features

249
00:09:01,680 --> 00:09:06,240
the discrete cosine transform is uh also

250
00:09:04,320 --> 00:09:08,000
finds application in things like jpeg

251
00:09:06,240 --> 00:09:09,839
compression because the job of the

252
00:09:08,000 --> 00:09:11,600
discrete cosine transform is to get the

253
00:09:09,839 --> 00:09:13,360
shape of the signal rather than the

254
00:09:11,600 --> 00:09:15,200
sharper peaks because it's known that

255
00:09:13,360 --> 00:09:17,519
those sharper smaller peaks are just

256
00:09:15,200 --> 00:09:17,519
noise

257
00:09:18,800 --> 00:09:22,720
another capsule coefficient is called

258
00:09:21,040 --> 00:09:24,560
gamma tone frequency capsule

259
00:09:22,720 --> 00:09:27,360
coefficients it's a little bit different

260
00:09:24,560 --> 00:09:30,080
from mfcc as you see the filter is now

261
00:09:27,360 --> 00:09:31,920
uh with us with a softer slope

262
00:09:30,080 --> 00:09:34,080
and soft edges

263
00:09:31,920 --> 00:09:37,120
now the inspiration here is again how

264
00:09:34,080 --> 00:09:39,440
the human hair works hearing works and

265
00:09:37,120 --> 00:09:41,120
this filter gamma tone

266
00:09:39,440 --> 00:09:42,560
filter bank is known to be the

267
00:09:41,120 --> 00:09:44,720
stimulation of the front edge of the

268
00:09:42,560 --> 00:09:46,480
cochlea so it's again more closely

269
00:09:44,720 --> 00:09:48,720
representing how we hear

270
00:09:46,480 --> 00:09:50,160
and the computation is very similar as

271
00:09:48,720 --> 00:09:51,920
in the spectrum passes through this

272
00:09:50,160 --> 00:09:54,240
filter bank and then there are steps for

273
00:09:51,920 --> 00:09:56,240
down sampling and loudness compressions

274
00:09:54,240 --> 00:10:00,560
followed by discrete cosine transform

275
00:09:56,240 --> 00:10:00,560
and that gets us to our gfcc features

276
00:10:01,760 --> 00:10:07,920
visually mfcc and gfcc look like this as

277
00:10:05,279 --> 00:10:08,880
you can see on the screen for same audio

278
00:10:07,920 --> 00:10:10,480
signal

279
00:10:08,880 --> 00:10:12,480
now we see that you know there are like

280
00:10:10,480 --> 00:10:13,760
mid processing features and they do look

281
00:10:12,480 --> 00:10:15,920
different they convey different

282
00:10:13,760 --> 00:10:19,040
information so it's not like one is a

283
00:10:15,920 --> 00:10:20,800
derivative of the other but it is uh

284
00:10:19,040 --> 00:10:23,519
produced differently through different

285
00:10:20,800 --> 00:10:25,279
processes and each of them convey a

286
00:10:23,519 --> 00:10:27,040
great deal of information when using

287
00:10:25,279 --> 00:10:29,440
machine learning models

288
00:10:27,040 --> 00:10:31,200
they have been used individually as well

289
00:10:29,440 --> 00:10:32,959
as you know a combination of those two

290
00:10:31,200 --> 00:10:36,000
features can be used to build machine

291
00:10:32,959 --> 00:10:36,000
learning models as well

292
00:10:39,519 --> 00:10:42,640
there's some other features that are

293
00:10:41,200 --> 00:10:44,480
proved to be of great importance

294
00:10:42,640 --> 00:10:46,560
especially in applications of speech

295
00:10:44,480 --> 00:10:48,959
processing that is linear prediction

296
00:10:46,560 --> 00:10:50,720
capsule coefficients bark frequency kept

297
00:10:48,959 --> 00:10:52,480
through coefficients power normalized

298
00:10:50,720 --> 00:10:53,839
schedule coefficients

299
00:10:52,480 --> 00:10:55,440
uh and then there's some other which are

300
00:10:53,839 --> 00:10:58,720
related to the spectrum of the signal

301
00:10:55,440 --> 00:11:00,480
like spectrum entropy flux uh how many

302
00:10:58,720 --> 00:11:02,000
times is cross is zero

303
00:11:00,480 --> 00:11:04,560
and then chroma features which

304
00:11:02,000 --> 00:11:07,120
essentially represents the tonal content

305
00:11:04,560 --> 00:11:09,920
of musical audio signal

306
00:11:07,120 --> 00:11:12,240
uh so that's what it it represents so it

307
00:11:09,920 --> 00:11:14,399
can be a very useful feature when

308
00:11:12,240 --> 00:11:17,760
considering classifying

309
00:11:14,399 --> 00:11:17,760
music related content

310
00:11:19,760 --> 00:11:24,079
the many tools that one can be leveraged

311
00:11:22,079 --> 00:11:25,600
in python for audio processing and

312
00:11:24,079 --> 00:11:27,519
bundling audio machine learning

313
00:11:25,600 --> 00:11:29,600
classifiers and i just wanted to list

314
00:11:27,519 --> 00:11:31,680
all of them in one place so if anybody

315
00:11:29,600 --> 00:11:33,760
is interested for different types of

316
00:11:31,680 --> 00:11:36,000
audio processing you please check these

317
00:11:33,760 --> 00:11:36,000
out

318
00:11:38,079 --> 00:11:41,440
coming to the library pi audio

319
00:11:39,760 --> 00:11:43,120
processing it's essentially a python

320
00:11:41,440 --> 00:11:44,959
library for audio analysis and

321
00:11:43,120 --> 00:11:46,399
classification there are a bunch of

322
00:11:44,959 --> 00:11:48,480
different functions that can be

323
00:11:46,399 --> 00:11:50,560
performed using this library starting

324
00:11:48,480 --> 00:11:52,880
from audio format conversion because

325
00:11:50,560 --> 00:11:55,680
you'll see a lot of conversion methods

326
00:11:52,880 --> 00:11:57,279
out there work on wave format but your

327
00:11:55,680 --> 00:11:58,720
audio can be on very different formats

328
00:11:57,279 --> 00:12:00,639
as well so converting it from the

329
00:11:58,720 --> 00:12:02,720
different formats to wave

330
00:12:00,639 --> 00:12:04,720
uh audio visualization you know

331
00:12:02,720 --> 00:12:06,800
sometimes you may want to visualize your

332
00:12:04,720 --> 00:12:08,399
audio with or without building the model

333
00:12:06,800 --> 00:12:10,720
so that's something that we can do with

334
00:12:08,399 --> 00:12:13,040
this library as well there's some audio

335
00:12:10,720 --> 00:12:15,600
cleaning techniques that help you remove

336
00:12:13,040 --> 00:12:17,519
any silence or low activity segments

337
00:12:15,600 --> 00:12:19,519
from your signal before you pass it into

338
00:12:17,519 --> 00:12:21,040
any further processing

339
00:12:19,519 --> 00:12:23,760
uh then there are audio feature

340
00:12:21,040 --> 00:12:24,399
extractions uh for mfcc that was spoken

341
00:12:23,760 --> 00:12:27,279
about

342
00:12:24,399 --> 00:12:28,800
gfcc spectral features and the chroma

343
00:12:27,279 --> 00:12:31,440
features as well

344
00:12:28,800 --> 00:12:32,959
i want to say particularly when i was uh

345
00:12:31,440 --> 00:12:35,600
working on this project where i wanted

346
00:12:32,959 --> 00:12:37,839
to use gfcc i was having a hard time

347
00:12:35,600 --> 00:12:39,600
finding a python implementation

348
00:12:37,839 --> 00:12:42,079
and that's what motivated me to create

349
00:12:39,600 --> 00:12:44,240
this library is when i used the matlab

350
00:12:42,079 --> 00:12:45,839
code that i had and converted that to

351
00:12:44,240 --> 00:12:48,399
python

352
00:12:45,839 --> 00:12:50,000
and that's where this comes from

353
00:12:48,399 --> 00:12:52,000
furthermore when once you have built

354
00:12:50,000 --> 00:12:54,639
your features you can use existing

355
00:12:52,000 --> 00:12:57,120
cyclone classifiers with auto hyper

356
00:12:54,639 --> 00:12:59,040
parameter tuning using this library

357
00:12:57,120 --> 00:13:00,720
if you want you can also use it without

358
00:12:59,040 --> 00:13:03,760
the cyclic loan classifiers if you want

359
00:13:00,720 --> 00:13:05,440
to use it with their own custom backend

360
00:13:03,760 --> 00:13:07,680
uh furthermore there are three

361
00:13:05,440 --> 00:13:09,839
pre-trained models of classification

362
00:13:07,680 --> 00:13:12,160
audio models that are provided with this

363
00:13:09,839 --> 00:13:13,680
library that can help you establish a

364
00:13:12,160 --> 00:13:17,360
baseline if you're working on similar

365
00:13:13,680 --> 00:13:17,360
problems of classifying audio

366
00:13:18,240 --> 00:13:22,480
we remember looking at this particular

367
00:13:20,880 --> 00:13:24,320
diagram in the beginning which talked

368
00:13:22,480 --> 00:13:25,920
about machine learning at a high level

369
00:13:24,320 --> 00:13:28,079
so just converting that to machine

370
00:13:25,920 --> 00:13:30,480
learning for audio signals and how

371
00:13:28,079 --> 00:13:32,160
different components uh can be related

372
00:13:30,480 --> 00:13:34,240
to what we've spoken about

373
00:13:32,160 --> 00:13:35,920
so in terms of data collection you can

374
00:13:34,240 --> 00:13:38,000
use your own data set if you have one

375
00:13:35,920 --> 00:13:39,519
but if you don't there are many publicly

376
00:13:38,000 --> 00:13:41,120
available data sets and that i've

377
00:13:39,519 --> 00:13:43,279
attached in a resources slide that will

378
00:13:41,120 --> 00:13:45,040
be towards the end like i said i'll be

379
00:13:43,279 --> 00:13:47,120
sharing the slide deck so

380
00:13:45,040 --> 00:13:49,199
uh you can free free to access that

381
00:13:47,120 --> 00:13:51,760
resource there

382
00:13:49,199 --> 00:13:54,880
secondly uh we have data cleaning uh

383
00:13:51,760 --> 00:13:57,360
which can be used uh it can be

384
00:13:54,880 --> 00:13:58,959
constituent of converting audio formats

385
00:13:57,360 --> 00:14:00,079
but also cleaning and removing the

386
00:13:58,959 --> 00:14:02,320
silence

387
00:14:00,079 --> 00:14:04,880
segments from the audio so that can be

388
00:14:02,320 --> 00:14:06,959
done using pi audio processing as well

389
00:14:04,880 --> 00:14:08,800
and then transformation is the feature

390
00:14:06,959 --> 00:14:10,560
formation which can be done using pi

391
00:14:08,800 --> 00:14:12,399
audio processing using the extract

392
00:14:10,560 --> 00:14:14,160
features module

393
00:14:12,399 --> 00:14:16,480
these features can be extracted to use

394
00:14:14,160 --> 00:14:18,959
with your own back end or

395
00:14:16,480 --> 00:14:20,560
it can also be used with existing

396
00:14:18,959 --> 00:14:22,720
circuit learn models using the run

397
00:14:20,560 --> 00:14:24,959
classification module which can help you

398
00:14:22,720 --> 00:14:27,120
train and classify your signals and also

399
00:14:24,959 --> 00:14:28,959
give you statistics like confusion

400
00:14:27,120 --> 00:14:30,560
matrix and essentially how your

401
00:14:28,959 --> 00:14:33,839
classifications have run on your

402
00:14:30,560 --> 00:14:33,839
evaluation data set

403
00:14:35,040 --> 00:14:38,959
if you're thinking of starting with such

404
00:14:36,560 --> 00:14:40,399
a problem uh let's talk about the flow

405
00:14:38,959 --> 00:14:41,440
what kind of questions you would ask if

406
00:14:40,399 --> 00:14:43,760
you want to

407
00:14:41,440 --> 00:14:45,839
do something with audio analyze it or

408
00:14:43,760 --> 00:14:47,680
create a classification model so let's

409
00:14:45,839 --> 00:14:49,920
say you have an audio

410
00:14:47,680 --> 00:14:51,199
does it need to be converted to wave

411
00:14:49,920 --> 00:14:53,120
if so

412
00:14:51,199 --> 00:14:55,360
yes we can do that with a module present

413
00:14:53,120 --> 00:14:56,959
in pi audio processing does it need to

414
00:14:55,360 --> 00:14:58,800
be cleaned

415
00:14:56,959 --> 00:15:01,600
if so we can use pi audio processing

416
00:14:58,800 --> 00:15:03,920
clean module does it need to uh do you

417
00:15:01,600 --> 00:15:05,760
need to build a circuit learn classifier

418
00:15:03,920 --> 00:15:07,360
that can be done as well using train and

419
00:15:05,760 --> 00:15:09,199
classify

420
00:15:07,360 --> 00:15:10,720
if not do you want to just extract the

421
00:15:09,199 --> 00:15:12,800
features to use with your own custom

422
00:15:10,720 --> 00:15:14,800
model that can be done as well using

423
00:15:12,800 --> 00:15:17,360
extract features module

424
00:15:14,800 --> 00:15:20,240
and then if not you just want to use a

425
00:15:17,360 --> 00:15:21,600
pre-trained model to just classify audio

426
00:15:20,240 --> 00:15:23,199
that you have

427
00:15:21,600 --> 00:15:25,440
that can be done as well and there are

428
00:15:23,199 --> 00:15:27,360
instructions in the readme to how to

429
00:15:25,440 --> 00:15:28,720
exactly do that

430
00:15:27,360 --> 00:15:30,639
if none of that if you just want to

431
00:15:28,720 --> 00:15:32,880
visualize your audio that can be done as

432
00:15:30,639 --> 00:15:34,160
well using pi audio processing plot

433
00:15:32,880 --> 00:15:36,399
module

434
00:15:34,160 --> 00:15:38,560
and if none of that

435
00:15:36,399 --> 00:15:40,720
please help us by creating an issue on

436
00:15:38,560 --> 00:15:42,720
github and mention all these things that

437
00:15:40,720 --> 00:15:44,720
you want to do in python for audio and

438
00:15:42,720 --> 00:15:46,160
that you're not able to do please create

439
00:15:44,720 --> 00:15:48,560
these issues please feel free to

440
00:15:46,160 --> 00:15:50,880
contribute uh in terms of working on

441
00:15:48,560 --> 00:15:52,880
some of the existing issues as well it's

442
00:15:50,880 --> 00:15:54,560
an open source project and we very much

443
00:15:52,880 --> 00:15:57,519
welcome everybody's input the

444
00:15:54,560 --> 00:15:57,519
communities input

445
00:15:59,199 --> 00:16:02,320
now coming to audio classification we

446
00:16:01,040 --> 00:16:03,920
have talked about the library we've

447
00:16:02,320 --> 00:16:05,920
talked about some features we've talked

448
00:16:03,920 --> 00:16:08,000
about what audio is and let's get it all

449
00:16:05,920 --> 00:16:10,560
together by actually discussing some of

450
00:16:08,000 --> 00:16:12,000
the audio classification examples that

451
00:16:10,560 --> 00:16:14,240
have been built using pi audio

452
00:16:12,000 --> 00:16:16,399
processing

453
00:16:14,240 --> 00:16:17,680
so the first one is the audio type

454
00:16:16,399 --> 00:16:19,839
classification

455
00:16:17,680 --> 00:16:22,959
in this problem we'll be classifying

456
00:16:19,839 --> 00:16:25,519
audio into three possible classes speech

457
00:16:22,959 --> 00:16:27,759
music birds

458
00:16:25,519 --> 00:16:30,000
so the first thing we do is of course

459
00:16:27,759 --> 00:16:32,399
the data that is considered so in in

460
00:16:30,000 --> 00:16:35,360
this case we are using 50 samples each

461
00:16:32,399 --> 00:16:37,199
so total of 150 samples for training

462
00:16:35,360 --> 00:16:39,519
and then for testing there are 14

463
00:16:37,199 --> 00:16:42,240
samples for each class

464
00:16:39,519 --> 00:16:44,079
the first thing we do is train an mfcc

465
00:16:42,240 --> 00:16:46,399
model and keeping the classifier

466
00:16:44,079 --> 00:16:48,399
constant at svm

467
00:16:46,399 --> 00:16:50,160
so the mfcc generated the training

468
00:16:48,399 --> 00:16:52,320
confusion matrix that can be seen on the

469
00:16:50,160 --> 00:16:53,519
top and it looks like it's doing pretty

470
00:16:52,320 --> 00:16:55,680
good

471
00:16:53,519 --> 00:16:57,759
when we pass the test data through this

472
00:16:55,680 --> 00:16:59,680
this model that was created using mfcc

473
00:16:57,759 --> 00:17:02,079
feature we see music is getting

474
00:16:59,680 --> 00:17:04,319
classified 13 out of 14 correctly and

475
00:17:02,079 --> 00:17:06,079
speech and births is 14 out of 14. so

476
00:17:04,319 --> 00:17:08,559
this is a good model and it looks like

477
00:17:06,079 --> 00:17:09,679
mfcc has definitely

478
00:17:08,559 --> 00:17:11,600
got

479
00:17:09,679 --> 00:17:13,280
parts to it that help the machine

480
00:17:11,600 --> 00:17:16,480
learning model to really decipher

481
00:17:13,280 --> 00:17:16,480
between these three classes

482
00:17:16,720 --> 00:17:21,199
this is what a representation looks like

483
00:17:19,039 --> 00:17:23,039
of the features of a speech music and

484
00:17:21,199 --> 00:17:24,319
birth signal so we can see how the

485
00:17:23,039 --> 00:17:25,919
feature looks and there are like

486
00:17:24,319 --> 00:17:27,520
different patterns

487
00:17:25,919 --> 00:17:29,760
associated with the different types of

488
00:17:27,520 --> 00:17:31,679
signals

489
00:17:29,760 --> 00:17:34,080
now just for experimentation purposes

490
00:17:31,679 --> 00:17:35,760
let's try a gfcc feature model and see

491
00:17:34,080 --> 00:17:38,559
if that makes any difference to the

492
00:17:35,760 --> 00:17:40,480
model so the training confusion matrix

493
00:17:38,559 --> 00:17:42,880
still looks good

494
00:17:40,480 --> 00:17:44,559
and when we test it the testing results

495
00:17:42,880 --> 00:17:47,200
are also pretty good we have 14 out of

496
00:17:44,559 --> 00:17:48,480
14 for music and for birds but 12 out of

497
00:17:47,200 --> 00:17:50,240
14 for

498
00:17:48,480 --> 00:17:51,760
uh speech which is a little bit

499
00:17:50,240 --> 00:17:54,080
different from what we had when we were

500
00:17:51,760 --> 00:17:56,559
training with the mfcc feature

501
00:17:54,080 --> 00:17:58,320
so it looks like standalone gfcc is also

502
00:17:56,559 --> 00:18:01,440
contributing something to the model that

503
00:17:58,320 --> 00:18:03,440
helps it decipher clear patterns between

504
00:18:01,440 --> 00:18:05,679
how music speech and birds look

505
00:18:03,440 --> 00:18:07,600
distinctly

506
00:18:05,679 --> 00:18:09,679
here's a comparison between the mfcc

507
00:18:07,600 --> 00:18:12,080
feature and the gfcc feature for speech

508
00:18:09,679 --> 00:18:13,919
music bird sample and we can see it's

509
00:18:12,080 --> 00:18:17,679
relaying different information from the

510
00:18:13,919 --> 00:18:17,679
plots that we can see in front of us

511
00:18:19,039 --> 00:18:22,559
now because they're relaying different

512
00:18:20,720 --> 00:18:24,000
information one last experiment that i

513
00:18:22,559 --> 00:18:25,919
wanted to do was combine these two

514
00:18:24,000 --> 00:18:28,480
features together so use them in

515
00:18:25,919 --> 00:18:30,640
conjunction together and again the

516
00:18:28,480 --> 00:18:32,640
training confusion matrix looks good the

517
00:18:30,640 --> 00:18:34,960
testing one also looks good

518
00:18:32,640 --> 00:18:36,960
again it's more pretty much similar to

519
00:18:34,960 --> 00:18:40,080
how mfcc was performing so further

520
00:18:36,960 --> 00:18:42,320
testing uh could be used on using

521
00:18:40,080 --> 00:18:44,080
further new samples for speech music

522
00:18:42,320 --> 00:18:45,840
bird to evaluate these

523
00:18:44,080 --> 00:18:48,160
but this is mainly to show off

524
00:18:45,840 --> 00:18:50,640
capability of the features itself while

525
00:18:48,160 --> 00:18:52,960
keeping the classifier constant if one

526
00:18:50,640 --> 00:18:54,640
wanted to create a even further invest

527
00:18:52,960 --> 00:18:57,280
into this model and create an even

528
00:18:54,640 --> 00:18:59,360
better model can look at more samples

529
00:18:57,280 --> 00:19:00,799
the quality of data the quantity of data

530
00:18:59,360 --> 00:19:03,360
and then different classification

531
00:19:00,799 --> 00:19:03,360
backends

532
00:19:04,960 --> 00:19:09,679
the second example that i want to talk

533
00:19:06,400 --> 00:19:11,919
about is the music genre classification

534
00:19:09,679 --> 00:19:14,080
now in this case we have 10 music genres

535
00:19:11,919 --> 00:19:16,960
it's pop metal disco blues reggae

536
00:19:14,080 --> 00:19:19,039
classical rock hip hop country and jazz

537
00:19:16,960 --> 00:19:19,919
there are 80 samples for training the

538
00:19:19,039 --> 00:19:22,640
model

539
00:19:19,919 --> 00:19:25,280
per class and then 20 samples per class

540
00:19:22,640 --> 00:19:27,760
for testing uh there's a paper that was

541
00:19:25,280 --> 00:19:29,679
published in 2002 that used mfcc

542
00:19:27,760 --> 00:19:31,760
features for doing this performing this

543
00:19:29,679 --> 00:19:34,000
essential classification and i've linked

544
00:19:31,760 --> 00:19:35,679
that in the resources slide as well so

545
00:19:34,000 --> 00:19:37,919
let's just use mfcc feature again

546
00:19:35,679 --> 00:19:40,400
keeping classifier constant as svm to

547
00:19:37,919 --> 00:19:42,240
see how this performs so it can be seen

548
00:19:40,400 --> 00:19:44,320
with the confusion matrix of the

549
00:19:42,240 --> 00:19:47,840
training side that some of the classes

550
00:19:44,320 --> 00:19:49,360
like metal uh classical they are doing

551
00:19:47,840 --> 00:19:52,080
showing good numbers doing well but then

552
00:19:49,360 --> 00:19:54,000
there are some like country uh disco

553
00:19:52,080 --> 00:19:55,840
that are not doing that great

554
00:19:54,000 --> 00:19:57,360
and when we run our testing samples

555
00:19:55,840 --> 00:20:00,640
through this classifier that was trained

556
00:19:57,360 --> 00:20:02,880
using mfcc feature we see uh

557
00:20:00,640 --> 00:20:04,880
it's again mixed we have classical 18

558
00:20:02,880 --> 00:20:07,360
out of 20 correctly classified but you

559
00:20:04,880 --> 00:20:09,520
know disco blues rock

560
00:20:07,360 --> 00:20:12,320
reggae they're all uh

561
00:20:09,520 --> 00:20:12,320
lower numbers

562
00:20:12,799 --> 00:20:17,919
so let's see if we add features now now

563
00:20:15,360 --> 00:20:20,880
earlier one was just mfcc and now we do

564
00:20:17,919 --> 00:20:21,840
mfcc gfcc spectral as well as chroma

565
00:20:20,880 --> 00:20:24,320
features

566
00:20:21,840 --> 00:20:26,240
now that improves our training confusion

567
00:20:24,320 --> 00:20:28,400
matrix significantly all the numbers

568
00:20:26,240 --> 00:20:30,640
have gone up and same thing we can

569
00:20:28,400 --> 00:20:33,919
notice for the testing 20 samples each

570
00:20:30,640 --> 00:20:36,240
but now we see that pop has gone up by

571
00:20:33,919 --> 00:20:38,880
five more correctly classified disco by

572
00:20:36,240 --> 00:20:40,559
nine um and country by seven so

573
00:20:38,880 --> 00:20:43,280
everything has improved so certainly

574
00:20:40,559 --> 00:20:45,120
adding these features added something to

575
00:20:43,280 --> 00:20:46,880
our model that helped decipher between

576
00:20:45,120 --> 00:20:49,440
these classes better again the

577
00:20:46,880 --> 00:20:51,120
classifier was kept constant at svm if

578
00:20:49,440 --> 00:20:54,960
the classifier is experimented with as

579
00:20:51,120 --> 00:20:54,960
well the model can be further improved

580
00:20:56,080 --> 00:21:00,720
to further see where their model is

581
00:20:58,320 --> 00:21:02,880
failing or not the testing data can be

582
00:21:00,720 --> 00:21:04,480
further elaborated using confusion

583
00:21:02,880 --> 00:21:07,520
matrices and getting the precision

584
00:21:04,480 --> 00:21:10,000
recall in f1 score so this helps you see

585
00:21:07,520 --> 00:21:10,799
which class is going wrong and exactly

586
00:21:10,000 --> 00:21:14,640
where

587
00:21:10,799 --> 00:21:17,120
for example in this case we see a

588
00:21:14,640 --> 00:21:19,600
particularly reggae is really getting

589
00:21:17,120 --> 00:21:22,000
incorrectly classified as hip-hop

590
00:21:19,600 --> 00:21:23,600
a lot so if that was uh somewhere you

591
00:21:22,000 --> 00:21:25,840
know we want to invest time in checking

592
00:21:23,600 --> 00:21:28,320
the data samples that exist the data

593
00:21:25,840 --> 00:21:30,320
quantity the data quality that could be

594
00:21:28,320 --> 00:21:32,320
done and it really depends on also what

595
00:21:30,320 --> 00:21:35,520
your goal is if your goal is mainly to

596
00:21:32,320 --> 00:21:37,440
be able to classify pop and metal and

597
00:21:35,520 --> 00:21:39,120
maybe let's say classical then you

598
00:21:37,440 --> 00:21:41,760
already have a model that does a decent

599
00:21:39,120 --> 00:21:43,919
job for those particular classes

600
00:21:41,760 --> 00:21:45,520
this is just to showcase the capability

601
00:21:43,919 --> 00:21:47,200
of extracting features and using

602
00:21:45,520 --> 00:21:49,600
classifiers but for the things that

603
00:21:47,200 --> 00:21:51,360
could be tried are experimenting with

604
00:21:49,600 --> 00:21:54,159
the data quantity and seeing what the

605
00:21:51,360 --> 00:21:56,240
data sizes look like the data quality in

606
00:21:54,159 --> 00:21:59,440
particular whether there is any noisy

607
00:21:56,240 --> 00:22:01,600
samples of other features that can be

608
00:21:59,440 --> 00:22:04,320
used also another consideration factor

609
00:22:01,600 --> 00:22:06,559
would be some of these uh genres have

610
00:22:04,320 --> 00:22:09,120
music with vocals so maybe some sort of

611
00:22:06,559 --> 00:22:11,039
detection that way and then also other

612
00:22:09,120 --> 00:22:13,760
classifier back ends can be experimented

613
00:22:11,039 --> 00:22:15,520
with

614
00:22:13,760 --> 00:22:17,280
lastly i'm going to discuss a location

615
00:22:15,520 --> 00:22:19,440
name classification problem so

616
00:22:17,280 --> 00:22:21,679
classifying audio that is spoken

617
00:22:19,440 --> 00:22:23,919
location names and seeing if we are able

618
00:22:21,679 --> 00:22:26,799
to decipher it so considering a very

619
00:22:23,919 --> 00:22:29,679
very basic example two spoken uh

620
00:22:26,799 --> 00:22:31,679
location names london and boston now

621
00:22:29,679 --> 00:22:33,440
these have similarities the number of

622
00:22:31,679 --> 00:22:35,200
characters that form these words the

623
00:22:33,440 --> 00:22:37,679
number of cons uh

624
00:22:35,200 --> 00:22:40,159
syllables that form these words

625
00:22:37,679 --> 00:22:42,799
uh is london and boston so we see in

626
00:22:40,159 --> 00:22:44,720
this representation right in front of us

627
00:22:42,799 --> 00:22:47,200
there is something that looks different

628
00:22:44,720 --> 00:22:49,360
in the plot for london in boston so that

629
00:22:47,200 --> 00:22:51,440
makes us feel like okay well easy right

630
00:22:49,360 --> 00:22:54,320
uh it looks very differentiable why

631
00:22:51,440 --> 00:22:56,320
won't a model be able to do that but

632
00:22:54,320 --> 00:22:57,919
then when we compare

633
00:22:56,320 --> 00:22:59,360
three different london spoken

634
00:22:57,919 --> 00:23:02,080
representations in three different

635
00:22:59,360 --> 00:23:03,440
boston basically everything on your left

636
00:23:02,080 --> 00:23:06,159
is london and everything on your right

637
00:23:03,440 --> 00:23:07,679
is boston so these charts very quickly

638
00:23:06,159 --> 00:23:09,600
look different because you know they're

639
00:23:07,679 --> 00:23:11,200
spoken by different people they're

640
00:23:09,600 --> 00:23:13,360
different styles that one can say the

641
00:23:11,200 --> 00:23:15,360
same name different accents different

642
00:23:13,360 --> 00:23:18,080
pause locations so there's a lot of

643
00:23:15,360 --> 00:23:21,799
variety here that goes on in how one

644
00:23:18,080 --> 00:23:21,799
speaks itself

645
00:23:22,799 --> 00:23:26,960
so here we're conducting two experiments

646
00:23:24,880 --> 00:23:29,039
one is we're training only on female

647
00:23:26,960 --> 00:23:31,520
voice samples and then testing on male

648
00:23:29,039 --> 00:23:33,679
voice samples for training we have 23

649
00:23:31,520 --> 00:23:36,400
samples for london and 23 samples for

650
00:23:33,679 --> 00:23:39,520
boston and then for testing we have 17

651
00:23:36,400 --> 00:23:41,679
samples of each for london and for

652
00:23:39,520 --> 00:23:44,559
boston but we're testing on only male

653
00:23:41,679 --> 00:23:47,360
voice and training on only female voice

654
00:23:44,559 --> 00:23:49,279
so let's see if our model can do that

655
00:23:47,360 --> 00:23:51,200
we try an mfcc feature we get a

656
00:23:49,279 --> 00:23:54,640
confusion matrix

657
00:23:51,200 --> 00:23:56,159
we see uh the testing and we have 9 8 9

658
00:23:54,640 --> 00:23:58,720
out of 17 for boston correctly

659
00:23:56,159 --> 00:24:00,240
classified 8 out of 17 for london

660
00:23:58,720 --> 00:24:03,520
let's see if we can improve that so we

661
00:24:00,240 --> 00:24:05,760
tried gfcc a feature entry in the model

662
00:24:03,520 --> 00:24:08,480
and now we have 13 out of 17 correctly

663
00:24:05,760 --> 00:24:09,520
classified for uh london and same for

664
00:24:08,480 --> 00:24:11,039
boston

665
00:24:09,520 --> 00:24:13,600
further trying to improve it adding

666
00:24:11,039 --> 00:24:15,600
spectral features with gfcc now we have

667
00:24:13,600 --> 00:24:18,559
15 out of 17 correctly classified for

668
00:24:15,600 --> 00:24:20,799
london and 14 out of 17 for boston

669
00:24:18,559 --> 00:24:23,039
now there's so much going on here in the

670
00:24:20,799 --> 00:24:24,880
sense our training is only female voice

671
00:24:23,039 --> 00:24:27,120
and testing is only male voice and there

672
00:24:24,880 --> 00:24:28,480
is a difference because males and

673
00:24:27,120 --> 00:24:30,559
females have different lengths of the

674
00:24:28,480 --> 00:24:32,640
vocal tracts which leads to voices in

675
00:24:30,559 --> 00:24:34,000
different pitches so there is that

676
00:24:32,640 --> 00:24:35,760
difference as well

677
00:24:34,000 --> 00:24:37,760
that we are hoping that that model is

678
00:24:35,760 --> 00:24:40,400
able to still pick up on the spoken

679
00:24:37,760 --> 00:24:41,760
representations and get past the

680
00:24:40,400 --> 00:24:44,000
differences in the training in the

681
00:24:41,760 --> 00:24:46,159
testing sample

682
00:24:44,000 --> 00:24:48,400
now if we combine these samples and

683
00:24:46,159 --> 00:24:50,000
shuffle it up and now if you're training

684
00:24:48,400 --> 00:24:52,320
on female and male voice and also

685
00:24:50,000 --> 00:24:54,400
testing on female and male voice

686
00:24:52,320 --> 00:24:56,000
now our data gets more representative

687
00:24:54,400 --> 00:24:58,159
when we're training the model so we see

688
00:24:56,000 --> 00:25:00,640
even mfcc is now doing better than what

689
00:24:58,159 --> 00:25:02,720
it was before when we were just training

690
00:25:00,640 --> 00:25:04,240
on female samples and then testing on

691
00:25:02,720 --> 00:25:07,039
male voice samples

692
00:25:04,240 --> 00:25:09,520
so now we see with with this different

693
00:25:07,039 --> 00:25:11,440
representation all the models are doing

694
00:25:09,520 --> 00:25:12,720
better the training confusion matrix

695
00:25:11,440 --> 00:25:14,640
looks much better because of the

696
00:25:12,720 --> 00:25:16,400
representation that we have

697
00:25:14,640 --> 00:25:18,480
and all both of these experiments were

698
00:25:16,400 --> 00:25:20,240
done using the svm classifier as well

699
00:25:18,480 --> 00:25:21,600
just to keep the classifier constants we

700
00:25:20,240 --> 00:25:24,320
were able to compare the effects of

701
00:25:21,600 --> 00:25:24,320
different features

702
00:25:24,720 --> 00:25:29,360
here's the very much promised resource

703
00:25:26,640 --> 00:25:31,440
slide it has several links to features

704
00:25:29,360 --> 00:25:33,120
that we did not talk about and also some

705
00:25:31,440 --> 00:25:35,600
of the papers that i mentioned

706
00:25:33,120 --> 00:25:37,360
uh the music genre classification

707
00:25:35,600 --> 00:25:39,520
and the audio data sets where you can

708
00:25:37,360 --> 00:25:42,520
find publicly available open source data

709
00:25:39,520 --> 00:25:42,520
sets

710
00:25:42,720 --> 00:25:47,120
finally i want to thank everyone for

711
00:25:44,960 --> 00:25:49,600
tuning in it's been a pleasure thank you

712
00:25:47,120 --> 00:25:49,600
so much

713
00:25:50,080 --> 00:25:55,600
thank you jotika and we have some

714
00:25:52,640 --> 00:25:57,120
questions for you um

715
00:25:55,600 --> 00:25:59,039
the first one is do you think this

716
00:25:57,120 --> 00:26:02,799
toolkit could be useful for other

717
00:25:59,039 --> 00:26:05,360
vibrational analysis like um the example

718
00:26:02,799 --> 00:26:06,559
they give is seismic signal processing

719
00:26:05,360 --> 00:26:09,120
earthquakes

720
00:26:06,559 --> 00:26:11,679
that is a very very interesting uh

721
00:26:09,120 --> 00:26:13,520
thought so because this library deals

722
00:26:11,679 --> 00:26:16,799
with mainly audio if you're talking

723
00:26:13,520 --> 00:26:19,440
about any audio effects of these signals

724
00:26:16,799 --> 00:26:21,120
or any patterns that are audible that

725
00:26:19,440 --> 00:26:22,880
you can hear then i think it's

726
00:26:21,120 --> 00:26:25,600
definitely worth a try

727
00:26:22,880 --> 00:26:27,120
i have personally not uh you know heard

728
00:26:25,600 --> 00:26:29,120
of that uh

729
00:26:27,120 --> 00:26:31,440
application before but it sounds very

730
00:26:29,120 --> 00:26:33,279
interesting and if there's any audible

731
00:26:31,440 --> 00:26:35,120
component to it i would definitely give

732
00:26:33,279 --> 00:26:36,880
it a shot

733
00:26:35,120 --> 00:26:40,080
there's another question that asks do

734
00:26:36,880 --> 00:26:42,720
you think this toolkit um so is my audio

735
00:26:40,080 --> 00:26:44,880
processing language agnostic or is it

736
00:26:42,720 --> 00:26:47,039
just for english

737
00:26:44,880 --> 00:26:48,400
oh so well it's actually it doesn't

738
00:26:47,039 --> 00:26:50,559
matter so

739
00:26:48,400 --> 00:26:52,320
the pi audio processing library is

740
00:26:50,559 --> 00:26:54,080
essentially written in python but what

741
00:26:52,320 --> 00:26:56,159
language or audio is in it does not

742
00:26:54,080 --> 00:26:57,679
matter because it's going to train on

743
00:26:56,159 --> 00:26:59,760
the samples you provide so if your

744
00:26:57,679 --> 00:27:02,159
samples are in english it's going to

745
00:26:59,760 --> 00:27:03,919
train on that data so it really depends

746
00:27:02,159 --> 00:27:05,600
on the data that you pass in rather than

747
00:27:03,919 --> 00:27:08,159
the library the library should be

748
00:27:05,600 --> 00:27:11,120
expected to do similar

749
00:27:08,159 --> 00:27:13,440
and how do you deal with noisy data so

750
00:27:11,120 --> 00:27:15,200
is there um have you considered

751
00:27:13,440 --> 00:27:17,279
automated noise classification and

752
00:27:15,200 --> 00:27:19,200
cleaning to any degree

753
00:27:17,279 --> 00:27:21,440
yeah that's a very good question noise

754
00:27:19,200 --> 00:27:23,440
is a very ill i would say frustrating

755
00:27:21,440 --> 00:27:25,760
component of just dealing with anything

756
00:27:23,440 --> 00:27:27,200
classification and dealing with data

757
00:27:25,760 --> 00:27:29,120
that's a good question and i think one

758
00:27:27,200 --> 00:27:30,880
thing about noise some components are

759
00:27:29,120 --> 00:27:33,120
very basic and simple where you just

760
00:27:30,880 --> 00:27:35,840
have a signal where the interest

761
00:27:33,120 --> 00:27:39,120
is in a very particular segments and the

762
00:27:35,840 --> 00:27:40,000
audio itself is long so that way uh it's

763
00:27:39,120 --> 00:27:41,840
more

764
00:27:40,000 --> 00:27:44,000
simpler solutions like removing of the

765
00:27:41,840 --> 00:27:45,919
silent segments but then there are other

766
00:27:44,000 --> 00:27:47,279
aspects removing noise like spectral

767
00:27:45,919 --> 00:27:49,760
subtraction

768
00:27:47,279 --> 00:27:52,559
in which you just take the pull audio

769
00:27:49,760 --> 00:27:54,159
and take the less

770
00:27:52,559 --> 00:27:55,279
less i would say component filled

771
00:27:54,159 --> 00:27:56,720
portions

772
00:27:55,279 --> 00:27:58,559
see what the signal looks like there and

773
00:27:56,720 --> 00:28:00,880
just subtract that from the entire

774
00:27:58,559 --> 00:28:02,559
signal to remove noises such as

775
00:28:00,880 --> 00:28:03,520
something going on in the background you

776
00:28:02,559 --> 00:28:05,440
know the

777
00:28:03,520 --> 00:28:07,200
like train noise car noise is something

778
00:28:05,440 --> 00:28:08,960
going on in the background so i think

779
00:28:07,200 --> 00:28:11,840
it's a very interesting application and

780
00:28:08,960 --> 00:28:15,120
there has been the use of mfcc features

781
00:28:11,840 --> 00:28:17,279
a bunch to even remove noise from data

782
00:28:15,120 --> 00:28:18,799
as features in gfcc as well especially

783
00:28:17,279 --> 00:28:21,120
in speaker identification when they're

784
00:28:18,799 --> 00:28:24,640
noisy samples provided these features

785
00:28:21,120 --> 00:28:25,600
help you clean them up as well

786
00:28:24,640 --> 00:28:28,080
we have

787
00:28:25,600 --> 00:28:29,200
a lot of questions on venueless and the

788
00:28:28,080 --> 00:28:30,480
next one is

789
00:28:29,200 --> 00:28:33,120
you've mentioned

790
00:28:30,480 --> 00:28:35,120
mfcc and gfcc features uh what are the

791
00:28:33,120 --> 00:28:37,600
implications for classifying speech

792
00:28:35,120 --> 00:28:41,039
features in people like accents the age

793
00:28:37,600 --> 00:28:42,720
of the speaker and maybe other features

794
00:28:41,039 --> 00:28:44,960
very good question as well so there are

795
00:28:42,720 --> 00:28:47,919
a lot of applications of these features

796
00:28:44,960 --> 00:28:48,880
itself in like gender classification

797
00:28:47,919 --> 00:28:50,880
uh

798
00:28:48,880 --> 00:28:52,799
and so on gfcc specifically is very

799
00:28:50,880 --> 00:28:54,640
useful for speaker identification as

800
00:28:52,799 --> 00:28:56,399
well so if you want to have a task where

801
00:28:54,640 --> 00:28:58,080
you want to differentiate between people

802
00:28:56,399 --> 00:29:00,080
in terms of in different age groups as

803
00:28:58,080 --> 00:29:02,000
well i would definitely give that a shot

804
00:29:00,080 --> 00:29:03,919
and then there's pncc

805
00:29:02,000 --> 00:29:06,640
bfcc there are other capstone

806
00:29:03,919 --> 00:29:08,399
coefficients that really help uh break

807
00:29:06,640 --> 00:29:10,320
down your signal and classify it into

808
00:29:08,399 --> 00:29:12,159
types as such especially when it's

809
00:29:10,320 --> 00:29:15,360
considered when it has speaker

810
00:29:12,159 --> 00:29:15,360
information associated

811
00:29:15,440 --> 00:29:20,000
um what's the visualization plotting

812
00:29:17,360 --> 00:29:22,000
stack uh based on for pi audio

813
00:29:20,000 --> 00:29:23,120
processing here's another question from

814
00:29:22,000 --> 00:29:24,880
the audience

815
00:29:23,120 --> 00:29:27,039
can you repeat that sorry what's the

816
00:29:24,880 --> 00:29:29,360
visualization and plotting stack based

817
00:29:27,039 --> 00:29:31,919
on for power processing yes it's a good

818
00:29:29,360 --> 00:29:34,159
question so it's mainly uh scikit-learn

819
00:29:31,919 --> 00:29:35,600
and matplotlib

820
00:29:34,159 --> 00:29:37,039
and essentially if you have your

821
00:29:35,600 --> 00:29:38,880
features or any data that you want to

822
00:29:37,039 --> 00:29:41,120
visualize once you have the data you

823
00:29:38,880 --> 00:29:43,200
know visualizing is uh can be done with

824
00:29:41,120 --> 00:29:44,880
anything like matt gottlieb seabourn any

825
00:29:43,200 --> 00:29:46,960
of any of your favorite visualization

826
00:29:44,880 --> 00:29:48,559
libraries

827
00:29:46,960 --> 00:29:50,799
and the final question we still have

828
00:29:48,559 --> 00:29:53,600
time for it so people asking where the

829
00:29:50,799 --> 00:29:55,840
music samples entire songs or were you

830
00:29:53,600 --> 00:29:57,919
only clipping part of the song and if

831
00:29:55,840 --> 00:29:59,679
you did both does the length of the

832
00:29:57,919 --> 00:30:02,000
sample have

833
00:29:59,679 --> 00:30:03,360
an effect on model performance that's a

834
00:30:02,000 --> 00:30:05,440
very good question the length of the

835
00:30:03,360 --> 00:30:07,600
sample unless it's like really short and

836
00:30:05,440 --> 00:30:09,279
does not convey much information would

837
00:30:07,600 --> 00:30:10,799
not have a very significant impact

838
00:30:09,279 --> 00:30:12,399
because you're windowing the signal and

839
00:30:10,799 --> 00:30:14,159
that's where you're extracting features

840
00:30:12,399 --> 00:30:16,559
and then averaging it out for your

841
00:30:14,159 --> 00:30:18,720
signal the data set specifically i used

842
00:30:16,559 --> 00:30:21,120
was entire audios and it's called the gt

843
00:30:18,720 --> 00:30:23,039
zan data set and i've also attached a

844
00:30:21,120 --> 00:30:24,799
link in the resources slide to this data

845
00:30:23,039 --> 00:30:26,399
set which contains all these genres so

846
00:30:24,799 --> 00:30:27,919
if you want to explore the particular

847
00:30:26,399 --> 00:30:30,159
data set it's it's going to be right

848
00:30:27,919 --> 00:30:32,640
there

849
00:30:30,159 --> 00:30:35,520
thank you so much jotika singh thank you

850
00:30:32,640 --> 00:30:36,960
for speaking at pike on a you 2021.

851
00:30:35,520 --> 00:30:38,960
thank you so much it's been a pleasure

852
00:30:36,960 --> 00:30:40,960
thank you very much for organizing the

853
00:30:38,960 --> 00:30:44,080
whole thing

854
00:30:40,960 --> 00:30:47,520
and for our audience now at home we have

855
00:30:44,080 --> 00:30:51,440
a bit of a break and we'll be back at

856
00:30:47,520 --> 00:30:54,640
let me check at 1 30 melbourne time with

857
00:30:51,440 --> 00:30:57,799
graph data science and paco nathan see

858
00:30:54,640 --> 00:30:57,799
you then