1 00:00:11,759 --> 00:00:16,720 hello welcome back and welcome all the 2 00:00:14,160 --> 00:00:19,359 way from california jotika singh 3 00:00:16,720 --> 00:00:20,960 she's the vp of data science at icx 4 00:00:19,359 --> 00:00:23,119 where she and her team work on natural 5 00:00:20,960 --> 00:00:25,760 language processing future engineering 6 00:00:23,119 --> 00:00:27,599 and all kinds of machine learning 7 00:00:25,760 --> 00:00:29,840 jotika has a master's degree from the 8 00:00:27,599 --> 00:00:31,760 university of california los angeles 9 00:00:29,840 --> 00:00:33,920 where among other research topics she 10 00:00:31,760 --> 00:00:35,600 worked on signal and speech processing 11 00:00:33,920 --> 00:00:36,960 and developed new approaches to remove 12 00:00:35,600 --> 00:00:38,719 noise from speech 13 00:00:36,960 --> 00:00:41,040 she shares her findings via her open 14 00:00:38,719 --> 00:00:44,640 source projects on github such as pi 15 00:00:41,040 --> 00:00:46,559 youtube analysis and pi audio processing 16 00:00:44,640 --> 00:00:47,440 and that's what jotika is talking about 17 00:00:46,559 --> 00:00:48,480 today 18 00:00:47,440 --> 00:00:50,399 audio 19 00:00:48,480 --> 00:00:52,719 what is audio data 20 00:00:50,399 --> 00:00:55,840 how to build features and classification 21 00:00:52,719 --> 00:00:57,680 models on audio how to solve these 22 00:00:55,840 --> 00:01:00,640 problems in python 23 00:00:57,680 --> 00:01:03,280 now is where we find out from the author 24 00:01:00,640 --> 00:01:06,400 of pi audio processing herself 25 00:01:03,280 --> 00:01:08,560 so please join me in a hand of virtual 26 00:01:06,400 --> 00:01:10,640 applause for jotika singh and 27 00:01:08,560 --> 00:01:12,640 classifying audio into types using 28 00:01:10,640 --> 00:01:14,240 python 29 00:01:12,640 --> 00:01:16,640 thank you so much for the wonderful 30 00:01:14,240 --> 00:01:18,479 introduction um like you mentioned i'll 31 00:01:16,640 --> 00:01:20,479 be talking about classifying audio into 32 00:01:18,479 --> 00:01:22,479 types using python 33 00:01:20,479 --> 00:01:24,960 before diving right in i just quickly 34 00:01:22,479 --> 00:01:27,360 want to introduce myself i work as a vp 35 00:01:24,960 --> 00:01:29,200 of data science as i6 media it's a 36 00:01:27,360 --> 00:01:31,600 content and audience intelligence 37 00:01:29,200 --> 00:01:33,520 company based in washington dc um 38 00:01:31,600 --> 00:01:34,960 attaching my social media handles there 39 00:01:33,520 --> 00:01:37,280 because i'm going to be posting the 40 00:01:34,960 --> 00:01:38,720 slide deck on social media as well on my 41 00:01:37,280 --> 00:01:41,360 twitter account 42 00:01:38,720 --> 00:01:43,600 after the talk also in case anybody has 43 00:01:41,360 --> 00:01:45,360 any questions that you are unable to ask 44 00:01:43,600 --> 00:01:47,840 me during the conference uh you can 45 00:01:45,360 --> 00:01:50,240 shoot a note out to me on twitter 46 00:01:47,840 --> 00:01:52,560 uh also attaching my linkedin and github 47 00:01:50,240 --> 00:01:53,520 accounts for reference i'm also going to 48 00:01:52,560 --> 00:01:55,119 be 49 00:01:53,520 --> 00:01:57,520 you know there's an upcoming book in the 50 00:01:55,119 --> 00:01:59,520 summer or to fall 2022 that i'm working 51 00:01:57,520 --> 00:02:01,200 on that i'm working on authoring it's on 52 00:01:59,520 --> 00:02:03,600 natural language processing in the real 53 00:02:01,200 --> 00:02:05,600 world which contains um 54 00:02:03,600 --> 00:02:07,280 descriptions about how natural language 55 00:02:05,600 --> 00:02:09,039 processing is used across several 56 00:02:07,280 --> 00:02:12,479 industry verticals and actually how to 57 00:02:09,039 --> 00:02:12,479 implement it using python 58 00:02:13,040 --> 00:02:18,000 so without further ado this entire talk 59 00:02:16,000 --> 00:02:19,680 will contain a few sections starting 60 00:02:18,000 --> 00:02:21,599 from what is audio 61 00:02:19,680 --> 00:02:23,040 machine learning at a high level 62 00:02:21,599 --> 00:02:25,680 audio features 63 00:02:23,040 --> 00:02:27,680 tools and using some of these tools and 64 00:02:25,680 --> 00:02:30,160 then classification examples that 65 00:02:27,680 --> 00:02:32,959 classify audio into different types 66 00:02:30,160 --> 00:02:34,480 across different genres uh towards the 67 00:02:32,959 --> 00:02:37,840 end using the tools that we have 68 00:02:34,480 --> 00:02:37,840 discussed previously 69 00:02:38,080 --> 00:02:42,480 so what is audio uh it's essentially a 70 00:02:40,560 --> 00:02:44,879 signal that vibrates in the audible 71 00:02:42,480 --> 00:02:46,480 frequency range what does that mean well 72 00:02:44,879 --> 00:02:47,840 when i'm talking now and you can hear me 73 00:02:46,480 --> 00:02:50,000 through the speakers 74 00:02:47,840 --> 00:02:52,640 uh those sounds basically create air 75 00:02:50,000 --> 00:02:54,640 pressure waves that is then received by 76 00:02:52,640 --> 00:02:57,040 our ear and these pressure signals is 77 00:02:54,640 --> 00:02:59,120 converted uh to some responses that our 78 00:02:57,040 --> 00:03:01,360 brain can understand and finally 79 00:02:59,120 --> 00:03:02,480 recognize the audio as a particular 80 00:03:01,360 --> 00:03:05,120 meaning 81 00:03:02,480 --> 00:03:07,280 there are so many great matlab tools for 82 00:03:05,120 --> 00:03:09,599 just digital signal processing speech 83 00:03:07,280 --> 00:03:11,519 processing and audio processing 84 00:03:09,599 --> 00:03:13,360 a lot of research that goes on you know 85 00:03:11,519 --> 00:03:15,360 the first thing where we can actually 86 00:03:13,360 --> 00:03:16,400 see the effects of those research is in 87 00:03:15,360 --> 00:03:18,239 matlab 88 00:03:16,400 --> 00:03:20,080 given that machine learning is the 89 00:03:18,239 --> 00:03:23,040 language of choice for 90 00:03:20,080 --> 00:03:25,519 building classification models uh uh 91 00:03:23,040 --> 00:03:28,080 python is the language of choice there 92 00:03:25,519 --> 00:03:29,680 uh there's there's little few gaps that 93 00:03:28,080 --> 00:03:31,440 i noticed in the community when i was 94 00:03:29,680 --> 00:03:32,959 trying to build audio classification 95 00:03:31,440 --> 00:03:35,280 models and i needed to extract 96 00:03:32,959 --> 00:03:37,680 particular features so in that attempts 97 00:03:35,280 --> 00:03:39,280 uh there are some open source libraries 98 00:03:37,680 --> 00:03:40,879 that are created to do audio processing 99 00:03:39,280 --> 00:03:42,879 in python 100 00:03:40,879 --> 00:03:44,480 uh and one of them is also pi audio 101 00:03:42,879 --> 00:03:48,159 processing which i'll be talking about 102 00:03:44,480 --> 00:03:48,159 the usage of it in a little bit 103 00:03:49,200 --> 00:03:53,599 so what is machine learning at a high 104 00:03:50,720 --> 00:03:56,000 level uh it's essentially we can imagine 105 00:03:53,599 --> 00:03:57,680 this as divided into three phases 106 00:03:56,000 --> 00:03:58,879 there's a data phase 107 00:03:57,680 --> 00:04:01,760 there's a training phase and an 108 00:03:58,879 --> 00:04:03,439 evaluation phase the data phase has 109 00:04:01,760 --> 00:04:05,920 everything to do with data from data 110 00:04:03,439 --> 00:04:09,120 collection whether you are scraping data 111 00:04:05,920 --> 00:04:10,720 or you have data from some resource or 112 00:04:09,120 --> 00:04:12,959 you're actually leveraging publicly 113 00:04:10,720 --> 00:04:14,799 available data sets to then cleaning of 114 00:04:12,959 --> 00:04:16,720 the data because oftentimes the data is 115 00:04:14,799 --> 00:04:18,880 not in the perfect shape uh that it's 116 00:04:16,720 --> 00:04:20,400 ready for feature extraction but once 117 00:04:18,880 --> 00:04:22,720 you have cleaned the data then we 118 00:04:20,400 --> 00:04:25,120 transform the data so that it is now in 119 00:04:22,720 --> 00:04:26,720 a numerical representation which goes as 120 00:04:25,120 --> 00:04:28,160 an input to the machine learning model 121 00:04:26,720 --> 00:04:29,840 that you're training and then the 122 00:04:28,160 --> 00:04:31,840 evaluation of the model further 123 00:04:29,840 --> 00:04:33,759 influences what else you can do in the 124 00:04:31,840 --> 00:04:36,000 data phase do you need more data do you 125 00:04:33,759 --> 00:04:38,160 need to clean it differently uh do you 126 00:04:36,000 --> 00:04:40,240 need to use other data transformation 127 00:04:38,160 --> 00:04:42,160 techniques 128 00:04:40,240 --> 00:04:44,400 so as you mentioned features are 129 00:04:42,160 --> 00:04:46,720 numerical representation of data 130 00:04:44,400 --> 00:04:48,479 the usable machine learning models but 131 00:04:46,720 --> 00:04:49,520 they really highly depend on the data 132 00:04:48,479 --> 00:04:51,120 type 133 00:04:49,520 --> 00:04:53,680 for instance 134 00:04:51,120 --> 00:04:55,840 if you have a text corpus using word to 135 00:04:53,680 --> 00:04:57,840 web to represent the phrases or the 136 00:04:55,840 --> 00:04:59,759 words within the text corpus as 137 00:04:57,840 --> 00:05:00,880 numerical representations works 138 00:04:59,759 --> 00:05:02,880 perfectly 139 00:05:00,880 --> 00:05:04,800 but if you were to pass just random 140 00:05:02,880 --> 00:05:07,520 numbers through word to work we don't 141 00:05:04,800 --> 00:05:08,560 really expect to get anything meaningful 142 00:05:07,520 --> 00:05:10,479 so there are different feature 143 00:05:08,560 --> 00:05:13,199 generation methods that are suitable for 144 00:05:10,479 --> 00:05:15,600 different types of data 145 00:05:13,199 --> 00:05:17,360 this gets us to audio features now there 146 00:05:15,600 --> 00:05:18,639 are so many different audio features and 147 00:05:17,360 --> 00:05:20,960 we are not going to talk about all of 148 00:05:18,639 --> 00:05:22,639 them but i wanted to mention everything 149 00:05:20,960 --> 00:05:24,880 or not everything but a lot of the 150 00:05:22,639 --> 00:05:26,320 things on one slide so if anybody is 151 00:05:24,880 --> 00:05:27,919 curious and wants to look up other 152 00:05:26,320 --> 00:05:31,039 things for reference 153 00:05:27,919 --> 00:05:31,039 it is there in one place 154 00:05:31,840 --> 00:05:35,120 let's start with 155 00:05:33,280 --> 00:05:37,919 two important things when we're talking 156 00:05:35,120 --> 00:05:40,720 about audio features uh spectrum and 157 00:05:37,919 --> 00:05:42,400 kepstrum what is spectrum 158 00:05:40,720 --> 00:05:44,240 when the audio signal is passed through 159 00:05:42,400 --> 00:05:46,720 a fourier transform what results is a 160 00:05:44,240 --> 00:05:48,639 spectrum but what is it essentially 161 00:05:46,720 --> 00:05:49,840 it is the audio signal in the frequency 162 00:05:48,639 --> 00:05:51,680 domain 163 00:05:49,840 --> 00:05:52,960 how we compute that is using a fourier 164 00:05:51,680 --> 00:05:55,600 transform 165 00:05:52,960 --> 00:05:57,280 if people are aware about fourier series 166 00:05:55,600 --> 00:05:59,919 even if not it is just a way to 167 00:05:57,280 --> 00:06:01,440 represent your signals in terms of sines 168 00:05:59,919 --> 00:06:03,360 and cosines 169 00:06:01,440 --> 00:06:04,880 uh and that is for your transform and 170 00:06:03,360 --> 00:06:08,000 that helps us get the signal in the 171 00:06:04,880 --> 00:06:09,919 frequency domain that is called spectrum 172 00:06:08,000 --> 00:06:12,479 now if we take the log magnitude of the 173 00:06:09,919 --> 00:06:14,000 spectrum to reduce amplitude differences 174 00:06:12,479 --> 00:06:16,560 and then take the inverse fourier 175 00:06:14,000 --> 00:06:18,400 transform what results is a kepstrum 176 00:06:16,560 --> 00:06:20,560 now kepstrom is neither in the time 177 00:06:18,400 --> 00:06:22,080 domain nor in the frequency domain 178 00:06:20,560 --> 00:06:23,680 why is it not in the time domain even 179 00:06:22,080 --> 00:06:26,000 though we took inverse fourier transform 180 00:06:23,680 --> 00:06:27,600 is because of the log magnitude step 181 00:06:26,000 --> 00:06:28,960 uh and because we took inverse fourier 182 00:06:27,600 --> 00:06:31,039 transform it's not in the frequency 183 00:06:28,960 --> 00:06:35,160 domain and oftentimes people refer to 184 00:06:31,039 --> 00:06:35,160 this as a quifferency domain 185 00:06:35,919 --> 00:06:40,000 so to show a representation of how these 186 00:06:38,400 --> 00:06:42,720 uh different representations look 187 00:06:40,000 --> 00:06:45,120 visually uh there's a wave form of a 188 00:06:42,720 --> 00:06:46,880 simple wobble and then the spectrum 189 00:06:45,120 --> 00:06:48,400 followed by the capstone 190 00:06:46,880 --> 00:06:50,479 and then the first 20 capstone 191 00:06:48,400 --> 00:06:52,240 coefficients now one thing great about 192 00:06:50,479 --> 00:06:55,280 the kepstrum is that the first few 193 00:06:52,240 --> 00:06:57,440 sometimes 13 sometimes 20 coefficients 194 00:06:55,280 --> 00:07:00,160 form as great features to build machine 195 00:06:57,440 --> 00:07:00,160 learning models 196 00:07:01,680 --> 00:07:05,280 but why do we care about the frequency 197 00:07:03,440 --> 00:07:07,360 domain right why is it that we are 198 00:07:05,280 --> 00:07:10,160 talking about spectrum and kepstrom and 199 00:07:07,360 --> 00:07:13,199 uh features associated with that now the 200 00:07:10,160 --> 00:07:16,240 inspiration is biology especially if you 201 00:07:13,199 --> 00:07:18,080 consider uh how we even see an image you 202 00:07:16,240 --> 00:07:20,880 know what it looks to the eye is certain 203 00:07:18,080 --> 00:07:23,280 thing but for a computer to process that 204 00:07:20,880 --> 00:07:26,319 uh when our eyes see something we see 205 00:07:23,280 --> 00:07:28,720 something a color like blue or green 206 00:07:26,319 --> 00:07:31,199 uh but for a computer they can represent 207 00:07:28,720 --> 00:07:32,240 it as the pixel values associated behind 208 00:07:31,199 --> 00:07:34,479 the image 209 00:07:32,240 --> 00:07:36,479 right similarly when we hear there's a 210 00:07:34,479 --> 00:07:38,880 whole process that goes on and inspired 211 00:07:36,479 --> 00:07:40,160 by that is how we generate features from 212 00:07:38,880 --> 00:07:42,080 audio 213 00:07:40,160 --> 00:07:44,160 there's the spiral 214 00:07:42,080 --> 00:07:46,479 fluid fill structure in the ear it's 215 00:07:44,160 --> 00:07:49,039 called the cochlea it has thousands of 216 00:07:46,479 --> 00:07:51,440 tiny hair that are of different lengths 217 00:07:49,039 --> 00:07:54,160 what happens is the longer hair resonate 218 00:07:51,440 --> 00:07:55,599 with sounds of lower frequencies and the 219 00:07:54,160 --> 00:07:56,879 shorter hair resonate with higher 220 00:07:55,599 --> 00:07:59,120 frequencies 221 00:07:56,879 --> 00:08:02,080 so it's almost considered because of the 222 00:07:59,120 --> 00:08:03,840 way uh the signal is processed like our 223 00:08:02,080 --> 00:08:06,720 ear is a natural fourier transform 224 00:08:03,840 --> 00:08:11,319 analyzer and this is why uh spectrum 225 00:08:06,720 --> 00:08:11,319 kepstrum is of great interest to us 226 00:08:12,479 --> 00:08:16,960 coming from the kepstrum there are a few 227 00:08:14,800 --> 00:08:19,280 features that form like of great 228 00:08:16,960 --> 00:08:21,440 importance and a lot of 229 00:08:19,280 --> 00:08:22,960 machine learning applications it's 230 00:08:21,440 --> 00:08:24,639 called the mel frequency kept 231 00:08:22,960 --> 00:08:26,800 coefficients 232 00:08:24,639 --> 00:08:28,560 behind it is what we call the mel filter 233 00:08:26,800 --> 00:08:31,120 bank which is just those triangles as 234 00:08:28,560 --> 00:08:32,719 you see on the screen 235 00:08:31,120 --> 00:08:35,680 now you see the triangles keep getting 236 00:08:32,719 --> 00:08:38,479 wider and this is because our human ear 237 00:08:35,680 --> 00:08:40,800 is a breast less frequency selective 238 00:08:38,479 --> 00:08:43,039 after one kilohertz so we want to grab 239 00:08:40,800 --> 00:08:46,080 less and less as we go forward now this 240 00:08:43,039 --> 00:08:47,920 filter the aim is to represent closely 241 00:08:46,080 --> 00:08:49,839 how the human hearing works 242 00:08:47,920 --> 00:08:51,440 and how it's mathematically produced is 243 00:08:49,839 --> 00:08:53,040 the spectrum of the signal passes 244 00:08:51,440 --> 00:08:55,120 through a male scale filter bank which 245 00:08:53,040 --> 00:08:57,279 is those that the filter that you see on 246 00:08:55,120 --> 00:08:59,440 the screen and then a log magnitude 247 00:08:57,279 --> 00:09:01,680 followed by a discrete cosine transform 248 00:08:59,440 --> 00:09:04,320 which results in mfcc features 249 00:09:01,680 --> 00:09:06,240 the discrete cosine transform is uh also 250 00:09:04,320 --> 00:09:08,000 finds application in things like jpeg 251 00:09:06,240 --> 00:09:09,839 compression because the job of the 252 00:09:08,000 --> 00:09:11,600 discrete cosine transform is to get the 253 00:09:09,839 --> 00:09:13,360 shape of the signal rather than the 254 00:09:11,600 --> 00:09:15,200 sharper peaks because it's known that 255 00:09:13,360 --> 00:09:17,519 those sharper smaller peaks are just 256 00:09:15,200 --> 00:09:17,519 noise 257 00:09:18,800 --> 00:09:22,720 another capsule coefficient is called 258 00:09:21,040 --> 00:09:24,560 gamma tone frequency capsule 259 00:09:22,720 --> 00:09:27,360 coefficients it's a little bit different 260 00:09:24,560 --> 00:09:30,080 from mfcc as you see the filter is now 261 00:09:27,360 --> 00:09:31,920 uh with us with a softer slope 262 00:09:30,080 --> 00:09:34,080 and soft edges 263 00:09:31,920 --> 00:09:37,120 now the inspiration here is again how 264 00:09:34,080 --> 00:09:39,440 the human hair works hearing works and 265 00:09:37,120 --> 00:09:41,120 this filter gamma tone 266 00:09:39,440 --> 00:09:42,560 filter bank is known to be the 267 00:09:41,120 --> 00:09:44,720 stimulation of the front edge of the 268 00:09:42,560 --> 00:09:46,480 cochlea so it's again more closely 269 00:09:44,720 --> 00:09:48,720 representing how we hear 270 00:09:46,480 --> 00:09:50,160 and the computation is very similar as 271 00:09:48,720 --> 00:09:51,920 in the spectrum passes through this 272 00:09:50,160 --> 00:09:54,240 filter bank and then there are steps for 273 00:09:51,920 --> 00:09:56,240 down sampling and loudness compressions 274 00:09:54,240 --> 00:10:00,560 followed by discrete cosine transform 275 00:09:56,240 --> 00:10:00,560 and that gets us to our gfcc features 276 00:10:01,760 --> 00:10:07,920 visually mfcc and gfcc look like this as 277 00:10:05,279 --> 00:10:08,880 you can see on the screen for same audio 278 00:10:07,920 --> 00:10:10,480 signal 279 00:10:08,880 --> 00:10:12,480 now we see that you know there are like 280 00:10:10,480 --> 00:10:13,760 mid processing features and they do look 281 00:10:12,480 --> 00:10:15,920 different they convey different 282 00:10:13,760 --> 00:10:19,040 information so it's not like one is a 283 00:10:15,920 --> 00:10:20,800 derivative of the other but it is uh 284 00:10:19,040 --> 00:10:23,519 produced differently through different 285 00:10:20,800 --> 00:10:25,279 processes and each of them convey a 286 00:10:23,519 --> 00:10:27,040 great deal of information when using 287 00:10:25,279 --> 00:10:29,440 machine learning models 288 00:10:27,040 --> 00:10:31,200 they have been used individually as well 289 00:10:29,440 --> 00:10:32,959 as you know a combination of those two 290 00:10:31,200 --> 00:10:36,000 features can be used to build machine 291 00:10:32,959 --> 00:10:36,000 learning models as well 292 00:10:39,519 --> 00:10:42,640 there's some other features that are 293 00:10:41,200 --> 00:10:44,480 proved to be of great importance 294 00:10:42,640 --> 00:10:46,560 especially in applications of speech 295 00:10:44,480 --> 00:10:48,959 processing that is linear prediction 296 00:10:46,560 --> 00:10:50,720 capsule coefficients bark frequency kept 297 00:10:48,959 --> 00:10:52,480 through coefficients power normalized 298 00:10:50,720 --> 00:10:53,839 schedule coefficients 299 00:10:52,480 --> 00:10:55,440 uh and then there's some other which are 300 00:10:53,839 --> 00:10:58,720 related to the spectrum of the signal 301 00:10:55,440 --> 00:11:00,480 like spectrum entropy flux uh how many 302 00:10:58,720 --> 00:11:02,000 times is cross is zero 303 00:11:00,480 --> 00:11:04,560 and then chroma features which 304 00:11:02,000 --> 00:11:07,120 essentially represents the tonal content 305 00:11:04,560 --> 00:11:09,920 of musical audio signal 306 00:11:07,120 --> 00:11:12,240 uh so that's what it it represents so it 307 00:11:09,920 --> 00:11:14,399 can be a very useful feature when 308 00:11:12,240 --> 00:11:17,760 considering classifying 309 00:11:14,399 --> 00:11:17,760 music related content 310 00:11:19,760 --> 00:11:24,079 the many tools that one can be leveraged 311 00:11:22,079 --> 00:11:25,600 in python for audio processing and 312 00:11:24,079 --> 00:11:27,519 bundling audio machine learning 313 00:11:25,600 --> 00:11:29,600 classifiers and i just wanted to list 314 00:11:27,519 --> 00:11:31,680 all of them in one place so if anybody 315 00:11:29,600 --> 00:11:33,760 is interested for different types of 316 00:11:31,680 --> 00:11:36,000 audio processing you please check these 317 00:11:33,760 --> 00:11:36,000 out 318 00:11:38,079 --> 00:11:41,440 coming to the library pi audio 319 00:11:39,760 --> 00:11:43,120 processing it's essentially a python 320 00:11:41,440 --> 00:11:44,959 library for audio analysis and 321 00:11:43,120 --> 00:11:46,399 classification there are a bunch of 322 00:11:44,959 --> 00:11:48,480 different functions that can be 323 00:11:46,399 --> 00:11:50,560 performed using this library starting 324 00:11:48,480 --> 00:11:52,880 from audio format conversion because 325 00:11:50,560 --> 00:11:55,680 you'll see a lot of conversion methods 326 00:11:52,880 --> 00:11:57,279 out there work on wave format but your 327 00:11:55,680 --> 00:11:58,720 audio can be on very different formats 328 00:11:57,279 --> 00:12:00,639 as well so converting it from the 329 00:11:58,720 --> 00:12:02,720 different formats to wave 330 00:12:00,639 --> 00:12:04,720 uh audio visualization you know 331 00:12:02,720 --> 00:12:06,800 sometimes you may want to visualize your 332 00:12:04,720 --> 00:12:08,399 audio with or without building the model 333 00:12:06,800 --> 00:12:10,720 so that's something that we can do with 334 00:12:08,399 --> 00:12:13,040 this library as well there's some audio 335 00:12:10,720 --> 00:12:15,600 cleaning techniques that help you remove 336 00:12:13,040 --> 00:12:17,519 any silence or low activity segments 337 00:12:15,600 --> 00:12:19,519 from your signal before you pass it into 338 00:12:17,519 --> 00:12:21,040 any further processing 339 00:12:19,519 --> 00:12:23,760 uh then there are audio feature 340 00:12:21,040 --> 00:12:24,399 extractions uh for mfcc that was spoken 341 00:12:23,760 --> 00:12:27,279 about 342 00:12:24,399 --> 00:12:28,800 gfcc spectral features and the chroma 343 00:12:27,279 --> 00:12:31,440 features as well 344 00:12:28,800 --> 00:12:32,959 i want to say particularly when i was uh 345 00:12:31,440 --> 00:12:35,600 working on this project where i wanted 346 00:12:32,959 --> 00:12:37,839 to use gfcc i was having a hard time 347 00:12:35,600 --> 00:12:39,600 finding a python implementation 348 00:12:37,839 --> 00:12:42,079 and that's what motivated me to create 349 00:12:39,600 --> 00:12:44,240 this library is when i used the matlab 350 00:12:42,079 --> 00:12:45,839 code that i had and converted that to 351 00:12:44,240 --> 00:12:48,399 python 352 00:12:45,839 --> 00:12:50,000 and that's where this comes from 353 00:12:48,399 --> 00:12:52,000 furthermore when once you have built 354 00:12:50,000 --> 00:12:54,639 your features you can use existing 355 00:12:52,000 --> 00:12:57,120 cyclone classifiers with auto hyper 356 00:12:54,639 --> 00:12:59,040 parameter tuning using this library 357 00:12:57,120 --> 00:13:00,720 if you want you can also use it without 358 00:12:59,040 --> 00:13:03,760 the cyclic loan classifiers if you want 359 00:13:00,720 --> 00:13:05,440 to use it with their own custom backend 360 00:13:03,760 --> 00:13:07,680 uh furthermore there are three 361 00:13:05,440 --> 00:13:09,839 pre-trained models of classification 362 00:13:07,680 --> 00:13:12,160 audio models that are provided with this 363 00:13:09,839 --> 00:13:13,680 library that can help you establish a 364 00:13:12,160 --> 00:13:17,360 baseline if you're working on similar 365 00:13:13,680 --> 00:13:17,360 problems of classifying audio 366 00:13:18,240 --> 00:13:22,480 we remember looking at this particular 367 00:13:20,880 --> 00:13:24,320 diagram in the beginning which talked 368 00:13:22,480 --> 00:13:25,920 about machine learning at a high level 369 00:13:24,320 --> 00:13:28,079 so just converting that to machine 370 00:13:25,920 --> 00:13:30,480 learning for audio signals and how 371 00:13:28,079 --> 00:13:32,160 different components uh can be related 372 00:13:30,480 --> 00:13:34,240 to what we've spoken about 373 00:13:32,160 --> 00:13:35,920 so in terms of data collection you can 374 00:13:34,240 --> 00:13:38,000 use your own data set if you have one 375 00:13:35,920 --> 00:13:39,519 but if you don't there are many publicly 376 00:13:38,000 --> 00:13:41,120 available data sets and that i've 377 00:13:39,519 --> 00:13:43,279 attached in a resources slide that will 378 00:13:41,120 --> 00:13:45,040 be towards the end like i said i'll be 379 00:13:43,279 --> 00:13:47,120 sharing the slide deck so 380 00:13:45,040 --> 00:13:49,199 uh you can free free to access that 381 00:13:47,120 --> 00:13:51,760 resource there 382 00:13:49,199 --> 00:13:54,880 secondly uh we have data cleaning uh 383 00:13:51,760 --> 00:13:57,360 which can be used uh it can be 384 00:13:54,880 --> 00:13:58,959 constituent of converting audio formats 385 00:13:57,360 --> 00:14:00,079 but also cleaning and removing the 386 00:13:58,959 --> 00:14:02,320 silence 387 00:14:00,079 --> 00:14:04,880 segments from the audio so that can be 388 00:14:02,320 --> 00:14:06,959 done using pi audio processing as well 389 00:14:04,880 --> 00:14:08,800 and then transformation is the feature 390 00:14:06,959 --> 00:14:10,560 formation which can be done using pi 391 00:14:08,800 --> 00:14:12,399 audio processing using the extract 392 00:14:10,560 --> 00:14:14,160 features module 393 00:14:12,399 --> 00:14:16,480 these features can be extracted to use 394 00:14:14,160 --> 00:14:18,959 with your own back end or 395 00:14:16,480 --> 00:14:20,560 it can also be used with existing 396 00:14:18,959 --> 00:14:22,720 circuit learn models using the run 397 00:14:20,560 --> 00:14:24,959 classification module which can help you 398 00:14:22,720 --> 00:14:27,120 train and classify your signals and also 399 00:14:24,959 --> 00:14:28,959 give you statistics like confusion 400 00:14:27,120 --> 00:14:30,560 matrix and essentially how your 401 00:14:28,959 --> 00:14:33,839 classifications have run on your 402 00:14:30,560 --> 00:14:33,839 evaluation data set 403 00:14:35,040 --> 00:14:38,959 if you're thinking of starting with such 404 00:14:36,560 --> 00:14:40,399 a problem uh let's talk about the flow 405 00:14:38,959 --> 00:14:41,440 what kind of questions you would ask if 406 00:14:40,399 --> 00:14:43,760 you want to 407 00:14:41,440 --> 00:14:45,839 do something with audio analyze it or 408 00:14:43,760 --> 00:14:47,680 create a classification model so let's 409 00:14:45,839 --> 00:14:49,920 say you have an audio 410 00:14:47,680 --> 00:14:51,199 does it need to be converted to wave 411 00:14:49,920 --> 00:14:53,120 if so 412 00:14:51,199 --> 00:14:55,360 yes we can do that with a module present 413 00:14:53,120 --> 00:14:56,959 in pi audio processing does it need to 414 00:14:55,360 --> 00:14:58,800 be cleaned 415 00:14:56,959 --> 00:15:01,600 if so we can use pi audio processing 416 00:14:58,800 --> 00:15:03,920 clean module does it need to uh do you 417 00:15:01,600 --> 00:15:05,760 need to build a circuit learn classifier 418 00:15:03,920 --> 00:15:07,360 that can be done as well using train and 419 00:15:05,760 --> 00:15:09,199 classify 420 00:15:07,360 --> 00:15:10,720 if not do you want to just extract the 421 00:15:09,199 --> 00:15:12,800 features to use with your own custom 422 00:15:10,720 --> 00:15:14,800 model that can be done as well using 423 00:15:12,800 --> 00:15:17,360 extract features module 424 00:15:14,800 --> 00:15:20,240 and then if not you just want to use a 425 00:15:17,360 --> 00:15:21,600 pre-trained model to just classify audio 426 00:15:20,240 --> 00:15:23,199 that you have 427 00:15:21,600 --> 00:15:25,440 that can be done as well and there are 428 00:15:23,199 --> 00:15:27,360 instructions in the readme to how to 429 00:15:25,440 --> 00:15:28,720 exactly do that 430 00:15:27,360 --> 00:15:30,639 if none of that if you just want to 431 00:15:28,720 --> 00:15:32,880 visualize your audio that can be done as 432 00:15:30,639 --> 00:15:34,160 well using pi audio processing plot 433 00:15:32,880 --> 00:15:36,399 module 434 00:15:34,160 --> 00:15:38,560 and if none of that 435 00:15:36,399 --> 00:15:40,720 please help us by creating an issue on 436 00:15:38,560 --> 00:15:42,720 github and mention all these things that 437 00:15:40,720 --> 00:15:44,720 you want to do in python for audio and 438 00:15:42,720 --> 00:15:46,160 that you're not able to do please create 439 00:15:44,720 --> 00:15:48,560 these issues please feel free to 440 00:15:46,160 --> 00:15:50,880 contribute uh in terms of working on 441 00:15:48,560 --> 00:15:52,880 some of the existing issues as well it's 442 00:15:50,880 --> 00:15:54,560 an open source project and we very much 443 00:15:52,880 --> 00:15:57,519 welcome everybody's input the 444 00:15:54,560 --> 00:15:57,519 communities input 445 00:15:59,199 --> 00:16:02,320 now coming to audio classification we 446 00:16:01,040 --> 00:16:03,920 have talked about the library we've 447 00:16:02,320 --> 00:16:05,920 talked about some features we've talked 448 00:16:03,920 --> 00:16:08,000 about what audio is and let's get it all 449 00:16:05,920 --> 00:16:10,560 together by actually discussing some of 450 00:16:08,000 --> 00:16:12,000 the audio classification examples that 451 00:16:10,560 --> 00:16:14,240 have been built using pi audio 452 00:16:12,000 --> 00:16:16,399 processing 453 00:16:14,240 --> 00:16:17,680 so the first one is the audio type 454 00:16:16,399 --> 00:16:19,839 classification 455 00:16:17,680 --> 00:16:22,959 in this problem we'll be classifying 456 00:16:19,839 --> 00:16:25,519 audio into three possible classes speech 457 00:16:22,959 --> 00:16:27,759 music birds 458 00:16:25,519 --> 00:16:30,000 so the first thing we do is of course 459 00:16:27,759 --> 00:16:32,399 the data that is considered so in in 460 00:16:30,000 --> 00:16:35,360 this case we are using 50 samples each 461 00:16:32,399 --> 00:16:37,199 so total of 150 samples for training 462 00:16:35,360 --> 00:16:39,519 and then for testing there are 14 463 00:16:37,199 --> 00:16:42,240 samples for each class 464 00:16:39,519 --> 00:16:44,079 the first thing we do is train an mfcc 465 00:16:42,240 --> 00:16:46,399 model and keeping the classifier 466 00:16:44,079 --> 00:16:48,399 constant at svm 467 00:16:46,399 --> 00:16:50,160 so the mfcc generated the training 468 00:16:48,399 --> 00:16:52,320 confusion matrix that can be seen on the 469 00:16:50,160 --> 00:16:53,519 top and it looks like it's doing pretty 470 00:16:52,320 --> 00:16:55,680 good 471 00:16:53,519 --> 00:16:57,759 when we pass the test data through this 472 00:16:55,680 --> 00:16:59,680 this model that was created using mfcc 473 00:16:57,759 --> 00:17:02,079 feature we see music is getting 474 00:16:59,680 --> 00:17:04,319 classified 13 out of 14 correctly and 475 00:17:02,079 --> 00:17:06,079 speech and births is 14 out of 14. so 476 00:17:04,319 --> 00:17:08,559 this is a good model and it looks like 477 00:17:06,079 --> 00:17:09,679 mfcc has definitely 478 00:17:08,559 --> 00:17:11,600 got 479 00:17:09,679 --> 00:17:13,280 parts to it that help the machine 480 00:17:11,600 --> 00:17:16,480 learning model to really decipher 481 00:17:13,280 --> 00:17:16,480 between these three classes 482 00:17:16,720 --> 00:17:21,199 this is what a representation looks like 483 00:17:19,039 --> 00:17:23,039 of the features of a speech music and 484 00:17:21,199 --> 00:17:24,319 birth signal so we can see how the 485 00:17:23,039 --> 00:17:25,919 feature looks and there are like 486 00:17:24,319 --> 00:17:27,520 different patterns 487 00:17:25,919 --> 00:17:29,760 associated with the different types of 488 00:17:27,520 --> 00:17:31,679 signals 489 00:17:29,760 --> 00:17:34,080 now just for experimentation purposes 490 00:17:31,679 --> 00:17:35,760 let's try a gfcc feature model and see 491 00:17:34,080 --> 00:17:38,559 if that makes any difference to the 492 00:17:35,760 --> 00:17:40,480 model so the training confusion matrix 493 00:17:38,559 --> 00:17:42,880 still looks good 494 00:17:40,480 --> 00:17:44,559 and when we test it the testing results 495 00:17:42,880 --> 00:17:47,200 are also pretty good we have 14 out of 496 00:17:44,559 --> 00:17:48,480 14 for music and for birds but 12 out of 497 00:17:47,200 --> 00:17:50,240 14 for 498 00:17:48,480 --> 00:17:51,760 uh speech which is a little bit 499 00:17:50,240 --> 00:17:54,080 different from what we had when we were 500 00:17:51,760 --> 00:17:56,559 training with the mfcc feature 501 00:17:54,080 --> 00:17:58,320 so it looks like standalone gfcc is also 502 00:17:56,559 --> 00:18:01,440 contributing something to the model that 503 00:17:58,320 --> 00:18:03,440 helps it decipher clear patterns between 504 00:18:01,440 --> 00:18:05,679 how music speech and birds look 505 00:18:03,440 --> 00:18:07,600 distinctly 506 00:18:05,679 --> 00:18:09,679 here's a comparison between the mfcc 507 00:18:07,600 --> 00:18:12,080 feature and the gfcc feature for speech 508 00:18:09,679 --> 00:18:13,919 music bird sample and we can see it's 509 00:18:12,080 --> 00:18:17,679 relaying different information from the 510 00:18:13,919 --> 00:18:17,679 plots that we can see in front of us 511 00:18:19,039 --> 00:18:22,559 now because they're relaying different 512 00:18:20,720 --> 00:18:24,000 information one last experiment that i 513 00:18:22,559 --> 00:18:25,919 wanted to do was combine these two 514 00:18:24,000 --> 00:18:28,480 features together so use them in 515 00:18:25,919 --> 00:18:30,640 conjunction together and again the 516 00:18:28,480 --> 00:18:32,640 training confusion matrix looks good the 517 00:18:30,640 --> 00:18:34,960 testing one also looks good 518 00:18:32,640 --> 00:18:36,960 again it's more pretty much similar to 519 00:18:34,960 --> 00:18:40,080 how mfcc was performing so further 520 00:18:36,960 --> 00:18:42,320 testing uh could be used on using 521 00:18:40,080 --> 00:18:44,080 further new samples for speech music 522 00:18:42,320 --> 00:18:45,840 bird to evaluate these 523 00:18:44,080 --> 00:18:48,160 but this is mainly to show off 524 00:18:45,840 --> 00:18:50,640 capability of the features itself while 525 00:18:48,160 --> 00:18:52,960 keeping the classifier constant if one 526 00:18:50,640 --> 00:18:54,640 wanted to create a even further invest 527 00:18:52,960 --> 00:18:57,280 into this model and create an even 528 00:18:54,640 --> 00:18:59,360 better model can look at more samples 529 00:18:57,280 --> 00:19:00,799 the quality of data the quantity of data 530 00:18:59,360 --> 00:19:03,360 and then different classification 531 00:19:00,799 --> 00:19:03,360 backends 532 00:19:04,960 --> 00:19:09,679 the second example that i want to talk 533 00:19:06,400 --> 00:19:11,919 about is the music genre classification 534 00:19:09,679 --> 00:19:14,080 now in this case we have 10 music genres 535 00:19:11,919 --> 00:19:16,960 it's pop metal disco blues reggae 536 00:19:14,080 --> 00:19:19,039 classical rock hip hop country and jazz 537 00:19:16,960 --> 00:19:19,919 there are 80 samples for training the 538 00:19:19,039 --> 00:19:22,640 model 539 00:19:19,919 --> 00:19:25,280 per class and then 20 samples per class 540 00:19:22,640 --> 00:19:27,760 for testing uh there's a paper that was 541 00:19:25,280 --> 00:19:29,679 published in 2002 that used mfcc 542 00:19:27,760 --> 00:19:31,760 features for doing this performing this 543 00:19:29,679 --> 00:19:34,000 essential classification and i've linked 544 00:19:31,760 --> 00:19:35,679 that in the resources slide as well so 545 00:19:34,000 --> 00:19:37,919 let's just use mfcc feature again 546 00:19:35,679 --> 00:19:40,400 keeping classifier constant as svm to 547 00:19:37,919 --> 00:19:42,240 see how this performs so it can be seen 548 00:19:40,400 --> 00:19:44,320 with the confusion matrix of the 549 00:19:42,240 --> 00:19:47,840 training side that some of the classes 550 00:19:44,320 --> 00:19:49,360 like metal uh classical they are doing 551 00:19:47,840 --> 00:19:52,080 showing good numbers doing well but then 552 00:19:49,360 --> 00:19:54,000 there are some like country uh disco 553 00:19:52,080 --> 00:19:55,840 that are not doing that great 554 00:19:54,000 --> 00:19:57,360 and when we run our testing samples 555 00:19:55,840 --> 00:20:00,640 through this classifier that was trained 556 00:19:57,360 --> 00:20:02,880 using mfcc feature we see uh 557 00:20:00,640 --> 00:20:04,880 it's again mixed we have classical 18 558 00:20:02,880 --> 00:20:07,360 out of 20 correctly classified but you 559 00:20:04,880 --> 00:20:09,520 know disco blues rock 560 00:20:07,360 --> 00:20:12,320 reggae they're all uh 561 00:20:09,520 --> 00:20:12,320 lower numbers 562 00:20:12,799 --> 00:20:17,919 so let's see if we add features now now 563 00:20:15,360 --> 00:20:20,880 earlier one was just mfcc and now we do 564 00:20:17,919 --> 00:20:21,840 mfcc gfcc spectral as well as chroma 565 00:20:20,880 --> 00:20:24,320 features 566 00:20:21,840 --> 00:20:26,240 now that improves our training confusion 567 00:20:24,320 --> 00:20:28,400 matrix significantly all the numbers 568 00:20:26,240 --> 00:20:30,640 have gone up and same thing we can 569 00:20:28,400 --> 00:20:33,919 notice for the testing 20 samples each 570 00:20:30,640 --> 00:20:36,240 but now we see that pop has gone up by 571 00:20:33,919 --> 00:20:38,880 five more correctly classified disco by 572 00:20:36,240 --> 00:20:40,559 nine um and country by seven so 573 00:20:38,880 --> 00:20:43,280 everything has improved so certainly 574 00:20:40,559 --> 00:20:45,120 adding these features added something to 575 00:20:43,280 --> 00:20:46,880 our model that helped decipher between 576 00:20:45,120 --> 00:20:49,440 these classes better again the 577 00:20:46,880 --> 00:20:51,120 classifier was kept constant at svm if 578 00:20:49,440 --> 00:20:54,960 the classifier is experimented with as 579 00:20:51,120 --> 00:20:54,960 well the model can be further improved 580 00:20:56,080 --> 00:21:00,720 to further see where their model is 581 00:20:58,320 --> 00:21:02,880 failing or not the testing data can be 582 00:21:00,720 --> 00:21:04,480 further elaborated using confusion 583 00:21:02,880 --> 00:21:07,520 matrices and getting the precision 584 00:21:04,480 --> 00:21:10,000 recall in f1 score so this helps you see 585 00:21:07,520 --> 00:21:10,799 which class is going wrong and exactly 586 00:21:10,000 --> 00:21:14,640 where 587 00:21:10,799 --> 00:21:17,120 for example in this case we see a 588 00:21:14,640 --> 00:21:19,600 particularly reggae is really getting 589 00:21:17,120 --> 00:21:22,000 incorrectly classified as hip-hop 590 00:21:19,600 --> 00:21:23,600 a lot so if that was uh somewhere you 591 00:21:22,000 --> 00:21:25,840 know we want to invest time in checking 592 00:21:23,600 --> 00:21:28,320 the data samples that exist the data 593 00:21:25,840 --> 00:21:30,320 quantity the data quality that could be 594 00:21:28,320 --> 00:21:32,320 done and it really depends on also what 595 00:21:30,320 --> 00:21:35,520 your goal is if your goal is mainly to 596 00:21:32,320 --> 00:21:37,440 be able to classify pop and metal and 597 00:21:35,520 --> 00:21:39,120 maybe let's say classical then you 598 00:21:37,440 --> 00:21:41,760 already have a model that does a decent 599 00:21:39,120 --> 00:21:43,919 job for those particular classes 600 00:21:41,760 --> 00:21:45,520 this is just to showcase the capability 601 00:21:43,919 --> 00:21:47,200 of extracting features and using 602 00:21:45,520 --> 00:21:49,600 classifiers but for the things that 603 00:21:47,200 --> 00:21:51,360 could be tried are experimenting with 604 00:21:49,600 --> 00:21:54,159 the data quantity and seeing what the 605 00:21:51,360 --> 00:21:56,240 data sizes look like the data quality in 606 00:21:54,159 --> 00:21:59,440 particular whether there is any noisy 607 00:21:56,240 --> 00:22:01,600 samples of other features that can be 608 00:21:59,440 --> 00:22:04,320 used also another consideration factor 609 00:22:01,600 --> 00:22:06,559 would be some of these uh genres have 610 00:22:04,320 --> 00:22:09,120 music with vocals so maybe some sort of 611 00:22:06,559 --> 00:22:11,039 detection that way and then also other 612 00:22:09,120 --> 00:22:13,760 classifier back ends can be experimented 613 00:22:11,039 --> 00:22:15,520 with 614 00:22:13,760 --> 00:22:17,280 lastly i'm going to discuss a location 615 00:22:15,520 --> 00:22:19,440 name classification problem so 616 00:22:17,280 --> 00:22:21,679 classifying audio that is spoken 617 00:22:19,440 --> 00:22:23,919 location names and seeing if we are able 618 00:22:21,679 --> 00:22:26,799 to decipher it so considering a very 619 00:22:23,919 --> 00:22:29,679 very basic example two spoken uh 620 00:22:26,799 --> 00:22:31,679 location names london and boston now 621 00:22:29,679 --> 00:22:33,440 these have similarities the number of 622 00:22:31,679 --> 00:22:35,200 characters that form these words the 623 00:22:33,440 --> 00:22:37,679 number of cons uh 624 00:22:35,200 --> 00:22:40,159 syllables that form these words 625 00:22:37,679 --> 00:22:42,799 uh is london and boston so we see in 626 00:22:40,159 --> 00:22:44,720 this representation right in front of us 627 00:22:42,799 --> 00:22:47,200 there is something that looks different 628 00:22:44,720 --> 00:22:49,360 in the plot for london in boston so that 629 00:22:47,200 --> 00:22:51,440 makes us feel like okay well easy right 630 00:22:49,360 --> 00:22:54,320 uh it looks very differentiable why 631 00:22:51,440 --> 00:22:56,320 won't a model be able to do that but 632 00:22:54,320 --> 00:22:57,919 then when we compare 633 00:22:56,320 --> 00:22:59,360 three different london spoken 634 00:22:57,919 --> 00:23:02,080 representations in three different 635 00:22:59,360 --> 00:23:03,440 boston basically everything on your left 636 00:23:02,080 --> 00:23:06,159 is london and everything on your right 637 00:23:03,440 --> 00:23:07,679 is boston so these charts very quickly 638 00:23:06,159 --> 00:23:09,600 look different because you know they're 639 00:23:07,679 --> 00:23:11,200 spoken by different people they're 640 00:23:09,600 --> 00:23:13,360 different styles that one can say the 641 00:23:11,200 --> 00:23:15,360 same name different accents different 642 00:23:13,360 --> 00:23:18,080 pause locations so there's a lot of 643 00:23:15,360 --> 00:23:21,799 variety here that goes on in how one 644 00:23:18,080 --> 00:23:21,799 speaks itself 645 00:23:22,799 --> 00:23:26,960 so here we're conducting two experiments 646 00:23:24,880 --> 00:23:29,039 one is we're training only on female 647 00:23:26,960 --> 00:23:31,520 voice samples and then testing on male 648 00:23:29,039 --> 00:23:33,679 voice samples for training we have 23 649 00:23:31,520 --> 00:23:36,400 samples for london and 23 samples for 650 00:23:33,679 --> 00:23:39,520 boston and then for testing we have 17 651 00:23:36,400 --> 00:23:41,679 samples of each for london and for 652 00:23:39,520 --> 00:23:44,559 boston but we're testing on only male 653 00:23:41,679 --> 00:23:47,360 voice and training on only female voice 654 00:23:44,559 --> 00:23:49,279 so let's see if our model can do that 655 00:23:47,360 --> 00:23:51,200 we try an mfcc feature we get a 656 00:23:49,279 --> 00:23:54,640 confusion matrix 657 00:23:51,200 --> 00:23:56,159 we see uh the testing and we have 9 8 9 658 00:23:54,640 --> 00:23:58,720 out of 17 for boston correctly 659 00:23:56,159 --> 00:24:00,240 classified 8 out of 17 for london 660 00:23:58,720 --> 00:24:03,520 let's see if we can improve that so we 661 00:24:00,240 --> 00:24:05,760 tried gfcc a feature entry in the model 662 00:24:03,520 --> 00:24:08,480 and now we have 13 out of 17 correctly 663 00:24:05,760 --> 00:24:09,520 classified for uh london and same for 664 00:24:08,480 --> 00:24:11,039 boston 665 00:24:09,520 --> 00:24:13,600 further trying to improve it adding 666 00:24:11,039 --> 00:24:15,600 spectral features with gfcc now we have 667 00:24:13,600 --> 00:24:18,559 15 out of 17 correctly classified for 668 00:24:15,600 --> 00:24:20,799 london and 14 out of 17 for boston 669 00:24:18,559 --> 00:24:23,039 now there's so much going on here in the 670 00:24:20,799 --> 00:24:24,880 sense our training is only female voice 671 00:24:23,039 --> 00:24:27,120 and testing is only male voice and there 672 00:24:24,880 --> 00:24:28,480 is a difference because males and 673 00:24:27,120 --> 00:24:30,559 females have different lengths of the 674 00:24:28,480 --> 00:24:32,640 vocal tracts which leads to voices in 675 00:24:30,559 --> 00:24:34,000 different pitches so there is that 676 00:24:32,640 --> 00:24:35,760 difference as well 677 00:24:34,000 --> 00:24:37,760 that we are hoping that that model is 678 00:24:35,760 --> 00:24:40,400 able to still pick up on the spoken 679 00:24:37,760 --> 00:24:41,760 representations and get past the 680 00:24:40,400 --> 00:24:44,000 differences in the training in the 681 00:24:41,760 --> 00:24:46,159 testing sample 682 00:24:44,000 --> 00:24:48,400 now if we combine these samples and 683 00:24:46,159 --> 00:24:50,000 shuffle it up and now if you're training 684 00:24:48,400 --> 00:24:52,320 on female and male voice and also 685 00:24:50,000 --> 00:24:54,400 testing on female and male voice 686 00:24:52,320 --> 00:24:56,000 now our data gets more representative 687 00:24:54,400 --> 00:24:58,159 when we're training the model so we see 688 00:24:56,000 --> 00:25:00,640 even mfcc is now doing better than what 689 00:24:58,159 --> 00:25:02,720 it was before when we were just training 690 00:25:00,640 --> 00:25:04,240 on female samples and then testing on 691 00:25:02,720 --> 00:25:07,039 male voice samples 692 00:25:04,240 --> 00:25:09,520 so now we see with with this different 693 00:25:07,039 --> 00:25:11,440 representation all the models are doing 694 00:25:09,520 --> 00:25:12,720 better the training confusion matrix 695 00:25:11,440 --> 00:25:14,640 looks much better because of the 696 00:25:12,720 --> 00:25:16,400 representation that we have 697 00:25:14,640 --> 00:25:18,480 and all both of these experiments were 698 00:25:16,400 --> 00:25:20,240 done using the svm classifier as well 699 00:25:18,480 --> 00:25:21,600 just to keep the classifier constants we 700 00:25:20,240 --> 00:25:24,320 were able to compare the effects of 701 00:25:21,600 --> 00:25:24,320 different features 702 00:25:24,720 --> 00:25:29,360 here's the very much promised resource 703 00:25:26,640 --> 00:25:31,440 slide it has several links to features 704 00:25:29,360 --> 00:25:33,120 that we did not talk about and also some 705 00:25:31,440 --> 00:25:35,600 of the papers that i mentioned 706 00:25:33,120 --> 00:25:37,360 uh the music genre classification 707 00:25:35,600 --> 00:25:39,520 and the audio data sets where you can 708 00:25:37,360 --> 00:25:42,520 find publicly available open source data 709 00:25:39,520 --> 00:25:42,520 sets 710 00:25:42,720 --> 00:25:47,120 finally i want to thank everyone for 711 00:25:44,960 --> 00:25:49,600 tuning in it's been a pleasure thank you 712 00:25:47,120 --> 00:25:49,600 so much 713 00:25:50,080 --> 00:25:55,600 thank you jotika and we have some 714 00:25:52,640 --> 00:25:57,120 questions for you um 715 00:25:55,600 --> 00:25:59,039 the first one is do you think this 716 00:25:57,120 --> 00:26:02,799 toolkit could be useful for other 717 00:25:59,039 --> 00:26:05,360 vibrational analysis like um the example 718 00:26:02,799 --> 00:26:06,559 they give is seismic signal processing 719 00:26:05,360 --> 00:26:09,120 earthquakes 720 00:26:06,559 --> 00:26:11,679 that is a very very interesting uh 721 00:26:09,120 --> 00:26:13,520 thought so because this library deals 722 00:26:11,679 --> 00:26:16,799 with mainly audio if you're talking 723 00:26:13,520 --> 00:26:19,440 about any audio effects of these signals 724 00:26:16,799 --> 00:26:21,120 or any patterns that are audible that 725 00:26:19,440 --> 00:26:22,880 you can hear then i think it's 726 00:26:21,120 --> 00:26:25,600 definitely worth a try 727 00:26:22,880 --> 00:26:27,120 i have personally not uh you know heard 728 00:26:25,600 --> 00:26:29,120 of that uh 729 00:26:27,120 --> 00:26:31,440 application before but it sounds very 730 00:26:29,120 --> 00:26:33,279 interesting and if there's any audible 731 00:26:31,440 --> 00:26:35,120 component to it i would definitely give 732 00:26:33,279 --> 00:26:36,880 it a shot 733 00:26:35,120 --> 00:26:40,080 there's another question that asks do 734 00:26:36,880 --> 00:26:42,720 you think this toolkit um so is my audio 735 00:26:40,080 --> 00:26:44,880 processing language agnostic or is it 736 00:26:42,720 --> 00:26:47,039 just for english 737 00:26:44,880 --> 00:26:48,400 oh so well it's actually it doesn't 738 00:26:47,039 --> 00:26:50,559 matter so 739 00:26:48,400 --> 00:26:52,320 the pi audio processing library is 740 00:26:50,559 --> 00:26:54,080 essentially written in python but what 741 00:26:52,320 --> 00:26:56,159 language or audio is in it does not 742 00:26:54,080 --> 00:26:57,679 matter because it's going to train on 743 00:26:56,159 --> 00:26:59,760 the samples you provide so if your 744 00:26:57,679 --> 00:27:02,159 samples are in english it's going to 745 00:26:59,760 --> 00:27:03,919 train on that data so it really depends 746 00:27:02,159 --> 00:27:05,600 on the data that you pass in rather than 747 00:27:03,919 --> 00:27:08,159 the library the library should be 748 00:27:05,600 --> 00:27:11,120 expected to do similar 749 00:27:08,159 --> 00:27:13,440 and how do you deal with noisy data so 750 00:27:11,120 --> 00:27:15,200 is there um have you considered 751 00:27:13,440 --> 00:27:17,279 automated noise classification and 752 00:27:15,200 --> 00:27:19,200 cleaning to any degree 753 00:27:17,279 --> 00:27:21,440 yeah that's a very good question noise 754 00:27:19,200 --> 00:27:23,440 is a very ill i would say frustrating 755 00:27:21,440 --> 00:27:25,760 component of just dealing with anything 756 00:27:23,440 --> 00:27:27,200 classification and dealing with data 757 00:27:25,760 --> 00:27:29,120 that's a good question and i think one 758 00:27:27,200 --> 00:27:30,880 thing about noise some components are 759 00:27:29,120 --> 00:27:33,120 very basic and simple where you just 760 00:27:30,880 --> 00:27:35,840 have a signal where the interest 761 00:27:33,120 --> 00:27:39,120 is in a very particular segments and the 762 00:27:35,840 --> 00:27:40,000 audio itself is long so that way uh it's 763 00:27:39,120 --> 00:27:41,840 more 764 00:27:40,000 --> 00:27:44,000 simpler solutions like removing of the 765 00:27:41,840 --> 00:27:45,919 silent segments but then there are other 766 00:27:44,000 --> 00:27:47,279 aspects removing noise like spectral 767 00:27:45,919 --> 00:27:49,760 subtraction 768 00:27:47,279 --> 00:27:52,559 in which you just take the pull audio 769 00:27:49,760 --> 00:27:54,159 and take the less 770 00:27:52,559 --> 00:27:55,279 less i would say component filled 771 00:27:54,159 --> 00:27:56,720 portions 772 00:27:55,279 --> 00:27:58,559 see what the signal looks like there and 773 00:27:56,720 --> 00:28:00,880 just subtract that from the entire 774 00:27:58,559 --> 00:28:02,559 signal to remove noises such as 775 00:28:00,880 --> 00:28:03,520 something going on in the background you 776 00:28:02,559 --> 00:28:05,440 know the 777 00:28:03,520 --> 00:28:07,200 like train noise car noise is something 778 00:28:05,440 --> 00:28:08,960 going on in the background so i think 779 00:28:07,200 --> 00:28:11,840 it's a very interesting application and 780 00:28:08,960 --> 00:28:15,120 there has been the use of mfcc features 781 00:28:11,840 --> 00:28:17,279 a bunch to even remove noise from data 782 00:28:15,120 --> 00:28:18,799 as features in gfcc as well especially 783 00:28:17,279 --> 00:28:21,120 in speaker identification when they're 784 00:28:18,799 --> 00:28:24,640 noisy samples provided these features 785 00:28:21,120 --> 00:28:25,600 help you clean them up as well 786 00:28:24,640 --> 00:28:28,080 we have 787 00:28:25,600 --> 00:28:29,200 a lot of questions on venueless and the 788 00:28:28,080 --> 00:28:30,480 next one is 789 00:28:29,200 --> 00:28:33,120 you've mentioned 790 00:28:30,480 --> 00:28:35,120 mfcc and gfcc features uh what are the 791 00:28:33,120 --> 00:28:37,600 implications for classifying speech 792 00:28:35,120 --> 00:28:41,039 features in people like accents the age 793 00:28:37,600 --> 00:28:42,720 of the speaker and maybe other features 794 00:28:41,039 --> 00:28:44,960 very good question as well so there are 795 00:28:42,720 --> 00:28:47,919 a lot of applications of these features 796 00:28:44,960 --> 00:28:48,880 itself in like gender classification 797 00:28:47,919 --> 00:28:50,880 uh 798 00:28:48,880 --> 00:28:52,799 and so on gfcc specifically is very 799 00:28:50,880 --> 00:28:54,640 useful for speaker identification as 800 00:28:52,799 --> 00:28:56,399 well so if you want to have a task where 801 00:28:54,640 --> 00:28:58,080 you want to differentiate between people 802 00:28:56,399 --> 00:29:00,080 in terms of in different age groups as 803 00:28:58,080 --> 00:29:02,000 well i would definitely give that a shot 804 00:29:00,080 --> 00:29:03,919 and then there's pncc 805 00:29:02,000 --> 00:29:06,640 bfcc there are other capstone 806 00:29:03,919 --> 00:29:08,399 coefficients that really help uh break 807 00:29:06,640 --> 00:29:10,320 down your signal and classify it into 808 00:29:08,399 --> 00:29:12,159 types as such especially when it's 809 00:29:10,320 --> 00:29:15,360 considered when it has speaker 810 00:29:12,159 --> 00:29:15,360 information associated 811 00:29:15,440 --> 00:29:20,000 um what's the visualization plotting 812 00:29:17,360 --> 00:29:22,000 stack uh based on for pi audio 813 00:29:20,000 --> 00:29:23,120 processing here's another question from 814 00:29:22,000 --> 00:29:24,880 the audience 815 00:29:23,120 --> 00:29:27,039 can you repeat that sorry what's the 816 00:29:24,880 --> 00:29:29,360 visualization and plotting stack based 817 00:29:27,039 --> 00:29:31,919 on for power processing yes it's a good 818 00:29:29,360 --> 00:29:34,159 question so it's mainly uh scikit-learn 819 00:29:31,919 --> 00:29:35,600 and matplotlib 820 00:29:34,159 --> 00:29:37,039 and essentially if you have your 821 00:29:35,600 --> 00:29:38,880 features or any data that you want to 822 00:29:37,039 --> 00:29:41,120 visualize once you have the data you 823 00:29:38,880 --> 00:29:43,200 know visualizing is uh can be done with 824 00:29:41,120 --> 00:29:44,880 anything like matt gottlieb seabourn any 825 00:29:43,200 --> 00:29:46,960 of any of your favorite visualization 826 00:29:44,880 --> 00:29:48,559 libraries 827 00:29:46,960 --> 00:29:50,799 and the final question we still have 828 00:29:48,559 --> 00:29:53,600 time for it so people asking where the 829 00:29:50,799 --> 00:29:55,840 music samples entire songs or were you 830 00:29:53,600 --> 00:29:57,919 only clipping part of the song and if 831 00:29:55,840 --> 00:29:59,679 you did both does the length of the 832 00:29:57,919 --> 00:30:02,000 sample have 833 00:29:59,679 --> 00:30:03,360 an effect on model performance that's a 834 00:30:02,000 --> 00:30:05,440 very good question the length of the 835 00:30:03,360 --> 00:30:07,600 sample unless it's like really short and 836 00:30:05,440 --> 00:30:09,279 does not convey much information would 837 00:30:07,600 --> 00:30:10,799 not have a very significant impact 838 00:30:09,279 --> 00:30:12,399 because you're windowing the signal and 839 00:30:10,799 --> 00:30:14,159 that's where you're extracting features 840 00:30:12,399 --> 00:30:16,559 and then averaging it out for your 841 00:30:14,159 --> 00:30:18,720 signal the data set specifically i used 842 00:30:16,559 --> 00:30:21,120 was entire audios and it's called the gt 843 00:30:18,720 --> 00:30:23,039 zan data set and i've also attached a 844 00:30:21,120 --> 00:30:24,799 link in the resources slide to this data 845 00:30:23,039 --> 00:30:26,399 set which contains all these genres so 846 00:30:24,799 --> 00:30:27,919 if you want to explore the particular 847 00:30:26,399 --> 00:30:30,159 data set it's it's going to be right 848 00:30:27,919 --> 00:30:32,640 there 849 00:30:30,159 --> 00:30:35,520 thank you so much jotika singh thank you 850 00:30:32,640 --> 00:30:36,960 for speaking at pike on a you 2021. 851 00:30:35,520 --> 00:30:38,960 thank you so much it's been a pleasure 852 00:30:36,960 --> 00:30:40,960 thank you very much for organizing the 853 00:30:38,960 --> 00:30:44,080 whole thing 854 00:30:40,960 --> 00:30:47,520 and for our audience now at home we have 855 00:30:44,080 --> 00:30:51,440 a bit of a break and we'll be back at 856 00:30:47,520 --> 00:30:54,640 let me check at 1 30 melbourne time with 857 00:30:51,440 --> 00:30:57,799 graph data science and paco nathan see 858 00:30:54,640 --> 00:30:57,799 you then