1 00:00:11,679 --> 00:00:15,920 hello and welcome back 2 00:00:14,000 --> 00:00:18,000 our last speaker for the science data 3 00:00:15,920 --> 00:00:20,320 and analytics track today 4 00:00:18,000 --> 00:00:22,240 is now based in geneva but he used to 5 00:00:20,320 --> 00:00:24,160 attend the melbourne python users group 6 00:00:22,240 --> 00:00:25,840 back when he lived here 7 00:00:24,160 --> 00:00:27,599 it feels really good to be able to 8 00:00:25,840 --> 00:00:28,800 introduce him to the pycon australia 9 00:00:27,599 --> 00:00:30,800 audience 10 00:00:28,800 --> 00:00:33,120 last yenken is a swedish australian 11 00:00:30,800 --> 00:00:35,840 engineering leader and data scientist 12 00:00:33,120 --> 00:00:37,600 currently leading the tech team at our 13 00:00:35,840 --> 00:00:40,320 world in data where he tries to wrangle 14 00:00:37,600 --> 00:00:42,320 the world data sets into shape and share 15 00:00:40,320 --> 00:00:44,800 insights into how we can make the world 16 00:00:42,320 --> 00:00:46,800 better using that data 17 00:00:44,800 --> 00:00:49,440 however most of the world data sets are 18 00:00:46,800 --> 00:00:51,520 scattered to the wind or worse 19 00:00:49,440 --> 00:00:53,600 stuck in excel files of massive 20 00:00:51,520 --> 00:00:55,039 organizations in tiny addendums to 21 00:00:53,600 --> 00:00:57,920 research papers 22 00:00:55,039 --> 00:00:59,280 in pdf format and even in government 23 00:00:57,920 --> 00:01:01,199 tweets 24 00:00:59,280 --> 00:01:04,159 lars will take us for a dive into 25 00:01:01,199 --> 00:01:06,960 initiatives by our world in data 26 00:01:04,159 --> 00:01:10,080 to unlock this information he'll share 27 00:01:06,960 --> 00:01:12,960 the joys and pains of data harmonization 28 00:01:10,080 --> 00:01:16,479 and ask how meta should data really be 29 00:01:12,960 --> 00:01:18,240 so please give a warm virtual applause 30 00:01:16,479 --> 00:01:21,040 to las janken 31 00:01:18,240 --> 00:01:22,880 and to his talk systems of the world 32 00:01:21,040 --> 00:01:24,960 cataloging the world's data for great 33 00:01:22,880 --> 00:01:26,720 good 34 00:01:24,960 --> 00:01:28,880 thanks so much fabio 35 00:01:26,720 --> 00:01:33,600 um and i should say yeah it's really fun 36 00:01:28,880 --> 00:01:36,720 to be able to uh join remotely uh 37 00:01:33,600 --> 00:01:39,680 having you know being an old melbourner 38 00:01:36,720 --> 00:01:41,360 so as having mentioned uh i'm from our 39 00:01:39,680 --> 00:01:43,200 world in data i'm going to be talking to 40 00:01:41,360 --> 00:01:45,920 you about our efforts to catalog the 41 00:01:43,200 --> 00:01:47,840 world's data um 42 00:01:45,920 --> 00:01:50,399 a little bit about me you know i 43 00:01:47,840 --> 00:01:52,720 actually i'm born in melbourne but i'm 44 00:01:50,399 --> 00:01:54,799 also part swedish 45 00:01:52,720 --> 00:01:57,600 i've spent the last five or six years 46 00:01:54,799 --> 00:01:59,119 living in stockholm in sweden i had a 47 00:01:57,600 --> 00:02:01,680 little boy there we moved our little 48 00:01:59,119 --> 00:02:03,920 family to geneva recently and recently 49 00:02:01,680 --> 00:02:05,040 i've been working for our world in data 50 00:02:03,920 --> 00:02:06,320 which is what i want to talk to you 51 00:02:05,040 --> 00:02:07,600 about today 52 00:02:06,320 --> 00:02:09,679 um 53 00:02:07,600 --> 00:02:12,400 most of my kind of working career has 54 00:02:09,679 --> 00:02:13,920 been in data science or in 55 00:02:12,400 --> 00:02:17,120 leading engineering teams in the health 56 00:02:13,920 --> 00:02:19,440 space but at our world i should say even 57 00:02:17,120 --> 00:02:21,520 the short experience so far at our world 58 00:02:19,440 --> 00:02:23,680 in data is quite different from these 59 00:02:21,520 --> 00:02:25,760 other data environments and i thought 60 00:02:23,680 --> 00:02:28,080 this would be something people might 61 00:02:25,760 --> 00:02:28,080 enjoy 62 00:02:28,239 --> 00:02:32,000 today you're going to learn a bit about 63 00:02:29,599 --> 00:02:33,360 our world and data and our mission 64 00:02:32,000 --> 00:02:36,000 you're going to learn about what's 65 00:02:33,360 --> 00:02:37,680 involved in cataloging the world's data 66 00:02:36,000 --> 00:02:39,920 and you'll learn how to find and query 67 00:02:37,680 --> 00:02:43,040 some of this data for yourself 68 00:02:39,920 --> 00:02:43,040 so let's dive in 69 00:02:43,360 --> 00:02:49,760 a world in data is a small non-profit 70 00:02:46,560 --> 00:02:53,040 it's spun out of oxford university 71 00:02:49,760 --> 00:02:54,640 by our founder max rosa and you know 72 00:02:53,040 --> 00:02:57,120 it's basically like a handful of 73 00:02:54,640 --> 00:02:58,800 researchers a handful of data managers 74 00:02:57,120 --> 00:03:02,080 and a handful of developers who are 75 00:02:58,800 --> 00:03:03,440 working together to try to 76 00:03:02,080 --> 00:03:04,879 shed some light on the world's big 77 00:03:03,440 --> 00:03:06,560 problems 78 00:03:04,879 --> 00:03:09,280 and our mission is really to make the 79 00:03:06,560 --> 00:03:12,080 data on these really big problems 80 00:03:09,280 --> 00:03:14,000 understandable and accessible 81 00:03:12,080 --> 00:03:17,760 we do that by 82 00:03:14,000 --> 00:03:19,519 writing articles that try to explain a 83 00:03:17,760 --> 00:03:21,040 lot of detail about these problems and 84 00:03:19,519 --> 00:03:23,440 make it accessible 85 00:03:21,040 --> 00:03:24,640 and we write about 86 00:03:23,440 --> 00:03:27,280 poverty 87 00:03:24,640 --> 00:03:30,640 and inequality 88 00:03:27,280 --> 00:03:33,519 we write about hunger 89 00:03:30,640 --> 00:03:35,280 about disease 90 00:03:33,519 --> 00:03:38,000 and about climate change 91 00:03:35,280 --> 00:03:39,120 amongst a range of other topics 92 00:03:38,000 --> 00:03:41,440 and now 93 00:03:39,120 --> 00:03:43,840 what one thing that really distinguishes 94 00:03:41,440 --> 00:03:46,879 the writing here other than the kind of 95 00:03:43,840 --> 00:03:48,879 depth of research behind it is that 96 00:03:46,879 --> 00:03:52,080 everything is backed up with these 97 00:03:48,879 --> 00:03:54,319 really rich interactive charts 98 00:03:52,080 --> 00:03:56,319 so for example 99 00:03:54,319 --> 00:03:57,840 here's a chart of 100 00:03:56,319 --> 00:04:00,640 code vaccine 101 00:03:57,840 --> 00:04:01,840 copper 19 vaccinations over time 102 00:04:00,640 --> 00:04:04,319 and 103 00:04:01,840 --> 00:04:06,080 this chart you know you can 104 00:04:04,319 --> 00:04:06,879 change the time frame you're interested 105 00:04:06,080 --> 00:04:09,360 in 106 00:04:06,879 --> 00:04:09,360 you can 107 00:04:10,319 --> 00:04:14,720 you can 108 00:04:11,920 --> 00:04:16,400 add countries so we can add australia 109 00:04:14,720 --> 00:04:18,400 and i know people are suffering in 110 00:04:16,400 --> 00:04:20,160 lockdown and trying to vaccinate as fast 111 00:04:18,400 --> 00:04:22,960 as possible in australia 112 00:04:20,160 --> 00:04:25,680 so i um but you can see that this curve 113 00:04:22,960 --> 00:04:28,479 is definitely on the rise 114 00:04:25,680 --> 00:04:29,199 it also does interesting things like if 115 00:04:28,479 --> 00:04:30,880 you 116 00:04:29,199 --> 00:04:32,400 click the time frame 117 00:04:30,880 --> 00:04:36,000 you know you get an appropriate 118 00:04:32,400 --> 00:04:36,000 visualization like a bar chart 119 00:04:36,080 --> 00:04:39,440 you can 120 00:04:37,280 --> 00:04:41,280 visualize things in maps you get a lot 121 00:04:39,440 --> 00:04:44,800 of source information and you can easily 122 00:04:41,280 --> 00:04:44,800 get to the underlying data too 123 00:04:46,880 --> 00:04:50,400 in order to 124 00:04:48,960 --> 00:04:52,000 you know in order to write about these 125 00:04:50,400 --> 00:04:55,120 topics and have a lot of these rich 126 00:04:52,000 --> 00:04:57,680 charts then actually we need a really 127 00:04:55,120 --> 00:05:00,320 solid data catalog behind them 128 00:04:57,680 --> 00:05:02,080 so let's take a look at 129 00:05:00,320 --> 00:05:04,479 what's involved in building this data 130 00:05:02,080 --> 00:05:04,479 catalog 131 00:05:05,199 --> 00:05:08,160 so 132 00:05:05,919 --> 00:05:10,479 we want to kind of help promote this 133 00:05:08,160 --> 00:05:13,120 evidence-based world view on a really 134 00:05:10,479 --> 00:05:14,880 wide range of global issues and if we 135 00:05:13,120 --> 00:05:18,800 want to do that then we need really good 136 00:05:14,880 --> 00:05:18,800 data from a wide range of sources 137 00:05:19,680 --> 00:05:22,880 if you've tried to pull data from 138 00:05:21,039 --> 00:05:25,680 different places you will notice that 139 00:05:22,880 --> 00:05:27,680 all sources use their own special format 140 00:05:25,680 --> 00:05:30,400 generally speaking i would say things 141 00:05:27,680 --> 00:05:32,479 are getting better year by year 142 00:05:30,400 --> 00:05:34,160 but you know the world doesn't share a 143 00:05:32,479 --> 00:05:37,039 single uh 144 00:05:34,160 --> 00:05:38,560 a single format for exchanging data 145 00:05:37,039 --> 00:05:40,720 and 146 00:05:38,560 --> 00:05:42,560 so that what that means is that you have 147 00:05:40,720 --> 00:05:44,080 to deal with a huge amount of variety 148 00:05:42,560 --> 00:05:46,720 from different institutions and 149 00:05:44,080 --> 00:05:49,039 different researches 150 00:05:46,720 --> 00:05:51,120 you'll also find that uh every 151 00:05:49,039 --> 00:05:53,280 institution has its own view on the 152 00:05:51,120 --> 00:05:55,280 world even if we just talk about 153 00:05:53,280 --> 00:05:57,759 countries so 154 00:05:55,280 --> 00:06:00,319 institutions often report countries by 155 00:05:57,759 --> 00:06:02,400 name rather than by iso code and they 156 00:06:00,319 --> 00:06:03,440 also pick they use their own names 157 00:06:02,400 --> 00:06:05,440 sometimes 158 00:06:03,440 --> 00:06:07,600 it's uh 159 00:06:05,440 --> 00:06:10,479 uh it's kind of semi-political which 160 00:06:07,600 --> 00:06:12,400 name a country is used for a country 161 00:06:10,479 --> 00:06:14,400 countries have disputed borders and 162 00:06:12,400 --> 00:06:15,840 different organizations decide what's in 163 00:06:14,400 --> 00:06:17,919 or out of a country 164 00:06:15,840 --> 00:06:19,520 um 165 00:06:17,919 --> 00:06:22,000 they also 166 00:06:19,520 --> 00:06:24,080 based on the the slice of global 167 00:06:22,000 --> 00:06:26,560 problems that they're looking at then 168 00:06:24,080 --> 00:06:27,680 the organizations also will pick like 169 00:06:26,560 --> 00:06:29,280 their own 170 00:06:27,680 --> 00:06:31,199 regions for the world 171 00:06:29,280 --> 00:06:33,759 so we might think of the world as being 172 00:06:31,199 --> 00:06:35,759 needly split up into different regions 173 00:06:33,759 --> 00:06:37,360 like europe and 174 00:06:35,759 --> 00:06:39,520 asia and north america and things like 175 00:06:37,360 --> 00:06:41,199 that but actually the organizations cut 176 00:06:39,520 --> 00:06:41,919 the world up into special categories 177 00:06:41,199 --> 00:06:43,600 like 178 00:06:41,919 --> 00:06:45,039 low-income countries middle income 179 00:06:43,600 --> 00:06:47,520 countries 180 00:06:45,039 --> 00:06:50,800 east asia excluding china south asia 181 00:06:47,520 --> 00:06:51,919 excluding india things like that 182 00:06:50,800 --> 00:06:53,840 if you want to work with all this 183 00:06:51,919 --> 00:06:55,919 variety you need to bring everything 184 00:06:53,840 --> 00:06:57,039 into this sort of common format 185 00:06:55,919 --> 00:06:58,479 and then 186 00:06:57,039 --> 00:07:00,479 think about it from like a database 187 00:06:58,479 --> 00:07:02,240 perspective if you want to if you want 188 00:07:00,479 --> 00:07:04,000 to be able to merge or compare things 189 00:07:02,240 --> 00:07:05,360 from different data sets anything that 190 00:07:04,000 --> 00:07:06,720 you're going to join on you need to 191 00:07:05,360 --> 00:07:08,720 harmonize 192 00:07:06,720 --> 00:07:10,479 so uh that's especially things like 193 00:07:08,720 --> 00:07:12,960 countries but it could also be things 194 00:07:10,479 --> 00:07:16,919 like diseases and the words they use for 195 00:07:12,960 --> 00:07:16,919 genders and things like that 196 00:07:17,759 --> 00:07:22,319 now 197 00:07:19,120 --> 00:07:25,759 the covet 19 data set that we collect 198 00:07:22,319 --> 00:07:28,000 and republish i mean this is actually um 199 00:07:25,759 --> 00:07:30,800 an extreme case of 200 00:07:28,000 --> 00:07:32,800 what our world and data goes through to 201 00:07:30,800 --> 00:07:34,639 collect data 202 00:07:32,800 --> 00:07:37,199 roll back a year the world was just 203 00:07:34,639 --> 00:07:39,120 entering this pandemic 204 00:07:37,199 --> 00:07:40,639 governments are scrambling even to 205 00:07:39,120 --> 00:07:41,599 accurately they're scrambling just to 206 00:07:40,639 --> 00:07:44,720 test 207 00:07:41,599 --> 00:07:45,680 let alone report their testing 208 00:07:44,720 --> 00:07:47,680 so 209 00:07:45,680 --> 00:07:49,360 the data reporting is only just 210 00:07:47,680 --> 00:07:50,479 beginning to come online for different 211 00:07:49,360 --> 00:07:52,319 countries 212 00:07:50,479 --> 00:07:54,400 and 213 00:07:52,319 --> 00:07:56,240 the really big institutions like the 214 00:07:54,400 --> 00:07:57,280 world health organization that we might 215 00:07:56,240 --> 00:07:59,840 expect 216 00:07:57,280 --> 00:08:01,280 to collect and publish this data they it 217 00:07:59,840 --> 00:08:02,879 took them actually quite a long time to 218 00:08:01,280 --> 00:08:04,479 come to the table 219 00:08:02,879 --> 00:08:07,680 so that our world and data team had a 220 00:08:04,479 --> 00:08:09,840 tricky pivot to decide on 221 00:08:07,680 --> 00:08:11,919 most of the big problems in the world 222 00:08:09,840 --> 00:08:13,759 like child mortality you know they don't 223 00:08:11,919 --> 00:08:15,520 change day by day 224 00:08:13,759 --> 00:08:17,280 these big problems you know the data 225 00:08:15,520 --> 00:08:19,599 sets on them they're updated a few times 226 00:08:17,280 --> 00:08:21,599 a year or maybe even once a year 227 00:08:19,599 --> 00:08:22,960 but covert 19 data was coming in every 228 00:08:21,599 --> 00:08:24,800 single day 229 00:08:22,960 --> 00:08:26,800 but the team basically leaned in to try 230 00:08:24,800 --> 00:08:28,639 and fill this gap 231 00:08:26,800 --> 00:08:32,240 and that involved an immense amount of 232 00:08:28,639 --> 00:08:33,919 work but that work enabled 233 00:08:32,240 --> 00:08:38,320 a lot of 234 00:08:33,919 --> 00:08:39,760 countries to be able to respond to or to 235 00:08:38,320 --> 00:08:42,719 have better evidence to base their 236 00:08:39,760 --> 00:08:44,159 policy responses on 237 00:08:42,719 --> 00:08:47,120 the 238 00:08:44,159 --> 00:08:48,480 one area or just generally speaking all 239 00:08:47,120 --> 00:08:50,080 this country reporting is getting better 240 00:08:48,480 --> 00:08:51,200 and better big organizations have come 241 00:08:50,080 --> 00:08:53,839 to the table 242 00:08:51,200 --> 00:08:56,720 so a lot of the things that used to be 243 00:08:53,839 --> 00:08:59,680 hard are a lot easier like 244 00:08:56,720 --> 00:09:02,080 testing and cases and deaths 245 00:08:59,680 --> 00:09:04,720 but vaccinations is still a really 246 00:09:02,080 --> 00:09:05,839 interesting and difficult one 247 00:09:04,720 --> 00:09:07,360 so 248 00:09:05,839 --> 00:09:10,399 how well the data has a nature paper 249 00:09:07,360 --> 00:09:12,959 where we discuss uh the assembly of a 250 00:09:10,399 --> 00:09:14,800 vaccination data set but when it comes 251 00:09:12,959 --> 00:09:16,880 down to actually pulling it together 252 00:09:14,800 --> 00:09:19,040 there's like all the usual suspects that 253 00:09:16,880 --> 00:09:22,240 you can imagine when you're trying to 254 00:09:19,040 --> 00:09:24,320 pull data in from wherever you can get 255 00:09:22,240 --> 00:09:24,320 it 256 00:09:24,959 --> 00:09:29,600 and 257 00:09:26,000 --> 00:09:31,680 you know there's apis that there's 258 00:09:29,600 --> 00:09:34,080 csv and excel neither of which are that 259 00:09:31,680 --> 00:09:36,000 terrible pdf was actually kind of 260 00:09:34,080 --> 00:09:37,360 prominent early on in the pandemic like 261 00:09:36,000 --> 00:09:40,399 the world health organization was 262 00:09:37,360 --> 00:09:41,200 publishing data and pdf 263 00:09:40,399 --> 00:09:42,560 and 264 00:09:41,200 --> 00:09:43,920 you know also 265 00:09:42,560 --> 00:09:45,600 numbers embedded in images and 266 00:09:43,920 --> 00:09:47,600 infographics 267 00:09:45,600 --> 00:09:49,680 and actually we still have them 268 00:09:47,600 --> 00:09:50,720 but you might be surprised to realize 269 00:09:49,680 --> 00:09:53,040 that 270 00:09:50,720 --> 00:09:55,440 you know in this data set there's also 271 00:09:53,040 --> 00:09:56,560 daily figures that are only given by 272 00:09:55,440 --> 00:09:58,640 video 273 00:09:56,560 --> 00:10:01,440 or daily figures that are posted to 274 00:09:58,640 --> 00:10:03,279 facebook only or daily figures that are 275 00:10:01,440 --> 00:10:05,839 only posted to twitter 276 00:10:03,279 --> 00:10:08,000 and uh there's still 12 countries that 277 00:10:05,839 --> 00:10:10,959 our world and data fetches the daily 278 00:10:08,000 --> 00:10:13,200 figures from directly from 279 00:10:10,959 --> 00:10:14,640 twitter 280 00:10:13,200 --> 00:10:17,200 interestingly as well as like this 281 00:10:14,640 --> 00:10:20,000 difficulty getting access to data there 282 00:10:17,200 --> 00:10:22,000 was one interesting case of like too 283 00:10:20,000 --> 00:10:23,680 much data 284 00:10:22,000 --> 00:10:25,120 the government of paraguay actually 285 00:10:23,680 --> 00:10:27,600 published 286 00:10:25,120 --> 00:10:29,440 vaccination status for every single 287 00:10:27,600 --> 00:10:30,959 citizen by name 288 00:10:29,440 --> 00:10:32,880 so if you want to know how they're doing 289 00:10:30,959 --> 00:10:33,920 with vaccinations you used to have to 290 00:10:32,880 --> 00:10:36,320 actually 291 00:10:33,920 --> 00:10:38,399 fetch this entire data set and then of 292 00:10:36,320 --> 00:10:41,279 all like you know people's names stuff 293 00:10:38,399 --> 00:10:43,600 you don't want to know but actually then 294 00:10:41,279 --> 00:10:45,519 roll that up into summary statistics 295 00:10:43,600 --> 00:10:47,920 thankfully now they do calculate those 296 00:10:45,519 --> 00:10:50,160 summary statistics so we don't need to 297 00:10:47,920 --> 00:10:53,279 be fetching all of this very personal 298 00:10:50,160 --> 00:10:53,279 data on their citizens 299 00:10:54,880 --> 00:10:58,640 so 300 00:10:56,800 --> 00:11:00,160 we've got data from these crazy sources 301 00:10:58,640 --> 00:11:01,600 and we're trying to bring it in and what 302 00:11:00,160 --> 00:11:03,680 does it look like what are we trying to 303 00:11:01,600 --> 00:11:06,079 bring it into 304 00:11:03,680 --> 00:11:09,600 our core data model is a tuple 305 00:11:06,079 --> 00:11:09,600 of four elements 306 00:11:10,320 --> 00:11:14,320 an entity which is often a country but 307 00:11:12,959 --> 00:11:16,079 it could be something else it could be a 308 00:11:14,320 --> 00:11:17,600 disease 309 00:11:16,079 --> 00:11:19,760 a variable 310 00:11:17,600 --> 00:11:22,959 a time point which is used at either a 311 00:11:19,760 --> 00:11:25,120 year or a date and a value 312 00:11:22,959 --> 00:11:27,440 so for example here we could we could 313 00:11:25,120 --> 00:11:31,200 read this tuple as 314 00:11:27,440 --> 00:11:33,600 australia had a life expectancy in 1921 315 00:11:31,200 --> 00:11:35,519 of 61 years 316 00:11:33,600 --> 00:11:39,200 and if you're curious the life 317 00:11:35,519 --> 00:11:41,680 expectancy today is more like 80 to 83 318 00:11:39,200 --> 00:11:45,440 years so australians are living 20 years 319 00:11:41,680 --> 00:11:45,440 longer than they used to 100 years ago 320 00:11:46,880 --> 00:11:50,160 so 321 00:11:47,680 --> 00:11:51,360 when it came to our world and data i had 322 00:11:50,160 --> 00:11:53,519 you know i've been working in these 323 00:11:51,360 --> 00:11:54,560 commercial environments doing data 324 00:11:53,519 --> 00:11:55,360 science 325 00:11:54,560 --> 00:11:57,200 and 326 00:11:55,360 --> 00:11:59,360 there was a couple of things though that 327 00:11:57,200 --> 00:12:01,360 were really really different about the 328 00:11:59,360 --> 00:12:02,959 data we're working with 329 00:12:01,360 --> 00:12:04,560 in a commercial environment let's say 330 00:12:02,959 --> 00:12:06,720 you're a successful company you might 331 00:12:04,560 --> 00:12:07,600 have millions of customers 332 00:12:06,720 --> 00:12:08,480 um 333 00:12:07,600 --> 00:12:10,399 and 334 00:12:08,480 --> 00:12:12,000 for these customers 335 00:12:10,399 --> 00:12:14,560 you probably 336 00:12:12,000 --> 00:12:16,320 have a lot of data terabytes of data 337 00:12:14,560 --> 00:12:18,240 uh it will be split in a few different 338 00:12:16,320 --> 00:12:19,760 categories some of it is like the direct 339 00:12:18,240 --> 00:12:21,600 stuff that 340 00:12:19,760 --> 00:12:23,200 is about your product 341 00:12:21,600 --> 00:12:24,959 some of it will be like marketing 342 00:12:23,200 --> 00:12:27,120 acquisition data some of it will be 343 00:12:24,959 --> 00:12:29,040 automatic machine logs 344 00:12:27,120 --> 00:12:30,800 especially for debugging 345 00:12:29,040 --> 00:12:32,880 but so you have this massive data set 346 00:12:30,800 --> 00:12:35,040 but a ton of it is machine generated 347 00:12:32,880 --> 00:12:36,720 automatically captured 348 00:12:35,040 --> 00:12:39,360 and probably it's only a few hundred 349 00:12:36,720 --> 00:12:42,000 variables actually but a huge number of 350 00:12:39,360 --> 00:12:43,519 time ports a huge number of recordings 351 00:12:42,000 --> 00:12:45,600 at our world in data you know there's 352 00:12:43,519 --> 00:12:46,880 not that many entities we care about 353 00:12:45,600 --> 00:12:48,240 like we're especially looking at 354 00:12:46,880 --> 00:12:49,519 countries there's only a few hundred 355 00:12:48,240 --> 00:12:51,839 countries 356 00:12:49,519 --> 00:12:53,920 you add in some debian diseases and 357 00:12:51,839 --> 00:12:55,680 types of fish and different things you 358 00:12:53,920 --> 00:12:57,839 know still the number of entities are 359 00:12:55,680 --> 00:12:58,959 small the number of time points will get 360 00:12:57,839 --> 00:13:01,920 us small 361 00:12:58,959 --> 00:13:05,040 but the data set has like over 100 000 362 00:13:01,920 --> 00:13:07,920 different variables and this is by far 363 00:13:05,040 --> 00:13:09,120 the most diverse data set that i've ever 364 00:13:07,920 --> 00:13:11,279 worked with 365 00:13:09,120 --> 00:13:12,240 every one of those variables is not 366 00:13:11,279 --> 00:13:14,639 machine 367 00:13:12,240 --> 00:13:15,440 captured but is instead something that 368 00:13:14,639 --> 00:13:17,920 is 369 00:13:15,440 --> 00:13:20,000 like a research team has painstakingly 370 00:13:17,920 --> 00:13:22,240 put together and people have had to go 371 00:13:20,000 --> 00:13:24,399 out and collect and estimate so there's 372 00:13:22,240 --> 00:13:26,320 a huge value in this data set as well 373 00:13:24,399 --> 00:13:29,360 it's completely different shape to a 374 00:13:26,320 --> 00:13:29,360 commercial data set 375 00:13:30,000 --> 00:13:35,279 so if we've got 100 000 variables then 376 00:13:32,800 --> 00:13:37,360 uh you know we need more structure on 377 00:13:35,279 --> 00:13:39,600 them than that to even navigate our own 378 00:13:37,360 --> 00:13:41,920 data set and in fact we do have more 379 00:13:39,600 --> 00:13:44,160 structure we organize these variables 380 00:13:41,920 --> 00:13:45,199 into data sets and data sets in name 381 00:13:44,160 --> 00:13:46,880 spaces 382 00:13:45,199 --> 00:13:49,199 so for example 383 00:13:46,880 --> 00:13:50,880 the one namespace might be the world 384 00:13:49,199 --> 00:13:54,720 health organization 385 00:13:50,880 --> 00:13:56,160 and they release for example uh data 386 00:13:54,720 --> 00:13:58,720 that one day set their release as their 387 00:13:56,160 --> 00:14:01,519 global health observatory and so that 388 00:13:58,720 --> 00:14:03,519 might have that might be you know a 389 00:14:01,519 --> 00:14:05,920 gigabyte or two and that will have 390 00:14:03,519 --> 00:14:09,040 thousands of variables and all of that 391 00:14:05,920 --> 00:14:09,040 will kind of come together 392 00:14:09,600 --> 00:14:13,760 something else that's very different 393 00:14:11,760 --> 00:14:16,399 compared to commercial data sets is the 394 00:14:13,760 --> 00:14:18,240 amount of metadata that we keep 395 00:14:16,399 --> 00:14:20,480 and we keep that metadata for two 396 00:14:18,240 --> 00:14:23,199 reasons one is for provenance and the 397 00:14:20,480 --> 00:14:25,519 other one is for visualization 398 00:14:23,199 --> 00:14:26,959 first start with provenance you know you 399 00:14:25,519 --> 00:14:28,800 might have been reading about these 400 00:14:26,959 --> 00:14:31,040 scandals in behavioral economics or 401 00:14:28,800 --> 00:14:32,560 social science recently 402 00:14:31,040 --> 00:14:34,480 and this kind of 403 00:14:32,560 --> 00:14:36,399 dog ate my homework excuses that the 404 00:14:34,480 --> 00:14:38,160 data's missing and 405 00:14:36,399 --> 00:14:40,639 you know i'm sorry 406 00:14:38,160 --> 00:14:42,720 but actually we know that in science 407 00:14:40,639 --> 00:14:44,560 you should never have to take anyone's 408 00:14:42,720 --> 00:14:47,199 word for anything you should be able to 409 00:14:44,560 --> 00:14:48,800 go back to that raw data at any time 410 00:14:47,199 --> 00:14:51,040 it's exactly the same thing with our 411 00:14:48,800 --> 00:14:53,199 articles and with our charts you should 412 00:14:51,040 --> 00:14:55,760 never have to take our word for anything 413 00:14:53,199 --> 00:14:58,000 at all you should always be able to get 414 00:14:55,760 --> 00:15:00,240 straight back to you should be about to 415 00:14:58,000 --> 00:15:01,839 get the raw data from our chart in two 416 00:15:00,240 --> 00:15:04,399 clicks and you should be able to get to 417 00:15:01,839 --> 00:15:05,839 the original source of that data as well 418 00:15:04,399 --> 00:15:07,680 in two clicks 419 00:15:05,839 --> 00:15:10,320 and in that way 420 00:15:07,680 --> 00:15:12,079 you know in that way you can come to 421 00:15:10,320 --> 00:15:13,440 know that you can trust our charts 422 00:15:12,079 --> 00:15:16,160 because 423 00:15:13,440 --> 00:15:17,839 the thing the data that we show you is 424 00:15:16,160 --> 00:15:19,440 really transparent about where it came 425 00:15:17,839 --> 00:15:22,399 from 426 00:15:19,440 --> 00:15:25,360 so we keep this at we keep a lot of 427 00:15:22,399 --> 00:15:27,199 metadata around provenance at this level 428 00:15:25,360 --> 00:15:28,639 the other side of metadata is around 429 00:15:27,199 --> 00:15:29,440 visualization 430 00:15:28,639 --> 00:15:30,880 so 431 00:15:29,440 --> 00:15:32,800 think about it this way when was the 432 00:15:30,880 --> 00:15:34,800 last time you just took a data frame and 433 00:15:32,800 --> 00:15:36,720 you plotted that data frame and you were 434 00:15:34,800 --> 00:15:38,959 like yes this is a chart i can show 435 00:15:36,720 --> 00:15:40,959 people this is beautiful 436 00:15:38,959 --> 00:15:42,720 it almost never happens 437 00:15:40,959 --> 00:15:45,519 so you end up like changing access 438 00:15:42,720 --> 00:15:47,680 labels and titles and adding all of this 439 00:15:45,519 --> 00:15:49,600 stuff all of that stuff is actually 440 00:15:47,680 --> 00:15:52,839 information about the data that you're 441 00:15:49,600 --> 00:15:56,480 kind of adding later 442 00:15:52,839 --> 00:15:58,880 so we have a lot of rich metadata about 443 00:15:56,480 --> 00:16:01,680 entities and variables that let us 444 00:15:58,880 --> 00:16:03,759 visualize them much more smoothly 445 00:16:01,680 --> 00:16:06,160 so that we have to do 446 00:16:03,759 --> 00:16:08,560 we do more work when we get the data in 447 00:16:06,160 --> 00:16:11,279 but we do much less work when we have to 448 00:16:08,560 --> 00:16:13,279 visualize that data 449 00:16:11,279 --> 00:16:15,759 just as an example there if i take a 450 00:16:13,279 --> 00:16:17,440 look at life expectancy 451 00:16:15,759 --> 00:16:19,680 here we can actually see that for some 452 00:16:17,440 --> 00:16:21,360 countries life expectancy 453 00:16:19,680 --> 00:16:24,000 goes all the way to all the way back to 454 00:16:21,360 --> 00:16:25,680 1543. 455 00:16:24,000 --> 00:16:28,079 um 456 00:16:25,680 --> 00:16:30,480 things like the title of the chart is 457 00:16:28,079 --> 00:16:32,880 metadata around a variable things like 458 00:16:30,480 --> 00:16:34,320 the units in years 459 00:16:32,880 --> 00:16:36,639 it might be really hard to see but 460 00:16:34,320 --> 00:16:38,320 there's a little source description here 461 00:16:36,639 --> 00:16:40,639 that says 462 00:16:38,320 --> 00:16:42,480 where this data came from and a note 463 00:16:40,639 --> 00:16:44,800 that explains 464 00:16:42,480 --> 00:16:47,040 what life expectancy at birth or what it 465 00:16:44,800 --> 00:16:48,800 sort of means 466 00:16:47,040 --> 00:16:50,720 and one way you can think about it is 467 00:16:48,800 --> 00:16:53,920 that this is a chart that's designed to 468 00:16:50,720 --> 00:16:56,079 be shared like in a tweet out of context 469 00:16:53,920 --> 00:16:59,040 and it should be pretty hard to it 470 00:16:56,079 --> 00:17:02,959 should be pretty hard to misinterpret it 471 00:16:59,040 --> 00:17:02,959 if even if it's shared out of context 472 00:17:03,040 --> 00:17:05,439 other metadata that would keep is of 473 00:17:04,559 --> 00:17:07,439 course 474 00:17:05,439 --> 00:17:10,559 this provenance metadata and in this 475 00:17:07,439 --> 00:17:12,720 case for life expectancy there's quite a 476 00:17:10,559 --> 00:17:14,880 detailed explanation of how this data 477 00:17:12,720 --> 00:17:19,240 comes together and that's because it's 478 00:17:14,880 --> 00:17:19,240 stitched from several data sets 479 00:17:22,559 --> 00:17:27,919 i'll just talk to you a little bit about 480 00:17:24,959 --> 00:17:29,520 our architecture that we use 481 00:17:27,919 --> 00:17:30,559 under the hood 482 00:17:29,520 --> 00:17:35,679 so 483 00:17:30,559 --> 00:17:37,440 we keep an internal data catalog 484 00:17:35,679 --> 00:17:40,000 in my sequel and this is our kind of 485 00:17:37,440 --> 00:17:41,840 source of truth for the world 486 00:17:40,000 --> 00:17:43,840 at the moment we publish over a thousand 487 00:17:41,840 --> 00:17:47,520 datasets from that catalog we republish 488 00:17:43,840 --> 00:17:50,000 them onto github but uh 489 00:17:47,520 --> 00:17:52,000 our goal actually in the coming year is 490 00:17:50,000 --> 00:17:55,120 to end up republishing 491 00:17:52,000 --> 00:17:56,320 more like 90 of that nearly everything 492 00:17:55,120 --> 00:17:58,160 the only things that we wouldn't 493 00:17:56,320 --> 00:18:00,400 republish would be things that are like 494 00:17:58,160 --> 00:18:02,160 embar data sets that are embargoed or 495 00:18:00,400 --> 00:18:06,000 data sets that are 496 00:18:02,160 --> 00:18:06,960 uh that we're unable to share publicly 497 00:18:06,000 --> 00:18:09,679 so 498 00:18:06,960 --> 00:18:11,919 uh our welding data is actually a jam 499 00:18:09,679 --> 00:18:13,360 stack site it's statically rendered the 500 00:18:11,919 --> 00:18:15,760 entire site 501 00:18:13,360 --> 00:18:18,240 so that means that all the interactive 502 00:18:15,760 --> 00:18:20,240 charts you see are actually 503 00:18:18,240 --> 00:18:21,440 rendered in a kind of baking or compile 504 00:18:20,240 --> 00:18:23,520 process 505 00:18:21,440 --> 00:18:25,679 into the live into the site that you see 506 00:18:23,520 --> 00:18:27,679 live but this data set this 507 00:18:25,679 --> 00:18:31,120 data catalog is never used 508 00:18:27,679 --> 00:18:31,120 when you use our live site 509 00:18:31,280 --> 00:18:35,520 there's three channels for data to get 510 00:18:33,039 --> 00:18:37,919 into the data catalog one of them is 511 00:18:35,520 --> 00:18:40,840 around covert because covet is just such 512 00:18:37,919 --> 00:18:42,880 a special situation for the world 513 00:18:40,840 --> 00:18:45,120 um and 514 00:18:42,880 --> 00:18:46,880 we combined the automatic channels of 515 00:18:45,120 --> 00:18:48,799 reliable data that we know about 516 00:18:46,880 --> 00:18:51,120 including johns hopkins university 517 00:18:48,799 --> 00:18:53,120 perhaps the world health organization 518 00:18:51,120 --> 00:18:54,640 along with some of the manual things 519 00:18:53,120 --> 00:18:57,440 that i showed you 520 00:18:54,640 --> 00:19:00,240 pulling official data from tweets 521 00:18:57,440 --> 00:19:02,480 facebook posts video things like that 522 00:19:00,240 --> 00:19:05,280 and we combine all of this in to make a 523 00:19:02,480 --> 00:19:07,760 reference coveted 90 data set 524 00:19:05,280 --> 00:19:09,200 and we published that on github 525 00:19:07,760 --> 00:19:10,960 we have published a data explorer for 526 00:19:09,200 --> 00:19:13,679 that on our site 527 00:19:10,960 --> 00:19:14,880 and um 528 00:19:13,679 --> 00:19:16,880 yeah 529 00:19:14,880 --> 00:19:19,440 you will if you see global data being 530 00:19:16,880 --> 00:19:21,200 compared if you you'll see it in your in 531 00:19:19,440 --> 00:19:22,559 a newspaper perhaps if you read the age 532 00:19:21,200 --> 00:19:25,840 of city morning herald or something 533 00:19:22,559 --> 00:19:29,039 similar you often see references to this 534 00:19:25,840 --> 00:19:29,039 particular data set 535 00:19:29,360 --> 00:19:34,400 i also mentioned that there are these 536 00:19:30,880 --> 00:19:35,840 big institutions like the who 537 00:19:34,400 --> 00:19:38,960 different departments of the united 538 00:19:35,840 --> 00:19:40,880 nations they go to immense effort to try 539 00:19:38,960 --> 00:19:42,720 to collect 540 00:19:40,880 --> 00:19:45,360 painstakingly 541 00:19:42,720 --> 00:19:48,000 information about global progress 542 00:19:45,360 --> 00:19:49,120 so they usually release that a few times 543 00:19:48,000 --> 00:19:53,360 a year 544 00:19:49,120 --> 00:19:56,080 and we do a process of bulk cleaning and 545 00:19:53,360 --> 00:19:57,360 uploading that anytime they do these 546 00:19:56,080 --> 00:20:00,080 releases 547 00:19:57,360 --> 00:20:04,240 and this work is actually the work of 548 00:20:00,080 --> 00:20:08,480 four full-time people in a data team 549 00:20:04,240 --> 00:20:10,159 finally if you are interested in uh 550 00:20:08,480 --> 00:20:12,320 well finally i should say if you're 551 00:20:10,159 --> 00:20:14,960 looking at these big global problems 552 00:20:12,320 --> 00:20:16,159 there are a lot of tiny facets to these 553 00:20:14,960 --> 00:20:17,600 problems 554 00:20:16,159 --> 00:20:20,640 for example 555 00:20:17,600 --> 00:20:22,640 there are um 556 00:20:20,640 --> 00:20:23,840 you know there was new data that came 557 00:20:22,640 --> 00:20:26,400 out on 558 00:20:23,840 --> 00:20:29,120 diet affordability which offered a kind 559 00:20:26,400 --> 00:20:31,120 of different perspective on 560 00:20:29,120 --> 00:20:33,039 global hunger 561 00:20:31,120 --> 00:20:35,280 and this sort of research data set 562 00:20:33,039 --> 00:20:37,120 actually our researchers themselves have 563 00:20:35,280 --> 00:20:40,640 a process an admin that they can 564 00:20:37,120 --> 00:20:40,640 harmonize and bring this data in 565 00:20:43,200 --> 00:20:45,600 so 566 00:20:44,400 --> 00:20:48,240 let me 567 00:20:45,600 --> 00:20:51,280 talk to you a little bit about getting 568 00:20:48,240 --> 00:20:54,799 your hands on this kind of global data 569 00:20:51,280 --> 00:20:56,240 so that you can actually um you know 570 00:20:54,799 --> 00:20:58,480 so that you can dig into these issues 571 00:20:56,240 --> 00:21:00,240 yourself if there's a major global issue 572 00:20:58,480 --> 00:21:02,960 and you want to get your hands on that 573 00:21:00,240 --> 00:21:04,320 data how can you do it 574 00:21:02,960 --> 00:21:06,880 so 575 00:21:04,320 --> 00:21:09,440 like everything else these days search 576 00:21:06,880 --> 00:21:12,799 is still your friend 577 00:21:09,440 --> 00:21:15,039 if you do a search i would say you know 578 00:21:12,799 --> 00:21:17,039 often you're hoping to get data from a 579 00:21:15,039 --> 00:21:18,559 big institution because you know that 580 00:21:17,039 --> 00:21:20,320 their data set is going to be pretty 581 00:21:18,559 --> 00:21:23,280 reliable that they stake their 582 00:21:20,320 --> 00:21:23,280 credibility on it 583 00:21:23,360 --> 00:21:26,320 but there 584 00:21:24,799 --> 00:21:28,400 if you see our world in data in the 585 00:21:26,320 --> 00:21:32,880 search results even if you don't care at 586 00:21:28,400 --> 00:21:35,120 all about our articles there you can 587 00:21:32,880 --> 00:21:37,039 it's still worth coming to us to try and 588 00:21:35,120 --> 00:21:39,520 get that data out and i can explain to 589 00:21:37,039 --> 00:21:40,720 you a little bit why 590 00:21:39,520 --> 00:21:42,480 let's take a look at indoor air 591 00:21:40,720 --> 00:21:44,080 pollution 592 00:21:42,480 --> 00:21:45,440 we can do a google search for indoor air 593 00:21:44,080 --> 00:21:48,320 pollution 594 00:21:45,440 --> 00:21:50,880 uh and it says that oh this is uh dust 595 00:21:48,320 --> 00:21:52,559 dirt and gases in the air and that could 596 00:21:50,880 --> 00:21:54,400 be harmful to you 597 00:21:52,559 --> 00:21:56,159 you might think wait indoor air 598 00:21:54,400 --> 00:21:58,000 pollution is this one of the world's big 599 00:21:56,159 --> 00:22:00,799 global problems 600 00:21:58,000 --> 00:22:03,520 and it could be surprising to you to 601 00:22:00,799 --> 00:22:05,679 discover that you know this kills 3.6 602 00:22:03,520 --> 00:22:07,440 million people every year which is six 603 00:22:05,679 --> 00:22:09,840 times more than all 604 00:22:07,440 --> 00:22:11,520 deaths from war terror attacks and 605 00:22:09,840 --> 00:22:14,720 murders combined 606 00:22:11,520 --> 00:22:16,960 so it's definitely a major cause of 607 00:22:14,720 --> 00:22:18,960 death in the world 608 00:22:16,960 --> 00:22:21,280 and 609 00:22:18,960 --> 00:22:23,280 well let's keep let's look down and see 610 00:22:21,280 --> 00:22:24,720 what we find in these search results we 611 00:22:23,280 --> 00:22:26,960 do see some stuff from the world health 612 00:22:24,720 --> 00:22:28,720 organization and this is probably one 613 00:22:26,960 --> 00:22:30,559 channel that you can go and get data on 614 00:22:28,720 --> 00:22:33,360 this problem 615 00:22:30,559 --> 00:22:36,080 but we also see an article here in our 616 00:22:33,360 --> 00:22:36,080 world in data 617 00:22:36,320 --> 00:22:41,039 now 618 00:22:38,400 --> 00:22:43,200 even if you are 619 00:22:41,039 --> 00:22:44,640 not interested in the article you can 620 00:22:43,200 --> 00:22:46,960 actually come down 621 00:22:44,640 --> 00:22:49,679 and look at the charts each chart is 622 00:22:46,960 --> 00:22:50,559 kind of representing a little small data 623 00:22:49,679 --> 00:22:52,720 set 624 00:22:50,559 --> 00:22:55,679 and 625 00:22:52,720 --> 00:22:58,720 you can download this data 626 00:22:55,679 --> 00:22:58,720 directly if you want to 627 00:23:01,360 --> 00:23:06,000 and here we see that uh outdoor and 628 00:23:03,679 --> 00:23:07,200 indoor pollution air pollution together 629 00:23:06,000 --> 00:23:09,600 is 630 00:23:07,200 --> 00:23:12,159 the fourth highest risk factor for death 631 00:23:09,600 --> 00:23:13,840 in 2017. 632 00:23:12,159 --> 00:23:15,520 and maybe just a reminder for indoor air 633 00:23:13,840 --> 00:23:17,120 pollution 634 00:23:15,520 --> 00:23:20,640 the reason in the air pollution is a 635 00:23:17,120 --> 00:23:23,600 problem it's for people who do not have 636 00:23:20,640 --> 00:23:25,280 good clean energy sources to 637 00:23:23,600 --> 00:23:27,760 feed or to 638 00:23:25,280 --> 00:23:29,520 to cook food and to heat their homes and 639 00:23:27,760 --> 00:23:31,280 that means they have to burn solid fuels 640 00:23:29,520 --> 00:23:34,000 and then you have this like 641 00:23:31,280 --> 00:23:35,760 bad smoky environment in the home 642 00:23:34,000 --> 00:23:38,480 actually it affects 3 billion people 643 00:23:35,760 --> 00:23:41,679 worldwide we can see that especially it 644 00:23:38,480 --> 00:23:43,840 affects sub-saharan africa east asia 645 00:23:41,679 --> 00:23:45,679 and you might not be aware that 646 00:23:43,840 --> 00:23:47,919 nearly 11 percent of our neighbors in 647 00:23:45,679 --> 00:23:49,360 papua new guinea die because of uh 648 00:23:47,919 --> 00:23:51,679 indoor air pollution 649 00:23:49,360 --> 00:23:53,679 so it's actually a reasonably 650 00:23:51,679 --> 00:23:55,120 significant problem 651 00:23:53,679 --> 00:23:58,080 again 652 00:23:55,120 --> 00:24:00,240 uh to get the data on this 653 00:23:58,080 --> 00:24:02,799 it's very simple just to click and get 654 00:24:00,240 --> 00:24:05,440 ssd and this is easy to work with and to 655 00:24:02,799 --> 00:24:09,200 use 656 00:24:05,440 --> 00:24:10,159 suppose you do this search and you don't 657 00:24:09,200 --> 00:24:12,080 either 658 00:24:10,159 --> 00:24:12,960 well for whatever reason you come dry 659 00:24:12,080 --> 00:24:15,120 you 660 00:24:12,960 --> 00:24:16,720 don't find an article from us that lets 661 00:24:15,120 --> 00:24:19,039 you quickly get into a chart and get to 662 00:24:16,720 --> 00:24:21,760 a data set or you get to an 663 00:24:19,039 --> 00:24:23,520 international institution but you end up 664 00:24:21,760 --> 00:24:25,200 chasing your tail in circles because 665 00:24:23,520 --> 00:24:26,400 they don't make the data super easy to 666 00:24:25,200 --> 00:24:28,320 get 667 00:24:26,400 --> 00:24:31,279 so then there's two really good open 668 00:24:28,320 --> 00:24:34,640 data catalogs published on github one of 669 00:24:31,279 --> 00:24:36,880 them is from us at our world and data 670 00:24:34,640 --> 00:24:38,720 our owed datasets catalog the other one 671 00:24:36,880 --> 00:24:40,720 is from gapminder 672 00:24:38,720 --> 00:24:42,559 and gapminder you might especially know 673 00:24:40,720 --> 00:24:45,760 from the late hans rosling that used to 674 00:24:42,559 --> 00:24:48,159 do really excellent ted talks 675 00:24:45,760 --> 00:24:51,039 their mission is to combat devastating 676 00:24:48,159 --> 00:24:55,279 ignorance in the world and they are also 677 00:24:51,039 --> 00:24:59,960 trying to collect and republish global 678 00:24:55,279 --> 00:24:59,960 progress data in a really nice way 679 00:25:00,400 --> 00:25:05,120 both of these data sets both of both us 680 00:25:03,360 --> 00:25:07,679 and gap minder we actually publish in a 681 00:25:05,120 --> 00:25:09,200 format called frictionless data now i 682 00:25:07,679 --> 00:25:11,360 had never come across frictionless data 683 00:25:09,200 --> 00:25:12,720 until recently so this was kind of news 684 00:25:11,360 --> 00:25:14,480 to me and it's interesting when you've 685 00:25:12,720 --> 00:25:16,799 worked in data science for ages but you 686 00:25:14,480 --> 00:25:18,640 haven't had a chance to actually 687 00:25:16,799 --> 00:25:22,320 where you discover a format that people 688 00:25:18,640 --> 00:25:22,320 are using that you've never used before 689 00:25:22,480 --> 00:25:27,039 um i mentioned all sorts of like weird 690 00:25:25,200 --> 00:25:28,799 ways of getting data in like well i 691 00:25:27,039 --> 00:25:30,480 imagine we have to check videos and 692 00:25:28,799 --> 00:25:34,559 tweets and facebook 693 00:25:30,480 --> 00:25:37,360 csvs are definitely not the worst format 694 00:25:34,559 --> 00:25:37,360 on the other hand 695 00:25:37,600 --> 00:25:43,440 when you think about the last time you 696 00:25:39,679 --> 00:25:45,279 used a csv file to do some serious work 697 00:25:43,440 --> 00:25:47,039 first you load it in super easy but then 698 00:25:45,279 --> 00:25:48,480 the first thing you have to do is fix 699 00:25:47,039 --> 00:25:50,320 all the types 700 00:25:48,480 --> 00:25:52,080 and 701 00:25:50,320 --> 00:25:54,640 this is a problem with csv that it's 702 00:25:52,080 --> 00:25:57,039 kind of lossy with respect to types or 703 00:25:54,640 --> 00:25:58,720 the schema if you save your csv file you 704 00:25:57,039 --> 00:26:01,120 load it back into a data frame it's not 705 00:25:58,720 --> 00:26:03,520 going to be the same data frame 706 00:26:01,120 --> 00:26:04,880 so this is one problem csvs don't have a 707 00:26:03,520 --> 00:26:06,880 schema 708 00:26:04,880 --> 00:26:09,120 the other problem with csvs is that they 709 00:26:06,880 --> 00:26:11,760 don't have any room for the type of 710 00:26:09,120 --> 00:26:13,600 metadata i described to you before 711 00:26:11,760 --> 00:26:16,159 like where did this column come from 712 00:26:13,600 --> 00:26:18,799 what does it mean what units are in 713 00:26:16,159 --> 00:26:18,799 things like that 714 00:26:18,960 --> 00:26:23,440 i think of the frictionless format as 715 00:26:20,880 --> 00:26:27,440 being csvs without shame 716 00:26:23,440 --> 00:26:29,120 so you've got csv files they're actually 717 00:26:27,440 --> 00:26:30,880 there's some nice things about them like 718 00:26:29,120 --> 00:26:31,919 you can really easily look at them as a 719 00:26:30,880 --> 00:26:34,000 human 720 00:26:31,919 --> 00:26:36,000 but they lack this kind of machine 721 00:26:34,000 --> 00:26:37,360 really readability and they like this 722 00:26:36,000 --> 00:26:40,159 metadata 723 00:26:37,360 --> 00:26:41,559 so if you basically add a json sidecar 724 00:26:40,159 --> 00:26:44,080 file 725 00:26:41,559 --> 00:26:46,080 datapackage.json that includes these 726 00:26:44,080 --> 00:26:48,880 things you've more or less got this 727 00:26:46,080 --> 00:26:50,320 frictionless format 728 00:26:48,880 --> 00:26:52,080 um 729 00:26:50,320 --> 00:26:55,840 yeah so maybe we can take a look at that 730 00:26:52,080 --> 00:26:58,000 actually so if i go to our 731 00:26:55,840 --> 00:27:00,799 github i'm in our world in data and our 732 00:26:58,000 --> 00:27:02,720 world and data data sets 733 00:27:00,799 --> 00:27:05,120 you can 734 00:27:02,720 --> 00:27:07,200 uh we can click straight into data sets 735 00:27:05,120 --> 00:27:08,880 and we see that uh 736 00:27:07,200 --> 00:27:11,760 well there's over a thousand data sets 737 00:27:08,880 --> 00:27:12,559 here covering all sorts of things work 738 00:27:11,760 --> 00:27:15,279 and 739 00:27:12,559 --> 00:27:18,480 work and leisure access to electricity 740 00:27:15,279 --> 00:27:20,720 electricity literacy tuberculosis 741 00:27:18,480 --> 00:27:23,120 let's take a look at the affordability 742 00:27:20,720 --> 00:27:25,279 of diets 743 00:27:23,120 --> 00:27:26,320 again a nice thing about frictionless 744 00:27:25,279 --> 00:27:29,520 format 745 00:27:26,320 --> 00:27:31,279 is that the csv is still there you can 746 00:27:29,520 --> 00:27:32,399 still just go and look at it 747 00:27:31,279 --> 00:27:35,840 um 748 00:27:32,399 --> 00:27:35,840 and here we are 749 00:27:36,080 --> 00:27:42,399 we're able to see various uh costs of 750 00:27:39,919 --> 00:27:45,440 eating enough calories versus the sort 751 00:27:42,399 --> 00:27:48,159 of diet that your body needs 752 00:27:45,440 --> 00:27:50,880 but you can also 753 00:27:48,159 --> 00:27:53,120 look in the datapackage.json file and 754 00:27:50,880 --> 00:27:56,640 here then you would see a lot of this 755 00:27:53,120 --> 00:27:58,559 metadata where did the data come from 756 00:27:56,640 --> 00:27:59,440 you see a schema 757 00:27:58,559 --> 00:28:03,360 and 758 00:27:59,440 --> 00:28:04,240 including descriptions for different uh 759 00:28:03,360 --> 00:28:06,559 for 760 00:28:04,240 --> 00:28:09,200 different columns and also some of this 761 00:28:06,559 --> 00:28:10,960 display data like the cost of a calorie 762 00:28:09,200 --> 00:28:13,840 sufficient data 763 00:28:10,960 --> 00:28:16,159 diet the short unit here is we if we 764 00:28:13,840 --> 00:28:17,840 don't have space we call it a dollar but 765 00:28:16,159 --> 00:28:21,120 if we've got more space we say this is 766 00:28:17,840 --> 00:28:23,600 actually technically a 2017 us dollar 767 00:28:21,120 --> 00:28:23,600 per day 768 00:28:24,799 --> 00:28:28,159 um 769 00:28:26,320 --> 00:28:30,640 there's a frictionless package in python 770 00:28:28,159 --> 00:28:32,399 that makes it super easy to load in this 771 00:28:30,640 --> 00:28:34,159 data and to 772 00:28:32,399 --> 00:28:36,240 get to both the metadata and the data 773 00:28:34,159 --> 00:28:39,679 frame out and so i recommend that if 774 00:28:36,240 --> 00:28:39,679 you're interested in that format 775 00:28:40,960 --> 00:28:46,480 just some like thoughts to close here um 776 00:28:44,799 --> 00:28:47,840 so 777 00:28:46,480 --> 00:28:50,720 at our world in data we talk about 778 00:28:47,840 --> 00:28:52,880 massive global problems and 779 00:28:50,720 --> 00:28:54,720 the world has really big problems 780 00:28:52,880 --> 00:28:56,799 and on the other hand when people get 781 00:28:54,720 --> 00:28:59,440 lonely you look at the news the news is 782 00:28:56,799 --> 00:29:00,960 also kind of giving you 783 00:28:59,440 --> 00:29:02,159 giving you all of these problems showing 784 00:29:00,960 --> 00:29:03,919 you all these things that are wrong with 785 00:29:02,159 --> 00:29:06,000 the world every single day 786 00:29:03,919 --> 00:29:08,960 and you combine these things and it's 787 00:29:06,000 --> 00:29:10,720 really easy for people to get cynical 788 00:29:08,960 --> 00:29:13,039 and to feel like 789 00:29:10,720 --> 00:29:16,000 there's not much 790 00:29:13,039 --> 00:29:17,520 to feel like there's not much uh worth 791 00:29:16,000 --> 00:29:19,120 in putting their efforts into trying to 792 00:29:17,520 --> 00:29:21,279 make the world better 793 00:29:19,120 --> 00:29:22,640 and surveys of young people show that a 794 00:29:21,279 --> 00:29:25,120 lot of them feel like the world is in 795 00:29:22,640 --> 00:29:25,120 decline 796 00:29:25,600 --> 00:29:29,520 at our world in data we think that the 797 00:29:27,039 --> 00:29:32,559 really big picture that we should be 798 00:29:29,520 --> 00:29:34,399 considering is both the massive global 799 00:29:32,559 --> 00:29:36,320 challenges and the massive global 800 00:29:34,399 --> 00:29:37,600 achievements that we've had 801 00:29:36,320 --> 00:29:38,880 if we talk about some of these 802 00:29:37,600 --> 00:29:40,480 achievements 803 00:29:38,880 --> 00:29:42,960 there they're things like 804 00:29:40,480 --> 00:29:46,080 moving from a world where nearly 805 00:29:42,960 --> 00:29:47,279 everyone lived in extreme poverty 200 806 00:29:46,080 --> 00:29:49,039 years ago 807 00:29:47,279 --> 00:29:51,919 to a world where we've been able to 808 00:29:49,039 --> 00:29:53,760 close and close and close this gap and 809 00:29:51,919 --> 00:29:56,080 we can really foresee that we'll end 810 00:29:53,760 --> 00:29:57,520 extreme poverty 811 00:29:56,080 --> 00:30:00,960 you know 812 00:29:57,520 --> 00:30:03,600 likely in the next 50 years 813 00:30:00,960 --> 00:30:05,679 we live in the world where 814 00:30:03,600 --> 00:30:07,520 more than 40 of children used to die 815 00:30:05,679 --> 00:30:09,279 before the age of five 816 00:30:07,520 --> 00:30:10,960 and we've been able to really really 817 00:30:09,279 --> 00:30:13,520 drastically cut child mortality 818 00:30:10,960 --> 00:30:16,960 especially in the last hundred years 819 00:30:13,520 --> 00:30:18,720 uh so that more and like neil yeah 820 00:30:16,960 --> 00:30:21,360 even this gap that's still left it's 821 00:30:18,720 --> 00:30:24,720 still unacceptable but yet we're doing 822 00:30:21,360 --> 00:30:27,120 so much better now than we were before 823 00:30:24,720 --> 00:30:28,399 more and more people can read get access 824 00:30:27,120 --> 00:30:31,120 to ideas 825 00:30:28,399 --> 00:30:33,760 uh navigate their 826 00:30:31,120 --> 00:30:35,600 local communities better 827 00:30:33,760 --> 00:30:37,440 and this corresponds with a massive 828 00:30:35,600 --> 00:30:40,000 increase in the global level of 829 00:30:37,440 --> 00:30:41,919 education something that's projected to 830 00:30:40,000 --> 00:30:44,399 continue well through the end of the 831 00:30:41,919 --> 00:30:44,399 century 832 00:30:44,880 --> 00:30:49,039 and people are also living in greater 833 00:30:47,840 --> 00:30:51,360 and greater 834 00:30:49,039 --> 00:30:53,760 political freedom than they have in the 835 00:30:51,360 --> 00:30:57,360 past and especially the last hundred 836 00:30:53,760 --> 00:30:58,240 years we see the end to the colonial era 837 00:30:57,360 --> 00:30:59,760 so 838 00:30:58,240 --> 00:31:01,440 you know you can understand someone 839 00:30:59,760 --> 00:31:05,120 writing an article like this this has 840 00:31:01,440 --> 00:31:06,720 been the best year ever except 841 00:31:05,120 --> 00:31:08,880 you know the moment they wrote it the 842 00:31:06,720 --> 00:31:10,640 global pandemic was beginning and i know 843 00:31:08,880 --> 00:31:13,360 you're all in lockdown so you'll be like 844 00:31:10,640 --> 00:31:15,120 lars this is not the best year ever 845 00:31:13,360 --> 00:31:17,600 but 846 00:31:15,120 --> 00:31:19,440 i really believe that if we put our 847 00:31:17,600 --> 00:31:21,120 energies into these massive global 848 00:31:19,440 --> 00:31:22,799 problems you know this pandemic we're 849 00:31:21,120 --> 00:31:24,559 going to see our way through it 850 00:31:22,799 --> 00:31:26,480 on the other side 851 00:31:24,559 --> 00:31:28,159 soon i think we'll be able to say again 852 00:31:26,480 --> 00:31:30,559 that each year will have been the best 853 00:31:28,159 --> 00:31:31,519 year ever 854 00:31:30,559 --> 00:31:35,519 and 855 00:31:31,519 --> 00:31:35,519 yeah that's me thank you 856 00:31:36,000 --> 00:31:42,480 thank you lars 857 00:31:38,159 --> 00:31:44,559 this has been um sovereign talk and uh 858 00:31:42,480 --> 00:31:47,200 people on the back channels were asking 859 00:31:44,559 --> 00:31:48,480 lots of questions we have time for a 860 00:31:47,200 --> 00:31:49,440 couple 861 00:31:48,480 --> 00:31:51,679 so 862 00:31:49,440 --> 00:31:54,159 the first one is the most voted by the 863 00:31:51,679 --> 00:31:56,320 audience how does a world of data relate 864 00:31:54,159 --> 00:31:59,120 to fair data principles 865 00:31:56,320 --> 00:32:01,840 and what would make your work easier in 866 00:31:59,120 --> 00:32:01,840 that regard 867 00:32:02,000 --> 00:32:04,399 um 868 00:32:04,480 --> 00:32:10,320 so i if there is a kind of technical 869 00:32:08,000 --> 00:32:12,240 concept of fair data principles then 870 00:32:10,320 --> 00:32:15,679 unfortunately it's still something for 871 00:32:12,240 --> 00:32:17,360 me to learn but i would say 872 00:32:15,679 --> 00:32:19,200 there's two things that would make our 873 00:32:17,360 --> 00:32:22,000 work a lot easier 874 00:32:19,200 --> 00:32:23,440 one would be a really nice standardized 875 00:32:22,000 --> 00:32:23,700 format for 876 00:32:23,440 --> 00:32:25,200 uh 877 00:32:23,700 --> 00:32:27,279 [Music] 878 00:32:25,200 --> 00:32:30,720 that big institutions and researchers 879 00:32:27,279 --> 00:32:32,640 both publishing that would be super nice 880 00:32:30,720 --> 00:32:35,440 um 881 00:32:32,640 --> 00:32:36,880 maybe on the other hand if i if if i 882 00:32:35,440 --> 00:32:38,480 instead of talking about what my gal 883 00:32:36,880 --> 00:32:40,840 work is if i just talk a little bit 884 00:32:38,480 --> 00:32:44,000 about a fairness angle 885 00:32:40,840 --> 00:32:46,880 um our world and data now gets quite a 886 00:32:44,000 --> 00:32:48,720 lot of traffic about a lot of issues and 887 00:32:46,880 --> 00:32:50,880 when i was a researcher you know when i 888 00:32:48,720 --> 00:32:52,640 collected data i was always terrified 889 00:32:50,880 --> 00:32:54,240 about sharing that data because i was 890 00:32:52,640 --> 00:32:55,919 sure that someone else would scoop me 891 00:32:54,240 --> 00:32:57,919 they would publish something they would 892 00:32:55,919 --> 00:32:59,440 find something amazing in my data that i 893 00:32:57,919 --> 00:33:01,840 didn't 894 00:32:59,440 --> 00:33:04,240 uh our world and data we try and be 895 00:33:01,840 --> 00:33:06,880 super super good about 896 00:33:04,240 --> 00:33:08,960 any time we grab data from researchers 897 00:33:06,880 --> 00:33:11,519 uh we'll firstly talk to the researchers 898 00:33:08,960 --> 00:33:14,080 but we also make sure to try and 899 00:33:11,519 --> 00:33:16,000 direct all credit and we direct people 900 00:33:14,080 --> 00:33:18,159 back to the researchers because that's 901 00:33:16,000 --> 00:33:20,640 definitely what's needed 902 00:33:18,159 --> 00:33:21,760 to get more people to be willing to 903 00:33:20,640 --> 00:33:26,240 share 904 00:33:21,760 --> 00:33:28,960 their data and i think uh 905 00:33:26,240 --> 00:33:30,640 yeah it's just like it's it's uh 906 00:33:28,960 --> 00:33:32,559 the more people who are able and willing 907 00:33:30,640 --> 00:33:35,840 to share what they have i think the 908 00:33:32,559 --> 00:33:35,840 better it is for the community 909 00:33:36,240 --> 00:33:39,840 um people are asking also the 910 00:33:37,600 --> 00:33:41,919 visualizations look great what tools do 911 00:33:39,840 --> 00:33:44,240 the team use 912 00:33:41,919 --> 00:33:47,279 we developed an in-house tool uh which 913 00:33:44,240 --> 00:33:49,279 we call grapher and that's i've actually 914 00:33:47,279 --> 00:33:50,960 been doing some work to try and port 915 00:33:49,279 --> 00:33:52,880 that to a nice 916 00:33:50,960 --> 00:33:55,679 python interface that you could use in a 917 00:33:52,880 --> 00:33:55,679 jupyter notebook 918 00:33:56,080 --> 00:33:59,360 in some ways it felt 919 00:33:58,000 --> 00:34:01,279 you know you could say well there's so 920 00:33:59,360 --> 00:34:02,880 many good data visualization tools it's 921 00:34:01,279 --> 00:34:04,159 crazy to write your own why would the 922 00:34:02,880 --> 00:34:07,440 team do that 923 00:34:04,159 --> 00:34:09,679 i think a big piece of this is the 924 00:34:07,440 --> 00:34:12,159 need to support all of this metadata 925 00:34:09,679 --> 00:34:13,440 like with the right metadata 926 00:34:12,159 --> 00:34:15,280 the charting 927 00:34:13,440 --> 00:34:17,119 of these issues should become really 928 00:34:15,280 --> 00:34:19,679 easy and you should be able to get a 929 00:34:17,119 --> 00:34:20,800 really rich interactive chart in just 930 00:34:19,679 --> 00:34:22,639 like 931 00:34:20,800 --> 00:34:26,639 uh 932 00:34:22,639 --> 00:34:28,320 you know just a few lines of code but um 933 00:34:26,639 --> 00:34:29,839 the existing charting tools they really 934 00:34:28,320 --> 00:34:32,639 require you 935 00:34:29,839 --> 00:34:34,960 to specify everything about a chart so 936 00:34:32,639 --> 00:34:36,480 that's a massive difference in the kind 937 00:34:34,960 --> 00:34:39,280 of open source tools that are out there 938 00:34:36,480 --> 00:34:41,440 today and the tool we're using 939 00:34:39,280 --> 00:34:44,560 i should mention that grapher is open 940 00:34:41,440 --> 00:34:46,320 source but i would say it's not 941 00:34:44,560 --> 00:34:49,280 split out 942 00:34:46,320 --> 00:34:52,560 into a code base in in a nice enough way 943 00:34:49,280 --> 00:34:53,520 for external contributions at the moment 944 00:34:52,560 --> 00:34:55,839 thank you 945 00:34:53,520 --> 00:34:57,839 uh we have many more good questions and 946 00:34:55,839 --> 00:34:59,520 i invite you to go and venue less and 947 00:34:57,839 --> 00:35:02,079 interact directly with the audience if 948 00:34:59,520 --> 00:35:04,320 you have a moment and i also want to 949 00:35:02,079 --> 00:35:07,359 thank you thank you lars jenkin for 950 00:35:04,320 --> 00:35:10,960 being the closing speaker at the pycon 951 00:35:07,359 --> 00:35:14,320 2021 data science and analytics track 952 00:35:10,960 --> 00:35:15,599 thanks so much everyone thanks have you 953 00:35:14,320 --> 00:35:18,240 bye 954 00:35:15,599 --> 00:35:20,000 and for everyone who's still following 955 00:35:18,240 --> 00:35:22,240 in the we thank you also for your 956 00:35:20,000 --> 00:35:25,040 attention and we would like to give you 957 00:35:22,240 --> 00:35:30,440 a small goodbye in 10 minutes from now 958 00:35:25,040 --> 00:35:30,440 at 5 45 melbourne time see you then 959 00:35:32,880 --> 00:35:34,960 you