1 00:00:02,490 --> 00:00:06,249 [Music] 2 00:00:10,480 --> 00:00:15,599 Hello everyone. We're going to get 3 00:00:12,160 --> 00:00:19,119 started with our next session. Um so 4 00:00:15,599 --> 00:00:21,279 today we've got Tennessee Leonberg and 5 00:00:19,119 --> 00:00:24,960 they'll be presenting on PyF tools 6 00:00:21,279 --> 00:00:26,720 machine learning for F system sciences. 7 00:00:24,960 --> 00:00:29,720 Thanks 8 00:00:26,720 --> 00:00:29,720 Joe. 9 00:00:33,280 --> 00:00:36,719 And uh I I know it's already been done 10 00:00:35,120 --> 00:00:38,640 earlier this morning, but just a quick 11 00:00:36,719 --> 00:00:40,000 acknowledgement of of country as well, 12 00:00:38,640 --> 00:00:42,800 just to add my support to those 13 00:00:40,000 --> 00:00:44,800 statements. So, a little bit about me. 14 00:00:42,800 --> 00:00:46,160 My name is Tennessee Luenberg. I've 15 00:00:44,800 --> 00:00:48,960 worked at the Australian Bureau of 16 00:00:46,160 --> 00:00:51,920 Meteorology for 22 years, mostly in the 17 00:00:48,960 --> 00:00:54,000 research program as a data scientist and 18 00:00:51,920 --> 00:00:55,680 software engineer. I'm the maintainer of 19 00:00:54,000 --> 00:00:57,280 two open source packages, PI Earth 20 00:00:55,680 --> 00:01:00,160 Tools, the subject of today's 21 00:00:57,280 --> 00:01:01,840 conversation, and also scores. And my 22 00:01:00,160 --> 00:01:03,520 main focus is the development of machine 23 00:01:01,840 --> 00:01:05,760 learning models for now casting, which 24 00:01:03,520 --> 00:01:08,720 is the time frame from right now out to 25 00:01:05,760 --> 00:01:10,960 about 3 or 4 hours. 26 00:01:08,720 --> 00:01:12,400 So I think for this audience, we'll add 27 00:01:10,960 --> 00:01:14,479 a bit of a description of like, well, 28 00:01:12,400 --> 00:01:16,000 what actually is Earth system science 29 00:01:14,479 --> 00:01:17,680 and then we'll see some examples and 30 00:01:16,000 --> 00:01:19,360 then kind of ease into what is the 31 00:01:17,680 --> 00:01:23,759 application of machine learning to that 32 00:01:19,360 --> 00:01:25,200 context. So earth system science models 33 00:01:23,759 --> 00:01:27,520 essentially refer to modeling sort of 34 00:01:25,200 --> 00:01:29,280 the processes and state of our kind of 35 00:01:27,520 --> 00:01:31,520 natural environment, our planet, our 36 00:01:29,280 --> 00:01:34,479 atmosphere. And traditionally they've 37 00:01:31,520 --> 00:01:36,479 been process-based or rule-based systems 38 00:01:34,479 --> 00:01:38,880 developed by 39 00:01:36,479 --> 00:01:41,119 academics and researchers on the basis 40 00:01:38,880 --> 00:01:43,280 of physical rules. So think force equals 41 00:01:41,119 --> 00:01:44,960 mass times acceleration or the effect of 42 00:01:43,280 --> 00:01:46,159 gravity on the atmosphere, things like 43 00:01:44,960 --> 00:01:48,159 that. 44 00:01:46,159 --> 00:01:50,079 uh they're amazing but they're also very 45 00:01:48,159 --> 00:01:52,000 computationally expensive which is why 46 00:01:50,079 --> 00:01:54,880 we have supercomputers uh which we don't 47 00:01:52,000 --> 00:01:57,759 all have in our back pockets. So prior 48 00:01:54,880 --> 00:02:00,240 to 2022 the general understanding is 49 00:01:57,759 --> 00:02:02,799 that the complexity of that task would 50 00:02:00,240 --> 00:02:04,479 have greatly exceeded any feasible 51 00:02:02,799 --> 00:02:07,280 machine learning approach for modeling 52 00:02:04,479 --> 00:02:09,280 the entire system anyway. However, since 53 00:02:07,280 --> 00:02:11,520 then machine learning models have come 54 00:02:09,280 --> 00:02:14,080 out and proved themselves very competent 55 00:02:11,520 --> 00:02:16,239 at at the entire task. There's still a 56 00:02:14,080 --> 00:02:17,760 scope to what they can do, but there's 57 00:02:16,239 --> 00:02:20,000 really a change in the state of the 58 00:02:17,760 --> 00:02:21,680 field. Some of these machine learning 59 00:02:20,000 --> 00:02:23,599 models are entirely machine learning 60 00:02:21,680 --> 00:02:26,160 based and there's no attempt to 61 00:02:23,599 --> 00:02:28,319 explicitly uh measure simulate the 62 00:02:26,160 --> 00:02:30,640 physical and chemical processes that are 63 00:02:28,319 --> 00:02:32,800 occurring and it's the neural net doing 64 00:02:30,640 --> 00:02:34,640 what the neural net does. uh whereas 65 00:02:32,800 --> 00:02:36,879 others are hybrid and there will be like 66 00:02:34,640 --> 00:02:38,480 a neural net aspect and also a 67 00:02:36,879 --> 00:02:40,560 rules-based aspect and they're not all 68 00:02:38,480 --> 00:02:42,319 neural networks but a lot of the big 69 00:02:40,560 --> 00:02:45,040 ones that are very interesting are 70 00:02:42,319 --> 00:02:46,640 neural networks. So let's sort of work 71 00:02:45,040 --> 00:02:47,840 it through with it with this example. So 72 00:02:46,640 --> 00:02:50,319 here's the example of a weather 73 00:02:47,840 --> 00:02:52,000 forecast. I work at the weather bureau. 74 00:02:50,319 --> 00:02:53,920 I kind of can't help making all of my 75 00:02:52,000 --> 00:02:56,160 examples weather. Everyone understands 76 00:02:53,920 --> 00:02:57,440 the weather, but the system can do a lot 77 00:02:56,160 --> 00:02:59,040 more than the weather, right? And 78 00:02:57,440 --> 00:03:00,720 there's a lot of pieces of information 79 00:02:59,040 --> 00:03:02,400 in here. So where do these pieces of 80 00:03:00,720 --> 00:03:03,920 information come from? So there's a 81 00:03:02,400 --> 00:03:05,360 certain amount of human value add and 82 00:03:03,920 --> 00:03:07,599 quality assurance to the way the weather 83 00:03:05,360 --> 00:03:09,599 bureau makes them. But behind that are 84 00:03:07,599 --> 00:03:11,920 these simulation models that drive the 85 00:03:09,599 --> 00:03:14,879 prediction of these variables. So this 86 00:03:11,920 --> 00:03:16,800 is an animation of a weather variable 87 00:03:14,879 --> 00:03:19,599 evolving around the time in a 88 00:03:16,800 --> 00:03:23,280 rules-based physics-based global model. 89 00:03:19,599 --> 00:03:25,440 So that's an earth system model. Uh 90 00:03:23,280 --> 00:03:27,519 that's earth system science and what 91 00:03:25,440 --> 00:03:30,640 we're talking about is a neural network 92 00:03:27,519 --> 00:03:34,480 that can do that. 93 00:03:30,640 --> 00:03:37,920 So here is the code for a model that I 94 00:03:34,480 --> 00:03:42,799 trained that does that to a greater or 95 00:03:37,920 --> 00:03:45,440 lesser degree. Um and it produced that. 96 00:03:42,799 --> 00:03:46,879 So basic proof by example that's the 97 00:03:45,440 --> 00:03:49,200 kind of thing we're talking about and 98 00:03:46,879 --> 00:03:52,000 that's what PI Earth tools is for. It is 99 00:03:49,200 --> 00:03:55,040 for the construction of neural network 100 00:03:52,000 --> 00:03:56,720 and other machine learning based models 101 00:03:55,040 --> 00:03:59,200 which are capable of performing useful 102 00:03:56,720 --> 00:04:02,959 earth system science simulations 103 00:03:59,200 --> 00:04:05,360 to the level and standard uh again scope 104 00:04:02,959 --> 00:04:07,040 and caveats apply. Um but this isn't the 105 00:04:05,360 --> 00:04:09,360 sort of the deep dive on the science of 106 00:04:07,040 --> 00:04:11,519 them roughly to the standard or indeed 107 00:04:09,360 --> 00:04:14,000 in cases better than the standard of the 108 00:04:11,519 --> 00:04:16,479 physical models. 109 00:04:14,000 --> 00:04:18,160 So back in 2023, it was possible to 110 00:04:16,479 --> 00:04:20,959 actually go and find all of the machine 111 00:04:18,160 --> 00:04:23,120 learning models. So a colleague and I 112 00:04:20,959 --> 00:04:24,960 went through this process uh of doing so 113 00:04:23,120 --> 00:04:27,360 within a particular scope. So there's a 114 00:04:24,960 --> 00:04:28,880 lot of caveats around what is and isn't 115 00:04:27,360 --> 00:04:30,720 a weather and climate model and so 116 00:04:28,880 --> 00:04:33,360 forth. But we we we found the scope and 117 00:04:30,720 --> 00:04:35,199 we you can see in the timeline when it 118 00:04:33,360 --> 00:04:37,440 is that machine learning really starts 119 00:04:35,199 --> 00:04:39,600 sort of taking off in the field. And 120 00:04:37,440 --> 00:04:42,320 since 2023, 121 00:04:39,600 --> 00:04:45,199 it's just there's just been an absolute 122 00:04:42,320 --> 00:04:46,800 explosion in the number of research, you 123 00:04:45,199 --> 00:04:48,080 know, the amount of research findings in 124 00:04:46,800 --> 00:04:50,639 this particular field and it would 125 00:04:48,080 --> 00:04:52,560 probably no longer be really reasonable 126 00:04:50,639 --> 00:04:54,400 to try to produce these diagrams now. 127 00:04:52,560 --> 00:04:56,960 But a few things from this still very 128 00:04:54,400 --> 00:04:59,600 much apply. So firstly uh fully 129 00:04:56,960 --> 00:05:01,360 connected neural networks as opposed to 130 00:04:59,600 --> 00:05:03,520 more advanced architectures are still 131 00:05:01,360 --> 00:05:05,759 very much relevant in the field um 132 00:05:03,520 --> 00:05:08,240 particularly for very specific process 133 00:05:05,759 --> 00:05:09,759 problems and also we are talking about 134 00:05:08,240 --> 00:05:12,240 things that don't just include neural 135 00:05:09,759 --> 00:05:14,160 networks. But on the other hand, since 136 00:05:12,240 --> 00:05:15,600 then, a great many of the most 137 00:05:14,160 --> 00:05:17,840 significant results have come from the 138 00:05:15,600 --> 00:05:19,440 application of transformers and graph 139 00:05:17,840 --> 00:05:22,400 neural networks, especially to 140 00:05:19,440 --> 00:05:25,759 predictive uh models and also diffusion 141 00:05:22,400 --> 00:05:27,199 models which might be uh perhaps a 142 00:05:25,759 --> 00:05:29,840 little more focused on the problem of 143 00:05:27,199 --> 00:05:31,199 downscaling or super resolution as it 144 00:05:29,840 --> 00:05:33,680 might be called from a more computer 145 00:05:31,199 --> 00:05:36,320 vision perspective. 146 00:05:33,680 --> 00:05:37,919 So this slide uh wasn't produced by me. 147 00:05:36,320 --> 00:05:39,440 This was produced by Andy Brown, the 148 00:05:37,919 --> 00:05:40,880 director of research at the European 149 00:05:39,440 --> 00:05:43,440 Center for Medium-Range Weather 150 00:05:40,880 --> 00:05:46,160 Forecasting. And this is where we have 151 00:05:43,440 --> 00:05:48,320 specific examples of these models 152 00:05:46,160 --> 00:05:50,880 performing to a greater st a higher 153 00:05:48,320 --> 00:05:54,880 standard of accuracy than the physical 154 00:05:50,880 --> 00:05:56,880 models. Uh again for specific metrics, 155 00:05:54,880 --> 00:05:59,280 specific variables, various scopes and 156 00:05:56,880 --> 00:06:00,880 caveats apply, but they very much held 157 00:05:59,280 --> 00:06:02,960 their own in terms of the headline 158 00:06:00,880 --> 00:06:06,240 statistics uh in some of these 159 00:06:02,960 --> 00:06:07,919 variables. So it's very much not hype 160 00:06:06,240 --> 00:06:09,680 and it's very much transforming the 161 00:06:07,919 --> 00:06:11,600 field. 162 00:06:09,680 --> 00:06:13,039 So let's talk about now that we've got 163 00:06:11,600 --> 00:06:15,919 that sort of scene setting out of the 164 00:06:13,039 --> 00:06:17,919 way Pythols itself and it you know how 165 00:06:15,919 --> 00:06:19,600 it came to be and where it sits you know 166 00:06:17,919 --> 00:06:23,520 in the field and what you might like to 167 00:06:19,600 --> 00:06:25,280 know about it. So back in around 2022 um 168 00:06:23,520 --> 00:06:27,840 you know we could see that change was 169 00:06:25,280 --> 00:06:30,400 change was on the way and internally 170 00:06:27,840 --> 00:06:32,960 started work on this tool for being able 171 00:06:30,400 --> 00:06:34,960 to sort of deploy and run various 172 00:06:32,960 --> 00:06:39,039 state-of-the-art published architectures 173 00:06:34,960 --> 00:06:42,240 run them against our data sets etc. Um, 174 00:06:39,039 --> 00:06:44,560 in 2020, March 2023, a framework called 175 00:06:42,240 --> 00:06:46,160 Climate Learn was released. I believe 176 00:06:44,560 --> 00:06:48,479 that's no longer being maintained, but 177 00:06:46,160 --> 00:06:51,360 that's when that was released. Around 178 00:06:48,479 --> 00:06:53,199 the middle of 2024, uh, a number of the 179 00:06:51,360 --> 00:06:55,919 bureau's research partners, so the Med 180 00:06:53,199 --> 00:06:57,840 Office in the UK, uh, NEWA, which is now 181 00:06:55,919 --> 00:06:59,840 Earth Sciences New Zealand, and Access 182 00:06:57,840 --> 00:07:02,000 NRI, which is a research group here in 183 00:06:59,840 --> 00:07:03,919 Australia, joined as PY Earth 184 00:07:02,000 --> 00:07:05,680 collaborators, and we sent the set the 185 00:07:03,919 --> 00:07:07,919 intention of creating an open-source 186 00:07:05,680 --> 00:07:09,599 framework so that we could share our 187 00:07:07,919 --> 00:07:11,919 work and collaborate more effectively 188 00:07:09,599 --> 00:07:14,639 across that partnership and with anyone 189 00:07:11,919 --> 00:07:17,360 else who might like to join us. In 190 00:07:14,639 --> 00:07:20,080 August 2024, a framework called Anamoy 191 00:07:17,360 --> 00:07:22,639 was released by the ECMWF. This is a 192 00:07:20,080 --> 00:07:24,720 very serious framework. It's it's ve got 193 00:07:22,639 --> 00:07:26,000 very very compelling results coming out 194 00:07:24,720 --> 00:07:29,840 of the models that come from that 195 00:07:26,000 --> 00:07:32,240 framework. Uh in December of that year, 196 00:07:29,840 --> 00:07:34,560 hot on the heels of one another. Um PI 197 00:07:32,240 --> 00:07:36,800 Earth Tools was released as open source 198 00:07:34,560 --> 00:07:39,199 shortly after the Miles Credit framework 199 00:07:36,800 --> 00:07:44,160 was released. Miles Credit being by 200 00:07:39,199 --> 00:07:45,680 Encar. So that's how we got where we got 201 00:07:44,160 --> 00:07:48,319 and I'd just like to pause and 202 00:07:45,680 --> 00:07:50,720 acknowledge uh the Pythols contributors 203 00:07:48,319 --> 00:07:53,599 that have gotten us to this point. Um in 204 00:07:50,720 --> 00:07:56,400 particular uh Harrison Cook who really 205 00:07:53,599 --> 00:07:58,000 led who very much led the uh technical 206 00:07:56,400 --> 00:08:00,319 development of the framework for the the 207 00:07:58,000 --> 00:08:01,840 first couple of years and has now gone 208 00:08:00,319 --> 00:08:03,840 to work at the European Center on the 209 00:08:01,840 --> 00:08:06,639 Anamoy framework and is now helping 210 00:08:03,840 --> 00:08:08,400 helping them in their journey but also 211 00:08:06,639 --> 00:08:09,759 fully recognizing and valuing the 212 00:08:08,400 --> 00:08:11,360 contribution of everyone else as well. 213 00:08:09,759 --> 00:08:13,039 And of course it's open source so you 214 00:08:11,360 --> 00:08:15,199 know your name could be here next year 215 00:08:13,039 --> 00:08:18,720 if you like. 216 00:08:15,199 --> 00:08:21,039 So what is PI Earth tools? So it has 217 00:08:18,720 --> 00:08:23,199 source code and documentation. It is a 218 00:08:21,039 --> 00:08:28,080 modular open-source Python framework 219 00:08:23,199 --> 00:08:31,599 with modules for data. No one one line 220 00:08:28,080 --> 00:08:34,399 no small task uh pipelines. So in this 221 00:08:31,599 --> 00:08:36,560 case think less orchestration pipelines 222 00:08:34,399 --> 00:08:38,399 and scheduling and think more 223 00:08:36,560 --> 00:08:40,159 declarative pipelines of the 224 00:08:38,399 --> 00:08:42,560 transformations that need to occur to 225 00:08:40,159 --> 00:08:45,519 the data and it takes the philosophy 226 00:08:42,560 --> 00:08:48,560 like a lot of models in a singular code 227 00:08:45,519 --> 00:08:51,600 base may like closely couple the 228 00:08:48,560 --> 00:08:53,120 pipeline and the model itself. uh PY 229 00:08:51,600 --> 00:08:55,440 tools takes the perspective that these 230 00:08:53,120 --> 00:08:57,519 things are should be better modularized 231 00:08:55,440 --> 00:09:00,240 to allow more fine-rained research to 232 00:08:57,519 --> 00:09:02,959 occur and more reuse of standardized 233 00:09:00,240 --> 00:09:05,360 scientifically validated pipelines that 234 00:09:02,959 --> 00:09:06,959 then data scientists can more easily 235 00:09:05,360 --> 00:09:09,519 just sort of pick up and work with and 236 00:09:06,959 --> 00:09:10,959 each each person sort of focus on their 237 00:09:09,519 --> 00:09:13,120 core skills they're bringing to the 238 00:09:10,959 --> 00:09:15,839 equation. So examples of what's in the 239 00:09:13,120 --> 00:09:18,000 pipeline, reproing, resampling, reg 240 00:09:15,839 --> 00:09:19,760 gridding, reagregating, normalizing, 241 00:09:18,000 --> 00:09:23,279 denormalizing, 242 00:09:19,760 --> 00:09:26,480 standardizing, adding, converting units 243 00:09:23,279 --> 00:09:29,600 and and a great many things. Um, and all 244 00:09:26,480 --> 00:09:31,920 of that ultimately turning itself into a 245 00:09:29,600 --> 00:09:34,560 fairly basic data structure being a 246 00:09:31,920 --> 00:09:36,800 Python iterable. And then that then 247 00:09:34,560 --> 00:09:39,200 proceeds to the models. And then pio 248 00:09:36,800 --> 00:09:42,800 tools also has a good number of uh 249 00:09:39,200 --> 00:09:44,880 packages and uh utility in it for 250 00:09:42,800 --> 00:09:46,560 working both with predefined models. So 251 00:09:44,880 --> 00:09:47,760 if there's a published architecture that 252 00:09:46,560 --> 00:09:50,800 we've been have a supported 253 00:09:47,760 --> 00:09:53,440 implementation for, you can pipe new or 254 00:09:50,800 --> 00:09:54,720 different data into a predefined model 255 00:09:53,440 --> 00:09:56,720 architecture for people who want to 256 00:09:54,720 --> 00:09:57,920 focus more on the pipelines or for 257 00:09:56,720 --> 00:09:59,920 people who are more focused on the 258 00:09:57,920 --> 00:10:01,920 modeling side of the equation. They can 259 00:09:59,920 --> 00:10:03,600 define their own models. uh if 260 00:10:01,920 --> 00:10:04,800 something's you know successful or they 261 00:10:03,600 --> 00:10:06,560 want to publish it then they could you 262 00:10:04,800 --> 00:10:09,600 know put it back into the into the main 263 00:10:06,560 --> 00:10:11,760 package for wider reuse and a certain 264 00:10:09,600 --> 00:10:14,079 amount of code uh on our road map coming 265 00:10:11,760 --> 00:10:16,320 soon for evaluation and standard scorec 266 00:10:14,079 --> 00:10:19,680 cards for model into comparison to 267 00:10:16,320 --> 00:10:22,959 support the research journey. 268 00:10:19,680 --> 00:10:26,480 So to recap PI Earth tools is for doing 269 00:10:22,959 --> 00:10:28,240 this kind of thing. 270 00:10:26,480 --> 00:10:31,440 So let's sort of pause and talk a little 271 00:10:28,240 --> 00:10:33,519 bit about the data. So one of the things 272 00:10:31,440 --> 00:10:37,440 that maybe differentiates the earth 273 00:10:33,519 --> 00:10:39,360 system science world a little is that we 274 00:10:37,440 --> 00:10:41,600 don't need to deal with as much say 275 00:10:39,360 --> 00:10:43,920 conflicting user generated data as if 276 00:10:41,600 --> 00:10:46,399 you were training an LLM but we do have 277 00:10:43,920 --> 00:10:48,720 to deal with like a similarly large 278 00:10:46,399 --> 00:10:50,560 volume of data. So you know we're 279 00:10:48,720 --> 00:10:53,440 talking in the hundreds of terabytes to 280 00:10:50,560 --> 00:10:55,279 pabytes range for the largest problems. 281 00:10:53,440 --> 00:10:58,720 uh and even for the smaller problems, 282 00:10:55,279 --> 00:11:00,640 it's still a decent decent whack of data 283 00:10:58,720 --> 00:11:02,959 and it has to be brought together in a 284 00:11:00,640 --> 00:11:05,760 great deal of different differentiated 285 00:11:02,959 --> 00:11:08,800 formats um which is often specific to 286 00:11:05,760 --> 00:11:10,399 Earth system science and I've always 287 00:11:08,800 --> 00:11:12,800 really wanted to kind of hide all of 288 00:11:10,399 --> 00:11:14,800 that behind an API. That's what you sort 289 00:11:12,800 --> 00:11:17,760 of have to do to get data into a machine 290 00:11:14,800 --> 00:11:20,320 learning model anyway. So this is a 291 00:11:17,760 --> 00:11:22,240 mechanism to standardize in in a coded 292 00:11:20,320 --> 00:11:24,079 way how to do that in a way that's a bit 293 00:11:22,240 --> 00:11:25,839 more transferable. And these data 294 00:11:24,079 --> 00:11:28,079 sources are not always sort of fully 295 00:11:25,839 --> 00:11:30,880 documented for an ordinary user as well. 296 00:11:28,079 --> 00:11:32,800 Often they you know require a certain 297 00:11:30,880 --> 00:11:35,200 depth of science understand even 298 00:11:32,800 --> 00:11:38,560 actually properly unpack and process the 299 00:11:35,200 --> 00:11:40,959 the data itself. 300 00:11:38,560 --> 00:11:43,120 So here are some examples of how PI 301 00:11:40,959 --> 00:11:45,120 Earth tools might connect data through 302 00:11:43,120 --> 00:11:47,200 to uh through to a machine learning 303 00:11:45,120 --> 00:11:48,320 model. some of the the processing steps 304 00:11:47,200 --> 00:11:49,920 that it goes through and we'll talk 305 00:11:48,320 --> 00:11:53,200 about the processing in a little more 306 00:11:49,920 --> 00:11:56,160 detail in a moment. But here we've got 307 00:11:53,200 --> 00:11:59,040 uh three separate uh pipeline graphs. 308 00:11:56,160 --> 00:12:01,279 The first one um essentially just being 309 00:11:59,040 --> 00:12:04,399 a stick uh which connects the 310 00:12:01,279 --> 00:12:06,720 weatherbench 2 based era 5 data set 311 00:12:04,399 --> 00:12:08,720 through the various transformations um 312 00:12:06,720 --> 00:12:10,959 that are required. So the steps here 313 00:12:08,720 --> 00:12:12,800 being you need to do the dimension order 314 00:12:10,959 --> 00:12:15,279 the variable sorting. So they'll often 315 00:12:12,800 --> 00:12:18,399 be like latitude, longitude, height, 316 00:12:15,279 --> 00:12:20,399 time. The various data s sets you want 317 00:12:18,399 --> 00:12:22,079 to work with often have those in a 318 00:12:20,399 --> 00:12:23,680 different order. So there's a certain 319 00:12:22,079 --> 00:12:26,240 amount of just reshuffleling of data 320 00:12:23,680 --> 00:12:29,440 that needs to occur. Certain amount of 321 00:12:26,240 --> 00:12:32,079 data is 0 to 360. Some is minus 180 to 322 00:12:29,440 --> 00:12:34,560 180. It's going to verifi vary by not 323 00:12:32,079 --> 00:12:36,959 just the source, but like how the person 324 00:12:34,560 --> 00:12:38,800 who archived your copy of that source 325 00:12:36,959 --> 00:12:40,560 decided to do it at the point in time 326 00:12:38,800 --> 00:12:43,040 they do it, which has persisted until 327 00:12:40,560 --> 00:12:45,200 now. So there's a lot of 328 00:12:43,040 --> 00:12:47,839 uh pre-thinking in the PI Earth tools 329 00:12:45,200 --> 00:12:50,639 framework for the major sources of data 330 00:12:47,839 --> 00:12:52,880 and sites that we optimize for. So that 331 00:12:50,639 --> 00:12:54,560 that sort of takes that burden off the 332 00:12:52,880 --> 00:12:56,720 user to some degree and they can use a 333 00:12:54,560 --> 00:12:58,240 pre-named pipeline to sort of you know 334 00:12:56,720 --> 00:13:02,160 get a bit of a head start on working 335 00:12:58,240 --> 00:13:03,519 with the data. um temporal retrieval to 336 00:13:02,160 --> 00:13:05,839 turn it into the appropriate sort of 337 00:13:03,519 --> 00:13:06,959 sequence tosequence pattern that we're 338 00:13:05,839 --> 00:13:08,639 talking about because there's often 339 00:13:06,959 --> 00:13:10,720 there's there's a history to the 340 00:13:08,639 --> 00:13:12,639 training the prior examples but then 341 00:13:10,720 --> 00:13:14,399 there's the number of sequences that you 342 00:13:12,639 --> 00:13:16,160 want to generate in your forecast period 343 00:13:14,399 --> 00:13:18,880 of lead time as well. So that sort of 344 00:13:16,160 --> 00:13:21,839 reshapes the data for presentation to 345 00:13:18,880 --> 00:13:24,320 the network. Convert it to numpy, play 346 00:13:21,839 --> 00:13:26,000 with the the dimensionality a bit more, 347 00:13:24,320 --> 00:13:28,720 work out the zed scaling, standard 348 00:13:26,000 --> 00:13:30,800 deviation, c it cuz that was a lot of 349 00:13:28,720 --> 00:13:33,360 work. And then be ready to start, you 350 00:13:30,800 --> 00:13:35,360 know, running epochs over the data. And 351 00:13:33,360 --> 00:13:38,000 on the top right we have a similar 352 00:13:35,360 --> 00:13:41,519 example but in this case we're trying to 353 00:13:38,000 --> 00:13:44,320 this was is utilizing to produce uh the 354 00:13:41,519 --> 00:13:46,000 uh El Nino index out of you know so the 355 00:13:44,320 --> 00:13:48,240 right hand side goes down to a single 356 00:13:46,000 --> 00:13:50,000 value which is that index value but on 357 00:13:48,240 --> 00:13:51,920 the left we're preserving latitude and 358 00:13:50,000 --> 00:13:54,000 longitude dimensionality in the input 359 00:13:51,920 --> 00:13:55,760 data. So a lot of this a lot of this 360 00:13:54,000 --> 00:13:58,079 sort of pipeline stuff is around how to 361 00:13:55,760 --> 00:13:59,600 handle appropriate data transformations 362 00:13:58,079 --> 00:14:03,360 and it's the source of a lot of 363 00:13:59,600 --> 00:14:05,440 mistakes. So like a a recent example is 364 00:14:03,360 --> 00:14:07,519 someone was working with hail data uh 365 00:14:05,440 --> 00:14:09,600 sorry lightning data. Lightning strikes 366 00:14:07,519 --> 00:14:11,440 are relatively sparse. They were 367 00:14:09,600 --> 00:14:14,079 preserving latitude and longitude when 368 00:14:11,440 --> 00:14:16,000 calculating their normalization factors. 369 00:14:14,079 --> 00:14:18,399 There were a lot of missing data and nan 370 00:14:16,000 --> 00:14:19,920 values. And so you know they hadn't 371 00:14:18,399 --> 00:14:22,320 plotted they hadn't plotted it. They 372 00:14:19,920 --> 00:14:24,160 weren't aware of it. These sorts of 373 00:14:22,320 --> 00:14:25,760 things allow the scientists involved to 374 00:14:24,160 --> 00:14:27,680 get in there and do things like inspect 375 00:14:25,760 --> 00:14:30,079 normalization weights and understand 376 00:14:27,680 --> 00:14:32,880 data data representations. 377 00:14:30,079 --> 00:14:34,399 Quick plot, quickly identify the issue. 378 00:14:32,880 --> 00:14:36,320 Let's just collapse it to a single 379 00:14:34,399 --> 00:14:40,639 standard deviation rather than something 380 00:14:36,320 --> 00:14:42,399 that's spatially varying. Happy days. 381 00:14:40,639 --> 00:14:45,040 So what are the strengths of Pearth 382 00:14:42,399 --> 00:14:48,480 tools? So we do take into account the 383 00:14:45,040 --> 00:14:51,120 hardware consideration. So running one 384 00:14:48,480 --> 00:14:53,760 of the global scale physical models is 385 00:14:51,120 --> 00:14:56,399 not something you can do on an ordinary 386 00:14:53,760 --> 00:14:58,160 laptop or computer. Running a global 387 00:14:56,399 --> 00:15:00,399 machine learning neural networkbased 388 00:14:58,160 --> 00:15:02,240 weather model is something you can do on 389 00:15:00,399 --> 00:15:06,000 your own laptop or computer for many 390 00:15:02,240 --> 00:15:08,079 laptops and computers. Training them is 391 00:15:06,000 --> 00:15:11,920 something you might need a very high 392 00:15:08,079 --> 00:15:14,240 power GPU for. More it's the data the 393 00:15:11,920 --> 00:15:16,720 data volumes are very prohibitive but 394 00:15:14,240 --> 00:15:18,800 training a low resolution one is not 395 00:15:16,720 --> 00:15:20,560 prohibitive and can be done and we've 396 00:15:18,800 --> 00:15:22,240 we've got that in our tutorials things 397 00:15:20,560 --> 00:15:24,800 that you can do for yourself. So one of 398 00:15:22,240 --> 00:15:27,440 the advantages I think is this can put 399 00:15:24,800 --> 00:15:30,399 from a learning perspective the ability 400 00:15:27,440 --> 00:15:32,320 to work with real world science into a 401 00:15:30,399 --> 00:15:34,000 more consumer class of hardware or more 402 00:15:32,320 --> 00:15:36,079 common commonly available source of 403 00:15:34,000 --> 00:15:37,040 hardware and category that people are 404 00:15:36,079 --> 00:15:39,760 going to be able to do more 405 00:15:37,040 --> 00:15:41,440 experimentation with. So it's been run 406 00:15:39,760 --> 00:15:43,120 on laptops, workstations and 407 00:15:41,440 --> 00:15:45,600 supercomputers. It supports 408 00:15:43,120 --> 00:15:48,320 standardization and repeatability. 409 00:15:45,600 --> 00:15:50,880 It has it has lots of documentation and 410 00:15:48,320 --> 00:15:53,279 we are it's a forever job to make it as 411 00:15:50,880 --> 00:15:56,000 good as it can be but we're very much on 412 00:15:53,279 --> 00:15:57,279 that journey and it's independent with 413 00:15:56,000 --> 00:16:00,160 respect to the machine learning 414 00:15:57,279 --> 00:16:01,519 architectures. So it can work with any 415 00:16:00,160 --> 00:16:03,680 machine learning architecture or 416 00:16:01,519 --> 00:16:05,759 framework. So the the three that we've 417 00:16:03,680 --> 00:16:08,800 integrated it with thus far in the 418 00:16:05,759 --> 00:16:12,480 tutorial examples uh would be XG Boost, 419 00:16:08,800 --> 00:16:15,199 PyTorch and Apple MLX framework. But 420 00:16:12,480 --> 00:16:17,839 there's no it will function you know 421 00:16:15,199 --> 00:16:20,880 very happily together with with Jax or 422 00:16:17,839 --> 00:16:23,680 TensorFlow or any other one that you 423 00:16:20,880 --> 00:16:26,160 might might care to add. So that's 424 00:16:23,680 --> 00:16:28,639 that's also a feature and that's why to 425 00:16:26,160 --> 00:16:30,880 separate the pipelines from the models 426 00:16:28,639 --> 00:16:32,160 because the the feed of the pipeline is 427 00:16:30,880 --> 00:16:34,560 something that's typically going to be 428 00:16:32,160 --> 00:16:37,759 numpy based or basic or a simple array 429 00:16:34,560 --> 00:16:38,880 based very numerical. Having that as 430 00:16:37,759 --> 00:16:41,199 something that can be done in a 431 00:16:38,880 --> 00:16:43,519 standardized sort of xarray open-source 432 00:16:41,199 --> 00:16:45,199 stack separate from the model itself 433 00:16:43,519 --> 00:16:46,880 allows you to be more modular and 434 00:16:45,199 --> 00:16:48,959 objective with regards to your choice of 435 00:16:46,880 --> 00:16:50,720 of actual networking st neural network 436 00:16:48,959 --> 00:16:53,279 stack. So when I say network here I mean 437 00:16:50,720 --> 00:16:54,959 neural network not network cable 438 00:16:53,279 --> 00:16:56,480 network. 439 00:16:54,959 --> 00:16:59,120 So what are the motivations for 440 00:16:56,480 --> 00:17:03,120 continuing to do pie earth tools and why 441 00:16:59,120 --> 00:17:04,720 why is it of interest uh for others? So 442 00:17:03,120 --> 00:17:07,919 much of machine learning research in 443 00:17:04,720 --> 00:17:10,000 earth system science uses a a custom 444 00:17:07,919 --> 00:17:11,679 stack. So there there's sort of a few 445 00:17:10,000 --> 00:17:13,919 categories like there's there's there's 446 00:17:11,679 --> 00:17:17,600 anoy which has a particular family of 447 00:17:13,919 --> 00:17:19,439 models in it and it's you know it's 448 00:17:17,600 --> 00:17:21,600 probably fair to say it's best of breed 449 00:17:19,439 --> 00:17:24,000 for that particular family and class of 450 00:17:21,600 --> 00:17:26,640 problems but there are equally as many 451 00:17:24,000 --> 00:17:29,039 amazing models that are built in pure 452 00:17:26,640 --> 00:17:31,120 pietorch or pure tensorflow or pure 453 00:17:29,039 --> 00:17:34,720 jacks that don't connect to particularly 454 00:17:31,120 --> 00:17:36,400 to one another. So, Pyear Tools, we aim 455 00:17:34,720 --> 00:17:38,400 to be able to bundle more model 456 00:17:36,400 --> 00:17:40,400 architectures, give people a leg up, 457 00:17:38,400 --> 00:17:42,880 give people predefined architectures, 458 00:17:40,400 --> 00:17:44,960 give them a starting point, 459 00:17:42,880 --> 00:17:46,559 unbburden some of the interface into the 460 00:17:44,960 --> 00:17:48,960 supercomputing facility and the 461 00:17:46,559 --> 00:17:50,640 complexity of the data sources and move 462 00:17:48,960 --> 00:17:52,240 a bit closer towards modular 463 00:17:50,640 --> 00:17:56,000 reproducible science. So, that's sort of 464 00:17:52,240 --> 00:17:58,080 the purpose. It's accessible uh by 465 00:17:56,000 --> 00:18:00,480 being, you know, it's on Pippi, it's on 466 00:17:58,080 --> 00:18:02,640 GitHub, you can just get it. There's no 467 00:18:00,480 --> 00:18:04,799 overheads or barriers. and it will work 468 00:18:02,640 --> 00:18:08,160 on your computer. 469 00:18:04,799 --> 00:18:10,559 Um, flipping the equation, it allows you 470 00:18:08,160 --> 00:18:13,760 to document and share your your workflow 471 00:18:10,559 --> 00:18:16,320 back with the community effectively. So, 472 00:18:13,760 --> 00:18:17,919 if people, you know, choose to adopt the 473 00:18:16,320 --> 00:18:19,919 framework, they're going to be able to 474 00:18:17,919 --> 00:18:21,840 collaborate with their partners on model 475 00:18:19,919 --> 00:18:24,480 versions and model upgrades and 476 00:18:21,840 --> 00:18:26,640 differentiation and pipeline versions 477 00:18:24,480 --> 00:18:28,720 more effectively than if they were 478 00:18:26,640 --> 00:18:31,039 gathered around specific models. So 479 00:18:28,720 --> 00:18:33,360 that's the basic principle and it gives 480 00:18:31,039 --> 00:18:36,480 people just a a way to give back as well 481 00:18:33,360 --> 00:18:37,840 which is also a motivation for many. So 482 00:18:36,480 --> 00:18:40,559 let's look at some things that you can 483 00:18:37,840 --> 00:18:42,799 do for yourself. So we have four 484 00:18:40,559 --> 00:18:46,720 specific tutorials which are focused on 485 00:18:42,799 --> 00:18:48,400 requiring no more than a 4 GB GPU and an 486 00:18:46,720 --> 00:18:50,160 amount of data that you can reasonably 487 00:18:48,400 --> 00:18:53,200 just download over the network in real 488 00:18:50,160 --> 00:18:55,520 time. So the the data volumes here top 489 00:18:53,200 --> 00:18:58,320 out at kind of 10 gigabytes. Maybe you 490 00:18:55,520 --> 00:19:00,559 don't even need more than three. and 491 00:18:58,320 --> 00:19:02,880 many you know many workstations and 492 00:19:00,559 --> 00:19:05,120 laptops would be able to do not just the 493 00:19:02,880 --> 00:19:06,720 inference but the training. So that the 494 00:19:05,120 --> 00:19:09,600 emphasis here is very much on putting 495 00:19:06,720 --> 00:19:11,520 the the ability to train back into the 496 00:19:09,600 --> 00:19:14,080 hands of as many people as possible with 497 00:19:11,520 --> 00:19:17,280 as few barriers to entry as poss as 498 00:19:14,080 --> 00:19:19,440 possible pardon me. Uh and we have a lot 499 00:19:17,280 --> 00:19:22,880 a long and interesting tutorial gallery 500 00:19:19,440 --> 00:19:25,200 full of lots of worked examples. 501 00:19:22,880 --> 00:19:26,720 It's very easy to get involved. Um so 502 00:19:25,200 --> 00:19:29,679 community engagement and community 503 00:19:26,720 --> 00:19:31,520 contributions are warmly welcomed. Um 504 00:19:29,679 --> 00:19:33,919 I'll be here for the sprints. So please 505 00:19:31,520 --> 00:19:35,200 do consider coming along to the PyCon AU 506 00:19:33,919 --> 00:19:37,919 sprints and there'll be some other 507 00:19:35,200 --> 00:19:40,080 developers online who can help orient 508 00:19:37,919 --> 00:19:42,559 you and get you started as well. run 509 00:19:40,080 --> 00:19:45,360 through the tutorials and there's a few 510 00:19:42,559 --> 00:19:47,280 other ideas and if you just want to keep 511 00:19:45,360 --> 00:19:49,679 in touch with PI Earth tools and see 512 00:19:47,280 --> 00:19:51,520 what's going on. Uh the access nRI 513 00:19:49,679 --> 00:19:53,679 machine learning uh for climate and 514 00:19:51,520 --> 00:19:55,360 weather working group uh is the place 515 00:19:53,679 --> 00:19:57,200 where most of the conversation about the 516 00:19:55,360 --> 00:19:59,919 ongoing development of PI Earth tools is 517 00:19:57,200 --> 00:20:02,160 occurring at the moment. In terms of our 518 00:19:59,919 --> 00:20:03,760 road map, probably the next major 519 00:20:02,160 --> 00:20:06,240 objective is adding standardized 520 00:20:03,760 --> 00:20:07,919 scorecards so that we can do far more 521 00:20:06,240 --> 00:20:10,320 straightforward model into comparison 522 00:20:07,919 --> 00:20:13,120 between the different architectures. Um, 523 00:20:10,320 --> 00:20:15,600 and so that again students or people 524 00:20:13,120 --> 00:20:18,480 starting new projects or or early career 525 00:20:15,600 --> 00:20:20,160 people have ready access to see how what 526 00:20:18,480 --> 00:20:22,799 they've trained stacks up against, you 527 00:20:20,160 --> 00:20:25,679 know, baseline stats from other models. 528 00:20:22,799 --> 00:20:27,440 um add an further range of benchmark 529 00:20:25,679 --> 00:20:29,919 model implementations for more classes 530 00:20:27,440 --> 00:20:32,000 of problems so that the framework is 531 00:20:29,919 --> 00:20:33,360 open and of interest to more more 532 00:20:32,000 --> 00:20:35,360 different researchers from different 533 00:20:33,360 --> 00:20:37,200 backgrounds on different problems. 534 00:20:35,360 --> 00:20:40,159 Bundling more kind of basic model 535 00:20:37,200 --> 00:20:41,840 archetypes. So uh to me the difference 536 00:20:40,159 --> 00:20:44,640 between an archetype and an architecture 537 00:20:41,840 --> 00:20:46,559 an architecture is like this particular 538 00:20:44,640 --> 00:20:48,400 network with this particular depth and 539 00:20:46,559 --> 00:20:50,240 this particular size and this many 540 00:20:48,400 --> 00:20:52,320 layers. So a lot of the choices are made 541 00:20:50,240 --> 00:20:55,120 whereas an archetype is like a graph 542 00:20:52,320 --> 00:20:57,120 neural network a CNN etc. So that we've 543 00:20:55,120 --> 00:20:59,919 got more worked examples for some of the 544 00:20:57,120 --> 00:21:01,280 the intervening steps between starting 545 00:20:59,919 --> 00:21:03,840 you know a blank sheet of paper and a 546 00:21:01,280 --> 00:21:06,640 functioning model and as always a bit of 547 00:21:03,840 --> 00:21:08,400 refactoring and simplification. 548 00:21:06,640 --> 00:21:10,640 So I'm not quite sure where that leaves 549 00:21:08,400 --> 00:21:14,159 us for time but that has run me out of 550 00:21:10,640 --> 00:21:16,080 slides. Um can I get a quick time check? 551 00:21:14,159 --> 00:21:18,559 10 minutes. Well I think that's the 552 00:21:16,080 --> 00:21:19,919 that's the goal isn't it? 553 00:21:18,559 --> 00:21:21,520 Great. So, we can stop there for 554 00:21:19,919 --> 00:21:23,840 questions. And if people don't have any 555 00:21:21,520 --> 00:21:26,150 questions, I can fire up tutorials if I 556 00:21:23,840 --> 00:21:26,710 can get the network working. 557 00:21:26,150 --> 00:21:31,299 [Applause] 558 00:21:26,710 --> 00:21:31,299 [Music] 559 00:21:33,600 --> 00:21:37,520 Thank you, Tennessee. That was a really 560 00:21:35,679 --> 00:21:39,919 awesome talk. It sounds like a really 561 00:21:37,520 --> 00:21:42,919 neat tool. Um, so do we have any 562 00:21:39,919 --> 00:21:42,919 questions? 563 00:21:50,000 --> 00:21:54,640 Hello. Uh first off, thank you. Great 564 00:21:51,919 --> 00:21:56,960 talk. Um so I'm a platform uh 565 00:21:54,640 --> 00:21:59,440 engineering person. So I'm wondering if 566 00:21:56,960 --> 00:22:01,760 there's any way you could uh say 567 00:21:59,440 --> 00:22:04,720 something about how this is being made 568 00:22:01,760 --> 00:22:06,799 available and how that's hosted for um 569 00:22:04,720 --> 00:22:08,400 analysts and researchers to consume 570 00:22:06,799 --> 00:22:13,360 within the bureau. 571 00:22:08,400 --> 00:22:15,600 Sure. So the I mean in any particular 572 00:22:13,360 --> 00:22:17,280 you you'll have to find someone to to 573 00:22:15,600 --> 00:22:19,280 install it in any particular 574 00:22:17,280 --> 00:22:22,559 organizational context. The main use of 575 00:22:19,280 --> 00:22:25,039 this by the bureau and the community is 576 00:22:22,559 --> 00:22:26,720 in research facilities that are for that 577 00:22:25,039 --> 00:22:28,960 particular purpose because of most of 578 00:22:26,720 --> 00:22:31,520 the focus being on training and 579 00:22:28,960 --> 00:22:34,159 reproducibility at this stage. There is 580 00:22:31,520 --> 00:22:35,919 I think very wisely a lot of scrutiny 581 00:22:34,159 --> 00:22:38,000 being put on these before they would go 582 00:22:35,919 --> 00:22:40,480 behind any kind of production service 583 00:22:38,000 --> 00:22:42,640 and frankly we're not there yet. So it's 584 00:22:40,480 --> 00:22:44,799 more so the the emphasis is on the 585 00:22:42,640 --> 00:22:46,559 optimization at NCI. There is a 586 00:22:44,799 --> 00:22:49,039 pre-built what's called a modules 587 00:22:46,559 --> 00:22:52,640 environment which is that that would be 588 00:22:49,039 --> 00:22:54,880 uh of use to that's fairly generally 589 00:22:52,640 --> 00:22:56,720 open. I don't know exactly how open but 590 00:22:54,880 --> 00:22:58,320 if you're a researcher in the field it's 591 00:22:56,720 --> 00:23:01,360 I I don't see why you wouldn't be able 592 00:22:58,320 --> 00:23:02,799 to access it there. But it's also pretty 593 00:23:01,360 --> 00:23:04,799 straightforward to pip install. Like 594 00:23:02,799 --> 00:23:06,559 there's a lot of packages, but other 595 00:23:04,799 --> 00:23:09,280 than there being a lot of packages, like 596 00:23:06,559 --> 00:23:12,000 I I run it on my machine, most of the 597 00:23:09,280 --> 00:23:14,400 developers have work on run on both 598 00:23:12,000 --> 00:23:17,120 their facility and an individual 599 00:23:14,400 --> 00:23:19,840 instance on a laptop as well. It it 600 00:23:17,120 --> 00:23:22,000 doesn't require much. So So it's it's 601 00:23:19,840 --> 00:23:24,240 Python processes kind of beginning to 602 00:23:22,000 --> 00:23:26,960 end. A bit of Dask, a bit of PyTorch. 603 00:23:24,240 --> 00:23:29,200 You don't need like an an MLflow server 604 00:23:26,960 --> 00:23:30,720 or this server or a database server or 605 00:23:29,200 --> 00:23:34,440 anything like that. It can kind of go 606 00:23:30,720 --> 00:23:34,440 where you want to take it. 607 00:23:35,200 --> 00:23:42,720 Hey, um just interested in a sort of a 608 00:23:39,039 --> 00:23:46,000 deep dive into the actual models the and 609 00:23:42,720 --> 00:23:48,960 the machine learning side of that. Right 610 00:23:46,000 --> 00:23:51,919 at the start you sort of talked about um 611 00:23:48,960 --> 00:23:55,520 the co you showed a very dense slide of 612 00:23:51,919 --> 00:23:57,520 code that was machine learning um 613 00:23:55,520 --> 00:24:01,679 version of a particular model and I'm 614 00:23:57,520 --> 00:24:05,760 wondering what are the data scientists 615 00:24:01,679 --> 00:24:08,240 and the data modelers learning from the 616 00:24:05,760 --> 00:24:12,880 machine learning that is kind of 617 00:24:08,240 --> 00:24:14,960 producing a synthesis of those ideas. 618 00:24:12,880 --> 00:24:16,960 Yeah that's that's a great question. I 619 00:24:14,960 --> 00:24:19,039 don't know whether this dense slide oh 620 00:24:16,960 --> 00:24:20,640 it stopped sharing the dense slide of 621 00:24:19,039 --> 00:24:23,039 full of code is going to help an 622 00:24:20,640 --> 00:24:27,120 enormous amount what people are learning 623 00:24:23,039 --> 00:24:28,720 is that the machine learning models they 624 00:24:27,120 --> 00:24:30,880 had a lot of let me reframe it in terms 625 00:24:28,720 --> 00:24:32,559 of the questions they're asking okay so 626 00:24:30,880 --> 00:24:34,559 one of the questions they were asking 627 00:24:32,559 --> 00:24:37,360 are do they fail different or do they 628 00:24:34,559 --> 00:24:39,679 fail the same uh do we have things like 629 00:24:37,360 --> 00:24:41,279 model collapse occurring are they 630 00:24:39,679 --> 00:24:43,279 effective outside of the training 631 00:24:41,279 --> 00:24:45,520 envelope how do they handle rare and 632 00:24:43,279 --> 00:24:47,360 extreme events How do they work in terms 633 00:24:45,520 --> 00:24:49,520 of the probabilities of those events 634 00:24:47,360 --> 00:24:51,919 occurring? And those are questions which 635 00:24:49,520 --> 00:24:54,799 are kind of natural to ask from a 636 00:24:51,919 --> 00:24:56,159 scientific perspective. And so a lot of 637 00:24:54,799 --> 00:24:58,720 what they're doing is like you might 638 00:24:56,159 --> 00:25:00,159 look at a uh I'm going to get the the 639 00:24:58,720 --> 00:25:02,320 words on the tip of my tongue, but 640 00:25:00,159 --> 00:25:04,960 essentially the spatial sensitivity map 641 00:25:02,320 --> 00:25:08,000 of the model to the to the data. So 642 00:25:04,960 --> 00:25:10,240 they're looking to find physical regions 643 00:25:08,000 --> 00:25:12,080 that they can understand as a scientist 644 00:25:10,240 --> 00:25:15,600 which are influencing and and resulting 645 00:25:12,080 --> 00:25:17,919 in those model predictions. So and 646 00:25:15,600 --> 00:25:19,679 whether the probability curve. So most 647 00:25:17,919 --> 00:25:22,000 physical modeling now is what they call 648 00:25:19,679 --> 00:25:24,799 ensemble modeling. So people there's an 649 00:25:22,000 --> 00:25:27,840 increasing emphasis not so much on going 650 00:25:24,799 --> 00:25:30,240 there will be exactly 20° of temperature 651 00:25:27,840 --> 00:25:32,000 and more the odds of exceedence of a 652 00:25:30,240 --> 00:25:34,080 variety of thresholds is such and such 653 00:25:32,000 --> 00:25:36,000 and such. And they do that at the moment 654 00:25:34,080 --> 00:25:38,400 through a a kind of initialization 655 00:25:36,000 --> 00:25:40,799 process with carefully managed noise of 656 00:25:38,400 --> 00:25:42,720 an of a fleet of ensemble members. and 657 00:25:40,799 --> 00:25:44,320 they're reconsidering 658 00:25:42,720 --> 00:25:46,960 how to do that in the light of things 659 00:25:44,320 --> 00:25:49,279 like a diffusion process which is more 660 00:25:46,960 --> 00:25:51,440 like a Gaussian noise model and less 661 00:25:49,279 --> 00:25:54,320 like a physical noise model. And so 662 00:25:51,440 --> 00:25:56,559 there's a lot of interplay between like 663 00:25:54,320 --> 00:25:57,919 as as the physicists with those deep 664 00:25:56,559 --> 00:26:00,080 understand and modelers with that sort 665 00:25:57,919 --> 00:26:01,520 of deep physical understanding are 666 00:26:00,080 --> 00:26:02,960 framing their understanding of the 667 00:26:01,520 --> 00:26:06,640 machine learning models to try to 668 00:26:02,960 --> 00:26:08,400 produce those similar outcomes. Um at 669 00:26:06,640 --> 00:26:10,400 some level what they're finding is that 670 00:26:08,400 --> 00:26:12,159 the machine learning models have quite 671 00:26:10,400 --> 00:26:14,240 similar strengths and weaknesses to the 672 00:26:12,159 --> 00:26:16,000 physical models. Uh they're trained 673 00:26:14,240 --> 00:26:19,039 against something called reanalysis. A 674 00:26:16,000 --> 00:26:21,520 reanalysis is a mixture of observations 675 00:26:19,039 --> 00:26:24,159 but away from the observations. It's 676 00:26:21,520 --> 00:26:26,320 model data. So it's a modeled estimate 677 00:26:24,159 --> 00:26:28,720 in the sparse you know to to fill in the 678 00:26:26,320 --> 00:26:31,760 sparse data. So to some degree the 679 00:26:28,720 --> 00:26:34,240 models are trained on emulating a 680 00:26:31,760 --> 00:26:35,840 physical model. There's a really 681 00:26:34,240 --> 00:26:38,400 interesting line of research on direct 682 00:26:35,840 --> 00:26:40,400 from observations training. Uh, which I 683 00:26:38,400 --> 00:26:41,919 think is is really that a lot of the 684 00:26:40,400 --> 00:26:44,080 focus of the cutting edge is really 685 00:26:41,919 --> 00:26:45,760 moving into direct from observations. 686 00:26:44,080 --> 00:26:47,520 That definitely gets me out of bed in 687 00:26:45,760 --> 00:26:50,320 the morning. That's that's that's super 688 00:26:47,520 --> 00:26:52,880 exciting. So, I'm not entirely sure if I 689 00:26:50,320 --> 00:26:56,000 I just tried to give you a flavor. Yeah, 690 00:26:52,880 --> 00:26:57,600 good play. 691 00:26:56,000 --> 00:27:01,120 Thank you, Jennese. Um, great 692 00:26:57,600 --> 00:27:04,240 presentation. just in the background of 693 00:27:01,120 --> 00:27:05,840 this package, all the weather variables 694 00:27:04,240 --> 00:27:08,559 are built into that. So, we don't need 695 00:27:05,840 --> 00:27:10,480 to know where they come from. And and my 696 00:27:08,559 --> 00:27:12,159 question goes to can we use things like 697 00:27:10,480 --> 00:27:14,559 your that airport you showed that three 698 00:27:12,159 --> 00:27:16,480 to four say wind. We could do wind 699 00:27:14,559 --> 00:27:20,000 forecast. Can we use that then with 700 00:27:16,480 --> 00:27:21,360 other products to say do fire modeling 701 00:27:20,000 --> 00:27:23,440 in the short term? We've got a bush 702 00:27:21,360 --> 00:27:25,360 fire. We then model what that wind's 703 00:27:23,440 --> 00:27:27,440 happening with some topographic effects. 704 00:27:25,360 --> 00:27:29,360 Can we predict that fire? Is that the 705 00:27:27,440 --> 00:27:30,799 intent for a package like this? 706 00:27:29,360 --> 00:27:32,640 Yeah, that's certainly the intent. So 707 00:27:30,799 --> 00:27:34,559 let let's talk about this. Can we do it 708 00:27:32,640 --> 00:27:38,000 like 709 00:27:34,559 --> 00:27:40,559 not in production today right now in 710 00:27:38,000 --> 00:27:42,559 easy grasp in principle using machine 711 00:27:40,559 --> 00:27:45,520 learning with a datadriven approach in 712 00:27:42,559 --> 00:27:49,200 research? Yes, definitely. So the the 713 00:27:45,520 --> 00:27:50,960 intent there is to provide and the focus 714 00:27:49,200 --> 00:27:53,120 again is still on training. So in a 715 00:27:50,960 --> 00:27:55,520 production setting, you've got a whole 716 00:27:53,120 --> 00:27:57,440 IT architecture to consider. In a 717 00:27:55,520 --> 00:27:59,600 research setting, it's like can we make 718 00:27:57,440 --> 00:28:01,279 the models to do that job effectively? 719 00:27:59,600 --> 00:28:02,559 And I'm very interested in things like 720 00:28:01,279 --> 00:28:04,960 whether we can take like a good 721 00:28:02,559 --> 00:28:07,440 foundation weather model and then add a 722 00:28:04,960 --> 00:28:10,559 field from another discipline or area 723 00:28:07,440 --> 00:28:12,480 such as fire spread or some other pre, 724 00:28:10,559 --> 00:28:14,399 you know, well well stored data set and 725 00:28:12,480 --> 00:28:16,880 then learn those relationships without 726 00:28:14,399 --> 00:28:18,640 having to do the full retrain. 727 00:28:16,880 --> 00:28:20,880 up until now, as you can see, since 728 00:28:18,640 --> 00:28:22,480 like, it's only just a small amount of 729 00:28:20,880 --> 00:28:25,279 time that those models have gone from 730 00:28:22,480 --> 00:28:27,440 kind of close par to the atmospheric 731 00:28:25,279 --> 00:28:29,760 model to genuinely competitive 732 00:28:27,440 --> 00:28:31,520 atmospheric modeling. I think that's 733 00:28:29,760 --> 00:28:34,960 sort of we're on the cusp of all of 734 00:28:31,520 --> 00:28:36,480 those sorts of changes and the the Yeah, 735 00:28:34,960 --> 00:28:40,640 if you're interested, tap me and I'll 736 00:28:36,480 --> 00:28:43,120 I'll send you a bunch of links. Yeah. 737 00:28:40,640 --> 00:28:45,840 Okay, really quick question. This this 738 00:28:43,120 --> 00:28:48,080 might be tangential to pi tools, but um 739 00:28:45,840 --> 00:28:51,679 have machine learning models been used 740 00:28:48,080 --> 00:28:55,120 to uh characterize uh the um the 741 00:28:51,679 --> 00:28:59,120 statistical distribution and spread of 742 00:28:55,120 --> 00:29:02,640 um of whether ensembles u models uh in 743 00:28:59,120 --> 00:29:04,320 terms of in terms of trying to optimize 744 00:29:02,640 --> 00:29:07,039 the number of ensemble members and 745 00:29:04,320 --> 00:29:08,799 things like that. 746 00:29:07,039 --> 00:29:11,440 Not in terms of optimizing physical 747 00:29:08,799 --> 00:29:13,600 ensemble members. In a lot of ways, m 748 00:29:11,440 --> 00:29:16,320 the machine learning has happened so 749 00:29:13,600 --> 00:29:18,480 quickly, the physical models models have 750 00:29:16,320 --> 00:29:20,720 only barely had time to react to this 751 00:29:18,480 --> 00:29:22,399 change in the field. So they'll try to 752 00:29:20,720 --> 00:29:25,440 work out how many machine learning 753 00:29:22,399 --> 00:29:27,200 ensemble members to run. Um but that 754 00:29:25,440 --> 00:29:28,960 they're quite cheap to run. So you can 755 00:29:27,200 --> 00:29:30,399 more or less run just as many as you 756 00:29:28,960 --> 00:29:32,640 want. So there's not the same kind of 757 00:29:30,399 --> 00:29:34,960 constraint and people haven't really 758 00:29:32,640 --> 00:29:36,399 sought to back connect to the physical 759 00:29:34,960 --> 00:29:38,399 modeling world. There there are lots of 760 00:29:36,399 --> 00:29:41,360 reasons to maybe do that like guard 761 00:29:38,399 --> 00:29:44,880 rails, explainability, fallbacks, etc. 762 00:29:41,360 --> 00:29:47,520 But um yeah, interesting times. 763 00:29:44,880 --> 00:29:50,960 Oh, can everyone thank Tennessee again? 764 00:29:47,520 --> 00:29:54,520 And we've got a mug for you. 765 00:29:50,960 --> 00:29:54,520 Thank you very much.