1 00:00:06,320 --> 00:00:11,499 [Music] 2 00:00:16,080 --> 00:00:20,480 hey everyone my name is dan draper and 3 00:00:18,000 --> 00:00:22,080 i'm the ceo and founder at citch and 4 00:00:20,480 --> 00:00:26,080 today i'm going to be talking about 5 00:00:22,080 --> 00:00:27,840 practical attacks on encrypted databases 6 00:00:26,080 --> 00:00:30,160 so the first question you might ask 7 00:00:27,840 --> 00:00:31,679 yourself is why do we actually bother to 8 00:00:30,160 --> 00:00:34,880 encrypt the database what kinds of 9 00:00:31,679 --> 00:00:37,360 things are we protecting ourselves from 10 00:00:34,880 --> 00:00:38,960 so there are some threats to consider 11 00:00:37,360 --> 00:00:40,320 as to why we would want to encrypt the 12 00:00:38,960 --> 00:00:41,760 database 13 00:00:40,320 --> 00:00:43,360 number one is 14 00:00:41,760 --> 00:00:45,440 let's say we're afraid of an attacker 15 00:00:43,360 --> 00:00:46,960 who learns the database connections 16 00:00:45,440 --> 00:00:48,239 trick there's only one database 17 00:00:46,960 --> 00:00:51,039 connection string for a database 18 00:00:48,239 --> 00:00:52,719 typically and if that's breached then a 19 00:00:51,039 --> 00:00:54,640 user could potentially get access to the 20 00:00:52,719 --> 00:00:56,320 database we might be concerned about an 21 00:00:54,640 --> 00:00:57,840 attacker who gains access to the host 22 00:00:56,320 --> 00:01:00,239 server the actual machine where the 23 00:00:57,840 --> 00:01:02,079 database is running 24 00:01:00,239 --> 00:01:03,920 we might be afraid of an attacker who 25 00:01:02,079 --> 00:01:05,680 retrieves a database dump or a backup 26 00:01:03,920 --> 00:01:07,360 that's actually relatively easy to 27 00:01:05,680 --> 00:01:08,960 achieve in in many cases and quite 28 00:01:07,360 --> 00:01:11,119 common 29 00:01:08,960 --> 00:01:13,439 and one final one which we often don't 30 00:01:11,119 --> 00:01:16,159 tend to think about is a trusted user an 31 00:01:13,439 --> 00:01:18,799 insider in this case a trusted database 32 00:01:16,159 --> 00:01:20,479 administrator who maybe we want to be 33 00:01:18,799 --> 00:01:22,720 able to access the database but not 34 00:01:20,479 --> 00:01:26,680 necessarily be able to read all of the 35 00:01:22,720 --> 00:01:26,680 the data stored in the tables 36 00:01:26,880 --> 00:01:30,000 before we go on i wanted to make a quick 37 00:01:28,799 --> 00:01:32,880 mention to 38 00:01:30,000 --> 00:01:35,600 um assistant professor paul grubbs who 39 00:01:32,880 --> 00:01:37,920 was the co-author of a paper from back 40 00:01:35,600 --> 00:01:40,640 in 2017 called why your encrypted 41 00:01:37,920 --> 00:01:43,680 database is not secure a lot of the work 42 00:01:40,640 --> 00:01:45,280 in this talk is based on that paper 43 00:01:43,680 --> 00:01:47,840 and trying to take a more practical 44 00:01:45,280 --> 00:01:49,200 approach so that it's accessible to a 45 00:01:47,840 --> 00:01:51,840 wider audience 46 00:01:49,200 --> 00:01:54,000 but certainly a big thanks goes out to 47 00:01:51,840 --> 00:01:56,240 paul and his co-authors for writing that 48 00:01:54,000 --> 00:02:00,799 paper 49 00:01:56,240 --> 00:02:02,399 so given this what leads to exploits 50 00:02:00,799 --> 00:02:04,960 certainly one of the major things is 51 00:02:02,399 --> 00:02:07,680 this idea of a query cache as 52 00:02:04,960 --> 00:02:08,959 things like mysql and postgresql use 53 00:02:07,680 --> 00:02:10,560 query caches quite heavily for 54 00:02:08,959 --> 00:02:13,360 performance but that can leak 55 00:02:10,560 --> 00:02:16,160 information leading to an attack 56 00:02:13,360 --> 00:02:18,319 logs query logs system logs 57 00:02:16,160 --> 00:02:20,480 writer headlogs many databases use this 58 00:02:18,319 --> 00:02:22,400 idea of a writer headlog to ensure 59 00:02:20,480 --> 00:02:24,879 consistency and reliability 60 00:02:22,400 --> 00:02:27,360 but that comes with the potential to 61 00:02:24,879 --> 00:02:29,200 reveal information to an attacker 62 00:02:27,360 --> 00:02:31,440 an obvious one but worth stating 63 00:02:29,200 --> 00:02:33,519 nonetheless is poor key management poor 64 00:02:31,440 --> 00:02:35,120 key management is true of any 65 00:02:33,519 --> 00:02:38,319 cryptographic system 66 00:02:35,120 --> 00:02:40,160 and no less in this case 67 00:02:38,319 --> 00:02:42,000 shortcomings in the cryptography itself 68 00:02:40,160 --> 00:02:44,080 and we'll talk more about that 69 00:02:42,000 --> 00:02:44,959 and some examples of shortcomings like 70 00:02:44,080 --> 00:02:46,400 those 71 00:02:44,959 --> 00:02:48,480 in this talk 72 00:02:46,400 --> 00:02:50,560 certainly that's a major part of why 73 00:02:48,480 --> 00:02:52,800 data is lost 74 00:02:50,560 --> 00:02:56,000 and finally poor assumptions about data 75 00:02:52,800 --> 00:02:58,159 sensitivity what kinds of data 76 00:02:56,000 --> 00:03:00,480 should we actually consider to be secret 77 00:02:58,159 --> 00:03:03,360 what kinds of data might we be thinking 78 00:03:00,480 --> 00:03:06,400 of quite innocuous but could lead to 79 00:03:03,360 --> 00:03:08,159 leverage for an attacker 80 00:03:06,400 --> 00:03:10,800 and i would also like to mention that 81 00:03:08,159 --> 00:03:13,120 the techniques and attacks in this talk 82 00:03:10,800 --> 00:03:16,080 please use for good 83 00:03:13,120 --> 00:03:18,640 don't just go start attacking random 84 00:03:16,080 --> 00:03:20,879 servers or databases on the internet um 85 00:03:18,640 --> 00:03:23,280 it's probably legal and not really a 86 00:03:20,879 --> 00:03:25,120 nice thing to do so use this just for 87 00:03:23,280 --> 00:03:27,519 your own educational purposes or in test 88 00:03:25,120 --> 00:03:29,280 environments only please 89 00:03:27,519 --> 00:03:30,879 all right let's get started with the 90 00:03:29,280 --> 00:03:32,080 first kind of encryption that i want to 91 00:03:30,879 --> 00:03:34,720 talk about today which is called 92 00:03:32,080 --> 00:03:37,280 transparent data encryption now this is 93 00:03:34,720 --> 00:03:39,760 the formal name for what we often refer 94 00:03:37,280 --> 00:03:41,760 to as encryption at rest so what is 95 00:03:39,760 --> 00:03:43,040 encryption at rest or transparent data 96 00:03:41,760 --> 00:03:44,799 encryption 97 00:03:43,040 --> 00:03:47,519 transparent data encryption is 98 00:03:44,799 --> 00:03:50,159 encrypting the file system for the 99 00:03:47,519 --> 00:03:52,319 machine where the database is running 100 00:03:50,159 --> 00:03:54,319 and so because when the database boots 101 00:03:52,319 --> 00:03:57,040 up it needs to get access to the data it 102 00:03:54,319 --> 00:03:58,480 also means that the keys to perform the 103 00:03:57,040 --> 00:03:59,840 decryption need to run on the same 104 00:03:58,480 --> 00:04:02,400 machine 105 00:03:59,840 --> 00:04:05,360 and actually all the data that the data 106 00:04:02,400 --> 00:04:07,519 needs to manage gets loaded into memory 107 00:04:05,360 --> 00:04:08,879 and is available in the clear so 108 00:04:07,519 --> 00:04:11,680 transparent data encryption actually 109 00:04:08,879 --> 00:04:13,360 doesn't protect us from a whole lot 110 00:04:11,680 --> 00:04:15,439 certainly if and certainly if an 111 00:04:13,360 --> 00:04:16,400 attacker gets access to 112 00:04:15,439 --> 00:04:19,519 um 113 00:04:16,400 --> 00:04:21,280 the machine itself the host um or even 114 00:04:19,519 --> 00:04:23,280 the application 115 00:04:21,280 --> 00:04:24,560 that's accessing that database 116 00:04:23,280 --> 00:04:26,400 transparent data encryption doesn't 117 00:04:24,560 --> 00:04:28,240 protect us 118 00:04:26,400 --> 00:04:30,639 and what i wanted to do is just show you 119 00:04:28,240 --> 00:04:35,280 a very simple attack that demonstrates 120 00:04:30,639 --> 00:04:36,880 uh really how useless tde is in practice 121 00:04:35,280 --> 00:04:39,440 and that is through this idea of sql 122 00:04:36,880 --> 00:04:41,280 injection so really nice simple easy 123 00:04:39,440 --> 00:04:42,880 attack to get us started you've probably 124 00:04:41,280 --> 00:04:44,720 heard of sql injection if you haven't 125 00:04:42,880 --> 00:04:45,840 i'm about to show you what it is 126 00:04:44,720 --> 00:04:47,840 we'll get into some more interesting 127 00:04:45,840 --> 00:04:50,240 attacks and advanced attacks later in 128 00:04:47,840 --> 00:04:50,240 the talk 129 00:04:50,960 --> 00:04:54,720 with a single injection attack i 130 00:04:52,400 --> 00:04:55,600 actually don't need access to 131 00:04:54,720 --> 00:04:57,520 the 132 00:04:55,600 --> 00:04:59,199 client application client or application 133 00:04:57,520 --> 00:05:02,080 and i don't need access to the database 134 00:04:59,199 --> 00:05:04,479 host all i need is the ability to 135 00:05:02,080 --> 00:05:06,320 enter particularly crafted queries and 136 00:05:04,479 --> 00:05:07,759 for the developer of the application 137 00:05:06,320 --> 00:05:09,840 that i'm querying 138 00:05:07,759 --> 00:05:12,160 to have not sanitized their inputs 139 00:05:09,840 --> 00:05:13,759 properly and this is a very common sadly 140 00:05:12,160 --> 00:05:15,680 still very common problem in our 141 00:05:13,759 --> 00:05:18,080 industry 142 00:05:15,680 --> 00:05:20,400 so i've created a postgresql instance in 143 00:05:18,080 --> 00:05:21,680 amazon's rds it's their relational data 144 00:05:20,400 --> 00:05:23,280 service 145 00:05:21,680 --> 00:05:26,080 and you can see from the configuration 146 00:05:23,280 --> 00:05:28,720 here that i've enabled encryption this 147 00:05:26,080 --> 00:05:30,400 configuration is in the story section so 148 00:05:28,720 --> 00:05:32,479 amazon are trying to make it clear that 149 00:05:30,400 --> 00:05:35,360 this is only applying to the file system 150 00:05:32,479 --> 00:05:37,120 and not to all of the data stored 151 00:05:35,360 --> 00:05:39,280 nonetheless a lot of people assume that 152 00:05:37,120 --> 00:05:42,080 tde is going to protect them for a whole 153 00:05:39,280 --> 00:05:44,320 bunch of things which it won't 154 00:05:42,080 --> 00:05:46,720 so if we jump into the database console 155 00:05:44,320 --> 00:05:49,600 now using psql 156 00:05:46,720 --> 00:05:53,199 i can see i've got a few tables here 157 00:05:49,600 --> 00:05:55,280 and i can just do selects on the data 158 00:05:53,199 --> 00:05:56,400 like there's no encryption whatsoever 159 00:05:55,280 --> 00:05:59,199 this is a 160 00:05:56,400 --> 00:06:01,120 a products database that i've created 161 00:05:59,199 --> 00:06:04,639 just for a simple you know ecommerce 162 00:06:01,120 --> 00:06:06,960 application written in ruby on rails 163 00:06:04,639 --> 00:06:09,600 and i've got a products table as you can 164 00:06:06,960 --> 00:06:11,919 see and a customers table 165 00:06:09,600 --> 00:06:14,639 one feature in this application 166 00:06:11,919 --> 00:06:16,479 is a search capability very common for 167 00:06:14,639 --> 00:06:18,240 an e-commerce application 168 00:06:16,479 --> 00:06:19,919 um i've got some products in here that i 169 00:06:18,240 --> 00:06:21,919 want to search i'm a big fan of sony 170 00:06:19,919 --> 00:06:23,919 cameras so i'm going to search for sony 171 00:06:21,919 --> 00:06:26,400 and you can see the results come back 172 00:06:23,919 --> 00:06:30,000 now i'm going to exploit this 173 00:06:26,400 --> 00:06:31,919 search bar to do a sql injection attack 174 00:06:30,000 --> 00:06:34,479 first thing i want to do is find a query 175 00:06:31,919 --> 00:06:36,000 that prevents any 176 00:06:34,479 --> 00:06:37,520 search results product search results 177 00:06:36,000 --> 00:06:40,240 coming back so i'm going to find a query 178 00:06:37,520 --> 00:06:42,000 that matches nothing and that's xx that 179 00:06:40,240 --> 00:06:42,960 gives me no results so let's start with 180 00:06:42,000 --> 00:06:44,560 that 181 00:06:42,960 --> 00:06:45,600 then what i'm going to do 182 00:06:44,560 --> 00:06:47,520 is 183 00:06:45,600 --> 00:06:48,960 close the quote 184 00:06:47,520 --> 00:06:50,880 so that 185 00:06:48,960 --> 00:06:53,039 that signals to the database that that's 186 00:06:50,880 --> 00:06:56,400 the end of the constraint 187 00:06:53,039 --> 00:06:58,639 i'm going to union those results with a 188 00:06:56,400 --> 00:07:00,000 query that is the same shape as the 189 00:06:58,639 --> 00:07:01,599 products query 190 00:07:00,000 --> 00:07:04,080 so but this time it's going to come from 191 00:07:01,599 --> 00:07:06,479 a different table so i'm going to select 192 00:07:04,080 --> 00:07:06,479 from 193 00:07:07,280 --> 00:07:11,199 name email and i'm going to say as 194 00:07:09,759 --> 00:07:12,880 description 195 00:07:11,199 --> 00:07:16,880 so that the union query thinks it's the 196 00:07:12,880 --> 00:07:18,000 same it's the same shape data 197 00:07:16,880 --> 00:07:19,599 from 198 00:07:18,000 --> 00:07:21,120 customers 199 00:07:19,599 --> 00:07:22,960 i'm going to end that query as i 200 00:07:21,120 --> 00:07:25,199 normally would and then i'm going to use 201 00:07:22,960 --> 00:07:27,120 this dash dash 202 00:07:25,199 --> 00:07:29,599 syntax which signals to the database 203 00:07:27,120 --> 00:07:30,720 query parser that everything after that 204 00:07:29,599 --> 00:07:32,479 should be treated as a comment and 205 00:07:30,720 --> 00:07:34,080 essentially ignored and so that means 206 00:07:32,479 --> 00:07:37,199 that if there are any anything there was 207 00:07:34,080 --> 00:07:39,120 anything else left in the query um after 208 00:07:37,199 --> 00:07:41,680 this input string it'll basically get 209 00:07:39,120 --> 00:07:43,199 ignored so my uh code here will will 210 00:07:41,680 --> 00:07:44,479 take precedence so let's see what 211 00:07:43,199 --> 00:07:47,280 happens 212 00:07:44,479 --> 00:07:48,720 okay i run that and i get 213 00:07:47,280 --> 00:07:51,120 all of the customers instead of the 214 00:07:48,720 --> 00:07:52,319 products and as you can see this is a 215 00:07:51,120 --> 00:07:54,720 very simple 216 00:07:52,319 --> 00:07:56,960 example of a sql injection attack but 217 00:07:54,720 --> 00:07:59,039 the point here is that transparent data 218 00:07:56,960 --> 00:08:01,520 encryption protected us from nothing 219 00:07:59,039 --> 00:08:03,520 okay so what if an attacker does have 220 00:08:01,520 --> 00:08:05,120 access to the host server how can we 221 00:08:03,520 --> 00:08:07,599 protect ourselves maybe they've got 222 00:08:05,120 --> 00:08:09,520 access through a remote code execution 223 00:08:07,599 --> 00:08:12,960 or maybe they're even a trusted insider 224 00:08:09,520 --> 00:08:14,879 or a semi-trusted insider 225 00:08:12,960 --> 00:08:18,080 one technique we can use in the 226 00:08:14,879 --> 00:08:19,199 postgresql ecosystem is called pg crypto 227 00:08:18,080 --> 00:08:20,720 it's an extension that adds some 228 00:08:19,199 --> 00:08:23,120 additional functions 229 00:08:20,720 --> 00:08:24,720 so how does it work 230 00:08:23,120 --> 00:08:26,960 pgcrypto provides a whole bunch of 231 00:08:24,720 --> 00:08:29,759 functions but one example 232 00:08:26,960 --> 00:08:32,640 is the pgp sim encrypt function 233 00:08:29,759 --> 00:08:34,640 pgp sim encrypts uses the 234 00:08:32,640 --> 00:08:35,519 pretty good privacy library underneath 235 00:08:34,640 --> 00:08:38,479 the hood 236 00:08:35,519 --> 00:08:41,279 performs a symmetric encryption using a 237 00:08:38,479 --> 00:08:43,039 pre-shared key and in this case i'm 238 00:08:41,279 --> 00:08:43,760 using it to encrypt 239 00:08:43,039 --> 00:08:45,680 my 240 00:08:43,760 --> 00:08:47,360 name and email address for the data that 241 00:08:45,680 --> 00:08:49,040 i'm storing in the table 242 00:08:47,360 --> 00:08:50,720 so i've created a little demonstration 243 00:08:49,040 --> 00:08:53,720 database here called 244 00:08:50,720 --> 00:08:53,720 pgcryptodemo 245 00:08:54,320 --> 00:08:58,720 just has a users table inside it and 246 00:08:58,880 --> 00:09:02,480 when i query the users table you can see 247 00:09:00,640 --> 00:09:04,320 that everything inside that table has 248 00:09:02,480 --> 00:09:06,560 been encrypted so if an attacker was 249 00:09:04,320 --> 00:09:07,600 able to get access to this database in 250 00:09:06,560 --> 00:09:09,040 theory they wouldn't be able to get 251 00:09:07,600 --> 00:09:10,560 access to any of the underlying 252 00:09:09,040 --> 00:09:12,800 information it's been fully encrypted by 253 00:09:10,560 --> 00:09:14,320 this pgcrypto library 254 00:09:12,800 --> 00:09:15,920 but i'm going to show you how to attack 255 00:09:14,320 --> 00:09:17,680 it 256 00:09:15,920 --> 00:09:21,200 so attack two 257 00:09:17,680 --> 00:09:21,200 direct memory access 258 00:09:21,760 --> 00:09:25,760 in this attack as mentioned this relies 259 00:09:24,080 --> 00:09:28,160 on the attacker getting access to the 260 00:09:25,760 --> 00:09:30,800 database host that's been compromised 261 00:09:28,160 --> 00:09:33,760 so to demonstrate this attack i've set 262 00:09:30,800 --> 00:09:35,040 up my mac on my right hand side here and 263 00:09:33,760 --> 00:09:36,560 i'm going to connect to the database 264 00:09:35,040 --> 00:09:38,880 running on my linux box which is on my 265 00:09:36,560 --> 00:09:41,600 left hand side and show you that i can 266 00:09:38,880 --> 00:09:44,160 actually extract some of the information 267 00:09:41,600 --> 00:09:45,600 over the wire as it's being inserted 268 00:09:44,160 --> 00:09:48,560 to do that i'm going to use a tool 269 00:09:45,600 --> 00:09:50,320 called avml or acquire volatile memory 270 00:09:48,560 --> 00:09:52,399 for linux ironically this tool has 271 00:09:50,320 --> 00:09:54,480 actually been created by microsoft 272 00:09:52,399 --> 00:09:55,839 it's been very useful in these kinds of 273 00:09:54,480 --> 00:09:57,200 tests 274 00:09:55,839 --> 00:09:59,360 so the first thing i'm going to do is go 275 00:09:57,200 --> 00:10:01,920 over to my mac here 276 00:09:59,360 --> 00:10:03,200 and connect to my linux box on my local 277 00:10:01,920 --> 00:10:05,440 network 278 00:10:03,200 --> 00:10:07,200 and connect to the pg crypto demo 279 00:10:05,440 --> 00:10:08,160 database 280 00:10:07,200 --> 00:10:09,680 okay 281 00:10:08,160 --> 00:10:10,959 you can see it's got the same users 282 00:10:09,680 --> 00:10:12,480 table in it 283 00:10:10,959 --> 00:10:14,399 so i'm going to run a query that i ran 284 00:10:12,480 --> 00:10:17,040 before and this is going to insert into 285 00:10:14,399 --> 00:10:19,440 the user's table name and email 286 00:10:17,040 --> 00:10:21,040 it's going to use that pgp sim encrypt 287 00:10:19,440 --> 00:10:23,200 function that we talked about before 288 00:10:21,040 --> 00:10:24,959 it's just going to encrypt the 289 00:10:23,200 --> 00:10:25,760 name here homer simpson with the same 290 00:10:24,959 --> 00:10:28,640 key 291 00:10:25,760 --> 00:10:30,320 and the email homer at springfield.com 292 00:10:28,640 --> 00:10:32,079 once again with the same key so i'm 293 00:10:30,320 --> 00:10:33,519 going to run that 294 00:10:32,079 --> 00:10:36,000 insert 295 00:10:33,519 --> 00:10:37,920 and then when we select from the users 296 00:10:36,000 --> 00:10:39,440 table you can see that now there are two 297 00:10:37,920 --> 00:10:41,120 rows and they're fully encrypted i can't 298 00:10:39,440 --> 00:10:43,519 see any information now what i'm going 299 00:10:41,120 --> 00:10:46,320 to do is take advantage of the fact that 300 00:10:43,519 --> 00:10:48,160 postgres stores some of these queries in 301 00:10:46,320 --> 00:10:51,040 a query cache 302 00:10:48,160 --> 00:10:53,040 so using the avml tool on my linux box 303 00:10:51,040 --> 00:10:54,240 which is hosting the database 304 00:10:53,040 --> 00:10:56,079 i'm going to run 305 00:10:54,240 --> 00:10:57,920 as root using sudo 306 00:10:56,079 --> 00:10:59,680 avml 307 00:10:57,920 --> 00:11:02,000 and then i'm going to 308 00:10:59,680 --> 00:11:04,040 push the output onto one of my local 309 00:11:02,000 --> 00:11:05,920 drives here 310 00:11:04,040 --> 00:11:08,000 avml.lime 311 00:11:05,920 --> 00:11:09,440 it's going to take a while to run 312 00:11:08,000 --> 00:11:12,000 thankfully this is pre-recorded so we 313 00:11:09,440 --> 00:11:13,920 can skip ahead to the end 314 00:11:12,000 --> 00:11:17,440 okay so that's just finished uh my 315 00:11:13,920 --> 00:11:19,600 machine here is 64 has 64 gig memory so 316 00:11:17,440 --> 00:11:21,839 took about six or seven minutes 317 00:11:19,600 --> 00:11:23,680 now i have this file on my b volume here 318 00:11:21,839 --> 00:11:25,519 called avml.lime 319 00:11:23,680 --> 00:11:27,040 lime is a particular format and there 320 00:11:25,519 --> 00:11:29,040 are tools that can read it but we 321 00:11:27,040 --> 00:11:30,800 actually don't need to use any of those 322 00:11:29,040 --> 00:11:33,680 tools we can do something really 323 00:11:30,800 --> 00:11:35,920 just brutal and nasty 324 00:11:33,680 --> 00:11:36,959 so i'm actually just going to cat and 325 00:11:35,920 --> 00:11:40,240 grip 326 00:11:36,959 --> 00:11:40,240 this avml file 327 00:11:40,959 --> 00:11:45,680 i'm using a grep alternative here called 328 00:11:43,200 --> 00:11:47,279 silver surfer which is a bit faster but 329 00:11:45,680 --> 00:11:49,440 same basic idea 330 00:11:47,279 --> 00:11:51,839 and i'm actually going to search just 331 00:11:49,440 --> 00:11:53,920 for illustrative purposes 332 00:11:51,839 --> 00:11:55,519 for the 333 00:11:53,920 --> 00:11:58,000 query that we inserted or the data that 334 00:11:55,519 --> 00:12:00,160 we inserted before which was homer 335 00:11:58,000 --> 00:12:00,160 at 336 00:12:00,639 --> 00:12:03,639 springfield.com 337 00:12:03,839 --> 00:12:08,880 now remember that this is running on the 338 00:12:06,160 --> 00:12:12,560 database machine and the query was 339 00:12:08,880 --> 00:12:12,560 actually performed from my mac 340 00:12:16,959 --> 00:12:22,399 it's going to kill that you can see here 341 00:12:18,720 --> 00:12:24,800 that the exact command i used 342 00:12:22,399 --> 00:12:27,120 on the mac to do that insertion 343 00:12:24,800 --> 00:12:29,440 was stored in a query cache 344 00:12:27,120 --> 00:12:31,120 uh in postgres on the server so 345 00:12:29,440 --> 00:12:32,240 immediately i've now got access to the 346 00:12:31,120 --> 00:12:34,000 key 347 00:12:32,240 --> 00:12:35,600 now if i didn't know what data to grip 348 00:12:34,000 --> 00:12:37,839 for i didn't know what i had been 349 00:12:35,600 --> 00:12:41,440 inserted i could actually just grip for 350 00:12:37,839 --> 00:12:43,839 say a regex of a hex encoded string 351 00:12:41,440 --> 00:12:44,880 which would give me a key 352 00:12:43,839 --> 00:12:48,399 there's all kinds of different things 353 00:12:44,880 --> 00:12:51,519 you can do so i can search for 354 00:12:48,399 --> 00:12:51,519 let's go pgp 355 00:12:51,680 --> 00:12:54,480 sim encrypt 356 00:13:02,079 --> 00:13:05,680 and as you can see there's all of the 357 00:13:04,399 --> 00:13:09,040 previous attempts at getting this 358 00:13:05,680 --> 00:13:09,040 working coming through in the logs 359 00:13:10,639 --> 00:13:14,639 so really this is quite terrifying and a 360 00:13:12,880 --> 00:13:16,399 pretty clear demonstration that tools 361 00:13:14,639 --> 00:13:18,880 like pg crypto 362 00:13:16,399 --> 00:13:20,320 are actually not as safe and secure as 363 00:13:18,880 --> 00:13:22,959 you might think and that's because 364 00:13:20,320 --> 00:13:25,519 you're sending the keys across the wire 365 00:13:22,959 --> 00:13:26,560 via psql to be processed on the server 366 00:13:25,519 --> 00:13:29,200 side 367 00:13:26,560 --> 00:13:31,040 and because of the way pg postgres query 368 00:13:29,200 --> 00:13:33,200 caching works a lot of that data gets 369 00:13:31,040 --> 00:13:34,720 stored and potentially for really quite 370 00:13:33,200 --> 00:13:37,519 a long time 371 00:13:34,720 --> 00:13:39,519 so we need to do something better 372 00:13:37,519 --> 00:13:41,440 all right so let's look at a third class 373 00:13:39,519 --> 00:13:43,360 of encryption and this is called 374 00:13:41,440 --> 00:13:45,360 deterministic encryption so 375 00:13:43,360 --> 00:13:47,279 deterministic encryption is essentially 376 00:13:45,360 --> 00:13:49,839 the idea that every time we encrypt a 377 00:13:47,279 --> 00:13:51,760 particular value or a plain text if you 378 00:13:49,839 --> 00:13:54,160 like with a given key we'll always get 379 00:13:51,760 --> 00:13:57,040 the same result so in this case i've got 380 00:13:54,160 --> 00:13:59,920 the encryption of cat under the key k 381 00:13:57,040 --> 00:14:01,519 let me get that that ciphertext 382 00:13:59,920 --> 00:14:03,040 dog gives me a different one obviously 383 00:14:01,519 --> 00:14:04,240 then i encrypt cat again and i get the 384 00:14:03,040 --> 00:14:05,920 same 385 00:14:04,240 --> 00:14:07,279 safe text output and so that's what 386 00:14:05,920 --> 00:14:09,360 makes it deterministic it's always going 387 00:14:07,279 --> 00:14:11,440 to be the same output value 388 00:14:09,360 --> 00:14:13,680 now why would we do that why is that a 389 00:14:11,440 --> 00:14:16,399 good thing why is it useful 390 00:14:13,680 --> 00:14:18,079 let's say we want to look at a database 391 00:14:16,399 --> 00:14:19,760 like this where we're keeping 392 00:14:18,079 --> 00:14:22,000 email addresses stored using 393 00:14:19,760 --> 00:14:24,000 deterministic encryption in the table 394 00:14:22,000 --> 00:14:25,760 let's say we want to look up a person 395 00:14:24,000 --> 00:14:27,519 based on the email address we don't want 396 00:14:25,760 --> 00:14:29,360 to send the email address to the server 397 00:14:27,519 --> 00:14:31,360 as we've seen in the previous attack 398 00:14:29,360 --> 00:14:33,519 that can be problematic 399 00:14:31,360 --> 00:14:34,880 so we actually want to hash or use it to 400 00:14:33,519 --> 00:14:38,000 terminate the encryption scheme to 401 00:14:34,880 --> 00:14:40,079 encrypt the email address first 402 00:14:38,000 --> 00:14:42,160 before performing the query and just do 403 00:14:40,079 --> 00:14:44,079 the query on the site text rather than 404 00:14:42,160 --> 00:14:46,160 on the plain text so it can be very 405 00:14:44,079 --> 00:14:48,160 useful 406 00:14:46,160 --> 00:14:50,399 however 407 00:14:48,160 --> 00:14:53,279 deterministic encryption has some pretty 408 00:14:50,399 --> 00:14:56,240 major drawbacks there was a study from 409 00:14:53,279 --> 00:14:57,839 some researchers in 2015 and navid at al 410 00:14:56,240 --> 00:14:59,440 and i'll link provide a link at the end 411 00:14:57,839 --> 00:15:01,519 of the talk that showed that 412 00:14:59,440 --> 00:15:03,199 deterministic encryption schemes 413 00:15:01,519 --> 00:15:05,600 are vulnerable to what's known as 414 00:15:03,199 --> 00:15:08,079 infants attacks so that's when the 415 00:15:05,600 --> 00:15:09,839 distribution of the scythe texts mirrors 416 00:15:08,079 --> 00:15:12,079 or is closely related to the 417 00:15:09,839 --> 00:15:13,760 distribution of the plain texts so you 418 00:15:12,079 --> 00:15:16,639 can see in this 419 00:15:13,760 --> 00:15:19,839 bottom example the uh for a gaussian 420 00:15:16,639 --> 00:15:21,680 distribution of of data the output 421 00:15:19,839 --> 00:15:22,720 domain looks very very similar to the 422 00:15:21,680 --> 00:15:24,639 input domain in terms of the 423 00:15:22,720 --> 00:15:27,199 distribution 424 00:15:24,639 --> 00:15:29,279 a really simple way to visualize that if 425 00:15:27,199 --> 00:15:31,040 you've done any cryptography before i'm 426 00:15:29,279 --> 00:15:32,079 sure you've seen this image 427 00:15:31,040 --> 00:15:34,880 um 428 00:15:32,079 --> 00:15:37,040 when you have a an output distribution 429 00:15:34,880 --> 00:15:39,120 that looks the same shape has a similar 430 00:15:37,040 --> 00:15:41,199 distribution to the input distribution 431 00:15:39,120 --> 00:15:43,199 it doesn't really hide anything 432 00:15:41,199 --> 00:15:44,800 you can use statistical methods and in 433 00:15:43,199 --> 00:15:47,279 this case just visual inspection to see 434 00:15:44,800 --> 00:15:48,880 that the encryption hasn't really worked 435 00:15:47,279 --> 00:15:51,519 what we want is randomized encryption 436 00:15:48,880 --> 00:15:53,839 but we'll come to that later 437 00:15:51,519 --> 00:15:55,360 so attack number three but actually i'm 438 00:15:53,839 --> 00:15:57,040 not going to do an inference attack i'm 439 00:15:55,360 --> 00:15:58,800 going to do another kind of attack on 440 00:15:57,040 --> 00:16:02,800 deterministic encryption schemes called 441 00:15:58,800 --> 00:16:02,800 a chosen plaintext attack 442 00:16:03,440 --> 00:16:08,399 this kind of attack relies on 443 00:16:06,079 --> 00:16:11,199 the attacker being able to see the 444 00:16:08,399 --> 00:16:13,040 results of the system encrypting 445 00:16:11,199 --> 00:16:15,519 particular plaintexts or particular 446 00:16:13,040 --> 00:16:17,279 values that we want to experiment with 447 00:16:15,519 --> 00:16:18,560 that we want to test 448 00:16:17,279 --> 00:16:20,639 there's lots of different ways you could 449 00:16:18,560 --> 00:16:22,800 do that certainly if an attacker got 450 00:16:20,639 --> 00:16:24,639 access to the database host itself they 451 00:16:22,800 --> 00:16:26,000 could do that but i'm actually going to 452 00:16:24,639 --> 00:16:27,199 use the 453 00:16:26,000 --> 00:16:28,880 the database 454 00:16:27,199 --> 00:16:31,759 right ahead log 455 00:16:28,880 --> 00:16:33,920 and a previous like an older database 456 00:16:31,759 --> 00:16:37,839 backup that i've managed to extract from 457 00:16:33,920 --> 00:16:37,839 a compromised nfs server 458 00:16:38,639 --> 00:16:43,279 so here's my copy of the 459 00:16:41,600 --> 00:16:45,040 database dump it's for a medical 460 00:16:43,279 --> 00:16:47,600 application 461 00:16:45,040 --> 00:16:48,959 i've already imported that into a into a 462 00:16:47,600 --> 00:16:51,120 database here 463 00:16:48,959 --> 00:16:52,639 called medical app development 464 00:16:51,120 --> 00:16:53,920 you can see it's got quite a bit of data 465 00:16:52,639 --> 00:16:55,920 in it 466 00:16:53,920 --> 00:16:58,560 and 467 00:16:55,920 --> 00:17:01,199 let's take a look at the 468 00:16:58,560 --> 00:17:03,040 patients table 469 00:17:01,199 --> 00:17:05,520 as you can see all of that data is 470 00:17:03,040 --> 00:17:08,160 encrypted it's it's unintelligible so 471 00:17:05,520 --> 00:17:10,319 all of this data was encrypted using um 472 00:17:08,160 --> 00:17:13,280 a in a ruby on rails application using 473 00:17:10,319 --> 00:17:15,280 the encrypted record function um so that 474 00:17:13,280 --> 00:17:17,919 has a particular format that wraps the 475 00:17:15,280 --> 00:17:21,439 wraps uh the site text and the related 476 00:17:17,919 --> 00:17:21,439 information inside json 477 00:17:22,959 --> 00:17:26,480 you can also see that there's a 478 00:17:25,039 --> 00:17:28,559 prescriptions table now the 479 00:17:26,480 --> 00:17:30,960 prescriptions table is not encrypted 480 00:17:28,559 --> 00:17:32,960 instead what this does is link 481 00:17:30,960 --> 00:17:35,360 a patient id which has obviously 482 00:17:32,960 --> 00:17:36,640 encrypted information associated with it 483 00:17:35,360 --> 00:17:37,919 just to a 484 00:17:36,640 --> 00:17:41,120 drug name 485 00:17:37,919 --> 00:17:43,840 so we want to see if we can work out 486 00:17:41,120 --> 00:17:47,120 if a particular individual is taking a 487 00:17:43,840 --> 00:17:49,760 particular kind of drug 488 00:17:47,120 --> 00:17:52,799 okay so now to to do this i'm actually 489 00:17:49,760 --> 00:17:55,840 going to go to the public part of this 490 00:17:52,799 --> 00:17:57,280 medical application it's running on my 491 00:17:55,840 --> 00:17:59,039 local machine here for demonstration 492 00:17:57,280 --> 00:18:00,559 purposes but in practice this would be 493 00:17:59,039 --> 00:18:02,080 running on a on a public website 494 00:18:00,559 --> 00:18:05,280 somewhere and i'm going to go and create 495 00:18:02,080 --> 00:18:06,559 an appointment for the user that i'm 496 00:18:05,280 --> 00:18:08,799 interested in learning more information 497 00:18:06,559 --> 00:18:08,799 about 498 00:18:09,440 --> 00:18:12,080 so on the new appointment screen i'm 499 00:18:10,960 --> 00:18:13,520 going to 500 00:18:12,080 --> 00:18:16,000 so i'm going to try i'm going to enter 501 00:18:13,520 --> 00:18:17,440 the email address of the user 502 00:18:16,000 --> 00:18:19,280 and i'm going to send an appointment 503 00:18:17,440 --> 00:18:22,720 doesn't matter when 504 00:18:19,280 --> 00:18:27,039 let's do it at 2 25 in the morning 505 00:18:22,720 --> 00:18:27,039 and go ahead and create this appointment 506 00:18:27,120 --> 00:18:31,840 now as i previously mentioned i've been 507 00:18:29,679 --> 00:18:32,720 able to get access to the write ahead 508 00:18:31,840 --> 00:18:34,799 logs 509 00:18:32,720 --> 00:18:37,200 for the database that's powering this 510 00:18:34,799 --> 00:18:38,559 application 511 00:18:37,200 --> 00:18:40,799 so i'm going to take advantage of that 512 00:18:38,559 --> 00:18:42,240 fact now 513 00:18:40,799 --> 00:18:45,039 i know that because this is a rails 514 00:18:42,240 --> 00:18:47,039 application and it's using uh encrypted 515 00:18:45,039 --> 00:18:49,280 record i know the format for the 516 00:18:47,039 --> 00:18:51,600 ciphertext or at least for the iv part 517 00:18:49,280 --> 00:18:54,240 of the ciphertext is base64 encoded so 518 00:18:51,600 --> 00:18:57,440 i'm going to grab four base64 encoded 519 00:18:54,240 --> 00:18:58,960 strings inside the writer headlock 520 00:18:57,440 --> 00:19:01,760 so i've got my writer head logs 521 00:18:58,960 --> 00:19:01,760 available here 522 00:19:04,000 --> 00:19:06,640 i'm going to 523 00:19:07,600 --> 00:19:10,960 cap 524 00:19:08,840 --> 00:19:13,600 them i'm going to grab for a regular 525 00:19:10,960 --> 00:19:15,039 expression that matches should match a 526 00:19:13,600 --> 00:19:16,400 base64 527 00:19:15,039 --> 00:19:20,160 encoder string 528 00:19:16,400 --> 00:19:20,160 of at least 12 characters long 529 00:19:21,520 --> 00:19:27,360 and there we go we've got a few of them 530 00:19:24,080 --> 00:19:27,360 now the latest one 531 00:19:28,559 --> 00:19:34,000 we're looking for the iv 532 00:19:31,280 --> 00:19:36,240 that's the initialization vector 533 00:19:34,000 --> 00:19:38,799 so i'm going to use that iv value 534 00:19:36,240 --> 00:19:40,080 which should be unique um to see if i 535 00:19:38,799 --> 00:19:42,799 can find 536 00:19:40,080 --> 00:19:44,400 the associated patient record and thus 537 00:19:42,799 --> 00:19:46,400 the medications that they're taking from 538 00:19:44,400 --> 00:19:47,919 my medical database 539 00:19:46,400 --> 00:19:51,720 so to do that i'm going to use a query 540 00:19:47,919 --> 00:19:51,720 that looks a bit like this 541 00:19:52,400 --> 00:19:56,640 so we know that the 542 00:19:54,080 --> 00:19:57,760 the data is going to be stored as a json 543 00:19:56,640 --> 00:19:59,600 string 544 00:19:57,760 --> 00:20:02,400 it's how rails 545 00:19:59,600 --> 00:20:02,400 stores the data 546 00:20:02,720 --> 00:20:07,280 so we're going to use postgresql 547 00:20:05,360 --> 00:20:08,480 json functions here we're going to 548 00:20:07,280 --> 00:20:11,440 extract 549 00:20:08,480 --> 00:20:13,200 the initialization vector 550 00:20:11,440 --> 00:20:14,720 out of that json payload we're going to 551 00:20:13,200 --> 00:20:16,080 compare it to the string we found in the 552 00:20:14,720 --> 00:20:18,640 writer head log 553 00:20:16,080 --> 00:20:20,960 so remember we're comparing this to 554 00:20:18,640 --> 00:20:24,960 the database dump that i i've managed to 555 00:20:20,960 --> 00:20:24,960 steal it was a few days old 556 00:20:26,559 --> 00:20:31,520 and yes we found a matching record so 557 00:20:28,480 --> 00:20:33,919 you can see the json here the way that 558 00:20:31,520 --> 00:20:36,320 encrypted record stores it 559 00:20:33,919 --> 00:20:37,840 it's got this this hash key with the iv 560 00:20:36,320 --> 00:20:40,080 and that's the iv that we found in the 561 00:20:37,840 --> 00:20:45,360 writer head log and so now all i need to 562 00:20:40,080 --> 00:20:45,360 do is find the prescription the user 563 00:20:46,240 --> 00:20:51,280 with that id 564 00:20:47,679 --> 00:20:53,520 inside the prescriptions table 565 00:20:51,280 --> 00:20:56,159 so two seven 566 00:20:53,520 --> 00:20:56,159 four nine 567 00:20:56,320 --> 00:20:59,679 and there we have it 568 00:20:57,919 --> 00:21:02,720 so we can see this patient's been 569 00:20:59,679 --> 00:21:05,200 prescribed a disulferum which i if you 570 00:21:02,720 --> 00:21:06,880 look it up is actually a drug 571 00:21:05,200 --> 00:21:08,000 that helps people with alcohol 572 00:21:06,880 --> 00:21:09,760 dependency 573 00:21:08,000 --> 00:21:12,159 so that would be considered pretty 574 00:21:09,760 --> 00:21:14,400 sensitive information and we've just 575 00:21:12,159 --> 00:21:16,960 managed to get it despite the fact that 576 00:21:14,400 --> 00:21:19,360 the database has been fully encrypted 577 00:21:16,960 --> 00:21:21,360 and has been encrypted 578 00:21:19,360 --> 00:21:22,960 before it's been sent to the database 579 00:21:21,360 --> 00:21:25,360 server 580 00:21:22,960 --> 00:21:28,159 i should mention that the way that rails 581 00:21:25,360 --> 00:21:30,880 encryption does deterministic encryption 582 00:21:28,159 --> 00:21:32,720 is it uses a fixed iv 583 00:21:30,880 --> 00:21:33,520 that's the the initialization vector 584 00:21:32,720 --> 00:21:35,120 here 585 00:21:33,520 --> 00:21:37,600 um and that's what i was searching for 586 00:21:35,120 --> 00:21:37,600 in this case 587 00:21:37,840 --> 00:21:41,520 all right so we mentioned uh randomized 588 00:21:40,080 --> 00:21:45,679 encryption before deterministic 589 00:21:41,520 --> 00:21:48,080 encryption has issues so let's go all in 590 00:21:45,679 --> 00:21:49,440 let's see how we can attack randomized 591 00:21:48,080 --> 00:21:52,240 encryption so that is essentially the 592 00:21:49,440 --> 00:21:53,679 best possible encryption 593 00:21:52,240 --> 00:21:55,360 so attack 4 594 00:21:53,679 --> 00:21:56,640 we're going to use a re-identification 595 00:21:55,360 --> 00:21:58,720 attack 596 00:21:56,640 --> 00:22:00,720 to perform this attack we don't need 597 00:21:58,720 --> 00:22:02,240 access to either the client application 598 00:22:00,720 --> 00:22:05,360 or the db host 599 00:22:02,240 --> 00:22:07,600 all we need is a relatively recent dump 600 00:22:05,360 --> 00:22:10,159 of the database like a backup and 601 00:22:07,600 --> 00:22:12,799 actually this would seem fairly feasible 602 00:22:10,159 --> 00:22:15,120 because if the data database has been 603 00:22:12,799 --> 00:22:17,919 encrypted using randomized encryption 604 00:22:15,120 --> 00:22:19,679 the dba may have just decided that an 605 00:22:17,919 --> 00:22:21,760 additional encryption step wasn't 606 00:22:19,679 --> 00:22:24,159 necessary because the data was already 607 00:22:21,760 --> 00:22:26,159 encrypted in other words the 608 00:22:24,159 --> 00:22:28,159 database dump may not necessarily be 609 00:22:26,159 --> 00:22:30,320 considered sensitive 610 00:22:28,159 --> 00:22:32,240 so in this example we're going to use a 611 00:22:30,320 --> 00:22:33,600 check-in app uh something that 612 00:22:32,240 --> 00:22:36,240 everyone's probably very very 613 00:22:33,600 --> 00:22:38,960 unpainfully familiar with at the moment 614 00:22:36,240 --> 00:22:40,559 and in this case we've got two data 615 00:22:38,960 --> 00:22:42,559 types 616 00:22:40,559 --> 00:22:44,880 one called a person which is fully 617 00:22:42,559 --> 00:22:46,400 encrypted and one called a location 618 00:22:44,880 --> 00:22:48,720 which is also encrypted with a small 619 00:22:46,400 --> 00:22:50,000 caveat which i'll come to in a moment 620 00:22:48,720 --> 00:22:52,320 and to 621 00:22:50,000 --> 00:22:55,360 manage check-ins we're literally just 622 00:22:52,320 --> 00:22:58,240 tracking in a check-in join table 623 00:22:55,360 --> 00:22:59,360 the person id of the person who 624 00:22:58,240 --> 00:23:00,640 checked in 625 00:22:59,360 --> 00:23:03,120 and the location to which they're 626 00:23:00,640 --> 00:23:03,120 checking in 627 00:23:04,159 --> 00:23:08,400 so the data might look something like 628 00:23:05,679 --> 00:23:10,400 this but there is a big problem here and 629 00:23:08,400 --> 00:23:12,159 the database designer or the application 630 00:23:10,400 --> 00:23:14,000 designer may never have realized how 631 00:23:12,159 --> 00:23:15,600 much of a problem this was 632 00:23:14,000 --> 00:23:17,600 and it's a common problem what's 633 00:23:15,600 --> 00:23:20,000 happened here is we've encrypted fully 634 00:23:17,600 --> 00:23:22,480 encrypted using a non-deterministic or 635 00:23:20,000 --> 00:23:25,440 randomized scheme the name of the 636 00:23:22,480 --> 00:23:28,320 location as as we have done for all the 637 00:23:25,440 --> 00:23:30,640 information we store about it per person 638 00:23:28,320 --> 00:23:32,400 but because we want to be able to do 639 00:23:30,640 --> 00:23:34,960 geospatial queries 640 00:23:32,400 --> 00:23:37,760 the location table also has longitudes 641 00:23:34,960 --> 00:23:39,440 and latitude stored in plain text 642 00:23:37,760 --> 00:23:41,120 and if we were to encrypt that data we 643 00:23:39,440 --> 00:23:42,559 would no longer be able to do geospatial 644 00:23:41,120 --> 00:23:44,640 searches so 645 00:23:42,559 --> 00:23:47,840 for example you've got an outbreak of 646 00:23:44,640 --> 00:23:49,360 covert and you want to find 647 00:23:47,840 --> 00:23:51,440 all of the 648 00:23:49,360 --> 00:23:54,320 people that have checked into venues 649 00:23:51,440 --> 00:23:55,520 within say a 200 meter radius 650 00:23:54,320 --> 00:23:57,279 i'm not sure if that's actually what 651 00:23:55,520 --> 00:23:59,440 health professionals do but i can 652 00:23:57,279 --> 00:24:02,320 imagine that would be a useful 653 00:23:59,440 --> 00:24:02,320 useful kind of query 654 00:24:03,279 --> 00:24:07,200 and actually this idea of storing 655 00:24:06,240 --> 00:24:09,039 uh 656 00:24:07,200 --> 00:24:11,760 what seems to be 657 00:24:09,039 --> 00:24:12,960 in not sensitive data in a database 658 00:24:11,760 --> 00:24:14,960 alongside 659 00:24:12,960 --> 00:24:18,799 sensitive data that's been encrypted 660 00:24:14,960 --> 00:24:20,320 um is actually very common but there's a 661 00:24:18,799 --> 00:24:22,480 major issue when it comes to location 662 00:24:20,320 --> 00:24:25,039 data in particular there was a study 663 00:24:22,480 --> 00:24:28,320 back in 2013 that showed that 664 00:24:25,039 --> 00:24:31,200 only four unique points what they call a 665 00:24:28,320 --> 00:24:33,840 trajectory for a person are required to 666 00:24:31,200 --> 00:24:36,720 identify a unique individual 667 00:24:33,840 --> 00:24:38,320 they can do that with 95 confidence and 668 00:24:36,720 --> 00:24:41,200 so we're going to take advantage of that 669 00:24:38,320 --> 00:24:43,600 fact in this attack 670 00:24:41,200 --> 00:24:46,799 now the way that we do that is by using 671 00:24:43,600 --> 00:24:48,720 a related data set so actually i happen 672 00:24:46,799 --> 00:24:50,559 to this is all hypothetical of course 673 00:24:48,720 --> 00:24:52,240 and not real but 674 00:24:50,559 --> 00:24:54,799 i've in this example in this 675 00:24:52,240 --> 00:24:57,600 demonstration i've come across 676 00:24:54,799 --> 00:24:59,679 a data set that has been leaked from an 677 00:24:57,600 --> 00:25:01,600 online dating app that online dating app 678 00:24:59,679 --> 00:25:02,720 was tracking location of all of its 679 00:25:01,600 --> 00:25:04,720 users 680 00:25:02,720 --> 00:25:06,320 it wasn't encrypting the data so now i 681 00:25:04,720 --> 00:25:08,159 have a data set that's been linked to 682 00:25:06,320 --> 00:25:10,000 the dart website 683 00:25:08,159 --> 00:25:12,640 with a 684 00:25:10,000 --> 00:25:14,640 person's email address and a whole set 685 00:25:12,640 --> 00:25:17,039 of longitudes and latitudes 686 00:25:14,640 --> 00:25:19,360 their trajectories if you like now while 687 00:25:17,039 --> 00:25:22,159 this may be fabricated this example this 688 00:25:19,360 --> 00:25:24,720 is based on many real world attacks for 689 00:25:22,159 --> 00:25:26,880 example there was a new york times 690 00:25:24,720 --> 00:25:29,520 journalist a few years ago who was able 691 00:25:26,880 --> 00:25:31,760 to use an attack very similar to this to 692 00:25:29,520 --> 00:25:34,480 learn an enormous amount of information 693 00:25:31,760 --> 00:25:36,240 about people that they were profiling 694 00:25:34,480 --> 00:25:37,279 and some of whom were even 695 00:25:36,240 --> 00:25:38,480 in the 696 00:25:37,279 --> 00:25:41,120 office of the president of the united 697 00:25:38,480 --> 00:25:42,880 states so it's very real attack this is 698 00:25:41,120 --> 00:25:44,640 just a demonstration showing something 699 00:25:42,880 --> 00:25:46,720 very similar 700 00:25:44,640 --> 00:25:49,039 so as i mentioned i was able to get a 701 00:25:46,720 --> 00:25:50,400 copy of the check-in 702 00:25:49,039 --> 00:25:52,159 app 703 00:25:50,400 --> 00:25:54,880 database dump and so i've imported it 704 00:25:52,159 --> 00:25:56,400 into a database on my local machine here 705 00:25:54,880 --> 00:25:58,159 you can see there's a few tables you can 706 00:25:56,400 --> 00:26:00,960 see the location table and people table 707 00:25:58,159 --> 00:26:04,000 and the check-ins table 708 00:26:00,960 --> 00:26:05,840 if i have a look at the people table 709 00:26:04,000 --> 00:26:07,679 you can see that all of that data is 710 00:26:05,840 --> 00:26:08,880 encrypted name and email address are 711 00:26:07,679 --> 00:26:10,480 both encrypted 712 00:26:08,880 --> 00:26:12,559 it can be a little bit tricky to see if 713 00:26:10,480 --> 00:26:14,000 this is deterministic or randomized 714 00:26:12,559 --> 00:26:15,919 encryption generally speaking if 715 00:26:14,000 --> 00:26:18,080 something is randomized you should never 716 00:26:15,919 --> 00:26:20,000 see any repeating values 717 00:26:18,080 --> 00:26:22,000 or whereas deterministic you sometimes 718 00:26:20,000 --> 00:26:24,240 will see repeating values if they're 719 00:26:22,000 --> 00:26:26,960 repeating plain text in the database for 720 00:26:24,240 --> 00:26:28,720 the locations table 721 00:26:26,960 --> 00:26:30,240 you can see that the location name is 722 00:26:28,720 --> 00:26:31,440 encrypted 723 00:26:30,240 --> 00:26:33,120 but the 724 00:26:31,440 --> 00:26:35,120 long and latitude longitudes and 725 00:26:33,120 --> 00:26:36,640 latitudes are not 726 00:26:35,120 --> 00:26:38,640 now as i mentioned 727 00:26:36,640 --> 00:26:41,520 i also have a 728 00:26:38,640 --> 00:26:44,159 reference data set from this 729 00:26:41,520 --> 00:26:47,120 dating site 730 00:26:44,159 --> 00:26:49,919 you can see that this has just got a csv 731 00:26:47,120 --> 00:26:52,400 of longitudes and latitudes 732 00:26:49,919 --> 00:26:54,159 mapped to email addresses of the of the 733 00:26:52,400 --> 00:26:55,760 user 734 00:26:54,159 --> 00:26:56,720 first thing we might want to try is look 735 00:26:55,760 --> 00:26:58,720 for 736 00:26:56,720 --> 00:27:00,480 common data points from our reference 737 00:26:58,720 --> 00:27:03,200 csv file 738 00:27:00,480 --> 00:27:05,840 in the check-in database 739 00:27:03,200 --> 00:27:07,919 dump restore 740 00:27:05,840 --> 00:27:11,480 let's so let's try one let's try this 741 00:27:07,919 --> 00:27:11,480 first one here 742 00:27:24,880 --> 00:27:30,640 and set the lat and the long values from 743 00:27:27,760 --> 00:27:32,799 the reference csv row 744 00:27:30,640 --> 00:27:35,039 and we don't find any values 745 00:27:32,799 --> 00:27:37,279 now i'll let you in on a secret there 746 00:27:35,039 --> 00:27:39,679 are no longs and lats here that match 747 00:27:37,279 --> 00:27:41,679 exactly and this is actually quite a 748 00:27:39,679 --> 00:27:43,279 problem for long beaches and latitudes 749 00:27:41,679 --> 00:27:46,000 because they are 750 00:27:43,279 --> 00:27:48,480 very very high precision numbers and to 751 00:27:46,000 --> 00:27:50,080 find something that matches exactly can 752 00:27:48,480 --> 00:27:52,399 be very challenging 753 00:27:50,080 --> 00:27:54,640 so we're going to use a concept called a 754 00:27:52,399 --> 00:27:56,000 geohash and what a geohash does is it 755 00:27:54,640 --> 00:27:57,760 buckets 756 00:27:56,000 --> 00:28:01,039 a longitudinal latitude down into a 757 00:27:57,760 --> 00:28:03,039 particular kind of cell 758 00:28:01,039 --> 00:28:06,480 of a certain precision depending on on 759 00:28:03,039 --> 00:28:08,480 how wide a range of uh locations we want 760 00:28:06,480 --> 00:28:10,559 to we want to assess so we're going to 761 00:28:08,480 --> 00:28:12,159 see if if there are longs and lats from 762 00:28:10,559 --> 00:28:13,840 the from the reference data set and the 763 00:28:12,159 --> 00:28:16,480 target data set that fit into the same 764 00:28:13,840 --> 00:28:19,279 geohash and use that as a proxy for 765 00:28:16,480 --> 00:28:22,159 being the same location 766 00:28:19,279 --> 00:28:23,279 so written a little script in ruby here 767 00:28:22,159 --> 00:28:27,039 which 768 00:28:23,279 --> 00:28:28,720 loads the reference data set 769 00:28:27,039 --> 00:28:30,399 and creates a 770 00:28:28,720 --> 00:28:32,159 hash of the 771 00:28:30,399 --> 00:28:34,960 reference email addresses along with all 772 00:28:32,159 --> 00:28:37,200 of their points i.e their trajectory 773 00:28:34,960 --> 00:28:39,600 then it iterates each person 774 00:28:37,200 --> 00:28:41,600 in my target database the checkin app 775 00:28:39,600 --> 00:28:43,919 database 776 00:28:41,600 --> 00:28:47,120 and tries to 777 00:28:43,919 --> 00:28:49,679 find a person that matches us some of 778 00:28:47,120 --> 00:28:51,520 those points ideally four or five of 779 00:28:49,679 --> 00:28:52,720 those points before we have the sort of 780 00:28:51,520 --> 00:28:54,559 confidence 781 00:28:52,720 --> 00:28:56,960 relative confidence that that that 782 00:28:54,559 --> 00:28:59,919 person is the same person 783 00:28:56,960 --> 00:29:02,799 i'm just using this person similarity to 784 00:28:59,919 --> 00:29:05,679 function which uh internally uses the 785 00:29:02,799 --> 00:29:07,039 the geohashing technique to try and find 786 00:29:05,679 --> 00:29:08,799 matching points 787 00:29:07,039 --> 00:29:10,080 uh happy to share the source code for 788 00:29:08,799 --> 00:29:11,679 anyone who might be interested in having 789 00:29:10,080 --> 00:29:12,559 a look at it so let's go ahead and run 790 00:29:11,679 --> 00:29:14,880 this code 791 00:29:12,559 --> 00:29:16,399 so i'm going to call the correlate 792 00:29:14,880 --> 00:29:17,840 function it's going to call this 793 00:29:16,399 --> 00:29:20,399 correlate script and i'm going to pass 794 00:29:17,840 --> 00:29:25,480 it the reference csv which remember was 795 00:29:20,399 --> 00:29:25,480 the data from the leaked dating app 796 00:29:34,880 --> 00:29:38,640 as you can see it's it's starting to 797 00:29:36,880 --> 00:29:43,120 work through each of the records in the 798 00:29:38,640 --> 00:29:44,960 target database um and we can see that 799 00:29:43,120 --> 00:29:46,480 for id record one there was no match so 800 00:29:44,960 --> 00:29:49,440 we weren't able to to learn who that 801 00:29:46,480 --> 00:29:50,960 person was for id2 however found three 802 00:29:49,440 --> 00:29:52,640 matching points so that's a reasonable 803 00:29:50,960 --> 00:29:53,760 confidence level it's probably 70 804 00:29:52,640 --> 00:29:55,840 confident 805 00:29:53,760 --> 00:29:56,960 that's this person here further down 806 00:29:55,840 --> 00:29:59,360 we've got 807 00:29:56,960 --> 00:30:01,200 another person record id4 that found 808 00:29:59,360 --> 00:30:02,799 four matching points so that's really a 809 00:30:01,200 --> 00:30:05,039 very high confidence level probably 95 810 00:30:02,799 --> 00:30:06,640 percent uh and continues to to go 811 00:30:05,039 --> 00:30:09,279 through you can see that a few were not 812 00:30:06,640 --> 00:30:11,279 found at all some had reasonable 813 00:30:09,279 --> 00:30:13,120 confidence a few more had very high 814 00:30:11,279 --> 00:30:14,960 confidence this one had five 815 00:30:13,120 --> 00:30:16,960 and so on and so forth 816 00:30:14,960 --> 00:30:18,640 so this is a very naive approach to go 817 00:30:16,960 --> 00:30:20,000 through a thousand records would take 818 00:30:18,640 --> 00:30:21,840 quite a while certainly if you've got 819 00:30:20,000 --> 00:30:23,600 databases of hundreds of thousands or 820 00:30:21,840 --> 00:30:25,440 millions of records this would be a very 821 00:30:23,600 --> 00:30:27,039 inefficient way to do it 822 00:30:25,440 --> 00:30:28,320 there are much smarter techniques for 823 00:30:27,039 --> 00:30:30,960 example there's an algorithm called the 824 00:30:28,320 --> 00:30:33,279 hungarian algorithm which is from 825 00:30:30,960 --> 00:30:34,960 combinatorics and that allows you to do 826 00:30:33,279 --> 00:30:36,960 things um 827 00:30:34,960 --> 00:30:38,799 that allows you to do tests like this 828 00:30:36,960 --> 00:30:40,399 much much faster 829 00:30:38,799 --> 00:30:42,880 there are also more sophisticated ways 830 00:30:40,399 --> 00:30:45,279 to assess similarity i can look at uh 831 00:30:42,880 --> 00:30:47,279 not just the locations that a person has 832 00:30:45,279 --> 00:30:49,200 been but what times of day they've been 833 00:30:47,279 --> 00:30:52,159 what days of the week 834 00:30:49,200 --> 00:30:53,120 how recently they've been versus 835 00:30:52,159 --> 00:30:55,200 you know 836 00:30:53,120 --> 00:30:56,399 somebody that's been very recently to a 837 00:30:55,200 --> 00:30:57,760 location versus they haven't been there 838 00:30:56,399 --> 00:30:59,120 for a while may indicate something about 839 00:30:57,760 --> 00:31:00,880 that person so there are quite 840 00:30:59,120 --> 00:31:02,840 sophisticated techniques that that 841 00:31:00,880 --> 00:31:06,000 really motivated attackers can start to 842 00:31:02,840 --> 00:31:08,080 apply and we've taken advantage here of 843 00:31:06,000 --> 00:31:10,480 the fact that a very small amount of 844 00:31:08,080 --> 00:31:12,720 seemingly innocuous data 845 00:31:10,480 --> 00:31:14,240 was included in the database 846 00:31:12,720 --> 00:31:16,320 despite the fact that all of the other 847 00:31:14,240 --> 00:31:18,159 data was encrypted 848 00:31:16,320 --> 00:31:19,919 so what have we learned from these four 849 00:31:18,159 --> 00:31:21,760 attacks 850 00:31:19,919 --> 00:31:24,799 number one i think it's pretty clear 851 00:31:21,760 --> 00:31:26,720 encryption is not a silver bullet 852 00:31:24,799 --> 00:31:29,039 and that even a small advantage matters 853 00:31:26,720 --> 00:31:31,919 any small opportunity that an attacker 854 00:31:29,039 --> 00:31:32,720 has to gain an advantage 855 00:31:31,919 --> 00:31:36,000 will 856 00:31:32,720 --> 00:31:37,519 potentially give them access to data 857 00:31:36,000 --> 00:31:39,039 also what's really interesting going 858 00:31:37,519 --> 00:31:41,519 through exercises like this is you 859 00:31:39,039 --> 00:31:43,120 realize the security of your data can be 860 00:31:41,519 --> 00:31:44,720 limited by the security of systems you 861 00:31:43,120 --> 00:31:46,720 don't control as soon as you have any 862 00:31:44,720 --> 00:31:48,799 information that is available to an 863 00:31:46,720 --> 00:31:51,039 attacker they can cross-reference that 864 00:31:48,799 --> 00:31:54,720 with data that's been leaked from 865 00:31:51,039 --> 00:31:56,720 systems much less secure than yours 866 00:31:54,720 --> 00:31:58,159 so how do we solve these problems 867 00:31:56,720 --> 00:32:00,000 well number one 868 00:31:58,159 --> 00:32:02,480 generally speaking the rule of thumb is 869 00:32:00,000 --> 00:32:04,240 always encrypt everything and i mean 870 00:32:02,480 --> 00:32:05,919 everything everything you possibly can 871 00:32:04,240 --> 00:32:08,600 with randomized encryption so for 872 00:32:05,919 --> 00:32:11,039 example aes running in gcm mode with a 873 00:32:08,600 --> 00:32:13,120 256-bit key that is really the the best 874 00:32:11,039 --> 00:32:15,279 possible encryption available we can use 875 00:32:13,120 --> 00:32:16,720 things like strong surgical encryption 876 00:32:15,279 --> 00:32:19,760 and what i mean by strong is it should 877 00:32:16,720 --> 00:32:21,039 be randomized non-deterministic 878 00:32:19,760 --> 00:32:23,200 but that's actually 879 00:32:21,039 --> 00:32:24,799 really a very new technology 880 00:32:23,200 --> 00:32:25,919 it's partly what we're working on at 881 00:32:24,799 --> 00:32:27,840 scistash 882 00:32:25,919 --> 00:32:29,440 and isn't isn't really generally 883 00:32:27,840 --> 00:32:30,640 available yet but certainly those kinds 884 00:32:29,440 --> 00:32:32,320 of these kinds of technologies are 885 00:32:30,640 --> 00:32:33,840 coming 886 00:32:32,320 --> 00:32:35,519 and similarly with order revealing 887 00:32:33,840 --> 00:32:38,880 encryption and homomorphic encryption 888 00:32:35,519 --> 00:32:40,240 these these um technologies promise to 889 00:32:38,880 --> 00:32:42,640 allow us to 890 00:32:40,240 --> 00:32:44,159 add encryption to systems 891 00:32:42,640 --> 00:32:46,000 uh in ways that we haven't been able to 892 00:32:44,159 --> 00:32:48,559 before and in particular 893 00:32:46,000 --> 00:32:50,159 for example the case of the longitudes 894 00:32:48,559 --> 00:32:51,679 and latitudes that we left unencrypted 895 00:32:50,159 --> 00:32:53,679 because we wanted to do geospatial 896 00:32:51,679 --> 00:32:55,919 searches on them things like homomorphic 897 00:32:53,679 --> 00:32:58,240 and auto review and encryption 898 00:32:55,919 --> 00:33:00,240 allow us to fully encrypt those data 899 00:32:58,240 --> 00:33:03,120 points as well so we really get the the 900 00:33:00,240 --> 00:33:04,960 best of both worlds 901 00:33:03,120 --> 00:33:06,880 and it really goes without saying but 902 00:33:04,960 --> 00:33:08,960 i'll repeat it nonetheless 903 00:33:06,880 --> 00:33:10,240 always use a threat model and get pen 904 00:33:08,960 --> 00:33:13,200 tests regularly from people who 905 00:33:10,240 --> 00:33:13,200 understand this space 906 00:33:14,480 --> 00:33:17,600 so finally i'll leave you with some 907 00:33:15,840 --> 00:33:19,919 resources 908 00:33:17,600 --> 00:33:22,320 there's the paper from paul grubbs and 909 00:33:19,919 --> 00:33:25,279 his co-authors there's a really 910 00:33:22,320 --> 00:33:27,519 interesting relatively digestible video 911 00:33:25,279 --> 00:33:30,399 by a channel called computer file on 912 00:33:27,519 --> 00:33:32,240 youtube which talks about this 913 00:33:30,399 --> 00:33:33,440 re-identification problem using location 914 00:33:32,240 --> 00:33:36,320 data 915 00:33:33,440 --> 00:33:38,399 and then another paper by navidad al 916 00:33:36,320 --> 00:33:39,840 from brown it talks about inference 917 00:33:38,399 --> 00:33:41,279 attacks it's really interesting paper 918 00:33:39,840 --> 00:33:42,320 quite heavy mathematically but if you're 919 00:33:41,279 --> 00:33:45,360 interested in this kind of thing it's a 920 00:33:42,320 --> 00:33:47,760 really good place to start 921 00:33:45,360 --> 00:33:49,279 so thank you i hope you enjoyed this 922 00:33:47,760 --> 00:33:50,880 talk and learned a few things about 923 00:33:49,279 --> 00:33:53,200 encrypted databases and some of the 924 00:33:50,880 --> 00:33:55,760 pitfalls you can follow me on twitter 925 00:33:53,200 --> 00:33:57,200 danieldraper feel free to email me if 926 00:33:55,760 --> 00:33:59,440 you're interested in talking about this 927 00:33:57,200 --> 00:34:01,440 topic or if you want to learn more 928 00:33:59,440 --> 00:34:02,960 and you can check me out at my workplace 929 00:34:01,440 --> 00:34:04,080 at sivestatch.com 930 00:34:02,960 --> 00:34:07,080 thanks for listening and talk to you 931 00:34:04,080 --> 00:34:07,080 soon