1 00:00:00,000 --> 00:00:08,469 foreign 2 00:00:00,500 --> 00:00:08,469 [Music] 3 00:00:11,219 --> 00:00:16,139 good afternoon so here this afternoon we 4 00:00:14,160 --> 00:00:17,940 have Paul Weber and he's here to talk to 5 00:00:16,139 --> 00:00:19,140 us about improving old compression 6 00:00:17,940 --> 00:00:21,660 algorithms 7 00:00:19,140 --> 00:00:23,760 Paul started writing code for a zx80 on 8 00:00:21,660 --> 00:00:25,800 the typewriter and things improved a bit 9 00:00:23,760 --> 00:00:27,720 since then they currently work for Red 10 00:00:25,800 --> 00:00:30,119 Hat trying to automate so that other 11 00:00:27,720 --> 00:00:31,980 people don't have to today Paul will 12 00:00:30,119 --> 00:00:34,739 discuss this Research into improving the 13 00:00:31,980 --> 00:00:37,079 lcw algorithm to get a demonstration on 14 00:00:34,739 --> 00:00:38,880 encoding and decoding and compare his 15 00:00:37,079 --> 00:00:41,380 compression ratios to other more modern 16 00:00:38,880 --> 00:00:44,719 compression methods please welcome Paul 17 00:00:41,380 --> 00:00:44,719 [Applause] 18 00:00:46,079 --> 00:00:51,180 thank you very much and greetings from 19 00:00:49,140 --> 00:00:53,039 nunawal country where I come from in 20 00:00:51,180 --> 00:00:55,800 Canberra 21 00:00:53,039 --> 00:00:57,420 so compression algorithms has been 22 00:00:55,800 --> 00:01:00,360 something I've been really interested in 23 00:00:57,420 --> 00:01:03,300 for a long time so I'm going to give you 24 00:01:00,360 --> 00:01:04,920 a bit of a history of compression uh 25 00:01:03,300 --> 00:01:07,200 some of the theory behind it some of the 26 00:01:04,920 --> 00:01:10,200 algorithm work that I've been doing and 27 00:01:07,200 --> 00:01:13,920 sort of look into the future so I want 28 00:01:10,200 --> 00:01:15,740 to take you back to 1977. who wasn't 29 00:01:13,920 --> 00:01:18,600 alive in 30 00:01:15,740 --> 00:01:21,180 1977. okay there's a few so so I'm going 31 00:01:18,600 --> 00:01:22,580 to have to recap some stuff 32 00:01:21,180 --> 00:01:26,100 um 33 00:01:22,580 --> 00:01:29,580 some some bad things happened uh some 34 00:01:26,100 --> 00:01:30,960 good things happened uh the first hints 35 00:01:29,580 --> 00:01:34,500 of an Internet 36 00:01:30,960 --> 00:01:38,340 started appearing something particularly 37 00:01:34,500 --> 00:01:41,579 important for a lot of people here that 38 00:01:38,340 --> 00:01:43,500 were alive during this time was these 39 00:01:41,579 --> 00:01:48,079 three computers appeared on the scene 40 00:01:43,500 --> 00:01:51,320 the pet the Apple II and the TRS-80 and 41 00:01:48,079 --> 00:01:54,780 those along with a number of other 42 00:01:51,320 --> 00:01:57,840 personal computers really took Computing 43 00:01:54,780 --> 00:02:01,140 from stuff that you did maybe at your 44 00:01:57,840 --> 00:02:04,619 parents work or in a library into the 45 00:02:01,140 --> 00:02:07,619 home that you could play with yourself 46 00:02:04,619 --> 00:02:10,979 the other important thing that is for my 47 00:02:07,619 --> 00:02:13,080 talk uh in that happened in 1977 was 48 00:02:10,979 --> 00:02:16,260 that these two gentlemen Abraham Lincoln 49 00:02:13,080 --> 00:02:18,660 on the left left who's sadly passed away 50 00:02:16,260 --> 00:02:22,620 so this talk is kind of dedicated to him 51 00:02:18,660 --> 00:02:27,239 and Jacob ziv on the right uh introduced 52 00:02:22,620 --> 00:02:29,580 a compression algorithm called lz77 53 00:02:27,239 --> 00:02:34,319 um strange title but there you go 54 00:02:29,580 --> 00:02:37,200 um the way this works is that we look at 55 00:02:34,319 --> 00:02:39,840 the output we're either outputting sort 56 00:02:37,200 --> 00:02:43,019 of literal data or we have a way of 57 00:02:39,840 --> 00:02:45,660 saying there's a repeat that's in the 58 00:02:43,019 --> 00:02:49,260 output stream that is already in the 59 00:02:45,660 --> 00:02:50,940 output and so we refer back to this we 60 00:02:49,260 --> 00:02:54,120 have this sort of Tuple in the original 61 00:02:50,940 --> 00:02:56,220 paper a bit the distance back to the 62 00:02:54,120 --> 00:02:57,720 last match the length of that match and 63 00:02:56,220 --> 00:02:58,920 then the next character that we're going 64 00:02:57,720 --> 00:03:02,280 to Output 65 00:02:58,920 --> 00:03:05,879 if we have no match then we just output 66 00:03:02,280 --> 00:03:07,440 zeros and we output characters one at a 67 00:03:05,879 --> 00:03:10,920 time until we've got enough to actually 68 00:03:07,440 --> 00:03:14,040 form a match so this basically looks 69 00:03:10,920 --> 00:03:18,560 back into a sliding window over the past 70 00:03:14,040 --> 00:03:23,280 output to look for matches that the 71 00:03:18,560 --> 00:03:26,640 encoder can efficiently uh 72 00:03:23,280 --> 00:03:28,980 is transmit and the decoder then knows 73 00:03:26,640 --> 00:03:31,980 to find in that window 74 00:03:28,980 --> 00:03:34,080 like it's impossible to understate the 75 00:03:31,980 --> 00:03:37,019 importance of this algorithm these are 76 00:03:34,080 --> 00:03:40,799 the tools just some of them that have 77 00:03:37,019 --> 00:03:43,560 been influenced by lz77 not just the 78 00:03:40,799 --> 00:03:46,799 tools but the algorithms as well as ltma 79 00:03:43,560 --> 00:03:49,879 obviously is one of them but there are 80 00:03:46,799 --> 00:03:52,500 decompressors for different 81 00:03:49,879 --> 00:03:55,280 architectures there's so when your 82 00:03:52,500 --> 00:03:58,500 memory is very constrained there are 83 00:03:55,280 --> 00:04:02,540 there are algorithms that are optimized 84 00:03:58,500 --> 00:04:02,540 particularly for decompression speed 85 00:04:02,900 --> 00:04:09,659 it it they have just been huge so 86 00:04:07,379 --> 00:04:13,439 uh we're going to move along on to 87 00:04:09,659 --> 00:04:14,340 another uh the next year 1978 88 00:04:13,439 --> 00:04:16,799 um 89 00:04:14,340 --> 00:04:18,419 a bunch of bad things happened a bunch 90 00:04:16,799 --> 00:04:19,079 of good things happened 91 00:04:18,419 --> 00:04:22,620 um 92 00:04:19,079 --> 00:04:25,380 and these two gentlemen not satisfied 93 00:04:22,620 --> 00:04:28,979 with releasing one algorithm released a 94 00:04:25,380 --> 00:04:28,979 second called lz78 95 00:04:29,220 --> 00:04:34,199 now this I know once you're on a theme 96 00:04:32,520 --> 00:04:37,139 stick to it right 97 00:04:34,199 --> 00:04:41,699 um so uh this works slightly differently 98 00:04:37,139 --> 00:04:45,180 what it does is it learns sequences of 99 00:04:41,699 --> 00:04:49,440 adjacent characters or symbols in the 100 00:04:45,180 --> 00:04:50,240 input and put them into a dictionary and 101 00:04:49,440 --> 00:04:55,259 then 102 00:04:50,240 --> 00:04:58,259 the output is the longest dictionary 103 00:04:55,259 --> 00:05:00,440 entry that matches that's in the 104 00:04:58,259 --> 00:05:03,060 dictionary at the time 105 00:05:00,440 --> 00:05:05,220 I have to take you slightly further 106 00:05:03,060 --> 00:05:06,900 forward to give you a good example of 107 00:05:05,220 --> 00:05:08,419 this I'm going I need to go forward to 108 00:05:06,900 --> 00:05:12,560 1984 109 00:05:08,419 --> 00:05:16,860 we were at war with East Asia again 110 00:05:12,560 --> 00:05:18,540 there was a lot of sex crime I was 13 at 111 00:05:16,860 --> 00:05:19,740 the moment so I didn't have any of that 112 00:05:18,540 --> 00:05:20,639 but 113 00:05:19,740 --> 00:05:23,160 um 114 00:05:20,639 --> 00:05:27,660 Terry Welch who I don't have a photo of 115 00:05:23,160 --> 00:05:30,020 uh released an improvement to lz78 and 116 00:05:27,660 --> 00:05:33,900 it's commonly called lzw 117 00:05:30,020 --> 00:05:38,580 the way this works which is conceptually 118 00:05:33,900 --> 00:05:41,100 a lot easier than lz78 is that we start 119 00:05:38,580 --> 00:05:44,039 with a symbol table that contains the 120 00:05:41,100 --> 00:05:45,479 first the basically all variations of an 121 00:05:44,039 --> 00:05:48,060 8-bit byte 122 00:05:45,479 --> 00:05:50,460 uh there's also a flush table symbol and 123 00:05:48,060 --> 00:05:51,419 there's a stop input symbol 124 00:05:50,460 --> 00:05:56,340 um 125 00:05:51,419 --> 00:05:57,419 and we basically add input 126 00:05:56,340 --> 00:06:00,960 um 127 00:05:57,419 --> 00:06:03,720 to sort of our buffer until that buffer 128 00:06:00,960 --> 00:06:05,639 no longer is matches an entry in the 129 00:06:03,720 --> 00:06:08,100 symbol table 130 00:06:05,639 --> 00:06:09,120 at that point we know we've got a new 131 00:06:08,100 --> 00:06:12,960 symbol 132 00:06:09,120 --> 00:06:16,620 so we add the last symbol that matched 133 00:06:12,960 --> 00:06:18,120 we remember the next character and we 134 00:06:16,620 --> 00:06:20,840 start looking again 135 00:06:18,120 --> 00:06:23,160 so I'm going to give you an example 136 00:06:20,840 --> 00:06:25,919 of how this works we're going to be 137 00:06:23,160 --> 00:06:27,900 encoding this admittedly very 138 00:06:25,919 --> 00:06:30,840 repetitious string but it's good for my 139 00:06:27,900 --> 00:06:34,199 purposes here and I'm going to ask you 140 00:06:30,840 --> 00:06:36,900 to bear with me for one bit because I'm 141 00:06:34,199 --> 00:06:38,280 not going to write out all 256 symbol 142 00:06:36,900 --> 00:06:40,740 entries here I just want you to imagine 143 00:06:38,280 --> 00:06:42,900 them I'm just going to number the 144 00:06:40,740 --> 00:06:45,780 symbols that we've got here at the 145 00:06:42,900 --> 00:06:48,060 moment we say we start with those the 146 00:06:45,780 --> 00:06:51,240 four characters that appear in this 147 00:06:48,060 --> 00:06:52,979 string as the first four entries in our 148 00:06:51,240 --> 00:06:57,180 symbol table 149 00:06:52,979 --> 00:07:01,139 so we read the first B we're looking 150 00:06:57,180 --> 00:07:04,319 ahead to a there's no ba in the symbol 151 00:07:01,139 --> 00:07:05,819 table so we add that and then we write 152 00:07:04,319 --> 00:07:09,300 the symbol B 153 00:07:05,819 --> 00:07:13,560 uh we'd go we've got the a we're looking 154 00:07:09,300 --> 00:07:16,680 ahead to n there's no a n symbol uh we 155 00:07:13,560 --> 00:07:18,660 add that and we write the a you can see 156 00:07:16,680 --> 00:07:23,880 how we're building up this symbol table 157 00:07:18,660 --> 00:07:27,300 here so we add n a that's good 158 00:07:23,880 --> 00:07:29,160 and now we have our first match a n is 159 00:07:27,300 --> 00:07:31,680 already in the symbol table so we can 160 00:07:29,160 --> 00:07:34,919 look ahead to a there's no a and a 161 00:07:31,680 --> 00:07:37,199 symbol but now we're outputting 162 00:07:34,919 --> 00:07:40,160 essentially a two character symbol 163 00:07:37,199 --> 00:07:42,960 versus just a one character symbol 164 00:07:40,160 --> 00:07:46,860 there's no we're looking ahead through 165 00:07:42,960 --> 00:07:50,460 the space so there's no a space symbol 166 00:07:46,860 --> 00:07:53,039 so we add that there's no space b symbol 167 00:07:50,460 --> 00:07:55,860 so we add that but now obviously we're 168 00:07:53,039 --> 00:07:59,880 going to start to get these larger 169 00:07:55,860 --> 00:08:02,280 symbols building up out of 170 00:07:59,880 --> 00:08:06,060 symbols that we've added to the symbol 171 00:08:02,280 --> 00:08:08,940 Table after we started compressing we've 172 00:08:06,060 --> 00:08:11,340 got an N A here and that produces the 173 00:08:08,940 --> 00:08:15,180 Nan because we're looking at forward to 174 00:08:11,340 --> 00:08:19,259 the Outlook ahead is to the N symbol but 175 00:08:15,180 --> 00:08:20,720 then we can write n a again and we know 176 00:08:19,259 --> 00:08:24,539 that the 177 00:08:20,720 --> 00:08:28,819 we're looking forward to in a space the 178 00:08:24,539 --> 00:08:32,039 decompressor here basically starts with 179 00:08:28,819 --> 00:08:35,159 the B symbol and it it knows that it's 180 00:08:32,039 --> 00:08:37,380 going to write a b something symbol into 181 00:08:35,159 --> 00:08:40,919 the table at some point when it gets the 182 00:08:37,380 --> 00:08:44,099 a symbol it knows that the a there was 183 00:08:40,919 --> 00:08:47,160 the first character of that symbol that 184 00:08:44,099 --> 00:08:48,959 it needed to write and so the 185 00:08:47,160 --> 00:08:51,420 decompressor and the compressor can work 186 00:08:48,959 --> 00:08:54,600 in lockstep building up this same symbol 187 00:08:51,420 --> 00:08:56,240 table so that's the shared knowledge and 188 00:08:54,600 --> 00:08:59,640 we start getting 189 00:08:56,240 --> 00:09:01,279 encoding that final banana 190 00:08:59,640 --> 00:09:05,760 um 191 00:09:01,279 --> 00:09:09,080 we and we're outputting symbols that are 192 00:09:05,760 --> 00:09:12,839 now three characters in length 193 00:09:09,080 --> 00:09:14,399 so and we finally basically finish up 194 00:09:12,839 --> 00:09:17,100 our input and I've just said okay 195 00:09:14,399 --> 00:09:19,140 there's no more work to be done 196 00:09:17,100 --> 00:09:21,060 so how does that compare with our 197 00:09:19,140 --> 00:09:23,519 original data 198 00:09:21,060 --> 00:09:27,660 well we started with 199 00:09:23,519 --> 00:09:31,339 four six byte words and three spaces so 200 00:09:27,660 --> 00:09:34,500 we've got 27 characters 327 bytes 201 00:09:31,339 --> 00:09:35,760 we have those symbols that we're needing 202 00:09:34,500 --> 00:09:41,279 to Output 203 00:09:35,760 --> 00:09:45,779 but in the original lzw algorithm we 204 00:09:41,279 --> 00:09:49,200 have a 12 bit a 4096 entry symbol table 205 00:09:45,779 --> 00:09:52,080 and so each of those is a 12 bit number 206 00:09:49,200 --> 00:09:53,820 so we multiply by 1.5 to get 21 bytes 207 00:09:52,080 --> 00:09:55,980 but we've still managed to get some 208 00:09:53,820 --> 00:09:58,820 compression even in this relatively 209 00:09:55,980 --> 00:09:58,820 short example 210 00:09:58,860 --> 00:10:04,019 so let's have a look at what else are W 211 00:10:02,339 --> 00:10:06,660 is giving us and what we could maybe 212 00:10:04,019 --> 00:10:10,260 improve 213 00:10:06,660 --> 00:10:13,620 we have this incremental adaptation 214 00:10:10,260 --> 00:10:16,260 which is good because as the symbol as 215 00:10:13,620 --> 00:10:19,440 the input changes you get new symbols 216 00:10:16,260 --> 00:10:21,200 you start getting longer symbols being 217 00:10:19,440 --> 00:10:23,540 made up 218 00:10:21,200 --> 00:10:26,459 but there 219 00:10:23,540 --> 00:10:28,200 we're only adding one more character 220 00:10:26,459 --> 00:10:31,560 each time and I think it would be good 221 00:10:28,200 --> 00:10:35,580 if we can find a way to expand that 222 00:10:31,560 --> 00:10:37,200 to to increase the the size of the 223 00:10:35,580 --> 00:10:41,279 symbols we're putting into the symbol 224 00:10:37,200 --> 00:10:44,940 table sort of faster than that other 225 00:10:41,279 --> 00:10:47,579 problem is that when we're writing data 226 00:10:44,940 --> 00:10:49,860 out initially 227 00:10:47,579 --> 00:10:54,300 um we only actually need eight bytes 228 00:10:49,860 --> 00:10:55,019 sorry eight bits but we're writing 12. 229 00:10:54,300 --> 00:10:58,339 um 230 00:10:55,019 --> 00:11:04,019 and that's a that's a big overhead right 231 00:10:58,339 --> 00:11:09,060 we're also Limited in this 4096 entry 232 00:11:04,019 --> 00:11:11,760 symbol table uh when we reach the 4096 233 00:11:09,060 --> 00:11:14,820 symbol we have to literally say forget 234 00:11:11,760 --> 00:11:19,339 everything you ever know about the input 235 00:11:14,820 --> 00:11:19,339 stream and start again from scratch 236 00:11:20,100 --> 00:11:26,839 despite all this uh this the lzw 237 00:11:24,079 --> 00:11:30,779 algorithm is actually used in the GIF 238 00:11:26,839 --> 00:11:32,880 187a and the GIF 89A standards which is 239 00:11:30,779 --> 00:11:36,079 where I need to mention 240 00:11:32,880 --> 00:11:38,880 the patents 241 00:11:36,079 --> 00:11:41,040 ordinarily there's a role of ominous 242 00:11:38,880 --> 00:11:43,800 Thunder at that point but 243 00:11:41,040 --> 00:11:45,839 um what happened here was Terry Welch 244 00:11:43,800 --> 00:11:47,820 was working for Sperrys Barry got bought 245 00:11:45,839 --> 00:11:50,160 by Unisys Unisys decided to start 246 00:11:47,820 --> 00:11:53,779 enforcing the patents on people who 247 00:11:50,160 --> 00:11:58,200 wrote algorithms to produce GIF images 248 00:11:53,779 --> 00:12:01,260 uh the what was the internet then uh 249 00:11:58,200 --> 00:12:03,899 kind of didn't like this companies 250 00:12:01,260 --> 00:12:07,200 didn't like this uh companies fought 251 00:12:03,899 --> 00:12:09,779 back a bit it was unpopular Unisys 252 00:12:07,200 --> 00:12:12,540 decided not to and then I decided to do 253 00:12:09,779 --> 00:12:16,980 it again and it was unpopular again 254 00:12:12,540 --> 00:12:22,040 uh the patents expired in 2004 but by 255 00:12:16,980 --> 00:12:25,200 that time uh in sort of early in the 90s 256 00:12:22,040 --> 00:12:27,240 the graphics producing community in the 257 00:12:25,200 --> 00:12:29,220 web Community as it was at the time 258 00:12:27,240 --> 00:12:32,040 decided that the thing that we really 259 00:12:29,220 --> 00:12:35,579 needed was a better standard gift was 260 00:12:32,040 --> 00:12:39,000 also limited to 256 byte car so 256 261 00:12:35,579 --> 00:12:41,160 color simple color tables so we needed 262 00:12:39,000 --> 00:12:43,500 true color and better grayscale and 263 00:12:41,160 --> 00:12:46,380 things like that so they invent the 264 00:12:43,500 --> 00:12:47,899 graphics Community invented PNG which in 265 00:12:46,380 --> 00:12:51,240 which used 266 00:12:47,899 --> 00:12:55,019 lz77 and zedlib to do its compression 267 00:12:51,240 --> 00:12:57,019 which was patent well patent free 268 00:12:55,019 --> 00:13:00,720 so 269 00:12:57,019 --> 00:13:03,480 that's kind of why I think that's the 270 00:13:00,720 --> 00:13:06,839 other reason why I think uh Elder W 271 00:13:03,480 --> 00:13:09,720 doesn't get as much scrutiny these days 272 00:13:06,839 --> 00:13:13,800 for what we could do to improve it as 273 00:13:09,720 --> 00:13:17,880 maybe uh lz77 has 274 00:13:13,800 --> 00:13:21,959 so what can we do to improve lzw 275 00:13:17,880 --> 00:13:24,240 well I reckon firstly we can find a more 276 00:13:21,959 --> 00:13:29,639 efficient way to write 277 00:13:24,240 --> 00:13:32,040 the initial data that is starting to add 278 00:13:29,639 --> 00:13:33,480 to our symbol table 279 00:13:32,040 --> 00:13:36,180 secondly 280 00:13:33,480 --> 00:13:39,240 the the one of the big problems is you 281 00:13:36,180 --> 00:13:44,339 can probably imagine if you're writing 282 00:13:39,240 --> 00:13:47,180 just say text file you're only using 96 283 00:13:44,339 --> 00:13:50,700 of the 256 characters 284 00:13:47,180 --> 00:13:53,040 and that's a whole bunch of symbol table 285 00:13:50,700 --> 00:13:54,779 that you have effectively wasted because 286 00:13:53,040 --> 00:13:58,380 you're never using it 287 00:13:54,779 --> 00:14:00,899 so we should start with as as little as 288 00:13:58,380 --> 00:14:01,800 possible in the symbol table 289 00:14:00,899 --> 00:14:05,279 um 290 00:14:01,800 --> 00:14:08,579 we also want to try to use variable 291 00:14:05,279 --> 00:14:12,120 length coding so one thing that an 292 00:14:08,579 --> 00:14:15,060 improvement on Terry uh of the W that I 293 00:14:12,120 --> 00:14:18,180 think Terry watch suggested was that if 294 00:14:15,060 --> 00:14:22,019 we're output if we're less than symbol 295 00:14:18,180 --> 00:14:23,160 512 we only out need to Output nine bit 296 00:14:22,019 --> 00:14:25,920 symbols 297 00:14:23,160 --> 00:14:28,079 we have already more than eight bit 298 00:14:25,920 --> 00:14:30,240 eight bits of symbol in the simple table 299 00:14:28,079 --> 00:14:32,420 but less than nine bits so we can always 300 00:14:30,240 --> 00:14:37,139 represent those with nine bit numbers 301 00:14:32,420 --> 00:14:39,060 when the compressor adds the 512 symbol 302 00:14:37,139 --> 00:14:41,519 it goes up to 10 bits when the 303 00:14:39,060 --> 00:14:44,639 decompressor recognizes that it is on 304 00:14:41,519 --> 00:14:46,680 the FI the 512th symbol it knows the 305 00:14:44,639 --> 00:14:48,860 next symbol is going to be 10 bits so 306 00:14:46,680 --> 00:14:52,860 they're synchronized still so we can 307 00:14:48,860 --> 00:14:55,800 shorten how much data work to go into 308 00:14:52,860 --> 00:14:59,639 um send per symbol 309 00:14:55,800 --> 00:15:01,079 I also want to look at ways that we can 310 00:14:59,639 --> 00:15:04,079 learn 311 00:15:01,079 --> 00:15:05,760 the patterns in the data faster and this 312 00:15:04,079 --> 00:15:09,420 is essential essentially what all of the 313 00:15:05,760 --> 00:15:09,420 improvements in 314 00:15:10,339 --> 00:15:17,760 lz77 over the years have done they have 315 00:15:13,560 --> 00:15:21,600 found ways of getting finding better 316 00:15:17,760 --> 00:15:23,720 matches which also means we need a much 317 00:15:21,600 --> 00:15:27,060 larger symbol table I mean 318 00:15:23,720 --> 00:15:29,660 lzma is typically working with up to 319 00:15:27,060 --> 00:15:33,720 four gigabytes of memory 320 00:15:29,660 --> 00:15:36,839 4096 symbols is just not enough right 321 00:15:33,720 --> 00:15:41,639 the last thing that I want to kind of do 322 00:15:36,839 --> 00:15:44,100 in my thoughts about improving uh lzw is 323 00:15:41,639 --> 00:15:45,720 still I wanted to stay true to that idea 324 00:15:44,100 --> 00:15:48,540 that we are adding symbols to a 325 00:15:45,720 --> 00:15:52,860 dictionary to a table 326 00:15:48,540 --> 00:15:55,380 um from how we've built up the input 327 00:15:52,860 --> 00:15:58,260 it would be nice if we could say okay 328 00:15:55,380 --> 00:16:00,600 this symbol occurred here like you know 329 00:15:58,260 --> 00:16:03,000 you're reading a Python program it uses 330 00:16:00,600 --> 00:16:07,019 the word import so we're going to use 331 00:16:03,000 --> 00:16:09,180 we're going to say directly write up the 332 00:16:07,019 --> 00:16:11,339 import the word import is going to be a 333 00:16:09,180 --> 00:16:12,440 symbol that you're going to need in the 334 00:16:11,339 --> 00:16:15,120 future 335 00:16:12,440 --> 00:16:18,060 and the problem is that that starts 336 00:16:15,120 --> 00:16:20,279 looking awfully a lot a lot like lz77 337 00:16:18,060 --> 00:16:22,740 where we said this amount of data was 338 00:16:20,279 --> 00:16:25,560 repeated and I feel like 339 00:16:22,740 --> 00:16:28,500 that's kind of a crossover here I'd like 340 00:16:25,560 --> 00:16:30,300 to stay true to that lzw idea 341 00:16:28,500 --> 00:16:33,779 Okay so 342 00:16:30,300 --> 00:16:35,040 uh the LZ PW algorithm that I've been 343 00:16:33,779 --> 00:16:38,519 working on 344 00:16:35,040 --> 00:16:42,120 its main I think it's main Improvement 345 00:16:38,519 --> 00:16:44,220 is that we alternate lists of literal 346 00:16:42,120 --> 00:16:47,339 characters 347 00:16:44,220 --> 00:16:50,040 um and then symbols we have a different 348 00:16:47,339 --> 00:16:52,380 way of writing each of those and we will 349 00:16:50,040 --> 00:16:55,920 basically say you will write as many 350 00:16:52,380 --> 00:16:58,500 literal characters as you can until and 351 00:16:55,920 --> 00:17:01,019 until there's a symbol that you you can 352 00:16:58,500 --> 00:17:03,180 match and then you write symbols for as 353 00:17:01,019 --> 00:17:05,900 many symbols as you can match until you 354 00:17:03,180 --> 00:17:08,579 have to go back to literals 355 00:17:05,900 --> 00:17:11,880 the symbols we're going to form similar 356 00:17:08,579 --> 00:17:14,220 to the way that lzw forms them so that 357 00:17:11,880 --> 00:17:17,880 when we read the literal string a b c d 358 00:17:14,220 --> 00:17:20,179 we'll form these three symbols A B B C 359 00:17:17,880 --> 00:17:23,520 and C D 360 00:17:20,179 --> 00:17:26,819 the at the moment the way the code works 361 00:17:23,520 --> 00:17:28,679 is that when we read the symbols a b and 362 00:17:26,819 --> 00:17:33,720 c d we're also going to create the 363 00:17:28,679 --> 00:17:36,059 symbol ABC now I that what I'm trying to 364 00:17:33,720 --> 00:17:39,480 experiment with and I literally had a 365 00:17:36,059 --> 00:17:43,260 brainwave as I was at the end of 366 00:17:39,480 --> 00:17:46,860 um Seb Chan's keynote I don't know why 367 00:17:43,260 --> 00:17:51,600 then but I realized how I can get it to 368 00:17:46,860 --> 00:17:54,299 not only add ABC but in fact at ABCD as 369 00:17:51,600 --> 00:17:58,200 a symbol and if we can do that then now 370 00:17:54,299 --> 00:17:59,880 we're going to expand our the symbols in 371 00:17:58,200 --> 00:18:03,260 our symbol table are going to start 372 00:17:59,880 --> 00:18:03,260 expanding much quicker 373 00:18:03,299 --> 00:18:08,760 um we're also going to store those 374 00:18:06,840 --> 00:18:10,200 symbols obviously in a separate table 375 00:18:08,760 --> 00:18:11,760 and we're going to start numbering from 376 00:18:10,200 --> 00:18:14,360 zero we're going to literally start with 377 00:18:11,760 --> 00:18:17,280 zero symbols in the table 378 00:18:14,360 --> 00:18:18,780 and each of these lists literals or 379 00:18:17,280 --> 00:18:20,400 symbols is going to be prefixed by 380 00:18:18,780 --> 00:18:22,799 lengths and lengths so we're going to be 381 00:18:20,400 --> 00:18:25,020 written in Fibonacci coding 382 00:18:22,799 --> 00:18:27,179 what's Fibonacci coding I hear you cry 383 00:18:25,020 --> 00:18:27,900 well 384 00:18:27,179 --> 00:18:30,240 um 385 00:18:27,900 --> 00:18:33,660 this is a little bit of a diversion but 386 00:18:30,240 --> 00:18:35,640 number bases can be fun we all use base 387 00:18:33,660 --> 00:18:37,740 10 coding where you read from right to 388 00:18:35,640 --> 00:18:40,100 left you multiply by the bay the value 389 00:18:37,740 --> 00:18:43,679 of that column in the base 390 00:18:40,100 --> 00:18:46,080 add the the value multiplied by that to 391 00:18:43,679 --> 00:18:48,179 produce the number and we've obviously 392 00:18:46,080 --> 00:18:52,260 familiar with this with doing this with 393 00:18:48,179 --> 00:18:55,260 binary coding where we just add the 394 00:18:52,260 --> 00:18:58,460 values of each of those bases where the 395 00:18:55,260 --> 00:19:00,419 base increases right from right to left 396 00:18:58,460 --> 00:19:01,980 Fibonacci closing as you've probably 397 00:19:00,419 --> 00:19:05,340 guessed now 398 00:19:01,980 --> 00:19:09,660 um You the base values are the Fibonacci 399 00:19:05,340 --> 00:19:14,220 numbers so that number there is 34 plus 400 00:19:09,660 --> 00:19:15,780 1 plus 8 plus 3 plus 2. 401 00:19:14,220 --> 00:19:17,700 and you've probably figured out here 402 00:19:15,780 --> 00:19:19,919 hang on three plus two is five that's 403 00:19:17,700 --> 00:19:22,500 also a Fibonacci number so we should 404 00:19:19,919 --> 00:19:24,780 probably write that as 405 00:19:22,500 --> 00:19:27,600 um eight plus five oh that's another 406 00:19:24,780 --> 00:19:29,100 Fibonacci number that's 13. 407 00:19:27,600 --> 00:19:33,299 so 408 00:19:29,100 --> 00:19:35,280 here we can we can observe that if 409 00:19:33,299 --> 00:19:38,220 there's a kind of canonical way to write 410 00:19:35,280 --> 00:19:40,860 Fibonacci numbers you always you can 411 00:19:38,220 --> 00:19:43,020 always write to them using the highest 412 00:19:40,860 --> 00:19:47,400 Fibonacci number possible you'd never 413 00:19:43,020 --> 00:19:49,440 need to write two ones side by side 414 00:19:47,400 --> 00:19:51,419 and so the other clever thing about 415 00:19:49,440 --> 00:19:53,460 Fibonacci coding is that we're actually 416 00:19:51,419 --> 00:19:56,280 going to turn that around and write it 417 00:19:53,460 --> 00:19:59,640 left to right increasing 418 00:19:56,280 --> 00:20:01,799 and then add a one bit on the end and 419 00:19:59,640 --> 00:20:03,539 when we see two one bits we know the 420 00:20:01,799 --> 00:20:05,840 number has terminated 421 00:20:03,539 --> 00:20:09,660 okay 422 00:20:05,840 --> 00:20:14,460 so Fibonacci number Fibonacci coding is 423 00:20:09,660 --> 00:20:16,380 self-terminating it also in some ways is 424 00:20:14,460 --> 00:20:17,940 resistant to bit errors which is kind of 425 00:20:16,380 --> 00:20:20,700 nice 426 00:20:17,940 --> 00:20:23,820 it's easy to decode because we simply go 427 00:20:20,700 --> 00:20:26,220 up the Fibonacci sequence it's a greedy 428 00:20:23,820 --> 00:20:29,360 encoding algorithm which is as soon as 429 00:20:26,220 --> 00:20:31,980 you know you can work out the 430 00:20:29,360 --> 00:20:34,860 Fibonacci number that's 431 00:20:31,980 --> 00:20:37,020 just below or equal to your current 432 00:20:34,860 --> 00:20:38,760 number you subtract that you write a 1 433 00:20:37,020 --> 00:20:42,120 and you write zeros until you get the 434 00:20:38,760 --> 00:20:45,000 next Fibonacci number that is is uh 435 00:20:42,120 --> 00:20:47,640 greater than that and 436 00:20:45,000 --> 00:20:49,559 um so just less than that or something 437 00:20:47,640 --> 00:20:52,140 like that yeah 438 00:20:49,559 --> 00:20:55,200 because we're going down it's less than 439 00:20:52,140 --> 00:20:56,880 um and you keep on doing that until you 440 00:20:55,200 --> 00:20:58,500 run out of zeros and then chuckle one on 441 00:20:56,880 --> 00:21:00,799 the front you reverse it around and 442 00:20:58,500 --> 00:21:04,380 there's your encoding 443 00:21:00,799 --> 00:21:06,299 there are schemes of Fibonacci coding 444 00:21:04,380 --> 00:21:08,400 that can cope with writing negative 445 00:21:06,299 --> 00:21:10,620 numbers and zeros and things like that 446 00:21:08,400 --> 00:21:12,900 I'm just we only need to use the 447 00:21:10,620 --> 00:21:15,360 simplest here because we're encoding 448 00:21:12,900 --> 00:21:17,220 lengths and we never need to write a 449 00:21:15,360 --> 00:21:20,039 zero length thing 450 00:21:17,220 --> 00:21:22,679 okay so that's enough Theory 451 00:21:20,039 --> 00:21:25,860 um the last thing I will say about the 452 00:21:22,679 --> 00:21:27,840 Fibonacci coding is if you look at the 453 00:21:25,860 --> 00:21:31,140 other variable length bitcoatings like 454 00:21:27,840 --> 00:21:35,220 inlias gamma coding and Lys Delta coding 455 00:21:31,140 --> 00:21:39,059 they are no more efficient than this but 456 00:21:35,220 --> 00:21:44,039 they require a lot more logic to decode 457 00:21:39,059 --> 00:21:47,039 um and you know we're really kind of at 458 00:21:44,039 --> 00:21:49,679 that theoretical limit we could write we 459 00:21:47,039 --> 00:21:52,440 could have Huffman tables but in order 460 00:21:49,679 --> 00:21:54,240 to have a Huffman table you need to have 461 00:21:52,440 --> 00:21:56,340 pre-encoded it or have some way of 462 00:21:54,240 --> 00:21:59,340 transmitting it and that adds more bytes 463 00:21:56,340 --> 00:22:00,600 this is kind of efficient right okay so 464 00:21:59,340 --> 00:22:02,820 links are going to be stored in 465 00:22:00,600 --> 00:22:05,460 Fibonacci coding 466 00:22:02,820 --> 00:22:09,299 so let's have a look at what we do with 467 00:22:05,460 --> 00:22:11,940 our example in coding this this again 468 00:22:09,299 --> 00:22:14,580 so we're going to start out with the 469 00:22:11,940 --> 00:22:16,320 literal ban we never start in simple 470 00:22:14,580 --> 00:22:17,220 coding we always start in literal coding 471 00:22:16,320 --> 00:22:18,419 because we have nothing in the simple 472 00:22:17,220 --> 00:22:20,640 table 473 00:22:18,419 --> 00:22:23,400 and we write we add we're looking 474 00:22:20,640 --> 00:22:27,059 forward to the a so we can write the 475 00:22:23,400 --> 00:22:29,580 three symbols b a a n and n a 476 00:22:27,059 --> 00:22:32,039 and at the point where we're looking at 477 00:22:29,580 --> 00:22:34,740 the A and we're looking forward to the N 478 00:22:32,039 --> 00:22:37,320 we can say oh that's a symbol entry we 479 00:22:34,740 --> 00:22:40,740 know there might be more than some more 480 00:22:37,320 --> 00:22:44,400 symbols but we've got a n we're looking 481 00:22:40,740 --> 00:22:47,940 forward to a something 482 00:22:44,400 --> 00:22:50,100 um and here we need to know with it's uh 483 00:22:47,940 --> 00:22:54,059 useful or it's sort of important in the 484 00:22:50,100 --> 00:22:56,340 algorithm to know that a is a prefix of 485 00:22:54,059 --> 00:23:01,559 a symbol somewhere we've seen it before 486 00:22:56,340 --> 00:23:04,860 in our input if we saw a q there 487 00:23:01,559 --> 00:23:08,100 um then we'd know no symbol starts with 488 00:23:04,860 --> 00:23:11,880 Q so we it's impossible for us to have a 489 00:23:08,100 --> 00:23:13,500 symbol with q that had is Q something in 490 00:23:11,880 --> 00:23:15,900 it and therefore we can just go straight 491 00:23:13,500 --> 00:23:18,780 back to literal coding but we might be 492 00:23:15,900 --> 00:23:20,820 encoding something here that is a symbol 493 00:23:18,780 --> 00:23:23,460 because it starts with a but then we 494 00:23:20,820 --> 00:23:25,559 look forward to the space and say no 495 00:23:23,460 --> 00:23:28,919 there's no a space symbol 496 00:23:25,559 --> 00:23:31,559 at that point we need to Output the A 497 00:23:28,919 --> 00:23:33,600 and the space so we've written two 498 00:23:31,559 --> 00:23:35,940 literals 499 00:23:33,600 --> 00:23:38,220 and we've we're looking forward to the 500 00:23:35,940 --> 00:23:42,179 capital B so we've added those two more 501 00:23:38,220 --> 00:23:43,159 symbols important to note here that we 502 00:23:42,179 --> 00:23:45,900 have 503 00:23:43,159 --> 00:23:47,840 more than four symbols in the table at 504 00:23:45,900 --> 00:23:50,820 this point this will be important later 505 00:23:47,840 --> 00:23:52,500 but now we can start writing symbols and 506 00:23:50,820 --> 00:23:56,520 obviously this is kind of going to 507 00:23:52,500 --> 00:23:59,880 follow that process that 508 00:23:56,520 --> 00:24:02,220 um or the W uses to construct symbols we 509 00:23:59,880 --> 00:24:04,740 keep on seeing symbols we add the next 510 00:24:02,220 --> 00:24:05,340 character there 511 00:24:04,740 --> 00:24:07,740 um 512 00:24:05,340 --> 00:24:10,020 and we keep on building up this list of 513 00:24:07,740 --> 00:24:13,980 symbols that we're going to Output until 514 00:24:10,020 --> 00:24:15,720 finally we have written the whole input 515 00:24:13,980 --> 00:24:18,360 Okay so 516 00:24:15,720 --> 00:24:19,140 how does this do for 517 00:24:18,360 --> 00:24:21,900 um 518 00:24:19,140 --> 00:24:24,480 compat a size comparison compared to the 519 00:24:21,900 --> 00:24:26,340 original input well we're going to need 520 00:24:24,480 --> 00:24:27,380 to work in bits here as you've probably 521 00:24:26,340 --> 00:24:30,059 guessed 522 00:24:27,380 --> 00:24:33,919 and so we need to multiply it by eight 523 00:24:30,059 --> 00:24:36,659 so we've got 250 216 bits in the input 524 00:24:33,919 --> 00:24:39,419 so we're going to write a string of 525 00:24:36,659 --> 00:24:42,059 length 3 which in Fibonacci coding is 526 00:24:39,419 --> 00:24:46,760 zero zero one one and then we can write 527 00:24:42,059 --> 00:24:51,840 the the initial 8 bit characters 528 00:24:46,760 --> 00:24:55,140 then we write a a symbol run of length 529 00:24:51,840 --> 00:24:57,840 one that is 1 1 and then the symbol 530 00:24:55,140 --> 00:25:01,500 number we needed was 531 00:24:57,840 --> 00:25:03,780 001 there were five b five symbols in 532 00:25:01,500 --> 00:25:06,120 the table at that point 533 00:25:03,780 --> 00:25:10,380 Which is less than eight so we needed at 534 00:25:06,120 --> 00:25:11,039 least three bits to encode that number 535 00:25:10,380 --> 00:25:13,200 um 536 00:25:11,039 --> 00:25:14,120 so we've written a total of five bits 537 00:25:13,200 --> 00:25:18,059 there 538 00:25:14,120 --> 00:25:22,559 we then write two characters 539 00:25:18,059 --> 00:25:25,740 um and that takes 19 bits and then this 540 00:25:22,559 --> 00:25:26,580 sequence we've got less than 16 541 00:25:25,740 --> 00:25:28,860 um 542 00:25:26,580 --> 00:25:32,779 symbols in the symbol table when we're 543 00:25:28,860 --> 00:25:34,640 writing this so we can use 544 00:25:32,779 --> 00:25:37,400 4-bit 545 00:25:34,640 --> 00:25:41,760 4-bit symbol numbers 546 00:25:37,400 --> 00:25:47,460 prefixed by the length in six bits is 38 547 00:25:41,760 --> 00:25:50,159 bits for a total of 90 by 90 bits 548 00:25:47,460 --> 00:25:51,960 which I have I I think that's pretty 549 00:25:50,159 --> 00:25:52,740 cool right 550 00:25:51,960 --> 00:25:55,080 um 551 00:25:52,740 --> 00:25:59,279 how does that compare if we were doing 552 00:25:55,080 --> 00:26:02,220 that in lzw well we had 14 symbols and 553 00:25:59,279 --> 00:26:05,880 they're 12 bits each and so they're 168 554 00:26:02,220 --> 00:26:09,419 bits even if we go down to nine bits we 555 00:26:05,880 --> 00:26:11,640 still only get 126 bits out of lzw so 556 00:26:09,419 --> 00:26:13,500 this is actually Improvement I I feel 557 00:26:11,640 --> 00:26:14,580 like I'm I'm on the right track here 558 00:26:13,500 --> 00:26:16,080 somewhere 559 00:26:14,580 --> 00:26:18,840 okay 560 00:26:16,080 --> 00:26:23,159 so how does this do with more real world 561 00:26:18,840 --> 00:26:25,980 data so I ran it I ran this on its own 562 00:26:23,159 --> 00:26:28,580 source code and that got a ratio of 563 00:26:25,980 --> 00:26:31,740 about 2.56 to 1 564 00:26:28,580 --> 00:26:35,340 user shared words compressed down 565 00:26:31,740 --> 00:26:40,880 slightly less less okay how does that 566 00:26:35,340 --> 00:26:40,880 compare to something as old as gzip 567 00:26:41,880 --> 00:26:45,360 okay well we've still got work to do 568 00:26:44,039 --> 00:26:48,299 then 569 00:26:45,360 --> 00:26:51,720 um it's not surprising because I think 570 00:26:48,299 --> 00:26:53,700 there are a lot of improvements we that 571 00:26:51,720 --> 00:26:56,400 we can get but I would just want a 572 00:26:53,700 --> 00:27:00,600 couple cover a couple of things in the 573 00:26:56,400 --> 00:27:04,679 code uh which is at my GitHub address 574 00:27:00,600 --> 00:27:06,080 um before I sort of summarize on what 575 00:27:04,679 --> 00:27:08,820 um 576 00:27:06,080 --> 00:27:11,460 uh what what the current state of it is 577 00:27:08,820 --> 00:27:15,059 and where the future is going 578 00:27:11,460 --> 00:27:17,400 um so the code I'll firstly say is still 579 00:27:15,059 --> 00:27:21,480 extremely experimental there's a lot of 580 00:27:17,400 --> 00:27:23,580 ideas that are tried out that are that 581 00:27:21,480 --> 00:27:24,620 I've got you know 582 00:27:23,580 --> 00:27:28,260 um 583 00:27:24,620 --> 00:27:31,200 uh flags on to say if if we're using 584 00:27:28,260 --> 00:27:34,020 this this particular encoding try this 585 00:27:31,200 --> 00:27:36,299 that sort of stuff I'm trying to keep as 586 00:27:34,020 --> 00:27:39,299 many ideas in there because I want to 587 00:27:36,299 --> 00:27:41,039 find out which work what works and what 588 00:27:39,299 --> 00:27:45,360 what doesn't 589 00:27:41,039 --> 00:27:47,340 um and this is still early code so I I'm 590 00:27:45,360 --> 00:27:50,039 really still not sure which of these 591 00:27:47,340 --> 00:27:51,120 ideas might prove to be useful as as 592 00:27:50,039 --> 00:27:55,020 you'll see 593 00:27:51,120 --> 00:27:57,120 the second is uh people who know me will 594 00:27:55,020 --> 00:27:58,980 kind of recognize 595 00:27:57,120 --> 00:28:01,559 um my sense of humor here but there's 596 00:27:58,980 --> 00:28:04,559 the thing called the mabulate flag now 597 00:28:01,559 --> 00:28:08,220 mabulate is a word I invented it's what 598 00:28:04,559 --> 00:28:10,980 you do say uh if you've got a Rubik's 599 00:28:08,220 --> 00:28:13,559 Cube and you know that uh the way to 600 00:28:10,980 --> 00:28:16,740 solve it is to get all of the colors on 601 00:28:13,559 --> 00:28:20,580 each face all together and the way to do 602 00:28:16,740 --> 00:28:21,840 that is by rotating the faces but you 603 00:28:20,580 --> 00:28:23,640 don't know 604 00:28:21,840 --> 00:28:27,240 because we haven't learned yet the the 605 00:28:23,640 --> 00:28:29,640 specific algorithms to move things 606 00:28:27,240 --> 00:28:31,559 around but you think that if you keep on 607 00:28:29,640 --> 00:28:34,200 sort of playing with it something might 608 00:28:31,559 --> 00:28:36,059 end up in the right shape and you know 609 00:28:34,200 --> 00:28:37,440 and colors might be able to move 610 00:28:36,059 --> 00:28:39,600 together and you'll kind of gradually 611 00:28:37,440 --> 00:28:40,919 learn your way around but you're not 612 00:28:39,600 --> 00:28:42,360 quite sure what you're doing so that's 613 00:28:40,919 --> 00:28:46,740 mabulating 614 00:28:42,360 --> 00:28:49,980 so the mabulate flag works like this it 615 00:28:46,740 --> 00:28:52,620 is basically that process that we did 616 00:28:49,980 --> 00:28:57,360 we're where we're looking at whether 617 00:28:52,620 --> 00:28:59,340 this symbol is a prefix whether we've 618 00:28:57,360 --> 00:29:01,260 possibly got a another symbol because 619 00:28:59,340 --> 00:29:03,900 we've got the prefix 620 00:29:01,260 --> 00:29:06,240 If the symbol hasn't matched and the 621 00:29:03,900 --> 00:29:09,900 last character is a prefix then we set 622 00:29:06,240 --> 00:29:11,940 that flag and we that's to remember the 623 00:29:09,900 --> 00:29:13,620 character that we're currently on 624 00:29:11,940 --> 00:29:15,900 because it could be the start of the new 625 00:29:13,620 --> 00:29:19,679 the new symbol 626 00:29:15,900 --> 00:29:21,120 If the symbol hasn't matched and the 627 00:29:19,679 --> 00:29:23,159 last character 628 00:29:21,120 --> 00:29:25,980 or the last the symbol that we're 629 00:29:23,159 --> 00:29:32,100 looking for is not a symbol 630 00:29:25,980 --> 00:29:34,980 then we go we now need to go to literals 631 00:29:32,100 --> 00:29:36,840 and it's the nebula flag is set at that 632 00:29:34,980 --> 00:29:38,820 point it means that we previously looked 633 00:29:36,840 --> 00:29:41,940 at a character which might have been a 634 00:29:38,820 --> 00:29:43,440 prefix but turns out not to be and at 635 00:29:41,940 --> 00:29:46,260 that point we need to Output that 636 00:29:43,440 --> 00:29:50,100 character as well so it's a bit 637 00:29:46,260 --> 00:29:51,899 complicated and as I said I I scratched 638 00:29:50,100 --> 00:29:53,700 my head for a long time trying to find 639 00:29:51,899 --> 00:29:55,260 the right word for this and I just 640 00:29:53,700 --> 00:29:58,380 couldn't so I used mabulate because 641 00:29:55,260 --> 00:30:00,539 that's me okay so the other thing you 642 00:29:58,380 --> 00:30:02,760 need to know about the code is that 643 00:30:00,539 --> 00:30:06,419 there's this array this thing called 644 00:30:02,760 --> 00:30:08,640 variable called to write that if we're 645 00:30:06,419 --> 00:30:10,140 in literal mode is a string of the 646 00:30:08,640 --> 00:30:12,840 physical characters we're going to write 647 00:30:10,140 --> 00:30:16,260 out if it if we're in symbol mode that 648 00:30:12,840 --> 00:30:19,080 is a list of strings so don't enforce 649 00:30:16,260 --> 00:30:20,220 type checking on that because it's going 650 00:30:19,080 --> 00:30:25,500 to change 651 00:30:20,220 --> 00:30:27,299 okay so where are we uh with LZ PW the 652 00:30:25,500 --> 00:30:29,460 first thing I do need to share with you 653 00:30:27,299 --> 00:30:32,820 is the failures 654 00:30:29,460 --> 00:30:36,240 um the one really interesting thing I 655 00:30:32,820 --> 00:30:38,399 discovered almost by accident uh was 656 00:30:36,240 --> 00:30:42,059 that I had started on a totally wrong 657 00:30:38,399 --> 00:30:45,899 premise I thought that it would be more 658 00:30:42,059 --> 00:30:49,620 efficient to write the symbols in the 659 00:30:45,899 --> 00:30:52,799 symbol table using Fibonacci codes and 660 00:30:49,620 --> 00:30:55,020 if we wrote those in uh a if we used 661 00:30:52,799 --> 00:30:57,419 like a move to front algorithm or we 662 00:30:55,020 --> 00:31:01,980 favored the least 663 00:30:57,419 --> 00:31:03,960 so the yeah the most recent symbol uh 664 00:31:01,980 --> 00:31:06,500 the most recent symbol is the most 665 00:31:03,960 --> 00:31:09,600 complex and it's the most the one most 666 00:31:06,500 --> 00:31:14,520 recently found in the input so if we can 667 00:31:09,600 --> 00:31:15,960 match that using fewer bits then we we 668 00:31:14,520 --> 00:31:18,059 should actually 669 00:31:15,960 --> 00:31:21,059 like this is how I was thinking which 670 00:31:18,059 --> 00:31:23,880 that should give us better compression 671 00:31:21,059 --> 00:31:26,940 than if we're just using a fixed width 672 00:31:23,880 --> 00:31:28,919 number of bits based on the symbol table 673 00:31:26,940 --> 00:31:31,200 size at that point in time 674 00:31:28,919 --> 00:31:33,840 and I tried this out and I had spent 675 00:31:31,200 --> 00:31:35,580 months working on that and then I was 676 00:31:33,840 --> 00:31:37,440 going through tidying up the code trying 677 00:31:35,580 --> 00:31:38,840 to make it vaguely presentable for this 678 00:31:37,440 --> 00:31:42,720 conference 679 00:31:38,840 --> 00:31:46,080 and I uh one of the things that uh was 680 00:31:42,720 --> 00:31:48,539 set in there was the ability to use 681 00:31:46,080 --> 00:31:51,320 variable length Fibonacci coding for 682 00:31:48,539 --> 00:31:57,840 symbols or still use fixed length coding 683 00:31:51,320 --> 00:31:59,460 and in one of my tests I ran it and left 684 00:31:57,840 --> 00:32:02,820 the flag off 685 00:31:59,460 --> 00:32:04,260 so it was using fixed length symbols and 686 00:32:02,820 --> 00:32:07,500 it was shorter 687 00:32:04,260 --> 00:32:09,299 I'm like okay well that's that idea out 688 00:32:07,500 --> 00:32:11,520 the window why is that 689 00:32:09,299 --> 00:32:13,380 and then I realized 690 00:32:11,520 --> 00:32:16,080 that 691 00:32:13,380 --> 00:32:17,880 if you think about the idea of Fibonacci 692 00:32:16,080 --> 00:32:21,240 coding part of the reason of that final 693 00:32:17,880 --> 00:32:23,880 one is that it has to tell the decoder 694 00:32:21,240 --> 00:32:26,520 this is where you where you stop it 695 00:32:23,880 --> 00:32:29,580 needs there needs to be an explicit 696 00:32:26,520 --> 00:32:32,460 piece of information passed from the 697 00:32:29,580 --> 00:32:34,080 encoder to the decoder to say this is 698 00:32:32,460 --> 00:32:36,779 the length of that symbol 699 00:32:34,080 --> 00:32:40,140 but if we already know the symbol table 700 00:32:36,779 --> 00:32:42,360 size that's an implicit piece of data 701 00:32:40,140 --> 00:32:45,000 that both the encoder and the decoder 702 00:32:42,360 --> 00:32:47,399 know and therefore it doesn't need to be 703 00:32:45,000 --> 00:32:49,620 transmitted and there therefore we can 704 00:32:47,399 --> 00:32:51,779 actually use more of the the bit space 705 00:32:49,620 --> 00:32:54,899 to transmit information as opposed to 706 00:32:51,779 --> 00:32:57,120 avoiding writing two ones together 707 00:32:54,899 --> 00:33:00,299 so that was a discovery 708 00:32:57,120 --> 00:33:03,720 um but I'm glad I discovered it right 709 00:33:00,299 --> 00:33:06,600 um the other idea that I've had in there 710 00:33:03,720 --> 00:33:09,240 which the moment I pretty much removed 711 00:33:06,600 --> 00:33:10,500 but it's there because it you know you 712 00:33:09,240 --> 00:33:15,480 never know 713 00:33:10,500 --> 00:33:17,700 um is I the idea of having a header in 714 00:33:15,480 --> 00:33:20,640 front of literal blocks and in front of 715 00:33:17,700 --> 00:33:23,640 symbol blocks that would both indicate 716 00:33:20,640 --> 00:33:24,480 yes this is a little block or a symbol 717 00:33:23,640 --> 00:33:27,240 block 718 00:33:24,480 --> 00:33:29,880 and then might indicate some Flags about 719 00:33:27,240 --> 00:33:32,340 that I'll cover some of the reasons why 720 00:33:29,880 --> 00:33:34,380 I think this might be useful there is 721 00:33:32,340 --> 00:33:36,480 never a situation where you need to 722 00:33:34,380 --> 00:33:37,740 write two literal blocks one after the 723 00:33:36,480 --> 00:33:40,620 other 724 00:33:37,740 --> 00:33:42,720 because all you will always find it is 725 00:33:40,620 --> 00:33:43,799 more efficient to Simply join them 726 00:33:42,720 --> 00:33:46,140 together 727 00:33:43,799 --> 00:33:49,019 it's possible that we might need to 728 00:33:46,140 --> 00:33:51,539 write two symbol blocks one after the 729 00:33:49,019 --> 00:33:53,519 other and that would most likely be when 730 00:33:51,539 --> 00:33:56,039 we have we're just about to change 731 00:33:53,519 --> 00:33:58,799 symbol sizes 732 00:33:56,039 --> 00:34:01,559 and we want to write the all of the 733 00:33:58,799 --> 00:34:03,419 symbols currently in the buffer out in 734 00:34:01,559 --> 00:34:05,100 the shorter format and then move on to 735 00:34:03,419 --> 00:34:08,639 the longer format 736 00:34:05,100 --> 00:34:12,119 but I'm not sure yet if having an extra 737 00:34:08,639 --> 00:34:15,060 bit in the header of the symbol list 738 00:34:12,119 --> 00:34:17,820 is worth that overhead 739 00:34:15,060 --> 00:34:20,580 it we're going to experiment right okay 740 00:34:17,820 --> 00:34:23,879 so uh current features are basically 741 00:34:20,580 --> 00:34:26,339 that it does do efficient literal coding 742 00:34:23,879 --> 00:34:30,139 and it does do efficient variable coding 743 00:34:26,339 --> 00:34:33,599 of symbols and I'm really proud of that 744 00:34:30,139 --> 00:34:36,599 the future features one of the main ones 745 00:34:33,599 --> 00:34:38,879 that I'm working on is the ability to 746 00:34:36,599 --> 00:34:40,500 backtrack through how you're encoding 747 00:34:38,879 --> 00:34:43,260 the symbols at this point 748 00:34:40,500 --> 00:34:45,240 and the easiest way to demonstrate this 749 00:34:43,260 --> 00:34:47,940 I've found it's a bit of a boring string 750 00:34:45,240 --> 00:34:49,619 it's not as good as bananas but it is 751 00:34:47,940 --> 00:34:52,260 this 752 00:34:49,619 --> 00:34:55,080 so as we're reading through that to 753 00:34:52,260 --> 00:34:59,460 write it we write the first five 754 00:34:55,080 --> 00:35:02,640 characters the then have a b b c c d and 755 00:34:59,460 --> 00:35:05,520 d dot dot a 756 00:35:02,640 --> 00:35:09,599 then we can write a b c d and at that 757 00:35:05,520 --> 00:35:10,920 point we add the ABC symbol 758 00:35:09,599 --> 00:35:14,940 um 759 00:35:10,920 --> 00:35:18,420 we add we have to add a literal comma 760 00:35:14,940 --> 00:35:20,700 and then the symbol table has a b c d 761 00:35:18,420 --> 00:35:24,300 and the ABC symbol 762 00:35:20,700 --> 00:35:24,960 so we're about to read ABC 763 00:35:24,300 --> 00:35:29,460 um 764 00:35:24,960 --> 00:35:32,099 we get to ABC and we're looking at D we 765 00:35:29,460 --> 00:35:35,339 we're mapulating that we're looking 766 00:35:32,099 --> 00:35:37,980 ahead to C and there's no DC in the 767 00:35:35,339 --> 00:35:40,200 symbol table so we would have to write a 768 00:35:37,980 --> 00:35:42,180 literal D at that point 769 00:35:40,200 --> 00:35:44,420 and then move on 770 00:35:42,180 --> 00:35:48,180 but 771 00:35:44,420 --> 00:35:50,820 what if we could instead write that as 772 00:35:48,180 --> 00:35:54,060 the symbols a b c d 773 00:35:50,820 --> 00:35:57,180 we've rearranged these the symbols that 774 00:35:54,060 --> 00:36:00,599 we're encoding and then we can write a 775 00:35:57,180 --> 00:36:04,619 second CD and where that it is always 776 00:36:00,599 --> 00:36:11,339 more efficient like that uh 777 00:36:04,619 --> 00:36:13,380 second D there was 10 bits of stepping 778 00:36:11,339 --> 00:36:16,320 back into literals 779 00:36:13,380 --> 00:36:18,359 that's a that's a 10-bit symbol that we 780 00:36:16,320 --> 00:36:21,260 could put in there that's that's you 781 00:36:18,359 --> 00:36:24,420 know where if we're a 782 00:36:21,260 --> 00:36:25,980 1024 symbols in our table we're still 783 00:36:24,420 --> 00:36:27,540 going to be more efficient just to write 784 00:36:25,980 --> 00:36:29,400 that one character and we're probably 785 00:36:27,540 --> 00:36:33,000 going to write more than one which means 786 00:36:29,400 --> 00:36:37,320 we're probably saving 19 bits so 787 00:36:33,000 --> 00:36:39,900 um we've got that that option of uh 788 00:36:37,320 --> 00:36:44,579 thinking about how we encode those 789 00:36:39,900 --> 00:36:50,240 symbols it then does mean that it we may 790 00:36:44,579 --> 00:36:51,859 change what symbols uh the 791 00:36:50,240 --> 00:36:54,780 decoder 792 00:36:51,859 --> 00:36:57,720 sees in its symbol table and this is 793 00:36:54,780 --> 00:36:59,280 something that's really important this 794 00:36:57,720 --> 00:37:01,980 is where I'm going to have to really 795 00:36:59,280 --> 00:37:03,420 play with this to try and try and 796 00:37:01,980 --> 00:37:06,359 understand what the effects of that 797 00:37:03,420 --> 00:37:08,579 rearrangement are the other thing that I 798 00:37:06,359 --> 00:37:10,440 found when I was playing around with 799 00:37:08,579 --> 00:37:12,540 this it worked fine with some of my test 800 00:37:10,440 --> 00:37:15,599 examples and I gave it the code of 801 00:37:12,540 --> 00:37:18,180 symbolizer to encode and it used all 32 802 00:37:15,599 --> 00:37:20,160 gigs of my machine and I had to kill it 803 00:37:18,180 --> 00:37:22,619 because it was running on flat out on a 804 00:37:20,160 --> 00:37:26,280 CPU and the reason for that was that it 805 00:37:22,619 --> 00:37:29,280 was looking through a list of about a 806 00:37:26,280 --> 00:37:30,420 string probably in the order of 807 00:37:29,280 --> 00:37:32,700 um 808 00:37:30,420 --> 00:37:36,000 80 to 90 characters but it was doing 809 00:37:32,700 --> 00:37:39,180 this like thousands of rearrangements of 810 00:37:36,000 --> 00:37:41,880 that finding every possible way that you 811 00:37:39,180 --> 00:37:45,420 could rewrite that so I have some ideas 812 00:37:41,880 --> 00:37:49,619 on how to improve that the other way the 813 00:37:45,420 --> 00:37:52,020 other ideas that I think we I want to 814 00:37:49,619 --> 00:37:54,839 explore here are that 815 00:37:52,020 --> 00:37:57,720 uh it would be good rather than having 816 00:37:54,839 --> 00:37:59,760 some sort of signal to say throw away 817 00:37:57,720 --> 00:38:02,339 your symbol table and start again from 818 00:37:59,760 --> 00:38:05,220 scratch if we could throw away the 819 00:38:02,339 --> 00:38:06,619 symbols we never needed again or the 820 00:38:05,220 --> 00:38:10,260 symbols that just hadn't been used 821 00:38:06,619 --> 00:38:11,760 recently we could tell the decoder to 822 00:38:10,260 --> 00:38:15,660 forget those 823 00:38:11,760 --> 00:38:18,540 even if we used fixed length coding 824 00:38:15,660 --> 00:38:23,520 maybe we still need a move to front idea 825 00:38:18,540 --> 00:38:25,680 so that uh the the uh the later symbols 826 00:38:23,520 --> 00:38:28,859 the least recently used symbols can be 827 00:38:25,680 --> 00:38:32,700 thrown out rather than say going up to a 828 00:38:28,859 --> 00:38:35,820 higher symbol table a symbol length what 829 00:38:32,700 --> 00:38:37,859 if we had literals that don't create 830 00:38:35,820 --> 00:38:40,320 symbols we know we've got a header that 831 00:38:37,859 --> 00:38:41,640 we never ever see again in the input and 832 00:38:40,320 --> 00:38:45,180 we just say 833 00:38:41,640 --> 00:38:46,440 the decoder should not encode symbols 834 00:38:45,180 --> 00:38:47,700 from this because you're just placing 835 00:38:46,440 --> 00:38:50,099 symbol space 836 00:38:47,700 --> 00:38:52,680 what if we had uh literals that don't 837 00:38:50,099 --> 00:38:56,579 create output but create symbols we 838 00:38:52,680 --> 00:38:58,680 could say recognize some patterns uh the 839 00:38:56,579 --> 00:38:59,760 you know in the first sort of 840 00:38:58,680 --> 00:39:03,960 um 841 00:38:59,760 --> 00:39:07,200 a 10 20 100 symbols find the most 842 00:39:03,960 --> 00:39:10,980 efficient way to write those in a 843 00:39:07,200 --> 00:39:13,020 compact string and then transmit that to 844 00:39:10,980 --> 00:39:14,480 build the symbol table that we're then 845 00:39:13,020 --> 00:39:16,800 going to use 846 00:39:14,480 --> 00:39:19,020 to get 847 00:39:16,800 --> 00:39:22,099 um just sort of almost immediately go 848 00:39:19,020 --> 00:39:27,119 into symbol output 849 00:39:22,099 --> 00:39:29,160 uh what if we are encoding not just one 850 00:39:27,119 --> 00:39:31,020 stream but multiple streams like a tar 851 00:39:29,160 --> 00:39:32,720 file or something like that 852 00:39:31,020 --> 00:39:36,000 could be maybe 853 00:39:32,720 --> 00:39:39,300 sometimes share this symbol table with 854 00:39:36,000 --> 00:39:40,800 the next file or throw it away and the 855 00:39:39,300 --> 00:39:42,180 three we're reading something completely 856 00:39:40,800 --> 00:39:42,960 different 857 00:39:42,180 --> 00:39:46,920 um 858 00:39:42,960 --> 00:39:49,440 this last one was prompted from a friend 859 00:39:46,920 --> 00:39:50,540 of mine who asked me would it be how 860 00:39:49,440 --> 00:39:52,920 would it Go 861 00:39:50,540 --> 00:39:54,540 reading Chinese characters and I 862 00:39:52,920 --> 00:39:56,579 realized that actually at the moment 863 00:39:54,540 --> 00:39:58,500 I've been talking about bytes throughout 864 00:39:56,579 --> 00:40:02,700 my code but in fact my code is written 865 00:39:58,500 --> 00:40:07,200 around Python's interpretation of utf-8 866 00:40:02,700 --> 00:40:09,420 and that means that if the literals 867 00:40:07,200 --> 00:40:12,480 still contain UTF cat 8 characters which 868 00:40:09,420 --> 00:40:14,579 are perfectly readable in as an input 869 00:40:12,480 --> 00:40:15,540 stream they're still just bytes at the 870 00:40:14,579 --> 00:40:18,960 end of the day 871 00:40:15,540 --> 00:40:23,099 but the symbol table is now not built up 872 00:40:18,960 --> 00:40:26,579 of those bytes but of the characters 873 00:40:23,099 --> 00:40:29,579 that have more meaning in the text 874 00:40:26,579 --> 00:40:32,880 that too could be you know could give us 875 00:40:29,579 --> 00:40:36,020 more and Co more compression if we're 876 00:40:32,880 --> 00:40:39,119 reading something like wav files 877 00:40:36,020 --> 00:40:42,480 which are notoriously hard to compress 878 00:40:39,119 --> 00:40:47,339 because the uh if you look looking at 879 00:40:42,480 --> 00:40:50,640 say a 16-bit PCM encoded each sample is 880 00:40:47,339 --> 00:40:54,420 two bytes but there is no relationship 881 00:40:50,640 --> 00:40:57,540 between essentially between the the 882 00:40:54,420 --> 00:41:00,839 least the the lowest byte of sample one 883 00:40:57,540 --> 00:41:04,859 and the highest byte of sample two 884 00:41:00,839 --> 00:41:08,099 but as 16 bit words they are absolutely 885 00:41:04,859 --> 00:41:11,099 related and if we could encode them as 886 00:41:08,099 --> 00:41:13,380 symbols rather than simply treating it 887 00:41:11,099 --> 00:41:15,839 as a stream of bytes I think that would 888 00:41:13,380 --> 00:41:17,160 give us improvements as well so there's 889 00:41:15,839 --> 00:41:20,760 a whole bunch of ideas that I'm 890 00:41:17,160 --> 00:41:22,859 percolating away I'm still working on 891 00:41:20,760 --> 00:41:25,430 the code but I'm hoping to have some 892 00:41:22,859 --> 00:41:29,420 more results soon thank you 893 00:41:25,430 --> 00:41:31,740 [Applause] 894 00:41:29,420 --> 00:41:34,020 thank you 895 00:41:31,740 --> 00:41:36,119 any questions 896 00:41:34,020 --> 00:41:38,220 um 897 00:41:36,119 --> 00:41:42,240 unfortunately we're actually out of time 898 00:41:38,220 --> 00:41:44,460 right sorry and we're in room change so 899 00:41:42,240 --> 00:41:46,320 we we can't really go over time like we 900 00:41:44,460 --> 00:41:48,000 did do before lunch 901 00:41:46,320 --> 00:41:50,280 um so wondering how would you be okay 902 00:41:48,000 --> 00:41:52,560 with people come up and ask me any 903 00:41:50,280 --> 00:41:54,480 questions you like I'd love to have 904 00:41:52,560 --> 00:41:56,440 collaboration on this thank you thank 905 00:41:54,480 --> 00:41:59,579 you thank you once again 906 00:41:56,440 --> 00:41:59,579 [Applause]