1
00:00:00,000 --> 00:00:08,469
foreign

2
00:00:00,500 --> 00:00:08,469
[Music]

3
00:00:11,219 --> 00:00:16,139
good afternoon so here this afternoon we

4
00:00:14,160 --> 00:00:17,940
have Paul Weber and he's here to talk to

5
00:00:16,139 --> 00:00:19,140
us about improving old compression

6
00:00:17,940 --> 00:00:21,660
algorithms

7
00:00:19,140 --> 00:00:23,760
Paul started writing code for a zx80 on

8
00:00:21,660 --> 00:00:25,800
the typewriter and things improved a bit

9
00:00:23,760 --> 00:00:27,720
since then they currently work for Red

10
00:00:25,800 --> 00:00:30,119
Hat trying to automate so that other

11
00:00:27,720 --> 00:00:31,980
people don't have to today Paul will

12
00:00:30,119 --> 00:00:34,739
discuss this Research into improving the

13
00:00:31,980 --> 00:00:37,079
lcw algorithm to get a demonstration on

14
00:00:34,739 --> 00:00:38,880
encoding and decoding and compare his

15
00:00:37,079 --> 00:00:41,380
compression ratios to other more modern

16
00:00:38,880 --> 00:00:44,719
compression methods please welcome Paul

17
00:00:41,380 --> 00:00:44,719
[Applause]

18
00:00:46,079 --> 00:00:51,180
thank you very much and greetings from

19
00:00:49,140 --> 00:00:53,039
nunawal country where I come from in

20
00:00:51,180 --> 00:00:55,800
Canberra

21
00:00:53,039 --> 00:00:57,420
so compression algorithms has been

22
00:00:55,800 --> 00:01:00,360
something I've been really interested in

23
00:00:57,420 --> 00:01:03,300
for a long time so I'm going to give you

24
00:01:00,360 --> 00:01:04,920
a bit of a history of compression uh

25
00:01:03,300 --> 00:01:07,200
some of the theory behind it some of the

26
00:01:04,920 --> 00:01:10,200
algorithm work that I've been doing and

27
00:01:07,200 --> 00:01:13,920
sort of look into the future so I want

28
00:01:10,200 --> 00:01:15,740
to take you back to 1977. who wasn't

29
00:01:13,920 --> 00:01:18,600
alive in

30
00:01:15,740 --> 00:01:21,180
1977. okay there's a few so so I'm going

31
00:01:18,600 --> 00:01:22,580
to have to recap some stuff

32
00:01:21,180 --> 00:01:26,100
um

33
00:01:22,580 --> 00:01:29,580
some some bad things happened uh some

34
00:01:26,100 --> 00:01:30,960
good things happened uh the first hints

35
00:01:29,580 --> 00:01:34,500
of an Internet

36
00:01:30,960 --> 00:01:38,340
started appearing something particularly

37
00:01:34,500 --> 00:01:41,579
important for a lot of people here that

38
00:01:38,340 --> 00:01:43,500
were alive during this time was these

39
00:01:41,579 --> 00:01:48,079
three computers appeared on the scene

40
00:01:43,500 --> 00:01:51,320
the pet the Apple II and the TRS-80 and

41
00:01:48,079 --> 00:01:54,780
those along with a number of other

42
00:01:51,320 --> 00:01:57,840
personal computers really took Computing

43
00:01:54,780 --> 00:02:01,140
from stuff that you did maybe at your

44
00:01:57,840 --> 00:02:04,619
parents work or in a library into the

45
00:02:01,140 --> 00:02:07,619
home that you could play with yourself

46
00:02:04,619 --> 00:02:10,979
the other important thing that is for my

47
00:02:07,619 --> 00:02:13,080
talk uh in that happened in 1977 was

48
00:02:10,979 --> 00:02:16,260
that these two gentlemen Abraham Lincoln

49
00:02:13,080 --> 00:02:18,660
on the left left who's sadly passed away

50
00:02:16,260 --> 00:02:22,620
so this talk is kind of dedicated to him

51
00:02:18,660 --> 00:02:27,239
and Jacob ziv on the right uh introduced

52
00:02:22,620 --> 00:02:29,580
a compression algorithm called lz77

53
00:02:27,239 --> 00:02:34,319
um strange title but there you go

54
00:02:29,580 --> 00:02:37,200
um the way this works is that we look at

55
00:02:34,319 --> 00:02:39,840
the output we're either outputting sort

56
00:02:37,200 --> 00:02:43,019
of literal data or we have a way of

57
00:02:39,840 --> 00:02:45,660
saying there's a repeat that's in the

58
00:02:43,019 --> 00:02:49,260
output stream that is already in the

59
00:02:45,660 --> 00:02:50,940
output and so we refer back to this we

60
00:02:49,260 --> 00:02:54,120
have this sort of Tuple in the original

61
00:02:50,940 --> 00:02:56,220
paper a bit the distance back to the

62
00:02:54,120 --> 00:02:57,720
last match the length of that match and

63
00:02:56,220 --> 00:02:58,920
then the next character that we're going

64
00:02:57,720 --> 00:03:02,280
to Output

65
00:02:58,920 --> 00:03:05,879
if we have no match then we just output

66
00:03:02,280 --> 00:03:07,440
zeros and we output characters one at a

67
00:03:05,879 --> 00:03:10,920
time until we've got enough to actually

68
00:03:07,440 --> 00:03:14,040
form a match so this basically looks

69
00:03:10,920 --> 00:03:18,560
back into a sliding window over the past

70
00:03:14,040 --> 00:03:23,280
output to look for matches that the

71
00:03:18,560 --> 00:03:26,640
encoder can efficiently uh

72
00:03:23,280 --> 00:03:28,980
is transmit and the decoder then knows

73
00:03:26,640 --> 00:03:31,980
to find in that window

74
00:03:28,980 --> 00:03:34,080
like it's impossible to understate the

75
00:03:31,980 --> 00:03:37,019
importance of this algorithm these are

76
00:03:34,080 --> 00:03:40,799
the tools just some of them that have

77
00:03:37,019 --> 00:03:43,560
been influenced by lz77 not just the

78
00:03:40,799 --> 00:03:46,799
tools but the algorithms as well as ltma

79
00:03:43,560 --> 00:03:49,879
obviously is one of them but there are

80
00:03:46,799 --> 00:03:52,500
decompressors for different

81
00:03:49,879 --> 00:03:55,280
architectures there's so when your

82
00:03:52,500 --> 00:03:58,500
memory is very constrained there are

83
00:03:55,280 --> 00:04:02,540
there are algorithms that are optimized

84
00:03:58,500 --> 00:04:02,540
particularly for decompression speed

85
00:04:02,900 --> 00:04:09,659
it it they have just been huge so

86
00:04:07,379 --> 00:04:13,439
uh we're going to move along on to

87
00:04:09,659 --> 00:04:14,340
another uh the next year 1978

88
00:04:13,439 --> 00:04:16,799
um

89
00:04:14,340 --> 00:04:18,419
a bunch of bad things happened a bunch

90
00:04:16,799 --> 00:04:19,079
of good things happened

91
00:04:18,419 --> 00:04:22,620
um

92
00:04:19,079 --> 00:04:25,380
and these two gentlemen not satisfied

93
00:04:22,620 --> 00:04:28,979
with releasing one algorithm released a

94
00:04:25,380 --> 00:04:28,979
second called lz78

95
00:04:29,220 --> 00:04:34,199
now this I know once you're on a theme

96
00:04:32,520 --> 00:04:37,139
stick to it right

97
00:04:34,199 --> 00:04:41,699
um so uh this works slightly differently

98
00:04:37,139 --> 00:04:45,180
what it does is it learns sequences of

99
00:04:41,699 --> 00:04:49,440
adjacent characters or symbols in the

100
00:04:45,180 --> 00:04:50,240
input and put them into a dictionary and

101
00:04:49,440 --> 00:04:55,259
then

102
00:04:50,240 --> 00:04:58,259
the output is the longest dictionary

103
00:04:55,259 --> 00:05:00,440
entry that matches that's in the

104
00:04:58,259 --> 00:05:03,060
dictionary at the time

105
00:05:00,440 --> 00:05:05,220
I have to take you slightly further

106
00:05:03,060 --> 00:05:06,900
forward to give you a good example of

107
00:05:05,220 --> 00:05:08,419
this I'm going I need to go forward to

108
00:05:06,900 --> 00:05:12,560
1984

109
00:05:08,419 --> 00:05:16,860
we were at war with East Asia again

110
00:05:12,560 --> 00:05:18,540
there was a lot of sex crime I was 13 at

111
00:05:16,860 --> 00:05:19,740
the moment so I didn't have any of that

112
00:05:18,540 --> 00:05:20,639
but

113
00:05:19,740 --> 00:05:23,160
um

114
00:05:20,639 --> 00:05:27,660
Terry Welch who I don't have a photo of

115
00:05:23,160 --> 00:05:30,020
uh released an improvement to lz78 and

116
00:05:27,660 --> 00:05:33,900
it's commonly called lzw

117
00:05:30,020 --> 00:05:38,580
the way this works which is conceptually

118
00:05:33,900 --> 00:05:41,100
a lot easier than lz78 is that we start

119
00:05:38,580 --> 00:05:44,039
with a symbol table that contains the

120
00:05:41,100 --> 00:05:45,479
first the basically all variations of an

121
00:05:44,039 --> 00:05:48,060
8-bit byte

122
00:05:45,479 --> 00:05:50,460
uh there's also a flush table symbol and

123
00:05:48,060 --> 00:05:51,419
there's a stop input symbol

124
00:05:50,460 --> 00:05:56,340
um

125
00:05:51,419 --> 00:05:57,419
and we basically add input

126
00:05:56,340 --> 00:06:00,960
um

127
00:05:57,419 --> 00:06:03,720
to sort of our buffer until that buffer

128
00:06:00,960 --> 00:06:05,639
no longer is matches an entry in the

129
00:06:03,720 --> 00:06:08,100
symbol table

130
00:06:05,639 --> 00:06:09,120
at that point we know we've got a new

131
00:06:08,100 --> 00:06:12,960
symbol

132
00:06:09,120 --> 00:06:16,620
so we add the last symbol that matched

133
00:06:12,960 --> 00:06:18,120
we remember the next character and we

134
00:06:16,620 --> 00:06:20,840
start looking again

135
00:06:18,120 --> 00:06:23,160
so I'm going to give you an example

136
00:06:20,840 --> 00:06:25,919
of how this works we're going to be

137
00:06:23,160 --> 00:06:27,900
encoding this admittedly very

138
00:06:25,919 --> 00:06:30,840
repetitious string but it's good for my

139
00:06:27,900 --> 00:06:34,199
purposes here and I'm going to ask you

140
00:06:30,840 --> 00:06:36,900
to bear with me for one bit because I'm

141
00:06:34,199 --> 00:06:38,280
not going to write out all 256 symbol

142
00:06:36,900 --> 00:06:40,740
entries here I just want you to imagine

143
00:06:38,280 --> 00:06:42,900
them I'm just going to number the

144
00:06:40,740 --> 00:06:45,780
symbols that we've got here at the

145
00:06:42,900 --> 00:06:48,060
moment we say we start with those the

146
00:06:45,780 --> 00:06:51,240
four characters that appear in this

147
00:06:48,060 --> 00:06:52,979
string as the first four entries in our

148
00:06:51,240 --> 00:06:57,180
symbol table

149
00:06:52,979 --> 00:07:01,139
so we read the first B we're looking

150
00:06:57,180 --> 00:07:04,319
ahead to a there's no ba in the symbol

151
00:07:01,139 --> 00:07:05,819
table so we add that and then we write

152
00:07:04,319 --> 00:07:09,300
the symbol B

153
00:07:05,819 --> 00:07:13,560
uh we'd go we've got the a we're looking

154
00:07:09,300 --> 00:07:16,680
ahead to n there's no a n symbol uh we

155
00:07:13,560 --> 00:07:18,660
add that and we write the a you can see

156
00:07:16,680 --> 00:07:23,880
how we're building up this symbol table

157
00:07:18,660 --> 00:07:27,300
here so we add n a that's good

158
00:07:23,880 --> 00:07:29,160
and now we have our first match a n is

159
00:07:27,300 --> 00:07:31,680
already in the symbol table so we can

160
00:07:29,160 --> 00:07:34,919
look ahead to a there's no a and a

161
00:07:31,680 --> 00:07:37,199
symbol but now we're outputting

162
00:07:34,919 --> 00:07:40,160
essentially a two character symbol

163
00:07:37,199 --> 00:07:42,960
versus just a one character symbol

164
00:07:40,160 --> 00:07:46,860
there's no we're looking ahead through

165
00:07:42,960 --> 00:07:50,460
the space so there's no a space symbol

166
00:07:46,860 --> 00:07:53,039
so we add that there's no space b symbol

167
00:07:50,460 --> 00:07:55,860
so we add that but now obviously we're

168
00:07:53,039 --> 00:07:59,880
going to start to get these larger

169
00:07:55,860 --> 00:08:02,280
symbols building up out of

170
00:07:59,880 --> 00:08:06,060
symbols that we've added to the symbol

171
00:08:02,280 --> 00:08:08,940
Table after we started compressing we've

172
00:08:06,060 --> 00:08:11,340
got an N A here and that produces the

173
00:08:08,940 --> 00:08:15,180
Nan because we're looking at forward to

174
00:08:11,340 --> 00:08:19,259
the Outlook ahead is to the N symbol but

175
00:08:15,180 --> 00:08:20,720
then we can write n a again and we know

176
00:08:19,259 --> 00:08:24,539
that the

177
00:08:20,720 --> 00:08:28,819
we're looking forward to in a space the

178
00:08:24,539 --> 00:08:32,039
decompressor here basically starts with

179
00:08:28,819 --> 00:08:35,159
the B symbol and it it knows that it's

180
00:08:32,039 --> 00:08:37,380
going to write a b something symbol into

181
00:08:35,159 --> 00:08:40,919
the table at some point when it gets the

182
00:08:37,380 --> 00:08:44,099
a symbol it knows that the a there was

183
00:08:40,919 --> 00:08:47,160
the first character of that symbol that

184
00:08:44,099 --> 00:08:48,959
it needed to write and so the

185
00:08:47,160 --> 00:08:51,420
decompressor and the compressor can work

186
00:08:48,959 --> 00:08:54,600
in lockstep building up this same symbol

187
00:08:51,420 --> 00:08:56,240
table so that's the shared knowledge and

188
00:08:54,600 --> 00:08:59,640
we start getting

189
00:08:56,240 --> 00:09:01,279
encoding that final banana

190
00:08:59,640 --> 00:09:05,760
um

191
00:09:01,279 --> 00:09:09,080
we and we're outputting symbols that are

192
00:09:05,760 --> 00:09:12,839
now three characters in length

193
00:09:09,080 --> 00:09:14,399
so and we finally basically finish up

194
00:09:12,839 --> 00:09:17,100
our input and I've just said okay

195
00:09:14,399 --> 00:09:19,140
there's no more work to be done

196
00:09:17,100 --> 00:09:21,060
so how does that compare with our

197
00:09:19,140 --> 00:09:23,519
original data

198
00:09:21,060 --> 00:09:27,660
well we started with

199
00:09:23,519 --> 00:09:31,339
four six byte words and three spaces so

200
00:09:27,660 --> 00:09:34,500
we've got 27 characters 327 bytes

201
00:09:31,339 --> 00:09:35,760
we have those symbols that we're needing

202
00:09:34,500 --> 00:09:41,279
to Output

203
00:09:35,760 --> 00:09:45,779
but in the original lzw algorithm we

204
00:09:41,279 --> 00:09:49,200
have a 12 bit a 4096 entry symbol table

205
00:09:45,779 --> 00:09:52,080
and so each of those is a 12 bit number

206
00:09:49,200 --> 00:09:53,820
so we multiply by 1.5 to get 21 bytes

207
00:09:52,080 --> 00:09:55,980
but we've still managed to get some

208
00:09:53,820 --> 00:09:58,820
compression even in this relatively

209
00:09:55,980 --> 00:09:58,820
short example

210
00:09:58,860 --> 00:10:04,019
so let's have a look at what else are W

211
00:10:02,339 --> 00:10:06,660
is giving us and what we could maybe

212
00:10:04,019 --> 00:10:10,260
improve

213
00:10:06,660 --> 00:10:13,620
we have this incremental adaptation

214
00:10:10,260 --> 00:10:16,260
which is good because as the symbol as

215
00:10:13,620 --> 00:10:19,440
the input changes you get new symbols

216
00:10:16,260 --> 00:10:21,200
you start getting longer symbols being

217
00:10:19,440 --> 00:10:23,540
made up

218
00:10:21,200 --> 00:10:26,459
but there

219
00:10:23,540 --> 00:10:28,200
we're only adding one more character

220
00:10:26,459 --> 00:10:31,560
each time and I think it would be good

221
00:10:28,200 --> 00:10:35,580
if we can find a way to expand that

222
00:10:31,560 --> 00:10:37,200
to to increase the the size of the

223
00:10:35,580 --> 00:10:41,279
symbols we're putting into the symbol

224
00:10:37,200 --> 00:10:44,940
table sort of faster than that other

225
00:10:41,279 --> 00:10:47,579
problem is that when we're writing data

226
00:10:44,940 --> 00:10:49,860
out initially

227
00:10:47,579 --> 00:10:54,300
um we only actually need eight bytes

228
00:10:49,860 --> 00:10:55,019
sorry eight bits but we're writing 12.

229
00:10:54,300 --> 00:10:58,339
um

230
00:10:55,019 --> 00:11:04,019
and that's a that's a big overhead right

231
00:10:58,339 --> 00:11:09,060
we're also Limited in this 4096 entry

232
00:11:04,019 --> 00:11:11,760
symbol table uh when we reach the 4096

233
00:11:09,060 --> 00:11:14,820
symbol we have to literally say forget

234
00:11:11,760 --> 00:11:19,339
everything you ever know about the input

235
00:11:14,820 --> 00:11:19,339
stream and start again from scratch

236
00:11:20,100 --> 00:11:26,839
despite all this uh this the lzw

237
00:11:24,079 --> 00:11:30,779
algorithm is actually used in the GIF

238
00:11:26,839 --> 00:11:32,880
187a and the GIF 89A standards which is

239
00:11:30,779 --> 00:11:36,079
where I need to mention

240
00:11:32,880 --> 00:11:38,880
the patents

241
00:11:36,079 --> 00:11:41,040
ordinarily there's a role of ominous

242
00:11:38,880 --> 00:11:43,800
Thunder at that point but

243
00:11:41,040 --> 00:11:45,839
um what happened here was Terry Welch

244
00:11:43,800 --> 00:11:47,820
was working for Sperrys Barry got bought

245
00:11:45,839 --> 00:11:50,160
by Unisys Unisys decided to start

246
00:11:47,820 --> 00:11:53,779
enforcing the patents on people who

247
00:11:50,160 --> 00:11:58,200
wrote algorithms to produce GIF images

248
00:11:53,779 --> 00:12:01,260
uh the what was the internet then uh

249
00:11:58,200 --> 00:12:03,899
kind of didn't like this companies

250
00:12:01,260 --> 00:12:07,200
didn't like this uh companies fought

251
00:12:03,899 --> 00:12:09,779
back a bit it was unpopular Unisys

252
00:12:07,200 --> 00:12:12,540
decided not to and then I decided to do

253
00:12:09,779 --> 00:12:16,980
it again and it was unpopular again

254
00:12:12,540 --> 00:12:22,040
uh the patents expired in 2004 but by

255
00:12:16,980 --> 00:12:25,200
that time uh in sort of early in the 90s

256
00:12:22,040 --> 00:12:27,240
the graphics producing community in the

257
00:12:25,200 --> 00:12:29,220
web Community as it was at the time

258
00:12:27,240 --> 00:12:32,040
decided that the thing that we really

259
00:12:29,220 --> 00:12:35,579
needed was a better standard gift was

260
00:12:32,040 --> 00:12:39,000
also limited to 256 byte car so 256

261
00:12:35,579 --> 00:12:41,160
color simple color tables so we needed

262
00:12:39,000 --> 00:12:43,500
true color and better grayscale and

263
00:12:41,160 --> 00:12:46,380
things like that so they invent the

264
00:12:43,500 --> 00:12:47,899
graphics Community invented PNG which in

265
00:12:46,380 --> 00:12:51,240
which used

266
00:12:47,899 --> 00:12:55,019
lz77 and zedlib to do its compression

267
00:12:51,240 --> 00:12:57,019
which was patent well patent free

268
00:12:55,019 --> 00:13:00,720
so

269
00:12:57,019 --> 00:13:03,480
that's kind of why I think that's the

270
00:13:00,720 --> 00:13:06,839
other reason why I think uh Elder W

271
00:13:03,480 --> 00:13:09,720
doesn't get as much scrutiny these days

272
00:13:06,839 --> 00:13:13,800
for what we could do to improve it as

273
00:13:09,720 --> 00:13:17,880
maybe uh lz77 has

274
00:13:13,800 --> 00:13:21,959
so what can we do to improve lzw

275
00:13:17,880 --> 00:13:24,240
well I reckon firstly we can find a more

276
00:13:21,959 --> 00:13:29,639
efficient way to write

277
00:13:24,240 --> 00:13:32,040
the initial data that is starting to add

278
00:13:29,639 --> 00:13:33,480
to our symbol table

279
00:13:32,040 --> 00:13:36,180
secondly

280
00:13:33,480 --> 00:13:39,240
the the one of the big problems is you

281
00:13:36,180 --> 00:13:44,339
can probably imagine if you're writing

282
00:13:39,240 --> 00:13:47,180
just say text file you're only using 96

283
00:13:44,339 --> 00:13:50,700
of the 256 characters

284
00:13:47,180 --> 00:13:53,040
and that's a whole bunch of symbol table

285
00:13:50,700 --> 00:13:54,779
that you have effectively wasted because

286
00:13:53,040 --> 00:13:58,380
you're never using it

287
00:13:54,779 --> 00:14:00,899
so we should start with as as little as

288
00:13:58,380 --> 00:14:01,800
possible in the symbol table

289
00:14:00,899 --> 00:14:05,279
um

290
00:14:01,800 --> 00:14:08,579
we also want to try to use variable

291
00:14:05,279 --> 00:14:12,120
length coding so one thing that an

292
00:14:08,579 --> 00:14:15,060
improvement on Terry uh of the W that I

293
00:14:12,120 --> 00:14:18,180
think Terry watch suggested was that if

294
00:14:15,060 --> 00:14:22,019
we're output if we're less than symbol

295
00:14:18,180 --> 00:14:23,160
512 we only out need to Output nine bit

296
00:14:22,019 --> 00:14:25,920
symbols

297
00:14:23,160 --> 00:14:28,079
we have already more than eight bit

298
00:14:25,920 --> 00:14:30,240
eight bits of symbol in the simple table

299
00:14:28,079 --> 00:14:32,420
but less than nine bits so we can always

300
00:14:30,240 --> 00:14:37,139
represent those with nine bit numbers

301
00:14:32,420 --> 00:14:39,060
when the compressor adds the 512 symbol

302
00:14:37,139 --> 00:14:41,519
it goes up to 10 bits when the

303
00:14:39,060 --> 00:14:44,639
decompressor recognizes that it is on

304
00:14:41,519 --> 00:14:46,680
the FI the 512th symbol it knows the

305
00:14:44,639 --> 00:14:48,860
next symbol is going to be 10 bits so

306
00:14:46,680 --> 00:14:52,860
they're synchronized still so we can

307
00:14:48,860 --> 00:14:55,800
shorten how much data work to go into

308
00:14:52,860 --> 00:14:59,639
um send per symbol

309
00:14:55,800 --> 00:15:01,079
I also want to look at ways that we can

310
00:14:59,639 --> 00:15:04,079
learn

311
00:15:01,079 --> 00:15:05,760
the patterns in the data faster and this

312
00:15:04,079 --> 00:15:09,420
is essential essentially what all of the

313
00:15:05,760 --> 00:15:09,420
improvements in

314
00:15:10,339 --> 00:15:17,760
lz77 over the years have done they have

315
00:15:13,560 --> 00:15:21,600
found ways of getting finding better

316
00:15:17,760 --> 00:15:23,720
matches which also means we need a much

317
00:15:21,600 --> 00:15:27,060
larger symbol table I mean

318
00:15:23,720 --> 00:15:29,660
lzma is typically working with up to

319
00:15:27,060 --> 00:15:33,720
four gigabytes of memory

320
00:15:29,660 --> 00:15:36,839
4096 symbols is just not enough right

321
00:15:33,720 --> 00:15:41,639
the last thing that I want to kind of do

322
00:15:36,839 --> 00:15:44,100
in my thoughts about improving uh lzw is

323
00:15:41,639 --> 00:15:45,720
still I wanted to stay true to that idea

324
00:15:44,100 --> 00:15:48,540
that we are adding symbols to a

325
00:15:45,720 --> 00:15:52,860
dictionary to a table

326
00:15:48,540 --> 00:15:55,380
um from how we've built up the input

327
00:15:52,860 --> 00:15:58,260
it would be nice if we could say okay

328
00:15:55,380 --> 00:16:00,600
this symbol occurred here like you know

329
00:15:58,260 --> 00:16:03,000
you're reading a Python program it uses

330
00:16:00,600 --> 00:16:07,019
the word import so we're going to use

331
00:16:03,000 --> 00:16:09,180
we're going to say directly write up the

332
00:16:07,019 --> 00:16:11,339
import the word import is going to be a

333
00:16:09,180 --> 00:16:12,440
symbol that you're going to need in the

334
00:16:11,339 --> 00:16:15,120
future

335
00:16:12,440 --> 00:16:18,060
and the problem is that that starts

336
00:16:15,120 --> 00:16:20,279
looking awfully a lot a lot like lz77

337
00:16:18,060 --> 00:16:22,740
where we said this amount of data was

338
00:16:20,279 --> 00:16:25,560
repeated and I feel like

339
00:16:22,740 --> 00:16:28,500
that's kind of a crossover here I'd like

340
00:16:25,560 --> 00:16:30,300
to stay true to that lzw idea

341
00:16:28,500 --> 00:16:33,779
Okay so

342
00:16:30,300 --> 00:16:35,040
uh the LZ PW algorithm that I've been

343
00:16:33,779 --> 00:16:38,519
working on

344
00:16:35,040 --> 00:16:42,120
its main I think it's main Improvement

345
00:16:38,519 --> 00:16:44,220
is that we alternate lists of literal

346
00:16:42,120 --> 00:16:47,339
characters

347
00:16:44,220 --> 00:16:50,040
um and then symbols we have a different

348
00:16:47,339 --> 00:16:52,380
way of writing each of those and we will

349
00:16:50,040 --> 00:16:55,920
basically say you will write as many

350
00:16:52,380 --> 00:16:58,500
literal characters as you can until and

351
00:16:55,920 --> 00:17:01,019
until there's a symbol that you you can

352
00:16:58,500 --> 00:17:03,180
match and then you write symbols for as

353
00:17:01,019 --> 00:17:05,900
many symbols as you can match until you

354
00:17:03,180 --> 00:17:08,579
have to go back to literals

355
00:17:05,900 --> 00:17:11,880
the symbols we're going to form similar

356
00:17:08,579 --> 00:17:14,220
to the way that lzw forms them so that

357
00:17:11,880 --> 00:17:17,880
when we read the literal string a b c d

358
00:17:14,220 --> 00:17:20,179
we'll form these three symbols A B B C

359
00:17:17,880 --> 00:17:23,520
and C D

360
00:17:20,179 --> 00:17:26,819
the at the moment the way the code works

361
00:17:23,520 --> 00:17:28,679
is that when we read the symbols a b and

362
00:17:26,819 --> 00:17:33,720
c d we're also going to create the

363
00:17:28,679 --> 00:17:36,059
symbol ABC now I that what I'm trying to

364
00:17:33,720 --> 00:17:39,480
experiment with and I literally had a

365
00:17:36,059 --> 00:17:43,260
brainwave as I was at the end of

366
00:17:39,480 --> 00:17:46,860
um Seb Chan's keynote I don't know why

367
00:17:43,260 --> 00:17:51,600
then but I realized how I can get it to

368
00:17:46,860 --> 00:17:54,299
not only add ABC but in fact at ABCD as

369
00:17:51,600 --> 00:17:58,200
a symbol and if we can do that then now

370
00:17:54,299 --> 00:17:59,880
we're going to expand our the symbols in

371
00:17:58,200 --> 00:18:03,260
our symbol table are going to start

372
00:17:59,880 --> 00:18:03,260
expanding much quicker

373
00:18:03,299 --> 00:18:08,760
um we're also going to store those

374
00:18:06,840 --> 00:18:10,200
symbols obviously in a separate table

375
00:18:08,760 --> 00:18:11,760
and we're going to start numbering from

376
00:18:10,200 --> 00:18:14,360
zero we're going to literally start with

377
00:18:11,760 --> 00:18:17,280
zero symbols in the table

378
00:18:14,360 --> 00:18:18,780
and each of these lists literals or

379
00:18:17,280 --> 00:18:20,400
symbols is going to be prefixed by

380
00:18:18,780 --> 00:18:22,799
lengths and lengths so we're going to be

381
00:18:20,400 --> 00:18:25,020
written in Fibonacci coding

382
00:18:22,799 --> 00:18:27,179
what's Fibonacci coding I hear you cry

383
00:18:25,020 --> 00:18:27,900
well

384
00:18:27,179 --> 00:18:30,240
um

385
00:18:27,900 --> 00:18:33,660
this is a little bit of a diversion but

386
00:18:30,240 --> 00:18:35,640
number bases can be fun we all use base

387
00:18:33,660 --> 00:18:37,740
10 coding where you read from right to

388
00:18:35,640 --> 00:18:40,100
left you multiply by the bay the value

389
00:18:37,740 --> 00:18:43,679
of that column in the base

390
00:18:40,100 --> 00:18:46,080
add the the value multiplied by that to

391
00:18:43,679 --> 00:18:48,179
produce the number and we've obviously

392
00:18:46,080 --> 00:18:52,260
familiar with this with doing this with

393
00:18:48,179 --> 00:18:55,260
binary coding where we just add the

394
00:18:52,260 --> 00:18:58,460
values of each of those bases where the

395
00:18:55,260 --> 00:19:00,419
base increases right from right to left

396
00:18:58,460 --> 00:19:01,980
Fibonacci closing as you've probably

397
00:19:00,419 --> 00:19:05,340
guessed now

398
00:19:01,980 --> 00:19:09,660
um You the base values are the Fibonacci

399
00:19:05,340 --> 00:19:14,220
numbers so that number there is 34 plus

400
00:19:09,660 --> 00:19:15,780
1 plus 8 plus 3 plus 2.

401
00:19:14,220 --> 00:19:17,700
and you've probably figured out here

402
00:19:15,780 --> 00:19:19,919
hang on three plus two is five that's

403
00:19:17,700 --> 00:19:22,500
also a Fibonacci number so we should

404
00:19:19,919 --> 00:19:24,780
probably write that as

405
00:19:22,500 --> 00:19:27,600
um eight plus five oh that's another

406
00:19:24,780 --> 00:19:29,100
Fibonacci number that's 13.

407
00:19:27,600 --> 00:19:33,299
so

408
00:19:29,100 --> 00:19:35,280
here we can we can observe that if

409
00:19:33,299 --> 00:19:38,220
there's a kind of canonical way to write

410
00:19:35,280 --> 00:19:40,860
Fibonacci numbers you always you can

411
00:19:38,220 --> 00:19:43,020
always write to them using the highest

412
00:19:40,860 --> 00:19:47,400
Fibonacci number possible you'd never

413
00:19:43,020 --> 00:19:49,440
need to write two ones side by side

414
00:19:47,400 --> 00:19:51,419
and so the other clever thing about

415
00:19:49,440 --> 00:19:53,460
Fibonacci coding is that we're actually

416
00:19:51,419 --> 00:19:56,280
going to turn that around and write it

417
00:19:53,460 --> 00:19:59,640
left to right increasing

418
00:19:56,280 --> 00:20:01,799
and then add a one bit on the end and

419
00:19:59,640 --> 00:20:03,539
when we see two one bits we know the

420
00:20:01,799 --> 00:20:05,840
number has terminated

421
00:20:03,539 --> 00:20:09,660
okay

422
00:20:05,840 --> 00:20:14,460
so Fibonacci number Fibonacci coding is

423
00:20:09,660 --> 00:20:16,380
self-terminating it also in some ways is

424
00:20:14,460 --> 00:20:17,940
resistant to bit errors which is kind of

425
00:20:16,380 --> 00:20:20,700
nice

426
00:20:17,940 --> 00:20:23,820
it's easy to decode because we simply go

427
00:20:20,700 --> 00:20:26,220
up the Fibonacci sequence it's a greedy

428
00:20:23,820 --> 00:20:29,360
encoding algorithm which is as soon as

429
00:20:26,220 --> 00:20:31,980
you know you can work out the

430
00:20:29,360 --> 00:20:34,860
Fibonacci number that's

431
00:20:31,980 --> 00:20:37,020
just below or equal to your current

432
00:20:34,860 --> 00:20:38,760
number you subtract that you write a 1

433
00:20:37,020 --> 00:20:42,120
and you write zeros until you get the

434
00:20:38,760 --> 00:20:45,000
next Fibonacci number that is is uh

435
00:20:42,120 --> 00:20:47,640
greater than that and

436
00:20:45,000 --> 00:20:49,559
um so just less than that or something

437
00:20:47,640 --> 00:20:52,140
like that yeah

438
00:20:49,559 --> 00:20:55,200
because we're going down it's less than

439
00:20:52,140 --> 00:20:56,880
um and you keep on doing that until you

440
00:20:55,200 --> 00:20:58,500
run out of zeros and then chuckle one on

441
00:20:56,880 --> 00:21:00,799
the front you reverse it around and

442
00:20:58,500 --> 00:21:04,380
there's your encoding

443
00:21:00,799 --> 00:21:06,299
there are schemes of Fibonacci coding

444
00:21:04,380 --> 00:21:08,400
that can cope with writing negative

445
00:21:06,299 --> 00:21:10,620
numbers and zeros and things like that

446
00:21:08,400 --> 00:21:12,900
I'm just we only need to use the

447
00:21:10,620 --> 00:21:15,360
simplest here because we're encoding

448
00:21:12,900 --> 00:21:17,220
lengths and we never need to write a

449
00:21:15,360 --> 00:21:20,039
zero length thing

450
00:21:17,220 --> 00:21:22,679
okay so that's enough Theory

451
00:21:20,039 --> 00:21:25,860
um the last thing I will say about the

452
00:21:22,679 --> 00:21:27,840
Fibonacci coding is if you look at the

453
00:21:25,860 --> 00:21:31,140
other variable length bitcoatings like

454
00:21:27,840 --> 00:21:35,220
inlias gamma coding and Lys Delta coding

455
00:21:31,140 --> 00:21:39,059
they are no more efficient than this but

456
00:21:35,220 --> 00:21:44,039
they require a lot more logic to decode

457
00:21:39,059 --> 00:21:47,039
um and you know we're really kind of at

458
00:21:44,039 --> 00:21:49,679
that theoretical limit we could write we

459
00:21:47,039 --> 00:21:52,440
could have Huffman tables but in order

460
00:21:49,679 --> 00:21:54,240
to have a Huffman table you need to have

461
00:21:52,440 --> 00:21:56,340
pre-encoded it or have some way of

462
00:21:54,240 --> 00:21:59,340
transmitting it and that adds more bytes

463
00:21:56,340 --> 00:22:00,600
this is kind of efficient right okay so

464
00:21:59,340 --> 00:22:02,820
links are going to be stored in

465
00:22:00,600 --> 00:22:05,460
Fibonacci coding

466
00:22:02,820 --> 00:22:09,299
so let's have a look at what we do with

467
00:22:05,460 --> 00:22:11,940
our example in coding this this again

468
00:22:09,299 --> 00:22:14,580
so we're going to start out with the

469
00:22:11,940 --> 00:22:16,320
literal ban we never start in simple

470
00:22:14,580 --> 00:22:17,220
coding we always start in literal coding

471
00:22:16,320 --> 00:22:18,419
because we have nothing in the simple

472
00:22:17,220 --> 00:22:20,640
table

473
00:22:18,419 --> 00:22:23,400
and we write we add we're looking

474
00:22:20,640 --> 00:22:27,059
forward to the a so we can write the

475
00:22:23,400 --> 00:22:29,580
three symbols b a a n and n a

476
00:22:27,059 --> 00:22:32,039
and at the point where we're looking at

477
00:22:29,580 --> 00:22:34,740
the A and we're looking forward to the N

478
00:22:32,039 --> 00:22:37,320
we can say oh that's a symbol entry we

479
00:22:34,740 --> 00:22:40,740
know there might be more than some more

480
00:22:37,320 --> 00:22:44,400
symbols but we've got a n we're looking

481
00:22:40,740 --> 00:22:47,940
forward to a something

482
00:22:44,400 --> 00:22:50,100
um and here we need to know with it's uh

483
00:22:47,940 --> 00:22:54,059
useful or it's sort of important in the

484
00:22:50,100 --> 00:22:56,340
algorithm to know that a is a prefix of

485
00:22:54,059 --> 00:23:01,559
a symbol somewhere we've seen it before

486
00:22:56,340 --> 00:23:04,860
in our input if we saw a q there

487
00:23:01,559 --> 00:23:08,100
um then we'd know no symbol starts with

488
00:23:04,860 --> 00:23:11,880
Q so we it's impossible for us to have a

489
00:23:08,100 --> 00:23:13,500
symbol with q that had is Q something in

490
00:23:11,880 --> 00:23:15,900
it and therefore we can just go straight

491
00:23:13,500 --> 00:23:18,780
back to literal coding but we might be

492
00:23:15,900 --> 00:23:20,820
encoding something here that is a symbol

493
00:23:18,780 --> 00:23:23,460
because it starts with a but then we

494
00:23:20,820 --> 00:23:25,559
look forward to the space and say no

495
00:23:23,460 --> 00:23:28,919
there's no a space symbol

496
00:23:25,559 --> 00:23:31,559
at that point we need to Output the A

497
00:23:28,919 --> 00:23:33,600
and the space so we've written two

498
00:23:31,559 --> 00:23:35,940
literals

499
00:23:33,600 --> 00:23:38,220
and we've we're looking forward to the

500
00:23:35,940 --> 00:23:42,179
capital B so we've added those two more

501
00:23:38,220 --> 00:23:43,159
symbols important to note here that we

502
00:23:42,179 --> 00:23:45,900
have

503
00:23:43,159 --> 00:23:47,840
more than four symbols in the table at

504
00:23:45,900 --> 00:23:50,820
this point this will be important later

505
00:23:47,840 --> 00:23:52,500
but now we can start writing symbols and

506
00:23:50,820 --> 00:23:56,520
obviously this is kind of going to

507
00:23:52,500 --> 00:23:59,880
follow that process that

508
00:23:56,520 --> 00:24:02,220
um or the W uses to construct symbols we

509
00:23:59,880 --> 00:24:04,740
keep on seeing symbols we add the next

510
00:24:02,220 --> 00:24:05,340
character there

511
00:24:04,740 --> 00:24:07,740
um

512
00:24:05,340 --> 00:24:10,020
and we keep on building up this list of

513
00:24:07,740 --> 00:24:13,980
symbols that we're going to Output until

514
00:24:10,020 --> 00:24:15,720
finally we have written the whole input

515
00:24:13,980 --> 00:24:18,360
Okay so

516
00:24:15,720 --> 00:24:19,140
how does this do for

517
00:24:18,360 --> 00:24:21,900
um

518
00:24:19,140 --> 00:24:24,480
compat a size comparison compared to the

519
00:24:21,900 --> 00:24:26,340
original input well we're going to need

520
00:24:24,480 --> 00:24:27,380
to work in bits here as you've probably

521
00:24:26,340 --> 00:24:30,059
guessed

522
00:24:27,380 --> 00:24:33,919
and so we need to multiply it by eight

523
00:24:30,059 --> 00:24:36,659
so we've got 250 216 bits in the input

524
00:24:33,919 --> 00:24:39,419
so we're going to write a string of

525
00:24:36,659 --> 00:24:42,059
length 3 which in Fibonacci coding is

526
00:24:39,419 --> 00:24:46,760
zero zero one one and then we can write

527
00:24:42,059 --> 00:24:51,840
the the initial 8 bit characters

528
00:24:46,760 --> 00:24:55,140
then we write a a symbol run of length

529
00:24:51,840 --> 00:24:57,840
one that is 1 1 and then the symbol

530
00:24:55,140 --> 00:25:01,500
number we needed was

531
00:24:57,840 --> 00:25:03,780
001 there were five b five symbols in

532
00:25:01,500 --> 00:25:06,120
the table at that point

533
00:25:03,780 --> 00:25:10,380
Which is less than eight so we needed at

534
00:25:06,120 --> 00:25:11,039
least three bits to encode that number

535
00:25:10,380 --> 00:25:13,200
um

536
00:25:11,039 --> 00:25:14,120
so we've written a total of five bits

537
00:25:13,200 --> 00:25:18,059
there

538
00:25:14,120 --> 00:25:22,559
we then write two characters

539
00:25:18,059 --> 00:25:25,740
um and that takes 19 bits and then this

540
00:25:22,559 --> 00:25:26,580
sequence we've got less than 16

541
00:25:25,740 --> 00:25:28,860
um

542
00:25:26,580 --> 00:25:32,779
symbols in the symbol table when we're

543
00:25:28,860 --> 00:25:34,640
writing this so we can use

544
00:25:32,779 --> 00:25:37,400
4-bit

545
00:25:34,640 --> 00:25:41,760
4-bit symbol numbers

546
00:25:37,400 --> 00:25:47,460
prefixed by the length in six bits is 38

547
00:25:41,760 --> 00:25:50,159
bits for a total of 90 by 90 bits

548
00:25:47,460 --> 00:25:51,960
which I have I I think that's pretty

549
00:25:50,159 --> 00:25:52,740
cool right

550
00:25:51,960 --> 00:25:55,080
um

551
00:25:52,740 --> 00:25:59,279
how does that compare if we were doing

552
00:25:55,080 --> 00:26:02,220
that in lzw well we had 14 symbols and

553
00:25:59,279 --> 00:26:05,880
they're 12 bits each and so they're 168

554
00:26:02,220 --> 00:26:09,419
bits even if we go down to nine bits we

555
00:26:05,880 --> 00:26:11,640
still only get 126 bits out of lzw so

556
00:26:09,419 --> 00:26:13,500
this is actually Improvement I I feel

557
00:26:11,640 --> 00:26:14,580
like I'm I'm on the right track here

558
00:26:13,500 --> 00:26:16,080
somewhere

559
00:26:14,580 --> 00:26:18,840
okay

560
00:26:16,080 --> 00:26:23,159
so how does this do with more real world

561
00:26:18,840 --> 00:26:25,980
data so I ran it I ran this on its own

562
00:26:23,159 --> 00:26:28,580
source code and that got a ratio of

563
00:26:25,980 --> 00:26:31,740
about 2.56 to 1

564
00:26:28,580 --> 00:26:35,340
user shared words compressed down

565
00:26:31,740 --> 00:26:40,880
slightly less less okay how does that

566
00:26:35,340 --> 00:26:40,880
compare to something as old as gzip

567
00:26:41,880 --> 00:26:45,360
okay well we've still got work to do

568
00:26:44,039 --> 00:26:48,299
then

569
00:26:45,360 --> 00:26:51,720
um it's not surprising because I think

570
00:26:48,299 --> 00:26:53,700
there are a lot of improvements we that

571
00:26:51,720 --> 00:26:56,400
we can get but I would just want a

572
00:26:53,700 --> 00:27:00,600
couple cover a couple of things in the

573
00:26:56,400 --> 00:27:04,679
code uh which is at my GitHub address

574
00:27:00,600 --> 00:27:06,080
um before I sort of summarize on what

575
00:27:04,679 --> 00:27:08,820
um

576
00:27:06,080 --> 00:27:11,460
uh what what the current state of it is

577
00:27:08,820 --> 00:27:15,059
and where the future is going

578
00:27:11,460 --> 00:27:17,400
um so the code I'll firstly say is still

579
00:27:15,059 --> 00:27:21,480
extremely experimental there's a lot of

580
00:27:17,400 --> 00:27:23,580
ideas that are tried out that are that

581
00:27:21,480 --> 00:27:24,620
I've got you know

582
00:27:23,580 --> 00:27:28,260
um

583
00:27:24,620 --> 00:27:31,200
uh flags on to say if if we're using

584
00:27:28,260 --> 00:27:34,020
this this particular encoding try this

585
00:27:31,200 --> 00:27:36,299
that sort of stuff I'm trying to keep as

586
00:27:34,020 --> 00:27:39,299
many ideas in there because I want to

587
00:27:36,299 --> 00:27:41,039
find out which work what works and what

588
00:27:39,299 --> 00:27:45,360
what doesn't

589
00:27:41,039 --> 00:27:47,340
um and this is still early code so I I'm

590
00:27:45,360 --> 00:27:50,039
really still not sure which of these

591
00:27:47,340 --> 00:27:51,120
ideas might prove to be useful as as

592
00:27:50,039 --> 00:27:55,020
you'll see

593
00:27:51,120 --> 00:27:57,120
the second is uh people who know me will

594
00:27:55,020 --> 00:27:58,980
kind of recognize

595
00:27:57,120 --> 00:28:01,559
um my sense of humor here but there's

596
00:27:58,980 --> 00:28:04,559
the thing called the mabulate flag now

597
00:28:01,559 --> 00:28:08,220
mabulate is a word I invented it's what

598
00:28:04,559 --> 00:28:10,980
you do say uh if you've got a Rubik's

599
00:28:08,220 --> 00:28:13,559
Cube and you know that uh the way to

600
00:28:10,980 --> 00:28:16,740
solve it is to get all of the colors on

601
00:28:13,559 --> 00:28:20,580
each face all together and the way to do

602
00:28:16,740 --> 00:28:21,840
that is by rotating the faces but you

603
00:28:20,580 --> 00:28:23,640
don't know

604
00:28:21,840 --> 00:28:27,240
because we haven't learned yet the the

605
00:28:23,640 --> 00:28:29,640
specific algorithms to move things

606
00:28:27,240 --> 00:28:31,559
around but you think that if you keep on

607
00:28:29,640 --> 00:28:34,200
sort of playing with it something might

608
00:28:31,559 --> 00:28:36,059
end up in the right shape and you know

609
00:28:34,200 --> 00:28:37,440
and colors might be able to move

610
00:28:36,059 --> 00:28:39,600
together and you'll kind of gradually

611
00:28:37,440 --> 00:28:40,919
learn your way around but you're not

612
00:28:39,600 --> 00:28:42,360
quite sure what you're doing so that's

613
00:28:40,919 --> 00:28:46,740
mabulating

614
00:28:42,360 --> 00:28:49,980
so the mabulate flag works like this it

615
00:28:46,740 --> 00:28:52,620
is basically that process that we did

616
00:28:49,980 --> 00:28:57,360
we're where we're looking at whether

617
00:28:52,620 --> 00:28:59,340
this symbol is a prefix whether we've

618
00:28:57,360 --> 00:29:01,260
possibly got a another symbol because

619
00:28:59,340 --> 00:29:03,900
we've got the prefix

620
00:29:01,260 --> 00:29:06,240
If the symbol hasn't matched and the

621
00:29:03,900 --> 00:29:09,900
last character is a prefix then we set

622
00:29:06,240 --> 00:29:11,940
that flag and we that's to remember the

623
00:29:09,900 --> 00:29:13,620
character that we're currently on

624
00:29:11,940 --> 00:29:15,900
because it could be the start of the new

625
00:29:13,620 --> 00:29:19,679
the new symbol

626
00:29:15,900 --> 00:29:21,120
If the symbol hasn't matched and the

627
00:29:19,679 --> 00:29:23,159
last character

628
00:29:21,120 --> 00:29:25,980
or the last the symbol that we're

629
00:29:23,159 --> 00:29:32,100
looking for is not a symbol

630
00:29:25,980 --> 00:29:34,980
then we go we now need to go to literals

631
00:29:32,100 --> 00:29:36,840
and it's the nebula flag is set at that

632
00:29:34,980 --> 00:29:38,820
point it means that we previously looked

633
00:29:36,840 --> 00:29:41,940
at a character which might have been a

634
00:29:38,820 --> 00:29:43,440
prefix but turns out not to be and at

635
00:29:41,940 --> 00:29:46,260
that point we need to Output that

636
00:29:43,440 --> 00:29:50,100
character as well so it's a bit

637
00:29:46,260 --> 00:29:51,899
complicated and as I said I I scratched

638
00:29:50,100 --> 00:29:53,700
my head for a long time trying to find

639
00:29:51,899 --> 00:29:55,260
the right word for this and I just

640
00:29:53,700 --> 00:29:58,380
couldn't so I used mabulate because

641
00:29:55,260 --> 00:30:00,539
that's me okay so the other thing you

642
00:29:58,380 --> 00:30:02,760
need to know about the code is that

643
00:30:00,539 --> 00:30:06,419
there's this array this thing called

644
00:30:02,760 --> 00:30:08,640
variable called to write that if we're

645
00:30:06,419 --> 00:30:10,140
in literal mode is a string of the

646
00:30:08,640 --> 00:30:12,840
physical characters we're going to write

647
00:30:10,140 --> 00:30:16,260
out if it if we're in symbol mode that

648
00:30:12,840 --> 00:30:19,080
is a list of strings so don't enforce

649
00:30:16,260 --> 00:30:20,220
type checking on that because it's going

650
00:30:19,080 --> 00:30:25,500
to change

651
00:30:20,220 --> 00:30:27,299
okay so where are we uh with LZ PW the

652
00:30:25,500 --> 00:30:29,460
first thing I do need to share with you

653
00:30:27,299 --> 00:30:32,820
is the failures

654
00:30:29,460 --> 00:30:36,240
um the one really interesting thing I

655
00:30:32,820 --> 00:30:38,399
discovered almost by accident uh was

656
00:30:36,240 --> 00:30:42,059
that I had started on a totally wrong

657
00:30:38,399 --> 00:30:45,899
premise I thought that it would be more

658
00:30:42,059 --> 00:30:49,620
efficient to write the symbols in the

659
00:30:45,899 --> 00:30:52,799
symbol table using Fibonacci codes and

660
00:30:49,620 --> 00:30:55,020
if we wrote those in uh a if we used

661
00:30:52,799 --> 00:30:57,419
like a move to front algorithm or we

662
00:30:55,020 --> 00:31:01,980
favored the least

663
00:30:57,419 --> 00:31:03,960
so the yeah the most recent symbol uh

664
00:31:01,980 --> 00:31:06,500
the most recent symbol is the most

665
00:31:03,960 --> 00:31:09,600
complex and it's the most the one most

666
00:31:06,500 --> 00:31:14,520
recently found in the input so if we can

667
00:31:09,600 --> 00:31:15,960
match that using fewer bits then we we

668
00:31:14,520 --> 00:31:18,059
should actually

669
00:31:15,960 --> 00:31:21,059
like this is how I was thinking which

670
00:31:18,059 --> 00:31:23,880
that should give us better compression

671
00:31:21,059 --> 00:31:26,940
than if we're just using a fixed width

672
00:31:23,880 --> 00:31:28,919
number of bits based on the symbol table

673
00:31:26,940 --> 00:31:31,200
size at that point in time

674
00:31:28,919 --> 00:31:33,840
and I tried this out and I had spent

675
00:31:31,200 --> 00:31:35,580
months working on that and then I was

676
00:31:33,840 --> 00:31:37,440
going through tidying up the code trying

677
00:31:35,580 --> 00:31:38,840
to make it vaguely presentable for this

678
00:31:37,440 --> 00:31:42,720
conference

679
00:31:38,840 --> 00:31:46,080
and I uh one of the things that uh was

680
00:31:42,720 --> 00:31:48,539
set in there was the ability to use

681
00:31:46,080 --> 00:31:51,320
variable length Fibonacci coding for

682
00:31:48,539 --> 00:31:57,840
symbols or still use fixed length coding

683
00:31:51,320 --> 00:31:59,460
and in one of my tests I ran it and left

684
00:31:57,840 --> 00:32:02,820
the flag off

685
00:31:59,460 --> 00:32:04,260
so it was using fixed length symbols and

686
00:32:02,820 --> 00:32:07,500
it was shorter

687
00:32:04,260 --> 00:32:09,299
I'm like okay well that's that idea out

688
00:32:07,500 --> 00:32:11,520
the window why is that

689
00:32:09,299 --> 00:32:13,380
and then I realized

690
00:32:11,520 --> 00:32:16,080
that

691
00:32:13,380 --> 00:32:17,880
if you think about the idea of Fibonacci

692
00:32:16,080 --> 00:32:21,240
coding part of the reason of that final

693
00:32:17,880 --> 00:32:23,880
one is that it has to tell the decoder

694
00:32:21,240 --> 00:32:26,520
this is where you where you stop it

695
00:32:23,880 --> 00:32:29,580
needs there needs to be an explicit

696
00:32:26,520 --> 00:32:32,460
piece of information passed from the

697
00:32:29,580 --> 00:32:34,080
encoder to the decoder to say this is

698
00:32:32,460 --> 00:32:36,779
the length of that symbol

699
00:32:34,080 --> 00:32:40,140
but if we already know the symbol table

700
00:32:36,779 --> 00:32:42,360
size that's an implicit piece of data

701
00:32:40,140 --> 00:32:45,000
that both the encoder and the decoder

702
00:32:42,360 --> 00:32:47,399
know and therefore it doesn't need to be

703
00:32:45,000 --> 00:32:49,620
transmitted and there therefore we can

704
00:32:47,399 --> 00:32:51,779
actually use more of the the bit space

705
00:32:49,620 --> 00:32:54,899
to transmit information as opposed to

706
00:32:51,779 --> 00:32:57,120
avoiding writing two ones together

707
00:32:54,899 --> 00:33:00,299
so that was a discovery

708
00:32:57,120 --> 00:33:03,720
um but I'm glad I discovered it right

709
00:33:00,299 --> 00:33:06,600
um the other idea that I've had in there

710
00:33:03,720 --> 00:33:09,240
which the moment I pretty much removed

711
00:33:06,600 --> 00:33:10,500
but it's there because it you know you

712
00:33:09,240 --> 00:33:15,480
never know

713
00:33:10,500 --> 00:33:17,700
um is I the idea of having a header in

714
00:33:15,480 --> 00:33:20,640
front of literal blocks and in front of

715
00:33:17,700 --> 00:33:23,640
symbol blocks that would both indicate

716
00:33:20,640 --> 00:33:24,480
yes this is a little block or a symbol

717
00:33:23,640 --> 00:33:27,240
block

718
00:33:24,480 --> 00:33:29,880
and then might indicate some Flags about

719
00:33:27,240 --> 00:33:32,340
that I'll cover some of the reasons why

720
00:33:29,880 --> 00:33:34,380
I think this might be useful there is

721
00:33:32,340 --> 00:33:36,480
never a situation where you need to

722
00:33:34,380 --> 00:33:37,740
write two literal blocks one after the

723
00:33:36,480 --> 00:33:40,620
other

724
00:33:37,740 --> 00:33:42,720
because all you will always find it is

725
00:33:40,620 --> 00:33:43,799
more efficient to Simply join them

726
00:33:42,720 --> 00:33:46,140
together

727
00:33:43,799 --> 00:33:49,019
it's possible that we might need to

728
00:33:46,140 --> 00:33:51,539
write two symbol blocks one after the

729
00:33:49,019 --> 00:33:53,519
other and that would most likely be when

730
00:33:51,539 --> 00:33:56,039
we have we're just about to change

731
00:33:53,519 --> 00:33:58,799
symbol sizes

732
00:33:56,039 --> 00:34:01,559
and we want to write the all of the

733
00:33:58,799 --> 00:34:03,419
symbols currently in the buffer out in

734
00:34:01,559 --> 00:34:05,100
the shorter format and then move on to

735
00:34:03,419 --> 00:34:08,639
the longer format

736
00:34:05,100 --> 00:34:12,119
but I'm not sure yet if having an extra

737
00:34:08,639 --> 00:34:15,060
bit in the header of the symbol list

738
00:34:12,119 --> 00:34:17,820
is worth that overhead

739
00:34:15,060 --> 00:34:20,580
it we're going to experiment right okay

740
00:34:17,820 --> 00:34:23,879
so uh current features are basically

741
00:34:20,580 --> 00:34:26,339
that it does do efficient literal coding

742
00:34:23,879 --> 00:34:30,139
and it does do efficient variable coding

743
00:34:26,339 --> 00:34:33,599
of symbols and I'm really proud of that

744
00:34:30,139 --> 00:34:36,599
the future features one of the main ones

745
00:34:33,599 --> 00:34:38,879
that I'm working on is the ability to

746
00:34:36,599 --> 00:34:40,500
backtrack through how you're encoding

747
00:34:38,879 --> 00:34:43,260
the symbols at this point

748
00:34:40,500 --> 00:34:45,240
and the easiest way to demonstrate this

749
00:34:43,260 --> 00:34:47,940
I've found it's a bit of a boring string

750
00:34:45,240 --> 00:34:49,619
it's not as good as bananas but it is

751
00:34:47,940 --> 00:34:52,260
this

752
00:34:49,619 --> 00:34:55,080
so as we're reading through that to

753
00:34:52,260 --> 00:34:59,460
write it we write the first five

754
00:34:55,080 --> 00:35:02,640
characters the then have a b b c c d and

755
00:34:59,460 --> 00:35:05,520
d dot dot a

756
00:35:02,640 --> 00:35:09,599
then we can write a b c d and at that

757
00:35:05,520 --> 00:35:10,920
point we add the ABC symbol

758
00:35:09,599 --> 00:35:14,940
um

759
00:35:10,920 --> 00:35:18,420
we add we have to add a literal comma

760
00:35:14,940 --> 00:35:20,700
and then the symbol table has a b c d

761
00:35:18,420 --> 00:35:24,300
and the ABC symbol

762
00:35:20,700 --> 00:35:24,960
so we're about to read ABC

763
00:35:24,300 --> 00:35:29,460
um

764
00:35:24,960 --> 00:35:32,099
we get to ABC and we're looking at D we

765
00:35:29,460 --> 00:35:35,339
we're mapulating that we're looking

766
00:35:32,099 --> 00:35:37,980
ahead to C and there's no DC in the

767
00:35:35,339 --> 00:35:40,200
symbol table so we would have to write a

768
00:35:37,980 --> 00:35:42,180
literal D at that point

769
00:35:40,200 --> 00:35:44,420
and then move on

770
00:35:42,180 --> 00:35:48,180
but

771
00:35:44,420 --> 00:35:50,820
what if we could instead write that as

772
00:35:48,180 --> 00:35:54,060
the symbols a b c d

773
00:35:50,820 --> 00:35:57,180
we've rearranged these the symbols that

774
00:35:54,060 --> 00:36:00,599
we're encoding and then we can write a

775
00:35:57,180 --> 00:36:04,619
second CD and where that it is always

776
00:36:00,599 --> 00:36:11,339
more efficient like that uh

777
00:36:04,619 --> 00:36:13,380
second D there was 10 bits of stepping

778
00:36:11,339 --> 00:36:16,320
back into literals

779
00:36:13,380 --> 00:36:18,359
that's a that's a 10-bit symbol that we

780
00:36:16,320 --> 00:36:21,260
could put in there that's that's you

781
00:36:18,359 --> 00:36:24,420
know where if we're a

782
00:36:21,260 --> 00:36:25,980
1024 symbols in our table we're still

783
00:36:24,420 --> 00:36:27,540
going to be more efficient just to write

784
00:36:25,980 --> 00:36:29,400
that one character and we're probably

785
00:36:27,540 --> 00:36:33,000
going to write more than one which means

786
00:36:29,400 --> 00:36:37,320
we're probably saving 19 bits so

787
00:36:33,000 --> 00:36:39,900
um we've got that that option of uh

788
00:36:37,320 --> 00:36:44,579
thinking about how we encode those

789
00:36:39,900 --> 00:36:50,240
symbols it then does mean that it we may

790
00:36:44,579 --> 00:36:51,859
change what symbols uh the

791
00:36:50,240 --> 00:36:54,780
decoder

792
00:36:51,859 --> 00:36:57,720
sees in its symbol table and this is

793
00:36:54,780 --> 00:36:59,280
something that's really important this

794
00:36:57,720 --> 00:37:01,980
is where I'm going to have to really

795
00:36:59,280 --> 00:37:03,420
play with this to try and try and

796
00:37:01,980 --> 00:37:06,359
understand what the effects of that

797
00:37:03,420 --> 00:37:08,579
rearrangement are the other thing that I

798
00:37:06,359 --> 00:37:10,440
found when I was playing around with

799
00:37:08,579 --> 00:37:12,540
this it worked fine with some of my test

800
00:37:10,440 --> 00:37:15,599
examples and I gave it the code of

801
00:37:12,540 --> 00:37:18,180
symbolizer to encode and it used all 32

802
00:37:15,599 --> 00:37:20,160
gigs of my machine and I had to kill it

803
00:37:18,180 --> 00:37:22,619
because it was running on flat out on a

804
00:37:20,160 --> 00:37:26,280
CPU and the reason for that was that it

805
00:37:22,619 --> 00:37:29,280
was looking through a list of about a

806
00:37:26,280 --> 00:37:30,420
string probably in the order of

807
00:37:29,280 --> 00:37:32,700
um

808
00:37:30,420 --> 00:37:36,000
80 to 90 characters but it was doing

809
00:37:32,700 --> 00:37:39,180
this like thousands of rearrangements of

810
00:37:36,000 --> 00:37:41,880
that finding every possible way that you

811
00:37:39,180 --> 00:37:45,420
could rewrite that so I have some ideas

812
00:37:41,880 --> 00:37:49,619
on how to improve that the other way the

813
00:37:45,420 --> 00:37:52,020
other ideas that I think we I want to

814
00:37:49,619 --> 00:37:54,839
explore here are that

815
00:37:52,020 --> 00:37:57,720
uh it would be good rather than having

816
00:37:54,839 --> 00:37:59,760
some sort of signal to say throw away

817
00:37:57,720 --> 00:38:02,339
your symbol table and start again from

818
00:37:59,760 --> 00:38:05,220
scratch if we could throw away the

819
00:38:02,339 --> 00:38:06,619
symbols we never needed again or the

820
00:38:05,220 --> 00:38:10,260
symbols that just hadn't been used

821
00:38:06,619 --> 00:38:11,760
recently we could tell the decoder to

822
00:38:10,260 --> 00:38:15,660
forget those

823
00:38:11,760 --> 00:38:18,540
even if we used fixed length coding

824
00:38:15,660 --> 00:38:23,520
maybe we still need a move to front idea

825
00:38:18,540 --> 00:38:25,680
so that uh the the uh the later symbols

826
00:38:23,520 --> 00:38:28,859
the least recently used symbols can be

827
00:38:25,680 --> 00:38:32,700
thrown out rather than say going up to a

828
00:38:28,859 --> 00:38:35,820
higher symbol table a symbol length what

829
00:38:32,700 --> 00:38:37,859
if we had literals that don't create

830
00:38:35,820 --> 00:38:40,320
symbols we know we've got a header that

831
00:38:37,859 --> 00:38:41,640
we never ever see again in the input and

832
00:38:40,320 --> 00:38:45,180
we just say

833
00:38:41,640 --> 00:38:46,440
the decoder should not encode symbols

834
00:38:45,180 --> 00:38:47,700
from this because you're just placing

835
00:38:46,440 --> 00:38:50,099
symbol space

836
00:38:47,700 --> 00:38:52,680
what if we had uh literals that don't

837
00:38:50,099 --> 00:38:56,579
create output but create symbols we

838
00:38:52,680 --> 00:38:58,680
could say recognize some patterns uh the

839
00:38:56,579 --> 00:38:59,760
you know in the first sort of

840
00:38:58,680 --> 00:39:03,960
um

841
00:38:59,760 --> 00:39:07,200
a 10 20 100 symbols find the most

842
00:39:03,960 --> 00:39:10,980
efficient way to write those in a

843
00:39:07,200 --> 00:39:13,020
compact string and then transmit that to

844
00:39:10,980 --> 00:39:14,480
build the symbol table that we're then

845
00:39:13,020 --> 00:39:16,800
going to use

846
00:39:14,480 --> 00:39:19,020
to get

847
00:39:16,800 --> 00:39:22,099
um just sort of almost immediately go

848
00:39:19,020 --> 00:39:27,119
into symbol output

849
00:39:22,099 --> 00:39:29,160
uh what if we are encoding not just one

850
00:39:27,119 --> 00:39:31,020
stream but multiple streams like a tar

851
00:39:29,160 --> 00:39:32,720
file or something like that

852
00:39:31,020 --> 00:39:36,000
could be maybe

853
00:39:32,720 --> 00:39:39,300
sometimes share this symbol table with

854
00:39:36,000 --> 00:39:40,800
the next file or throw it away and the

855
00:39:39,300 --> 00:39:42,180
three we're reading something completely

856
00:39:40,800 --> 00:39:42,960
different

857
00:39:42,180 --> 00:39:46,920
um

858
00:39:42,960 --> 00:39:49,440
this last one was prompted from a friend

859
00:39:46,920 --> 00:39:50,540
of mine who asked me would it be how

860
00:39:49,440 --> 00:39:52,920
would it Go

861
00:39:50,540 --> 00:39:54,540
reading Chinese characters and I

862
00:39:52,920 --> 00:39:56,579
realized that actually at the moment

863
00:39:54,540 --> 00:39:58,500
I've been talking about bytes throughout

864
00:39:56,579 --> 00:40:02,700
my code but in fact my code is written

865
00:39:58,500 --> 00:40:07,200
around Python's interpretation of utf-8

866
00:40:02,700 --> 00:40:09,420
and that means that if the literals

867
00:40:07,200 --> 00:40:12,480
still contain UTF cat 8 characters which

868
00:40:09,420 --> 00:40:14,579
are perfectly readable in as an input

869
00:40:12,480 --> 00:40:15,540
stream they're still just bytes at the

870
00:40:14,579 --> 00:40:18,960
end of the day

871
00:40:15,540 --> 00:40:23,099
but the symbol table is now not built up

872
00:40:18,960 --> 00:40:26,579
of those bytes but of the characters

873
00:40:23,099 --> 00:40:29,579
that have more meaning in the text

874
00:40:26,579 --> 00:40:32,880
that too could be you know could give us

875
00:40:29,579 --> 00:40:36,020
more and Co more compression if we're

876
00:40:32,880 --> 00:40:39,119
reading something like wav files

877
00:40:36,020 --> 00:40:42,480
which are notoriously hard to compress

878
00:40:39,119 --> 00:40:47,339
because the uh if you look looking at

879
00:40:42,480 --> 00:40:50,640
say a 16-bit PCM encoded each sample is

880
00:40:47,339 --> 00:40:54,420
two bytes but there is no relationship

881
00:40:50,640 --> 00:40:57,540
between essentially between the the

882
00:40:54,420 --> 00:41:00,839
least the the lowest byte of sample one

883
00:40:57,540 --> 00:41:04,859
and the highest byte of sample two

884
00:41:00,839 --> 00:41:08,099
but as 16 bit words they are absolutely

885
00:41:04,859 --> 00:41:11,099
related and if we could encode them as

886
00:41:08,099 --> 00:41:13,380
symbols rather than simply treating it

887
00:41:11,099 --> 00:41:15,839
as a stream of bytes I think that would

888
00:41:13,380 --> 00:41:17,160
give us improvements as well so there's

889
00:41:15,839 --> 00:41:20,760
a whole bunch of ideas that I'm

890
00:41:17,160 --> 00:41:22,859
percolating away I'm still working on

891
00:41:20,760 --> 00:41:25,430
the code but I'm hoping to have some

892
00:41:22,859 --> 00:41:29,420
more results soon thank you

893
00:41:25,430 --> 00:41:31,740
[Applause]

894
00:41:29,420 --> 00:41:34,020
thank you

895
00:41:31,740 --> 00:41:36,119
any questions

896
00:41:34,020 --> 00:41:38,220
um

897
00:41:36,119 --> 00:41:42,240
unfortunately we're actually out of time

898
00:41:38,220 --> 00:41:44,460
right sorry and we're in room change so

899
00:41:42,240 --> 00:41:46,320
we we can't really go over time like we

900
00:41:44,460 --> 00:41:48,000
did do before lunch

901
00:41:46,320 --> 00:41:50,280
um so wondering how would you be okay

902
00:41:48,000 --> 00:41:52,560
with people come up and ask me any

903
00:41:50,280 --> 00:41:54,480
questions you like I'd love to have

904
00:41:52,560 --> 00:41:56,440
collaboration on this thank you thank

905
00:41:54,480 --> 00:41:59,579
you thank you once again

906
00:41:56,440 --> 00:41:59,579
[Applause]