Thesis (M.Tech. Information Technology)-Cape Technikon, Cape Town, 2000

Data stored on disks generally contain significant redundancy. A mechanism or algorithm that
recodes the data to lessen the data size could possibly double or triple the effective data that
could be stored on the media. One mechanism of doing this is by data compression.
Many compression algorithms currently exist, but each one has its own advantages as well as
disadvantages. The objective of this study', to formulate a new compression algorithm that
could be implemented in a real-time mode in any file system. The new compression algorithm
should also execute as fast as possible, so as not to cause a lag in the file systems performance.
This study focuses on binary data of any type, whereas previous articles such as (Huftnlan.
1952:1098), (Ziv & Lempel, 1977:337: 1978:530), (Storer & Szymanski. 1982:928) and
(Welch, 1984:8) have placed particular emphasis on text compression in their discussions of
compression algorithms for computer data.
The resulting compression algorithm that is formulated by this study is Lempel-Ziv-Toutlc
(LZT). LZT is basically an LZ77 (Ziv & Lempel, 1977:337) encoder with a buffer size equal in
size to that of the data block of the file system in question. LZT does not make this distinction,
it discards the sliding buffer principle and uses each data block of the entire input stream. as
one big buffer on which compression can be performed. LZT also handles the encoding of a
match slightly different to that of LZ77. An LZT match is encoded by two bit streams, the first
specifying the position of the match and the other specifying the length of the match. This
combination is commonly referred to as a <position, length> pair.
To encode the position portion of the <position, length> pair, we make use of a sliding scale
method. The sliding scale method works as follows. Let the position in the input buffer, of the
current character to be compressed be held by inpos, where inpos is initially set to 3. It is then
only possible for a match to occur at position 1 or 2. Hence the position of a match will never
be greater than 2, and therefore the position portion can be encoded using only 1 bit. As "inpos"
is incremented as each character is encoded, the match position range increases and therefore
more bits will be required to encode the match position.
The reason why a decimal 2 can be encoded 'sing only I bit can be explained as follows. When
decimal values are converted to binary values, we get 010 = 02, 110 = 12, 210, = 102etc. As a
position of 0 will never be used, it is possible to develop a coding scheme where a decimal
value of 1 can be represented by a binary value of 0, and a decimal value of 2 can be
represented by binary value of 1. Only I bit is therefore needed to encode match position I and
match position 2. In general. any decimal value n ca:) be represented by the binary equivalent
for (n - 1). The number of bits needed to encode (n - 1), indicates the number of bits needed to
encode the match position.
The length portion of the <position, length> pair is encoded using a variable length coding
(vlc) approach. The vlc method performs its encoding by using binary blocks. The first binary
block is 3 bits long, where binary values 000 through 110 represent decimal values I through 7.
Where binary value 0 represents decimal value I as previously explained when coding a match
position. This coding scheme is possible since no match length of 0 will ever be encoded. The
maximum binary value of a binary block is used to specify whether another binary block
follows the current binary block. in which case it is called the block to follow flag (bff). In this
case binary III specifies that there should be another binary block following this one. Next a 1
bit binary block is appended to the existing 3 bit binary block, resulting in a 7 bit binary block,
where binary value 111 0000 represents decimal value 8 and where the maximum binary value
of 111 1111 is meant to act as a bff. By continuing in this way the next binary block of bits are
appended. Each consecutive binary block is 1 bit bigger in size than the previous binary block.
The binary block size continues to grow until it reaches a size of 8 bits. At this point no further
increase to the binary block size is made and all subsequent binary blocks will be 8 bits in size.
A distinction has to be made between a literal or compressed character in the output stream.
This is needed in order to inform the decompressor of whether the next character that it reads
has to be decompressed or not. Making use of a match flag fulfils this requirement. The match
flags are encoded the same way as in LZ77 (Ziv & Lempel, 1977:337) where binary value 0
indicates that a literal character is encoded and binary value I indicates that a match pair is
encoded. Additionally any literal character that is encoded is encoded using static Huffman
(Section 2.3), where the frequency table is derived from the characters found in the current data
block.
The LZT algorithm is tested against various current compression algorithms using the full test
data set from the Calgary/Canterbury compression corpus. The Calgary/Canterbury
compression corpus is a set of text and binary files. specifically selected for use by the Internet
and academic community to test the efficiency of compression algorithms. The resulting
average compression ratio on the test data is 3.249 bits per byte, with an average 180 kilobytes
per second compression speed. The test machine is an Intel Pentium, running a 150 MHz
processor with 96 MB RAM.
In conclusion, it can be derived from this study that the LZT algorithm is more efficient for
implementation in a real-time environment than currently available compression algorithms
(Section 4.2). Since it has a higher compression ratio than that of currently available real-time
compression algorithms. It should also be noted that only 68 kilobytes of RAM is required by
the LZT program in order to execute successfully on a computer.