Random seeking on gzip streams

Posted on 2006-10-16 (月) at 12:57 am by Florian Ragwitz [SIGNED]

Posted in: Hacking

After asking the lazyweb on how to seek on gzip streams I got a very useful and comprehensive reply by Paul Sladen.

As my previous searches on this topic didn't turn up to much useful things I'd like to publish Pauls reply (with his approval) here to help other people having the same problem as well:

Seeking on compressed streams is a little like reading sectors from a harddisk; to read one byte requires reading a much large sectors.

Two things are needed, the sector size and the sector length. On a hard-disk the sector length is nominally 512 with the mapping between an offset and sector start often as simple as: $offset >> 9.

In compressed streams, the relationship between stream position and sector start is non-linear. Using a non-linear mapping is more involved as searching a lookup table is required.

Mapping between an uncompressed position instead cannot be calculated and a 'zmap' table listing all sector start position and corresponding on-disk points needs to be created. The size of a sector must be determined aswell.

Bzip2:

It is fairly easy to seek on compressed bzip2 streams. In a Bzip2 stream, each block is totally separate and self contained---have a look a the 'bzip2recover' program. 'bzip2recover' scans through a bzip2 stream locating each block/'segment' (normally 900kB of input, so ~200kB of compressed output), then copying each segment to separate bzip2 file.

Gzip:

Seeking on gzip streams is somewhat more involved; Gzip uses a dictionary design where back-references are made to uncompressed data within the last 32kB ("the window").

The only safe place to start reading a 'sector' in a gzip stream is somewhere where that the dictionary size is zero. An empty dictionary occurs at the start of a stream, or where the stream has been 'reset'.

Gzip resets:

A gzip 'sector' (the length required to be read, to safely decode a byte) could be the full length of the file-size, or could occur more frequently. A gzip sector start occurs when the stream is reset---resetting the stream more frequently leads to more sector starts, but a reduction in compression thanks to less use of the sliding dictionary. Trade-off.

The "gzip --rsyncable" is an example of stream resetting; in this case the stream is reset (and the dictionary closed) each time that the sum of the last 4096 octets of input data, modulo 65536(?) equals a magic number (zero normally). In exchange for the extra reset/start points, you get a size increase of 4-5%, but a better "hit rate" for random access.

Gzip reset point lookup tables:

More likely you'd want to reset the stream every eg. 4096kB of input, then save the corresponding position of the output stream for a give input position. This lookup table of start positions needs to be stored

somewhere. 'squashfs' does in a separate file; 'zsync' in the zmap.

(If you wanted to define a standard, these reset points could be stored in the 'extra' or 'comment' fields at the start of a gzip file; storing them with a suitable header would enable some degree of seeking for sufficiently intelligent readers---files could be post-processed to add this data).

Storing a LUT takes space; adding extra reset points and storing the LUT data for these extra points takes more space---but I found storing the LUT as a set of double-deltas cut it down; could be gzipped aswell.

Gzip partial resets:

A gzip stream doesn't just contain 'reset' points, but also 'partial reset' points, a partial resets the huffman tables, but doesn't reset the dictionary. It's safe to stop reading at a partial reset (which

effectively shortens the sector length).

Whilst not normally possible, partial reset points /are/ able to be used as a starting point /if/ you can populate the previous uncompressed 32kB that forms the dictionary window from another source---'zsync' does this when reconstructing a file, however 'zsync' only does construction in a linear fashion, from the start of a file.

Further Reading:

Magic words to search for:

zsync, squashfs, apt-sync, succinct

Succinct is my unfinished project. Succinct involves many parts, one of which covers what you're doing. It might be good to break separate off seeking of compressed streams into a separate project---many small pieces make big problems easier!

What I might do is store the LUT in the EXTRA field of the gzip header (limited to 64kB though). Though, within 64kB it would always be possible to construct a useful LUT; based on the final length of the input data, it would be possible to reduce the granularity of the LUT such that it only mentioned 50%, 30%... even 5% of reset points. The LUT itself could be gzipped for extra gain.

Hope that's useful,

-Paul

PS. In all the cases (except one) that I mentioned gzip, I probably meant 'zlib' or 'deflate'! :)

In a later mail Paul also added the following:

BTW, in my search around a bit later for something I turned up two more things that might be interesting:

dictzip, examples/zran.c in the zlib source

'dictzip' is a similar idea to what I suggested at the end of the previous email. A lookup table is built and stored in the EXTRA section at the start of the gzip file and the utilities ('dictzcat' et al) then use this look-up table for random-access on the data.

'dictzip' are still slightly wide of the mark, forcing a flush/chunk every 58969 bytes. That number is chosen so that even in the worst case, the compressed version will never be more than 216 in length. However I think their math is slightly wrong as even in the worse case, gzip only generates 5 bytes per 32kB for an uncompresed store. -> ~65514 by my reckoning. Squashfs uncompressable chunks by setting the top bit of the file offset to indicate a true store.

Their lookup table format appears to be flexible enough that I /might/ be able to create a table by scanning an existing stream rather than re-encoding. Depends about the 64kB limitation. Mmm.

Thanks Paul!




Navigation

Categories

Login