doc/liblzma-intro.txt - jrn/xz - Git at Google


 Introduction to liblzma
 -----------------------

 Writing applications to work with liblzma

     liblzma API is split in several subheaders to improve readability and
     maintainance. The subheaders must not be #included directly; simply
     use `#include <lzma.h>' instead.

     Those who have used zlib should find liblzma's API easy to use.
     To developers who haven't used zlib before, I recommend learning
     zlib first, because zlib has excellent documentation.

     While the API is similar to that of zlib, there are some major
     differences, which are summarized below.

     For basic stream encoding, zlib has three functions (deflateInit(),
     deflate(), and deflateEnd()). Similarly, there are three functions
     for stream decoding (inflateInit(), inflate(), and inflateEnd()).
     liblzma has only single coding and ending function. Thus, to
     encode one may use, for example, lzma_stream_encoder_single(),
     lzma_code(), and lzma_end(). Simlarly for decoding, one may
     use lzma_auto_decoder(), lzma_code(), and lzma_end().

     zlib has deflateReset() and inflateReset() to reset the stream
     structure without reallocating all the memory. In liblzma, all
     coder initialization functions are like zlib's reset functions:
     the first-time initializations are done with the same functions
     as the reinitializations (resetting).

     To make all this work, liblzma needs to know when lzma_stream
     doesn't already point to an allocated and initialized coder.
     This is achieved by initializing lzma_stream structure with
     LZMA_STREAM_INIT (static initialization) or LZMA_STREAM_INIT_VAR
     (for exampple when new lzma_stream has been allocated with malloc()).
     This initialization should be done exactly once per lzma_stream
     structure to avoid leaking memory. Calling lzma_end() will leave
     lzma_stream into a state comparable to the state achieved with
     LZMA_STREAM_INIT and LZMA_STREAM_INIT_VAR.

     Example probably clarifies a lot. With zlib, compression goes
     roughly like this:

         z_stream strm;
         deflateInit(&strm, level);
         deflate(&strm, Z_RUN);
         deflate(&strm, Z_RUN);
         ...
         deflate(&strm, Z_FINISH);
         deflateEnd(&strm) or deflateReset(&strm)

     With liblzma, it's slightly different:

         lzma_stream strm = LZMA_STREAM_INIT;
         lzma_stream_encoder_single(&strm, &options);
         lzma_code(&strm, LZMA_RUN);
         lzma_code(&strm, LZMA_RUN);
         ...
         lzma_code(&strm, LZMA_FINISH);
         lzma_end(&strm) or reinitialize for new coding work

      Reinitialization in the last step can be any function that can
      initialize lzma_stream; it doesn't need to be the same function
      that was used for the previous initialization. If it is the same
      function, liblzma will usually be able to re-use most of the
      existing memory allocations (depends on how much the initialization
      options change). If you reinitialize with different function,
      liblzma will automatically free the memory of the previous coder.


 File formats

     liblzma supports multiple container formats for the compressed data.
     Different initialization functions initialize the lzma_stream to
     process different container formats. See the details from the public
     header files.

     The following functions are the most commonly used:

       - lzma_stream_encoder_single(): Encodes Single-Block Stream; this
         the recommended format for most purporses.

       - lzma_alone_encoder(): Useful if you need to encode into the
         legacy LZMA_Alone format.

       - lzma_auto_decoder(): Decoder that automatically detects the
         file format; recommended when you decode compressed files on
         disk, because this way compatibility with the legacy LZMA_Alone
         format is transparent.

       - lzma_stream_decoder(): Decoder for Single- and Multi-Block
         Streams; this is good if you want to accept only .lzma Streams.


 Filters

     liblzma supports multiple filters (algorithm implementations). The new
     .lzma format supports filter-chain having up to seven filters. In the
     filter chain, the output of one filter is input of the next filter in
     the chain. The legacy LZMA_Alone format supports only one filter, and
     that must always be LZMA.

         General-purporse compression:

             LZMA        The main algorithm of liblzma (surprise!)

         Branch/Call/Jump filters for executables:

             x86         This filter is known as BCJ in 7-Zip
             IA64        IA-64 (Itanium)
             PowerPC     Big endian PowerPC
             ARM
             ARM-Thumb
             SPARC

         Other filters:

             Copy        Dummy filter that simply copies all the data
                         from input to output.

             Subblock    Multi-purporse filter, that can
                           - embed End of Payload Marker if the previous
                             filter in the chain doesn't support it; and
                           - apply Subfilters, which filter only part
                             of the same compressed Block in the Stream.

     Branch/Call/Jump filters never change the size of the data. They
     should usually be used as a pre-filter for some compression filter
     like LZMA.


 Integrity checks

     The .lzma Stream format uses CRC32 as the integrity check for
     different file format headers. It is possible to omit CRC32 from
     the Block Headers, but not from Stream Header. This is the reason
     why CRC32 code cannot be disabled when building liblzma (in addition,
     the LZMA encoder uses CRC32 for hashing, so that's another reason).

     The integrity check of the actual data is calculated from the
     uncompressed data. This check can be CRC32, CRC64, or SHA256.
     It can also be omitted completely, although that usually is not
     a good thing to do. There are free IDs left, so support for new
     checks algorithms can be added later.


 API and ABI stability

     The API and ABI of liblzma isn't stable yet, although no huge
     changes should happen. One potential place for change is the
     lzma_options_subblock structure.

     In the 4.42.0alpha phase, the shared library version number won't
     be updated even if ABI breaks. I don't want to track the ABI changes
     yet. Just rebuild everything when you upgrade liblzma until we get
     to the beta stage.


 Size of the library

     While liblzma isn't huge, it is quite far from the smallest possible
     LZMA implementation: full liblzma binary (with support for all
     filters and other features) is way over 100 KiB, but the plain raw
     LZMA decoder is only 5-10 KiB.

     To decrease the size of the library, you can omit parts of the library
     by passing certain options to the `configure' script. Disabling
     everything but the decoders of the require filters will usually give
     you a small enough library, but if you need a decoder for example
     embedded in the operating system kernel, the code from liblzma probably
     isn't suitable as is.

     If you need a minimal implementation supporting .lzma Streams, you
     may need to do partial rewrite. liblzma uses stateful API like zlib.
     That increases the size of the library. Using callback API or even
     simpler buffer-to-buffer API would allow smaller implementation.

     LZMA SDK contains smaller LZMA decoder written in ANSI-C than
     liblzma, so you may want to take a look at that code. However,
     it doesn't (at least not yet) support the new .lzma Stream format.


 Documentation

     There's no other documentation than the public headers and this
     text yet. Real docs will be written some day, I hope.

	Introduction to liblzma
	-----------------------

	Writing applications to work with liblzma

	liblzma API is split in several subheaders to improve readability and
	maintainance. The subheaders must not be #included directly; simply
	use `#include <lzma.h>' instead.

	Those who have used zlib should find liblzma's API easy to use.
	To developers who haven't used zlib before, I recommend learning
	zlib first, because zlib has excellent documentation.

	While the API is similar to that of zlib, there are some major
	differences, which are summarized below.

	For basic stream encoding, zlib has three functions (deflateInit(),
	deflate(), and deflateEnd()). Similarly, there are three functions
	for stream decoding (inflateInit(), inflate(), and inflateEnd()).
	liblzma has only single coding and ending function. Thus, to
	encode one may use, for example, lzma_stream_encoder_single(),
	lzma_code(), and lzma_end(). Simlarly for decoding, one may
	use lzma_auto_decoder(), lzma_code(), and lzma_end().

	zlib has deflateReset() and inflateReset() to reset the stream
	structure without reallocating all the memory. In liblzma, all
	coder initialization functions are like zlib's reset functions:
	the first-time initializations are done with the same functions
	as the reinitializations (resetting).

	To make all this work, liblzma needs to know when lzma_stream
	doesn't already point to an allocated and initialized coder.
	This is achieved by initializing lzma_stream structure with
	LZMA_STREAM_INIT (static initialization) or LZMA_STREAM_INIT_VAR
	(for exampple when new lzma_stream has been allocated with malloc()).
	This initialization should be done exactly once per lzma_stream
	structure to avoid leaking memory. Calling lzma_end() will leave
	lzma_stream into a state comparable to the state achieved with
	LZMA_STREAM_INIT and LZMA_STREAM_INIT_VAR.

	Example probably clarifies a lot. With zlib, compression goes
	roughly like this:

	z_stream strm;
	deflateInit(&strm, level);
	deflate(&strm, Z_RUN);
	deflate(&strm, Z_RUN);
	...
	deflate(&strm, Z_FINISH);
	deflateEnd(&strm) or deflateReset(&strm)

	With liblzma, it's slightly different:

	lzma_stream strm = LZMA_STREAM_INIT;
	lzma_stream_encoder_single(&strm, &options);
	lzma_code(&strm, LZMA_RUN);
	lzma_code(&strm, LZMA_RUN);
	...
	lzma_code(&strm, LZMA_FINISH);
	lzma_end(&strm) or reinitialize for new coding work

	Reinitialization in the last step can be any function that can
	initialize lzma_stream; it doesn't need to be the same function
	that was used for the previous initialization. If it is the same
	function, liblzma will usually be able to re-use most of the
	existing memory allocations (depends on how much the initialization
	options change). If you reinitialize with different function,
	liblzma will automatically free the memory of the previous coder.


	File formats

	liblzma supports multiple container formats for the compressed data.
	Different initialization functions initialize the lzma_stream to
	process different container formats. See the details from the public
	header files.

	The following functions are the most commonly used:

	- lzma_stream_encoder_single(): Encodes Single-Block Stream; this
	the recommended format for most purporses.

	- lzma_alone_encoder(): Useful if you need to encode into the
	legacy LZMA_Alone format.

	- lzma_auto_decoder(): Decoder that automatically detects the
	file format; recommended when you decode compressed files on
	disk, because this way compatibility with the legacy LZMA_Alone
	format is transparent.

	- lzma_stream_decoder(): Decoder for Single- and Multi-Block
	Streams; this is good if you want to accept only .lzma Streams.


	Filters

	liblzma supports multiple filters (algorithm implementations). The new
	.lzma format supports filter-chain having up to seven filters. In the
	filter chain, the output of one filter is input of the next filter in
	the chain. The legacy LZMA_Alone format supports only one filter, and
	that must always be LZMA.

	General-purporse compression:

	LZMA The main algorithm of liblzma (surprise!)

	Branch/Call/Jump filters for executables:

	x86 This filter is known as BCJ in 7-Zip
	IA64 IA-64 (Itanium)
	PowerPC Big endian PowerPC
	ARM
	ARM-Thumb
	SPARC

	Other filters:

	Copy Dummy filter that simply copies all the data
	from input to output.

	Subblock Multi-purporse filter, that can
	- embed End of Payload Marker if the previous
	filter in the chain doesn't support it; and
	- apply Subfilters, which filter only part
	of the same compressed Block in the Stream.

	Branch/Call/Jump filters never change the size of the data. They
	should usually be used as a pre-filter for some compression filter
	like LZMA.


	Integrity checks

	The .lzma Stream format uses CRC32 as the integrity check for
	different file format headers. It is possible to omit CRC32 from
	the Block Headers, but not from Stream Header. This is the reason
	why CRC32 code cannot be disabled when building liblzma (in addition,
	the LZMA encoder uses CRC32 for hashing, so that's another reason).

	The integrity check of the actual data is calculated from the
	uncompressed data. This check can be CRC32, CRC64, or SHA256.
	It can also be omitted completely, although that usually is not
	a good thing to do. There are free IDs left, so support for new
	checks algorithms can be added later.


	API and ABI stability

	The API and ABI of liblzma isn't stable yet, although no huge
	changes should happen. One potential place for change is the
	lzma_options_subblock structure.

	In the 4.42.0alpha phase, the shared library version number won't
	be updated even if ABI breaks. I don't want to track the ABI changes
	yet. Just rebuild everything when you upgrade liblzma until we get
	to the beta stage.


	Size of the library

	While liblzma isn't huge, it is quite far from the smallest possible
	LZMA implementation: full liblzma binary (with support for all
	filters and other features) is way over 100 KiB, but the plain raw
	LZMA decoder is only 5-10 KiB.

	To decrease the size of the library, you can omit parts of the library
	by passing certain options to the `configure' script. Disabling
	everything but the decoders of the require filters will usually give
	you a small enough library, but if you need a decoder for example
	embedded in the operating system kernel, the code from liblzma probably
	isn't suitable as is.

	If you need a minimal implementation supporting .lzma Streams, you
	may need to do partial rewrite. liblzma uses stateful API like zlib.
	That increases the size of the library. Using callback API or even
	simpler buffer-to-buffer API would allow smaller implementation.

	LZMA SDK contains smaller LZMA decoder written in ANSI-C than
	liblzma, so you may want to take a look at that code. However,
	it doesn't (at least not yet) support the new .lzma Stream format.


	Documentation

	There's no other documentation than the public headers and this
	text yet. Real docs will be written some day, I hope.