doc/liblzma-advanced.txt - jrn/xz - Git at Google


 Advanced features of liblzma
 ----------------------------

 0. Introduction

     Most developers need only the basic features of liblzma. These
     features allow single-threaded encoding and decoding of .lzma files
     in streamed mode.

     In some cases developers want more. The .lzma file format is
     designed to allow multi-threaded encoding and decoding and limited
     random-access reading. These features are possible in non-streamed
     mode and limitedly also in streamed mode.

     To take advange of these features, the application needs a custom
     .lzma file format handler. liblzma provides a set of tools to ease
     this task, but it's still quite a bit of work to get a good custom
     .lzma handler done.


 1. Where to begin

     Start by reading the .lzma file format specification. Understanding
     the basics of the .lzma file structure is required to implement a
     custom .lzma file handler and to understand the rest of this document.


 2. The basic components

 2.1. Stream Header and tail

     Stream Header begins the .lzma Stream and Stream tail ends it. Stream
     Header is defined in the file format specification, but Stream tail
     isn't (thus I write "tail" with a lower-case letter). Stream tail is
     simply the Stream Flags and the Footer Magic Bytes fields together.
     It was done this way in liblzma, because the Block coders take care
     of the rest of the stuff in the Stream Footer.

     For now, the size of Stream Header is fixed to 11 bytes. The header
     <lzma/stream_flags.h> defines LZMA_STREAM_HEADER_SIZE, which you
     should use instead of a hardcoded number. Similarly, Stream tail
     is fixed to 3 bytes, and there is a constant LZMA_STREAM_TAIL_SIZE.

     It is possible, that a future version of the .lzma format will have
     variable-sized Stream Header and tail. As of writing, this seems so
     unlikely though, that it was considered simplest to just use a
     constant instead of providing a functions to get and store the sizes
     of the Stream Header and tail.


 2.x. Stream tail

     For now, the size of Stream tail is fixed to 3 bytes. The header
     <lzma/stream_flags.h> defines LZMA_STREAM_TAIL_SIZE, which you
     should use instead of a hardcoded number.


 3. Keeping track of size information

     The lzma_info_* functions found from <lzma/info.h> should ease the
     task of keeping track of sizes of the Blocks and also the Stream
     as a whole. Using these functions is strongly recommended, because
     there are surprisingly many situations where an error can occur,
     and these functions check for possible errors every time some new
     information becomes available.

     If you find lzma_info_* functions lacking something that you would
     find useful, please contact the author.


 3.1. Start offset of the Stream

     If you are storing the .lzma Stream inside anothe file format, or
     for some other reason are placing the .lzma Stream to somewhere
     else than to the beginning of the file, you should tell the starting
     offset of the Stream using lzma_info_start_offset_set().

     The start offset of the Stream is used for two distinct purporses.
     First, knowing the start offset of the Stream allows
     lzma_info_alignment_get() to correctly calculate the alignment of
     every Block. This information is given to the Block encoder, which
     will calculate the size of Header Padding so that Compressed Data
     is alignment at an optimal offset.

     Another use for start offset of the Stream is in random-access
     reading. If you set the start offset of the Stream, lzma_info_locate()
     will be able to calculate the offset relative to the beginning of the
     file containing the Stream (instead of offset relative to the
     beginning of the Stream).


 3.2. Size of Stream Header

     While the size of Stream Header is constant (11 bytes) in the current
     version of the .lzma file format, this may change in future.


 3.3. Size of Header Metadata Block

     This information is needed when doing random-access reading, and
     to verify the value of this field stored in Footer Metadata Block.


 3.4. Total Size of the Data Blocks


 3.5. Uncompressed Size of Data Blocks


 3.6. Index


 x. Alignment

     There are a few slightly different types of alignment issues when
     working with .lzma files.

     The .lzma format doesn't strictly require any kind of alignment.
     However, if the encoder carefully optimizes the alignment in all
     situations, it can improve compression ratio, speed of the encoder
     and decoder, and slightly help if the files get damaged and need
     recovery.

     Alignment has the most significant effect compression ratio FIXME


 x.1. Compression ratio

     Some filters take advantage of the alignment of the input data.
     To get the best compression ratio, make sure that you feed these
     filters correctly aligned data.

     Some filters (e.g. LZMA) don't necessarily mind too much if the
     input doesn't match the preferred alignment. With these filters
     the penalty in compression ratio depends on the specific type of
     data being compressed.

     Other filters (e.g. PowerPC executable filter) won't work at all
     with data that is improperly aligned. While the data can still
     be de-filtered back to its original form, the benefit of the
     filtering (better compression ratio) is completely lost, because
     these filters expect certain patterns at properly aligned offsets.
     The compression ratio may even worse with incorrectly aligned input
     than without the filter.


 x.1.1. Inter-filter alignment

     When there are multiple filters chained, checking the alignment can
     be useful not only with the input of the first filter and output of
     the last filter, but also between the filters.

     Inter-filter alignment important especially with the Subblock filter.


 x.1.2. Further compression with external tools

     This is relatively rare situation in practice, but still worth
     understanding.

     Let's say that there are several SPARC executables, which are each
     filtered to separate .lzma files using only the SPARC filter. If
     Uncompressed Size is written to the Block Header, the size of Block
     Header may vary between the .lzma files. If no Padding is used in
     the Block Header to correct the alignment, the starting offset of
     the Compressed Data field will be differently aligned in different
     .lzma files.

     All these .lzma files are archived into a single .tar archive. Due
     to nature of the .tar format, every file is aligned inside the
     archive to an offset that is a multiple of 512 bytes.

     The .tar archive is compressed into a new .lzma file using the LZMA
     filter with options, that prefer input alignment of four bytes. Now
     if the independent .lzma files don't have the same alignment of
     the Compressed Data fields, the LZMA filter will be unable to take
     advantage of the input alignment between the files in the .tar
     archive, which reduces compression ratio.

     Thus, even if you have only single Block per file, it can be good for
     compression ratio to align the Compressed Data to optimal offset.


 x.2. Speed

     Most modern computers are faster when multi-byte data is located
     at aligned offsets in RAM. Proper alignment of the Compressed Data
     fields can slightly increase the speed of some filters.


 x.3. Recovery

     Aligning every Block Header to start at an offset with big enough
     alignment may ease or at least speed up recovery of broken files.


 y. Typical usage cases

 y.x. Parsing the Stream backwards

     You may need to parse the Stream backwards if you need to get
     information such as the sizes of the Stream, Index, or Extra.
     The basic procedure to do this follows.

     Locate the end of the Stream. If the Stream is stored as is in a
     standalone .lzma file, simply seek to the end of the file and start
     reading backwards using appropriate buffer size. The file format
     specification allows arbitrary amount of Footer Padding (zero or more
     NUL bytes), which you skip before trying to decode the Stream tail.

     Once you have located the end of the Stream (a non-NULL byte), make
     sure you have at least the last LZMA_STREAM_TAIL_SIZE bytes of the
     Stream in a buffer. If there isn't enough bytes left from the file,
     the file is too small to contain a valid Stream. Decode the Stream
     tail using lzma_stream_tail_decoder(). Store the offset of the first
     byte of the Stream tail; you will need it later.

     You may now want to do some internal verifications e.g. if the Check
     type is supported by the liblzma build you are using.

     Decode the Backward Size field with lzma_vli_reverse_decode(). The
     field is at maximum of LZMA_VLI_BYTES_MAX bytes long. Check that
     Backward Size is not zero. Store the offset of the first byte of
     the Backward Size; you will need it later.

     Now you know the Total Size of the last Block of the Stream. It's the
     value of Backward Size plus the size of the Backward Size field. Note
     that you cannot use lzma_vli_size() to calculate the size since there
     might be padding; you need to use the real observed size of the
     Backward Size field.

     At this point, the operation continues differently for Single-Block
     and Multi-Block Streams.


 y.x.1. Single-Block Stream

     There might be Uncompressed Size field present in the Stream Footer.
     You cannot know it for sure unless you have already parsed the Block
     Header earlier. For security reasons, you probably want to try to
     decode the Uncompressed Size field, but you must not indicate any
     error if decoding fails. Later you can give the decoded Uncompressed
     Size to Block decoder if Uncopmressed Size isn't otherwise known;
     this prevents it from producing too much output in case of (possibly
     intentionally) corrupt file.

     Calculate the the start offset of the Stream:

         backward_offset - backward_size - LZMA_STREAM_HEADER_SIZE

     backward_offset is the offset of the first byte of the Backward Size
     field. Remember to check for integer overflows, which can occur with
     invalid input files.

     Seek to the beginning of the Stream. Decode the Stream Header using
     lzma_stream_header_decoder(). Verify that the decoded Stream Flags
     match the values found from Stream tail. You can use the
     lzma_stream_flags_is_equal() macro for this.

     Decode the Block Header. Verify that it isn't a Metadata Block, since
     Single-Block Streams cannot have Metadata. If Uncompressed Size is
     present in the Block Header, the value you tried to decode from the
     Stream Footer must be ignored, since Uncompressed Size wasn't actually
     present there. If Block Header doesn't have Uncompressed Size, and
     decoding the Uncompressed Size field from the Stream Footer failed,
     the file is corrupt.

     If you were only looking for the Uncompressed Size of the Stream,
     you now got that information, and you can stop processing the Stream.

     To decode the Block, the same instructions apply as described in
     FIXME. However, because you have some extra known information decoded
     from the Stream Footer, you should give this information to the Block
     decoder so that it can verify it while decoding:
       - If Uncompressed Size is not present in the Block Header, set
         lzma_options_block.uncompressed_size to the value you decoded
         from the Stream Footer.
       - Always set lzma_options_block.total_size to backward_size +
         size_of_backward_size (you calculated this sum earlier already).


 y.x.2. Multi-Block Stream

     Calculate the start offset of the Footer Metadata Block:

         backward_offset - backward_size

     backward_offset is the offset of the first byte of the Backward Size
     field. Remember to check for integer overflows, which can occur with
     broken input files.

     Decode the Block Header. Verify that it is a Metadata Block. Set
     lzma_options_block.total_size to backward_size + size_of_backward_size
     (you calculated this sum earlier already). Then decode the Footer
     Metadata Block.

     Store the decoded Footer Metadata to lzma_info structure using
     lzma_info_set_metadata(). Set also the offset of the Backward Size
     field using lzma_info_size_set(). Then you can get the start offset
     of the Stream using lzma_info_size_get(). Note that any of these steps
     may fail so don't omit error checking.

     Seek to the beginning of the Stream. Decode the Stream Header using
     lzma_stream_header_decoder(). Verify that the decoded Stream Flags
     match the values found from Stream tail. You can use the
     lzma_stream_flags_is_equal() macro for this.

     If you were only looking for the Uncompressed Size of the Stream,
     it's possible that you already have it now. If Uncompressed Size (or
     whatever information you were looking for) isn't available yet,
     continue by decoding also the Header Metadata Block. (If some
     information is missing, the Header Metadata Block has to be present.)

     Decoding the Data Blocks goes the same way as described in FIXME.


 y.x.3. Variations

     If you know the offset of the beginning of the Stream, you may want
     to parse the Stream Header before parsing the Stream tail.

	Advanced features of liblzma
	----------------------------

	0. Introduction

	Most developers need only the basic features of liblzma. These
	features allow single-threaded encoding and decoding of .lzma files
	in streamed mode.

	In some cases developers want more. The .lzma file format is
	designed to allow multi-threaded encoding and decoding and limited
	random-access reading. These features are possible in non-streamed
	mode and limitedly also in streamed mode.

	To take advange of these features, the application needs a custom
	.lzma file format handler. liblzma provides a set of tools to ease
	this task, but it's still quite a bit of work to get a good custom
	.lzma handler done.


	1. Where to begin

	Start by reading the .lzma file format specification. Understanding
	the basics of the .lzma file structure is required to implement a
	custom .lzma file handler and to understand the rest of this document.


	2. The basic components

	2.1. Stream Header and tail

	Stream Header begins the .lzma Stream and Stream tail ends it. Stream
	Header is defined in the file format specification, but Stream tail
	isn't (thus I write "tail" with a lower-case letter). Stream tail is
	simply the Stream Flags and the Footer Magic Bytes fields together.
	It was done this way in liblzma, because the Block coders take care
	of the rest of the stuff in the Stream Footer.

	For now, the size of Stream Header is fixed to 11 bytes. The header
	<lzma/stream_flags.h> defines LZMA_STREAM_HEADER_SIZE, which you
	should use instead of a hardcoded number. Similarly, Stream tail
	is fixed to 3 bytes, and there is a constant LZMA_STREAM_TAIL_SIZE.

	It is possible, that a future version of the .lzma format will have
	variable-sized Stream Header and tail. As of writing, this seems so
	unlikely though, that it was considered simplest to just use a
	constant instead of providing a functions to get and store the sizes
	of the Stream Header and tail.


	2.x. Stream tail

	For now, the size of Stream tail is fixed to 3 bytes. The header
	<lzma/stream_flags.h> defines LZMA_STREAM_TAIL_SIZE, which you
	should use instead of a hardcoded number.


	3. Keeping track of size information

	The lzma_info_* functions found from <lzma/info.h> should ease the
	task of keeping track of sizes of the Blocks and also the Stream
	as a whole. Using these functions is strongly recommended, because
	there are surprisingly many situations where an error can occur,
	and these functions check for possible errors every time some new
	information becomes available.

	If you find lzma_info_* functions lacking something that you would
	find useful, please contact the author.


	3.1. Start offset of the Stream

	If you are storing the .lzma Stream inside anothe file format, or
	for some other reason are placing the .lzma Stream to somewhere
	else than to the beginning of the file, you should tell the starting
	offset of the Stream using lzma_info_start_offset_set().

	The start offset of the Stream is used for two distinct purporses.
	First, knowing the start offset of the Stream allows
	lzma_info_alignment_get() to correctly calculate the alignment of
	every Block. This information is given to the Block encoder, which
	will calculate the size of Header Padding so that Compressed Data
	is alignment at an optimal offset.

	Another use for start offset of the Stream is in random-access
	reading. If you set the start offset of the Stream, lzma_info_locate()
	will be able to calculate the offset relative to the beginning of the
	file containing the Stream (instead of offset relative to the
	beginning of the Stream).


	3.2. Size of Stream Header

	While the size of Stream Header is constant (11 bytes) in the current
	version of the .lzma file format, this may change in future.


	3.3. Size of Header Metadata Block

	This information is needed when doing random-access reading, and
	to verify the value of this field stored in Footer Metadata Block.


	3.4. Total Size of the Data Blocks


	3.5. Uncompressed Size of Data Blocks


	3.6. Index




	x. Alignment

	There are a few slightly different types of alignment issues when
	working with .lzma files.

	The .lzma format doesn't strictly require any kind of alignment.
	However, if the encoder carefully optimizes the alignment in all
	situations, it can improve compression ratio, speed of the encoder
	and decoder, and slightly help if the files get damaged and need
	recovery.

	Alignment has the most significant effect compression ratio FIXME


	x.1. Compression ratio

	Some filters take advantage of the alignment of the input data.
	To get the best compression ratio, make sure that you feed these
	filters correctly aligned data.

	Some filters (e.g. LZMA) don't necessarily mind too much if the
	input doesn't match the preferred alignment. With these filters
	the penalty in compression ratio depends on the specific type of
	data being compressed.

	Other filters (e.g. PowerPC executable filter) won't work at all
	with data that is improperly aligned. While the data can still
	be de-filtered back to its original form, the benefit of the
	filtering (better compression ratio) is completely lost, because
	these filters expect certain patterns at properly aligned offsets.
	The compression ratio may even worse with incorrectly aligned input
	than without the filter.


	x.1.1. Inter-filter alignment

	When there are multiple filters chained, checking the alignment can
	be useful not only with the input of the first filter and output of
	the last filter, but also between the filters.

	Inter-filter alignment important especially with the Subblock filter.


	x.1.2. Further compression with external tools

	This is relatively rare situation in practice, but still worth
	understanding.

	Let's say that there are several SPARC executables, which are each
	filtered to separate .lzma files using only the SPARC filter. If
	Uncompressed Size is written to the Block Header, the size of Block
	Header may vary between the .lzma files. If no Padding is used in
	the Block Header to correct the alignment, the starting offset of
	the Compressed Data field will be differently aligned in different
	.lzma files.

	All these .lzma files are archived into a single .tar archive. Due
	to nature of the .tar format, every file is aligned inside the
	archive to an offset that is a multiple of 512 bytes.

	The .tar archive is compressed into a new .lzma file using the LZMA
	filter with options, that prefer input alignment of four bytes. Now
	if the independent .lzma files don't have the same alignment of
	the Compressed Data fields, the LZMA filter will be unable to take
	advantage of the input alignment between the files in the .tar
	archive, which reduces compression ratio.

	Thus, even if you have only single Block per file, it can be good for
	compression ratio to align the Compressed Data to optimal offset.


	x.2. Speed

	Most modern computers are faster when multi-byte data is located
	at aligned offsets in RAM. Proper alignment of the Compressed Data
	fields can slightly increase the speed of some filters.


	x.3. Recovery

	Aligning every Block Header to start at an offset with big enough
	alignment may ease or at least speed up recovery of broken files.


	y. Typical usage cases

	y.x. Parsing the Stream backwards

	You may need to parse the Stream backwards if you need to get
	information such as the sizes of the Stream, Index, or Extra.
	The basic procedure to do this follows.

	Locate the end of the Stream. If the Stream is stored as is in a
	standalone .lzma file, simply seek to the end of the file and start
	reading backwards using appropriate buffer size. The file format
	specification allows arbitrary amount of Footer Padding (zero or more
	NUL bytes), which you skip before trying to decode the Stream tail.

	Once you have located the end of the Stream (a non-NULL byte), make
	sure you have at least the last LZMA_STREAM_TAIL_SIZE bytes of the
	Stream in a buffer. If there isn't enough bytes left from the file,
	the file is too small to contain a valid Stream. Decode the Stream
	tail using lzma_stream_tail_decoder(). Store the offset of the first
	byte of the Stream tail; you will need it later.

	You may now want to do some internal verifications e.g. if the Check
	type is supported by the liblzma build you are using.

	Decode the Backward Size field with lzma_vli_reverse_decode(). The
	field is at maximum of LZMA_VLI_BYTES_MAX bytes long. Check that
	Backward Size is not zero. Store the offset of the first byte of
	the Backward Size; you will need it later.

	Now you know the Total Size of the last Block of the Stream. It's the
	value of Backward Size plus the size of the Backward Size field. Note
	that you cannot use lzma_vli_size() to calculate the size since there
	might be padding; you need to use the real observed size of the
	Backward Size field.

	At this point, the operation continues differently for Single-Block
	and Multi-Block Streams.


	y.x.1. Single-Block Stream

	There might be Uncompressed Size field present in the Stream Footer.
	You cannot know it for sure unless you have already parsed the Block
	Header earlier. For security reasons, you probably want to try to
	decode the Uncompressed Size field, but you must not indicate any
	error if decoding fails. Later you can give the decoded Uncompressed
	Size to Block decoder if Uncopmressed Size isn't otherwise known;
	this prevents it from producing too much output in case of (possibly
	intentionally) corrupt file.

	Calculate the the start offset of the Stream:

	backward_offset - backward_size - LZMA_STREAM_HEADER_SIZE

	backward_offset is the offset of the first byte of the Backward Size
	field. Remember to check for integer overflows, which can occur with
	invalid input files.

	Seek to the beginning of the Stream. Decode the Stream Header using
	lzma_stream_header_decoder(). Verify that the decoded Stream Flags
	match the values found from Stream tail. You can use the
	lzma_stream_flags_is_equal() macro for this.

	Decode the Block Header. Verify that it isn't a Metadata Block, since
	Single-Block Streams cannot have Metadata. If Uncompressed Size is
	present in the Block Header, the value you tried to decode from the
	Stream Footer must be ignored, since Uncompressed Size wasn't actually
	present there. If Block Header doesn't have Uncompressed Size, and
	decoding the Uncompressed Size field from the Stream Footer failed,
	the file is corrupt.

	If you were only looking for the Uncompressed Size of the Stream,
	you now got that information, and you can stop processing the Stream.

	To decode the Block, the same instructions apply as described in
	FIXME. However, because you have some extra known information decoded
	from the Stream Footer, you should give this information to the Block
	decoder so that it can verify it while decoding:
	- If Uncompressed Size is not present in the Block Header, set
	lzma_options_block.uncompressed_size to the value you decoded
	from the Stream Footer.
	- Always set lzma_options_block.total_size to backward_size +
	size_of_backward_size (you calculated this sum earlier already).


	y.x.2. Multi-Block Stream

	Calculate the start offset of the Footer Metadata Block:

	backward_offset - backward_size

	backward_offset is the offset of the first byte of the Backward Size
	field. Remember to check for integer overflows, which can occur with
	broken input files.

	Decode the Block Header. Verify that it is a Metadata Block. Set
	lzma_options_block.total_size to backward_size + size_of_backward_size
	(you calculated this sum earlier already). Then decode the Footer
	Metadata Block.

	Store the decoded Footer Metadata to lzma_info structure using
	lzma_info_set_metadata(). Set also the offset of the Backward Size
	field using lzma_info_size_set(). Then you can get the start offset
	of the Stream using lzma_info_size_get(). Note that any of these steps
	may fail so don't omit error checking.

	Seek to the beginning of the Stream. Decode the Stream Header using
	lzma_stream_header_decoder(). Verify that the decoded Stream Flags
	match the values found from Stream tail. You can use the
	lzma_stream_flags_is_equal() macro for this.

	If you were only looking for the Uncompressed Size of the Stream,
	it's possible that you already have it now. If Uncompressed Size (or
	whatever information you were looking for) isn't available yet,
	continue by decoding also the Header Metadata Block. (If some
	information is missing, the Header Metadata Block has to be present.)

	Decoding the Data Blocks goes the same way as described in FIXME.


	y.x.3. Variations

	If you know the offset of the beginning of the Stream, you may want
	to parse the Stream Header before parsing the Stream tail.