| |
| Advanced features of liblzma |
| ---------------------------- |
| |
| 0. Introduction |
| |
| Most developers need only the basic features of liblzma. These |
| features allow single-threaded encoding and decoding of .lzma files |
| in streamed mode. |
| |
| In some cases developers want more. The .lzma file format is |
| designed to allow multi-threaded encoding and decoding and limited |
| random-access reading. These features are possible in non-streamed |
| mode and limitedly also in streamed mode. |
| |
| To take advange of these features, the application needs a custom |
| .lzma file format handler. liblzma provides a set of tools to ease |
| this task, but it's still quite a bit of work to get a good custom |
| .lzma handler done. |
| |
| |
| 1. Where to begin |
| |
| Start by reading the .lzma file format specification. Understanding |
| the basics of the .lzma file structure is required to implement a |
| custom .lzma file handler and to understand the rest of this document. |
| |
| |
| 2. The basic components |
| |
| 2.1. Stream Header and tail |
| |
| Stream Header begins the .lzma Stream and Stream tail ends it. Stream |
| Header is defined in the file format specification, but Stream tail |
| isn't (thus I write "tail" with a lower-case letter). Stream tail is |
| simply the Stream Flags and the Footer Magic Bytes fields together. |
| It was done this way in liblzma, because the Block coders take care |
| of the rest of the stuff in the Stream Footer. |
| |
| For now, the size of Stream Header is fixed to 11 bytes. The header |
| <lzma/stream_flags.h> defines LZMA_STREAM_HEADER_SIZE, which you |
| should use instead of a hardcoded number. Similarly, Stream tail |
| is fixed to 3 bytes, and there is a constant LZMA_STREAM_TAIL_SIZE. |
| |
| It is possible, that a future version of the .lzma format will have |
| variable-sized Stream Header and tail. As of writing, this seems so |
| unlikely though, that it was considered simplest to just use a |
| constant instead of providing a functions to get and store the sizes |
| of the Stream Header and tail. |
| |
| |
| 2.x. Stream tail |
| |
| For now, the size of Stream tail is fixed to 3 bytes. The header |
| <lzma/stream_flags.h> defines LZMA_STREAM_TAIL_SIZE, which you |
| should use instead of a hardcoded number. |
| |
| |
| 3. Keeping track of size information |
| |
| The lzma_info_* functions found from <lzma/info.h> should ease the |
| task of keeping track of sizes of the Blocks and also the Stream |
| as a whole. Using these functions is strongly recommended, because |
| there are surprisingly many situations where an error can occur, |
| and these functions check for possible errors every time some new |
| information becomes available. |
| |
| If you find lzma_info_* functions lacking something that you would |
| find useful, please contact the author. |
| |
| |
| 3.1. Start offset of the Stream |
| |
| If you are storing the .lzma Stream inside anothe file format, or |
| for some other reason are placing the .lzma Stream to somewhere |
| else than to the beginning of the file, you should tell the starting |
| offset of the Stream using lzma_info_start_offset_set(). |
| |
| The start offset of the Stream is used for two distinct purporses. |
| First, knowing the start offset of the Stream allows |
| lzma_info_alignment_get() to correctly calculate the alignment of |
| every Block. This information is given to the Block encoder, which |
| will calculate the size of Header Padding so that Compressed Data |
| is alignment at an optimal offset. |
| |
| Another use for start offset of the Stream is in random-access |
| reading. If you set the start offset of the Stream, lzma_info_locate() |
| will be able to calculate the offset relative to the beginning of the |
| file containing the Stream (instead of offset relative to the |
| beginning of the Stream). |
| |
| |
| 3.2. Size of Stream Header |
| |
| While the size of Stream Header is constant (11 bytes) in the current |
| version of the .lzma file format, this may change in future. |
| |
| |
| 3.3. Size of Header Metadata Block |
| |
| This information is needed when doing random-access reading, and |
| to verify the value of this field stored in Footer Metadata Block. |
| |
| |
| 3.4. Total Size of the Data Blocks |
| |
| |
| 3.5. Uncompressed Size of Data Blocks |
| |
| |
| 3.6. Index |
| |
| |
| |
| |
| x. Alignment |
| |
| There are a few slightly different types of alignment issues when |
| working with .lzma files. |
| |
| The .lzma format doesn't strictly require any kind of alignment. |
| However, if the encoder carefully optimizes the alignment in all |
| situations, it can improve compression ratio, speed of the encoder |
| and decoder, and slightly help if the files get damaged and need |
| recovery. |
| |
| Alignment has the most significant effect compression ratio FIXME |
| |
| |
| x.1. Compression ratio |
| |
| Some filters take advantage of the alignment of the input data. |
| To get the best compression ratio, make sure that you feed these |
| filters correctly aligned data. |
| |
| Some filters (e.g. LZMA) don't necessarily mind too much if the |
| input doesn't match the preferred alignment. With these filters |
| the penalty in compression ratio depends on the specific type of |
| data being compressed. |
| |
| Other filters (e.g. PowerPC executable filter) won't work at all |
| with data that is improperly aligned. While the data can still |
| be de-filtered back to its original form, the benefit of the |
| filtering (better compression ratio) is completely lost, because |
| these filters expect certain patterns at properly aligned offsets. |
| The compression ratio may even worse with incorrectly aligned input |
| than without the filter. |
| |
| |
| x.1.1. Inter-filter alignment |
| |
| When there are multiple filters chained, checking the alignment can |
| be useful not only with the input of the first filter and output of |
| the last filter, but also between the filters. |
| |
| Inter-filter alignment important especially with the Subblock filter. |
| |
| |
| x.1.2. Further compression with external tools |
| |
| This is relatively rare situation in practice, but still worth |
| understanding. |
| |
| Let's say that there are several SPARC executables, which are each |
| filtered to separate .lzma files using only the SPARC filter. If |
| Uncompressed Size is written to the Block Header, the size of Block |
| Header may vary between the .lzma files. If no Padding is used in |
| the Block Header to correct the alignment, the starting offset of |
| the Compressed Data field will be differently aligned in different |
| .lzma files. |
| |
| All these .lzma files are archived into a single .tar archive. Due |
| to nature of the .tar format, every file is aligned inside the |
| archive to an offset that is a multiple of 512 bytes. |
| |
| The .tar archive is compressed into a new .lzma file using the LZMA |
| filter with options, that prefer input alignment of four bytes. Now |
| if the independent .lzma files don't have the same alignment of |
| the Compressed Data fields, the LZMA filter will be unable to take |
| advantage of the input alignment between the files in the .tar |
| archive, which reduces compression ratio. |
| |
| Thus, even if you have only single Block per file, it can be good for |
| compression ratio to align the Compressed Data to optimal offset. |
| |
| |
| x.2. Speed |
| |
| Most modern computers are faster when multi-byte data is located |
| at aligned offsets in RAM. Proper alignment of the Compressed Data |
| fields can slightly increase the speed of some filters. |
| |
| |
| x.3. Recovery |
| |
| Aligning every Block Header to start at an offset with big enough |
| alignment may ease or at least speed up recovery of broken files. |
| |
| |
| y. Typical usage cases |
| |
| y.x. Parsing the Stream backwards |
| |
| You may need to parse the Stream backwards if you need to get |
| information such as the sizes of the Stream, Index, or Extra. |
| The basic procedure to do this follows. |
| |
| Locate the end of the Stream. If the Stream is stored as is in a |
| standalone .lzma file, simply seek to the end of the file and start |
| reading backwards using appropriate buffer size. The file format |
| specification allows arbitrary amount of Footer Padding (zero or more |
| NUL bytes), which you skip before trying to decode the Stream tail. |
| |
| Once you have located the end of the Stream (a non-NULL byte), make |
| sure you have at least the last LZMA_STREAM_TAIL_SIZE bytes of the |
| Stream in a buffer. If there isn't enough bytes left from the file, |
| the file is too small to contain a valid Stream. Decode the Stream |
| tail using lzma_stream_tail_decoder(). Store the offset of the first |
| byte of the Stream tail; you will need it later. |
| |
| You may now want to do some internal verifications e.g. if the Check |
| type is supported by the liblzma build you are using. |
| |
| Decode the Backward Size field with lzma_vli_reverse_decode(). The |
| field is at maximum of LZMA_VLI_BYTES_MAX bytes long. Check that |
| Backward Size is not zero. Store the offset of the first byte of |
| the Backward Size; you will need it later. |
| |
| Now you know the Total Size of the last Block of the Stream. It's the |
| value of Backward Size plus the size of the Backward Size field. Note |
| that you cannot use lzma_vli_size() to calculate the size since there |
| might be padding; you need to use the real observed size of the |
| Backward Size field. |
| |
| At this point, the operation continues differently for Single-Block |
| and Multi-Block Streams. |
| |
| |
| y.x.1. Single-Block Stream |
| |
| There might be Uncompressed Size field present in the Stream Footer. |
| You cannot know it for sure unless you have already parsed the Block |
| Header earlier. For security reasons, you probably want to try to |
| decode the Uncompressed Size field, but you must not indicate any |
| error if decoding fails. Later you can give the decoded Uncompressed |
| Size to Block decoder if Uncopmressed Size isn't otherwise known; |
| this prevents it from producing too much output in case of (possibly |
| intentionally) corrupt file. |
| |
| Calculate the start offset of the Stream: |
| |
| backward_offset - backward_size - LZMA_STREAM_HEADER_SIZE |
| |
| backward_offset is the offset of the first byte of the Backward Size |
| field. Remember to check for integer overflows, which can occur with |
| invalid input files. |
| |
| Seek to the beginning of the Stream. Decode the Stream Header using |
| lzma_stream_header_decoder(). Verify that the decoded Stream Flags |
| match the values found from Stream tail. You can use the |
| lzma_stream_flags_is_equal() macro for this. |
| |
| Decode the Block Header. Verify that it isn't a Metadata Block, since |
| Single-Block Streams cannot have Metadata. If Uncompressed Size is |
| present in the Block Header, the value you tried to decode from the |
| Stream Footer must be ignored, since Uncompressed Size wasn't actually |
| present there. If Block Header doesn't have Uncompressed Size, and |
| decoding the Uncompressed Size field from the Stream Footer failed, |
| the file is corrupt. |
| |
| If you were only looking for the Uncompressed Size of the Stream, |
| you now got that information, and you can stop processing the Stream. |
| |
| To decode the Block, the same instructions apply as described in |
| FIXME. However, because you have some extra known information decoded |
| from the Stream Footer, you should give this information to the Block |
| decoder so that it can verify it while decoding: |
| - If Uncompressed Size is not present in the Block Header, set |
| lzma_options_block.uncompressed_size to the value you decoded |
| from the Stream Footer. |
| - Always set lzma_options_block.total_size to backward_size + |
| size_of_backward_size (you calculated this sum earlier already). |
| |
| |
| y.x.2. Multi-Block Stream |
| |
| Calculate the start offset of the Footer Metadata Block: |
| |
| backward_offset - backward_size |
| |
| backward_offset is the offset of the first byte of the Backward Size |
| field. Remember to check for integer overflows, which can occur with |
| broken input files. |
| |
| Decode the Block Header. Verify that it is a Metadata Block. Set |
| lzma_options_block.total_size to backward_size + size_of_backward_size |
| (you calculated this sum earlier already). Then decode the Footer |
| Metadata Block. |
| |
| Store the decoded Footer Metadata to lzma_info structure using |
| lzma_info_set_metadata(). Set also the offset of the Backward Size |
| field using lzma_info_size_set(). Then you can get the start offset |
| of the Stream using lzma_info_size_get(). Note that any of these steps |
| may fail so don't omit error checking. |
| |
| Seek to the beginning of the Stream. Decode the Stream Header using |
| lzma_stream_header_decoder(). Verify that the decoded Stream Flags |
| match the values found from Stream tail. You can use the |
| lzma_stream_flags_is_equal() macro for this. |
| |
| If you were only looking for the Uncompressed Size of the Stream, |
| it's possible that you already have it now. If Uncompressed Size (or |
| whatever information you were looking for) isn't available yet, |
| continue by decoding also the Header Metadata Block. (If some |
| information is missing, the Header Metadata Block has to be present.) |
| |
| Decoding the Data Blocks goes the same way as described in FIXME. |
| |
| |
| y.x.3. Variations |
| |
| If you know the offset of the beginning of the Stream, you may want |
| to parse the Stream Header before parsing the Stream tail. |
| |