| |
| The .lzma File Format |
| --------------------- |
| |
| 0. Preface |
| 0.1. Copyright Notices |
| 0.2. Changes |
| 1. Conventions |
| 1.1. Byte and Its Representation |
| 1.2. Multibyte Integers |
| 2. Overall Structure of .lzma File |
| 2.1. Stream |
| 2.1.1. Stream Header |
| 2.1.1.1. Header Magic Bytes |
| 2.1.1.2. Stream Flags |
| 2.1.1.3. CRC32 |
| 2.1.2. Stream Footer |
| 2.1.2.1. CRC32 |
| 2.1.2.2. Backward Size |
| 2.1.2.3. Stream Flags |
| 2.1.2.4. Footer Magic Bytes |
| 2.2. Stream Padding |
| 3. Block |
| 3.1. Block Header |
| 3.1.1. Block Header Size |
| 3.1.2. Block Flags |
| 3.1.3. Compressed Size |
| 3.1.4. Uncompressed Size |
| 3.1.5. List of Filter Flags |
| 3.1.6. Header Padding |
| 3.1.7. CRC32 |
| 3.2. Compressed Data |
| 3.3. Check |
| 4. Index |
| 4.1. Index Indicator |
| 4.2. Number of Records |
| 4.3. List of Records |
| 4.3.1. Total Size |
| 4.3.2. Uncompressed Size |
| 4.4. Index Padding |
| 4.5. CRC32 |
| 5. Filter Chains |
| 5.1. Alignment |
| 5.2. Security |
| 5.3. Filters |
| 5.3.1. LZMA2 |
| 5.3.2. Branch/Call/Jump Filters for Executables |
| 5.3.3. Delta |
| 5.3.3.1. Format of the Encoded Output |
| 5.4. Custom Filter IDs |
| 5.4.1. Reserved Custom Filter ID Ranges |
| 6. Cyclic Redundancy Checks |
| 7. References |
| |
| |
| 0. Preface |
| |
| This document describes the .lzma file format (filename suffix |
| `.lzma', MIME type `application/x-lzma'). It is intended that |
| this format replace the format used by the LZMA_Alone tool |
| included in LZMA SDK up to and including version 4.57. |
| |
| IMPORTANT: The version described in this document is a |
| draft, NOT a final, official version. Changes |
| are possible. |
| |
| |
| 0.1. Copyright Notices |
| |
| Copyright (C) 2006-2008 Lasse Collin <lasse.collin@tukaani.org> |
| Copyright (C) 2006 Ville Koskinen <w-ber@iki.fi> |
| |
| Copying and distribution of this file, with or without |
| modification, are permitted in any medium without royalty |
| provided the copyright notice and this notice are preserved. |
| Modified versions must be marked as such. |
| |
| All source code examples given in this document are put into |
| the public domain by the authors of this document. |
| |
| Special thanks for helping with this document goes to |
| Igor Pavlov. Thanks for helping with this document goes to |
| Mark Adler, H. Peter Anvin, and Mikko Pouru. |
| |
| |
| 0.2. Changes |
| |
| Last modified: 2008-06-17 14:10+0300 |
| |
| (A changelog will be kept once the first official version |
| is made.) |
| |
| |
| 1. Conventions |
| |
| The keywords `must', `must not', `required', `should', |
| `should not', `recommended', `may', and `optional' in this |
| document are to be interpreted as described in [RFC-2119]. |
| These words are not capitalized in this document. |
| |
| Indicating a warning means displaying a message, returning |
| appropriate exit status, or something else to let the user |
| know that something worth warning occurred. The operation |
| should still finish if a warning is indicated. |
| |
| Indicating an error means displaying a message, returning |
| appropriate exit status, or something else to let the user |
| know that something prevented successfully finishing the |
| operation. The operation must be aborted once an error has |
| been indicated. |
| |
| |
| 1.1. Byte and Its Representation |
| |
| In this document, byte is always 8 bits. |
| |
| A `nul byte' has all bits unset. That is, the value of a nul |
| byte is 0x00. |
| |
| To represent byte blocks, this document uses notation that |
| is similar to the notation used in [RFC-1952]: |
| |
| +-------+ |
| | Foo | One byte. |
| +-------+ |
| |
| +---+---+ |
| | Foo | Two bytes; that is, some of the vertical bars |
| +---+---+ can be missing. |
| |
| +=======+ |
| | Foo | Zero or more bytes. |
| +=======+ |
| |
| In this document, a boxed byte or a byte sequence declared |
| using this notation is called `a field'. The example field |
| above would be called `the Foo field' or plain `Foo'. |
| |
| |
| 1.2. Multibyte Integers |
| |
| Multibyte integers of static length, such as CRC values, |
| are stored in little endian byte order (least significant |
| byte first). |
| |
| When smaller values are more likely than bigger values (for |
| example file sizes), multibyte integers are encoded in a |
| variable-length representation: |
| - Numbers in the range [0, 127] are copied as is, and take |
| one byte of space. |
| - Bigger numbers will occupy two or more bytes. All but the |
| last byte of the multibyte representation have the highest |
| (eighth) bit set. |
| |
| For now, the value of the variable-length integers is limited |
| to 63 bits, which limits the encoded size of the integer to |
| nine bytes. These limits may be increased in future if needed. |
| |
| The following C code illustrates encoding and decoding of |
| variable-length integers. The functions return the number of |
| bytes occupied by the integer (1-9), or zero on error. |
| |
| #include <sys/types.h> |
| #include <inttypes.h> |
| |
| size_t |
| encode(uint8_t buf[static 9], uint64_t num) |
| { |
| if (num >= UINT64_MAX / 2) |
| return 0; |
| |
| size_t i = 0; |
| |
| while (num >= 0x80) { |
| buf[i++] = (uint8_t)(num) | 0x80; |
| num >>= 7; |
| } |
| |
| buf[i++] = (uint8_t)(num); |
| |
| return i; |
| } |
| |
| size_t |
| decode(const uint8_t buf[], size_t size_max, uint64_t *num) |
| { |
| if (size_max == 0) |
| return 0; |
| |
| if (size_max > 9) |
| size_max = 9; |
| |
| *num = buf[0] & 0x7F; |
| size_t i = 0; |
| |
| while (buf[i++] & 0x80) { |
| if (i > size_max || buf[i] == 0x00) |
| return 0; |
| |
| *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7); |
| } |
| |
| return i; |
| } |
| |
| |
| 2. Overall Structure of .lzma File |
| |
| +========+================+========+================+ |
| | Stream | Stream Padding | Stream | Stream Padding | ... |
| +========+================+========+================+ |
| |
| A file contains usually only one Stream. However, it is |
| possible to concatenate multiple Streams together with no |
| additional processing. It is up to the implementation to |
| decide if the decoder will continue decoding from the next |
| Stream once the end of the first Stream has been reached. |
| |
| |
| 2.1. Stream |
| |
| +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+ |
| | Stream Header | Block | Block | ... | Block | |
| +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+ |
| |
| +=======+-+-+-+-+-+-+-+-+-+-+-+-+ |
| ---> | Index | Stream Footer | |
| +=======+-+-+-+-+-+-+-+-+-+-+-+-+ |
| |
| All the above fields have a size that is a multiple of four. If |
| Stream is used as an internal part of another file format, it |
| is recommended to make the Stream start at an offset that is |
| a multiple of four bytes. |
| |
| Stream Header, Index, and Stream Footer are always present in |
| a Stream. The maximum size of the Index field is 16 GiB (2^34). |
| |
| There are zero or more Blocks. The maximum number of Blocks is |
| limited only by the maximum size of the Index field. |
| |
| Total size of a Stream must be less than 8 EiB (2^63 bytes). |
| The same limit applies to the total amount of uncompressed |
| data stored in a Stream. |
| |
| If an implementation supports handling .lzma files with |
| multiple concatenated Streams, it may apply the above limits |
| to the file as a whole instead of limiting per Stream basis. |
| |
| |
| 2.1.1. Stream Header |
| |
| +---+---+---+---+---+---+-------+------+--+--+--+--+ |
| | Header Magic Bytes | Stream Flags | CRC32 | |
| +---+---+---+---+---+---+-------+------+--+--+--+--+ |
| |
| |
| 2.1.1.1. Header Magic Bytes |
| |
| The first six (6) bytes of the Stream are so called Header |
| Magic Bytes. They can be used to identify the file type. |
| |
| Using a C array and ASCII: |
| const uint8_t HEADER_MAGIC[6] |
| = { 0xFF, 'L', 'Z', 'M', 'A', 0x00 }; |
| |
| In plain hexadecimal: |
| FF 4C 5A 4D 41 00 |
| |
| Notes: |
| - The first byte (0xFF) was chosen so that the files cannot |
| be erroneously detected as being in LZMA_Alone format, in |
| which the first byte is in the range [0x00, 0xE0]. |
| - The sixth byte (0x00) was chosen to prevent applications |
| from misdetecting the file as a text file. |
| |
| If the Header Magic Bytes don't match, the decoder must |
| indicate an error. |
| |
| |
| 2.1.1.2. Stream Flags |
| |
| The first byte of Stream Flags is always a nul byte. In future |
| this byte may be used to indicate new Stream version or other |
| Stream properties. |
| |
| The second byte of Stream Flags is a bit field: |
| |
| Bit(s) Mask Description |
| 0-3 0x0F Type of Check (see Section 3.3): |
| ID Size Check name |
| 0x00 0 bytes None |
| 0x01 4 bytes CRC32 |
| 0x02 4 bytes (Reserved) |
| 0x03 4 bytes (Reserved) |
| 0x04 8 bytes CRC64 |
| 0x05 8 bytes (Reserved) |
| 0x06 8 bytes (Reserved) |
| 0x07 16 bytes (Reserved) |
| 0x08 16 bytes (Reserved) |
| 0x09 16 bytes (Reserved) |
| 0x0A 32 bytes SHA-256 |
| 0x0B 32 bytes (Reserved) |
| 0x0C 32 bytes (Reserved) |
| 0x0D 64 bytes (Reserved) |
| 0x0E 64 bytes (Reserved) |
| 0x0F 64 bytes (Reserved) |
| 4-7 0xF0 Reserved for future use; must be zero for now. |
| |
| Implementations must support at least the Check IDs 0x00 (None) |
| and 0x01 (CRC32). Supporting other Check IDs is optional. If |
| an unsupported Check is used, the decoder should indicate a |
| warning or error. |
| |
| If any reserved bit is set, the decoder must indicate an error. |
| It is possible that there is a new field present which the |
| decoder is not aware of, and can thus parse the Stream Header |
| incorrectly. |
| |
| |
| 2.1.1.3. CRC32 |
| |
| The CRC32 is calculated from the Stream Flags field. It is |
| stored as an unsigned 32-bit little endian integer. If the |
| calculated value does not match the stored one, the decoder |
| must indicate an error. |
| |
| The idea is that Stream Flags would always be two bytes, even |
| if new features are needed. This way old decoders will be able |
| to verify the CRC32 calculated from Stream Flags, and thus |
| distinguish between corrupt files (CRC32 doesn't match) and |
| files that the decoder doesn't support (CRC32 matches but |
| Stream Flags has reserved bits set). |
| |
| |
| 2.1.2. Stream Footer |
| |
| +-+-+-+-+---+---+---+---+-------+------+----------+---------+ |
| | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes | |
| +-+-+-+-+---+---+---+---+-------+------+----------+---------+ |
| |
| |
| 2.1.2.1. CRC32 |
| |
| The CRC32 is calculated from the Backward Size and Stream Flags |
| fields. It is stored as an unsigned 32-bit little endian |
| integer. If the calculated value does not match the stored one, |
| the decoder must indicate an error. |
| |
| The reason to have the CRC32 field before the Backward Size and |
| Stream Flags fields is to keep the four-byte fields aligned to |
| a multiple of four bytes. |
| |
| |
| 2.1.2.2. Backward Size |
| |
| Backward Size is stored as a 32-bit little endian integer, |
| which indicates the size of the Index field as multiple of |
| four bytes, minimum value being four bytes: |
| |
| real_backward_size = (stored_backward_size + 1) * 4; |
| |
| Using a fixed-size integer to store this value makes it |
| slightly simpler to parse the Stream Footer when the |
| application needs to parse the Stream backwards. |
| |
| |
| 2.1.2.3. Stream Flags |
| |
| This is a copy of the Stream Flags field from the Stream |
| Header. The information stored to Stream Flags is needed |
| when parsing the Stream backwards. The decoder must compare |
| the Stream Flags fields in both Stream Header and Stream |
| Footer, and indicate an error if they are not identical. |
| |
| |
| 2.1.2.4. Footer Magic Bytes |
| |
| As the last step of the decoding process, the decoder must |
| verify the existence of Footer Magic Bytes. If they don't |
| match, an error must be indicated. |
| |
| Using a C array and ASCII: |
| const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' }; |
| |
| In hexadecimal: |
| 59 5A |
| |
| The primary reason to have Footer Magic Bytes is to make |
| it easier to detect incomplete files quickly, without |
| uncompressing. If the file does not end with Footer Magic Bytes |
| (excluding Stream Padding described in Section 2.2), it cannot |
| be undamaged, unless someone has intentionally appended garbage |
| after the end of the Stream. |
| |
| |
| 2.2. Stream Padding |
| |
| Only the decoders that support decoding of concatenated Streams |
| must support Stream Padding. |
| |
| Stream Padding must contain only nul bytes. Any non-nul byte |
| should be considered as the beginning of a new Stream. To |
| preserve the four-byte alignment of consecutive Streams, the |
| size of Stream Padding must be a multiple of four bytes. Empty |
| Stream Padding is allowed. |
| |
| Note that non-empty Stream Padding is allowed at the end of the |
| file; there doesn't need to be a new Stream after non-empty |
| Stream Padding. This can be convenient in certain situations |
| [GNU-tar]. |
| |
| The possibility of Padding should be taken into account when |
| designing an application that parses the Stream backwards. |
| |
| |
| 3. Block |
| |
| +==============+=================+=======+ |
| | Block Header | Compressed Data | Check | |
| +==============+=================+=======+ |
| |
| |
| 3.1. Block Header |
| |
| +-------------------+-------------+=================+ |
| | Block Header Size | Block Flags | Compressed Size | |
| +-------------------+-------------+=================+ |
| |
| +===================+======================+ |
| ---> | Uncompressed Size | List of Filter Flags | |
| +===================+======================+ |
| |
| +================+--+--+--+--+ |
| ---> | Header Padding | CRC32 | |
| +================+--+--+--+--+ |
| |
| |
| 3.1.1. Block Header Size |
| |
| This field overlaps with the Index Indicator field (see |
| Section 4.1). |
| |
| This field contains the size of the Block Header field, |
| including the Block Header Size field itself. Valid values are |
| in the range [0x01, 0xFF], which indicate the size of the Block |
| Header as multiples of four bytes, minimum size being eight |
| bytes: |
| |
| real_header_size = (encoded_header_size + 1) * 4; |
| |
| If bigger Block Header is needed in future, a new field can be |
| added between the current Block Header and Compressed Data |
| fields. The presence of this new field would be indicated in |
| the Block Header. |
| |
| |
| 3.1.2. Block Flags |
| |
| The first byte of the Block Flags field is a bit field: |
| |
| Bit(s) Mask Description |
| 0-1 0x03 Number of filters (1-4) |
| 2-5 0x3C Reserved for future use; must be zero for now. |
| 6 0x40 The Compressed Size field is present. |
| 7 0x80 The Uncompressed Size field is present. |
| |
| If any reserved bit is set, the decoder must indicate an error. |
| It is possible that there is a new field present which the |
| decoder is not aware of, and can thus parse the Block Header |
| incorrectly. |
| |
| |
| 3.1.3. Compressed Size |
| |
| This field is present only if the appropriate bit is set in |
| the Block Flags field (see Section 3.1.2). |
| |
| This field contains the size of the Compressed Data field as |
| multiple of four bytes, minimum value being four bytes: |
| |
| real_compressed_size = (stored_compressed_size + 1) * 4; |
| |
| The size is stored using the encoding described in Section 1.2. |
| If the Compressed Size does not match the real size of the |
| Compressed Data field, the decoder must indicate an error. |
| |
| |
| 3.1.4. Uncompressed Size |
| |
| This field is present only if the appropriate bit is set in |
| the Block Flags field (see Section 3.1.2). |
| |
| The Uncompressed Size field contains the size of the Block |
| after uncompressing. Uncompressed Size is stored using the |
| encoding described in Section 1.2. If the Uncompressed Size |
| does not match the real uncompressed size, the decoder must |
| indicate an error. |
| |
| Storing the Compressed Size and Uncompressed Size fields serves |
| several purposes: |
| - The decoder knows how much memory it needs to allocate |
| for a temporary buffer in multithreaded mode. |
| - Simple error detection: wrong size indicates a broken file. |
| - Seeking forwards to a specific location in streamed mode. |
| |
| It should be noted that the only reliable way to determine |
| the real uncompressed size is to uncompress the Block, |
| because the Block Header and Index fields may contain |
| (intentionally or unintentionally) invalid information. |
| |
| |
| 3.1.5. List of Filter Flags |
| |
| +================+================+ +================+ |
| | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags | |
| +================+================+ +================+ |
| |
| The number of Filter Flags fields is stored in the Block Flags |
| field (see Section 3.1.2). |
| |
| The format of each Filter Flags field is as follows: |
| |
| +===========+====================+===================+ |
| | Filter ID | Size of Properties | Filter Properties | |
| +===========+====================+===================+ |
| |
| Both Filter ID and Size of Properties are stored using the |
| encoding described in Section 1.2. Size of Properties indicates |
| the size of the Filter Properties field as bytes. The list of |
| officially defined Filter IDs and the formats of their Filter |
| Properties are described in Section 5.3. |
| |
| |
| 3.1.6. Header Padding |
| |
| This field contains as many nul byte as it is needed to make |
| the Block Header have the size specified in Block Header Size. |
| If any of the bytes are not nul bytes, the decoder must |
| indicate an error. It is possible that there is a new field |
| present which the decoder is not aware of, and can thus parse |
| the Block Header incorrectly. |
| |
| |
| 3.1.7. CRC32 |
| |
| The CRC32 is calculated over everything in the Block Header |
| field except the CRC32 field itself. It is stored as an |
| unsigned 32-bit little endian integer. If the calculated |
| value does not match the stored one, the decoder must indicate |
| an error. |
| |
| By verifying the CRC32 of the Block Header before parsing the |
| actual contents allows the decoder to distinguish between |
| corrupt and unsupported files. |
| |
| |
| 3.2. Compressed Data |
| |
| The format of Compressed Data depends on Block Flags and List |
| of Filter Flags. Excluding the descriptions of the simplest |
| filters in Section 5.3, the format of the filter-specific |
| encoded data is out of scope of this document. |
| |
| If the natural size of Compressed Data is not a multiple of |
| four bytes, it must be padded with 1-3 nul bytes to make it |
| a multiple of four bytes. |
| |
| |
| 3.3. Check |
| |
| The type and size of the Check field depends on which bits |
| are set in the Stream Flags field (see Section 2.1.1.2). |
| |
| The Check, when used, is calculated from the original |
| uncompressed data. If the calculated Check does not match the |
| stored one, the decoder must indicate an error. If the selected |
| type of Check is not supported by the decoder, it must indicate |
| a warning or error. |
| |
| |
| 4. Index |
| |
| +-----------------+=========================+ |
| | Index Indicator | Number of Index Records | |
| +-----------------+=========================+ |
| |
| +=================+=========+-+-+-+-+ |
| ---> | List of Records | Padding | CRC32 | |
| +=================+=========+-+-+-+-+ |
| |
| Index serves several purporses. Using it, one can |
| - verify that all Blocks in a Stream have been processed; |
| - find out the uncompressed size of a Stream; and |
| - quickly access the beginning of any Block (random access). |
| |
| |
| 4.1. Index Indicator |
| |
| This field overlaps with the Block Header Size field (see |
| Section 3.1.1). The value of Index Indicator is always 0x00. |
| |
| |
| 4.2. Number of Records |
| |
| This field indicates how many Records there are in the List |
| of Records field, and thus how many Blocks there are in the |
| Stream. The value is stored using the encoding described in |
| Section 1.2. If the decoder has decoded all the Blocks of the |
| Stream, and then notices that the Number of Records doesn't |
| match the real number of Blocks, the decoder must indicate an |
| error. |
| |
| |
| 4.3. List of Records |
| |
| List of Records consists of as many Records as indicated by the |
| Number of Records field: |
| |
| +========+========+ |
| | Record | Record | ... |
| +========+========+ |
| |
| Each Record contains two fields: |
| |
| +============+===================+ |
| | Total Size | Uncompressed Size | |
| +============+===================+ |
| |
| If the decoder has decoded all the Blocks of the Stream, it |
| must verify that the contents of the Records match the real |
| Total Size and Uncompressed Size of the respective Blocks. |
| |
| Implementation hint: It is possible to verify the Index with |
| constant memory usage by calculating for example SHA256 of both |
| the real size values and the List of Records, then comparing |
| the check values. Implementing this using non-cryptographic |
| check like CRC32 should be avoided unless small code size is |
| important. |
| |
| If the decoder supports random-access reading, it must verify |
| that Total Size and Uncompressed Size of every completely |
| decoded Block match the sizes stored in the Index. If only |
| partial Block is decoded, the decoder must verify that the |
| processed sizes don't exceed the sizes stored in the Index. |
| |
| |
| 4.3.1. Total Size |
| |
| This field indicates the encoded size of the respective Block |
| as multiples of four bytes, minimum value being four bytes: |
| |
| real_total_size = (stored_total_size + 1) * 4; |
| |
| The value is stored using the encoding described in Section |
| 1.2. |
| |
| |
| 4.3.2. Uncompressed Size |
| |
| This field indicates the Uncompressed Size of the respective |
| Block as bytes. The value is stored using the encoding |
| described in Section 1.2. |
| |
| |
| 4.4. Index Padding |
| |
| This field must contain 0-3 nul bytes to pad the Index to |
| a multiple of four bytes. |
| |
| |
| 4.5. CRC32 |
| |
| The CRC32 is calculated over everything in the Index field |
| except the CRC32 field itself. The CRC32 is stored as an |
| unsigned 32-bit little endian integer. If the calculated |
| value does not match the stored one, the decoder must indicate |
| an error. |
| |
| |
| 5. Filter Chains |
| |
| The Block Flags field defines how many filters are used. When |
| more than one filter is used, the filters are chained; that is, |
| the output of one filter is the input of another filter. The |
| following figure illustrates the direction of data flow. |
| |
| v Uncompressed Data ^ |
| | Filter 0 | |
| Encoder | Filter 1 | Decoder |
| | Filter n | |
| v Compressed Data ^ |
| |
| |
| 5.1. Alignment |
| |
| Alignment of uncompressed input data is usually the job of |
| the application producing the data. For example, to get the |
| best results, an archiver tool should make sure that all |
| PowerPC executable files in the archive stream start at |
| offsets that are multiples of four bytes. |
| |
| Some filters, for example LZMA, can be configured to take |
| advantage of specified alignment of input data. Note that |
| taking advantage of aligned input can be benefical also when |
| a filter is not the first filter in the chain. For example, |
| if you compress PowerPC executables, you may want to use the |
| PowerPC filter and chain that with the LZMA filter. Because not |
| only the input but also the output alignment of the PowerPC |
| filter is four bytes, it is now benefical to set LZMA settings |
| so that the LZMA encoder can take advantage of its |
| four-byte-aligned input data. |
| |
| The output of the last filter in the chain is stored to the |
| Compressed Data field, which is is guaranteed to be aligned |
| to a multiple of four bytes relative to the beginning of the |
| Stream. This can increase |
| - speed, if the filtered data is handled multiple bytes at |
| a time by the filter-specific encoder and decoder, |
| because accessing aligned data in computer memory is |
| usually faster; and |
| - compression ratio, if the output data is later compressed |
| with an external compression tool. |
| |
| |
| 5.2. Security |
| |
| If filters would be allowed to be chained freely, it would be |
| possible to create malicious files, that would be very slow to |
| decode. Such files could be used to create denial of service |
| attacks. |
| |
| Slow files could occur when multiple filters are chained: |
| |
| v Compressed input data |
| | Filter 1 decoder (last filter) |
| | Filter 0 decoder (non-last filter) |
| v Uncompressed output data |
| |
| The decoder of the last filter in the chain produces a lot of |
| output from little input. Another filter in the chain takes the |
| output of the last filter, and produces very little output |
| while consuming a lot of input. As a result, a lot of data is |
| moved inside the filter chain, but the filter chain as a whole |
| gets very little work done. |
| |
| To prevent this kind of slow files, there are restrictions on |
| how the filters can be chained. These restrictions must be |
| taken into account when designing new filters. |
| |
| The maximum number of filters in the chain has been limited to |
| four, thus there can be at maximum of three non-last filters. |
| Of these three non-last filters, only two are allowed to change |
| the size of the data. |
| |
| The non-last filters, that change the size of the data, must |
| have a limit how much the decoder can compress the data: the |
| decoder should produce at least n bytes of output when the |
| filter is given 2n bytes of input. This limit is not |
| absolute, but significant deviations must be avoided. |
| |
| The above limitations guarantee that if the last filter in the |
| chain produces 4n bytes of output, the chain as a whole will |
| produce at least n bytes of output. |
| |
| |
| 5.3. Filters |
| |
| 5.3.1. LZMA2 |
| |
| LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purporse |
| compression algorithm with high compression ratio and fast |
| decompression. LZMA is based on LZ77 and range coding |
| algorithms. |
| |
| LZMA2 uses LZMA internally, but adds support for uncompressed |
| chunks, eases stateful decoder implementations, and improves |
| support for multithreading. Thus, the plain LZMA will not be |
| supported in this file format. |
| |
| Filter ID: 0x21 |
| Size of Filter Properties: 1 byte |
| Changes size of data: Yes |
| Allow as a non-last filter: No |
| Allow as the last filter: Yes |
| |
| Preferred alignment: |
| Input data: Adjustable to 1/2/4/8/16 byte(s) |
| Output data: 1 byte |
| |
| At the time of writing, there is no other documentation about |
| how LZMA works than the source code in LZMA SDK. Once such |
| documentation gets written, it will probably be published as |
| a separate document, because including the documentation here |
| would lengthen this document considerably. |
| |
| The format of the one-byte Filter Properties field is as |
| follows: |
| |
| Bits Mask Description |
| 0-5 0x3F Dictionary Size |
| 6-7 0xC0 Reserved for future use; must be zero for now. |
| |
| Dictionary Size is encoded with one-bit mantissa and five-bit |
| exponent. The smallest dictionary size is 4 KiB and the biggest |
| is 4 GiB. |
| |
| Raw value Mantissa Exponent Dictionary size |
| 0 2 11 4 KiB |
| 1 3 11 6 KiB |
| 2 2 12 8 KiB |
| 3 3 12 12 KiB |
| 4 2 13 16 KiB |
| 5 3 13 24 KiB |
| 6 2 14 32 KiB |
| ... ... ... ... |
| 35 3 27 768 MiB |
| 36 2 28 1024 MiB |
| 37 3 29 1536 MiB |
| 38 2 30 2048 MiB |
| 39 3 30 3072 MiB |
| 40 2 31 4096 MiB |
| |
| Instead of having a table in the decoder, the dictionary size |
| can be decoded using the following C code: |
| |
| const uint8_t bits = get_dictionary_flags() & 0x3F; |
| if (bits > 40) |
| return DICTIONARY_TOO_BIG; // Bigger than 4 GiB |
| |
| uint32_t dictionary_size = 2 | (bits & 1); |
| dictionary_size <<= bits / 2 + 11; |
| |
| |
| 5.3.2. Branch/Call/Jump Filters for Executables |
| |
| These filters convert relative branch, call, and jump |
| instructions to their absolute counterparts in executable |
| files. This conversion increases redundancy and thus |
| compression ratio. |
| |
| Size of Filter Properties: 0 or 4 bytes |
| Changes size of data: No |
| Allow as a non-last filter: Yes |
| Allow as the last filter: No |
| |
| Detecting when all of the data has been decoded: |
| Uncompressed size: Yes |
| End of Payload Marker: No |
| End of Input: Yes |
| |
| Below is the list of filters in this category. The alignment |
| is the same for both input and output data. |
| |
| Filter ID Alignment Description |
| 0x04 1 byte x86 filter (BCJ) |
| 0x05 4 bytes PowerPC (big endian) filter |
| 0x06 16 bytes IA64 filter |
| 0x07 4 bytes ARM (little endian) filter |
| 0x08 2 bytes ARM Thumb (little endian) filter |
| 0x09 4 bytes SPARC filter |
| |
| If the size of Filter Properties is four bytes, the Filter |
| Properties field contains the start offset used for address |
| conversions. It is stored as an unsigned 32-bit little endian |
| integer. If the size of Filter Properties is zero, the start |
| offset is zero. |
| |
| Setting the start offset may be useful if an executable has |
| multiple sections, and there are many cross-section calls. |
| Taking advantage of this feature usually requires usage of |
| the Subblock filter. |
| |
| |
| 5.3.3. Delta |
| |
| The Delta filter may increase compression ratio when the value |
| of the next byte correlates with the value of an earlier byte |
| at specified distance. |
| |
| Filter ID: 0x03 |
| Size of Filter Properties: 1 byte |
| Changes size of data: No |
| Allow as a non-last filter: Yes |
| Allow as the last filter: No |
| |
| Preferred alignment: |
| Input data: 1 byte |
| Output data: Same as the original input data |
| |
| The Properties byte indicates the delta distance, which can be |
| 1-256 bytes backwards from the current byte: 0x00 indicates |
| distance of 1 byte and 0xFF distance of 256 bytes. |
| |
| |
| 5.3.3.1. Format of the Encoded Output |
| |
| The code below illustrates both encoding and decoding with |
| the Delta filter. |
| |
| // Distance is in the range [1, 256]. |
| const unsigned int distance = get_properties_byte() + 1; |
| uint8_t pos = 0; |
| uint8_t delta[256]; |
| |
| memset(delta, 0, sizeof(delta)); |
| |
| while (1) { |
| const int byte = read_byte(); |
| if (byte == EOF) |
| break; |
| |
| uint8_t tmp = delta[(uint8_t)(distance + pos)]; |
| if (is_encoder) { |
| tmp = (uint8_t)(byte) - tmp; |
| delta[pos] = (uint8_t)(byte); |
| } else { |
| tmp = (uint8_t)(byte) + tmp; |
| delta[pos] = tmp; |
| } |
| |
| write_byte(tmp); |
| --pos; |
| } |
| |
| |
| 5.4. Custom Filter IDs |
| |
| If a developer wants to use custom Filter IDs, he has two |
| choices. The first choice is to contact Lasse Collin and ask |
| him to allocate a range of IDs for the developer. |
| |
| The second choice is to generate a 40-bit random integer, |
| which the developer can use as his personal Developer ID. |
| To minimalize the risk of collisions, Developer ID has to be |
| a randomly generated integer, not manually selected "hex word". |
| The following command, which works on many free operating |
| systems, can be used to generate Developer ID: |
| |
| dd if=/dev/urandom bs=5 count=1 | hexdump |
| |
| The developer can then use his Developer ID to create unique |
| (well, hopefully unique) Filter IDs. |
| |
| Bits Mask Description |
| 0-15 0x0000_0000_0000_FFFF Filter ID |
| 16-55 0x00FF_FFFF_FFFF_0000 Developer ID |
| 56-62 0x7F00_0000_0000_0000 Static prefix: 0x7F |
| |
| The resulting 63-bit integer will use 9 bytes of space when |
| stored using the encoding described in Section 1.2. To get |
| a shorter ID, see the beginning of this Section how to |
| request a custom ID range. |
| |
| |
| 5.4.1. Reserved Custom Filter ID Ranges |
| |
| Range Description |
| 0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility |
| 0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility |
| |
| |
| 6. Cyclic Redundancy Checks |
| |
| There are several incompatible variations to calculate CRC32 |
| and CRC64. For simplicity and clarity, complete examples are |
| provided to calculate the checks as they are used in this file |
| format. Implementations may use different code as long as it |
| gives identical results. |
| |
| The program below reads data from standard input, calculates |
| the CRC32 and CRC64 values, and prints the calculated values |
| as big endian hexadecimal strings to standard output. |
| |
| #include <sys/types.h> |
| #include <inttypes.h> |
| #include <stdio.h> |
| |
| uint32_t crc32_table[256]; |
| uint64_t crc64_table[256]; |
| |
| void |
| init(void) |
| { |
| static const uint32_t poly32 = UINT32_C(0xEDB88320); |
| static const uint64_t poly64 |
| = UINT64_C(0xC96C5795D7870F42); |
| |
| for (size_t i = 0; i < 256; ++i) { |
| uint32_t crc32 = i; |
| uint64_t crc64 = i; |
| |
| for (size_t j = 0; j < 8; ++j) { |
| if (crc32 & 1) |
| crc32 = (crc32 >> 1) ^ poly32; |
| else |
| crc32 >>= 1; |
| |
| if (crc64 & 1) |
| crc64 = (crc64 >> 1) ^ poly64; |
| else |
| crc64 >>= 1; |
| } |
| |
| crc32_table[i] = crc32; |
| crc64_table[i] = crc64; |
| } |
| } |
| |
| uint32_t |
| crc32(const uint8_t *buf, size_t size, uint32_t crc) |
| { |
| crc = ~crc; |
| for (size_t i = 0; i < size; ++i) |
| crc = crc32_table[buf[i] ^ (crc & 0xFF)] |
| ^ (crc >> 8); |
| return ~crc; |
| } |
| |
| uint64_t |
| crc64(const uint8_t *buf, size_t size, uint64_t crc) |
| { |
| crc = ~crc; |
| for (size_t i = 0; i < size; ++i) |
| crc = crc64_table[buf[i] ^ (crc & 0xFF)] |
| ^ (crc >> 8); |
| return ~crc; |
| } |
| |
| int |
| main() |
| { |
| init(); |
| |
| uint32_t value32 = 0; |
| uint64_t value64 = 0; |
| uint64_t total_size = 0; |
| uint8_t buf[8192]; |
| |
| while (1) { |
| const size_t buf_size = fread(buf, 1, 8192, stdin); |
| if (buf_size == 0) |
| break; |
| |
| total_size += buf_size; |
| value32 = crc32(buf, buf_size, value32); |
| value64 = crc64(buf, buf_size, value64); |
| } |
| |
| printf("Bytes: %" PRIu64 "\n", total_size); |
| printf("CRC-32: 0x%08" PRIX32 "\n", value32); |
| printf("CRC-64: 0x%016" PRIX64 "\n", value64); |
| |
| return 0; |
| } |
| |
| |
| 7. References |
| |
| LZMA SDK - The original LZMA implementation |
| http://7-zip.org/sdk.html |
| |
| LZMA Utils - LZMA adapted to POSIX-like systems |
| http://tukaani.org/lzma/ |
| |
| [RFC-1952] |
| GZIP file format specification version 4.3 |
| http://www.ietf.org/rfc/rfc1952.txt |
| - Notation of byte boxes in section `2.1. Overall conventions' |
| |
| [RFC-2119] |
| Key words for use in RFCs to Indicate Requirement Levels |
| http://www.ietf.org/rfc/rfc2119.txt |
| |
| [GNU-tar] |
| GNU tar 1.16.1 manual |
| http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html |
| - Node 9.4.2 `Blocking Factor', paragraph that begins |
| `gzip will complain about trailing garbage' |
| - Note that this URL points to the latest version of the |
| manual, and may some day not contain the note which is in |
| 1.16.1. For the exact version of the manual, download GNU |
| tar 1.16.1: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.16.1.tar.gz |
| |