| |
| The .lzma File Format |
| --------------------- |
| |
| 0. Preface |
| 0.1. Copyright Notices |
| 0.2. Changes |
| 1. Conventions |
| 1.1. Byte and Its Representation |
| 1.2. Multibyte Integers |
| 2. Stream |
| 2.1. Stream Types |
| 2.1.1. Single-Block Stream |
| 2.1.2. Multi-Block Stream |
| 2.2. Stream Header |
| 2.2.1. Header Magic Bytes |
| 2.2.2. Stream Flags |
| 2.2.3. CRC32 |
| 3. Block |
| 3.1. Block Header |
| 3.1.1. Block Flags |
| 3.1.2. Compressed Size |
| 3.1.3. Uncompressed Size |
| 3.1.4. List of Filter Flags |
| 3.1.4.1. Misc |
| 3.1.4.2. External ID |
| 3.1.4.3. External Size of Properties |
| 3.1.4.4. Filter Properties |
| 3.1.5. CRC32 |
| 3.1.6. Header Padding |
| 3.2. Compressed Data |
| 3.3. Block Footer |
| 3.3.1. Check |
| 3.3.2. Stream Footer |
| 3.3.2.1. Uncompressed Size |
| 3.3.2.2. Backward Size |
| 3.3.2.3. Stream Flags |
| 3.3.2.4. Footer Magic Bytes |
| 3.3.3. Footer Padding |
| 4. Filters |
| 4.1. Detecting when All Data Has Been Decoded |
| 4.1.1. With Uncompressed Size |
| 4.1.2. With End of Input |
| 4.1.3. With End of Payload Marker |
| 4.2. Alignment |
| 4.3. Filters |
| 4.3.1. Copy |
| 4.3.2. Subblock |
| 4.3.2.1. Format of the Encoded Output |
| 4.3.3. Delta |
| 4.3.3.1. Format of the Encoded Output |
| 4.3.4. LZMA |
| 4.3.4.1. LZMA Properties |
| 4.3.4.2. Dictionary Flags |
| 4.3.5. Branch/Call/Jump Filters for Executables |
| 5. Metadata |
| 5.1. Metadata Flags |
| 5.2. Size of Header Metadata Block |
| 5.3. Total Size |
| 5.4. Uncompressed Size |
| 5.5. Index |
| 5.5.1. Number of Data Blocks |
| 5.5.2. Total Sizes |
| 5.5.3. Uncompressed Sizes |
| 5.6. Extra |
| 5.6.1. 0x00: Dummy/Padding |
| 5.6.2. 0x01: OpenPGP Signature |
| 5.6.3. 0x02: Filter Information |
| 5.6.4. 0x03: Comment |
| 5.6.5. 0x04: List of Checks |
| 5.6.6. 0x05: Original Filename |
| 5.6.7. 0x07: Modification Time |
| 5.6.8. 0x09: High-Resolution Modification Time |
| 5.6.9. 0x0B: MIME Type |
| 5.6.10. 0x0D: Homepage URL |
| 6. Custom Filter and Extra Record IDs |
| 6.1. Reserved Custom Filter ID Ranges |
| 7. Cyclic Redundancy Checks |
| 8. References |
| 8.1. Normative References |
| 8.2. Informative References |
| |
| |
| 0. Preface |
| |
| This document describes the .lzma file format (filename suffix |
| `.lzma', MIME type `application/x-lzma'). It is intended that |
| this format replace the format used by the LZMA_Alone tool |
| included in LZMA SDK up to and including version 4.43. |
| |
| IMPORTANT: The version described in this document is a |
| draft, NOT a final, official version. Changes |
| are possible. |
| |
| |
| 0.1. Copyright Notices |
| |
| Copyright (C) 2006, 2007 Lasse Collin <lasse.collin@tukaani.org> |
| Copyright (C) 2006 Ville Koskinen <w-ber@iki.fi> |
| |
| Copying and distribution of this file, with or without |
| modification, are permitted in any medium without royalty |
| provided the copyright notice and this notice are preserved. |
| Modified versions must be marked as such. |
| |
| All source code examples given in this document are put into |
| the public domain by the authors of this document. |
| |
| Thanks for helping with this document goes to Igor Pavlov, |
| Mark Adler and Mikko Pouru. |
| |
| |
| 0.2. Changes |
| |
| Last modified: 2007-12-02 22:40+0200 |
| |
| (A changelog will be kept once the first official version |
| is made.) |
| |
| |
| 1. Conventions |
| |
| The keywords `must', `must not', `required', `should', |
| `should not', `recommended', `may', and `optional' in this |
| document are to be interpreted as described in [RFC-2119]. |
| These words are not capitalized in this document. |
| |
| Indicating a warning means displaying a message, returning |
| appropriate exit status, or something else to let the user |
| know that something worth warning occurred. The operation |
| should still finish if a warning is indicated. |
| |
| Indicating an error means displaying a message, returning |
| appropriate exit status, or something else to let the user |
| know that something prevented successfully finishing the |
| operation. The operation must be aborted once an error has |
| been indicated. |
| |
| |
| 1.1. Byte and Its Representation |
| |
| In this document, byte is always 8 bits. |
| |
| A `nul byte' has all bits unset. That is, the value of a nul |
| byte is 0x00. |
| |
| To represent byte blocks, this document uses notation that |
| is similar to the notation used in [RFC-1952]: |
| |
| +-------+ |
| | Foo | One byte. |
| +-------+ |
| |
| +---+---+ |
| | Foo | Two bytes; that is, some of the vertical bars |
| +---+---+ can be missing. |
| |
| +=======+ |
| | Foo | Zero or more bytes. |
| +=======+ |
| |
| In this document, a boxed byte or a byte sequence declared |
| using this notation is called `a field'. The example field |
| above would be called called `the Foo field' or plain `Foo'. |
| |
| |
| 1.2. Multibyte Integers |
| |
| Multibyte integers of static length, such as CRC values, |
| are stored in little endian byte order (least significant |
| byte first). |
| |
| When smaller values are more likely than bigger values (e.g. |
| file sizes), multibyte integers are encoded in a simple |
| variable-length representation: |
| - Numbers in the range [0, 127] are copied as is, and take |
| one byte of space. |
| - Bigger numbers will occupy two or more bytes. The lowest |
| seven bits of every byte are used for data; the highest |
| (eighth) bit indicates either that |
| 0) the byte is in the middle of the byte sequence, or |
| 1) the byte is the first or the last byte. |
| |
| For now, the value of the variable-length integers is limited |
| to 63 bits, which limits the encoded size of the integer to |
| nine bytes. These limits may be increased in future if needed. |
| |
| Note that the encoding is not as optimal as it could be. For |
| example, it is possible to encode the number 42 using any |
| number of bytes between one and nine. This is convenient |
| for non-streamed encoders, that write Compressed Size or |
| Uncompressed Size fields to the Block Header (see Section 3.1) |
| after the Compressed Data field is written to the disk. |
| |
| In several situations, the decoder needs to compare that two |
| fields contain identical information. When comparing fields |
| using the encoding described in this Section, the decoder must |
| consider two fields identical if their decoded values are |
| identical; it does not matter if the encoded variable-length |
| representations differ. |
| |
| The following C code illustrates encoding and decoding 63-bit |
| variables; the highest bit of uint64_t must be unset. The |
| functions return the number of bytes occupied by the integer |
| (1-9), or zero on error. |
| |
| #include <sys/types.h> |
| #include <inttypes.h> |
| |
| size_t |
| encode(uint8_t buf[static 9], uint64_t num) |
| { |
| if (num >= (UINT64_C(1) << (9 * 7))) |
| return 0; |
| if (num <= 0x7F) { |
| buf[0] = num; |
| return 1; |
| } |
| buf[0] = (num & 0x7F) | 0x80; |
| num >>= 7; |
| size_t i = 1; |
| while (num >= 0x80) { |
| buf[i++] = num & 0x7F; |
| num >>= 7; |
| } |
| buf[i++] = num | 0x80; |
| return i; |
| } |
| |
| size_t |
| decode(const uint8_t buf[], size_t size_max, uint64_t *num) |
| { |
| if (size_max == 0) |
| return 0; |
| if (size_max > 9) |
| size_max = 9; |
| *num = buf[0] & 0x7F; |
| if (!(buf[0] & 0x80)) |
| return 1; |
| size_t i = 1; |
| do { |
| if (i == size_max) |
| return 0; |
| *num |= (uint64_t)(buf[i] & 0x7F) << (7 * i); |
| } while (!(buf[i++] & 0x80)); |
| return i; |
| } |
| |
| size_t |
| decode_reverse(const uint8_t buf[], size_t size_max, |
| uint64_t *num) |
| { |
| if (size_max == 0) |
| return 0; |
| const size_t end = size_max > 9 ? size_max - 9 : 0; |
| size_t i = size_max - 1; |
| *num = buf[i] & 0x7F; |
| if (!(buf[i] & 0x80)) |
| return 1; |
| do { |
| if (i-- == end) |
| return 0; |
| *num <<= 7; |
| *num |= buf[i] & 0x7F; |
| } while (!(buf[i] & 0x80)); |
| return size_max - i; |
| } |
| |
| |
| 2. Stream |
| |
| +========+========+========+ |
| | Stream | Stream | Stream | ... |
| +========+========+========+ |
| |
| A file contains usually only one Stream. However, it is |
| possible to concatenate multiple Streams together with no |
| additional processing. It is up to the implementation to |
| decide if the decoder will continue decoding from the next |
| Stream once the end of the first Stream has been reached. |
| |
| |
| 2.1. Stream Types |
| |
| There are two types of Streams: Single-Block Streams and |
| Multi-Block Streams. Decoders conforming to this specification |
| must support at least Single-Block Streams. Supporting |
| Multi-Block Streams is optional. If the decoder supports only |
| Single-Block Streams, the documentation of the decoder should |
| mention this fact clearly. |
| |
| |
| 2.1.1. Single-Block Stream |
| |
| +===============+============+ |
| | Stream Header | Data Block | |
| +===============+============+ |
| |
| As the name says, a Single-Block Stream has exactly one Block. |
| The Block must be a Data Block; Metadata Blocks are not allowed |
| in Single-Block Streams. |
| |
| |
| 2.1.2. Multi-Block Stream |
| |
| +===============+=======================+ |
| | Stream Header | Header Metadata Block | |
| +===============+=======================+ |
| |
| +============+ +============+=======================+ |
| ---> | Data Block | ... | Data Block | Footer Metadata Block | |
| +============+ +============+=======================+ |
| |
| Notes: |
| - Stream Header is mandatory. |
| - Header Metadata Block is optional. |
| - Each Multi-Block Stream has at least one Data Block. The |
| maximum number of Data Blocks is not limited. |
| - Footer Metadata Block is mandatory. |
| |
| |
| 2.2. Stream Header |
| |
| +---+---+---+---+---+---+--------------+--+--+--+--+ |
| | Header Magic Bytes | Stream Flags | CRC32 | |
| +---+---+---+---+---+---+--------------+--+--+--+--+ |
| |
| |
| 2.2.1. Header Magic Bytes |
| |
| The first six (6) bytes of the Stream are so called Header |
| Magic Bytes. They can be used to identify the file type. |
| |
| Using a C array and ASCII: |
| const uint8_t HEADER_MAGIC[6] |
| = { 0xFF, 'L', 'Z', 'M', 'A', 0x00 }; |
| |
| In plain hexadecimal: |
| FF 4C 5A 4D 41 00 |
| |
| Notes: |
| - The first byte (0xFF) was chosen so that the files cannot |
| be erroneously detected as being in LZMA_Alone format, in |
| which the first byte is in the the range [0x00, 0xE0]. |
| - The sixth byte (0x00) was chosen to prevent applications |
| from misdetecting the file as a text file. |
| |
| |
| 2.2.2. Stream Flags |
| |
| Bit(s) Mask Description |
| 0-2 0x07 Type of Check (see Section 3.3.1): |
| ID Size Check name |
| 0x00 0 bytes None |
| 0x01 4 bytes CRC32 |
| 0x02 4 bytes (Reserved) |
| 0x03 8 bytes CRC64 |
| 0x04 16 bytes (Reserved) |
| 0x05 32 bytes SHA-256 |
| 0x06 32 bytes (Reserved) |
| 0x07 64 bytes (Reserved) |
| 3 0x08 The CRC32 field is present in Block Headers. |
| 4 0x10 If unset, this is a Single-Block Stream; if set, |
| this is a Multi-Block Stream. |
| 5-7 0xE0 Reserved for future use; must be zero for now. |
| |
| Implementations must support at least the Check IDs 0x00 (None) |
| and 0x01 (CRC32). Supporting other Check IDs is optional. If an |
| unsupported Check is used, the decoder must indicate a warning |
| or error. |
| |
| If any reserved bit is set, the decoder must indicate an error. |
| It is possible that there is a new field present which the |
| decoder is not aware of, and can thus parse the Stream Header |
| incorrectly. |
| |
| |
| 2.2.3. CRC32 |
| |
| The CRC32 is calculated from the Stream Flags field. It is |
| stored as an unsigned 32-bit little endian integer. If the |
| calculated value does not match the stored one, the decoder |
| must indicate an error. |
| |
| Note that this field is always present; the bit in Stream Flags |
| controls only presence of CRC32 in Block Headers. |
| |
| |
| 3. Block |
| |
| +==============+=================+==============+ |
| | Block Header | Compressed Data | Block Footer | |
| +==============+=================+==============+ |
| |
| There are two types of Blocks: |
| - Data Blocks hold the actual compressed data. |
| - Metadata Blocks hold the Index, Extra, and a few other |
| non-data fields (see Section 5). |
| |
| The type of the Block is indicated by the corresponding bit |
| in the Block Flags field (see Section 3.1.1). |
| |
| |
| 3.1. Block Header |
| |
| +------+------+=================+===================+ |
| | Block Flags | Compressed Size | Uncompressed Size | |
| +------+------+=================+===================+ |
| |
| +======================+--+--+--+--+================+ |
| ---> | List of Filter Flags | CRC32 | Header Padding | |
| +======================+--+--+--+--+================+ |
| |
| |
| 3.1.1. Block Flags |
| |
| The first byte of the Block Flags field is a bit field: |
| |
| Bit(s) Mask Description |
| 0-2 0x07 Number of filters (0-7) |
| 3 0x08 Use End of Payload Marker (even if |
| Uncompressed Size is stored to Block Header). |
| 4 0x10 The Compressed Size field is present. |
| 5 0x20 The Uncompressed Size field is present. |
| 6 0x40 Reserved for future use; must be zero for now. |
| 7 0x80 This is a Metadata Block. |
| |
| The second byte of the Block Flags field is also a bit field: |
| |
| Bit(s) Mask Description |
| 0-4 0x1F Size of the Header Padding field (0-31 bytes) |
| 5-7 0xE0 Reserved for future use; must be zero for now. |
| |
| The decoder must indicate an error if End of Payload Marker |
| is not used and Uncompressed Size is not stored to the Block |
| Header. Because of this, the first byte of Block Flags can |
| never be a nul byte. This is useful when detecting beginning |
| of the Block after Footer Padding (see Section 3.3.3). |
| |
| If any reserved bit is set, the decoder must indicate an error. |
| It is possible that there is a new field present which the |
| decoder is not aware of, and can thus parse the Block Header |
| incorrectly. |
| |
| |
| 3.1.2. Compressed Size |
| |
| This field is present only if the appropriate bit is set in |
| the Block Flags field (see Section 3.1.1). |
| |
| This field contains the size of the Compressed Data field. |
| The size is stored using the encoding described in Section 1.2. |
| If the Compressed Size does not match the real size of the |
| Compressed Data field, the decoder must indicate an error. |
| |
| Having the Compressed Size field in the Block Header can be |
| useful for multithreaded decoding when seeking is not possible. |
| If the Blocks are small enough, the decoder can read multiple |
| Blocks into its internal buffer, and decode the Blocks in |
| parallel. |
| |
| Compressed Size can also be useful when seeking forwards to |
| a specific location in streamed mode: the decoder can quickly |
| skip over irrelevant Blocks, without decoding them. |
| |
| |
| 3.1.3. Uncompressed Size |
| |
| This field is present only if the appropriate bit is set in |
| the Block Flags field (see Section 3.1.1). |
| |
| The Uncompressed Size field contains the size of the Block |
| after uncompressing. |
| |
| Storing Uncompressed Size serves several purposes: |
| - The decoder will know when all of the data has been |
| decoded without an explicit End of Payload Marker. |
| - The decoder knows how much memory it needs to allocate |
| for a temporary buffer in multithreaded mode. |
| - Simple error detection: wrong size indicates a broken file. |
| - Sometimes it is useful to know the file size without |
| uncompressing the file. |
| |
| It should be noted that the only reliable way to find out what |
| the real uncompressed size is is to uncompress the Block, |
| because the Block Header and Metadata Block fields may contain |
| (intentionally or unintentionally) invalid information. |
| |
| Uncompressed Size is stored using the encoding described in |
| Section 1.2. If the Uncompressed Size does not match the |
| real uncompressed size, the decoder must indicate an error. |
| |
| |
| 3.1.4. List of Filter Flags |
| |
| +================+================+ +================+ |
| | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags | |
| +================+================+ +================+ |
| |
| The number of Filter Flags fields is stored in the Block Flags |
| field (see Section 3.1.1). As a special case, if the number of |
| Filter Flags fields is zero, it is equivalent to having the |
| Copy filter as the only filter. |
| |
| The format of each Filter Flags field is as follows: |
| |
| +------+=============+=============================+ |
| | Misc | External ID | External Size of Properties | |
| +------+=============+=============================+ |
| |
| +===================+ |
| ---> | Filter Properties | |
| +===================+ |
| |
| The list of officially defined Filter IDs and the formats of |
| their Filter Properties are described in Section 4.3. |
| |
| |
| 3.1.4.1. Misc |
| |
| To save space, the most commonly used Filter IDs and the |
| Size of Filter Properties are encoded in a single byte. |
| Depending on the contents of the Misc field, Filter ID is |
| the value of the Misc or External ID field. |
| |
| Value Filter ID Size of Filter Properties |
| 0x00 - 0x1F Misc 0 bytes |
| 0x20 - 0x3F Misc 1 byte |
| 0x40 - 0x5F Misc 2 bytes |
| 0x60 - 0x7F Misc 3 bytes |
| 0x80 - 0x9F Misc 4 bytes |
| 0xA0 - 0xBF Misc 5 bytes |
| 0xC0 - 0xDF Misc 6 bytes |
| 0xE0 - 0xFE External ID 0-30 bytes |
| 0xFF External ID External Size of Properties |
| |
| The following code demonstrates parsing the Misc field and, |
| when needed, the External ID and External Size of Properties |
| fields. |
| |
| uint64_t id; |
| uint64_t properties_size; |
| uint8_t misc = read_byte(); |
| |
| if (misc >= 0xE0) { |
| id = read_variable_length_integer(); |
| |
| if (misc == 0xFF) |
| properties_size = read_variable_length_integer(); |
| else |
| properties_size = misc - 0xE0; |
| |
| } else { |
| id = misc; |
| properties_size = misc / 0x20; |
| } |
| |
| |
| 3.1.4.2. External ID |
| |
| This field is present only if the Misc field contains a value |
| that indicates usage of External ID. The External ID is stored |
| using the encoding described in Section 1.2. |
| |
| |
| 3.1.4.3. External Size of Properties |
| |
| This field is present only if the Misc field contains a value |
| that indicates usage of External Size of Properties. The size |
| of Filter Properties is stored using the encoding described in |
| Section 1.2. |
| |
| |
| 3.1.4.4. Filter Properties |
| |
| Size of this field depends on the Misc field (Section 3.1.4.1) |
| and, if present, External Size of Properties field (Section |
| 3.1.4.3). The format of this field is depends on the selected |
| filter; see Section 4.3 for details. |
| |
| |
| 3.1.5. CRC32 |
| |
| This field is present only if the appropriate bit is set in |
| the Stream Flags field (see Section 2.2.2). |
| |
| The CRC32 is calculated over everything in the Block Header |
| field except the Header Padding field and the CRC32 field |
| itself. It is stored as an unsigned 32-bit little endian |
| integer. If the calculated value does not match the stored |
| one, the decoder must indicate an error. |
| |
| |
| 3.1.6. Header Padding |
| |
| This field contains as many nul bytes as indicated by the value |
| stored in the Header Flags field. If the Header Padding field |
| contains any non-nul bytes, the decoder must indicate an error. |
| |
| The intent of the Header Padding field is to allow alignment |
| of Compressed Data. The usefulness of alignment is described |
| in Section 4.3. |
| |
| |
| 3.2. Compressed Data |
| |
| The format of Compressed Data depends on Block Flags and List |
| of Filter Flags. Excluding the descriptions of the simplest |
| filters in Section 4, the format of the filter-specific encoded |
| data is out of scope of this document. |
| |
| Note a special case: if End of Payload Marker (see Section |
| 3.1.1) is not used and Uncompressed Size is zero, the size |
| of the Compressed Data field is always zero. |
| |
| |
| 3.3. Block Footer |
| |
| +=======+===============+================+ |
| | Check | Stream Footer | Footer Padding | |
| +=======+===============+================+ |
| |
| |
| 3.3.1. Check |
| |
| The type and size of the Check field depends on which bits |
| are set in the Stream Flags field (see Section 2.2.2). |
| |
| The Check, when used, is calculated from the original |
| uncompressed data. If the calculated Check does not match the |
| stored one, the decoder must indicate an error. If the selected |
| type of Check is not supported by the decoder, it must indicate |
| a warning or error. |
| |
| |
| 3.3.2. Stream Footer |
| |
| +===================+===============+--------------+ |
| | Uncompressed Size | Backward Size | Stream Flags | |
| +===================+===============+--------------+ |
| |
| +----------+---------+ |
| ---> | Footer Magic Bytes | |
| +----------+---------+ |
| |
| Stream Footer is present only in |
| - Data Block of a Single-Block Stream; and |
| - Footer Metadata Block of a Multi-Block Stream. |
| |
| The Stream Footer field is placed inside Block Footer, because |
| no padding is allowed between Check and Stream Footer. |
| |
| |
| 3.3.2.1. Uncompressed Size |
| |
| This field is present only in the Data Block of a Single-Block |
| Stream if Uncompressed Size is not stored to the Block Header |
| (see Section 3.1.1). Without the Uncompressed Size field in |
| Stream Footer it would not be possible to quickly find out |
| the Uncompressed Size of the Stream in all cases. |
| |
| Uncompressed Size is stored using the encoding described in |
| Section 1.2. If the stored value does not match the real |
| uncompressed size of the Single-Block Stream, the decoder must |
| indicate an error. |
| |
| |
| 3.3.2.2. Backward Size |
| |
| This field contains the total size of the Block Header, |
| Compressed Data, Check, and Uncompressed Size fields. The |
| value is stored using the encoding described in Section 1.2. |
| If the Backward Size does not match the real total size of |
| the appropriate fields, the decoder must indicate an error. |
| |
| Implementations reading the Stream backwards should notice |
| that the value in this field can never be zero. |
| |
| |
| 3.3.2.3. Stream Flags |
| |
| This is a copy of the Stream Flags field from the Stream |
| Header. The information stored to Stream Flags is needed |
| when parsing the Stream backwards. |
| |
| |
| 3.3.2.4. Footer Magic Bytes |
| |
| As the last step of the decoding process, the decoder must |
| verify the existence of Footer Magic Bytes. If they are not |
| found, an error must be indicated. |
| |
| Using a C array and ASCII: |
| const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' }; |
| |
| In hexadecimal: |
| 59 5A |
| |
| The primary reason to have Footer Magic Bytes is to make |
| it easier to detect incomplete files quickly, without |
| uncompressing. If the file does not end with Footer Magic Bytes |
| (excluding Footer Padding described in Section 3.3.3), it |
| cannot be undamaged, unless someone has intentionally appended |
| garbage after the end of the Stream. (Appending garbage at the |
| end of the file does not prevent uncompressing the file, but |
| may give a warning or error depending on the decoder |
| implementation.) |
| |
| |
| 3.3.3. Footer Padding |
| |
| In certain situations it is convenient to be able to pad |
| Blocks or Streams to be multiples of, for example, 512 bytes. |
| Footer Padding makes this possible. Note that this is in no |
| way required to enforce alignment in the way described in |
| Section 4.3; the Header Padding field is enough for that. |
| |
| When Footer Padding is used, it must contain only nul bytes. |
| Any non-nul byte should be considered as the beginning of |
| a new Block or Stream. |
| |
| The possibility of Padding should be taken into account when |
| designing an application that wants to find out information |
| about a Stream by parsing Footer Metadata Block. |
| |
| Support for Padding was inspired by a related note in |
| [GNU-tar]. |
| |
| |
| 4. Filters |
| |
| The Block Flags field defines how many filters are used. When |
| more than one filter is used, the filters are chained; that is, |
| the output of one filter is the input of another filter. The |
| following figure illustrates the direction of data flow. |
| |
| v Uncompressed Data ^ |
| | Filter 0 | |
| Encoder | Filter 1 | Decoder |
| | ... | |
| | Filter n | |
| v Compressed Data ^ |
| |
| The filters are independent from each other, except that they |
| must cooperate a little to make it possible, in all cases, to |
| detect when all of the data has been decoded. In addition, the |
| filters should cooperate in the encoder to keep the alignment |
| optimal. |
| |
| |
| 4.1. Detecting when All Data Has Been Decoded |
| |
| There must be a way for the decoder to detect when all of the |
| Compressed Data has been decoded. This is simple when only |
| one filter is used, but a bit more complex when multiple |
| filters are chained. |
| |
| This file format supports three methods to detect when all of |
| the data has been decoded: |
| - Uncompressed size |
| - End of Input |
| - End of Payload Marker |
| |
| In both encoder and decoder, filters are initialized starting |
| from the first filter in the chain. For each filter, one of |
| these three methods is used. |
| |
| |
| 4.1.1. With Uncompressed Size |
| |
| This method is the only method supported by all filters. |
| It must be used when uncompressed size is known by the |
| filter-specific encoder or decoder. In practice this means |
| that Uncompressed Size has been stored to the Block Header. |
| |
| In case of the first filter in the chain, the uncompressed size |
| given to the filter-specific encoder or decoder equals the |
| Uncompressed Size stored in the Block Header. For the rest of |
| the filters in the chain, uncompressed size is the size of the |
| output data of the previous filter in the chain. |
| |
| Note that when Use End of Payload Marker bit is set in Block |
| Flags, Uncompressed Size is considered to be unknown even if |
| it was present in the Block Header. Thus, if End of Payload |
| Marker is used, uncompressed size of all of the filters in |
| the chain is unknown, and can never be used to detect when |
| all of the data has been decoded. |
| |
| Once the correct number of bytes has been written out, the |
| filter-specific decoder indicates to its caller that all of |
| the data has been decoded. If the filter-specific decoder |
| detects End of Input or End of Payload Marker before the |
| correct number of bytes is decoded, the decoder must indicate |
| an error. |
| |
| |
| 4.1.2. With End of Input |
| |
| Most filters will know that all of the data has been decoded |
| when the End of Input data has been reached. Once the filter |
| knows that it has received the input data in its entirety, |
| it finishes its job, and indicates to its caller that all of |
| the data has been decoded. The filter-specific decoder must |
| indicate an error if it detects End of Payload Marker. |
| |
| Note that this method can work only when the filter is not |
| the last filter in the chain, because only another filter |
| can indicate the End of Input data. In practice this means, |
| that a filter later in the chain must support embedding |
| End of Payload Marker. |
| |
| When a filter that cannot embed End of Payload Marker is the |
| last filter in the chain, Subblock filter is appended to the |
| chain as an implicit filter. In the simplest case, this occurs |
| when no filters are specified, and Uncompressed Size is unknown |
| or the End of Payload Marker bit is set in Block Flags. |
| |
| |
| 4.1.3. With End of Payload Marker |
| |
| End of Payload Marker is a filter-specific bit sequence that |
| indicates the end of data. It is supported by only a few |
| filters. It is used when uncompressed size is unknown, and |
| the filter |
| - doesn't support End of Input; or |
| - is the last filter in the chain. |
| |
| End of Payload Marker is embedded at the end of the encoded |
| data by the filter-specific encoder. When the filter-specific |
| decoder detects the embedded End of Payload Marker, the decoder |
| knows that all of the data has been decoded. Then it finishes |
| its job, and indicates to its caller that all of the data has |
| been decoded. If the filter-specific decoder detects End of |
| Input before End of Payload Marker, the decoder must indicate |
| an error. |
| |
| If the filter supports both End of Input and End of Payload |
| Marker, the former is used, unless the filter is the last |
| filter in the chain. |
| |
| |
| 4.2. Alignment |
| |
| Some filters give better compression ratio or are faster |
| when the input or output data is aligned. For optimal results, |
| the encoder should try to enforce proper alignment when |
| possible. Not enforcing alignment in the encoder is not |
| an error. Thus, the decoder must be able to handle files with |
| suboptimal alignment. |
| |
| Alignment of uncompressed input data is usually the job of |
| the application producing the data. For example, to get the |
| best results, an archiver tool should make sure that all |
| PowerPC executable files in the archive stream start at |
| offsets that are multiples of four bytes. |
| |
| Some filters, for example LZMA, can be configured to take |
| advantage of specified alignment of input data. Note that |
| taking advantage of aligned input can be benefical also when |
| a filter is not the first filter in the chain. For example, |
| if you compress PowerPC executables, you may want to use the |
| PowerPC filter and chain that with the LZMA filter. Because not |
| only the input but also the output alignment of the PowerPC |
| filter is four bytes, it is now benefical to set LZMA settings |
| so that the LZMA encoder can take advantage of its |
| four-byte-aligned input data. |
| |
| The output of the last filter in the chain is stored to the |
| Compressed Data field. Aligning Compressed Data appropriately |
| can increase |
| - speed, if the filtered data is handled multiple bytes at |
| a time by the filter-specific encoder and decoder, |
| because accessing aligned data in computer memory is |
| usually faster; and |
| - compression ratio, if the output data is later compressed |
| with an external compression tool. |
| |
| Compressed Data in a Stream can be aligned by using the Header |
| Padding field in the Block Header. |
| |
| |
| 4.3. Filters |
| |
| 4.3.1. Copy |
| |
| This is a dummy filter that simply copies all data from input |
| to output unmodified. |
| |
| Filter ID: 0x00 |
| Size of Filter Properties: 0 bytes |
| Changes size of data: No |
| |
| Detecting when all of the data has been decoded: |
| Uncompressed size: Yes |
| End of Payload Marker: No |
| End of Input: Yes |
| |
| Preferred alignment: |
| Input data: 1 byte |
| Output data: 1 byte |
| |
| |
| 4.3.2. Subblock |
| |
| The Subblock filter can be used to |
| - embed End of Payload Marker when the otherwise last |
| filter in the chain does not support embedding it; and |
| - apply additional filters in the middle of a Block. |
| |
| Filter ID: 0x01 |
| Size of Filter Properties: 0 bytes |
| Changes size of data: Yes, unpredictably |
| |
| Detecting when all of the data has been decoded: |
| Uncompressed size: Yes |
| End of Payload Marker: Yes |
| End of Input: Yes |
| |
| Preferred alignment: |
| Input data: 1 byte |
| Output data: Freely adjustable |
| |
| |
| 4.3.2.1. Format of the Encoded Output |
| |
| The encoded data from the Subblock filter consist of zero or |
| more Subblocks: |
| |
| +==========+==========+ |
| | Subblock | Subblock | ... |
| +==========+==========+ |
| |
| Each Subblock contains two fields: |
| |
| +----------------+===============+ |
| | Subblock Flags | Subblock Data | |
| +----------------+===============+ |
| |
| Subblock Flags is a bitfield: |
| |
| Bits Mask Description |
| 0-3 0x0F The interpretation of these bits depend on |
| the Subblock Type: |
| - 0x20 Bits 0-3 for Size |
| - 0x30 Bits 0-3 for Repeat Count |
| - Other These bits must be zero. |
| 4-7 0xF0 Subblock Type: |
| - 0x00: Padding |
| - 0x10: End of Payload Marker |
| - 0x20: Data |
| - 0x30: Repeating Data |
| - 0x40: Set Subfilter |
| - 0x50: Unset Subfilter |
| If some other value is detected, the decoder |
| must indicate an error. |
| |
| The format of the Subblock Data field depends on Subblock Type. |
| |
| Subblocks with the Subblock Type 0x00 (Padding) don't have a |
| Subblock Data field. These Subblocks can be useful for fixing |
| alignment. There can be at maximum of 31 consecutive Subblocks |
| with this Subblock Type; if there are more, the decoder must |
| indicate an error. |
| |
| Subblock with the Subblock Type 0x10 (End of Payload Marker) |
| doesn't have a Subblock Data field. The decoder must indicate |
| an error if this Subblock Type is detected when Subfilter is |
| enabled, or when the Subblock filter is not supposed to embed |
| the End of Payload Marker. |
| |
| Subblocks with the Subblock Type 0x20 (Data) contain the rest |
| of the Size, which is followed by Size + 1 bytes in the Data |
| field (that is, Data can never be empty): |
| |
| +------+------+------+======+ |
| | Bits 4-27 for Size | Data | |
| +------+------+------+======+ |
| |
| Subblocks with the Subblock Type 0x30 (Repeating Data) contain |
| the rest of the Repeat Count, the Size of the Data, and finally |
| the actual Data to be repeated: |
| |
| +---------+---------+--------+------+======+ |
| | Bits 4-27 for Repeat Count | Size | Data | |
| +---------+---------+--------+------+======+ |
| |
| The size of the Data field is Size + 1. It is repeated Repeat |
| Count + 1 times. That is, the minimum size of Data is one byte; |
| the maximum size of Data is 256 bytes. The minimum number of |
| repeats is one; the maximum number of repeats is 2^28. |
| |
| If Subfilter is not used, the Data field of Subblock Types 0x20 |
| and 0x30 is the output of the decoded Subblock filter. If |
| Subfilter is used, Data is the input of the Subfilter, and the |
| decoded output of the Subfilter is the decoded output of the |
| Subblock filter. |
| |
| Subblocks with the Subblock Type 0x40 (Set Subfilter) contain |
| a Filter Flags field in Subblock Data: |
| |
| +==============+ |
| | Filter Flags | |
| +==============+ |
| |
| It is an error to set the Subfilter to Filter ID 0x00 (Copy) |
| or 0x01 (Subblock). All the other Filter IDs are allowed. |
| The decoder must indicate an error if this Subblock Type is |
| detected when a Subfilter is already enabled. |
| |
| Subblocks with the Subblock Type 0x50 (Unset Subfilter) don't |
| have a Subblock Data field. There must be at least one Subblock |
| with Subblock Type 0x20 or 0x30 between Subblocks with Subblock |
| Type 0x40 and 0x50; if there isn't, the decoder must indicate |
| an error. |
| |
| Subblock Types 0x40 and 0x50 are always used as a pair: If the |
| Subblock filter has been enabled with Subblock Type 0x40, it |
| must always be disabled later with Subblock Type 0x50. |
| Disabling must be done even if the Subfilter used End of |
| Payload Marker; after the Subfilter has detected End of Payload |
| Marker, the next Subblock that is not Padding must unset the |
| Subfilter. |
| |
| When the Subblock filter is used as an implicit filter to embed |
| End of Payload marker, the Subblock Types 0x40 and 0x50 (Set or |
| Unset Subfilter) must not be used. The decoder must indicate an |
| error if it detects any of these Subblock Types in an implicit |
| Subblock filter. |
| |
| The following code illustrates the basic structure of a |
| Subblock decoder. |
| |
| uint32_t consecutive_padding = 0; |
| bool got_output_with_subfilter = false; |
| |
| while (true) { |
| uint32_t size; |
| uint32_t repeat; |
| uint8_t flags = read_byte(); |
| |
| if (flags != 0) |
| consecutive_padding = 0; |
| |
| switch (flags >> 4) { |
| case 0: |
| // Padding |
| if (flags & 0x0F) |
| return DATA_ERROR; |
| if (++consecutive_padding == 32) |
| return DATA_ERROR; |
| break; |
| |
| case 1: |
| // End of Payload Marker |
| if (flags & 0x0F) |
| return DATA_ERROR; |
| if (subfilter_enabled || !allow_eopm) |
| return DATA_ERROR; |
| break; |
| |
| case 2: |
| // Data |
| size = flags & 0x0F; |
| for (size_t i = 4; i < 28; i += 8) |
| size |= (uint32_t)(read_byte()) << i; |
| |
| // If any output is produced, this will |
| // set got_output_with_subfilter to true. |
| copy_data(size); |
| break; |
| |
| case 3: |
| // Repeating Data |
| repeat = flags & 0x0F; |
| for (size_t i = 4; i < 28; i += 8) |
| repeat |= (uint32_t)(read_byte()) << i; |
| size = read_byte(); |
| |
| // If any output is produced, this will |
| // set got_output_with_subfilter to true. |
| copy_repeating_data(size, repeat); |
| break; |
| |
| case 4: |
| // Set Subfilter |
| if (flags & 0x0F) |
| return DATA_ERROR; |
| if (subfilter_enabled) |
| return DATA_ERROR; |
| got_output_with_subfilter = false; |
| set_subfilter(); |
| break; |
| |
| case 5: |
| // Unset Subfilter |
| if (flags & 0x0F) |
| return DATA_ERROR; |
| if (!subfilter_enabled) |
| return DATA_ERROR; |
| if (!got_output_with_subfilter) |
| return DATA_ERROR; |
| unset_subfilter(); |
| break; |
| |
| default: |
| return DATA_ERROR; |
| } |
| } |
| |
| |
| 4.3.3. Delta |
| |
| The Delta filter may increase compression ratio when the value |
| of the next byte correlates with the value of an earlier byte |
| at specified distance. |
| |
| Filter ID: 0x20 |
| Size of Filter Properties: 1 byte |
| Changes size of data: No |
| |
| Detecting when all of the data has been decoded: |
| Uncompressed size: Yes |
| End of Payload Marker: No |
| End of Input: Yes |
| |
| Preferred alignment: |
| Input data: 1 byte |
| Output data: Same as the original input data |
| |
| The Properties byte indicates the delta distance, which can be |
| 1-256 bytes backwards from the current byte: 0x00 indicates |
| distance of 1 byte and 0xFF distance of 256 bytes. |
| |
| |
| 4.3.3.1. Format of the Encoded Output |
| |
| The code below illustrates both encoding and decoding with |
| the Delta filter. |
| |
| // Distance is in the range [1, 256]. |
| const unsigned int distance = get_properties_byte() + 1; |
| uint8_t pos = 0; |
| uint8_t delta[256]; |
| |
| memset(delta, 0, sizeof(delta)); |
| |
| while (1) { |
| const int byte = read_byte(); |
| if (byte == EOF) |
| break; |
| |
| uint8_t tmp = delta[(uint8_t)(distance + pos)]; |
| if (is_encoder) { |
| tmp = (uint8_t)(byte) - tmp; |
| delta[pos] = (uint8_t)(byte); |
| } else { |
| tmp = (uint8_t)(byte) + tmp; |
| delta[pos] = tmp; |
| } |
| |
| write_byte(tmp); |
| --pos; |
| } |
| |
| |
| 4.3.4. LZMA |
| |
| LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purporse |
| compression algorithm with high compression ratio and fast |
| decompression. LZMA based on LZ77 and range coding algorithms. |
| |
| Filter ID: 0x40 |
| Size of Filter Properties: 2 bytes |
| Changes size of data: Yes, unpredictably |
| |
| Detecting when all of the data has been decoded: |
| Uncompressed size: Yes |
| End of Payload Marker: Yes |
| End of Input: No |
| |
| Preferred alignment: |
| Input data: Adjustable to 1/2/4/8/16 byte(s) |
| Output data: 1 byte |
| |
| At the time of writing, there is no other documentation about |
| how LZMA works than the source code in LZMA SDK. Once such |
| documentation gets written, it will probably be published as |
| a separate document, because including the documentation here |
| would lengthen this document considerably. |
| |
| The format of the Filter Properties field is as follows: |
| |
| +-----------------+------------------+ |
| | LZMA Properties | Dictionary Flags | |
| +-----------------+------------------+ |
| |
| |
| 4.3.4.1. LZMA Properties |
| |
| The LZMA Properties bits contain three properties. An |
| abbreviation is given in parentheses, followed by the value |
| range of the property. The field consists of |
| |
| 1) the number of literal context bits (lc, [0, 8]); |
| 2) the number of literal position bits (lp, [0, 4]); and |
| 3) the number of position bits (pb, [0, 4]). |
| |
| They are encoded using the following formula: |
| |
| LZMA Properties = (pb * 5 + lp) * 9 + lc |
| |
| The following C code illustrates a straightforward way to |
| decode the properties: |
| |
| uint8_t lc, lp, pb; |
| uint8_t prop = get_lzma_properties() & 0xFF; |
| if (prop > (4 * 5 + 4) * 9 + 8) |
| return LZMA_PROPERTIES_ERROR; |
| |
| pb = prop / (9 * 5); |
| prop -= pb * 9 * 5; |
| lp = prop / 9; |
| lc = prop - lp * 9; |
| |
| |
| 4.3.4.2. Dictionary Flags |
| |
| Currently the lowest six bits of the Dictionary Flags field |
| are in use: |
| |
| Bits Mask Description |
| 0-5 0x3F Dictionary Size |
| 6-7 0xC0 Reserved for future use; must be zero for now. |
| |
| Dictionary Size is encoded with one-bit mantissa and five-bit |
| exponent. To avoid wasting space, one-byte dictionary has its |
| own special value. |
| |
| Raw value Mantissa Exponent Dictionary size |
| 0 1 0 1 byte |
| 1 2 0 2 bytes |
| 2 3 0 3 bytes |
| 3 2 1 4 bytes |
| 4 3 1 6 bytes |
| 5 2 2 8 bytes |
| 6 3 2 12 bytes |
| 7 2 3 16 bytes |
| 8 3 3 24 bytes |
| 9 2 4 32 bytes |
| ... ... ... ... |
| 61 2 30 2 GiB |
| 62 3 30 3 GiB |
| 63 2 31 4 GiB (*) |
| |
| (*) The real maximum size of the dictionary is one byte |
| less than 4 GiB, because the distance of 4 GiB is |
| reserved for End of Payload Marker. |
| |
| Instead of having a table in the decoder, the dictionary size |
| can be decoded using the following C code: |
| |
| uint64_t dictionary_size; |
| const uint8_t bits = get_dictionary_flags() & 0x3F; |
| if (bits == 0) { |
| dictionary_size = 1; |
| } else { |
| dictionary_size = 2 | ((bits + 1) & 1); |
| dictionary_size = dictionary_size << ((bits - 1) / 2); |
| } |
| |
| |
| 4.3.5. Branch/Call/Jump Filters for Executables |
| |
| These filters convert relative branch, call, and jump |
| instructions to their absolute counterparts in executable |
| files. This conversion increases redundancy and thus |
| compression ratio. |
| |
| Size of Filter Properties: 0 or 4 bytes |
| Changes size of data: No |
| |
| Detecting when all of the data has been decoded: |
| Uncompressed size: Yes |
| End of Payload Marker: No |
| End of Input: Yes |
| |
| Below is the list of filters in this category. The alignment |
| is the same for both input and output data. |
| |
| Filter ID Alignment Description |
| 0x04 1 byte x86 filter (BCJ) |
| 0x05 4 bytes PowerPC (big endian) filter |
| 0x06 16 bytes IA64 filter |
| 0x07 4 bytes ARM (little endian) filter |
| 0x08 2 bytes ARM Thumb (little endian) filter |
| 0x09 4 bytes SPARC filter |
| |
| If the size of Filter Properties is four bytes, the Filter |
| Properties field contains the start offset used for address |
| conversions. It is stored as an unsigned 32-bit little endian |
| integer. If the size of Filter Properties is zero, the start |
| offset is zero. |
| |
| Setting the start offset may be useful if an executable has |
| multiple sections, and there are many cross-section calls. |
| Taking advantage of this feature usually requires usage of |
| the Subblock filter. |
| |
| |
| 5. Metadata |
| |
| Metadata is stored in Metadata Blocks, which can be in the |
| beginning or at the end of a Multi-Block Stream. Because of |
| Blocks, it is possible to compress Metadata in the same way |
| as the actual data is compressed. This Section describes the |
| format of the data stored in Metadata Blocks. |
| |
| +----------------+===============================+ |
| | Metadata Flags | Size of Header Metadata Block | |
| +----------------+===============================+ |
| |
| +============+===================+=======+=======+ |
| ---> | Total Size | Uncompressed Size | Index | Extra | |
| +============+===================+=======+=======+ |
| |
| Stream must be parseable backwards. That is, there must be |
| a way to locate the beginning of the Stream by starting from |
| the end of the Stream. Thus, the Footer Metadata Block must |
| contain the Total Size field or the Index field. If the Stream |
| has Header Metadata Block, also the Size of Header Metadata |
| Block field must be present in Footer Metadata Block. |
| |
| It must be possible to quickly locate the Blocks in |
| non-streamed mode. Thus, the Index field must be present |
| at least in one Metadata Block. |
| |
| If the above conditions are not met, the decoder must indicate |
| an error. |
| |
| There should be no additional data after the last field. If |
| there is, the the decoder should indicate an error. |
| |
| |
| 5.1. Metadata Flags |
| |
| This field describes which fields are present in a Metadata |
| Block: |
| |
| Bit(s) Mask Desription |
| 0 0x01 Size of Header Metadata Block is present. |
| 1 0x02 Total Size is present. |
| 2 0x04 Uncompressed Size is present. |
| 3 0x08 Index is present. |
| 4-6 0x70 Reserve for future use; must be zero for now. |
| 7 0x80 Extra is present. |
| |
| If any reserved bit is set, the decoder must indicate an error. |
| It is possible that there is a new field present which the |
| decoder is not aware of, and can thus parse the Metadata |
| incorrectly. |
| |
| |
| 5.2. Size of Header Metadata Block |
| |
| This field is present only if the appropriate bit is set in |
| the Metadata Flags field (see Section 5.1). |
| |
| Size of Header Metadata Block is needed to make it possible to |
| parse the Stream backwards. The size is stored using the |
| encoding described in Section 1.2. The decoder must verify that |
| that the value stored in this field is non-zero. In Footer |
| Metadata Block, the decoder must also verify that the stored |
| size matches the real size of Header Metadata Block. In the |
| Header Meatadata Block, the value of this field is ignored as |
| long as it is not zero. |
| |
| |
| 5.3. Total Size |
| |
| This field is present only if the appropriate bit is set in the |
| Metadata Flags field (see Section 5.1). |
| |
| This field contains the total size of the Data Blocks in the |
| Stream. Total Size is stored using the encoding described in |
| Section 1.2. If the stored value does not match the real total |
| size of the Data Blocks, the decoder must indicate an error. |
| The value of this field must be non-zero. |
| |
| Total Size can be used to quickly locate the beginning or end |
| of the Stream. This can be useful for example when doing |
| random-access reading, and the Index field is not in the |
| Metadata Block currently being read. |
| |
| It is useless to have both Total Size and Index in the same |
| Metadata Block, because Total Size can be calculated from the |
| Index field. |
| |
| |
| 5.4. Uncompressed Size |
| |
| This field is present only if the appropriate bit is set in the |
| Metadata Flags field (see Section 5.1). |
| |
| This field contains the total uncompressed size of the Data |
| Blocks in the Stream. Uncompresssed Size is stored using the |
| encoding described in Section 1.2. If the stored value does not |
| match the real uncompressed size of the Data Blocks, the |
| decoder must indicate an error. |
| |
| It is useless to have both Uncompressed Size and Index in |
| the same Metadata Block, because Uncompressed Size can be |
| calculated from the Index field. |
| |
| |
| 5.5. Index |
| |
| +=======================+=============+====================+ |
| | Number of Data Blocks | Total Sizes | Uncompressed Sizes | |
| +=======================+=============+====================+ |
| |
| Index serves several purporses. Using it, one can |
| - verify that all Blocks in a Stream have been processed; |
| - find out the Uncompressed Size of a Stream; and |
| - quickly access the beginning of any Block (random access). |
| |
| |
| 5.5.1. Number of Data Blocks |
| |
| This field contains the number of Data Blocks in the Stream. |
| The value is stored using the encoding described in Section |
| 1.2. If the decoder has decoded all the Data Blocks of the |
| Stream, and then notices that the Number of Records doesn't |
| match the real number of Data Blocks, the decoder must |
| indicate an error. The value of this field must be non-zero. |
| |
| |
| 5.5.2. Total Sizes |
| |
| +============+============+ |
| | Total Size | Total Size | ... |
| +============+============+ |
| |
| This field lists the Total Sizes of every Data Block in the |
| Stream. There are as many Total Size fields as indicated by |
| the Number of Data Blocks field. |
| |
| Total Size is the size of Block Header, Compressed Data, and |
| Block Footer. It is stored using the encoding described in |
| Section 1.2. If the Total Sizes do not match the real sizes |
| of respective Blocks, the decoder should indicate an error. |
| All the Total Size fields must have a non-zero value. |
| |
| |
| 5.5.3. Uncompressed Sizes |
| |
| +===================+===================+ |
| | Uncompressed Size | Uncompressed Size | ... |
| +===================+===================+ |
| |
| This field lists the Uncompressed Sizes of every Data Block |
| in the Stream. There are as many Uncompressed Size fields as |
| indicated by the Number of Records field. |
| |
| Uncompressed Sizes are stored using the encoding described |
| in Section 1.2. If the Uncompressed Sizes do not match the |
| real sizes of respective Blocks, the decoder shoud indicate |
| an error. |
| |
| |
| 5.6. Extra |
| |
| This field is present only if the appropriate bit is set in the |
| Metadata Flags field (see Section 5.1). Note that the bit does |
| not indicate that there is any data in the Extra field; it only |
| indicates that Extra may be non-empty. |
| |
| The Extra field contains only information that is not required |
| to properly uncompress the Stream or to do random-access |
| reading. Supporting the Extra field is optional. In case the |
| decoder doesn't support the Extra field, it should silently |
| ignore it. |
| |
| Extra consists of zero or more Records: |
| |
| +========+========+ |
| | Record | Record | ... |
| +========+========+ |
| |
| Excluding Records with Record ID 0x00, each Record contains |
| three fields: |
| |
| +==========+==============+======+ |
| | Reord ID | Size of Data | Data | |
| +==========+==============+======+ |
| |
| The Record ID and Size of Data are stored using the encoding |
| described in Section 1.2. Data can be binary or UTF-8 |
| [RFC-3629] strings. Non-UTF-8 strings should be avoided. |
| Because the Size of Data is known, there is no need to |
| terminate strings with a nul byte, although doing so should |
| not be considered an error. |
| |
| The Record IDs are divided in two categories: |
| - Safe-to-Copy Records may be preserved as is when the |
| Stream is modified in ways that don't change the actual |
| uncompressed data. Examples of such operatings include |
| recompressing and adding, modifying, or deleting unrelated |
| Extra Records. |
| - Unsafe-to-Copy Records should be removed (and possibly |
| recreated) when any kind of changes are made to the Stream. |
| |
| When the actual uncompressed data is modified, all Records |
| should be removed (and possibly recreated), unless the |
| application knows that the Data stored to the Record(s) is |
| still valid. |
| |
| The following subsections describe the standard Record IDs and |
| the format of their Data fields. Safe-to-Copy Records have an |
| odd ID, while Unsafe-to-Copy Records have an even ID. |
| |
| |
| 5.6.1. 0x00: Dummy/Padding |
| |
| This Record is special, because it doesn't have the Size of |
| Data or Data fields. |
| |
| Dummy Records can be used, for example, to fill Metadata Block |
| when a few bytes of extra space has been reserved for it. There |
| can be any number of Dummy Records. |
| |
| |
| 5.6.2. 0x01: OpenPGP Signature |
| |
| OpenPGP signature is computed from uncompressed data. The |
| signature can be used to verify that the contents of a Stream |
| has been created by a trustworthy source. |
| |
| If the decoder supports decoding concatenated Streams, it |
| must indicate an error when verifying OpenPGP signatures if |
| there is more than one Stream. |
| |
| OpenPGP format is documented in [RFC-2440]. |
| |
| |
| 5.6.3. 0x02: Filter Information |
| |
| The Filter Information Record contains information about the |
| filters used in the Stream. This field can be used to quickly |
| - display which filters are used in each Block; |
| - check if all the required filters are supported by the |
| current decoder version; and |
| - check how much memory is required to decode each Block. |
| |
| The format of the Filter Information field is as follows: |
| |
| +=================+=================+ |
| | Block 0 Filters | Block 1 Filters | ... |
| +=================+=================+ |
| |
| There can be at maximum of as many Block Filters fields as |
| there are Data Blocks in the Stream. The format of the Block |
| Filters field is as follows: |
| |
| +------------------+======================+============+ |
| | Block Properties | List of Filter Flags | Subfilters | |
| +------------------+======================+============+ |
| |
| Block Properties is a bitfield: |
| |
| Bit(s) Mask Description |
| 0-2 0x07 Number of filters (0-7) |
| 3 0x08 End of Payload Marker is used. |
| 4 0x10 The Subfilters field is present. |
| 5-7 0xE0 Reserved for future use; must be zero for now. |
| |
| The contents of the List of Filter Flags field must match the |
| List of Filter Flags field in the respective Block Header. |
| |
| The Subfilters field may be present only if the List of Filter |
| Flags contains a Filter Flags field for a Subblock filter. The |
| format of the Subfilters field is as follows: |
| |
| +======================+=========================+ |
| | Number of Subfilters | List of Subfilter Flags | |
| +======================+=========================+ |
| |
| The value stored in the Number of Subfilters field is stored |
| using the encoding described in Section 1.2. The List of |
| Subfilter Flags field contains as many Filter Flags fields |
| as indicated by the Number of Subfilters field. These Filter |
| Flags fields list some or all the Subfilters used via the |
| Subblock filter. The order of the listed Subfilters is not |
| significant. |
| |
| Decoders supporting this Record should indicate a warning or |
| error if this Record contains Filter Flags that are not |
| actually used by the respective Blocks. |
| |
| |
| 5.6.4. 0x03: Comment |
| |
| Free-form comment is stored in UTF-8 [RFC-3629] encoding. |
| |
| The beginning of a new line should be indicated using the |
| ASCII Line Feed character (0x0A). When the Line Feed character |
| is not the native way to indicate new line in the underlying |
| operating system, the encoder and decoder should convert the |
| newline characters to and from Line Feeds. |
| |
| |
| 5.6.5. 0x04: List of Checks |
| |
| +=======+=======+ |
| | Check | Check | ... |
| +=======+=======+ |
| |
| There are as many Check fields as there are Blocks in the |
| Stream. The size of Check fields depend on Stream Flags |
| (see Section 2.2.2). |
| |
| Decoders supporting this Record should indicate a warning or |
| error if the Checks don't match the respective Blocks. |
| |
| |
| 5.6.6. 0x05: Original Filename |
| |
| Original filename is stored in UTF-8 [RFC-3629] encoding. |
| |
| The filename must not include any path, only the filename |
| itself. Special care must be taken to prevent directory |
| traversal vulnerabilities. |
| |
| When files are moved between different operating systems, it |
| is possible that filename valid in the source system is not |
| valid in the target system. It is implementation defined how |
| the decoder handles this kind of situations. |
| |
| |
| 5.6.7. 0x07: Modification Time |
| |
| Modification time is stored as POSIX time, as an unsigned |
| little endian integer. The number of bits depends on the |
| Size of Data field. Note that the usage of unsigned integer |
| limits the earliest representable time to 1970-01-01T00:00:00. |
| |
| |
| 5.6.8. 0x09: High-Resolution Modification Time |
| |
| This Record extends the `0x04: Modification time' Record with |
| a subsecond time information. There are two supported formats |
| of this field, which can be distinguished by looking at the |
| Size of Data field. |
| |
| Size Data |
| 3 [0; 9,999,999] times 100 nanoseconds |
| 4 [0; 999,999,999] nanoseconds |
| |
| The value is stored as an unsigned 24-bit or 32-bit little |
| endian integer. |
| |
| |
| 5.6.9. 0x0B: MIME Type |
| |
| MIME type of the uncompressed Stream. This can be used to |
| detect the content type. [IANA-MIME] |
| |
| |
| 5.6.10. 0x0D: Homepage URL |
| |
| This field can be used, for example, when distributing software |
| packages (sources or binaries). The field would indicate the |
| homepage of the program. |
| |
| For details on how to encode URLs, see [RFC-1738]. |
| |
| |
| 6. Custom Filter and Extra Record IDs |
| |
| If a developer wants to use custom Filter or Extra Record IDs, |
| he has two choices. The first choice is to contact Lasse Collin |
| and ask him to allocate a range of IDs for the developer. |
| |
| The second choice is to generate a 40-bit random integer, |
| which the developer can use as his personal Developer ID. |
| To minimalize the risk of collisions, Developer ID has to be |
| a randomly generated integer, not manually selected "hex word". |
| The following command, which works on many free operating |
| systems, can be used to generate Developer ID: |
| |
| dd if=/dev/urandom bs=5 count=1 | hexdump |
| |
| The developer can then use his Developer ID to create unique |
| (well, hopefully unique) Filter and Extra Record IDs. |
| |
| Bits Mask Description |
| 0-15 0x0000_0000_0000_FFFF Filter or Extra Record ID |
| 16-55 0x00FF_FFFF_FFFF_0000 Developer ID |
| 56-62 0x7F00_0000_0000_0000 Static prefix: 0x7F |
| |
| The resulting 63-bit integer will use 9 bytes of space when |
| stored using the encoding described in Section 1.2. To get |
| a shorter ID, see the beginning of this Section how to |
| request a custom ID range. |
| |
| Note that Filter and Metadata Record IDs are in their own |
| namespaces. That is, you can use the same ID value as Filter ID |
| and Metadata Record ID, and the meanings of the IDs do not need |
| to be related to each other. |
| |
| |
| 6.1. Reserved Custom Filter ID Ranges |
| |
| Range Description |
| 0x0000_0000 - 0x0000_00DF IDs fitting into the Misc field |
| 0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility |
| 0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility |
| |
| |
| 7. Cyclic Redundancy Checks |
| |
| There are several incompatible variations to calculate CRC32 |
| and CRC64. For simplicity and clarity, complete examples are |
| provided to calculate the checks as they are used in this file |
| format. Implementations may use different code as long as it |
| gives identical results. |
| |
| The program below reads data from standard input, calculates |
| the CRC32 and CRC64 values, and prints the calculated values |
| as big endian hexadecimal strings to standard output. |
| |
| #include <sys/types.h> |
| #include <inttypes.h> |
| #include <stdio.h> |
| |
| uint32_t crc32_table[256]; |
| uint64_t crc64_table[256]; |
| |
| void |
| init(void) |
| { |
| static const uint32_t poly32 = UINT32_C(0xEDB88320); |
| static const uint64_t poly64 |
| = UINT64_C(0xC96C5795D7870F42); |
| |
| for (size_t i = 0; i < 256; ++i) { |
| uint32_t crc32 = i; |
| uint64_t crc64 = i; |
| |
| for (size_t j = 0; j < 8; ++j) { |
| if (crc32 & 1) |
| crc32 = (crc32 >> 1) ^ poly32; |
| else |
| crc32 >>= 1; |
| |
| if (crc64 & 1) |
| crc64 = (crc64 >> 1) ^ poly64; |
| else |
| crc64 >>= 1; |
| } |
| |
| crc32_table[i] = crc32; |
| crc64_table[i] = crc64; |
| } |
| } |
| |
| uint32_t |
| crc32(const uint8_t *buf, size_t size, uint32_t crc) |
| { |
| crc = ~crc; |
| for (size_t i = 0; i < size; ++i) |
| crc = crc32_table[buf[i] ^ (crc & 0xFF)] |
| ^ (crc >> 8); |
| return ~crc; |
| } |
| |
| uint64_t |
| crc64(const uint8_t *buf, size_t size, uint64_t crc) |
| { |
| crc = ~crc; |
| for (size_t i = 0; i < size; ++i) |
| crc = crc64_table[buf[i] ^ (crc & 0xFF)] |
| ^ (crc >> 8); |
| return ~crc; |
| } |
| |
| int |
| main() |
| { |
| init(); |
| |
| uint32_t value32 = 0; |
| uint64_t value64 = 0; |
| uint64_t total_size = 0; |
| uint8_t buf[8192]; |
| |
| while (1) { |
| const size_t buf_size = fread(buf, 1, 8192, stdin); |
| if (buf_size == 0) |
| break; |
| |
| total_size += buf_size; |
| value32 = crc32(buf, buf_size, value32); |
| value64 = crc64(buf, buf_size, value64); |
| } |
| |
| printf("Bytes: %" PRIu64 "\n", total_size); |
| printf("CRC-32: 0x%08" PRIX32 "\n", value32); |
| printf("CRC-64: 0x%016" PRIX64 "\n", value64); |
| |
| return 0; |
| } |
| |
| |
| 8. References |
| |
| 8.1. Normative References |
| |
| [RFC-1738] |
| Uniform Resource Locators (URL) |
| http://www.ietf.org/rfc/rfc1738.txt |
| |
| [RFC-2119] |
| Key words for use in RFCs to Indicate Requirement Levels |
| http://www.ietf.org/rfc/rfc2119.txt |
| |
| [RFC-2440] |
| OpenPGP Message Format |
| http://www.ietf.org/rfc/rfc2440.txt |
| |
| [RFC-3629] |
| UTF-8, a transformation format of ISO 10646 |
| http://www.ietf.org/rfc/rfc3629.txt |
| |
| [IANA-MIME] |
| MIME Media Types |
| http://www.iana.org/assignments/media-types/ |
| |
| |
| 8.2. Informative References |
| |
| LZMA SDK - The original LZMA implementation |
| http://7-zip.org/sdk.html |
| |
| LZMA Utils - LZMA adapted to POSIX-like systems |
| http://tukaani.org/lzma/ |
| |
| [RFC-1952] |
| GZIP file format specification version 4.3 |
| http://www.ietf.org/rfc/rfc1952.txt |
| - Notation of byte boxes in section `2.1. Overall conventions' |
| |
| [GNU-tar] |
| GNU tar 1.16.1 manual |
| http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html |
| - Node 9.4.2 `Blocking Factor', paragraph that begins |
| `gzip will complain about trailing garbage' |
| - Note that this URL points to the latest version of the |
| manual, and may some day not contain the note which is in |
| 1.16.1. For the exact version of the manual, download GNU |
| tar 1.16.1: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.16.1.tar.gz |
| |