| |
| Using liblzma securely |
| ---------------------- |
| |
| 0. Introduction |
| |
| This document discusses how to use liblzma securely. There are issues |
| that don't apply to zlib or libbzip2, so reading this document is |
| strongly recommended even for those who are very familiar with zlib |
| or libbzip2. |
| |
| While making liblzma itself as secure as possible is essential, it's |
| out of scope of this document. |
| |
| |
| 1. Memory usage |
| |
| The memory usage of liblzma varies a lot. |
| |
| |
| 1.1. Problem sources |
| |
| 1.1.1. Block coder |
| |
| The memory requirements of Block encoder depend on the used filters |
| and their settings. The memory requirements of the Block decoder |
| depend on the which filters and with which filter settings the Block |
| was encoded. Usually the memory requirements of a decoder are equal |
| or less than the requirements of the encoder with the same settings. |
| |
| While the typical memory requirements to decode a Block is from a few |
| hundred kilobytes to tens of megabytes, a maliciously constructed |
| files can require a lot more RAM to decode. With the current filters, |
| the maximum amount is about 7 GiB. If you use multi-threaded decoding, |
| every Block can require this amount of RAM, thus a four-threaded |
| decoder could suddenly try to allocate 28 GiB of RAM. |
| |
| If you don't limit the maximum memory usage in any way, and there are |
| no resource limits set on the operating system side, one malicious |
| input file can run the system out of memory, or at least make it swap |
| badly for a long time. This is exceptionally bad on servers e.g. |
| email server doing virus scanning on incoming messages. |
| |
| |
| 1.1.2. Metadata decoder |
| |
| Multi-Block .lzma files contain at least one Metadata Block. |
| Externally the Metadata Blocks are similar to Data Blocks, so all |
| the issues mentioned about memory usage of Data Blocks applies to |
| Metadata Blocks too. |
| |
| The uncompressed content of Metadata Blocks contain information about |
| the Stream as a whole, and optionally some Extra Records. The |
| information about the Stream is kept in liblzma's internal data |
| structures in RAM. Extra Records can contain arbitrary data. They are |
| not interpreted by liblzma, but liblzma will provide them to the |
| application in uninterpreted form if the application wishes so. |
| |
| Usually the Uncompressed Size of a Metadata Block is small. Even on |
| extreme cases, it shouldn't be much bigger than a few megabytes. Once |
| the Metadata has been parsed into native data structures in liblzma, |
| it usually takes a little more memory than in the encoded form. For |
| all normal files, this is no problem, since the resulting memory usage |
| won't be too much. |
| |
| The problem is that a maliciously constructed Metadata Block can |
| contain huge amount of "information", which liblzma will try to store |
| in its internal data structures. This may cause liblzma to allocate |
| all the available RAM unless some kind of resource usage limits are |
| applied. |
| |
| Note that the Extra Records in Metadata are always parsed but, but |
| memory is allocated for them only if the application has requested |
| liblzma to provide the Extra Records to the application. |
| |
| |
| 1.2. Solutions |
| |
| If you need to decode files from untrusted sources (most people do), |
| you must limit the memory usage to avoid denial of service (DoS) |
| conditions caused by malicious input files. |
| |
| The first step is to find out how much memory you are allowed consume |
| at maximum. This may be a hardcoded constant or derived from the |
| available RAM; whatever is appropriate in the application. |
| |
| The simplest solution is to use setrlimit() if the kernel supports |
| RLIMIT_AS, which limits the memory usage of the whole process. |
| For more portable and fine-grained limitting, you can use |
| memory limitter functions found from <lzma/memlimit.h>. |
| |
| |
| 1.2.1. Encoder |
| |
| lzma_memory_usage() will give you a rough estimate about the memory |
| usage of the given filter chain. To dramatically simplify the internal |
| implementation, this function doesn't take into account all the small |
| helper data structures needed in various places; only the structures |
| with significant memory usage are taken into account. Still, the |
| accuracy of this function should be well within a mebibyte. |
| |
| The Subblock filter is a special case. If a Subfilter has been |
| specified, it isn't taken into account when lzma_memory_usage() |
| calculates the memory usage. You need to calculate the memory usage |
| of the Subfilter separately. |
| |
| Keeping track of Blocks in a Multi-Block Stream takes a few dozen |
| bytes of RAM per Block (size of the lzma_index structure plus overhead |
| of malloc()). It isn't a good idea to put tens of thousands of Blocks |
| into a Stream unless you have a very good reason to do so (compressed |
| dictionary could be an example of such situation). |
| |
| Also keep the number and sizes of Extra Records sane. If you produce |
| the list of Extra Records automatically from some untrusted source, |
| you should not only validate the content of these Records, but also |
| their memory usage. |
| |
| |
| 1.2.2. Decoder |
| |
| A single-threaded decoder should simply use a memory limitter and |
| indicate an error if it runs out of memory. |
| |
| Memory-limitting with multi-threaded decoding is tricky. The simple |
| solution is to divide the maximum allowed memory usage with the |
| maximum allowed threads, and give each Block decoder their own |
| independent lzma_memory_limitter. The drawback is that if one Block |
| needs notably more RAM than any other Block, the decoder will run out |
| of memory when in reality there would be plenty of free RAM. |
| |
| An attractive alternative would be using shared lzma_memory_limitter. |
| Depending on the application and the expected type of input, this may |
| either be the best solution or a source of hard-to-repeat problems. |
| Consider the following requirements: |
| - You use at maximum of n threads. |
| - x(i) is the decoder memory requirements of the Block number i |
| in an expected input Stream. |
| - The memory limitter is set to higher value than the sum of n |
| highest values x(i). |
| |
| (If you are better at explaining the above conditions, please |
| contribute your improved version.) |
| |
| If the above conditions aren't met, it is possible that the decoding |
| will fail unpredictably. That is, on the same machine using the same |
| settings, the decoding may sometimes succeed and sometimes fail. This |
| is because sometimes threads may run so that the Blocks with highest |
| memory usage are tried to be decoded at the same time. |
| |
| Most .lzma files have all the Blocks encoded with identical settings, |
| or at least the memory usage won't vary dramatically. That's why most |
| multi-threaded decoders probably want to use the simple "separate |
| lzma_memory_limitter for each thread" solution, possibly fallbacking |
| to single-threaded mode in case the per-thread memory limits aren't |
| enough in multi-threaded mode. |
| |
| FIXME: Memory usage of Stream info. |
| |
| [ |
| |
| ] |
| |
| |
| 2. Huge uncompressed output |
| |
| 2.1. Data Blocks |
| |
| Decoding a tiny .lzma file can produce huge amount of uncompressed |
| output. There is an example file of 45 bytes, which decodes to 64 PiB |
| (that's 2^56 bytes). Uncompressing such a file to disk is likely to |
| fill even a bigger disk array. If the data is written to a pipe, it |
| may not fill the disk, but would still take very long time to finish. |
| |
| To avoid denial of service conditions caused by huge amount of |
| uncompressed output, applications using liblzma should use some method |
| to limit the amount of output produced. The exact method depends on |
| the application. |
| |
| All valid .lzma Streams make it possible to find out the uncompressed |
| size of the Stream without actually uncompressing the data. This |
| information is available in at least one of the Metadata Blocks. |
| Once the uncompressed size is parsed, the decoder can verify that |
| it doesn't exceed certain limits (e.g. available disk space). |
| |
| When the uncompressed size is known, the decoder can actively keep |
| track of the amount of output produced so far, and that it doesn't |
| exceed the known uncompressed size. If it does exceed, the file is |
| known to be corrupt and an error should be indicated without |
| continuing to decode the rest of the file. |
| |
| Unfortunately, finding the uncompressed size beforehand is often |
| possible only in non-streamed mode, because the needed information |
| could be in the Footer Metdata Block, which (obviously) is at the |
| end of the Stream. In purely streamed mode decoding, one may need to |
| use some rough arbitrary limits to prevent the problems described in |
| the beginning of this section. |
| |
| |
| 2.2. Metadata |
| |
| Metadata is stored in Metadata Blocks, which are very similar to |
| Data Blocks. Thus, the uncompressed size can be huge just like with |
| Data Blocks. The difference is, that the contents of Metadata Blocks |
| aren't given to the application as is, but parsed by liblzma. Still, |
| reading through a huge Metadata can take very long time, effectively |
| creating a denial of service like piping decoded a Data Block to |
| another process would do. |
| |
| At first it would seem that using a memory limitter would prevent |
| this issue as a side effect. But it does so only if the application |
| requests liblzma to allocate the Extra Records and provide them to |
| the application. If Extra Records aren't requested, they aren't |
| allocated either. Still, the Extra Records are being read through |
| to validate that the Metadata is in proper format. |
| |
| The solution is to limit the Uncompressed Size of a Metadata Block |
| to some relatively large value. This will make liblzma to give an |
| error when the given limit is reached. |
| |