| |
| Hacking liblzma |
| --------------- |
| |
| 0. Preface |
| |
| This document gives some overall information about the internals of |
| liblzma, which should make it easier to start reading and modifying |
| the code. |
| |
| |
| 1. Programming language |
| |
| liblzma was written in C99. If you use GCC, this means that you need |
| at least GCC 3.x.x. GCC 2 isn't and won't be supported. |
| |
| Some GCC-specific extensions are used *conditionally*. They aren't |
| required to build a full-featured library. Don't make the code rely |
| on any non-standard compiler extensions or even C99 features that |
| aren't portable between almost-C99 compatible compilers (for example |
| non-static inlines). |
| |
| The public API headers are in C89. This is to avoid frustrating those |
| who maintain programs, which are strictly in C89 or C++. |
| |
| An assumption about sizeof(size_t) is made. If this assumption is |
| wrong, some porting is probably needed: |
| |
| sizeof(uint32_t) <= sizeof(size_t) <= sizeof(uint64_t) |
| |
| |
| 2. Internal vs. external API |
| |
| |
| |
| Input Output |
| v Application ^ |
| | liblzma public API | |
| | Stream coder | |
| | Block coder | |
| | Filter coder | |
| | ... | |
| v Filter coder ^ |
| |
| |
| Application |
| `-- liblzma public API |
| `-- Stream coder |
| |-- Stream info handler |
| |-- Stream Header coder |
| |-- Block Header coder |
| | `-- Filter Flags coder |
| |-- Metadata coder |
| | `-- Block coder |
| | `-- Filter 0 |
| | `-- Filter 1 |
| | ... |
| |-- Data Block coder |
| | `-- Filter 0 |
| | `-- Filter 1 |
| | ... |
| `-- Stream tail coder |
| |
| |
| |
| x. Designing new filters |
| |
| All filters must be designed so that the decoder cannot consume |
| arbitrary amount input without producing any decoded output. Failing |
| to follow this rule makes liblzma vulnerable to DoS attacks if |
| untrusted files are decoded (usually they are untrusted). |
| |
| An example should clarify the reason behind this requirement: There |
| are two filters in the chain. The decoder of the first filter produces |
| huge amount of output (many gigabytes or more) with a few bytes of |
| input, which gets passed to the decoder of the second filter. If the |
| data passed to the second filter is interpreted as something that |
| produces no output (e.g. padding), the filter chain as a whole |
| produces no output and consumes no input for a long period of time. |
| |
| The above problem was present in the first versions of the Subblock |
| filter. A tiny .lzma file could have taken several years to decode |
| while it wouldn't produce any output at all. The problem was fixed |
| by adding limits for number of consecutive Padding bytes, and requiring |
| that some decoded output must be produced between Set Subfilter and |
| Unset Subfilter. |
| |
| |
| x. Implementing new filters |
| |
| If the filter supports embedding End of Payload Marker, make sure that |
| when your filter detects End of Payload Marker, |
| - the usage of End of Payload Marker is actually allowed (i.e. End |
| of Input isn't used); and |
| - it also checks that there is no more input coming from the next |
| filter in the chain. |
| |
| The second requirement is slightly tricky. It's possible that the next |
| filter hasn't returned LZMA_STREAM_END yet. It may even need a few |
| bytes more input before it will do so. You need to give it as much |
| input as it needs, and verify that it doesn't produce any output. |
| |
| Don't call the next filter in the chain after it has returned |
| LZMA_STREAM_END (except in encoder if action == LZMA_SYNC_FLUSH). |
| It will result undefined behavior. |
| |
| Be pedantic. If the input data isn't exactly valid, reject it. |
| |
| At the moment, liblzma isn't modular. You will need to edit several |
| files in src/liblzma/common to include support for a new filter. grep |
| for LZMA_FILTER_LZMA to locate the files needing changes. |
| |