doc/liblzma-hacking.txt - jrn/xz - Git at Google


 Hacking liblzma
 ---------------

 0. Preface

     This document gives some overall information about the internals of
     liblzma, which should make it easier to start reading and modifying
     the code.


 1. Programming language

     liblzma was written in C99. If you use GCC, this means that you need
     at least GCC 3.x.x. GCC 2 isn't and won't be supported.

     Some GCC-specific extensions are used *conditionally*. They aren't
     required to build a full-featured library. Don't make the code rely
     on any non-standard compiler extensions or even C99 features that
     aren't portable between almost-C99 compatible compilers (for example
     non-static inlines).

     The public API headers are in C89. This is to avoid frustrating those
     who maintain programs, which are strictly in C89 or C++.

     An assumption about sizeof(size_t) is made. If this assumption is
     wrong, some porting is probably needed:

         sizeof(uint32_t) <= sizeof(size_t) <= sizeof(uint64_t)


 2. Internal vs. external API


         Input                         Output
           v     Application             ^
           |     liblzma public API      |
           |     Stream coder            |
           |     Block coder             |
           |     Filter coder            |
           |     ...                     |
           v     Filter coder            ^


         Application
           `-- liblzma public API
                 `-- Stream coder
                       |-- Stream info handler
                       |-- Stream Header coder
                       |-- Block Header coder
                       |     `-- Filter Flags coder
                       |-- Metadata coder
                       |     `-- Block coder
                       |           `-- Filter 0
                       |                 `-- Filter 1
                       |                     ...
                       |-- Data Block coder
                       |     `-- Filter 0
                       |           `-- Filter 1
                       |               ...
                       `-- Stream tail coder


 x. Designing new filters

     All filters must be designed so that the decoder cannot consume
     arbitrary amount input without producing any decoded output. Failing
     to follow this rule makes liblzma vulnerable to DoS attacks if
     untrusted files are decoded (usually they are untrusted).

     An example should clarify the reason behind this requirement: There
     are two filters in the chain. The decoder of the first filter produces
     huge amount of output (many gigabytes or more) with a few bytes of
     input, which gets passed to the decoder of the second filter. If the
     data passed to the second filter is interpreted as something that
     produces no output (e.g. padding), the filter chain as a whole
     produces no output and consumes no input for a long period of time.

     The above problem was present in the first versions of the Subblock
     filter. A tiny .lzma file could have taken several years to decode
     while it wouldn't produce any output at all. The problem was fixed
     by adding limits for number of consecutive Padding bytes, and requiring
     that some decoded output must be produced between Set Subfilter and
     Unset Subfilter.


 x. Implementing new filters

     If the filter supports embedding End of Payload Marker, make sure that
     when your filter detects End of Payload Marker,
       - the usage of End of Payload Marker is actually allowed (i.e. End
         of Input isn't used); and
       - it also checks that there is no more input coming from the next
         filter in the chain.

     The second requirement is slightly tricky. It's possible that the next
     filter hasn't returned LZMA_STREAM_END yet. It may even need a few
     bytes more input before it will do so. You need to give it as much
     input as it needs, and verify that it doesn't produce any output.

     Don't call the next filter in the chain after it has returned
     LZMA_STREAM_END (except in encoder if action == LZMA_SYNC_FLUSH).
     It will result undefined behavior.

     Be pedantic. If the input data isn't exactly valid, reject it.

     At the moment, liblzma isn't modular. You will need to edit several
     files in src/liblzma/common to include support for a new filter. grep
     for LZMA_FILTER_LZMA to locate the files needing changes.

	Hacking liblzma
	---------------

	0. Preface

	This document gives some overall information about the internals of
	liblzma, which should make it easier to start reading and modifying
	the code.


	1. Programming language

	liblzma was written in C99. If you use GCC, this means that you need
	at least GCC 3.x.x. GCC 2 isn't and won't be supported.

	Some GCC-specific extensions are used conditionally. They aren't
	required to build a full-featured library. Don't make the code rely
	on any non-standard compiler extensions or even C99 features that
	aren't portable between almost-C99 compatible compilers (for example
	non-static inlines).

	The public API headers are in C89. This is to avoid frustrating those
	who maintain programs, which are strictly in C89 or C++.

	An assumption about sizeof(size_t) is made. If this assumption is
	wrong, some porting is probably needed:

	sizeof(uint32_t) <= sizeof(size_t) <= sizeof(uint64_t)


	2. Internal vs. external API



	Input Output
	v Application ^
	\| liblzma public API \|
	\| Stream coder \|
	\| Block coder \|
	\| Filter coder \|
	\| ... \|
	v Filter coder ^


	Application
	`-- liblzma public API
	`-- Stream coder
	\|-- Stream info handler
	\|-- Stream Header coder
	\|-- Block Header coder
	\| `-- Filter Flags coder
	\|-- Metadata coder
	\| `-- Block coder
	\| `-- Filter 0
	\| `-- Filter 1
	\| ...
	\|-- Data Block coder
	\| `-- Filter 0
	\| `-- Filter 1
	\| ...
	`-- Stream tail coder



	x. Designing new filters

	All filters must be designed so that the decoder cannot consume
	arbitrary amount input without producing any decoded output. Failing
	to follow this rule makes liblzma vulnerable to DoS attacks if
	untrusted files are decoded (usually they are untrusted).

	An example should clarify the reason behind this requirement: There
	are two filters in the chain. The decoder of the first filter produces
	huge amount of output (many gigabytes or more) with a few bytes of
	input, which gets passed to the decoder of the second filter. If the
	data passed to the second filter is interpreted as something that
	produces no output (e.g. padding), the filter chain as a whole
	produces no output and consumes no input for a long period of time.

	The above problem was present in the first versions of the Subblock
	filter. A tiny .lzma file could have taken several years to decode
	while it wouldn't produce any output at all. The problem was fixed
	by adding limits for number of consecutive Padding bytes, and requiring
	that some decoded output must be produced between Set Subfilter and
	Unset Subfilter.


	x. Implementing new filters

	If the filter supports embedding End of Payload Marker, make sure that
	when your filter detects End of Payload Marker,
	- the usage of End of Payload Marker is actually allowed (i.e. End
	of Input isn't used); and
	- it also checks that there is no more input coming from the next
	filter in the chain.

	The second requirement is slightly tricky. It's possible that the next
	filter hasn't returned LZMA_STREAM_END yet. It may even need a few
	bytes more input before it will do so. You need to give it as much
	input as it needs, and verify that it doesn't produce any output.

	Don't call the next filter in the chain after it has returned
	LZMA_STREAM_END (except in encoder if action == LZMA_SYNC_FLUSH).
	It will result undefined behavior.

	Be pedantic. If the input data isn't exactly valid, reject it.

	At the moment, liblzma isn't modular. You will need to edit several
	files in src/liblzma/common to include support for a new filter. grep
	for LZMA_FILTER_LZMA to locate the files needing changes.