Blame - Documentation/gitformat-chunk.txt - jrn/git

blob: 3315df6201dc964886652c494b2123260de9f434 [file] [log] [blame]

Ævar Arnfjörð Bjarmason	977c47b	2022-08-04 18:28:39 +0200	[diff] [blame]	1	gitformat-chunk(5)
				2	==================
				3
				4	NAME
				5	----
				6	gitformat-chunk - Chunk-based file formats
				7
				8	SYNOPSIS
				9	--------
				10
				11	Used by linkgit:gitformat-commit-graph[5] and the "MIDX" format (see
				12	the pack format documentation in linkgit:gitformat-pack[5]).
				13
				14	DESCRIPTION
				15	-----------
Derrick Stolee	a43a2e6	2021-02-18 14:07:39 +0000	[diff] [blame]	16
				17	Some file formats in Git use a common concept of "chunks" to describe
				18	sections of the file. This allows structured access to a large file by
				19	scanning a small "table of contents" for the remaining data. This common
				20	format is used by the `commit-graph` and `multi-pack-index` files. See
Ævar Arnfjörð Bjarmason	977c47b	2022-08-04 18:28:39 +0200	[diff] [blame]	21	the `multi-pack-index` format in linkgit:gitformat-pack[5] and
Ævar Arnfjörð Bjarmason	8cbace9	2022-08-04 18:28:35 +0200	[diff] [blame]	22	the `commit-graph` format in linkgit:gitformat-commit-graph[5] for
Derrick Stolee	a43a2e6	2021-02-18 14:07:39 +0000	[diff] [blame]	23	how they use the chunks to describe structured data.
				24
				25	A chunk-based file format begins with some header information custom to
				26	that format. That header should include enough information to identify
				27	the file type, format version, and number of chunks in the file. From this
				28	information, that file can determine the start of the chunk-based region.
				29
				30	The chunk-based region starts with a table of contents describing where
				31	each chunk starts and ends. This consists of (C+1) rows of 12 bytes each,
				32	where C is the number of chunks. Consider the following table:
				33
				34	\| Chunk ID (4 bytes) \| Chunk Offset (8 bytes) \|
				35	\|--------------------\|------------------------\|
				36	\| ID[0] \| OFFSET[0] \|
				37	\| ... \| ... \|
				38	\| ID[C] \| OFFSET[C] \|
				39	\| 0x0000 \| OFFSET[C+1] \|
				40
				41	Each row consists of a 4-byte chunk identifier (ID) and an 8-byte offset.
				42	Each integer is stored in network-byte order.
				43
				44	The chunk identifier `ID[i]` is a label for the data stored within this
Elijah Newren	384f7d1	2023-10-08 06:45:05 +0000	[diff] [blame]	45	file from `OFFSET[i]` (inclusive) to `OFFSET[i+1]` (exclusive). Thus, the
Derrick Stolee	a43a2e6	2021-02-18 14:07:39 +0000	[diff] [blame]	46	size of the `i`th chunk is equal to the difference between `OFFSET[i+1]`
				47	and `OFFSET[i]`. This requires that the chunk data appears contiguously
				48	in the same order as the table of contents.
				49
				50	The final entry in the table of contents must be four zero bytes. This
				51	confirms that the table of contents is ending and provides the offset for
				52	the end of the chunk-based data.
				53
				54	Note: The chunk-based format expects that the file contains _at least_ a
				55	trailing hash after `OFFSET[C+1]`.
				56
				57	Functions for working with chunk-based file formats are declared in
				58	`chunk-format.h`. Using these methods provide extra checks that assist
				59	developers when creating new file formats.
				60
				61	Writing chunk-based file formats
				62	--------------------------------
				63
				64	To write a chunk-based file format, create a `struct chunkfile` by
				65	calling `init_chunkfile()` and pass a `struct hashfile` pointer. The
				66	caller is responsible for opening the `hashfile` and writing header
				67	information so the file format is identifiable before the chunk-based
				68	format begins.
				69
Elijah Newren	5676b04	2023-10-08 06:45:11 +0000	[diff] [blame]	70	Then, call `add_chunk()` for each chunk that is intended for writing. This
Derrick Stolee	a43a2e6	2021-02-18 14:07:39 +0000	[diff] [blame]	71	populates the `chunkfile` with information about the order and size of
				72	each chunk to write. Provide a `chunk_write_fn` function pointer to
				73	perform the write of the chunk data upon request.
				74
				75	Call `write_chunkfile()` to write the table of contents to the `hashfile`
				76	followed by each of the chunks. This will verify that each chunk wrote
				77	the expected amount of data so the table of contents is correct.
				78
				79	Finally, call `free_chunkfile()` to clear the `struct chunkfile` data. The
				80	caller is responsible for finalizing the `hashfile` by writing the trailing
				81	hash and closing the file.
				82
				83	Reading chunk-based file formats
				84	--------------------------------
				85
				86	To read a chunk-based file format, the file must be opened as a
				87	memory-mapped region. The chunk-format API expects that the entire file
				88	is mapped as a contiguous memory region.
				89
				90	Initialize a `struct chunkfile` pointer with `init_chunkfile(NULL)`.
				91
				92	After reading the header information from the beginning of the file,
				93	including the chunk count, call `read_table_of_contents()` to populate
				94	the `struct chunkfile` with the list of chunks, their offsets, and their
				95	sizes.
				96
				97	Extract the data information for each chunk using `pair_chunk()` or
				98	`read_chunk()`:
				99
				100	* `pair_chunk()` assigns a given pointer with the location inside the
				101	memory-mapped file corresponding to that chunk's offset. If the chunk
				102	does not exist, then the pointer is not modified.
				103
				104	* `read_chunk()` takes a `chunk_read_fn` function pointer and calls it
				105	with the appropriate initial pointer and size information. The function
				106	is not called if the chunk does not exist. Use this method to read chunks
				107	if you need to perform immediate parsing or if you need to execute logic
				108	based on the size of the chunk.
				109
				110	After calling these methods, call `free_chunkfile()` to clear the
				111	`struct chunkfile` data. This will not close the memory-mapped region.
				112	Callers are expected to own that data for the timeframe the pointers into
				113	the region are needed.
				114
				115	Examples
				116	--------
				117
				118	These file formats use the chunk-format API, and can be used as examples
				119	for future formats:
				120
				121	* commit-graph: see `write_commit_graph_file()` and `parse_commit_graph()`
				122	in `commit-graph.c` for how the chunk-format API is used to write and
				123	parse the commit-graph file format documented in
Ævar Arnfjörð Bjarmason	8cbace9	2022-08-04 18:28:35 +0200	[diff] [blame]	124	the commit-graph file format in linkgit:gitformat-commit-graph[5].
Derrick Stolee	a43a2e6	2021-02-18 14:07:39 +0000	[diff] [blame]	125
				126	* multi-pack-index: see `write_midx_internal()` and `load_multi_pack_index()`
				127	in `midx.c` for how the chunk-format API is used to write and
				128	parse the multi-pack-index file format documented in
Ævar Arnfjörð Bjarmason	977c47b	2022-08-04 18:28:39 +0200	[diff] [blame]	129	the multi-pack-index file format section of linkgit:gitformat-pack[5].
				130
				131	GIT
				132	---
				133	Part of the linkgit:git[1] suite