Derrick Stolee | ceab693 | 2018-07-12 15:39:18 -0400 | [diff] [blame] | 1 | Multi-Pack-Index (MIDX) Design Notes |
| 2 | ==================================== |
| 3 | |
| 4 | The Git object directory contains a 'pack' directory containing |
| 5 | packfiles (with suffix ".pack") and pack-indexes (with suffix |
| 6 | ".idx"). The pack-indexes provide a way to lookup objects and |
| 7 | navigate to their offset within the pack, but these must come |
| 8 | in pairs with the packfiles. This pairing depends on the file |
| 9 | names, as the pack-index differs only in suffix with its pack- |
| 10 | file. While the pack-indexes provide fast lookup per packfile, |
| 11 | this performance degrades as the number of packfiles increases, |
| 12 | because abbreviations need to inspect every packfile and we are |
| 13 | more likely to have a miss on our most-recently-used packfile. |
| 14 | For some large repositories, repacking into a single packfile |
| 15 | is not feasible due to storage space or excessive repack times. |
| 16 | |
| 17 | The multi-pack-index (MIDX for short) stores a list of objects |
| 18 | and their offsets into multiple packfiles. It contains: |
| 19 | |
Teng Long | ad506e6 | 2021-11-18 15:11:14 +0800 | [diff] [blame] | 20 | * A list of packfile names. |
| 21 | * A sorted list of object IDs. |
| 22 | * A list of metadata for the ith object ID including: |
| 23 | ** A value j referring to the jth packfile. |
| 24 | ** An offset within the jth packfile for the object. |
| 25 | * If large offsets are required, we use another list of large |
Derrick Stolee | ceab693 | 2018-07-12 15:39:18 -0400 | [diff] [blame] | 26 | offsets similar to version 2 pack-indexes. |
Taylor Blau | 95e8383 | 2022-01-25 17:41:03 -0500 | [diff] [blame] | 27 | - An optional list of objects in pseudo-pack order (used with MIDX bitmaps). |
Derrick Stolee | ceab693 | 2018-07-12 15:39:18 -0400 | [diff] [blame] | 28 | |
| 29 | Thus, we can provide O(log N) lookup time for any number |
| 30 | of packfiles. |
| 31 | |
| 32 | Design Details |
| 33 | -------------- |
| 34 | |
| 35 | - The MIDX is stored in a file named 'multi-pack-index' in the |
| 36 | .git/objects/pack directory. This could be stored in the pack |
| 37 | directory of an alternate. It refers only to packfiles in that |
| 38 | same directory. |
| 39 | |
Eric Wong | 0d0d8d8 | 2021-09-24 11:11:36 +0000 | [diff] [blame] | 40 | - The core.multiPackIndex config setting must be on (which is the |
| 41 | default) to consume MIDX files. Setting it to `false` prevents |
| 42 | Git from reading a MIDX file, even if one exists. |
Derrick Stolee | ceab693 | 2018-07-12 15:39:18 -0400 | [diff] [blame] | 43 | |
| 44 | - The file format includes parameters for the object ID hash |
| 45 | function, so a future change of hash algorithm does not require |
| 46 | a change in format. |
| 47 | |
| 48 | - The MIDX keeps only one record per object ID. If an object appears |
Taylor Blau | 9218c6a | 2021-03-30 11:04:11 -0400 | [diff] [blame] | 49 | in multiple packfiles, then the MIDX selects the copy in the |
| 50 | preferred packfile, otherwise selecting from the most-recently |
| 51 | modified packfile. |
Derrick Stolee | ceab693 | 2018-07-12 15:39:18 -0400 | [diff] [blame] | 52 | |
| 53 | - If there exist packfiles in the pack directory not registered in |
| 54 | the MIDX, then those packfiles are loaded into the `packed_git` |
| 55 | list and `packed_git_mru` cache. |
| 56 | |
| 57 | - The pack-indexes (.idx files) remain in the pack directory so we |
| 58 | can delete the MIDX file, set core.midx to false, or downgrade |
| 59 | without any loss of information. |
| 60 | |
| 61 | - The MIDX file format uses a chunk-based approach (similar to the |
| 62 | commit-graph file) that allows optional data to be added. |
| 63 | |
| 64 | Future Work |
| 65 | ----------- |
| 66 | |
Derrick Stolee | ceab693 | 2018-07-12 15:39:18 -0400 | [diff] [blame] | 67 | - The multi-pack-index allows many packfiles, especially in a context |
| 68 | where repacking is expensive (such as a very large repo), or |
| 69 | unexpected maintenance time is unacceptable (such as a high-demand |
| 70 | build machine). However, the multi-pack-index needs to be rewritten |
| 71 | in full every time. We can extend the format to be incremental, so |
| 72 | writes are fast. By storing a small "tip" multi-pack-index that |
| 73 | points to large "base" MIDX files, we can keep writes fast while |
| 74 | still reducing the number of binary searches required for object |
| 75 | lookups. |
| 76 | |
Taylor Blau | 917a54c | 2021-08-24 12:15:59 -0400 | [diff] [blame] | 77 | - If the multi-pack-index is extended to store a "stable object order" |
Derrick Stolee | ceab693 | 2018-07-12 15:39:18 -0400 | [diff] [blame] | 78 | (a function Order(hash) = integer that is constant for a given hash, |
Taylor Blau | 917a54c | 2021-08-24 12:15:59 -0400 | [diff] [blame] | 79 | even as the multi-pack-index is updated) then MIDX bitmaps could be |
| 80 | updated independently of the MIDX. |
Derrick Stolee | ceab693 | 2018-07-12 15:39:18 -0400 | [diff] [blame] | 81 | |
| 82 | - Packfiles can be marked as "special" using empty files that share |
| 83 | the initial name but replace ".pack" with ".keep" or ".promisor". |
| 84 | We can add an optional chunk of data to the multi-pack-index that |
| 85 | records flags of information about the packfiles. This allows new |
| 86 | states, such as 'repacked' or 'redeltified', that can help with |
| 87 | pack maintenance in a multi-pack environment. It may also be |
| 88 | helpful to organize packfiles by object type (commit, tree, blob, |
| 89 | etc.) and use this metadata to help that maintenance. |
| 90 | |
Derrick Stolee | ceab693 | 2018-07-12 15:39:18 -0400 | [diff] [blame] | 91 | Related Links |
| 92 | ------------- |
| 93 | [0] https://bugs.chromium.org/p/git/issues/detail?id=6 |
| 94 | Chromium work item for: Multi-Pack Index (MIDX) |
| 95 | |
Jeff King | 3eae30e | 2019-11-27 07:54:04 -0500 | [diff] [blame] | 96 | [1] https://lore.kernel.org/git/20180107181459.222909-1-dstolee@microsoft.com/ |
Derrick Stolee | ceab693 | 2018-07-12 15:39:18 -0400 | [diff] [blame] | 97 | An earlier RFC for the multi-pack-index feature |
| 98 | |
Jeff King | 3eae30e | 2019-11-27 07:54:04 -0500 | [diff] [blame] | 99 | [2] https://lore.kernel.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/ |
Derrick Stolee | ceab693 | 2018-07-12 15:39:18 -0400 | [diff] [blame] | 100 | Git Merge 2018 Contributor's summit notes (includes discussion of MIDX) |