| GIT bitmap v1 format |
| ==================== |
| |
| == Pack and multi-pack bitmaps |
| |
| Bitmaps store reachability information about the set of objects in a packfile, |
| or a multi-pack index (MIDX). The former is defined obviously, and the latter is |
| defined as the union of objects in packs contained in the MIDX. |
| |
| A bitmap may belong to either one pack, or the repository's multi-pack index (if |
| it exists). A repository may have at most one bitmap. |
| |
| An object is uniquely described by its bit position within a bitmap: |
| |
| - If the bitmap belongs to a packfile, the __n__th bit corresponds to |
| the __n__th object in pack order. For a function `offset` which maps |
| objects to their byte offset within a pack, pack order is defined as |
| follows: |
| |
| o1 <= o2 <==> offset(o1) <= offset(o2) |
| |
| - If the bitmap belongs to a MIDX, the __n__th bit corresponds to the |
| __n__th object in MIDX order. With an additional function `pack` which |
| maps objects to the pack they were selected from by the MIDX, MIDX order |
| is defined as follows: |
| |
| o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2) |
| |
| The ordering between packs is done according to the MIDX's .rev file. |
| Notably, the preferred pack sorts ahead of all other packs. |
| |
| The on-disk representation (described below) of a bitmap is the same regardless |
| of whether or not that bitmap belongs to a packfile or a MIDX. The only |
| difference is the interpretation of the bits, which is described above. |
| |
| Certain bitmap extensions are supported (see: Appendix B). No extensions are |
| required for bitmaps corresponding to packfiles. For bitmaps that correspond to |
| MIDXs, both the bit-cache and rev-cache extensions are required. |
| |
| == On-disk format |
| |
| - A header appears at the beginning: |
| |
| 4-byte signature: {'B', 'I', 'T', 'M'} |
| |
| 2-byte version number (network byte order) |
| The current implementation only supports version 1 |
| of the bitmap index (the same one as JGit). |
| |
| 2-byte flags (network byte order) |
| |
| The following flags are supported: |
| |
| - BITMAP_OPT_FULL_DAG (0x1) REQUIRED |
| This flag must always be present. It implies that the |
| bitmap index has been generated for a packfile or |
| multi-pack index (MIDX) with full closure (i.e. where |
| every single object in the packfile/MIDX can find its |
| parent links inside the same packfile/MIDX). This is a |
| requirement for the bitmap index format, also present in |
| JGit, that greatly reduces the complexity of the |
| implementation. |
| |
| - BITMAP_OPT_HASH_CACHE (0x4) |
| If present, the end of the bitmap file contains |
| `N` 32-bit name-hash values, one per object in the |
| pack/MIDX. The format and meaning of the name-hash is |
| described below. |
| |
| 4-byte entry count (network byte order) |
| |
| The total count of entries (bitmapped commits) in this bitmap index. |
| |
| 20-byte checksum |
| |
| The SHA1 checksum of the pack/MIDX this bitmap index |
| belongs to. |
| |
| - 4 EWAH bitmaps that act as type indexes |
| |
| Type indexes are serialized after the hash cache in the shape |
| of four EWAH bitmaps stored consecutively (see Appendix A for |
| the serialization format of an EWAH bitmap). |
| |
| There is a bitmap for each Git object type, stored in the following |
| order: |
| |
| - Commits |
| - Trees |
| - Blobs |
| - Tags |
| |
| In each bitmap, the `n`th bit is set to true if the `n`th object |
| in the packfile or multi-pack index is of that type. |
| |
| The obvious consequence is that the OR of all 4 bitmaps will result |
| in a full set (all bits set), and the AND of all 4 bitmaps will |
| result in an empty bitmap (no bits set). |
| |
| - N entries with compressed bitmaps, one for each indexed commit |
| |
| Where `N` is the total amount of entries in this bitmap index. |
| Each entry contains the following: |
| |
| - 4-byte object position (network byte order) |
| The position **in the index for the packfile or |
| multi-pack index** where the bitmap for this commit is |
| found. |
| |
| - 1-byte XOR-offset |
| The xor offset used to compress this bitmap. For an entry |
| in position `x`, a XOR offset of `y` means that the actual |
| bitmap representing this commit is composed by XORing the |
| bitmap for this entry with the bitmap in entry `x-y` (i.e. |
| the bitmap `y` entries before this one). |
| |
| Note that this compression can be recursive. In order to |
| XOR this entry with a previous one, the previous entry needs |
| to be decompressed first, and so on. |
| |
| The hard-limit for this offset is 160 (an entry can only be |
| xor'ed against one of the 160 entries preceding it). This |
| number is always positive, and hence entries are always xor'ed |
| with **previous** bitmaps, not bitmaps that will come afterwards |
| in the index. |
| |
| - 1-byte flags for this bitmap |
| At the moment the only available flag is `0x1`, which hints |
| that this bitmap can be re-used when rebuilding bitmap indexes |
| for the repository. |
| |
| - The compressed bitmap itself, see Appendix A. |
| |
| == Appendix A: Serialization format for an EWAH bitmap |
| |
| Ewah bitmaps are serialized in the same protocol as the JAVAEWAH |
| library, making them backwards compatible with the JGit |
| implementation: |
| |
| - 4-byte number of bits of the resulting UNCOMPRESSED bitmap |
| |
| - 4-byte number of words of the COMPRESSED bitmap, when stored |
| |
| - N x 8-byte words, as specified by the previous field |
| |
| This is the actual content of the compressed bitmap. |
| |
| - 4-byte position of the current RLW for the compressed |
| bitmap |
| |
| All words are stored in network byte order for their corresponding |
| sizes. |
| |
| The compressed bitmap is stored in a form of run-length encoding, as |
| follows. It consists of a concatenation of an arbitrary number of |
| chunks. Each chunk consists of one or more 64-bit words |
| |
| H L_1 L_2 L_3 .... L_M |
| |
| H is called RLW (run length word). It consists of (from lower to higher |
| order bits): |
| |
| - 1 bit: the repeated bit B |
| |
| - 32 bits: repetition count K (unsigned) |
| |
| - 31 bits: literal word count M (unsigned) |
| |
| The bitstream represented by the above chunk is then: |
| |
| - K repetitions of B |
| |
| - The bits stored in `L_1` through `L_M`. Within a word, bits at |
| lower order come earlier in the stream than those at higher |
| order. |
| |
| The next word after `L_M` (if any) must again be a RLW, for the next |
| chunk. For efficient appending to the bitstream, the EWAH stores a |
| pointer to the last RLW in the stream. |
| |
| |
| == Appendix B: Optional Bitmap Sections |
| |
| These sections may or may not be present in the `.bitmap` file; their |
| presence is indicated by the header flags section described above. |
| |
| Name-hash cache |
| --------------- |
| |
| If the BITMAP_OPT_HASH_CACHE flag is set, the end of the bitmap contains |
| a cache of 32-bit values, one per object in the pack/MIDX. The value at |
| position `i` is the hash of the pathname at which the `i`th object |
| (counting in index or multi-pack index order) in the pack/MIDX can be found. |
| This can be fed into the delta heuristics to compare objects with similar |
| pathnames. |
| |
| The hash algorithm used is: |
| |
| hash = 0; |
| while ((c = *name++)) |
| if (!isspace(c)) |
| hash = (hash >> 2) + (c << 24); |
| |
| Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag. |
| If implementations want to choose a different hashing scheme, they are |
| free to do so, but MUST allocate a new header flag (because comparing |
| hashes made under two different schemes would be pointless). |