| Git Commit Graph Design Notes |
| ============================= |
| |
| Git walks the commit graph for many reasons, including: |
| |
| 1. Listing and filtering commit history. |
| 2. Computing merge bases. |
| |
| These operations can become slow as the commit count grows. The merge |
| base calculation shows up in many user-facing commands, such as 'merge-base' |
| or 'status' and can take minutes to compute depending on history shape. |
| |
| There are two main costs here: |
| |
| 1. Decompressing and parsing commits. |
| 2. Walking the entire graph to satisfy topological order constraints. |
| |
| The commit graph file is a supplemental data structure that accelerates |
| commit graph walks. If a user downgrades or disables the 'core.commitGraph' |
| config setting, then the existing ODB is sufficient. The file is stored |
| as "commit-graph" either in the .git/objects/info directory or in the info |
| directory of an alternate. |
| |
| The commit graph file stores the commit graph structure along with some |
| extra metadata to speed up graph walks. By listing commit OIDs in lexi- |
| cographic order, we can identify an integer position for each commit and |
| refer to the parents of a commit using those integer positions. We use |
| binary search to find initial commits and then use the integer positions |
| for fast lookups during the walk. |
| |
| A consumer may load the following info for a commit from the graph: |
| |
| 1. The commit OID. |
| 2. The list of parents, along with their integer position. |
| 3. The commit date. |
| 4. The root tree OID. |
| 5. The generation number (see definition below). |
| |
| Values 1-4 satisfy the requirements of parse_commit_gently(). |
| |
| Define the "generation number" of a commit recursively as follows: |
| |
| * A commit with no parents (a root commit) has generation number one. |
| |
| * A commit with at least one parent has generation number one more than |
| the largest generation number among its parents. |
| |
| Equivalently, the generation number of a commit A is one more than the |
| length of a longest path from A to a root commit. The recursive definition |
| is easier to use for computation and observing the following property: |
| |
| If A and B are commits with generation numbers N and M, respectively, |
| and N <= M, then A cannot reach B. That is, we know without searching |
| that B is not an ancestor of A because it is further from a root commit |
| than A. |
| |
| Conversely, when checking if A is an ancestor of B, then we only need |
| to walk commits until all commits on the walk boundary have generation |
| number at most N. If we walk commits using a priority queue seeded by |
| generation numbers, then we always expand the boundary commit with highest |
| generation number and can easily detect the stopping condition. |
| |
| This property can be used to significantly reduce the time it takes to |
| walk commits and determine topological relationships. Without generation |
| numbers, the general heuristic is the following: |
| |
| If A and B are commits with commit time X and Y, respectively, and |
| X < Y, then A _probably_ cannot reach B. |
| |
| This heuristic is currently used whenever the computation is allowed to |
| violate topological relationships due to clock skew (such as "git log" |
| with default order), but is not used when the topological order is |
| required (such as merge base calculations, "git log --graph"). |
| |
| In practice, we expect some commits to be created recently and not stored |
| in the commit graph. We can treat these commits as having "infinite" |
| generation number and walk until reaching commits with known generation |
| number. |
| |
| We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not |
| in the commit-graph file. If a commit-graph file was written by a version |
| of Git that did not compute generation numbers, then those commits will |
| have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. |
| |
| Since the commit-graph file is closed under reachability, we can guarantee |
| the following weaker condition on all commits: |
| |
| If A and B are commits with generation numbers N amd M, respectively, |
| and N < M, then A cannot reach B. |
| |
| Note how the strict inequality differs from the inequality when we have |
| fully-computed generation numbers. Using strict inequality may result in |
| walking a few extra commits, but the simplicity in dealing with commits |
| with generation number *_INFINITY or *_ZERO is valuable. |
| |
| We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose |
| generation numbers are computed to be at least this value. We limit at |
| this value since it is the largest value that can be stored in the |
| commit-graph file using the 30 bits available to generation numbers. This |
| presents another case where a commit can have generation number equal to |
| that of a parent. |
| |
| Design Details |
| -------------- |
| |
| - The commit graph file is stored in a file named 'commit-graph' in the |
| .git/objects/info directory. This could be stored in the info directory |
| of an alternate. |
| |
| - The core.commitGraph config setting must be on to consume graph files. |
| |
| - The file format includes parameters for the object ID hash function, |
| so a future change of hash algorithm does not require a change in format. |
| |
| Future Work |
| ----------- |
| |
| - The commit graph feature currently does not honor commit grafts. This can |
| be remedied by duplicating or refactoring the current graft logic. |
| |
| - The 'commit-graph' subcommand does not have a "verify" mode that is |
| necessary for integration with fsck. |
| |
| - After computing and storing generation numbers, we must make graph |
| walks aware of generation numbers to gain the performance benefits they |
| enable. This will mostly be accomplished by swapping a commit-date-ordered |
| priority queue with one ordered by generation number. The following |
| operations are important candidates: |
| |
| - 'log --topo-order' |
| - 'tag --merged' |
| |
| - Currently, parse_commit_gently() requires filling in the root tree |
| object for a commit. This passes through lookup_tree() and consequently |
| lookup_object(). Also, it calls lookup_commit() when loading the parents. |
| These method calls check the ODB for object existence, even if the |
| consumer does not need the content. For example, we do not need the |
| tree contents when computing merge bases. Now that commit parsing is |
| removed from the computation time, these lookup operations are the |
| slowest operations keeping graph walks from being fast. Consider |
| loading these objects without verifying their existence in the ODB and |
| only loading them fully when consumers need them. Consider a method |
| such as "ensure_tree_loaded(commit)" that fully loads a tree before |
| using commit->tree. |
| |
| - The current design uses the 'commit-graph' subcommand to generate the graph. |
| When this feature stabilizes enough to recommend to most users, we should |
| add automatic graph writes to common operations that create many commits. |
| For example, one could compute a graph on 'clone', 'fetch', or 'repack' |
| commands. |
| |
| - A server could provide a commit graph file as part of the network protocol |
| to avoid extra calculations by clients. This feature is only of benefit if |
| the user is willing to trust the file, because verifying the file is correct |
| is as hard as computing it from scratch. |
| |
| Related Links |
| ------------- |
| [0] https://bugs.chromium.org/p/git/issues/detail?id=8 |
| Chromium work item for: Serialized Commit Graph |
| |
| [1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/ |
| An abandoned patch that introduced generation numbers. |
| |
| [2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/ |
| Discussion about generation numbers on commits and how they interact |
| with fsck. |
| |
| [3] https://public-inbox.org/git/20170908034739.4op3w4f2ma5s65ku@sigill.intra.peff.net/ |
| More discussion about generation numbers and not storing them inside |
| commit objects. A valuable quote: |
| |
| "I think we should be moving more in the direction of keeping |
| repo-local caches for optimizations. Reachability bitmaps have been |
| a big performance win. I think we should be doing the same with our |
| properties of commits. Not just generation numbers, but making it |
| cheap to access the graph structure without zlib-inflating whole |
| commit objects (i.e., packv4 or something like the "metapacks" I |
| proposed a few years ago)." |
| |
| [4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u |
| A patch to remove the ahead-behind calculation from 'status'. |