| Git Sparse-Index Design Document |
| ================================ |
| |
| The sparse-checkout feature allows users to focus a working directory on |
| a subset of the files at HEAD. The cone mode patterns, enabled by |
| `core.sparseCheckoutCone`, allow for very fast pattern matching to |
| discover which files at HEAD belong in the sparse-checkout cone. |
| |
| Three important scale dimensions for a Git working directory are: |
| |
| * `HEAD`: How many files are present at `HEAD`? |
| |
| * Populated: How many files are within the sparse-checkout cone. |
| |
| * Modified: How many files has the user modified in the working directory? |
| |
| We will use big-O notation -- O(X) -- to denote how expensive certain |
| operations are in terms of these dimensions. |
| |
| These dimensions are ordered by their magnitude: users (typically) modify |
| fewer files than are populated, and we can only populate files at `HEAD`. |
| |
| Problems occur if there is an extreme imbalance in these dimensions. For |
| example, if `HEAD` contains millions of paths but the populated set has |
| only tens of thousands, then commands like `git status` and `git add` can |
| be dominated by operations that require O(`HEAD`) operations instead of |
| O(Populated). Primarily, the cost is in parsing and rewriting the index, |
| which is filled primarily with files at `HEAD` that are marked with the |
| `SKIP_WORKTREE` bit. |
| |
| The sparse-index intends to take these commands that read and modify the |
| index from O(`HEAD`) to O(Populated). To do this, we need to modify the |
| index format in a significant way: add "sparse directory" entries. |
| |
| With cone mode patterns, it is possible to detect when an entire |
| directory will have its contents outside of the sparse-checkout definition. |
| Instead of listing all of the files it contains as individual entries, a |
| sparse-index contains an entry with the directory name, referencing the |
| object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit. |
| If we need to discover the details for paths within that directory, we |
| can parse trees to find that list. |
| |
| At time of writing, sparse-directory entries violate expectations about the |
| index format and its in-memory data structure. There are many consumers in |
| the codebase that expect to iterate through all of the index entries and |
| see only files. In fact, these loops expect to see a reference to every |
| staged file. One way to handle this is to parse trees to replace a |
| sparse-directory entry with all of the files within that tree as the index |
| is loaded. However, parsing trees is slower than parsing the index format, |
| so that is a slower operation than if we left the index alone. The plan is |
| to make all of these integrations "sparse aware" so this expansion through |
| tree parsing is unnecessary and they use fewer resources than when using a |
| full index. |
| |
| The implementation plan below follows four phases to slowly integrate with |
| the sparse-index. The intention is to incrementally update Git commands to |
| interact safely with the sparse-index without significant slowdowns. This |
| may not always be possible, but the hope is that the primary commands that |
| users need in their daily work are dramatically improved. |
| |
| Phase I: Format and initial speedups |
| ------------------------------------ |
| |
| During this phase, Git learns to enable the sparse-index and safely parse |
| one. Protections are put in place so that every consumer of the in-memory |
| data structure can operate with its current assumption of every file at |
| `HEAD`. |
| |
| At first, every index parse will call a helper method, |
| `ensure_full_index()`, which scans the index for sparse-directory entries |
| (pointing to trees) and replaces them with the full list of paths (with |
| blob contents) by parsing tree objects. This will be slower in all cases. |
| The only noticeable change in behavior will be that the serialized index |
| file contains sparse-directory entries. |
| |
| To start, we use a new required index extension, `sdir`, to allow |
| inserting sparse-directory entries into indexes with file format |
| versions 2, 3, and 4. This prevents Git versions that do not understand |
| the sparse-index from operating on one, while allowing tools that do not |
| understand the sparse-index to operate on repositories as long as they do |
| not interact with the index. A new format, index v5, will be introduced |
| that includes sparse-directory entries by default. It might also |
| introduce other features that have been considered for improving the |
| index, as well. |
| |
| Next, consumers of the index will be guarded against operating on a |
| sparse-index by inserting calls to `ensure_full_index()` or |
| `expand_index_to_path()`. If a specific path is requested, then those will |
| be protected from within the `index_file_exists()` and `index_name_pos()` |
| API calls: they will call `ensure_full_index()` if necessary. The |
| intention here is to preserve existing behavior when interacting with a |
| sparse-checkout. We don't want a change to happen by accident, without |
| tests. Many of these locations may not need any change before removing the |
| guards, but we should not do so without tests to ensure the expected |
| behavior happens. |
| |
| It may be desirable to _change_ the behavior of some commands in the |
| presence of a sparse index or more generally in any sparse-checkout |
| scenario. In such cases, these should be carefully communicated and |
| tested. No such behavior changes are intended during this phase. |
| |
| During a scan of the codebase, not every iteration of the cache entries |
| needs an `ensure_full_index()` check. The basic reasons include: |
| |
| 1. The loop is scanning for entries with non-zero stage. These entries |
| are not collapsed into a sparse-directory entry. |
| |
| 2. The loop is scanning for submodules. These entries are not collapsed |
| into a sparse-directory entry. |
| |
| 3. The loop is part of the index API, especially around reading or |
| writing the format. |
| |
| 4. The loop is checking for correct order of cache entries and that is |
| correct if and only if the sparse-directory entries are in the correct |
| location. |
| |
| 5. The loop ignores entries with the `SKIP_WORKTREE` bit set, or is |
| otherwise already aware of sparse directory entries. |
| |
| 6. The sparse-index is disabled at this point when using the split-index |
| feature, so no effort is made to protect the split-index API. |
| |
| Even after inserting these guards, we will keep expanding sparse-indexes |
| for most Git commands using the `command_requires_full_index` repository |
| setting. This setting will be on by default and disabled one builtin at a |
| time until we have sufficient confidence that all of the index operations |
| are properly guarded. |
| |
| To complete this phase, the commands `git status` and `git add` will be |
| integrated with the sparse-index so that they operate with O(Populated) |
| performance. They will be carefully tested for operations within and |
| outside the sparse-checkout definition. |
| |
| Phase II: Careful integrations |
| ------------------------------ |
| |
| This phase focuses on ensuring that all index extensions and APIs work |
| well with a sparse-index. This requires significant increases to our test |
| coverage, especially for operations that interact with the working |
| directory outside of the sparse-checkout definition. Some of these |
| behaviors may not be the desirable ones, such as some tests already |
| marked for failure in `t1092-sparse-checkout-compatibility.sh`. |
| |
| The index extensions that may require special integrations are: |
| |
| * FS Monitor |
| * Untracked cache |
| |
| While integrating with these features, we should look for patterns that |
| might lead to better APIs for interacting with the index. Coalescing |
| common usage patterns into an API call can reduce the number of places |
| where sparse-directories need to be handled carefully. |
| |
| Phase III: Important command speedups |
| ------------------------------------- |
| |
| At this point, the patterns for testing and implementing sparse-directory |
| logic should be relatively stable. This phase focuses on updating some of |
| the most common builtins that use the index to operate as O(Populated). |
| Here is a potential list of commands that could be valuable to integrate |
| at this point: |
| |
| * `git commit` |
| * `git checkout` |
| * `git merge` |
| * `git rebase` |
| |
| Hopefully, commands such as `git merge` and `git rebase` can benefit |
| instead from merge algorithms that do not use the index as a data |
| structure, such as the merge-ORT strategy. As these topics mature, we |
| may enable the ORT strategy by default for repositories using the |
| sparse-index feature. |
| |
| Along with `git status` and `git add`, these commands cover the majority |
| of users' interactions with the working directory. In addition, we can |
| integrate with these commands: |
| |
| * `git grep` |
| * `git rm` |
| |
| These have been proposed as some whose behavior could change when in a |
| repo with a sparse-checkout definition. It would be good to include this |
| behavior automatically when using a sparse-index. Some clarity is needed |
| to make the behavior switch clear to the user. |
| |
| This phase is the first where parallel work might be possible without too |
| much conflicts between topics. |
| |
| Phase IV: The long tail |
| ----------------------- |
| |
| This last phase is less a "phase" and more "the new normal" after all of |
| the previous work. |
| |
| To start, the `command_requires_full_index` option could be removed in |
| favor of expanding only when hitting an API guard. |
| |
| There are many Git commands that could use special attention to operate as |
| O(Populated), while some might be so rare that it is acceptable to leave |
| them with additional overhead when a sparse-index is present. |
| |
| Here are some commands that might be useful to update: |
| |
| * `git sparse-checkout set` |
| * `git am` |
| * `git clean` |
| * `git stash` |