blob: dc2e763bbec89ce596782d7dec1977fb0b361662 [file] [log] [blame]
Elijah Newren20d87d32022-11-06 06:04:26 +00001Table of contents:
2
3 * Terminology
4 * Purpose of sparse-checkouts
5 * Usecases of primary concern
6 * Oversimplified mental models ("Cliff Notes" for this document!)
7 * Desired behavior
8 * Behavior classes
9 * Subcommand-dependent defaults
10 * Sparse specification vs. sparsity patterns
11 * Implementation Questions
12 * Implementation Goals/Plans
13 * Known bugs
14 * Reference Emails
15
16
17=== Terminology ===
18
19cone mode: one of two modes for specifying the desired subset of files
20 in a sparse-checkout. In cone-mode, the user specifies
21 directories (getting both everything under that directory as
22 well as everything in leading directories), while in non-cone
23 mode, the user specifies gitignore-style patterns. Controlled
24 by the --[no-]cone option to sparse-checkout init|set.
25
26SKIP_WORKTREE: When tracked files do not match the sparse specification and
27 are removed from the working tree, the file in the index is marked
28 with a SKIP_WORKTREE bit. Note that if a tracked file has the
29 SKIP_WORKTREE bit set but the file is later written by the user to
30 the working tree anyway, the SKIP_WORKTREE bit will be cleared at
31 the beginning of any subsequent Git operation.
32
33 Most sparse checkout users are unaware of this implementation
34 detail, and the term should generally be avoided in user-facing
35 descriptions and command flags. Unfortunately, prior to the
36 `sparse-checkout` subcommand this low-level detail was exposed,
37 and as of time of writing, is still exposed in various places.
38
39sparse-checkout: a subcommand in git used to reduce the files present in
40 the working tree to a subset of all tracked files. Also, the
41 name of the file in the $GIT_DIR/info directory used to track
42 the sparsity patterns corresponding to the user's desired
43 subset.
44
45sparse cone: see cone mode
46
47sparse directory: An entry in the index corresponding to a directory, which
48 appears in the index instead of all the files under that directory
49 that would normally appear. See also sparse-index. Something that
50 can cause confusion is that the "sparse directory" does NOT match
51 the sparse specification, i.e. the directory is NOT present in the
52 working tree. May be renamed in the future (e.g. to "skipped
53 directory").
54
55sparse index: A special mode for sparse-checkout that also makes the
56 index sparse by recording a directory entry in lieu of all the
57 files underneath that directory (thus making that a "skipped
58 directory" which unfortunately has also been called a "sparse
59 directory"), and does this for potentially multiple
60 directories. Controlled by the --[no-]sparse-index option to
61 init|set|reapply.
62
63sparsity patterns: patterns from $GIT_DIR/info/sparse-checkout used to
64 define the set of files of interest. A warning: It is easy to
65 over-use this term (or the shortened "patterns" term), for two
66 reasons: (1) users in cone mode specify directories rather than
67 patterns (their directories are transformed into patterns, but
68 users may think you are talking about non-cone mode if you use the
69 word "patterns"), and (b) the sparse specification might
70 transiently differ in the working tree or index from the sparsity
71 patterns (see "Sparse specification vs. sparsity patterns").
72
73sparse specification: The set of paths in the user's area of focus. This
74 is typically just the tracked files that match the sparsity
75 patterns, but the sparse specification can temporarily differ and
76 include additional files. (See also "Sparse specification
77 vs. sparsity patterns")
78
79 * When working with history, the sparse specification is exactly
80 the set of files matching the sparsity patterns.
81 * When interacting with the working tree, the sparse specification
82 is the set of tracked files with a clear SKIP_WORKTREE bit or
83 tracked files present in the working copy.
84 * When modifying or showing results from the index, the sparse
85 specification is the set of files with a clear SKIP_WORKTREE bit
86 or that differ in the index from HEAD.
87 * If working with the index and the working copy, the sparse
88 specification is the union of the paths from above.
89
90vivifying: When a command restores a tracked file to the working tree (and
91 hopefully also clears the SKIP_WORKTREE bit in the index for that
92 file), this is referred to as "vivifying" the file.
93
94
95=== Purpose of sparse-checkouts ===
96
97sparse-checkouts exist to allow users to work with a subset of their
98files.
99
100You can think of sparse-checkouts as subdividing "tracked" files into two
101categories -- a sparse subset, and all the rest. Implementationally, we
102mark "all the rest" in the index with a SKIP_WORKTREE bit and leave them
103out of the working tree. The SKIP_WORKTREE files are still tracked, just
104not present in the working tree.
105
106In the past, sparse-checkouts were defined by "SKIP_WORKTREE means the file
107is missing from the working tree but pretend the file contents match HEAD".
108That was not only bogus (it actually meant the file missing from the
109working tree matched the index rather than HEAD), but it was also a
110low-level detail which only provided decent behavior for a few commands.
111There were a surprising number of ways in which that guiding principle gave
112command results that violated user expectations, and as such was a bad
113mental model. However, it persisted for many years and may still be found
114in some corners of the code base.
115
116Anyway, the idea of "working with a subset of files" is simple enough, but
117there are multiple different high-level usecases which affect how some Git
118subcommands should behave. Further, even if we only considered one of
119those usecases, sparse-checkouts can modify different subcommands in over a
120half dozen different ways. Let's start by considering the high level
121usecases:
122
123 A) Users are _only_ interested in the sparse portion of the repo
124
125 A*) Users are _only_ interested in the sparse portion of the repo
126 that they have downloaded so far
127
128 B) Users want a sparse working tree, but are working in a larger whole
129
130 C) sparse-checkout is a behind-the-scenes implementation detail allowing
131 Git to work with a specially crafted in-house virtual file system;
132 users are actually working with a "full" working tree that is
133 lazily populated, and sparse-checkout helps with the lazy population
134 piece.
135
136It may be worth explaining each of these in a bit more detail:
137
138
139 (Behavior A) Users are _only_ interested in the sparse portion of the repo
140
141These folks might know there are other things in the repository, but
142don't care. They are uninterested in other parts of the repository, and
143only want to know about changes within their area of interest. Showing
144them other files from history (e.g. from diff/log/grep/etc.) is a
145usability annoyance, potentially a huge one since other changes in
146history may dwarf the changes they are interested in.
147
148Some of these users also arrive at this usecase from wanting to use partial
149clones together with sparse checkouts (in a way where they have downloaded
150blobs within the sparse specification) and do disconnected development.
151Not only do these users generally not care about other parts of the
152repository, but consider it a blocker for Git commands to try to operate on
153those. If commands attempt to access paths in history outside the sparsity
154specification, then the partial clone will attempt to download additional
155blobs on demand, fail, and then fail the user's command. (This may be
156unavoidable in some cases, e.g. when `git merge` has non-trivial changes to
157reconcile outside the sparse specification, but we should limit how often
158users are forced to connect to the network.)
159
160Also, even for users using partial clones that do not mind being
161always connected to the network, the need to download blobs as
162side-effects of various other commands (such as the printed diffstat
163after a merge or pull) can lead to worries about local repository size
164growing unnecessarily[10].
165
166 (Behavior A*) Users are _only_ interested in the sparse portion of the repo
167 that they have downloaded so far (a variant on the first usecase)
168
169This variant is driven by folks who using partial clones together with
170sparse checkouts and do disconnected development (so far sounding like a
171subset of behavior A users) and doing so on very large repositories. The
172reason for yet another variant is that downloading even just the blobs
173through history within their sparse specification may be too much, so they
174only download some. They would still like operations to succeed without
175network connectivity, though, so things like `git log -S${SEARCH_TERM} -p`
176or `git grep ${SEARCH_TERM} OLDREV ` would need to be prepared to provide
177partial results that depend on what happens to have been downloaded.
178
179This variant could be viewed as Behavior A with the sparse specification
180for history querying operations modified from "sparsity patterns" to
181"sparsity patterns limited to the blobs we have already downloaded".
182
183 (Behavior B) Users want a sparse working tree, but are working in a
184 larger whole
185
186Stolee described this usecase this way[11]:
187
188"I'm also focused on users that know that they are a part of a larger
189whole. They know they are operating on a large repository but focus on
190what they need to contribute their part. I expect multiple "roles" to
191use very different, almost disjoint parts of the codebase. Some other
192"architect" users operate across the entire tree or hop between different
193sections of the codebase as necessary. In this situation, I'm wary of
194scoping too many features to the sparse-checkout definition, especially
195"git log," as it can be too confusing to have their view of the codebase
196depend on your "point of view."
197
198People might also end up wanting behavior B due to complex inter-project
199dependencies. The initial attempts to use sparse-checkouts usually involve
200the directories you are directly interested in plus what those directories
201depend upon within your repository. But there's a monkey wrench here: if
202you have integration tests, they invert the hierarchy: to run integration
203tests, you need not only what you are interested in and its in-tree
204dependencies, you also need everything that depends upon what you are
205interested in or that depends upon one of your dependencies...AND you need
206all the in-tree dependencies of that expanded group. That can easily
207change your sparse-checkout into a nearly dense one.
208
209Naturally, that tends to kill the benefits of sparse-checkouts. There are
210a couple solutions to this conundrum: either avoid grabbing in-repo
211dependencies (maybe have built versions of your in-repo dependencies pulled
212from a CI cache somewhere), or say that users shouldn't run integration
213tests directly and instead do it on the CI server when they submit a code
214review. Or do both. Regardless of whether you stub out your in-repo
215dependencies or stub out the things that depend upon you, there is
216certainly a reason to want to query and be aware of those other stubbed-out
217parts of the repository, particularly when the dependencies are complex or
218change relatively frequently. Thus, for such uses, sparse-checkouts can be
219used to limit what you directly build and modify, but these users do not
220necessarily want their sparse checkout paths to limit their queries of
221versions in history.
222
223Some people may also be interested in behavior B over behavior A simply as
224a performance workaround: if they are using non-cone mode, then they have
225to deal with its inherent quadratic performance problems. In that mode,
226every operation that checks whether paths match the sparsity specification
227can be expensive. As such, these users may only be willing to pay for
228those expensive checks when interacting with the working copy, and may
229prefer getting "unrelated" results from their history queries over having
230slow commands.
231
232 (Behavior C) sparse-checkout is an implementational detail supporting a
233 special VFS.
234
235This usecase goes slightly against the traditional definition of
236sparse-checkout in that it actually tries to present a full or dense
237checkout to the user. However, this usecase utilizes the same underlying
238technical underpinnings in a new way which does provide some performance
239advantages to users. The basic idea is that a company can have an in-house
240Git-aware Virtual File System which pretends all files are present in the
241working tree, by intercepting all file system accesses and using those to
242fetch and write accessed files on demand via partial clones. The VFS uses
243sparse-checkout to prevent Git from writing or paying attention to many
244files, and manually updates the sparse checkout patterns itself based on
245user access and modification of files in the working tree. See commit
246ecc7c8841d ("repo_read_index: add config to expect files outside sparse
247patterns", 2022-02-25) and the link at [17] for a more detailed description
248of such a VFS.
249
250The biggest difference here is that users are completely unaware that the
251sparse-checkout machinery is even in use. The sparse patterns are not
252specified by the user but rather are under the complete control of the VFS
253(and the patterns are updated frequently and dynamically by it). The user
254will perceive the checkout as dense, and commands should thus behave as if
255all files are present.
256
257
258=== Usecases of primary concern ===
259
260Most of the rest of this document will focus on Behavior A and Behavior
261B. Some notes about the other two cases and why we are not focusing on
262them:
263
264 (Behavior A*)
265
266Supporting this usecase is estimated to be difficult and a lot of work.
267There are no plans to implement it currently, but it may be a potential
268future alternative. Knowing about the existence of additional alternatives
269may affect our choice of command line flags (e.g. if we need tri-state or
270quad-state flags rather than just binary flags), so it was still important
271to at least note.
272
273Further, I believe the descriptions below for Behavior A are probably still
274valid for this usecase, with the only exception being that it redefines the
275sparse specification to restrict it to already-downloaded blobs. The hard
276part is in making commands capable of respecting that modified definition.
277
278 (Behavior C)
279
280This usecase violates some of the early sparse-checkout documented
281assumptions (since files marked as SKIP_WORKTREE will be displayed to users
282as present in the working tree). That violation may mean various
283sparse-checkout related behaviors are not well suited to this usecase and
284we may need tweaks -- to both documentation and code -- to handle it.
285However, this usecase is also perhaps the simplest model to support in that
286everything behaves like a dense checkout with a few exceptions (e.g. branch
287checkouts and switches write fewer things, knowing the VFS will lazily
288write the rest on an as-needed basis).
289
Andrew Kreimer98398f32024-09-20 11:28:13 +0300290Since there is no publicly available VFS-related code for folks to try,
Elijah Newren20d87d32022-11-06 06:04:26 +0000291the number of folks who can test such a usecase is limited.
292
293The primary reason to note the Behavior C usecase is that as we fix things
294to better support Behaviors A and B, there may be additional places where
295we need to make tweaks allowing folks in this usecase to get the original
296non-sparse treatment. For an example, see ecc7c8841d ("repo_read_index:
297add config to expect files outside sparse patterns", 2022-02-25). The
298secondary reason to note Behavior C, is so that folks taking advantage of
299Behavior C do not assume they are part of the Behavior B camp and propose
300patches that break things for the real Behavior B folks.
301
302
303=== Oversimplified mental models ===
304
305An oversimplification of the differences in the above behaviors is:
306
307 Behavior A: Restrict worktree and history operations to sparse specification
308 Behavior B: Restrict worktree operations to sparse specification; have any
309 history operations work across all files
310 Behavior C: Do not restrict either worktree or history operations to the
311 sparse specification...with the exception of branch checkouts or
312 switches which avoid writing files that will match the index so
313 they can later lazily be populated instead.
314
315
316=== Desired behavior ===
317
318As noted previously, despite the simple idea of just working with a subset
319of files, there are a range of different behavioral changes that need to be
320made to different subcommands to work well with such a feature. See
321[1,2,3,4,5,6,7,8,9,10] for various examples. In particular, at [2], we saw
322that mere composition of other commands that individually worked correctly
323in a sparse-checkout context did not imply that the higher level command
324would work correctly; it sometimes requires further tweaks. So,
325understanding these differences can be beneficial.
326
327* Commands behaving the same regardless of high-level use-case
328
329 * commands that only look at files within the sparsity specification
330
331 * diff (without --cached or REVISION arguments)
332 * grep (without --cached or REVISION arguments)
333 * diff-files
334
335 * commands that restore files to the working tree that match sparsity
336 patterns, and remove unmodified files that don't match those
337 patterns:
338
339 * switch
340 * checkout (the switch-like half)
341 * read-tree
342 * reset --hard
343
344 * commands that write conflicted files to the working tree, but otherwise
345 will omit writing files to the working tree that do not match the
346 sparsity patterns:
347
348 * merge
349 * rebase
350 * cherry-pick
351 * revert
352
353 * `am` and `apply --cached` should probably be in this section but
354 are buggy (see the "Known bugs" section below)
355
356 The behavior for these commands somewhat depends upon the merge
357 strategy being used:
358 * `ort` behaves as described above
Elijah Newren20d87d32022-11-06 06:04:26 +0000359 * `octopus` and `resolve` will always vivify any file changed in the merge
360 relative to the first parent, which is rather suboptimal.
361
362 It is also important to note that these commands WILL update the index
363 outside the sparse specification relative to when the operation began,
364 BUT these commands often make a commit just before or after such that
365 by the end of the operation there is no change to the index outside the
366 sparse specification. Of course, if the operation hits conflicts or
367 does not make a commit, then these operations clearly can modify the
368 index outside the sparse specification.
369
370 Finally, it is important to note that at least the first four of these
371 commands also try to remove differences between the sparse
372 specification and the sparsity patterns (much like the commands in the
373 previous section).
374
375 * commands that always ignore sparsity since commits must be full-tree
376
377 * archive
378 * bundle
379 * commit
380 * format-patch
381 * fast-export
382 * fast-import
383 * commit-tree
384
385 * commands that write any modified file to the working tree (conflicted
386 or not, and whether those paths match sparsity patterns or not):
387
388 * stash
389 * apply (without `--index` or `--cached`)
390
391* Commands that may slightly differ for behavior A vs. behavior B:
392
393 Commands in this category behave mostly the same between the two
394 behaviors, but may differ in verbosity and types of warning and error
395 messages.
396
397 * commands that make modifications to which files are tracked:
398 * add
399 * rm
400 * mv
401 * update-index
402
403 The fact that files can move between the 'tracked' and 'untracked'
404 categories means some commands will have to treat untracked files
405 differently. But if we have to treat untracked files differently,
406 then additional commands may also need changes:
407
408 * status
409 * clean
410
411 In particular, `status` may need to report any untracked files outside
412 the sparsity specification as an erroneous condition (especially to
413 avoid the user trying to `git add` them, forcing `git add` to display
414 an error).
415
416 It's not clear to me exactly how (or even if) `clean` would change,
417 but it's the other command that also affects untracked files.
418
419 `update-index` may be slightly special. Its --[no-]skip-worktree flag
420 may need to ignore the sparse specification by its nature. Also, its
421 current --[no-]ignore-skip-worktree-entries default is totally bogus.
422
423 * commands for manually tweaking paths in both the index and the working tree
424 * `restore`
425 * the restore-like half of `checkout`
426
427 These commands should be similar to add/rm/mv in that they should
428 only operate on the sparse specification by default, and require a
429 special flag to operate on all files.
430
431 Also, note that these commands currently have a number of issues (see
432 the "Known bugs" section below)
433
434* Commands that significantly differ for behavior A vs. behavior B:
435
436 * commands that query history
437 * diff (with --cached or REVISION arguments)
438 * grep (with --cached or REVISION arguments)
439 * show (when given commit arguments)
440 * blame (only matters when one or more -C flags are passed)
441 * and annotate
442 * log
443 * whatchanged
444 * ls-files
445 * diff-index
446 * diff-tree
447 * ls-tree
448
449 Note: for log and whatchanged, revision walking logic is unaffected
450 but displaying of patches is affected by scoping the command to the
451 sparse-checkout. (The fact that revision walking is unaffected is
452 why rev-list, shortlog, show-branch, and bisect are not in this
453 list.)
454
455 ls-files may be slightly special in that e.g. `git ls-files -t` is
456 often used to see what is sparse and what is not. Perhaps -t should
457 always work on the full tree?
458
459* Commands I don't know how to classify
460
461 * range-diff
462
463 Is this like `log` or `format-patch`?
464
465 * cherry
466
467 See range-diff
468
469* Commands unaffected by sparse-checkouts
470
471 * shortlog
472 * show-branch
473 * rev-list
474 * bisect
475
476 * branch
477 * describe
478 * fetch
479 * gc
480 * init
481 * maintenance
482 * notes
483 * pull (merge & rebase have the necessary changes)
484 * push
485 * submodule
486 * tag
487
488 * config
489 * filter-branch (works in separate checkout without sparse-checkout setup)
490 * pack-refs
491 * prune
492 * remote
493 * repack
494 * replace
495
496 * bugreport
497 * count-objects
498 * fsck
499 * gitweb
500 * help
501 * instaweb
502 * merge-tree (doesn't touch worktree or index, and merges always compute full-tree)
503 * rerere
504 * verify-commit
505 * verify-tag
506
507 * commit-graph
508 * hash-object
509 * index-pack
510 * mktag
511 * mktree
512 * multi-pack-index
513 * pack-objects
514 * prune-packed
515 * symbolic-ref
516 * unpack-objects
517 * update-ref
518 * write-tree (operates on index, possibly optimized to use sparse dir entries)
519
520 * for-each-ref
521 * get-tar-commit-id
522 * ls-remote
523 * merge-base (merges are computed full tree, so merge base should be too)
524 * name-rev
525 * pack-redundant
526 * rev-parse
527 * show-index
528 * show-ref
529 * unpack-file
530 * var
531 * verify-pack
532
533 * <Everything under 'Interacting with Others' in 'git help --all'>
534 * <Everything under 'Low-level...Syncing' in 'git help --all'>
535 * <Everything under 'Low-level...Internal Helpers' in 'git help --all'>
536 * <Everything under 'External commands' in 'git help --all'>
537
538* Commands that might be affected, but who cares?
539
540 * merge-file
541 * merge-index
542 * gitk?
543
544
545=== Behavior classes ===
546
547From the above there are a few classes of behavior:
548
549 * "restrict"
550
551 Commands in this class only read or write files in the working tree
552 within the sparse specification.
553
554 When moving to a new commit (e.g. switch, reset --hard), these commands
555 may update index files outside the sparse specification as of the start
556 of the operation, but by the end of the operation those index files
557 will match HEAD again and thus those files will again be outside the
558 sparse specification.
559
560 When paths are explicitly specified, these paths are intersected with
561 the sparse specification and will only operate on such paths.
562 (e.g. `git restore [--staged] -- '*.png'`, `git reset -p -- '*.md'`)
563
564 Some of these commands may also attempt, at the end of their operation,
565 to cull transient differences between the sparse specification and the
566 sparsity patterns (see "Sparse specification vs. sparsity patterns" for
567 details, but this basically means either removing unmodified files not
568 matching the sparsity patterns and marking those files as
569 SKIP_WORKTREE, or vivifying files that match the sparsity patterns and
570 marking those files as !SKIP_WORKTREE).
571
572 * "restrict modulo conflicts"
573
574 Commands in this class generally behave like the "restrict" class,
575 except that:
576 (1) they will ignore the sparse specification and write files with
577 conflicts to the working tree (thus temporarily expanding the
578 sparse specification to include such files.)
579 (2) they are grouped with commands which move to a new commit, since
580 they often create a commit and then move to it, even though we
581 know there are many exceptions to moving to the new commit. (For
582 example, the user may rebase a commit that becomes empty, or have
583 a cherry-pick which conflicts, or a user could run `merge
584 --no-commit`, and we also view `apply --index` kind of like `am
585 --no-commit`.) As such, these commands can make changes to index
586 files outside the sparse specification, though they'll mark such
587 files with SKIP_WORKTREE.
588
589 * "restrict also specially applied to untracked files"
590
591 Commands in this class generally behave like the "restrict" class,
592 except that they have to handle untracked files differently too, often
593 because these commands are dealing with files changing state between
594 'tracked' and 'untracked'. Often, this may mean printing an error
595 message if the command had nothing to do, but the arguments may have
596 referred to files whose tracked-ness state could have changed were it
597 not for the sparsity patterns excluding them.
598
599 * "no restrict"
600
601 Commands in this class ignore the sparse specification entirely.
602
603 * "restrict or no restrict dependent upon behavior A vs. behavior B"
604
605 Commands in this class behave like "no restrict" for folks in the
606 behavior B camp, and like "restrict" for folks in the behavior A camp.
607 However, when behaving like "restrict" a warning of some sort might be
608 provided that history queries have been limited by the sparse-checkout
609 specification.
610
611
612=== Subcommand-dependent defaults ===
613
614Note that we have different defaults depending on the command for the
615desired behavior :
616
617 * Commands defaulting to "restrict":
618 * diff-files
619 * diff (without --cached or REVISION arguments)
620 * grep (without --cached or REVISION arguments)
621 * switch
622 * checkout (the switch-like half)
623 * reset (<commit>)
624
625 * restore
626 * checkout (the restore-like half)
627 * checkout-index
628 * reset (with pathspec)
629
630 This behavior makes sense; these interact with the working tree.
631
632 * Commands defaulting to "restrict modulo conflicts":
633 * merge
634 * rebase
635 * cherry-pick
636 * revert
637
638 * am
639 * apply --index (which is kind of like an `am --no-commit`)
640
641 * read-tree (especially with -m or -u; is kind of like a --no-commit merge)
642 * reset (<tree-ish>, due to similarity to read-tree)
643
644 These also interact with the working tree, but require slightly
645 different behavior either so that (a) conflicts can be resolved or (b)
646 because they are kind of like a merge-without-commit operation.
647
648 (See also the "Known bugs" section below regarding `am` and `apply`)
649
650 * Commands defaulting to "no restrict":
651 * archive
652 * bundle
653 * commit
654 * format-patch
655 * fast-export
656 * fast-import
657 * commit-tree
658
659 * stash
660 * apply (without `--index`)
661
662 These have completely different defaults and perhaps deserve the most
663 detailed explanation:
664
665 In the case of commands in the first group (format-patch,
666 fast-export, bundle, archive, etc.), these are commands for
667 communicating history, which will be broken if they restrict to a
668 subset of the repository. As such, they operate on full paths and
669 have no `--restrict` option for overriding. Some of these commands may
670 take paths for manually restricting what is exported, but it needs to
671 be very explicit.
672
673 In the case of stash, it needs to vivify files to avoid losing the
674 user's changes.
675
676 In the case of apply without `--index`, that command needs to update
677 the working tree without the index (or the index without the working
678 tree if `--cached` is passed), and if we restrict those updates to the
679 sparse specification then we'll lose changes from the user.
680
681 * Commands defaulting to "restrict also specially applied to untracked files":
682 * add
683 * rm
684 * mv
685 * update-index
686 * status
687 * clean (?)
688
689 Our original implementation for the first three of these commands was
690 "no restrict", but it had some severe usability issues:
691 * `git add <somefile>` if honored and outside the sparse
692 specification, can result in the file randomly disappearing later
693 when some subsequent command is run (since various commands
694 automatically clean up unmodified files outside the sparse
695 specification).
696 * `git rm '*.jpg'` could very negatively surprise users if it deletes
697 files outside the range of the user's interest.
698 * `git mv` has similar surprises when moving into or out of the cone,
699 so best to restrict by default
700
701 So, we switched `add` and `rm` to default to "restrict", which made
702 usability problems much less severe and less frequent, but we still got
703 complaints because commands like:
704 git add <file-outside-sparse-specification>
705 git rm <file-outside-sparse-specification>
706 would silently do nothing. We should instead print an error in those
707 cases to get usability right.
708
709 update-index needs to be updated to match, and status and maybe clean
710 also need to be updated to specially handle untracked paths.
711
712 There may be a difference in here between behavior A and behavior B in
713 terms of verboseness of errors or additional warnings.
714
715 * Commands falling under "restrict or no restrict dependent upon behavior
716 A vs. behavior B"
717
718 * diff (with --cached or REVISION arguments)
719 * grep (with --cached or REVISION arguments)
720 * show (when given commit arguments)
721 * blame (only matters when one or more -C flags passed)
722 * and annotate
723 * log
724 * and variants: shortlog, gitk, show-branch, whatchanged, rev-list
725 * ls-files
726 * diff-index
727 * diff-tree
728 * ls-tree
729
730 For now, we default to behavior B for these, which want a default of
731 "no restrict".
732
733 Note that two of these commands -- diff and grep -- also appeared in a
734 different list with a default of "restrict", but only when limited to
735 searching the working tree. The working tree vs. history distinction
736 is fundamental in how behavior B operates, so this is expected. Note,
737 though, that for diff and grep with --cached, when doing "restrict"
738 behavior, the difference between sparse specification and sparsity
739 patterns is important to handle.
740
741 "restrict" may make more sense as the long term default for these[12].
742 Also, supporting "restrict" for these commands might be a fair amount
743 of work to implement, meaning it might be implemented over multiple
744 releases. If that behavior were the default in the commands that
745 supported it, that would force behavior B users to need to learn to
746 slowly add additional flags to their commands, depending on git
747 version, to get the behavior they want. That gradual switchover would
748 be painful, so we should avoid it at least until it's fully
749 implemented.
750
751
752=== Sparse specification vs. sparsity patterns ===
753
754In a well-behaved situation, the sparse specification is given directly
755by the $GIT_DIR/info/sparse-checkout file. However, it can transiently
756diverge for a few reasons:
757
758 * needing to resolve conflicts (merging will vivify conflicted files)
759 * running Git commands that implicitly vivify files (e.g. "git stash apply")
760 * running Git commands that explicitly vivify files (e.g. "git checkout
761 --ignore-skip-worktree-bits FILENAME")
762 * other commands that write to these files (perhaps a user copies it
763 from elsewhere)
764
765For the last item, note that we do automatically clear the SKIP_WORKTREE
766bit for files that are present in the working tree. This has been true
767since 82386b4496 ("Merge branch 'en/present-despite-skipped'",
7682022-03-09)
769
770However, such a situation is transient because:
771
772 * Such transient differences can and will be automatically removed as
773 a side-effect of commands which call unpack_trees() (checkout,
774 merge, reset, etc.).
775 * Users can also request such transient differences be corrected via
776 running `git sparse-checkout reapply`. Various places recommend
777 running that command.
778 * Additional commands are also welcome to implicitly fix these
779 differences; we may add more in the future.
780
781While we avoid dropping unstaged changes or files which have conflicts,
782we otherwise aggressively try to fix these transient differences. If
783users want these differences to persist, they should run the `set` or
784`add` subcommands of `git sparse-checkout` to reflect their intended
785sparse specification.
786
787However, when we need to do a query on history restricted to the
788"relevant subset of files" such a transiently expanded sparse
789specification is ignored. There are a couple reasons for this:
790
791 * The behavior wanted when doing something like
792 git grep expression REVISION
793 is roughly what the users would expect from
794 git checkout REVISION && git grep expression
795 (modulo a "REVISION:" prefix), which has a couple ramifications:
796
797 * REVISION may have paths not in the current index, so there is no
798 path we can consult for a SKIP_WORKTREE setting for those paths.
799
800 * Since `checkout` is one of those commands that tries to remove
801 transient differences in the sparse specification, it makes sense
802 to use the corrected sparse specification
803 (i.e. $GIT_DIR/info/sparse-checkout) rather than attempting to
804 consult SKIP_WORKTREE anyway.
805
806So, a transiently expanded (or restricted) sparse specification applies to
807the working tree, but not to history queries where we always use the
808sparsity patterns. (See [16] for an early discussion of this.)
809
810Similar to a transiently expanded sparse specification of the working tree
811based on additional files being present in the working tree, we also need
812to consider additional files being modified in the index. In particular,
813if the user has staged changes to files (relative to HEAD) that do not
814match the sparsity patterns, and the file is not present in the working
815tree, we still want to consider the file part of the sparse specification
816if we are specifically performing a query related to the index (e.g. git
817diff --cached [REVISION], git diff-index [REVISION], git restore --staged
818--source=REVISION -- PATHS, etc.) Note that a transiently expanded sparse
819specification for the index usually only matters under behavior A, since
820under behavior B index operations are lumped with history and tend to
821operate full-tree.
822
823
824=== Implementation Questions ===
825
826 * Do the options --scope={sparse,all} sound good to others? Are there better
827 options?
828 * Names in use, or appearing in patches, or previously suggested:
829 * --sparse/--dense
830 * --ignore-skip-worktree-bits
831 * --ignore-skip-worktree-entries
832 * --ignore-sparsity
833 * --[no-]restrict-to-sparse-paths
834 * --full-tree/--sparse-tree
835 * --[no-]restrict
836 * --scope={sparse,all}
837 * --focus/--unfocus
838 * --limit/--unlimited
839 * Rationale making me lean slightly towards --scope={sparse,all}:
840 * We want a name that works for many commands, so we need a name that
841 does not conflict
842 * We know that we have more than two possible usecases, so it is best
843 to avoid a flag that appears to be binary.
844 * --scope={sparse,all} isn't overly long and seems relatively
845 explanatory
846 * `--sparse`, as used in add/rm/mv, is totally backwards for
847 grep/log/etc. Changing the meaning of `--sparse` for these
848 commands would fix the backwardness, but possibly break existing
849 scripts. Using a new name pairing would allow us to treat
850 `--sparse` in these commands as a deprecated alias.
851 * There is a different `--sparse`/`--dense` pair for commands using
852 revision machinery, so using that naming might cause confusion
853 * There is also a `--sparse` in both pack-objects and show-branch, which
854 don't conflict but do suggest that `--sparse` is overloaded
855 * The name --ignore-skip-worktree-bits is a double negative, is
856 quite a mouthful, refers to an implementation detail that many
857 users may not be familiar with, and we'd need a negation for it
858 which would probably be even more ridiculously long. (But we
859 can make --ignore-skip-worktree-bits a deprecated alias for
860 --no-restrict.)
861
862 * If a config option is added (sparse.scope?) what should the values and
863 description be? "sparse" (behavior A), "worktree-sparse-history-dense"
864 (behavior B), "dense" (behavior C)? There's a risk of confusion,
865 because even for Behaviors A and B we want some commands to be
866 full-tree and others to operate sparsely, so the wording may need to be
867 more tied to the usecases and somehow explain that. Also, right now,
868 the primary difference we are focusing is just the history-querying
869 commands (log/diff/grep). Previous config suggestion here: [13]
870
871 * Is `--no-expand` a good alias for ls-files's `--sparse` option?
872 (`--sparse` does not map to either `--scope=sparse` or `--scope=all`,
873 because in non-cone mode it does nothing and in cone-mode it shows the
874 sparse directory entries which are technically outside the sparse
875 specification)
876
877 * Under Behavior A:
878 * Does ls-files' `--no-expand` override the default `--scope=all`, or
879 does it need an extra flag?
880 * Does ls-files' `-t` option imply `--scope=all`?
881 * Does update-index's `--[no-]skip-worktree` option imply `--scope=all`?
882
883 * sparse-checkout: once behavior A is fully implemented, should we take
884 an interim measure to ease people into switching the default? Namely,
885 if folks are not already in a sparse checkout, then require
886 `sparse-checkout init/set` to take a
887 `--set-scope=(sparse|worktree-sparse-history-dense|dense)` flag (which
888 would set sparse.scope according to the setting given), and throw an
889 error if the flag is not provided? That error would be a great place
890 to warn folks that the default may change in the future, and get them
891 used to specifying what they want so that the eventual default switch
892 is seamless for them.
893
894
895=== Implementation Goals/Plans ===
896
897 * Get buy-in on this document in general.
898
899 * Figure out answers to the 'Implementation Questions' sections (above)
900
901 * Fix bugs in the 'Known bugs' section (below)
902
903 * Provide some kind of method for backfilling the blobs within the sparse
904 specification in a partial clone
905
906 [Below here is kind of spitballing since the first two haven't been resolved]
907
908 * update-index: flip the default to --no-ignore-skip-worktree-entries,
909 nuke this stupid "Oh, there's a bug? Let me add a flag to let users
910 request that they not trigger this bug." flag
911
912 * Flags & Config
913 * Make `--sparse` in add/rm/mv a deprecated alias for `--scope=all`
914 * Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore
915 a deprecated aliases for `--scope=all`
916 * Create config option (sparse.scope?), tie it to the "Cliff notes"
917 overview
918
919 * Add --scope=sparse (and --scope=all) flag to each of the history querying
920 commands. IMPORTANT: make sure diff machinery changes don't mess with
921 format-patch, fast-export, etc.
922
923=== Known bugs ===
924
925This list used to be a lot longer (see e.g. [1,2,3,4,5,6,7,8,9]), but we've
926been working on it.
927
9280. Behavior A is not well supported in Git. (Behavior B didn't used to
929 be either, but was the easier of the two to implement.)
930
9311. am and apply:
932
933 apply, without `--index` or `--cached`, relies on files being present
934 in the working copy, and also writes to them unconditionally. As
935 such, it should first check for the files' presence, and if found to
936 be SKIP_WORKTREE, then clear the bit and vivify the paths, then do
937 its work. Currently, it just throws an error.
938
939 apply, with either `--cached` or `--index`, will not preserve the
940 SKIP_WORKTREE bit. This is fine if the file has conflicts, but
941 otherwise SKIP_WORKTREE bits should be preserved for --cached and
942 probably also for --index.
943
944 am, if there are no conflicts, will vivify files and fail to preserve
945 the SKIP_WORKTREE bit. If there are conflicts and `-3` is not
946 specified, it will vivify files and then complain the patch doesn't
947 apply. If there are conflicts and `-3` is specified, it will vivify
948 files and then complain that those vivified files would be
949 overwritten by merge.
950
9512. reset --hard:
952
953 reset --hard provides confusing error message (works correctly, but
954 misleads the user into believing it didn't):
955
956 $ touch addme
957 $ git add addme
958 $ git ls-files -t
959 H addme
960 H tracked
961 S tracked-but-maybe-skipped
962 $ git reset --hard # usually works great
963 error: Path 'addme' not uptodate; will not remove from working tree.
964 HEAD is now at bdbbb6f third
965 $ git ls-files -t
966 H tracked
967 S tracked-but-maybe-skipped
968 $ ls -1
969 tracked
970
971 `git reset --hard` DID remove addme from the index and the working tree, contrary
972 to the error message, but in line with how reset --hard should behave.
973
9743. read-tree
975
976 `read-tree` doesn't apply the 'SKIP_WORKTREE' bit to *any* of the
977 entries it reads into the index, resulting in all your files suddenly
978 appearing to be "deleted".
979
9804. Checkout, restore:
981
982 These command do not handle path & revision arguments appropriately:
983
984 $ ls
985 tracked
986 $ git ls-files -t
987 H tracked
988 S tracked-but-maybe-skipped
989 $ git status --porcelain
990 $ git checkout -- '*skipped'
991 error: pathspec '*skipped' did not match any file(s) known to git
992 $ git ls-files -- '*skipped'
993 tracked-but-maybe-skipped
994 $ git checkout HEAD -- '*skipped'
995 error: pathspec '*skipped' did not match any file(s) known to git
996 $ git ls-tree HEAD | grep skipped
997 100644 blob 276f5a64354b791b13840f02047738c77ad0584f tracked-but-maybe-skipped
998 $ git status --porcelain
999 $ git checkout HEAD~1 -- '*skipped'
1000 $ git ls-files -t
1001 H tracked
1002 H tracked-but-maybe-skipped
1003 $ git status --porcelain
1004 M tracked-but-maybe-skipped
1005 $ git checkout HEAD -- '*skipped'
1006 $ git status --porcelain
1007 $
1008
1009 Note that checkout without a revision (or restore --staged) fails to
1010 find a file to restore from the index, even though ls-files shows
1011 such a file certainly exists.
1012
1013 Similar issues occur with HEAD (--source=HEAD in restore's case),
1014 but suddenly works when HEAD~1 is specified. And then after that it
1015 will work with HEAD specified, even though it didn't before.
1016
1017 Directories are also an issue:
1018
1019 $ git sparse-checkout set nomatches
1020 $ git status
1021 On branch main
1022 You are in a sparse checkout with 0% of tracked files present.
1023
1024 nothing to commit, working tree clean
1025 $ git checkout .
1026 error: pathspec '.' did not match any file(s) known to git
1027 $ git checkout HEAD~1 .
1028 Updated 1 path from 58916d9
1029 $ git ls-files -t
1030 S tracked
1031 H tracked-but-maybe-skipped
1032
10335. checkout and restore --staged, continued:
1034
1035 These commands do not correctly scope operations to the sparse
1036 specification, and make it worse by not setting important SKIP_WORKTREE
1037 bits:
1038
1039 $ git restore --source OLDREV --staged outside-sparse-cone/
1040 $ git status --porcelain
1041 MD outside-sparse-cone/file1
1042 MD outside-sparse-cone/file2
1043 MD outside-sparse-cone/file3
1044
1045 We can add a --scope=all mode to `git restore` to let it operate outside
1046 the sparse specification, but then it will be important to set the
1047 SKIP_WORKTREE bits appropriately.
1048
10496. Performance issues; see:
1050 https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com/
1051
1052
1053=== Reference Emails ===
1054
1055Emails that detail various bugs we've had in sparse-checkout:
1056
1057[1] (Original descriptions of behavior A & behavior B)
1058 https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
1059[2] (Fix stash applications in sparse checkouts; bugs from behavioral differences)
1060 https://lore.kernel.org/git/ccfedc7140dbf63ba26a15f93bd3885180b26517.1606861519.git.gitgitgadget@gmail.com/
1061[3] (Present-despite-skipped entries)
1062 https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/
1063[4] (Clone --no-checkout interaction)
1064 https://lore.kernel.org/git/pull.801.v2.git.git.1591324899170.gitgitgadget@gmail.com/ (clone --no-checkout)
1065[5] (The need for update_sparsity() and avoiding `read-tree -mu HEAD`)
1066 https://lore.kernel.org/git/3a1f084641eb47515b5a41ed4409a36128913309.1585270142.git.gitgitgadget@gmail.com/
1067[6] (SKIP_WORKTREE is advisory, not mandatory)
1068 https://lore.kernel.org/git/844306c3e86ef67591cc086decb2b760e7d710a3.1585270142.git.gitgitgadget@gmail.com/
1069[7] (`worktree add` should copy sparsity settings from current worktree)
1070 https://lore.kernel.org/git/c51cb3714e7b1d2f8c9370fe87eca9984ff4859f.1644269584.git.gitgitgadget@gmail.com/
1071[8] (Avoid negative surprises in add, rm, and mv)
1072 https://lore.kernel.org/git/cover.1617914011.git.matheus.bernardino@usp.br/
1073 https://lore.kernel.org/git/pull.1018.v4.git.1632497954.gitgitgadget@gmail.com/
1074[9] (Move from out-of-cone to in-cone)
1075 https://lore.kernel.org/git/20220630023737.473690-6-shaoxuan.yuan02@gmail.com/
1076 https://lore.kernel.org/git/20220630023737.473690-4-shaoxuan.yuan02@gmail.com/
1077[10] (Unnecessarily downloading objects outside sparse specification)
1078 https://lore.kernel.org/git/CAOLTT8QfwOi9yx_qZZgyGa8iL8kHWutEED7ok_jxwTcYT_hf9Q@mail.gmail.com/
1079
1080[11] (Stolee's comments on high-level usecases)
1081 https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/
1082
1083[12] Others commenting on eventually switching default to behavior A:
1084 * https://lore.kernel.org/git/xmqqh719pcoo.fsf@gitster.g/
1085 * https://lore.kernel.org/git/xmqqzgeqw0sy.fsf@gitster.g/
1086 * https://lore.kernel.org/git/a86af661-cf58-a4e5-0214-a67d3a794d7e@github.com/
1087
1088[13] Previous config name suggestion and description
1089 * https://lore.kernel.org/git/CABPp-BE6zW0nJSStcVU=_DoDBnPgLqOR8pkTXK3dW11=T01OhA@mail.gmail.com/
1090
1091[14] Tangential issue: switch to cone mode as default sparse specification mechanism:
1092 https://lore.kernel.org/git/a1b68fd6126eb341ef3637bb93fedad4309b36d0.1650594746.git.gitgitgadget@gmail.com/
1093
1094[15] Lengthy email on grep behavior, covering what should be searched:
1095 * https://lore.kernel.org/git/CABPp-BGVO3QdbfE84uF_3QDF0-y2iHHh6G5FAFzNRfeRitkuHw@mail.gmail.com/
1096
1097[16] Email explaining sparsity patterns vs. SKIP_WORKTREE and history operations,
1098 search for the parenthetical comment starting "We do not check".
1099 https://lore.kernel.org/git/CABPp-BFsCPPNOZ92JQRJeGyNd0e-TCW-LcLyr0i_+VSQJP+GCg@mail.gmail.com/
1100
1101[17] https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/