sparse-index: improve lstat caching of sparse paths

The clear_skip_worktree_from_present_files() method was first introduced
in af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files
present in worktree, 2022-01-14) to allow better interaction with the
working directory in the presence of paths outside of the
sparse-checkout. The initial implementation would lstat() every single
SKIP_WORKTREE path to see if it existed; if it ran across a sparse
directory that existed (when a sparse index was in use), then it would
expand the index and then check every SKIP_WORKTREE path.

Since these lstat() calls were very expensive, this was improved in
d79d299352 (Accelerate clear_skip_worktree_from_present_files() by
caching, 2022-01-14) by caching directories that do not exist so it
could avoid lstat()ing any files under such directories. However, there
are some inefficiencies in that caching mechanism.

The caching mechanism stored only the parent directory as not existing,
even if a higher parent directory also does not exist. This means that
wasted lstat() calls would occur when the paths passed to path_found()
change immediate parent directories but within the same parent directory
that does not exist.

To create an example repository that demonstrates this problem, it helps
to have a directory outside of the sparse-checkout that contains many
deep paths. In particular, the first paths (in lexicographic order)
underneath the sparse directory should have deep directory structures,
maximizing the difference between the old caching algorithm that looks
to a single parent and the new caching algorithm that looks to the
top-most missing directory.

The performance test script p2000-sparse-operations.sh takes the sample
repository and copies its HEAD to several copies nested in directories
of the form f<i>/f<j>/f<k> where i, j, and k are numbers from 1 to 4.
The sparse-checkout cone is then selected as "f2/f4/". Creating "f1/f1/"
will trigger the behavior and also lead to some interesting cases for
the caching algorithm since "f1/f1/" exists but "f1/f2/" and "f3/" do
not.

This is difficult to notice when running performance tests using the Git
repository (or a blow-up of the Git repository, as in
p2000-sparse-operations.sh) because Git has a very shallow directory
structure.

This change reorganizes the caching algorithm to focus on storing the
highest level leading directory that does not exist; specifically this
means that that directory's parent _does_ exist. By doing a little extra
work on a path passed to path_found(), we can short-circuit all of the
paths passed to path_found() afterwards that match a prefix with that
non-existing directory. When in a repository where the first sparse file
is likely to have a much deeper path than the first non-existing
directory, this can realize significant gains.

The details of this algorithm require careful attention, so the new
implementation of path_found() has detailed comments, including the use
of a new max_common_dir_prefix() method that may be of independent
interest.

It's worth noting that this is not universally positive, since we are
doing extra lstat() calls to establish the exact path to cache. In the
blow-up of the Git repository, we can see that the lstat count
_increases_ from 28 to 31. However, these numbers were already
artificially low.

Contributor Elijah Newren created a publicly-available test repository
that demonstrates the difference in these caching algorithms in the most
extreme way. To test, follow these steps:

  git clone --sparse https://github.com/newren/gvfs-like-git-bomb
  cd gvfs-like-git-bomb
  ./runme.sh                   # NOTE: check scripts before running!

At this point, assuming you do not have index.sparse=true set globally,
the index has one million paths with the SKIP_WORKTREE bit and they will
all be sent to path_found() in the sparse loop. You can measure this by
running 'git status' with GIT_TRACE2_PERF=1:

    Sparse files in the index: 1,000,000
  sparse_lstat_count (before):   200,000
   sparse_lstat_count (after):         2

And here are the performance numbers:

  Benchmark 1: old
    Time (mean ± σ):     397.5 ms ±   4.1 ms
    Range (min … max):   391.2 ms … 404.8 ms    10 runs

  Benchmark 2: new
    Time (mean ± σ):     252.7 ms ±   3.1 ms
    Range (min … max):   249.4 ms … 259.5 ms    11 runs

  Summary
    'new' ran
      1.57 ± 0.02 times faster than 'old'

By modifying this example further, we can demonstrate a more realistic
example and include the sparse index expansion. Continue by creating
this directory, confusing both caching algorithms somewhat:

  mkdir -p bomb/d/e/f/a/a

Then re-run the 'git status' tests to see these statistics:

    Sparse files in the index: 1,000,000
  sparse_lstat_count (before):   724,010
   sparse_lstat_count (after):       106

  Benchmark 1: old
    Time (mean ± σ):     753.0 ms ±   3.5 ms
    Range (min … max):   749.7 ms … 760.9 ms    10 runs

  Benchmark 2: new
    Time (mean ± σ):     201.4 ms ±   3.2 ms
    Range (min … max):   196.0 ms … 207.9 ms    14 runs

  Summary
    'new' ran
      3.74 ± 0.06 times faster than 'old'

Note that if this repository had a sparse index enabled, the additional
cost of expanding the sparse index affects the total time of these
commands by over four seconds, significantly diminishing the benefit of
the caching algorithm. Having existing paths outside of the
sparse-checkout is a known performance issue for the sparse index and is
a known trade-off for the performance benefits given when no such paths
exist.

Using an internal monorepo with over two million paths at HEAD and a
typical sparse-checkout cone such that the sparse index contains
~190,000 entries (including over two thousand sparse trees), I was able
to measure these lstat counts when one sparse directory actually exists
on disk:

  Sparse files in expanded index: 1,841,997
       full_lstat_count (before): 1,188,161
       full_lstat_count  (after):     4,404

This resulted in this absolute time change, on a warm disk:

      Time in full loop (before): 13.481 s
      Time in full loop  (after):  0.081 s

(These times were calculated on a Windows machine, where lstat() is
slower than a similar Linux machine.)

Helped-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Reviewed-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
1 file changed
tree: 102b74676e2de487f1ef74e504b7ea436983b9a0
  1. .github/
  2. block-sha1/
  3. builtin/
  4. ci/
  5. compat/
  6. contrib/
  7. Documentation/
  8. ewah/
  9. git-gui/
  10. gitk-git/
  11. gitweb/
  12. mergetools/
  13. negotiator/
  14. oss-fuzz/
  15. perl/
  16. po/
  17. refs/
  18. reftable/
  19. sha1/
  20. sha1dc/
  21. sha256/
  22. t/
  23. templates/
  24. trace2/
  25. xdiff/
  26. .cirrus.yml
  27. .clang-format
  28. .editorconfig
  29. .gitattributes
  30. .gitignore
  31. .gitlab-ci.yml
  32. .gitmodules
  33. .mailmap
  34. .tsan-suppressions
  35. abspath.c
  36. abspath.h
  37. aclocal.m4
  38. add-interactive.c
  39. add-interactive.h
  40. add-patch.c
  41. advice.c
  42. advice.h
  43. alias.c
  44. alias.h
  45. alloc.c
  46. alloc.h
  47. apply.c
  48. apply.h
  49. archive-tar.c
  50. archive-zip.c
  51. archive.c
  52. archive.h
  53. attr.c
  54. attr.h
  55. banned.h
  56. base85.c
  57. base85.h
  58. bisect.c
  59. bisect.h
  60. blame.c
  61. blame.h
  62. blob.c
  63. blob.h
  64. bloom.c
  65. bloom.h
  66. branch.c
  67. branch.h
  68. builtin.h
  69. bulk-checkin.c
  70. bulk-checkin.h
  71. bundle-uri.c
  72. bundle-uri.h
  73. bundle.c
  74. bundle.h
  75. cache-tree.c
  76. cache-tree.h
  77. cbtree.c
  78. cbtree.h
  79. chdir-notify.c
  80. chdir-notify.h
  81. check-builtins.sh
  82. checkout.c
  83. checkout.h
  84. chunk-format.c
  85. chunk-format.h
  86. CODE_OF_CONDUCT.md
  87. color.c
  88. color.h
  89. column.c
  90. column.h
  91. combine-diff.c
  92. command-list.txt
  93. commit-graph.c
  94. commit-graph.h
  95. commit-reach.c
  96. commit-reach.h
  97. commit-slab-decl.h
  98. commit-slab-impl.h
  99. commit-slab.h
  100. commit.c
  101. commit.h
  102. common-main.c
  103. config.c
  104. config.h
  105. config.mak.dev
  106. config.mak.in
  107. config.mak.uname
  108. configure.ac
  109. connect.c
  110. connect.h
  111. connected.c
  112. connected.h
  113. convert.c
  114. convert.h
  115. copy.c
  116. copy.h
  117. COPYING
  118. credential.c
  119. credential.h
  120. csum-file.c
  121. csum-file.h
  122. ctype.c
  123. daemon.c
  124. date.c
  125. date.h
  126. decorate.c
  127. decorate.h
  128. delta-islands.c
  129. delta-islands.h
  130. delta.h
  131. detect-compiler
  132. diagnose.c
  133. diagnose.h
  134. diff-delta.c
  135. diff-lib.c
  136. diff-merges.c
  137. diff-merges.h
  138. diff-no-index.c
  139. diff.c
  140. diff.h
  141. diffcore-break.c
  142. diffcore-delta.c
  143. diffcore-order.c
  144. diffcore-pickaxe.c
  145. diffcore-rename.c
  146. diffcore-rotate.c
  147. diffcore.h
  148. dir-iterator.c
  149. dir-iterator.h
  150. dir.c
  151. dir.h
  152. editor.c
  153. editor.h
  154. entry.c
  155. entry.h
  156. environment.c
  157. environment.h
  158. exec-cmd.c
  159. exec-cmd.h
  160. fetch-negotiator.c
  161. fetch-negotiator.h
  162. fetch-pack.c
  163. fetch-pack.h
  164. fmt-merge-msg.c
  165. fmt-merge-msg.h
  166. fsck.c
  167. fsck.h
  168. fsmonitor--daemon.h
  169. fsmonitor-ipc.c
  170. fsmonitor-ipc.h
  171. fsmonitor-ll.h
  172. fsmonitor-path-utils.h
  173. fsmonitor-settings.c
  174. fsmonitor-settings.h
  175. fsmonitor.c
  176. fsmonitor.h
  177. generate-cmdlist.sh
  178. generate-configlist.sh
  179. generate-hooklist.sh
  180. gettext.c
  181. gettext.h
  182. git-archimport.perl
  183. git-compat-util.h
  184. git-curl-compat.h
  185. git-cvsexportcommit.perl
  186. git-cvsimport.perl
  187. git-cvsserver.perl
  188. git-difftool--helper.sh
  189. git-filter-branch.sh
  190. git-instaweb.sh
  191. git-merge-octopus.sh
  192. git-merge-one-file.sh
  193. git-merge-resolve.sh
  194. git-mergetool--lib.sh
  195. git-mergetool.sh
  196. git-p4.py
  197. git-quiltimport.sh
  198. git-request-pull.sh
  199. git-send-email.perl
  200. git-sh-i18n.sh
  201. git-sh-setup.sh
  202. git-submodule.sh
  203. git-svn.perl
  204. GIT-VERSION-GEN
  205. git-web--browse.sh
  206. git-zlib.c
  207. git-zlib.h
  208. git.c
  209. git.rc
  210. gpg-interface.c
  211. gpg-interface.h
  212. graph.c
  213. graph.h
  214. grep.c
  215. grep.h
  216. hash-ll.h
  217. hash-lookup.c
  218. hash-lookup.h
  219. hash.h
  220. hashmap.c
  221. hashmap.h
  222. help.c
  223. help.h
  224. hex-ll.c
  225. hex-ll.h
  226. hex.c
  227. hex.h
  228. hook.c
  229. hook.h
  230. http-backend.c
  231. http-fetch.c
  232. http-push.c
  233. http-walker.c
  234. http.c
  235. http.h
  236. ident.c
  237. ident.h
  238. imap-send.c
  239. INSTALL
  240. iterator.h
  241. json-writer.c
  242. json-writer.h
  243. khash.h
  244. kwset.c
  245. kwset.h
  246. levenshtein.c
  247. levenshtein.h
  248. LGPL-2.1
  249. line-log.c
  250. line-log.h
  251. line-range.c
  252. line-range.h
  253. linear-assignment.c
  254. linear-assignment.h
  255. list-objects-filter-options.c
  256. list-objects-filter-options.h
  257. list-objects-filter.c
  258. list-objects-filter.h
  259. list-objects.c
  260. list-objects.h
  261. list.h
  262. lockfile.c
  263. lockfile.h
  264. log-tree.c
  265. log-tree.h
  266. loose.c
  267. loose.h
  268. ls-refs.c
  269. ls-refs.h
  270. mailinfo.c
  271. mailinfo.h
  272. mailmap.c
  273. mailmap.h
  274. Makefile
  275. match-trees.c
  276. match-trees.h
  277. mem-pool.c
  278. mem-pool.h
  279. merge-blobs.c
  280. merge-blobs.h
  281. merge-ll.c
  282. merge-ll.h
  283. merge-ort-wrappers.c
  284. merge-ort-wrappers.h
  285. merge-ort.c
  286. merge-ort.h
  287. merge-recursive.c
  288. merge-recursive.h
  289. merge.c
  290. merge.h
  291. mergesort.h
  292. midx-write.c
  293. midx.c
  294. midx.h
  295. name-hash.c
  296. name-hash.h
  297. notes-cache.c
  298. notes-cache.h
  299. notes-merge.c
  300. notes-merge.h
  301. notes-utils.c
  302. notes-utils.h
  303. notes.c
  304. notes.h
  305. object-file-convert.c
  306. object-file-convert.h
  307. object-file.c
  308. object-file.h
  309. object-name.c
  310. object-name.h
  311. object-store-ll.h
  312. object-store.h
  313. object.c
  314. object.h
  315. oid-array.c
  316. oid-array.h
  317. oidmap.c
  318. oidmap.h
  319. oidset.c
  320. oidset.h
  321. oidtree.c
  322. oidtree.h
  323. pack-bitmap-write.c
  324. pack-bitmap.c
  325. pack-bitmap.h
  326. pack-check.c
  327. pack-mtimes.c
  328. pack-mtimes.h
  329. pack-objects.c
  330. pack-objects.h
  331. pack-revindex.c
  332. pack-revindex.h
  333. pack-write.c
  334. pack.h
  335. packfile.c
  336. packfile.h
  337. pager.c
  338. pager.h
  339. parallel-checkout.c
  340. parallel-checkout.h
  341. parse-options-cb.c
  342. parse-options.c
  343. parse-options.h
  344. parse.c
  345. parse.h
  346. patch-delta.c
  347. patch-ids.c
  348. patch-ids.h
  349. path.c
  350. path.h
  351. pathspec.c
  352. pathspec.h
  353. pkt-line.c
  354. pkt-line.h
  355. preload-index.c
  356. preload-index.h
  357. pretty.c
  358. pretty.h
  359. prio-queue.c
  360. prio-queue.h
  361. progress.c
  362. progress.h
  363. promisor-remote.c
  364. promisor-remote.h
  365. prompt.c
  366. prompt.h
  367. protocol-caps.c
  368. protocol-caps.h
  369. protocol.c
  370. protocol.h
  371. prune-packed.c
  372. prune-packed.h
  373. quote.c
  374. quote.h
  375. range-diff.c
  376. range-diff.h
  377. reachable.c
  378. reachable.h
  379. read-cache-ll.h
  380. read-cache.c
  381. read-cache.h
  382. README.md
  383. rebase-interactive.c
  384. rebase-interactive.h
  385. rebase.c
  386. rebase.h
  387. ref-filter.c
  388. ref-filter.h
  389. reflog-walk.c
  390. reflog-walk.h
  391. reflog.c
  392. reflog.h
  393. refs.c
  394. refs.h
  395. refspec.c
  396. refspec.h
  397. remote-curl.c
  398. remote.c
  399. remote.h
  400. replace-object.c
  401. replace-object.h
  402. repo-settings.c
  403. repository.c
  404. repository.h
  405. rerere.c
  406. rerere.h
  407. reset.c
  408. reset.h
  409. resolve-undo.c
  410. resolve-undo.h
  411. revision.c
  412. revision.h
  413. run-command.c
  414. run-command.h
  415. sane-ctype.h
  416. scalar.c
  417. SECURITY.md
  418. send-pack.c
  419. send-pack.h
  420. sequencer.c
  421. sequencer.h
  422. serve.c
  423. serve.h
  424. server-info.c
  425. server-info.h
  426. setup.c
  427. setup.h
  428. sh-i18n--envsubst.c
  429. sha1dc_git.c
  430. sha1dc_git.h
  431. shallow.c
  432. shallow.h
  433. shared.mak
  434. shell.c
  435. shortlog.h
  436. sideband.c
  437. sideband.h
  438. sigchain.c
  439. sigchain.h
  440. simple-ipc.h
  441. sparse-index.c
  442. sparse-index.h
  443. split-index.c
  444. split-index.h
  445. stable-qsort.c
  446. statinfo.c
  447. statinfo.h
  448. strbuf.c
  449. strbuf.h
  450. streaming.c
  451. streaming.h
  452. string-list.c
  453. string-list.h
  454. strmap.c
  455. strmap.h
  456. strvec.c
  457. strvec.h
  458. sub-process.c
  459. sub-process.h
  460. submodule-config.c
  461. submodule-config.h
  462. submodule.c
  463. submodule.h
  464. symlinks.c
  465. symlinks.h
  466. tag.c
  467. tag.h
  468. tar.h
  469. tempfile.c
  470. tempfile.h
  471. thread-utils.c
  472. thread-utils.h
  473. tmp-objdir.c
  474. tmp-objdir.h
  475. trace.c
  476. trace.h
  477. trace2.c
  478. trace2.h
  479. trailer.c
  480. trailer.h
  481. transport-helper.c
  482. transport-internal.h
  483. transport.c
  484. transport.h
  485. tree-diff.c
  486. tree-walk.c
  487. tree-walk.h
  488. tree.c
  489. tree.h
  490. unicode-width.h
  491. unimplemented.sh
  492. unix-socket.c
  493. unix-socket.h
  494. unix-stream-server.c
  495. unix-stream-server.h
  496. unpack-trees.c
  497. unpack-trees.h
  498. upload-pack.c
  499. upload-pack.h
  500. url.c
  501. url.h
  502. urlmatch.c
  503. urlmatch.h
  504. usage.c
  505. userdiff.c
  506. userdiff.h
  507. utf8.c
  508. utf8.h
  509. varint.c
  510. varint.h
  511. version.c
  512. version.h
  513. versioncmp.c
  514. versioncmp.h
  515. walker.c
  516. walker.h
  517. wildmatch.c
  518. wildmatch.h
  519. worktree.c
  520. worktree.h
  521. wrap-for-bin.sh
  522. wrapper.c
  523. wrapper.h
  524. write-or-die.c
  525. write-or-die.h
  526. ws.c
  527. ws.h
  528. wt-status.c
  529. wt-status.h
  530. xdiff-interface.c
  531. xdiff-interface.h
README.md

Build status

Git - fast, scalable, distributed revision control system

Git is a fast, scalable, distributed revision control system with an unusually rich command set that provides both high-level operations and full access to internals.

Git is an Open Source project covered by the GNU General Public License version 2 (some parts of it are under different licenses, compatible with the GPLv2). It was originally written by Linus Torvalds with help of a group of hackers around the net.

Please read the file INSTALL for installation instructions.

Many Git online resources are accessible from https://git-scm.com/ including full documentation and Git related tools.

See Documentation/gittutorial.txt to get started, then see Documentation/giteveryday.txt for a useful minimum set of commands, and Documentation/git-<commandname>.txt for documentation of each command. If git has been correctly installed, then the tutorial can also be read with man gittutorial or git help tutorial, and the documentation of each command with man git-<commandname> or git help <commandname>.

CVS users may also want to read Documentation/gitcvs-migration.txt (man gitcvs-migration or git help cvs-migration if git is installed).

The user discussion and development of Git take place on the Git mailing list -- everyone is welcome to post bug reports, feature requests, comments and patches to git@vger.kernel.org (read Documentation/SubmittingPatches for instructions on patch submission and Documentation/CodingGuidelines).

Those wishing to help with error message, usage and informational message string translations (localization l10) should see po/README.md (a po file is a Portable Object file that holds the translations).

To subscribe to the list, send an email to git+subscribe@vger.kernel.org (see https://subspace.kernel.org/subscribing.html for details). The mailing list archives are available at https://lore.kernel.org/git/, https://marc.info/?l=git and other archival sites.

Issues which are security relevant should be disclosed privately to the Git Security mailing list git-security@googlegroups.com.

The maintainer frequently sends the “What's cooking” reports that list the current status of various development topics to the mailing list. The discussion following them give a good reference for project status, development direction and remaining tasks.

The name “git” was given by Linus Torvalds when he wrote the very first version. He described the tool as “the stupid content tracker” and the name as (depending on your mood):

  • random three-letter combination that is pronounceable, and not actually used by any common UNIX command. The fact that it is a mispronunciation of “get” may or may not be relevant.
  • stupid. contemptible and despicable. simple. Take your pick from the dictionary of slang.
  • “global information tracker”: you're in a good mood, and it actually works for you. Angels sing, and a light suddenly fills the room.
  • “goddamn idiotic truckload of sh*t”: when it breaks