pack-bitmap-write: reimplement bitmap writing

The bitmap generation code works by iterating over the set of commits
for which we plan to write bitmaps, and then for each one performing a
traditional traversal over the reachable commits and trees, filling in
the bitmap. Between two traversals, we can often reuse the previous
bitmap result as long as the first commit is an ancestor of the second.
However, our worst case is that we may end up doing "n" complete
complete traversals to the root in order to create "n" bitmaps.

In a real-world case (the shared-storage repo consisting of all GitHub
forks of chromium/chromium), we perform very poorly: generating bitmaps
takes ~3 hours, whereas we can walk the whole object graph in ~3
minutes.

This commit completely rewrites the algorithm, with the goal of
accessing each object only once. It works roughly like this:

  - generate a list of commits in topo-order using a single traversal

  - invert the edges of the graph (so have parents point at their
    children)

  - make one pass in reverse topo-order, generating a bitmap for each
    commit and passing the result along to child nodes

We generate correct results because each node we visit has already had
all of its ancestors added to the bitmap. And we make only two linear
passes over the commits.

We also visit each tree usually only once. When filling in a bitmap, we
don't bother to recurse into trees whose bit is already set in the
bitmap (since we know we've already done so when setting their bit).
That means that if commit A references tree T, none of its descendants
will need to open T again. I say "usually", though, because it is
possible for a given tree to be mentioned in unrelated parts of history
(e.g., cherry-picking to a parallel branch).

So we've accomplished our goal, and the resulting algorithm is pretty
simple to understand. But there are some downsides, at least with this
initial implementation:

  - we no longer reuse the results of any on-disk bitmaps when
    generating. So we'd expect to sometimes be slower than the original
    when bitmaps already exist. However, this is something we'll be able
    to add back in later.

  - we use much more memory. Instead of keeping one bitmap in memory at
    a time, we're passing them up through the graph. So our memory use
    should scale with the graph width (times the size of a bitmap).

So how does it perform?

For a clone of linux.git, generating bitmaps from scratch with the old
algorithm took 63s. Using this algorithm it takes 205s. Which is much
worse, but _might_ be acceptable if it behaved linearly as the size
grew. It also increases peak heap usage by ~1G. That's not impossibly
large, but not encouraging.

On the complete fork-network of torvalds/linux, it increases the peak
RAM usage by 40GB. Yikes. (I forgot to record the time it took, but the
memory usage was too much to consider this reasonable anyway).

On the complete fork-network of chromium/chromium, I ran out of memory
before succeeding. Some back-of-the-envelope calculations indicate it
would need 80+GB to complete.

So at this stage, we've managed to make things much worse. But because
of the way this new algorithm is structured, there are a lot of
opportunities for optimization on top. We'll start implementing those in
the follow-on patches.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
1 file changed
tree: 239b5ddf52e9c6c57e8a1fe4bd1e65ff57caeb41
  1. .github/
  2. block-sha1/
  3. builtin/
  4. ci/
  5. compat/
  6. contrib/
  7. Documentation/
  8. ewah/
  9. git-gui/
  10. gitk-git/
  11. gitweb/
  12. mergetools/
  13. negotiator/
  14. perl/
  15. po/
  16. ppc/
  17. refs/
  18. sha1dc/
  19. sha256/
  20. t/
  21. templates/
  22. trace2/
  23. vcs-svn/
  24. xdiff/
  25. .cirrus.yml
  26. .clang-format
  27. .editorconfig
  28. .gitattributes
  29. .gitignore
  30. .gitmodules
  31. .mailmap
  32. .travis.yml
  33. .tsan-suppressions
  34. abspath.c
  35. aclocal.m4
  36. add-interactive.c
  37. add-interactive.h
  38. add-patch.c
  39. advice.c
  40. advice.h
  41. alias.c
  42. alias.h
  43. alloc.c
  44. alloc.h
  45. apply.c
  46. apply.h
  47. archive-tar.c
  48. archive-zip.c
  49. archive.c
  50. archive.h
  51. attr.c
  52. attr.h
  53. banned.h
  54. base85.c
  55. bisect.c
  56. bisect.h
  57. blame.c
  58. blame.h
  59. blob.c
  60. blob.h
  61. bloom.c
  62. bloom.h
  63. branch.c
  64. branch.h
  65. builtin.h
  66. bulk-checkin.c
  67. bulk-checkin.h
  68. bundle.c
  69. bundle.h
  70. cache-tree.c
  71. cache-tree.h
  72. cache.h
  73. chdir-notify.c
  74. chdir-notify.h
  75. check-builtins.sh
  76. check_bindir
  77. checkout.c
  78. checkout.h
  79. CODE_OF_CONDUCT.md
  80. color.c
  81. color.h
  82. column.c
  83. column.h
  84. combine-diff.c
  85. command-list.txt
  86. commit-graph.c
  87. commit-graph.h
  88. commit-reach.c
  89. commit-reach.h
  90. commit-slab-decl.h
  91. commit-slab-impl.h
  92. commit-slab.h
  93. commit.c
  94. commit.h
  95. common-main.c
  96. config.c
  97. config.h
  98. config.mak.dev
  99. config.mak.in
  100. config.mak.uname
  101. configure.ac
  102. connect.c
  103. connect.h
  104. connected.c
  105. connected.h
  106. convert.c
  107. convert.h
  108. copy.c
  109. COPYING
  110. credential.c
  111. credential.h
  112. csum-file.c
  113. csum-file.h
  114. ctype.c
  115. daemon.c
  116. date.c
  117. decorate.c
  118. decorate.h
  119. delta-islands.c
  120. delta-islands.h
  121. delta.h
  122. detect-compiler
  123. diff-delta.c
  124. diff-lib.c
  125. diff-no-index.c
  126. diff.c
  127. diff.h
  128. diffcore-break.c
  129. diffcore-delta.c
  130. diffcore-order.c
  131. diffcore-pickaxe.c
  132. diffcore-rename.c
  133. diffcore.h
  134. dir-iterator.c
  135. dir-iterator.h
  136. dir.c
  137. dir.h
  138. editor.c
  139. entry.c
  140. environment.c
  141. exec-cmd.c
  142. exec-cmd.h
  143. fetch-negotiator.c
  144. fetch-negotiator.h
  145. fetch-pack.c
  146. fetch-pack.h
  147. fmt-merge-msg.c
  148. fmt-merge-msg.h
  149. fsck.c
  150. fsck.h
  151. fsmonitor.c
  152. fsmonitor.h
  153. fuzz-commit-graph.c
  154. fuzz-pack-headers.c
  155. fuzz-pack-idx.c
  156. generate-cmdlist.sh
  157. generate-configlist.sh
  158. gettext.c
  159. gettext.h
  160. git-add--interactive.perl
  161. git-archimport.perl
  162. git-bisect.sh
  163. git-compat-util.h
  164. git-cvsexportcommit.perl
  165. git-cvsimport.perl
  166. git-cvsserver.perl
  167. git-difftool--helper.sh
  168. git-filter-branch.sh
  169. git-instaweb.sh
  170. git-merge-octopus.sh
  171. git-merge-one-file.sh
  172. git-merge-resolve.sh
  173. git-mergetool--lib.sh
  174. git-mergetool.sh
  175. git-p4.py
  176. git-parse-remote.sh
  177. git-quiltimport.sh
  178. git-rebase--preserve-merges.sh
  179. git-request-pull.sh
  180. git-send-email.perl
  181. git-sh-i18n.sh
  182. git-sh-setup.sh
  183. git-submodule.sh
  184. git-svn.perl
  185. GIT-VERSION-GEN
  186. git-web--browse.sh
  187. git.c
  188. git.rc
  189. gpg-interface.c
  190. gpg-interface.h
  191. graph.c
  192. graph.h
  193. grep.c
  194. grep.h
  195. hash.h
  196. hashmap.c
  197. hashmap.h
  198. help.c
  199. help.h
  200. hex.c
  201. http-backend.c
  202. http-fetch.c
  203. http-push.c
  204. http-walker.c
  205. http.c
  206. http.h
  207. ident.c
  208. imap-send.c
  209. INSTALL
  210. iterator.h
  211. json-writer.c
  212. json-writer.h
  213. khash.h
  214. kwset.c
  215. kwset.h
  216. levenshtein.c
  217. levenshtein.h
  218. LGPL-2.1
  219. line-log.c
  220. line-log.h
  221. line-range.c
  222. line-range.h
  223. linear-assignment.c
  224. linear-assignment.h
  225. list-objects-filter-options.c
  226. list-objects-filter-options.h
  227. list-objects-filter.c
  228. list-objects-filter.h
  229. list-objects.c
  230. list-objects.h
  231. list.h
  232. ll-merge.c
  233. ll-merge.h
  234. lockfile.c
  235. lockfile.h
  236. log-tree.c
  237. log-tree.h
  238. ls-refs.c
  239. ls-refs.h
  240. mailinfo.c
  241. mailinfo.h
  242. mailmap.c
  243. mailmap.h
  244. Makefile
  245. match-trees.c
  246. mem-pool.c
  247. mem-pool.h
  248. merge-blobs.c
  249. merge-blobs.h
  250. merge-recursive.c
  251. merge-recursive.h
  252. merge.c
  253. mergesort.c
  254. mergesort.h
  255. midx.c
  256. midx.h
  257. name-hash.c
  258. notes-cache.c
  259. notes-cache.h
  260. notes-merge.c
  261. notes-merge.h
  262. notes-utils.c
  263. notes-utils.h
  264. notes.c
  265. notes.h
  266. object-store.h
  267. object.c
  268. object.h
  269. oid-array.c
  270. oid-array.h
  271. oidmap.c
  272. oidmap.h
  273. oidset.c
  274. oidset.h
  275. pack-bitmap-write.c
  276. pack-bitmap.c
  277. pack-bitmap.h
  278. pack-check.c
  279. pack-objects.c
  280. pack-objects.h
  281. pack-revindex.c
  282. pack-revindex.h
  283. pack-write.c
  284. pack.h
  285. packfile.c
  286. packfile.h
  287. pager.c
  288. parse-options-cb.c
  289. parse-options.c
  290. parse-options.h
  291. patch-delta.c
  292. patch-ids.c
  293. patch-ids.h
  294. path.c
  295. path.h
  296. pathspec.c
  297. pathspec.h
  298. pkt-line.c
  299. pkt-line.h
  300. preload-index.c
  301. pretty.c
  302. pretty.h
  303. prio-queue.c
  304. prio-queue.h
  305. progress.c
  306. progress.h
  307. promisor-remote.c
  308. promisor-remote.h
  309. prompt.c
  310. prompt.h
  311. protocol.c
  312. protocol.h
  313. prune-packed.c
  314. prune-packed.h
  315. quote.c
  316. quote.h
  317. range-diff.c
  318. range-diff.h
  319. reachable.c
  320. reachable.h
  321. read-cache.c
  322. README.md
  323. rebase-interactive.c
  324. rebase-interactive.h
  325. rebase.c
  326. rebase.h
  327. ref-filter.c
  328. ref-filter.h
  329. reflog-walk.c
  330. reflog-walk.h
  331. refs.c
  332. refs.h
  333. refspec.c
  334. refspec.h
  335. remote-curl.c
  336. remote.c
  337. remote.h
  338. replace-object.c
  339. replace-object.h
  340. repo-settings.c
  341. repository.c
  342. repository.h
  343. rerere.c
  344. rerere.h
  345. reset.c
  346. reset.h
  347. resolve-undo.c
  348. resolve-undo.h
  349. revision.c
  350. revision.h
  351. run-command.c
  352. run-command.h
  353. send-pack.c
  354. send-pack.h
  355. sequencer.c
  356. sequencer.h
  357. serve.c
  358. serve.h
  359. server-info.c
  360. setup.c
  361. sh-i18n--envsubst.c
  362. sha1-file.c
  363. sha1-lookup.c
  364. sha1-lookup.h
  365. sha1-name.c
  366. sha1dc_git.c
  367. sha1dc_git.h
  368. shallow.c
  369. shallow.h
  370. shell.c
  371. shortlog.h
  372. sideband.c
  373. sideband.h
  374. sigchain.c
  375. sigchain.h
  376. split-index.c
  377. split-index.h
  378. stable-qsort.c
  379. strbuf.c
  380. strbuf.h
  381. streaming.c
  382. streaming.h
  383. string-list.c
  384. string-list.h
  385. strvec.c
  386. strvec.h
  387. sub-process.c
  388. sub-process.h
  389. submodule-config.c
  390. submodule-config.h
  391. submodule.c
  392. submodule.h
  393. symlinks.c
  394. tag.c
  395. tag.h
  396. tar.h
  397. tempfile.c
  398. tempfile.h
  399. thread-utils.c
  400. thread-utils.h
  401. tmp-objdir.c
  402. tmp-objdir.h
  403. trace.c
  404. trace.h
  405. trace2.c
  406. trace2.h
  407. trailer.c
  408. trailer.h
  409. transport-helper.c
  410. transport-internal.h
  411. transport.c
  412. transport.h
  413. tree-diff.c
  414. tree-walk.c
  415. tree-walk.h
  416. tree.c
  417. tree.h
  418. unicode-width.h
  419. unimplemented.sh
  420. unix-socket.c
  421. unix-socket.h
  422. unpack-trees.c
  423. unpack-trees.h
  424. upload-pack.c
  425. upload-pack.h
  426. url.c
  427. url.h
  428. urlmatch.c
  429. urlmatch.h
  430. usage.c
  431. userdiff.c
  432. userdiff.h
  433. utf8.c
  434. utf8.h
  435. varint.c
  436. varint.h
  437. version.c
  438. version.h
  439. versioncmp.c
  440. walker.c
  441. walker.h
  442. wildmatch.c
  443. wildmatch.h
  444. worktree.c
  445. worktree.h
  446. wrap-for-bin.sh
  447. wrapper.c
  448. write-or-die.c
  449. ws.c
  450. wt-status.c
  451. wt-status.h
  452. xdiff-interface.c
  453. xdiff-interface.h
  454. zlib.c
README.md

Build status

Git - fast, scalable, distributed revision control system

Git is a fast, scalable, distributed revision control system with an unusually rich command set that provides both high-level operations and full access to internals.

Git is an Open Source project covered by the GNU General Public License version 2 (some parts of it are under different licenses, compatible with the GPLv2). It was originally written by Linus Torvalds with help of a group of hackers around the net.

Please read the file INSTALL for installation instructions.

Many Git online resources are accessible from https://git-scm.com/ including full documentation and Git related tools.

See Documentation/gittutorial.txt to get started, then see Documentation/giteveryday.txt for a useful minimum set of commands, and Documentation/git-<commandname>.txt for documentation of each command. If git has been correctly installed, then the tutorial can also be read with man gittutorial or git help tutorial, and the documentation of each command with man git-<commandname> or git help <commandname>.

CVS users may also want to read Documentation/gitcvs-migration.txt (man gitcvs-migration or git help cvs-migration if git is installed).

The user discussion and development of Git take place on the Git mailing list -- everyone is welcome to post bug reports, feature requests, comments and patches to git@vger.kernel.org (read Documentation/SubmittingPatches for instructions on patch submission). To subscribe to the list, send an email with just “subscribe git” in the body to majordomo@vger.kernel.org. The mailing list archives are available at https://lore.kernel.org/git/, http://marc.info/?l=git and other archival sites.

Issues which are security relevant should be disclosed privately to the Git Security mailing list git-security@googlegroups.com.

The maintainer frequently sends the “What's cooking” reports that list the current status of various development topics to the mailing list. The discussion following them give a good reference for project status, development direction and remaining tasks.

The name “git” was given by Linus Torvalds when he wrote the very first version. He described the tool as “the stupid content tracker” and the name as (depending on your mood):

  • random three-letter combination that is pronounceable, and not actually used by any common UNIX command. The fact that it is a mispronunciation of “get” may or may not be relevant.
  • stupid. contemptible and despicable. simple. Take your pick from the dictionary of slang.
  • “global information tracker”: you're in a good mood, and it actually works for you. Angels sing, and a light suddenly fills the room.
  • “goddamn idiotic truckload of sh*t”: when it breaks