pack-objects: reuse on-disk deltas for thin "have" objects

When we serve a fetch, we pass the "wants" and "haves" from
the fetch negotiation to pack-objects. That tells us not
only which objects we need to send, but we also use the
boundary commits as "preferred bases": their trees and blobs
are candidates for delta bases, both for reusing on-disk
deltas and for finding new ones.

However, this misses some opportunities. Modulo some special
cases like shallow or partial clones, we know that every
object reachable from the "haves" could be a preferred base.
We don't use all of them for two reasons:

  1. It's expensive to traverse the whole history and
     enumerate all of the objects the other side has.

  2. The delta search is expensive, so we want to keep the
     number of candidate bases sane. The boundary commits
     are the most likely to work.

When we have reachability bitmaps, though, reason 1 no
longer applies. We can efficiently compute the set of
reachable objects on the other side (and in fact already did
so as part of the bitmap set-difference to get the list of
interesting objects). And using this set conveniently
covers the shallow and partial cases, since we have to
disable the use of bitmaps for those anyway.

The second reason argues against using these bases in the
search for new deltas. But there's one case where we can use
this information for free: when we have an existing on-disk
delta that we're considering reusing, we can do so if we
know the other side has the base object. This in fact saves
time during the delta search, because it's one less delta we
have to compute.

And that's exactly what this patch does: when we're
considering whether to reuse an on-disk delta, if bitmaps
tell us the other side has the object (and we're making a
thin-pack), then we reuse it.

Here are the results on p5311 using linux.git, which
simulates a client fetching after `N` days since their last
fetch:

 Test                         origin              HEAD
 --------------------------------------------------------------------------
 5311.3: server   (1 days)    0.27(0.27+0.04)     0.12(0.09+0.03) -55.6%
 5311.4: size     (1 days)               0.9M              237.0K -73.7%
 5311.5: client   (1 days)    0.04(0.05+0.00)     0.10(0.10+0.00) +150.0%
 5311.7: server   (2 days)    0.34(0.42+0.04)     0.13(0.10+0.03) -61.8%
 5311.8: size     (2 days)               1.5M              347.7K -76.5%
 5311.9: client   (2 days)    0.07(0.08+0.00)     0.16(0.15+0.01) +128.6%
 5311.11: server   (4 days)   0.56(0.77+0.08)     0.13(0.10+0.02) -76.8%
 5311.12: size     (4 days)              2.8M              566.6K -79.8%
 5311.13: client   (4 days)   0.13(0.15+0.00)     0.34(0.31+0.02) +161.5%
 5311.15: server   (8 days)   0.97(1.39+0.11)     0.30(0.25+0.05) -69.1%
 5311.16: size     (8 days)              4.3M                1.0M -76.0%
 5311.17: client   (8 days)   0.20(0.22+0.01)     0.53(0.52+0.01) +165.0%
 5311.19: server  (16 days)   1.52(2.51+0.12)     0.30(0.26+0.03) -80.3%
 5311.20: size    (16 days)              8.0M                2.0M -74.5%
 5311.21: client  (16 days)   0.40(0.47+0.03)     1.01(0.98+0.04) +152.5%
 5311.23: server  (32 days)   2.40(4.44+0.20)     0.31(0.26+0.04) -87.1%
 5311.24: size    (32 days)             14.1M                4.1M -70.9%
 5311.25: client  (32 days)   0.70(0.90+0.03)     1.81(1.75+0.06) +158.6%
 5311.27: server  (64 days)   11.76(26.57+0.29)   0.55(0.50+0.08) -95.3%
 5311.28: size    (64 days)             89.4M               47.4M -47.0%
 5311.29: client  (64 days)   5.71(9.31+0.27)     15.20(15.20+0.32) +166.2%
 5311.31: server (128 days)   16.15(36.87+0.40)   0.91(0.82+0.14) -94.4%
 5311.32: size   (128 days)            134.8M              100.4M -25.5%
 5311.33: client (128 days)   9.42(16.86+0.49)    25.34(25.80+0.46) +169.0%

In all cases we save CPU time on the server (sometimes
significant) and the resulting pack is smaller. We do spend
more CPU time on the client side, because it has to
reconstruct more deltas. But that's the right tradeoff to
make, since clients tend to outnumber servers. It just means
the thin pack mechanism is doing its job.

From the user's perspective, the end-to-end time of the
operation will generally be faster. E.g., in the 128-day
case, we saved 15s on the server at a cost of 16s on the
client. Since the resulting pack is 34MB smaller, this is a
net win if the network speed is less than 270Mbit/s. And
that's actually the worst case. The 64-day case saves just
over 11s at a cost of just under 11s. So it's a slight win
at any network speed, and the 40MB saved is pure bonus. That
trend continues for the smaller fetches.

The implementation itself is mostly straightforward, with
the new logic going into check_object(). But there are two
tricky bits.

The first is that check_object() needs access to the
relevant information (the thin flag and bitmap result). We
can do this by pushing these into program-lifetime globals.

The second is that the rest of the code assumes that any
reused delta will point to another "struct object_entry" as
its base. But of course the case we are interested in here
is the one where don't have such an entry!

I looked at a number of options that didn't quite work:

 - we could use a flag to signal a reused delta, but it's
   not a single bit. We have to actually store the oid of
   the base, which is normally done by pointing to the
   existing object_entry. And we'd have to modify all the
   code which looks at deltas.

 - we could add the reused bases to the end of the existing
   object_entry array. While this does create some extra
   work as later stages consider the extra entries, it's
   actually not too bad (we're not sending them, so they
   don't cost much in the delta search, and at most we'd
   have 2*N of them).

   But there's a more subtle problem. Adding to the existing
   array means we might need to grow it with realloc, which
   could move the earlier entries around. While many of the
   references to other entries are done by integer index,
   some (including ones on the stack) use pointers, which
   would become invalidated.

   This isn't insurmountable, but it would require quite a
   bit of refactoring (and it's hard to know that you've got
   it all, since it may work _most_ of the time and then
   fail subtly based on memory allocation patterns).

 - we could allocate a new one-off entry for the base. In
   fact, this is what an earlier version of this patch did.
   However, since the refactoring brought in by ad635e82d6
   (Merge branch 'nd/pack-objects-pack-struct', 2018-05-23),
   the delta_idx code requires that both entries be in the
   main packing list.

So taking all of those options into account, what I ended up
with is a separate list of "external bases" that are not
part of the main packing list. Each delta entry that points
to an external base has a single-bit flag to do so; we have a
little breathing room in the bitfield section of
object_entry.

This lets us limit the change primarily to the oe_delta()
and oe_set_delta_ext() functions. And as a bonus, most of
the rest of the code does not consider these dummy entries
at all, saving both runtime CPU and code complexity.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
3 files changed
tree: 17612452b57edd80c5822deed9e724252a2d6efb
  1. .github/
  2. block-sha1/
  3. builtin/
  4. ci/
  5. compat/
  6. contrib/
  7. Documentation/
  8. ewah/
  9. git-gui/
  10. gitk-git/
  11. gitweb/
  12. mergetools/
  13. negotiator/
  14. perl/
  15. po/
  16. ppc/
  17. refs/
  18. sha1dc/
  19. t/
  20. templates/
  21. vcs-svn/
  22. xdiff/
  23. .clang-format
  24. .gitattributes
  25. .gitignore
  26. .gitmodules
  27. .mailmap
  28. .travis.yml
  29. .tsan-suppressions
  30. abspath.c
  31. aclocal.m4
  32. advice.c
  33. advice.h
  34. alias.c
  35. alias.h
  36. alloc.c
  37. alloc.h
  38. apply.c
  39. apply.h
  40. archive-tar.c
  41. archive-zip.c
  42. archive.c
  43. archive.h
  44. argv-array.c
  45. argv-array.h
  46. attr.c
  47. attr.h
  48. banned.h
  49. base85.c
  50. bisect.c
  51. bisect.h
  52. blame.c
  53. blame.h
  54. blob.c
  55. blob.h
  56. branch.c
  57. branch.h
  58. builtin.h
  59. bulk-checkin.c
  60. bulk-checkin.h
  61. bundle.c
  62. bundle.h
  63. cache-tree.c
  64. cache-tree.h
  65. cache.h
  66. chdir-notify.c
  67. chdir-notify.h
  68. check-builtins.sh
  69. check-racy.c
  70. check_bindir
  71. checkout.c
  72. checkout.h
  73. color.c
  74. color.h
  75. column.c
  76. column.h
  77. combine-diff.c
  78. command-list.txt
  79. commit-graph.c
  80. commit-graph.h
  81. commit-slab-decl.h
  82. commit-slab-impl.h
  83. commit-slab.h
  84. commit.c
  85. commit.h
  86. common-main.c
  87. config.c
  88. config.h
  89. config.mak.dev
  90. config.mak.in
  91. config.mak.uname
  92. configure.ac
  93. connect.c
  94. connect.h
  95. connected.c
  96. connected.h
  97. convert.c
  98. convert.h
  99. copy.c
  100. COPYING
  101. credential-cache--daemon.c
  102. credential-cache.c
  103. credential-store.c
  104. credential.c
  105. credential.h
  106. csum-file.c
  107. csum-file.h
  108. ctype.c
  109. daemon.c
  110. date.c
  111. decorate.c
  112. decorate.h
  113. delta.h
  114. detect-compiler
  115. diff-delta.c
  116. diff-lib.c
  117. diff-no-index.c
  118. diff.c
  119. diff.h
  120. diffcore-break.c
  121. diffcore-delta.c
  122. diffcore-order.c
  123. diffcore-pickaxe.c
  124. diffcore-rename.c
  125. diffcore.h
  126. dir-iterator.c
  127. dir-iterator.h
  128. dir.c
  129. dir.h
  130. editor.c
  131. entry.c
  132. environment.c
  133. exec-cmd.c
  134. exec-cmd.h
  135. fast-import.c
  136. fetch-negotiator.c
  137. fetch-negotiator.h
  138. fetch-object.c
  139. fetch-object.h
  140. fetch-pack.c
  141. fetch-pack.h
  142. fmt-merge-msg.h
  143. fsck.c
  144. fsck.h
  145. fsmonitor.c
  146. fsmonitor.h
  147. generate-cmdlist.sh
  148. gettext.c
  149. gettext.h
  150. git-add--interactive.perl
  151. git-archimport.perl
  152. git-bisect.sh
  153. git-compat-util.h
  154. git-cvsexportcommit.perl
  155. git-cvsimport.perl
  156. git-cvsserver.perl
  157. git-difftool--helper.sh
  158. git-filter-branch.sh
  159. git-instaweb.sh
  160. git-merge-octopus.sh
  161. git-merge-one-file.sh
  162. git-merge-resolve.sh
  163. git-mergetool--lib.sh
  164. git-mergetool.sh
  165. git-p4.py
  166. git-parse-remote.sh
  167. git-quiltimport.sh
  168. git-rebase--am.sh
  169. git-rebase--interactive.sh
  170. git-rebase--merge.sh
  171. git-rebase--preserve-merges.sh
  172. git-rebase.sh
  173. git-remote-testgit.sh
  174. git-request-pull.sh
  175. git-send-email.perl
  176. git-sh-i18n.sh
  177. git-sh-setup.sh
  178. git-stash.sh
  179. git-submodule.sh
  180. git-svn.perl
  181. GIT-VERSION-GEN
  182. git-web--browse.sh
  183. git.c
  184. git.rc
  185. gpg-interface.c
  186. gpg-interface.h
  187. graph.c
  188. graph.h
  189. grep.c
  190. grep.h
  191. hash.h
  192. hashmap.c
  193. hashmap.h
  194. help.c
  195. help.h
  196. hex.c
  197. http-backend.c
  198. http-fetch.c
  199. http-push.c
  200. http-walker.c
  201. http.c
  202. http.h
  203. ident.c
  204. imap-send.c
  205. INSTALL
  206. iterator.h
  207. json-writer.c
  208. json-writer.h
  209. khash.h
  210. kwset.c
  211. kwset.h
  212. levenshtein.c
  213. levenshtein.h
  214. LGPL-2.1
  215. line-log.c
  216. line-log.h
  217. line-range.c
  218. line-range.h
  219. linear-assignment.c
  220. linear-assignment.h
  221. list-objects-filter-options.c
  222. list-objects-filter-options.h
  223. list-objects-filter.c
  224. list-objects-filter.h
  225. list-objects.c
  226. list-objects.h
  227. list.h
  228. ll-merge.c
  229. ll-merge.h
  230. lockfile.c
  231. lockfile.h
  232. log-tree.c
  233. log-tree.h
  234. ls-refs.c
  235. ls-refs.h
  236. mailinfo.c
  237. mailinfo.h
  238. mailmap.c
  239. mailmap.h
  240. Makefile
  241. match-trees.c
  242. mem-pool.c
  243. mem-pool.h
  244. merge-blobs.c
  245. merge-blobs.h
  246. merge-recursive.c
  247. merge-recursive.h
  248. merge.c
  249. mergesort.c
  250. mergesort.h
  251. name-hash.c
  252. notes-cache.c
  253. notes-cache.h
  254. notes-merge.c
  255. notes-merge.h
  256. notes-utils.c
  257. notes-utils.h
  258. notes.c
  259. notes.h
  260. object-store.h
  261. object.c
  262. object.h
  263. oidmap.c
  264. oidmap.h
  265. oidset.c
  266. oidset.h
  267. pack-bitmap-write.c
  268. pack-bitmap.c
  269. pack-bitmap.h
  270. pack-check.c
  271. pack-objects.c
  272. pack-objects.h
  273. pack-revindex.c
  274. pack-revindex.h
  275. pack-write.c
  276. pack.h
  277. packfile.c
  278. packfile.h
  279. pager.c
  280. parse-options-cb.c
  281. parse-options.c
  282. parse-options.h
  283. patch-delta.c
  284. patch-ids.c
  285. patch-ids.h
  286. path.c
  287. path.h
  288. pathspec.c
  289. pathspec.h
  290. pkt-line.c
  291. pkt-line.h
  292. preload-index.c
  293. pretty.c
  294. pretty.h
  295. prio-queue.c
  296. prio-queue.h
  297. progress.c
  298. progress.h
  299. prompt.c
  300. prompt.h
  301. protocol.c
  302. protocol.h
  303. quote.c
  304. quote.h
  305. range-diff.c
  306. range-diff.h
  307. reachable.c
  308. reachable.h
  309. read-cache.c
  310. README.md
  311. ref-filter.c
  312. ref-filter.h
  313. reflog-walk.c
  314. reflog-walk.h
  315. refs.c
  316. refs.h
  317. refspec.c
  318. refspec.h
  319. remote-curl.c
  320. remote-testsvn.c
  321. remote.c
  322. remote.h
  323. replace-object.c
  324. replace-object.h
  325. repository.c
  326. repository.h
  327. rerere.c
  328. rerere.h
  329. resolve-undo.c
  330. resolve-undo.h
  331. revision.c
  332. revision.h
  333. run-command.c
  334. run-command.h
  335. send-pack.c
  336. send-pack.h
  337. sequencer.c
  338. sequencer.h
  339. serve.c
  340. serve.h
  341. server-info.c
  342. setup.c
  343. sh-i18n--envsubst.c
  344. sha1-array.c
  345. sha1-array.h
  346. sha1-file.c
  347. sha1-lookup.c
  348. sha1-lookup.h
  349. sha1-name.c
  350. sha1dc_git.c
  351. sha1dc_git.h
  352. shallow.c
  353. shell.c
  354. shortlog.h
  355. sideband.c
  356. sideband.h
  357. sigchain.c
  358. sigchain.h
  359. split-index.c
  360. split-index.h
  361. strbuf.c
  362. strbuf.h
  363. streaming.c
  364. streaming.h
  365. string-list.c
  366. string-list.h
  367. sub-process.c
  368. sub-process.h
  369. submodule-config.c
  370. submodule-config.h
  371. submodule.c
  372. submodule.h
  373. symlinks.c
  374. tag.c
  375. tag.h
  376. tar.h
  377. tempfile.c
  378. tempfile.h
  379. thread-utils.c
  380. thread-utils.h
  381. tmp-objdir.c
  382. tmp-objdir.h
  383. trace.c
  384. trace.h
  385. trailer.c
  386. trailer.h
  387. transport-helper.c
  388. transport-internal.h
  389. transport.c
  390. transport.h
  391. tree-diff.c
  392. tree-walk.c
  393. tree-walk.h
  394. tree.c
  395. tree.h
  396. unicode-width.h
  397. unimplemented.sh
  398. unix-socket.c
  399. unix-socket.h
  400. unpack-trees.c
  401. unpack-trees.h
  402. upload-pack.c
  403. upload-pack.h
  404. url.c
  405. url.h
  406. urlmatch.c
  407. urlmatch.h
  408. usage.c
  409. userdiff.c
  410. userdiff.h
  411. utf8.c
  412. utf8.h
  413. varint.c
  414. varint.h
  415. version.c
  416. version.h
  417. versioncmp.c
  418. walker.c
  419. walker.h
  420. wildmatch.c
  421. wildmatch.h
  422. worktree.c
  423. worktree.h
  424. wrap-for-bin.sh
  425. wrapper.c
  426. write-or-die.c
  427. ws.c
  428. wt-status.c
  429. wt-status.h
  430. xdiff-interface.c
  431. xdiff-interface.h
  432. zlib.c
README.md

Git - fast, scalable, distributed revision control system

Git is a fast, scalable, distributed revision control system with an unusually rich command set that provides both high-level operations and full access to internals.

Git is an Open Source project covered by the GNU General Public License version 2 (some parts of it are under different licenses, compatible with the GPLv2). It was originally written by Linus Torvalds with help of a group of hackers around the net.

Please read the file INSTALL for installation instructions.

Many Git online resources are accessible from https://git-scm.com/ including full documentation and Git related tools.

See Documentation/gittutorial.txt to get started, then see Documentation/giteveryday.txt for a useful minimum set of commands, and Documentation/git-.txt for documentation of each command. If git has been correctly installed, then the tutorial can also be read with man gittutorial or git help tutorial, and the documentation of each command with man git-<commandname> or git help <commandname>.

CVS users may also want to read Documentation/gitcvs-migration.txt (man gitcvs-migration or git help cvs-migration if git is installed).

The user discussion and development of Git take place on the Git mailing list -- everyone is welcome to post bug reports, feature requests, comments and patches to git@vger.kernel.org (read Documentation/SubmittingPatches for instructions on patch submission). To subscribe to the list, send an email with just “subscribe git” in the body to majordomo@vger.kernel.org. The mailing list archives are available at https://public-inbox.org/git/, http://marc.info/?l=git and other archival sites.

Issues which are security relevant should be disclosed privately to the Git Security mailing list git-security@googlegroups.com.

The maintainer frequently sends the “What's cooking” reports that list the current status of various development topics to the mailing list. The discussion following them give a good reference for project status, development direction and remaining tasks.

The name “git” was given by Linus Torvalds when he wrote the very first version. He described the tool as “the stupid content tracker” and the name as (depending on your mood):

  • random three-letter combination that is pronounceable, and not actually used by any common UNIX command. The fact that it is a mispronunciation of “get” may or may not be relevant.
  • stupid. contemptible and despicable. simple. Take your pick from the dictionary of slang.
  • “global information tracker”: you're in a good mood, and it actually works for you. Angels sing, and a light suddenly fills the room.
  • “goddamn idiotic truckload of sh*t”: when it breaks