When I run git gc or git repack over my Git repository, it outputs a "Total" line once it's done. What do these numbers mean?
A couple of examples from a fairly small repository:
$ git gc
...
Total 576 (delta 315), reused 576 (delta 315)
$ git repack -afd --depth=250 --window=250
...
Total 576 (delta 334), reused 242 (delta 0)
And one from a much larger repository:
$ git gc
...
Total 347629 (delta 289610), reused 342219 (delta 285060)
...
I can guess what that first "Total" number is: the number of Git objects (so commits, trees and files) in the repository. What do all the others actually mean?
I've already looked at the git-gc(1) and git-repack(1) man pages, and perused their "See also"s, too, and my attempts at Googling have only produced irrelevant results.
I did some work with dulwich, a pure python implementation of Git. What I am about to say here reflects my experience with dulwich's git implementation, not the canonical git source and so there may be differences.
Git is remarkably simple - I mean, so simple it confounds! The name is really appropriate to its design which is very clever due to its stupidity.
When you commit anything, git takes what's in the index (staging area) and creates SHA digest items, so each file gets SHAed and the files in each directory get SHAed as blob objects and of course the directory structure gets SHAed as tree objects, and all that gets bound into a commit object which also has a SHA. Git just fires these straight into the filing system in .git/objects as it processes the commit. If it succeeds at firing all of them in there, it simply writes the SHA of the most recent commit object into .git/refs/heads/.
From time to time a commit may fail half way through. If something fails to write into .git/objects, git does no cleanup at that time. That's because usually you'll fix the problem and redo the commit - in this case, git will restart exactly from where it previously halted i.e. half way through the commit.
Here's where git gc comes in. It simply parses through all objects in .git/objects, marking off all those which are referred to in some way by a HEAD or a BRANCH. Anything remaining obviously is orphaned and has nothing to do with anything "important", so it can be deleted. This is why if you branch, do some work on that branch but later abandon that branch and delete any reference to it from your git repo, the periodic git gc which runs will totally purge your branch. This can surprise some older VCS users e.g. CVS never forgot anything except when it crashed or corrupted itself (which was often).
git repack (really git-pack-objects) is totally different to git gc (as in, a separate command and operation though git gc may call git repack). As I mentioned earlier, git just fires everything into its own SHAed file. It does gzip them before going to disc storage, but obviously this isn't space efficient over the long run. So what git-pack-objects does is to examine a series of SHA objects for anywhere where data replicates across revisions. It doesn't care what kind of SHA object it is - all are considered equal for packing. It then generates binary deltas where those make sense, and stores the entire lot as a .pack file in .git/objects/pack, removing any packed objects from the normal directory structure.
Note that generally git-pack-objects makes a new .pack file rather than replacing existing .pack files, if the most recent pack file is less than 1Mb in size. Thus, over time you'll see multiple .pack files appear in .git/objects/pack. Indeed, when you git fetch, you simply ask the remote repo to pack all unpacked items and to send the .pack files that the fetching repo doesn't have to the fetching repo. git repack simply calls git-pack-objects but tells it to merge .pack files as it sees fit. That implies decompressing anything which has changed, regenerating the binary deltas and recompressing.
So, to answer your question, the total line refers to the total number of objects in the git repo. The first delta number is the number of those total objects which are binary delta objects i.e. how many objects git has decided have a strong similarity with other objects and can be stored as a binary delta. The reused number indicates how many objects from a compressed source (i.e. a packfile) are being used without having been recompressed to include more recent changes. This would occur when you have multiple packfiles but where a more recent SHA object refers to an item in an old packfile as its base, then applies deltas to it to make it modern. This lets git make use of previously compressed older revisions of data without having to recompress it to include more recent additions. Note that git may append to an existing pack file without rewriting the entire pack file.
Generally speaking, a high reused count indicates that some space could be reclaimed with a full repack (i.e. a git repack -a) which will always return reused to zero. However, generally git will silently take care of all of that for you. Also, doing full repacks may force some git fetches to restart from scratch because the packs differ - this depends on server settings (allowing custom per-client pack generation is expensive on server CPU, so some major GIT sites disable it).
Hopefully this answers your question. Really with git it is so simple you're amazed it works at all in the beginning, then as you wrap your head around it you become seriously impressed. Only truly genius programmers can write something so simple yet works so well because they can see simplicity where most programmers can only see complexity.
Niall
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With