When reading the Git Book, the author claims that commits are snapshots and not differences. He then goes on to say that if a file has not changed between commits, a reference is made to the previous file version instead of taking a new snapshot in order to save space.
However, I'm having a hard time believing this explanation. Or at least I feel like the author is not fully explaining something.
For example, imagine I have a Git repo that contains a single 4GB file called dictionary.txt. If I were to simply add a new line entry into this text file, the author is claiming that a new snapshot will be taken since Git takes snapshots and not diffs. Thus my repo would balloon in size each time I modify and commit this file since a new snapshot is taken every time.
I have a hard time believing this is actually the case, and I assume I'm misunderstanding the point that the author is trying to make. However, the fact that the author makes a point to call out "commits are snapshots, diffs" make it seem like an important concept that I would like to understand. It just confuses me why the author would make a comment regarding Git trying to save storage space if a file has not changed, while omitting some other key concept of saving storage when a file has indeed changed.
Any help?
First, this seems like an easy experiment. Let's start with a file that's roughly 500MB in size:
$ ls -l sample.txt
-rw-r--r--. 1 lkellogg lkellogg 580521093 Mar 20 20:58 sample.txt
If we add this to a fresh repository:
$ git init
$ git add sample.txt
$ git commit -m 'Initial commit'
We see that, with compression, our .git directory is just over 300MB in size:
$ du -sh .git
309M .git
If we add a single line to the file and commit the change:
$ echo this is a test >> sample.txt
$ git commit -m 'Add one line'
We see that our repository has roughly doubled in size:
$ du -sh .git
617M .git
If we repeat the above sequence, we find that our repository has again increased by roughly the size of the file:
925M .git
In other words: git stores snapshots, not diffs. This is because:
git is optimized for performance over storage, since storage is cheap. If you store diffs, checking out a file can be a time and CPU intensive operation since you need to replay all the changes between some base version and your target version.
git is designed primarily as a source code control system, and source files are typically plain text, smaller, and compress well. If you have large binary blobs in your repository, you may want to investigate things like git lfs.
There are optimizations as the repository grows larger, but fundamentally this is how git operates.
Conceptually, and in the simple case, Git stores complete snapshots. This sets it apart from other version control systems which do only store the differences between adjacent commits.
In reality there are optimizations.
You have two nearly identical 22K objects on your disk (each compressed to approximately 7K). Wouldn’t it be nice if Git could store one of them in full but then the second object only as the delta between it and the first? It turns out that it can.
The primary optimization is "packfiles". Pro Git has a chapter on them.
The initial format in which Git saves objects on disk is called a “loose” object format. However, occasionally Git packs up several of these objects into a single binary file called a “packfile” in order to save space and be more efficient. Git does this if you have too many loose objects around, if you run the git gc command manually, or if you push to a remote server.
The packfile is a single file containing the contents of all the objects that were removed from your filesystem. The index is a file that contains offsets into that packfile so you can quickly seek to a specific object. What is cool is that although the objects on disk before you ran the gc command were collectively about 15K in size, the new packfile is only 7K. You’ve cut your disk usage by half by packing your objects.
When Git packs objects, it looks for files that are named and sized similarly, and stores just the deltas from one version of the file to the next.
Note that Git is not storing the deltas between commits, Git can store any delta it finds most efficient. This is why I say Git conceptually stores the whole file, and storing deltas is an optimization.
This is also why one should decompress files before committing, not only will it make diff and merge work better, but Git will do a better job saving space.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With