Is there a difference between how Git stores text and binary files

Question

One thing everyone seems to agree on is Git is not great for big binary blobs. Keep in mind that a binary blob is different from a large text file; you can use Git on large text files without a problem, but Git can't do much with an impervious binary file except treat it as one big solid black box and commit it as-is.

According to https://opensource.com/life/16/8/how-manage-binary-blobs-git-part-7:

One thing everyone seems to agree on is Git is not great for big binary blobs. Keep in mind that a binary blob is different from a large text file; you can use Git on large text files without a problem, but Git can't do much with an impervious binary file except treat it as one big solid black box and commit it as-is.

Say you have a complex 3D model for the exciting new first person puzzle game you're making, and you save it in a binary format, resulting in a 1 gigabyte file. You git commit it once, adding a gigabyte to your repository's history. Later, you give the model a different hair style and commit your update; Git can't tell the hair apart from the head or the rest of the model, so you've just committed another gigabyte. Then you change the model's eye color and commit that small change: another gigabyte. That is three gigabytes for one model with a few minor changes made on a whim. Scale that across all the assets in a game, and you have a serious problem.

It was my understanding that there is no difference between text and binary files and Git stores all files of each commit in their entirety (creating a checksummed blob), with unchanged files simply pointing to an already existing blob. How all those blobs are stored and compressed is another question, that I do not know the details of, but I would have assumed that if the various 1GB files in the quote are more or less the same, a good compression algorithm would figure this out and may be able to store all of them in even less than 1GB total, if they are repetitive. This reasoning should apply to binary as well as to text files.

Contrary to this, the quote continues saying

Contrast that to a text file like the .obj format. One commit stores everything, just as with the other model, but an .obj file is a series of lines of plain text describing the vertices of a model. If you modify the model and save it back out to .obj, Git can read the two files line by line, create a diff of the changes, and process a fairly small commit. The more refined the model becomes, the smaller the commits get, and it's a standard Git use case. It is a big file, but it uses a kind of overlay or sparse storage method to build a complete picture of the current state of your data.

Is my understanding correct? Is the quote incorrect?

Noufal Ibrahim · Accepted Answer

Git does store files in their entirety and so if you have 2 binary files with only a small change, it will take twice the space. Observe.

% git init                
Initialized empty Git repository in /tmp/x/.git/
{master #}%                                                                                                                                           [/tmp/x]
{master #}% du -sh .git           
100K    .git                         
{master #}% dd if=/dev/urandom of=./test count=1 bs=10M
1+0 records in
1+0 records out                                                                                                                                               
10485760 bytes (10 MB, 10 MiB) copied, 0.102277 s, 103 MB/s
{master #%}% ls -sh test
10M test
{master #%}% git add test
git co%
{master #}% git commit -m "Adds test"
[master (root-commit) 0c12c32] Adds test
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 test
{master}% du -sh .git
11M     .git

I've created a 10MB file and added and committed it. The repository is now 10MB in size.

If I make a small change and then do this again,

{master}% e test # This is an invocation of my editor to change a few bytes.
nil
{master}% git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   test

no changes added to commit (use "git add" and/or "git commit -a")
{master *}% git add test
{master +}% git commit -m "Updates test a little"
[master 99ed99a] Updates test a little
 1 file changed, 0 insertions(+), 0 deletions(-)
{master}% du -sh .git
21M     .git

It will take 20MB. Two times the 10MB file.

This however is a the "loose object" format of the repository where each blob is a separate file on disk.

You can pack all of these into a git packfile (which is done when you push etc.) and see what happens.

{master}% git gc
Counting objects: 6, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (6/6), done.
Total 6 (delta 1), reused 0 (delta 0)
{master}% du -sh .git
11M     .git

Now, it stores the blob and the diff just once in the packfile. This is different from each commit storing just the diff. It's that the objects themselves are packed into a single file.

Is there a difference between how Git stores text and binary files

Tags:

git

Bananach

1 Answers

Noufal Ibrahim

Recent Activity

Donate For Us

Is there a difference between how Git stores text and binary files

Tags:

git

Bananach

1 Answers

Noufal Ibrahim

Related questions

Recent Activity

Donate For Us