I accidentally dropped a DVD-rip into a website project, then carelessly git commit -a -m ...
, and, zap, the repo was bloated by 2.2 gigs. Next time I made some edits, deleted the video file, and committed everything, but the compressed file is still there in the repository, in history.
I know I can start branches from those commits and rebase one branch onto another. But what should I do to merge the 2 commits so that the big file doesn't show in the history and is cleaned in the garbage collection procedure?
If the large file was added in the most recent commit, you can just run: git rm --cached <filename> to remove the large file, then. git commit --amend -C HEAD to edit the commit.
To entirely remove unwanted files from a repository's history you can use either the git filter-repo tool or the BFG Repo-Cleaner open source tool. The git filter-repo tool and the BFG Repo-Cleaner rewrite your repository's history, which changes the SHAs for existing commits that you alter and any dependent commits.
If this is your last commit and you want to completely delete the file from your local and the remote repository, you can: remove the file git rm <file> commit with amend flag: git commit --amend.
Use the BFG Repo-Cleaner, a simpler, faster alternative to git-filter-branch
specifically designed for removing unwanted files from Git history.
Carefully follow the usage instructions, the core part is just this:
$ java -jar bfg.jar --strip-blobs-bigger-than 100M my-repo.git
Any files over 100MB in size (that aren't in your latest commit) will be removed from your Git repository's history. You can then use git gc
to clean away the dead data:
$ git gc --prune=now --aggressive
The BFG is typically at least 10-50x faster than running git-filter-branch
, and generally easier to use.
Full disclosure: I'm the author of the BFG Repo-Cleaner.
What you want to do is highly disruptive if you have published history to other developers. See “Recovering From Upstream Rebase” in the git rebase
documentation for the necessary steps after repairing your history.
You have at least two options: git filter-branch
and an interactive rebase, both explained below.
git filter-branch
I had a similar problem with bulky binary test data from a Subversion import and wrote about removing data from a git repository.
Say your git history is:
$ git lola --name-status * f772d66 (HEAD, master) Login page | A login.html * cb14efd Remove DVD-rip | D oops.iso * ce36c98 Careless | A oops.iso | A other.html * 5af4522 Admin page | A admin.html * e738b63 Index A index.html
Note that git lola
is a non-standard but highly useful alias. With the --name-status
switch, we can see tree modifications associated with each commit.
In the “Careless” commit (whose SHA1 object name is ce36c98) the file oops.iso
is the DVD-rip added by accident and removed in the next commit, cb14efd. Using the technique described in the aforementioned blog post, the command to execute is:
git filter-branch --prune-empty -d /dev/shm/scratch \ --index-filter "git rm --cached -f --ignore-unmatch oops.iso" \ --tag-name-filter cat -- --all
Options:
--prune-empty
removes commits that become empty (i.e., do not change the tree) as a result of the filter operation. In the typical case, this option produces a cleaner history.-d
names a temporary directory that does not yet exist to use for building the filtered history. If you are running on a modern Linux distribution, specifying a tree in /dev/shm
will result in faster execution.--index-filter
is the main event and runs against the index at each step in the history. You want to remove oops.iso
wherever it is found, but it isn’t present in all commits. The command git rm --cached -f --ignore-unmatch oops.iso
deletes the DVD-rip when it is present and does not fail otherwise.--tag-name-filter
describes how to rewrite tag names. A filter of cat
is the identity operation. Your repository, like the sample above, may not have any tags, but I included this option for full generality.--
specifies the end of options to git filter-branch
--all
following --
is shorthand for all refs. Your repository, like the sample above, may have only one ref (master), but I included this option for full generality.After some churning, the history is now:
$ git lola --name-status * 8e0a11c (HEAD, master) Login page | A login.html * e45ac59 Careless | A other.html | | * f772d66 (refs/original/refs/heads/master) Login page | | A login.html | * cb14efd Remove DVD-rip | | D oops.iso | * ce36c98 Careless |/ A oops.iso | A other.html | * 5af4522 Admin page | A admin.html * e738b63 Index A index.html
Notice that the new “Careless” commit adds only other.html
and that the “Remove DVD-rip” commit is no longer on the master branch. The branch labeled refs/original/refs/heads/master
contains your original commits in case you made a mistake. To remove it, follow the steps in “Checklist for Shrinking a Repository.”
$ git update-ref -d refs/original/refs/heads/master $ git reflog expire --expire=now --all $ git gc --prune=now
For a simpler alternative, clone the repository to discard the unwanted bits.
$ cd ~/src $ mv repo repo.old $ git clone file:///home/user/src/repo.old repo
Using a file:///...
clone URL copies objects rather than creating hardlinks only.
Now your history is:
$ git lola --name-status * 8e0a11c (HEAD, master) Login page | A login.html * e45ac59 Careless | A other.html * 5af4522 Admin page | A admin.html * e738b63 Index A index.html
The SHA1 object names for the first two commits (“Index” and “Admin page”) stayed the same because the filter operation did not modify those commits. “Careless” lost oops.iso
and “Login page” got a new parent, so their SHA1s did change.
With a history of:
$ git lola --name-status * f772d66 (HEAD, master) Login page | A login.html * cb14efd Remove DVD-rip | D oops.iso * ce36c98 Careless | A oops.iso | A other.html * 5af4522 Admin page | A admin.html * e738b63 Index A index.html
you want to remove oops.iso
from “Careless” as though you never added it, and then “Remove DVD-rip” is useless to you. Thus, our plan going into an interactive rebase is to keep “Admin page,” edit “Careless,” and discard “Remove DVD-rip.”
Running $ git rebase -i 5af4522
starts an editor with the following contents.
pick ce36c98 Careless pick cb14efd Remove DVD-rip pick f772d66 Login page # Rebase 5af4522..f772d66 onto 5af4522 # # Commands: # p, pick = use commit # r, reword = use commit, but edit the commit message # e, edit = use commit, but stop for amending # s, squash = use commit, but meld into previous commit # f, fixup = like "squash", but discard this commit's log message # x, exec = run command (the rest of the line) using shell # # If you remove a line here THAT COMMIT WILL BE LOST. # However, if you remove everything, the rebase will be aborted. #
Executing our plan, we modify it to
edit ce36c98 Careless pick f772d66 Login page # Rebase 5af4522..f772d66 onto 5af4522 # ...
That is, we delete the line with “Remove DVD-rip” and change the operation on “Careless” to be edit
rather than pick
.
Save-quitting the editor drops us at a command prompt with the following message.
Stopped at ce36c98... Careless You can amend the commit now, with git commit --amend Once you are satisfied with your changes, run git rebase --continue
As the message tells us, we are on the “Careless” commit we want to edit, so we run two commands.
$ git rm --cached oops.iso $ git commit --amend -C HEAD $ git rebase --continue
The first removes the offending file from the index. The second modifies or amends “Careless” to be the updated index and -C HEAD
instructs git to reuse the old commit message. Finally, git rebase --continue
goes ahead with the rest of the rebase operation.
This gives a history of:
$ git lola --name-status * 93174be (HEAD, master) Login page | A login.html * a570198 Careless | A other.html * 5af4522 Admin page | A admin.html * e738b63 Index A index.html
which is what you want.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With