Maintaining history with --allow-unrelated-histories?

Question

I have two repos foo and bar, they have no common root I would like to merge the two repos, maintaining both of their histories.

I NEED to maintain foo's commit history.. I would like bar's history to be rebased on top of foo as a discrete patches.

For example,

repo foo has file /baz with commits A, B, C.
repo bar has file /baz with commits, D, E.

I would like the resulting repo foo to have /baz's commits sequenced as such: A, B, C, D, E.

When D lands it should always be taken as correct and complete.

It seems the preferred method of joining two repos is --allow-unrelated-histories and merging, but I need to know how to maintain the histories after the merge.

torek · Accepted Answer

The good news is that you can get what you want: History, in a Git repository, is just the commits.

The commits are numbered. The numbers are hash IDs, and these are unique to the content of each commit. If the content of two commits, in two different Git repositories, exactly matches, so do their hash IDs. If the content differs, so do their hash IDs.

A Git repository's history is the set of commits in its repository. To retain the history you are calling "foo's commit history", you need to retain those exact commits, with their exact contents. Fortunately, that's what you say you want to do here too:

I would like bar's history to be rebased on top of foo as a discrete patches.

No commit is ever a "discrete patch", so you must mean that you want to copy some or all parts of the commits from bar.

That is, we'll start with the commits out of repository foo. We'll copy them exactly as is so that they are the same commits, with the same hash IDs, into our new combined repository. Then, we will take each commit out of repository bar, make changes to it once it's not in Git as a commit, and put the result back as a new commit in our combined repository.

The new commits, being different from any previous commit, will have their own new and unique hash IDs. So only repository foo's history is retained.

When [the first commit from repo bar] lands it should always be taken as correct and complete.

Now we're starting to talk about what the contents of each commit will be, so now we look at the mechanics of making a commit.

The normal process

Normally, when we make a new commit, we do that by:

Starting with a clone of some existing repository, or starting with some existing repository. Either is fine since the commits get copied by cloning, so that a clone has the same hash IDs. (The branch names don't get copied—the clone gets its own branch names—but branch names in Git don't matter, except in terms of finding commit hash IDs.)
Extracting one of these commits: git checkout name or (Git 2.23 or later) git switch name. This uses the name to find the hash ID of the commit. Git then copies the snapshot part of the commit out to two places:
- One copy goes into Git's index. This is what Git will use to make the next commit snapshot.
- The other copy goes into your working tree. The files inside a commit are not useful for anything except as an archived snapshot: they're compressed, de-duplicated, and generally only readable by Git itself. So they have to be un-archived and expanded into useful form. Git does not need these files: your work-tree copy of each file is for you, because you need these files in this form.
Now we work with the files, and maybe change some of them. If we do change some of them, and want to make a new snapshot, we have to use git add to copy the updated files back into Git's index, booting out the old copy and replacing it with the updated file. (Or, if the file is all-new, we don't boot anything out, we just add a new file.)
Then we run git commit: Git makes a new commit with all-new metadata that Git constructs from your user.name and user.email setting and other information it has. The snapshot of the new commit is from Git's index. The overall content of the commit is the snapshot plus the metadata.

Having written out a new commit into the all-commits-and-other-Git-objects database, Git then stashes the new commit's new and unique hash ID into the current branch name, so that Git can find the new commit, using the current branch name. The new commit is now frozen for all time: this hash ID is now used up and means this commit.

What you want instead

You will start by cloning the foo repository as a whole, so that you get all three of its commits, A-B-C. Each of these three commits has a full snapshot of every file. This is a normal everyday clone operation, working the usual way: copy all the commits and none of the branches, then create one new branch name matching the source repository's branch name, holding the same commit hash ID.

Next, you'll probably want to git remote add the bar repository, so that you can git fetch all of its commits: in this case, D and E. These too are full snapshots of every file.

It is now up to you to decide how you want to take snapshot-and-metadata D out of your combined repository and make a new and different commit D' that has a snapshot and that links back to existing commit C. You can retain as much of D's metadata as you like, except for the parent hash ID. Commit D, in repository bar, is the initial commit, so it says that there is no parent. You need a commit D' that says there is one parent and it is ______ (insert hash ID of commit C here).

Having made D' from D, you now need to make E' from E. This is basically the same process.

You talk about wanting to retain one file, but each commit has a full snapshot of every file. If you want to retain every file from commit D, completely ignoring the snapshot in commit C, this is easy, because commit D has, as its snapshot, the exact correct set of files. You just re-use D's snapshot when you make your D'. If you only want to retain one file from D, it's still easy-ish, though it's just a tiny bit harder.

This repeats for commit E, and then, since there were just the two commits, you are now done. Your combined repository has in it:

A--B--C   <-- foo/main
       \
        D'-E'   <-- main (HEAD)

D--E   <-- bar/main

(assuming the two input repositories have branches named main used to find their final commits).

You'll need to say whether you want the full snapshot from D as D', or whether you want a single file, before we talk about ways to obtain the result.

If you want `D`'s snapshot in `D'` ...

If you want to keep the entire snapshot from commit D as the new commit D' snapshot, what we'll want is for the new commit D' to literally use that tree object (this is an internal detail of a commit that we wouldn't normally worry about, but it becomes a useful possibility here).

We also need to know what you want for D''s metadata: for its author, committer, and date strings, and for its log message. You can have Git copy those from D directly.

To do both of these, we will:

use git replace, at least temporarily, to make a graft; then
use git filter-branch or git filter-repo or similar to turn the graft permanent.

The way grafts—made with git replace with the --graft option—work is to copy a commit except for its parent linkage:

--graft <commit> [<parent>...]
Create a graft commit. A new commit is created with the same content as <commit> except that its parents will be [<parent>...] instead of <commit>'s parents. A replacement ref is then created to replace <commit> with the newly created commit. Use --convert-graft-file to convert a $GIT_DIR/info/grafts file and use replace refs instead.

So, given:

A--B--C   <-- main (HEAD), foo/main

D--E   <-- bar/main

in your replacement-so-far repository, you can now run:

git replace --graft bar/main~1 main

Here the <commit> argument is bar/main~1. This is the commit that is to be copied. The <parent>... arguments are just main. Git will resolve bar/main~1 to a commit hash ID to find commit D, and will resolve main to find commit C. It then makes a new commit—D'—whose contents are from commit D with one change: the snapshot is the same, and most of the metadata are the same, but the parent list is commit C (i.e., the one found by main).

Git then makes a very weird name—it's not a branch name; it's not a tag name; it's not a remote-tracking name; it lives in the refs/replace/ namespace and has D's raw hash ID as the rest of its name—that locates this new commit:

A--B--C   <-- main (HEAD), foo/main
       \
        D'  <-- refs/replace/<hash>

D--E   <-- bar/main

If we now run git log bar/main, Git:

looks up commit E and displays it, then follows the parent link to D;
goes to look up commit D, but sees that there is a refs/replace/ for D, so immediately jumps over to D' instead, and displays that;
moves back from D to C (there's no replacement for C) and displays C;
moves back to B and displays it; and
moves back to A and displays it.

This is how replacements work. There's one big drawback with replacement commits, and that is that git clone normally does not copy them. That might be OK! If this repository is the only place you ever need this behavior, you can just about stop here. This has some advantages because now commit E, in this repository, is literally the actual commit E from repository bar. Should repository bar add new commits, you can just bring them into your repository and use them.

Right now, though, let's now move the name main to point to E:

git reset --hard bar/main

or (this should work but I have not tried it):

git merge --ff-only bar/main

The result is:

A--B--C   <-- foo/main
       \
        D'  <-- refs/replace/<hash>

D--E   <-- main (HEAD), bar/main

If this drawback about grafts not getting cloned isn't OK—and/or if you never intend to contact repository bar again—you can now "cement the replacement". To do that, you must have Git re-copy each commit in place. More precisely, we only need to recopy commits D and E, with the replacement being done during the re-copying, but it's easiest to re-copy all commits, with git filter-branch.

Using filter-branch or filter-repo

There is one big problem with filter-branch: it's being retired. It still exists in Git, and it still works (or should work), but it is no longer supported. Instead, git filter-repo is recommended now—but it's not included with Git distributions yet. Both have the same fundamental principles of operation though.

Since a repository is nothing more than a collection of commits and other internal Git objects, plus a collection of names by which we find the commits and other objects, we can have a program:

walk some or all commits, either literally or virtually extracting them (snapshot and/or metadata) to a temporary area;
apply some filter(s) to the snapshot and/or metadata; and
construct a new commit from the filtered result.

If the new commit is absolutely, completely, 100% identical, bit-for-bit, to the original commit, it gets the same hash-ID number. If it's different, it gets a different number.

By walking the commits from oldest to newest,¹ keeping a map—old hash ID _____ = new hash ID _____—we can make arbitrary changes to the entire repository. Any commit that's not changed at all, including no changes to its parentage, retains its hash ID. Any commit that is changed—as in, different snapshot or different history (parent linkage)—gets a new number.

Once we've finished the operation over all the commits to be operated-on, we can then adjust some or all of the names, so that instead of finding the old commits, they find the new ones.

Because you'll have to pick one of filter-branch or filter-repo, this answer does not have a specific recipe for either one—but I'll note here that we don't actually intend to make any specific change to anything about any commit. All we want is for the filter operation to obey the graft. That is, when making a copy of commit D, filter-branch or filter-repo should look up the replacement D' instead of using the original D.

When the filter operation does this, here's the result:

To copy A, we grab all the bits from A and make no changes. The result gets written back. It's 100% bit-for-bit identical to A, so it is still A.
To copy B, we grab all the bits from B and make no changes except to replace the parent of B with the new copy of A. That's still A! So the copy of B is 100% bit-for-bit identical, and hence is B.
To copy C, we grab all its bits and ... well, this is just like B: the copy of C is C.
To copy D, we grab all the bits of ... no, wait, there's a replacement! We grab all the bits of D'. We replace D''s parent C with the copy, which is still C. So writing this back, we get D'. That means the copy of D is D'.
To copy E, we grab all the bits of E, but replace E's parent (D) with its copy (D'). This means that the copy of Eis not bit-for-bit identical. Instead, it'sE', a copy of Ethat leads back toD`.

Hence after the copy process (but before adjusting branch names), we have:

A--B--C   <-- foo/main
       \
        D'  <-- refs/replace/<hash>
         \
          E'

D--E   <-- bar/main, main (HEAD)

Now we go in and change some set of branch names. The only actual branch name here is main. We replace the hash ID in main with its copy, i.e., change from pointing to E, to pointing to E instead:

A--B--C   <-- foo/main
       \
        D'  <-- refs/replace/<hash>
         \
          E'  <-- main (HEAD)

D--E   <-- bar/main

We can now delete the refs/replace/ name (which is how clones operate: they fail to copy the name, which is like deleting it) since we never plan to follow bar/main from E to D. If we also delete the bar/main name, that leaves us with original commits D-E un-findable, and a repository that looks like this:

A--B--C   <-- foo/main
       \
        D'-E'  <-- main (HEAD)

which is what we wanted.

¹It's worth noting here that both filter commands still use Git's backwards method of finding commits. That is, while filter-branch and filter-repo need to copy commits "forwards", from A onward, they find the commits "backwards" first. We start at commit E, then move back to D—and jump to the grafted replacement—and move back to C, then B, then A. Having collected the list of commit hash IDs, Git now just reverses the order. (Technically it collects the list in a topological sort order, then reverses from there.)

If you need to retain all the commit hashes ...

You mention in your own answer that git merge --allow-unrelated-histories works for what you want. This is true because git log with a file name—you don't need --follow here, just the file name—defaults to using history simplification when tracking down how some file(s) became the way they are in the final commit.

Let's just draw the effect of merging with the --allow-unrelated-histories flag. We start with a combining repository as before:

A--B--C   <-- foo/main

D--E   <-- main (HEAD), bar/main

Note, however, that this time I have chosen commit E to be our main—perhaps via git reset --hard E. That's for the -s ours below. If we don't need -s ours, we can pick either tip commit, but we're going to have to make sure that the merge commit's copy of that one particular file is the copy from commit E.

We now just run git merge with the flag that says that it should merge anyway. We can add -s ours to make Git completely ignore all files in the snapshot in C, which is pretty convenient. Git will then add one new merge commit, which I will draw as M for merge, that links back to both commits E and C. The first parent of M will be E, and the second will be C, so I'm going to flip the lines over as well:²

D-----E   <-- bar/main
       \
        M   <-- main (HEAD)
       /
A--B--C   <-- foo/main

When using this kind of operation, note that the original commits are completely untouched. They therefore retain their original hash IDs and hence are the original history.

If this kind of merge is acceptable, it's usually the best way to deal with this. It's not at all the same as what you described, though: the history does not appear to start at E and work back to A, the way Git usually does with linear history. Instead, the history now starts at M, and immediately diverges into both E and C. When using git log with no options, you'll see both lines of history. When using git log -- filename, however, you'll see only the line of history that explains the outcome: the version of the file that appears in commit M. So if the copy of the named file in M matches that in E, but not that in C, git log will follow the line from M back to E. That leaves Git with only one remaining commit to visit: D.

The last option you have here is the one we already described above, using git replace. We can use this to make Git "avert its eyes" from any one particular commit, using as a substitute a commit made by git replace. A graft commit re-uses the snapshot, but makes arbitrary changes to the parents.

²The first vs second parent at a merge is mostly useful with git log --first-parent, which—when hitting a merge—pretends that there is no second parent. In this case, for instance, it completely ignores commit C.

NO WAR WITH RUSSIA · Answer

Turns out the --allow-unrelated-history maintains the history of both files which can be accessed with the --follow command on git log

--follow
  Continue listing the history of a file beyond renames (works only for a single file)

I'd still like to know how to do this without needing to use --follow, but that's probably good enough.

Maintaining history with --allow-unrelated-histories?

Tags:

git

rebase

NO WAR WITH RUSSIA

2 Answers

The normal process

What you want instead

If you want `D`'s snapshot in `D'` ...

Using filter-branch or filter-repo

If you need to retain all the commit hashes ...

torek

NO WAR WITH RUSSIA

Recent Activity

Donate For Us

Maintaining history with --allow-unrelated-histories?

Tags:

git

rebase

NO WAR WITH RUSSIA

2 Answers

The normal process

What you want instead

If you want D's snapshot in D' ...

Using filter-branch or filter-repo

If you need to retain all the commit hashes ...

torek

NO WAR WITH RUSSIA

Related questions

Recent Activity

Donate For Us

If you want `D`'s snapshot in `D'` ...