Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

git log --numstat has weird data

Tags:

git

git-log

I run this command:

git log HEAD --numstat --pretty="%ae" | cat

and I see this:

6       4       frontend/src/Frontend.hs
[email protected]
[email protected]
[email protected]

3       3       dep/rhyolite/github.json
29      14      frontend/src/Frontend.hs
[email protected]

3       1       backend/src/Backend/RequestHandler.hs
27      18      frontend/src/Frontend.hs
[email protected]

5       0       default.nix
7       0       dep/reflex-dom/default.nix
7       0       dep/reflex-dom/github.json
7       0       dep/reflex/default.nix
7       0       dep/reflex/github.json
[email protected]
[email protected]
[email protected]

what exactly does it mean to have more than one email associated with a commit? does that likely mean the person moved from one network to another (like took their laptop home, then finished their commit?) I am trying to parse the commits and assign them to a single author, but having multiple emails attached to a commit is making this harder?


1 Answers

As you have already found, you are including merges in your git log output. By default, when looking at a merge, git log does not diff the merge against its parents, so no --numstat output comes out either.

Skipping the merges (with git log --no-merges) is ... okay, but if you do this, you lose information about any changes made during merging. That might be the best you can do, because the information you can get from a merge is tricky.

Consider a graph that looks like this:

...--o--A
         \
          M--...
         /
...--o--B

Git will compare M against ... what? A standard git diff compares two commits. Which two commits should you compare, to make sense of the snapshot in M? Is it A-vs-M, or B-vs-M?

There is no one right answer to this question. To avoid having to pick an answer, git log just doesn't bother diffing M at all, by default. But you can tell it that it should diff anyway, using any of these three options:

  • -c: produce a combined diff
  • --cc (two dashes, two lowercase Cs): produce a combined diff
  • -m: produce multiple non-combined diffs by "virtually splitting" the merge

The two kinds of combined diffs are not exactly the same, but I've never seen a proper description of the intent of the difference between them. Both of them first throw out, from the diff, any file whose copy in M exactly matches its copy in either A or B. (If the merge has more than two parents, this phase discards any file that exactly matches the copy in any parent.)

Then, having reduced the number of files—sometimes to zero!—the combined diff goes on to diff both (or all) parents' versions against the (single) child version of that file in commit M. See the git diff documentation on combined merges for more.

The -m option is more useful, especially if combined with the --first-parent option. In this case, Git pretends—for diff purposes only—that M has been split into multiple separate copies:

...--o--A--M1--...

...--o--B--M2--...

Now that the merge has been split, each copy has just one parent, so Git can, in effect, just use git diff A M1 and show you that. Then it can go on and git diff B M2 and show you that. The actual snapshot used in each diff is just the (single) snapshot for real commit M.

If you combine this with --first-parent, git log walks only the first parent of each merge, and produces only the first-parent diff for the merge. So instead of the actual graph, this git log shows you what things would be if the graph were just:

...--o--A
         \
          M--...

You never see commit B (nor its parents) at all, but you do get a diff describing commit M, as compared to commit A, and therefore --numstat works for this case.

like image 68
torek Avatar answered Oct 19 '25 22:10

torek



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!