Here is a benchmark:
fn benchmark_or(repetitions: usize, mut increment: usize) -> usize {
    let mut batch = 0;
    for _ in 0..repetitions {
        increment |= 1;
        batch += increment | 1;
    }
    batch
}
fn benchmark_xor(repetitions: usize, mut increment: usize) -> usize {
    let mut batch = 0;
    for _ in 0..repetitions {
        increment ^= 1;
        batch += increment | 1;
    }
    batch
}
fn benchmark(c: &mut Criterion) {
    let increment = 1;
    let repetitions = 1000;
    c.bench_function("Increment Or", |b| {
        b.iter(|| black_box(benchmark_or(repetitions, increment)))
    });
    c.bench_function("Increment Xor", |b| {
        b.iter(|| black_box(benchmark_xor(repetitions, increment)))
    });
}
The results are:
Increment Or            time:   [271.02 ns 271.14 ns 271.28 ns]
Increment Xor           time:   [79.656 ns 79.761 ns 79.885 ns] 
I get the same result if I replace or with and.
It's quite confusing as the or bench compiles to
.LBB0_5:
        or      edi, 1
        add     eax, edi
        add     rcx, -1
        jne     .LBB0_5
And the xor bench compiles to basically the same instructions plus two additional ones:
.LBB1_6:
        xor     edx, 1
        or      edi, 1
        add     eax, edi
        mov     edi, edx
        add     rcx, -1
        jne     .LBB1_6
Full Assembly code
Why is the difference so large?
This part of the function that uses XOR which you quoted:
.LBB1_6:
        xor     rdx, 1
        or      rsi, 1
        add     rax, rsi
        mov     rsi, rdx
        add     rcx, -1
        jne     .LBB1_6
Is only the "tail end" of an unrolled loop. The "meat" (the part that actually runs a lot) is this:
.LBB1_9:
        add     rax, rdx
        add     rdi, 4
        jne     .LBB1_9
rdx is set up to be 4 times increment - in a way that I would describe as "only a compiler could be this stupid", but it only happens outside the loop so it's not a complete disaster. The loop counter is advanced by 4 in every iteration (starting negative and counting up to zero, which is clever, redeeming the compiler somewhat).
This loop could be executed at 1 iteration per cycle, translating to 4 iterations of the source-loop per cycle.
The loop in the function that uses OR is also unrolled, this is the actual "meat" of that function:
.LBB0_8:
        or      rsi, 1
        lea     rax, [rax + 2*rsi]
        lea     rax, [rax + 2*rsi]
        lea     rax, [rax + 2*rsi]
        lea     rax, [rax + 2*rsi]
        add     rdi, 8
        jne     .LBB0_8
It's unrolled by 8, which might have been nice, but chaining lea 4 times like that really takes "only a compiler could be this stupid" to the next level. The serial dependency through the leas costs at least 4 cycles per iteration, translating to 2 iterations of the source-loop per cycle.
That explains a 2x difference in performance (in favour of the XOR version), not quite your measured 3.4x difference, so further analysis could be done.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With