ARM64 add instruction preferred opcode?

Question

ARM64 supports add (extended register) that must be used for SP register, and add (shifted register) that must be used for ZR register.

The instruction add X0, X1, X2 matches the syntax for both of these. Apparently, on my machine, GAS chooses to assemble this instruction using add (shifted register).

Is there a reason for this? Does the ARM64 developer documentation specify the expected behavior somewhere (I found preferred disassembly but not assembly)? Or is there a technical reason (one of the two instructions being faster)?

Nate Eldredge · Accepted Answer

I was starting to write this when Siguza's answer appeared. I agree with everything in it.

Yes, the ARM architecture spec requires ADD X0, X1, X2 to assemble the "shifted register" version, encoding 0x8b020020. It can't be the "extended register" version because, as Siguza mentioned, the extension operator is mandatory when neither of the output or first input operand is the stack pointer. If you did want the "extended register" encoding, you would have to manually specify an extension. You could use either UXTX or SXTX, neither of which actually performs any extension, but they give you different encodings.

ADD X0, X1, X2, UXTX  -> 0x8b226020
ADD X0, X1, X2, SXTX  -> 0x8b22e020

It's entirely possible on any given chip that some of these encodings may be faster than others, but obviously that is outside the scope of the architecture spec. It would be entirely possible for ADD X0, X1, X2, UXTX (extended register) to be faster than ADD X0, X1, X2; a programmer or compiler who cared about optimal code would need to know that and write/emit their assembly accordingly.

For ARM-designed cores, they publish a Software Optimization Guide with instruction timings. Taking Cortex A-76 as an example because I have it handy, the shifted register version is in fact faster, with a latency of 1 cycle and throughput of up to 3 per cycle (it can execute on any of the I0, I1 or M pipelines). This applies when the shift is a left shift and the shift count is at most 4. (A careful reading of the Architecture Reference Manual shows that ADD X0, X1, X2 is specifically equivalent to ADD X0, X1, X2, LSL #0.) The "extended register" ADD instruction has latency of 2 cycles and throughput of only 1 per cycle; it only executes on the M pipeline. These are also the timings for the "shifted register" instruction with a different shift type or a shift count of more than 4.

So on that specific implementation, ADD X0, X1, X2 would be preferable performance-wise to any of ADD X0, X1, X2, UXTX or ADD X0, X1, X2, SXTX or ADD X0, X1, X2, LSR #0, even though they all have exactly the same architectural effects. But again, for some other chip it could be the other way around.

Generally, ARM's assembly language syntax is designed to ensure that:

For each legal line of assembly code, the manual should specify unambiguously what binary encoding should be emitted.
For every legal binary encoding, there exists at least one line of assembly code that will emit that encoding (there can be more).
For every legal binary encoding, the manual should specify unambiguously how it should be disassembled, and that disassembly will assemble back to the same encoding.

There are lots of redundant encodings in the instruction set: multiple encodings that have identical architectural effects. (I was once bored and found close to 40000 different encodings to zero the X0 register.) But whenever you want a specific one, say for performance, or for a shellcode payload that needs to contain / avoid certain bytes, there's a reliable way to get it.

So whenever you think ambiguity might exist, it's more likely that you just need to read the manual's fine print more carefully.

Contrast x86: for instance, the instruction ADD EBX, ECX can be encoded as either 01 cb (store form) or as 03 d9 (load form). Different x86 assemblers will emit different ones by default, and won't necessarily give you a way to specify the other one. (There are stories about old 8086 assemblers that would secretly "watermark" your code by the way they chose between redundant encodings.) Disassemblers will disassemble both as ADD EBX, ECX without any indication of which encoding it was, so disassembling and then re-assembling a binary file may not give back the original.

Edit: The above is not quite true. One exception involves bitmask immediates for logic instructions. The complicated decoding of the immediate fields in the instruction involves masking off some of their bits in some cases, so it is possible to have multiple legal encodings that yield the same immediate operand value.

An example is 0x92000000 and 0x92200000 which disassemble identically as and x0, x0, #0x100000001 (using GNU binutils 2.44). Conversely, assembling and x0, x0, #0x100000001 with the GNU assembler gives 0x92000000, and there is no apparent way to have it emit 0x92200000 instead.

In the pseudocode for DecodeBitMasks in the Architecture Reference Manual, either encoding has immN = 0 and imms = 000000. This yields len = 5 and levels = 011111. Since r is computed as immr AND levels, either of the values 000000 or 100000 for immr yield identical decodes with r=0.

Siguza · Answer

The manual seems to treat the different variants of add as unambiguous, i.e. non-overlapping.

All quoted passages are from Version L.b of the manual.

It has this to say on add (extended register):

If “Rd” or “Rn” is ’11111’ (SP) and “option” is ’011’ then LSL is preferred, but may be omitted when “imm3” is ’000’. In all other cases <extend> is required and must be UXTX when “option” is ’011’.

This obviously applies to disassemblers, but nowhere does it say that assemblers are free to ignore it.

But what's more interesting is paragraph C1.2.3 Instruction Mnemonics, because it contains the very instruction from your question:

The A64 assembly language overloads instruction mnemonics and distinguishes between the different forms of an instruction based on the operand types. For example, the following ADD instructions all have different opcodes. However, the programmer must remember only one mnemonic, as the assembler automatically chooses the correct opcode based on the operands. The disassembler follows the same procedure in reverse.

Example C1-1 ADD instructions with different opcodes
ADD W0, W1, W2 // add 32-bit register
ADD X0, X1, X2 // add 64-bit register
ADD X0, X1, W2, SXTW // add 64-bit extended register
ADD X0, X1, #42 // add 64-bit immediate

This strongly implies that the operands to add allow for complete disambiguation.

This is further corroborated by the fact that the manual explicitly lays out rules for disambiguation in the case of overlapping operands, such as in C3.2.2 Load/store register (unscaled offset):

The load/store register (unscaled offset) instructions are required to disambiguate this instruction class from the load/store register instruction forms that support an addressing mode of base plus a scaled, unsigned 12-bit immediate offset, because that can represent some offset values in the same range.

The ambiguous immediate offsets are byte offsets that are both:

In the range 0-255, inclusive.

Naturally aligned to the access size.

Other byte offsets in the range -256 to 255 inclusive are unambiguous. An assembler program translating a load/store instruction, for example LDR, is required to encode an unambiguous offset using the unscaled 9-bit offset form, and to encode an ambiguous offset using the scaled 12-bit offset form. A programmer might force the generation of the unscaled 9-bit form by using one of the mnemonics in Table C3-21. Arm recommends that a disassembler outputs all unscaled 9-bit offset forms using one of these mnemonics, but unambiguous offsets can be output using a load/store single register mnemonic, for example, LDR.

The fact that no such paragraph exists for add implies that Arm considers them unambiguous.

So based on the above, my reading of the manual is that:

While the description of add (extended register) lists {, <extend> {#<amount>}} in curly braces (implying optionality), that only applies if either the first or second operand are sp - in all other cases, it is not optional.
Following that, the only legal opcode choice for add X0, X1, X2 is add (shifted register).

ARM64 add instruction preferred opcode?

Tags:

assembly

machine-code

arm64

instruction-encoding

alexisrdt

2 Answers

Nate Eldredge

Siguza

Recent Activity

Donate For Us

ARM64 add instruction preferred opcode?

Tags:

assembly

machine-code

arm64

instruction-encoding

alexisrdt

2 Answers

Nate Eldredge

Siguza

Related questions

Recent Activity

Donate For Us