There are generally two types of SIMD instructions:
A. Ones that work with aligned memory addresses, that will raise general-protection (#GP) exception if the address is not aligned on the operand size boundary:
movaps  xmm0, xmmword ptr [rax]
vmovaps ymm0, ymmword ptr [rax]
vmovaps zmm0, zmmword ptr [rax]
B. And the ones that work with unaligned memory addresses, that will not raise such exception:
movups  xmm0, xmmword ptr [rax]
vmovups ymm0, ymmword ptr [rax]
vmovups zmm0, zmmword ptr [rax]
But I'm just curious, why would I want to shoot myself in the foot and use aligned memory instructions from the first group at all?
Alignment helps the CPU fetch data from memory in an efficient manner: less cache miss/flush, less bus transactions etc. Some memory types (e.g. RDRAM, DRAM etc.) need to be accessed in a structured manner (aligned "words" and in "burst transactions" i.e. many words at one time) in order to yield efficient results.
The alignment of the access refers to the address being a multiple of the transfer size. For example, an aligned 32 bit access will have the bottom 4 bits of the address as 0x0, 0x4, 0x8 and 0xC assuming the memory is byte addressed. An unaligned address is then an address that isn't a multiple of the transfer size.
Unaligned memory accesses occur when you try to read N bytes of data starting from an address that is not evenly divisible by N (i.e. addr % N != 0). For example, reading 4 bytes of data from address 0x10004 is fine, but reading 4 bytes of data from address 0x10005 would be an unaligned memory access.
movups/vmovups can be used. The same penalties discussed in the aligned access case (see next) apply here too. In addition, accesses that cross a cache line or virtual page boundary always incur penalty on all processors.movups/vmovups consume more resources (up to twice as much) in the frontend and the backend of the pipeline. In other words, movups/vmovups can be up to twice as slow as movaps/vmovaps in terms of latency and/or throughput.Therefore, if you don't care about the older microarchitectures, both are technically equivalent. Although if you know or expect the data to be aligned, you should use the aligned instructions to ensure that the data is indeed aligned without having to add explicit checks in the code.
I think there is a subtle difference between using _mm_loadu_ps and _mm_load_ps even on "Intel Nehalem and later (including Silvermont and later) and AMD Bulldozer and later" which can have an impact on performance.
Operations which fold a load and another operation such as multiplication into one instruction can only be done with load, not loadu intrinsics, unless you compile with AVX enabled to allow unaligned memory operands.
Consider the following code
#include <x86intrin.h>
__m128 foo(float *x, float *y) {
    __m128 vx = _mm_loadu_ps(x);
    __m128 vy = _mm_loadu_ps(y);
    return vx*vy;
}
This gets converted to
movups  xmm0, XMMWORD PTR [rdi]
movups  xmm1, XMMWORD PTR [rsi]
mulps   xmm0, xmm1
however if the aligned load intrinsics (_mm_load_ps) are used, it's compiled to
movaps  xmm0, XMMWORD PTR [rdi]
mulps   xmm0, XMMWORD PTR [rsi]
which saves one instruction. But if the compiler can use VEX encoded loads, it's only two instructions for unaligned as well.
vmovups xmm0, XMMWORD PTR [rsi]
vmulps  xmm0, xmm0, XMMWORD PTR [rdi]
Therefor for aligned access although there is no difference in performance when using the instructions movaps and movups on Intel Nehalem and later or Silvermont and later, or AMD Bulldozer and later.
But there can be a difference in performance when using _mm_loadu_ps and _mm_load_ps intrinsics when compiling without AVX enabled, in cases where the compiler's tradeoff is not movaps vs. movups, it's between movups or folding a load into an ALU instruction.  (Which happens when the vector is only used as an input to one thing, otherwise the compiler will use a mov* load to get the result in a register for reuse.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With