I use it like this.
__pld(pin[0], pin[1], pin[2], pin[3], pin[4]);
But I get this error.
undefined reference to `__pld'
What am I missing? Do I need to include a header file or something? I am using ARM Cortex A8, does it even support the pld instruction?
As shown in this answer, you can use inline assembler as per Clark.  __builtin_prefetch is also a good suggestion.  An important fact to know is how the pld instruction acts on the ARM; for some processors it does nothing.  For others, it brings the data into the cache.  This is only going to be effective for a read operation (or read/modify/write).  The other thing to note, is that if it does work on your processor, it fetches an entire cache line.  So the example of fetching the pin array, doesn't need to specify all members.
You will get more performance by ensuring that pld data is cache aligned.  Another issue, from seeing the previous code, you will only gain performance with variables you read.  In some cases, you are just writing to the pin array.  There is no value in prefetching these items.  The ARM has a write buffer, so writes are batched together and will burst to an SDRAM chip automatically.
Grouping all read data together on a cache line will show the most performance improvement; the whole line can be pre-fectched with a single pld.  Also, when you un-roll a loop, the compiler will be able to see these reads and will schedule them earlier if possible so that they are filled in the cache; at least for some ARM cpus.
Also, you may consider,
 __attribute__((optimize("prefetch-loop-arrays")))
in the spirit of the accepted answer to the other question; probably the compiler will have already enabled this at -O3 if it is effective on the CPU you have specified.
Various compiler options can be specified with --param NAME=VALUE that allow you to give hints to the compiler on the memory sub-system.  This could be a very potent combination, if you get the parameters correct.
prefetch-latencysimultaneous-prefetchesl1-cache-line-sizel1-cache-sizel2-cache-sizemin-insn-to-prefetch-ratioprefetch-min-insn-to-mem-ratioMake sure you specify a -mcpu to the compiler that supports the pld.  If all is right, the compiler should do this automatically for you.  However, sometime you may need to do it manually.
For reference, here is gcc-4.7.3's ARM prefetch loop arrays code activation.
  /* Enable sw prefetching at -O3 for CPUS that have prefetch, and we have deemed
     it beneficial (signified by setting num_prefetch_slots to 1 or more.)  */
  if (flag_prefetch_loop_arrays < 0
      && HAVE_prefetch
      && optimize >= 3
      && current_tune->num_prefetch_slots > 0)
    flag_prefetch_loop_arrays = 1;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With