How to use (read/write) CPU caches L1, L2, L3

Question

I have a task that requires ultra performance

Of course I can optimize its algorithm but I also want optimize on the hardware level.

I can of course use the CPU affinity in order to allocate a whole core to the thread that processes my task

Another kind of optimization could be to put in the CPU caches (L1, L2, L3) the data my tasks requires to complete, in order to avoid as far as possible the "RAM access" latency

What API can I use for such a development?

(In other words, my questions could be: "how to force to the CPU to place in the cache a given data-structure?")

Thank you for your help

Taylor Kidd · Accepted Answer

Excellent comment by Peter C about prefetching. As a former optimizer, the first thing we'd do to improve code was to remove all the SW prefetching. Also, don't try to muck around with power states and such. They are so good now days that the effort isn't worth the gain in HPC. A possible exception is hyper threading. The only time you'd want to go there would be for certain benchmarking where you need consistency as well as performance.

Take a look at the Intel optimization resources such as the optimization guide. Also get yourself a good profiler; Intel's VTune is truly one of the best. For info from Intel, use bing (or google) to find stuff. Intel's site is and always has been a glossy mess. VTune has Student and Educator licensing.

Here are the steps I used to take in optimizing apps for performance. First off, exhaust the higher-level software changes. Then get down into tweaking for hardware performance. Why? Two reasons: (1) code changes are generally architecture independent and have a better chance of surviving a move to a different HW platform and generation. (2) They are a heck of a lot simpler to do (though perhaps not as fun).

CODE CHANGES:

Remove all SW prefetching.
Replace any polling with periodic interrupts
Make sure any checking interrupts have appropriate intervals
Use Fortran. Really. There's a reason Fortran is still alive. Take a look at the Intel Fortran forums. The forum's all classical HPC. And Intel's Fortran compiler is one of the best.
Use a good optimizing compiler, and play with the compiler settings and pragmas/annotations (e.g. #pragma loop count). Again, Intel's is one of the best. (I hate saying that, but it's true.)
Use a good SW profiler to find optimization opportunities (where most of your time is being spent). Make sure the profiler is able to dig into the source code to identify time spent in different functions. Optimize those functions first.
Find opportunities for thread parallization (multi-threading) properly scoped to the number of cores
Find opportunities for vectorization
Convert from AoS (Array of Structs) to SofA. Note that if you have to do the conversion on the fly, it may not be worth the performance cost.
Structure your loops such that they are more conducive to the compiler finding vectorization opportunities. See any good optimization book for how to do this.

HARDWARE HACKING/OPTIMIZATION (using a good HW-level performance analyzer)

Identify cache and TLB misses, and restructure code.
Identify branch mispredicts and restructure code.
Identify pipeline stalls and restructure code.

One last suggestion, though I'm sure you already know this. Remember, go after the hottest spots. Smaller opportunities are time consuming and performance improvements are not impactful to the overall application.

Best of luck. Optimization can be fun and rewarding (if you are slightly crazy).

How to use (read/write) CPU caches L1, L2, L3

Tags:

cpu-architecture

cpu-cache

cpu

cpu-cores

cpu-usage

Philippe MESMEUR

1 Answers

Taylor Kidd

Recent Activity

Donate For Us

How to use (read/write) CPU caches L1, L2, L3

Tags:

cpu-architecture

cpu-cache

cpu

cpu-cores

cpu-usage

Philippe MESMEUR

1 Answers

Taylor Kidd

Related questions

Recent Activity

Donate For Us