In a multi-threaded application running on a recent linux Distributed Shared Memory system, is there a straight forward way to count the number of requests per thread to remote (non-local) NUMA memory nodes?
I am thinking of using PAPI to count interconnect traffic. Is this the way to go?
In my application, threads are bound to a particular core or processor for their entire life-time. When the application begins, memory is allocated page wise and spread in a round-robin manner across all available NUMA memory nodes.
Thank you for your answers.
If you have access to VTune, local and remote NUMA node accesses are counted by hardware counters OFFCORE_RESPONSE.ANY_DATA.OTHER_LOCAL_DRAM_0 for fast local NUMA node accesses and OFFCORE_RESPONSE.ANY_DATA.REMOTE_DRAM_0 for slower remote NUMA node acccesses.
How the counters appear in VTune:

How the counters look in two scenarios:
NUMA unhappy code: core 0 (NUMA node 0) increments 50 MB residing on NUMA node 1:

NUMA happy code: core 0 (NUMA node 0) increments 50 MB residing on NUMA node 0:

I found the pcm-numa.x tool that comes with Intel PCM to be quite useful. It tells you the number of times each core has accessed the local or remote NUMA nodes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With