I am currently trying to get into programming with multiple cores. I want to write/implement a parallel matrix multiplication with C++/Python/Java (I guess Java will be the simplest one).
But one question that I can't answer by myself is how RAM access works with multiple CPUs.
We have two matrices A and B. We want to compute C = A* B:

A parallel execution will only be faster, when n,m or p are big. So suppose n, m and p >= 10,000. For simplicity, just suppose n=m=p=10,000 = 10^4.
We know that we can compute each $c_{i,j}$ withouth looking at other entries of C. So we could compute every c_{i,j} in parallel:

But all c_{1,i} (i \in 1,...,p) need the first line of A. As A is an array with 10^8 doubles, it needs 800 MB. This is definitely bigger than a CPU cache. But one line (80kB) will fit into a CPU cache. So I guess it is a good idea to assign every line of C to exactly one CPU (as soon as a CPU is free). So this CPU will at least have A in its cache and benefit from that.
How is RAM access managed for the different cores (on a normal Intel notebook)?
I guess there has to be one "controller" that gives exclusive access to one CPU at a time. Does this controller have a special name?
By chance, two or more CPUs might need the same information. Can they get it at the same time? Is RAM access a bottleneck for the matrix multiplication problem?
Please let me also know when you know some good books that introduce you to multicore programming (in C++/Python/Java).
You should separate the questions of parallelizing matrix multiplication in a cache friendly way (There are many methods for that - search for "tiling". here's a nice explanation from Berkeley), from the question of how multiple cores share access to some resources such as a shared cache and memory. The first refers to how cache thrashing can be avoided and effective reuse of data can be achieved (on a given cache hierarchy), the latter refers to memory bandwidth utilization. It's true that the two are connected but they're mostly mutually exclusive since good caching would reduce your outbound bandwidth (which is of course desirable for both performance and power). Sometimes however it can't be done, in cases where the data is not reusable or the algorithm can't be modified to fit in the cache. In these cases the memory BW may become your bottleneck, and the different core will just have to share it as best as possible.
Most modern CPUs have multiple cores sharing a last-level cache (i'm not sure this is the case in some smartphone segments, but for notebooks/desktops/servers this usually applies). That cache, in turn, talks to the memory controller (which used to be sitting on a different chip called north-bridge, but since a few years back got integrated into most CPUs for faster access). Through the memory controller, the entire CPU can talk with the DRAM and tell it what to fetch. The MC is usually smart enough to combine accesses such that they require minimal time and effort to fetch (keep in mind that fetching a "page" from the DRAM is a long task, requiring often to first evict the current page buffered in the sense amplifiers).
Note that this structure means that the MC doesn't have to talk with the multiple cores separately, it just fetches the data to the last level cache. The cores also won't need to talk with the memory controller directly as the accesses are filtered through the last level cache (with the few exceptions such as uncacheable accesses that will go past it, and IO accesses that have another controller). All the cores will share that cache storage, in addition to their own private caches.
Now a note about sharing - if 2 (or more) cores need the same data simultaneously, you're lucky - either it's already in the cache (in which case both accesses will be served in turn by sending copies of the data to each core, and marking them as "shared"), or if the data doesn't exist both would wait until the MC can bring it (once), and then continue as with the hit case. However, once exception is if one or more of the cores need to write new data into that line or part of it. In that case the modifier would issue a read request for ownership (RFO), that would prevent sharing the line and invalidate all the copies in the other cores, otherwise you're running the risk of losing cache coherence or consistency (as one core may use stale data or perceive incorrect memory ordering). This is known as a race condition in parallel algorithms and is the reason for complicated locking/fencing mechanisms. Again - note that this is orthogonal to the actual RAM access, and may equally apply for last-level cache accesses.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With