I'm writing a multi-threaded OpenMPI application, using MPI_Isend and MPI_Irecv from several threads to exchange hundreds of messages per second between ranks over InfiniBand RDMA.
Transfers are in the order of 400 - 800KByte, generating about 9 Gbps in and out for each rank, well within the capacity of FDR. Simple MPI benchmarks also show good performance.
The completion of the transfers is checked upon by polling all active transfers using MPI_Testsome in a dedicated thread.
The transfer rates I achieve depend on the message rate, but more importantly also on the polling frequency of MPI_Testsome. That is, if I poll, say, every 10ms, the requests finish later than if I poll every 1ms.
I'd expect that if I poll evert 10ms instead of every 1ms, I'd at most be informed of finished requests 9ms later. I'd not expect the transfers themselves to delay completion by fewer calls to MPI_Testsome, and thus slow down the total transfer rates. I'd expect MPI_Testsome to be entirely passive.
Anyone here have a clue why this behaviour could occur?
The observed behaviour is due to the way operation progression is implemented in Open MPI. Posting a send or receive, no matter if it is done synchronously or asynchronously, results in a series of internal operations being queued. Progression is basically the processing of those queued operations. There are two modes that you can select at library build time: one with asynchronous progression thread and one without with the latter being the default.
When the library is compiled with async progression thread enabled, a background thread takes care and processes the queue. This allows for background transfers to commence in parallel with the user's code but increases the latency. Without async progression, operations are faster but progression can only happen when the user code calls into the MPI library, e.g. while in MPI_Wait or MPI_Test and family. The MPI_Test family of functions are implemented in such a way as to return as fast as possible. That means that the library has to balance a trade-off between doing stuff in the call, thus slowing it down, or returning quickly, which means less operations are progressed on each call.
Some of the Open MPI developers, notably Jeff Squyres, visits Stack Overflow every now and then. He could possibly provide more details.
This behaviour is hardly specific to Open MPI. Unless heavily hardware-assisted, MPI is usually implemented following the same methods.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With