Summary:
memcpy seems unable to transfer over 2GB/sec on my system in a real or test application. What can I do to get faster memory-to-memory copies?
Full details:
As part of a data capture application (using some specialized hardware), I need to copy about 3 GB/sec from temporary buffers into main memory. To acquire data, I provide the hardware driver with a series of buffers (2MB each). The hardware DMAs data to each buffer, and then notifies my program when each buffer is full. My program empties the buffer (memcpy to another, larger block of RAM), and reposts the processed buffer to the card to be filled again. I am having issues with memcpy moving the data fast enough. It seems that the memory-to-memory copy should be fast enough to support 3GB/sec on the hardware that I am running on. Lavalys EVEREST gives me a 9337MB/sec memory copy benchmark result, but I can't get anywhere near those speeds with memcpy, even in a simple test program.
I have isolated the performance issue by adding/removing the memcpy call inside the buffer processing code. Without the memcpy, I can run full data rate- about 3GB/sec. With the memcpy enabled, I am limited to about 550Mb/sec (using current compiler).
In order to benchmark memcpy on my system, I've written a separate test program that just calls memcpy on some blocks of data. (I've posted the code below) I've run this both in the compiler/IDE that I'm using (National Instruments CVI) as well as Visual Studio 2010. While I'm not currently using Visual Studio, I am willing to make the switch if it will yield the necessary performance. However, before blindly moving over, I wanted to make sure that it would solve my memcpy performance problems.
Visual C++ 2010: 1900 MB/sec
NI CVI 2009: 550 MB/sec
While I am not surprised that CVI is significantly slower than Visual Studio, I am surprised that the memcpy performance is this low. While I'm not sure if this is directly comparable, this is much lower than the EVEREST benchmark bandwidth. While I don't need quite that level of performance, a minimum of 3GB/sec is necessary. Surely the standard library implementation can't be this much worse than whatever EVEREST is using!
What, if anything, can I do to make memcpy faster in this situation?
Hardware details: AMD Magny Cours- 4x octal core 128 GB DDR3 Windows Server 2003 Enterprise X64
Test program:
#include <windows.h> #include <stdio.h>  const size_t NUM_ELEMENTS = 2*1024 * 1024; const size_t ITERATIONS = 10000;  int main (int argc, char *argv[]) {     LARGE_INTEGER start, stop, frequency;      QueryPerformanceFrequency(&frequency);      unsigned short * src = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);     unsigned short * dest = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);      for(int ctr = 0; ctr < NUM_ELEMENTS; ctr++)     {         src[ctr] = rand();     }      QueryPerformanceCounter(&start);      for(int iter = 0; iter < ITERATIONS; iter++)         memcpy(dest, src, NUM_ELEMENTS * sizeof(unsigned short));      QueryPerformanceCounter(&stop);      __int64 duration = stop.QuadPart - start.QuadPart;      double duration_d = (double)duration / (double) frequency.QuadPart;      double bytes_sec = (ITERATIONS * (NUM_ELEMENTS/1024/1024) * sizeof(unsigned short)) / duration_d;      printf("Duration: %.5lfs for %d iterations, %.3lfMB/sec\n", duration_d, ITERATIONS, bytes_sec);      free(src);     free(dest);      getchar();      return 0; } EDIT: If you have an extra five minutes and want to contribute, can you run the above code on your machine and post your time as a comment?
With a cold cache, optimized memcpy with write-back cache works best because the cache doesn't have to write to memory and so avoids any delays on the bus. For a garbage-filled cache, write-through caches work slightly better, because the cache doesn't need to spend extra cycles evicting irrelevant data to memory.
I also had some code that I really needed to speed up, and memcpy is slow because it has too many unnecessary checks. For example, it checks to see if the destination and source memory blocks overlap and if it should start copying from the back of the block rather than the front.
memmove() is similar to memcpy() as it also copies data from a source to destination.
A simple loop is slightly faster for about 10-20 bytes and less (It's a single compare+branch, see OP_T_THRES ), but for larger sizes, memcpy is faster and portable.
I have found a way to increase speed in this situation. I wrote a multi-threaded version of memcpy, splitting the area to be copied between threads. Here are some performance scaling numbers for a set block size, using the same timing code as found above. I had no idea that the performance, especially for this small size of block, would scale to this many threads. I suspect that this has something to do with the large number of memory controllers (16) on this machine.
Performance (10000x 4MB block memcpy):   1 thread :  1826 MB/sec  2 threads:  3118 MB/sec  3 threads:  4121 MB/sec  4 threads: 10020 MB/sec  5 threads: 12848 MB/sec  6 threads: 14340 MB/sec  8 threads: 17892 MB/sec 10 threads: 21781 MB/sec 12 threads: 25721 MB/sec 14 threads: 25318 MB/sec 16 threads: 19965 MB/sec 24 threads: 13158 MB/sec 32 threads: 12497 MB/sec I don't understand the huge performance jump between 3 and 4 threads. What would cause a jump like this?
I've included the memcpy code that I wrote below for other that may run into this same issue. Please note that there is no error checking in this code- this may need to be added for your application.
#define NUM_CPY_THREADS 4  HANDLE hCopyThreads[NUM_CPY_THREADS] = {0}; HANDLE hCopyStartSemaphores[NUM_CPY_THREADS] = {0}; HANDLE hCopyStopSemaphores[NUM_CPY_THREADS] = {0}; typedef struct {     int ct;     void * src, * dest;     size_t size; } mt_cpy_t;  mt_cpy_t mtParamters[NUM_CPY_THREADS] = {0};  DWORD WINAPI thread_copy_proc(LPVOID param) {     mt_cpy_t * p = (mt_cpy_t * ) param;      while(1)     {         WaitForSingleObject(hCopyStartSemaphores[p->ct], INFINITE);         memcpy(p->dest, p->src, p->size);         ReleaseSemaphore(hCopyStopSemaphores[p->ct], 1, NULL);     }      return 0; }  int startCopyThreads() {     for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)     {         hCopyStartSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);         hCopyStopSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);         mtParamters[ctr].ct = ctr;         hCopyThreads[ctr] = CreateThread(0, 0, thread_copy_proc, &mtParamters[ctr], 0, NULL);      }      return 0; }  void * mt_memcpy(void * dest, void * src, size_t bytes) {     //set up parameters     for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)     {         mtParamters[ctr].dest = (char *) dest + ctr * bytes / NUM_CPY_THREADS;         mtParamters[ctr].src = (char *) src + ctr * bytes / NUM_CPY_THREADS;         mtParamters[ctr].size = (ctr + 1) * bytes / NUM_CPY_THREADS - ctr * bytes / NUM_CPY_THREADS;     }      //release semaphores to start computation     for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)         ReleaseSemaphore(hCopyStartSemaphores[ctr], 1, NULL);      //wait for all threads to finish     WaitForMultipleObjects(NUM_CPY_THREADS, hCopyStopSemaphores, TRUE, INFINITE);      return dest; }  int stopCopyThreads() {     for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)     {         TerminateThread(hCopyThreads[ctr], 0);         CloseHandle(hCopyStartSemaphores[ctr]);         CloseHandle(hCopyStopSemaphores[ctr]);     }     return 0; } If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With