Is it possible to access hard disk/ flash disk directly from GPU (CUDA/openCL) and load/store content directly from the GPU's memory ?
I am trying to avoid copying stuff from disk to memory and then copying it over to GPU's memory.
I read about Nvidia GPUDirect but not sure if it does what I explained above. It talks about remote GPU memory and disks but the disks in my case are local to the GPU.
Basic idea is to load contents (something like dma) -> do some operations -> store contents back to disk (again in dma fashion).
I am trying to involve CPU and RAM as little as possible here.
Please feel free to offer any suggestions about the design.
For anyone else looking for this, 'lazy unpinning' did more or less what I want.
Go through the following to see if this can be helpful for you.
The most straightforward implementation using RDMA for GPUDirect would pin memory before each transfer and unpin it right after the transfer is complete. Unfortunately, this would perform poorly in general, as pinning and unpinning memory are expensive operations. The rest of the steps required to perform an RDMA transfer, however, can be performed quickly without entering the kernel (the DMA list can be cached and replayed using MMIO registers/command lists).
Hence, lazily unpinning memory is key to a high performance RDMA implementation. What it implies, is keeping the memory pinned even after the transfer has finished. This takes advantage of the fact that it is likely that the same memory region will be used for future DMA transfers thus lazy unpinning saves pin/unpin operations.
An example implementation of lazy unpinning would keep a set of pinned memory regions and only unpin some of them (for example the least recently used one) if the total size of the regions reached some threshold, or if pinning a new region failed because of BAR space exhaustion (see PCI BAR sizes).
Here is a link to an application guide and to nvidia docs.
Trying to use this feature, I wrote a small example on Windows x64 to implement this. In this example, kernel "directly" accesses disk space. Actually, as @RobertCrovella mentioned previously, the operating system is doing the job, with probably some CPU work; but no supplemental coding.
__global__ void kernel(int4* ptr)
{
    int4 val ; val.x = threadIdx.x ; val.y = blockDim.x ; val.z = blockIdx.x ; val.w = gridDim.x ;
    ptr[threadIdx.x + blockDim.x * blockIdx.x] = val ;
    ptr[160*1024*1024 + threadIdx.x + blockDim.x * blockIdx.x] = val ;
}
#include "Windows.h"
int main()
{
    // 4GB - larger than installed GPU memory
    size_t size = 256 * 1024 * 1024 * sizeof(int4) ;
    HANDLE hFile = ::CreateFile ("GPU.dump", (GENERIC_READ | GENERIC_WRITE), 0, 0, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL) ;
    HANDLE hFileMapping = ::CreateFileMapping (hFile, 0, PAGE_READWRITE, (size >> 32), (int)size, 0) ;
    void* ptr = ::MapViewOfFile (hFileMapping, FILE_MAP_ALL_ACCESS, 0, 0, size) ;
    ::cudaSetDeviceFlags (cudaDeviceMapHost) ;
    cudaError_t er = ::cudaHostRegister (ptr, size, cudaHostRegisterMapped) ;
    if (cudaSuccess != er)
    {
        printf ("could not register\n") ;
        return 1 ;
    }
    void* d_ptr ;
    er = ::cudaHostGetDevicePointer (&d_ptr, ptr, 0) ;
    if (cudaSuccess != er)
    {
        printf ("could not get device pointer\n") ;
        return 1 ;
    }
    kernel<<<256,256>>> ((int4*)d_ptr) ;
    if (cudaSuccess != ::cudaDeviceSynchronize())
    {
        printf ("error in kernel\n") ;
        return 1 ;
    }
    if (cudaSuccess != ::cudaHostUnregister (ptr))
    {
        printf ("could not unregister\n") ;
        return 1 ;
    }
    ::UnmapViewOfFile (ptr) ;
    ::CloseHandle (hFileMapping) ;
    ::CloseHandle (hFile) ; 
    ::cudaDeviceReset() ;
    printf ("DONE\n");
    return 0 ;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With