I am working in parallelise [this file][1] on GPU using [PTX file with matlab parallel.gpu.CUDAkernel][2]. My problem with [kron tensor product][3] is the following. My code should multiply two vectors kron(a,b) by multiplying each element of the first vector a=<32x1> by the all elements of the other vector b=<1x32> and the output vector size will be k<32x32>=a.*b. I tried to write it in C++ and it worked, as I only concern about summing all the elements of 2d array. I thought I can make it easy as 1D array because m=sum(sum(kron(a,b))) is the code I am working on
for(i=0;i<32;i++)
 for(j=0;j<32;j++)
   k[i*32+j]=a[i]*b[j]
It meant to have the a[i]th element multiply by eachelement in b and I though to go with 32 blocks with each block has a 32 threads and the code should be 
__global__ void myKrom(int* c,int* a, int*b) {
  int i=blockDim.x*blockIdx.x+threadIdx.x;
  while(i<32) {
    c[i]=a[blockIdx.x]+b[blockDim.x*blockIdx.x+threadIdx.x];
  }
That should make the trick as the blockIdx.x is the outer loop, but it didn't. Could any body tell me where, may i ask for parallel way to do the parallel sum.
You may actually mean something like this:
__global__ void myKrom(int* c,int* a, int*b)
{
  int i=blockDim.x*blockIdx.x+threadIdx.x;
  if(i<32*32){
    c[i]=a[blockIdx.x]+b[threadIdx.x];
  }
}
when you call the kernel by myKrom<<<32, 32>>> (c, a, b);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With