Wednesday, 18 September 2013

Wrong data loading from global to local memory in CUDA program which uses several kernels

Wrong data loading from global to local memory in CUDA program which uses
several kernels

I am writing a CUDA application. In the program, I have 6 kernel functions
which are getting called from the host machine and they are in a loop. So
it is basically a kind of simulation which goes on in a loop and those 5
kernels are getting called in a loop. Inside the kernels I am using the
shared memory where I bring the data from global memory. Now I am facing a
bizarre problem while I was debugging. When I run the whole simulation
with all the kernels then actually wrong data is getting copied from the
global memory to the shared memory and this is happening from 1st kernel.
But then when I comment out the 4th, 5th and 6th kernel, compile it and
debug, then correct data is getting copied to the shared memory. I know
that this is not a index mapping problem because of two reasons; 1.The
index mapping is fairly easy, 2. After commenting out the 4th kernel,
correct data is getting copied. I think there is something wrong in the
third kernel, but I am not able to figure out what exactly could happen in
between kernels which could affect the local memory data copy and
operation.
I just want to give few more information : My kernels have grid and block
structures which are different from one to another as I don't know if this
could be the reason :
#define env_end 48
tot_ag = 512
dim3 gridDim_1(env_end/16,env_end/16,1);
dim3 blockDim_1(16,16,1);
dim3 gridDim_2(1,tot_ag/32,1);
dim3 blockDim_2(8,32,1);
Kernel #1 : <<<gridDim_1,blockDim_1>>>
Kernel #2 : <<<tot_ag/256,256 >>>
Kernel #3 : <<<gridDim_2,blockDim_2>>>
Kernel #4 : <<<gridDim_1,blockDim_1>>>
Kernel #5 : <<<gridDim_2,blockDim_2>>>
Kernel #6 : <<<gridDim_1,blockDim_1>>>
I don't know if I have provide enough information to provide a solution as
I am not sure exactly what kind of information I should provide to solve
this. I am not even able to provide a dummy code which could replicate the
situation as I have tried to write a code to replicate the situation but
it all produced correct result with correct data copy from global to
shared and the original program is too long to provide here. I was just
thinking if someone have faced this thing and solved it or if someone
could provide me with a probable reason behind this or a favourable
procedure to find the reason. But please let me know if any further
information is required. Any help would be very much appreciated. Thank
you.

No comments:

Post a Comment