Friday, 9 August 2013

How to debug: CUDA kernel fails when there are many threads?

How to debug: CUDA kernel fails when there are many threads?

I am using a Quadro K2000M card, CUDA capability 3.0, CUDA Driver 5.5,
runtime 5.0, programming with Visual Studio 2010. My GPU algorithm runs
many parallel breadth first searches (BFS) of a tree (constant). The
threads are independed except reading from a constant array and the tree.
In each thread there can be some malloc/free operations, following the BFS
algorithm with queues (no recursion). There N threads; the number of tree
leaf nodes is also N. I used 256 threads per block and (N+256-1)/256
blocks per grid.
Now the problem is the program works for less N=100000 threads but fails
for more than that. It also works in CPU or in GPU thread by thread. When
N is large (e.g. >100000), the kernel crashes and then cudaMemcpy from
device to host also fails. I tried Nsight, but it is too slow.
Now I set "cudaDeviceSetLimit(cudaLimitMallocHeapSize, 268435456);" I also
tried larger values, up to 1G; cudaDeviceSetLimit succeeded but the
problem remains.
Does anyone know some common reason for the above problem? Or any hints
for further debugging? I tried to put some printf's, but there are tons of
output. Moreover, once a thread crashes, all remaining printf's are
discarded. Thus it is hard to identify the problem.

No comments:

Post a Comment