0

I can't understand why the below program runs ok for block=N,1,1 but not for 1,1,N (result is invalid value ) or 1,N,1 ( result is 0,1,0.....0) or 10,50,1 (result is 0,1,0..0) (N=500).

import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
import random
from pycuda.curandom import rand
import cmath
import pycuda.driver as drv


N=500
a_gpu=gpuarray.to_gpu(np.zeros(N).astype(np.int32))

mod =SourceModule("""
#include <cmath>

extern "C" {      

__global__  void myfunc(int *a,int N)
    {

    int idx=threadIdx.x;   //+blockIdx.x*blockDim.x;

    if (idx<N) 
            a[idx]=idx;

}
}

""",no_extern_c=1)

#call the function(kernel)
func = mod.get_function("myfunc")

func(a_gpu,np.int32(N), block=(N,1,1),grid=(1,1))

a=a_gpu.get()
print("a = ",a)

--------------EDIT----------------------------------------

Ok,i forgot that if i use int idx=threadIdx.y ,then i can use block(1,N,1) .

But , then , always must i use this arrangement block(N,1,1) ?

I must understand that! Thank you!

George
  • 5,808
  • 15
  • 83
  • 160
  • the third dimension of a block is limited to 64 threads and the total number of threads in a block to 512 or 1024 depending on your gpu. all in documentation if you care to read it – talonmies Aug 29 '12 at 16:33
  • I have edited my post (since 20 min ago) :) .The thing i can't understand is ,i must always use (N,1,1) arrangement?Thanks! – George Aug 29 '12 at 16:36

2 Answers2

1

The first dimension corresponds to threadIdx.x, the second with threadIdx.y and the third with threadIdx.z

When you launch block(N,1,1) threadIdx.x goes from 0 to N, while threadIdx.y and threadIdx.z are always zero.

When you launch block(1, N, 1) threadIdx.x is always zero, threadIdx.y goes from 0 to N.

so instead of having

idx = threadIdx.x;

Change it to

idx = blockDim.x * threadIdx.y + threadIdx.x;

or more accurately (only if using block(X, Y, Z) with Z > 1)

idx = (blockDim.y * threadIdx.z +  threadIdx.y) * blockDim.x + threadIdx.x;
ashokk
  • 38
  • 7
  • Your last example doesn't work right.I found idx=threadIdx.x+threadIdx.y*blockDim.x+threadIdx.z*blockDim.x*blockDim.y; this works ok.And thanks,i realized now how it should work.Maybe you could help me with this?http://stackoverflow.com/questions/12159709/cuda-out-of-memory-threads-and-blocks-issue – George Aug 30 '12 at 10:59
  • @george, thanks for pointing it out. I had blockIdx instead of blockDim. Fixed the code now. – ashokk Aug 31 '12 at 00:21
-2

the third value is limited to a small number like 2 or 3 if I remember!

you should be able to use (1,N,1).

widgg
  • 1,358
  • 2
  • 16
  • 35