:) While I was trying to manage my kernel resources I decided to look into PTX but there are a couple of things that I do not understand. Here is a very simple kernel I wrote:
__global__
void foo(float* out, float* in, uint32_t n)
{
uint32_t idx = blockIdx.x * blockDim.x + threadIdx.x;
uint32_t one = 5;
out[idx] = in[idx]+one;
}
Then I compiled it using: nvcc --ptxas-options=-v -keep main.cu and I got this output on the console:
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z3fooPfS_j' for 'sm_10'
ptxas info : Used 2 registers, 36 bytes smem
And the resulting ptx is the following:
.entry _Z3fooPfS_j (
.param .u64 __cudaparm__Z3fooPfS_j_out,
.param .u64 __cudaparm__Z3fooPfS_j_in,
.param .u32 __cudaparm__Z3fooPfS_j_n)
{
.reg .u16 %rh<4>;
.reg .u32 %r<5>;
.reg .u64 %rd<8>;
.reg .f32 %f<5>;
.loc 15 17 0
$LDWbegin__Z3fooPfS_j:
.loc 15 21 0
mov.u16 %rh1, %ctaid.x;
mov.u16 %rh2, %ntid.x;
mul.wide.u16 %r1, %rh1, %rh2;
cvt.u32.u16 %r2, %tid.x;
add.u32 %r3, %r2, %r1;
cvt.u64.u32 %rd1, %r3;
mul.wide.u32 %rd2, %r3, 4;
ld.param.u64 %rd3, [__cudaparm__Z3fooPfS_j_in];
add.u64 %rd4, %rd3, %rd2;
ld.global.f32 %f1, [%rd4+0];
mov.f32 %f2, 0f40a00000; // 5
add.f32 %f3, %f1, %f2;
ld.param.u64 %rd5, [__cudaparm__Z3fooPfS_j_out];
add.u64 %rd6, %rd5, %rd2;
st.global.f32 [%rd6+0], %f3;
.loc 15 22 0
exit;
$LDWend__Z3fooPfS_j:
} // _Z3fooPfS_j
Now there are some things that I don't understand:
- According to the ptx assembly 4+5+8+5=22 registers are used. Then why it says
used 2 registersduring the compilation? - Looking at the assembly I realised that the data type of threadId, blockId etc is
u16. Is this defined in the CUDA specification? Or this may vary between different versions of the CUDA driver? - Can someone explain to me this line:
mul.wide.u16 %r1, %rh1, %rh2;?%r1isu32, whywideinstead ofu32is used? - How are the names of the registers chosen? In my vase I understand the
%rpart but I don't understand theh,(null),dpart. Is it chosen based on the data type length? ie:hfor 16bit, null for 32bit,dfor 64bit? - If I replace the last 2 lines of my kernel with this
out[idx] = in[idx];, then when I compile the program it says that 3 registers are used! How is it possible to use more registers now?
Please ignore the fact that my test kernel does not check if the array index is out of bounds.
Thank you very much.