Confusion with CUDA PTX code and register memory

Question

:) While I was trying to manage my kernel resources I decided to look into PTX but there are a couple of things that I do not understand. Here is a very simple kernel I wrote:

__global__
void foo(float* out, float* in, uint32_t n)
{
    uint32_t idx = blockIdx.x * blockDim.x + threadIdx.x;
    uint32_t one = 5;
    out[idx] = in[idx]+one;
}

Then I compiled it using: nvcc --ptxas-options=-v -keep main.cu and I got this output on the console:

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z3fooPfS_j' for 'sm_10'
ptxas info    : Used 2 registers, 36 bytes smem

And the resulting ptx is the following:

    .entry _Z3fooPfS_j (
            .param .u64 __cudaparm__Z3fooPfS_j_out,
            .param .u64 __cudaparm__Z3fooPfS_j_in,
            .param .u32 __cudaparm__Z3fooPfS_j_n)
    {
    .reg .u16 %rh<4>;
    .reg .u32 %r<5>;
    .reg .u64 %rd<8>;
    .reg .f32 %f<5>;
    .loc    15  17  0
$LDWbegin__Z3fooPfS_j:
    .loc    15  21  0
    mov.u16     %rh1, %ctaid.x;
    mov.u16     %rh2, %ntid.x;
    mul.wide.u16    %r1, %rh1, %rh2;
    cvt.u32.u16     %r2, %tid.x;
    add.u32     %r3, %r2, %r1;
    cvt.u64.u32     %rd1, %r3;
    mul.wide.u32    %rd2, %r3, 4;
    ld.param.u64    %rd3, [__cudaparm__Z3fooPfS_j_in];
    add.u64     %rd4, %rd3, %rd2;
    ld.global.f32   %f1, [%rd4+0];
    mov.f32     %f2, 0f40a00000;        // 5
    add.f32     %f3, %f1, %f2;
    ld.param.u64    %rd5, [__cudaparm__Z3fooPfS_j_out];
    add.u64     %rd6, %rd5, %rd2;
    st.global.f32   [%rd6+0], %f3;
    .loc    15  22  0
    exit;
$LDWend__Z3fooPfS_j:
    } // _Z3fooPfS_j

Now there are some things that I don't understand:

According to the ptx assembly 4+5+8+5=22 registers are used. Then why it says used 2 registers during the compilation?
Looking at the assembly I realised that the data type of threadId, blockId etc is u16. Is this defined in the CUDA specification? Or this may vary between different versions of the CUDA driver?
Can someone explain to me this line: mul.wide.u16 %r1, %rh1, %rh2;? %r1 is u32, why wide instead of u32 is used?
How are the names of the registers chosen? In my vase I understand the %r part but I don't understand the h,(null),d part. Is it chosen based on the data type length? ie: h for 16bit, null for 32bit, d for 64bit?
If I replace the last 2 lines of my kernel with this out[idx] = in[idx];, then when I compile the program it says that 3 registers are used! How is it possible to use more registers now?

Please ignore the fact that my test kernel does not check if the array index is out of bounds.

Thank you very much.

(1) PTXAS is the compiler component that translates PTX to machine code. Thus the register count from -Xptxas -v pertains to the physical registers used in the machine code (you can inspect it with cuobjdump --dump-sass). PTX is an intermediate language that uses virtual registers. Since the PTX code is generated in SSA (single static assignment) form, every new result written is assigned a new virtual register number. (2) mul.wide is described in the PTX specification (which is part of CUDA documentation). In this case it multiplies two u16 operands, giving a u32 result (ie. the full product) — njuffa, Jun 07 '13 at 03:50

njuffa · Accepted Answer · 2013-06-07T13:08:57.450

PTX is an intermediate language that is designed to be portable across multiple GPU architectures. It gets compiled by the compiler component PTXAS into final machine code, also refered to as SASS, for a particular architecture. The nvcc option -Xptxas -v causes PTXAS to report various statistics about the generated machine code, including the number of physical registers used in the machine code. You can inspect the machine code by disassembling it with cuobjdump --dump-sass.

So the number of registers one sees used in PTX code has no significance, since these are virtual registers. The CUDA compiler generates PTX code in what is known as SSA form (static single assignment, see http://en.wikipedia.org/wiki/Static_single_assignment_form). This basically means that each new result written is assigned a new register.

The instruction mul.wide is described in the PTX specification, the current version of which (3.1) you can find here: http://docs.nvidia.com/cuda/parallel-thread-execution/index.html . In your example code, the suffix .u16 means that it multiplies two unsigned 16-bit quantities and returns an unsigned 32-bit result, i.e. it computes the full, double-width product of the source operands.

Virtual registers in PTX are typed, but their names can be chosen freely, independent of type. The CUDA compiler appears to follow certain conventions that are (to my knowledge) not documented since they are internal implementation artifacts. Looking at a bunch of PTX code it is clear that the register names currently generated encode type information, this may be done for ease of debugging: p<num> is used for predicates, r<num> for 32-bit integers, rd<num> for 64-bit integers, f<num> for 32-bit floats, and fd<num> for 64-bit doubles. You can easily see this for yourself by looking at the .reg directives in the PTX code that create these virtual registers.

Confusion with CUDA PTX code and register memory

1 Answers1

Linked