Understanding CUDA kernel stack usage and register spilling

Question

I am trying to fully understand the information of PTXAS -v CUDA for kernel stack usage and register spilling (for sm_35 architecture). For one of my kernels it produces:

    3536 bytes stack frame, 3612 bytes spill stores, 6148 bytes spill loads
ptxas info    : Used 255 registers, 392 bytes cmem[0]

I know that the stack frame is allocated in local memory which lives physically where global memory is and is private to each thread.

My questions are:

Is the memory needed for register spillage also allocated in local memory?
Is the total amount of memory needed for register spilling and stack usage equal to [number of threads]x[3536 bytes]. Thus register spillage loads/stores operate on the stack frame?
The number of spill stores/loads doesn't detail on the size of the transfers. Are these always 32bit registers? Thus, a 64bit floating point number spill would be counted as 2 spill stores?
Are spill stores/loads cached in L2 cache?

May be partially answered here: http://stackoverflow.com/questions/12388207/interpreting-output-of-ptxas-options-v — njuffa, Sep 28 '13 at 16:22
I read the thread. It's partially answered. But this does not justify to close this question. See comment below. — ritter, Sep 30 '13 at 13:27

score 3 · Accepted Answer · edited Sep 29 '13 at 09:32

Registers are spilled to local memory. "local" means "thread-local", i.e. storage private to each thread.
The amount of local memory required for the entire launch is at least number_of_threads times local_memory_bytes_per_thread. Due to allocation granularity it can often be more.
The compiler statistics for spill transfers are already normalized to bytes as individual local memory accesses may have difference widths. Inspection of the generated machine code (run cuobjdump --dump-sass on the binary) will show the width of the individual accesses. The relevant instructions will have names like LLD, LST, LDL, STL.
I am reasonably sure that local memory accesses are cached in L1 and L2 caches, but cannot quote the relevant paragraphs from the documentation at this time.

Please answer whether register spill space is included in the stack frame reported or goes on top of it (question 2). I will then be happy to accept your answer. — ritter, Sep 30 '13 at 13:26

Understanding CUDA kernel stack usage and register spilling

1 Answers1