I am trying to fully understand the information of PTXAS -v CUDA for kernel stack usage and register spilling (for sm_35 architecture). For one of my kernels it produces:
3536 bytes stack frame, 3612 bytes spill stores, 6148 bytes spill loads
ptxas info : Used 255 registers, 392 bytes cmem[0]
I know that the stack frame is allocated in local memory which lives physically where global memory is and is private to each thread.
My questions are:
- Is the memory needed for register spillage also allocated in local memory?
- Is the total amount of memory needed for register spilling and stack usage equal to [number of threads]x[3536 bytes]. Thus register spillage loads/stores operate on the stack frame?
- The number of spill stores/loads doesn't detail on the size of the transfers. Are these always 32bit registers? Thus, a 64bit floating point number spill would be counted as 2 spill stores?
- Are spill stores/loads cached in L2 cache?