I recently read through the generated PTX code of a CUDA kernel. I realized that many registers are used to just store an intermediate value and are then never used again, and that NVCC generally seems to not care much about register re-use and instead opts to just use a new register at pretty much any point new data is created.
This raises the question, is it worth to manually go over the PTX code and try to minimize the register use, or is that something the PTX VM handles at runtime anyways?