Is it bad that NVCC generates PTX code that is very generous with registers?

Question

I recently read through the generated PTX code of a CUDA kernel. I realized that many registers are used to just store an intermediate value and are then never used again, and that NVCC generally seems to not care much about register re-use and instead opts to just use a new register at pretty much any point new data is created.

This raises the question, is it worth to manually go over the PTX code and try to minimize the register use, or is that something the PTX VM handles at runtime anyways?

Doesn't PTX have to get compiled for the actual hardware's instruction-set anyway? That compiler is probably going to do lifetime-analysis on any value, probably in terms of SSA (https://en.wikipedia.org/wiki/Static_single-assignment_form), so I'd assume that optimization step would end up with separate names for each new value a register takes even if you did your proposed transformation. (NVCC probably generated PTX from an SSA representation). This is guesswork based on background knowledge of compilers / optimizers, no experience with CUDA or PTX, but I'd guess this is a non-problem. — Peter Cordes, Jul 20 '22 at 18:32
It is not a problem as long as the lower-level (LL) compiler (AFAIK translating the PTX code to SASS) can do a proper LL register allocation. Reducing the number of HL register ahead of time is actually a problem for future architectures that would use more registers than what would be fixed by the PTX part. Register allocation is the very basic part of any compiler so I would be very surprised if the low-level compiler do not do that (AFAIK it is closed-source so nobody can say if this is the case except Nvidia developers that are probably limited by a NDA). — Jérôme Richard, Jul 20 '22 at 18:36
@PeterCordes yes it gets compiled further into something architecture-specific — Niels Slotboom, Jul 20 '22 at 18:52
@JeromeRichard The SASS format is relatively well known. It contains the physical register allocation and even the control of register reuse cache to improve upon the register banks. — Sebastian, Jul 25 '22 at 06:16

talonmies · Accepted Answer · 2022-07-21T06:47:04.433

5

This raises the question, is it worth to manually go over the PTX code and try to minimize the register use

No. Nvcc generates static single assignment code deliberately.

or is that something the PTX VM handles at runtime anyways?

There is no such thing as a “PTX VM”. PTX is always compiled into shader assembler that runs on the hardware. Register allocation and usage optimisation is done statically by the assembler from PTX code, which can either be part of an nvcc invocation or by the GPU driver itself at runtime.

edited Jul 21 '22 at 06:47

answered Jul 20 '22 at 22:43

talonmies

70,661
34
192
269

An optimizing translator from one instruction-set to another would normally be called a compiler, not an assembler. But I guess if there's established terminology, one should use it. By comparison, LLVM's `llvm-as` is an assembler which reads LLVM-IR (an explicitly SSA assembly language for LLVM bitcode, target-neutral) and assembles it into a binary representation of the same target-independent instructions. – Peter Cordes Jul 20 '22 at 23:11
The tool is specifically referred to as an assembler by Nvidia, the executable is `ptxas` in the CUDA toolkit. The wording in the answer was carefully chosen – talonmies Jul 20 '22 at 23:15
Yup, I figured that must be the case. Thanks for the details on what CUDA calls things. – Peter Cordes Jul 20 '22 at 23:20

Is it bad that NVCC generates PTX code that is very generous with registers?

1 Answers1