I'm slightly confused by various statements around energy / power of PCM audio. Specifically, I came across two seemingly contradicting statements in these two answers:
Answer 1 states: [...] you can formally sum the squares of the samples and call it energy [...].
$ E = \sum\limits_{n=0}^{N-1} s[n]^2 $
Answer 2 states: To measure the energy [...] calculate the RMS (Root-Mean-Square).
$ E = \sqrt{\frac{\sum\limits_{n=0}^{N-1} s[n]^2}{N}} $
Note that the first answer also addresses the question of whether this definition directly relates to any physical energy. Let's ignore that aspect to focus on what's the more sensible approximation (probably by making the assumption there purely resistive load and that the audio system is linear).
My attempt to make sense of the two contradicting definitions
I'd assume that the second answer slightly mixes up average power and energy. The RMS seems to be related to approximating of average power over a certain time interval, because it uses a mean, not a sum. So to go from power to energy, one would have to multiply by time again (or here $N$).
Trying to find the source of the relationship of RMS with average power I came across this section of audio power on wikipedia. At least for a steady sinusoidal tone the average electrical power can be approximated by:
$ P_\mathrm{avg} = \frac{{V_\mathrm{RMS}}^2}{R} $
So basically when plugging in the squared RMS into the equation of Answer 2 and accounting for the "time integration", we get back the same result as Answer 1:
$ E ~=~ N \cdot \mathrm{RMS}^2 ~=~ N \cdot \left( \sqrt{\frac{\sum\limits_{n=0}^{N-1} s[n]^2}{N}} \right)^2 ~=~ \sum\limits_{n=0}^{N-1} s[n]^2 $
Now this leads me to my actual question: If the approximation of average power is based on the squared root mean square, why do we even take the root in the first place? Why isn't it popular to just use mean square if that is closer related to physical power/energy? There must be a flaw in my reasoning I guess.