To what degree can the brain move resources from the "what" to the "when" to achieve a precise level of timing for conversion between sensory and motor output?
I think that you may be making a false assumption here. I don't think that the 'what' and the 'when' are in competition for resources. According to Arnal and Giraud (and many others), having a valid prediction of when an event is going to occur and having a valid prediction on which event is going to occur, both lead to reduced primary sensory activity (and to a shortened reaction time). As such, you could say that the resources that go into making both types of predictions (the 'when' and the 'what'), reduce total resources used. In the opposite situation, where there is an invalid prediction, there is not only more total neural activity in primary sensory areas, but stimulus representations are also less sharp (Kok et al., 2012).
Furthermore, the differences seen in frequency space that are related to 'what' and 'when' predictions, as well as the fact that 'when' predictions seem to have a necessary stimulus-driven component while 'what' predictions do not, also point to a possible dissociation in resources used. In other words, it might not be the same neurons firing anyway. As an example, Wacongne et al (2012) suggested a model on how the feedforward and the feedback portion of predictive coding (of 'what' predictions) might work using different neurons within a single cortical column.
Finally, less resources does not necessarily translate to better sensory-motor conversion: selectively attending to a stimulus increases the amount of sensory brain activity, but decreases reaction times. It seems that it might boil down to how sharp the stimulus representation is (which is connected to valid/invalid 'what' predictions), followed by how strongly these neurons are firing (which is connected to selective attention). Squandering resources would mean having the wrong neurons firing, rather than firing more vs. less in a given sensory area. Once again, this would mean that forming any type of prediction leads to better use of total resources.
I'm still not sure if this answers your question, as the examples you give (e.g. drum beat) contain certainty on both the 'what' (sound of drum) and the 'when' (time of sound onset). If you want to dissociate the two, you could have a situation where you know what is going to come (sound of drum) but with a jittered stimulus onset asynchrony. Or, you could have a precise timing with novel sounds (a sound of a different pitch is displayed every half a second). In those cases your predictions would be mainly on the 'what' vs 'when', but in both cases hearing any of these sounds would carry an element of surprise, and would therefore lead to more primary sensory activity and a longer RT, i.e. more resources would be used.
References:
Wacongne C, Changeux JP, Dehaene S (2012) A Neuronal Model of Predictive Coding Accounting for the Mismatch Negativity. The Journal of Neuroscience 32:3665-3678.
Kok P, Jehee JFM, de Lange FP (2012) Less is more: Expectation sharpens representations in the primary visual cortex. Neuron 75:265-270.