Does increasing the number of Q functions in Q-Learning scale?

Question

Q-Learning (Watkins, 1989) uses a single function to estimate the value of actions and to choose the next action. Double Q-Learning (Hasselt, 2010) extends this and uses two functions which are updated using different subsets of experience. The paper claims this tends to calm down the overestimation of Q-Learning and Hasselt further claims: "Therefore, this algorithm is not less data-efficient than Q-learning."

So my question enters, is there a Q-Learning with n functions? If so, does it scale on the two claims of Double Q?

Neil Slater · Answer 1 · 2023-12-05T09:17:20.440

Probably adding more Q estimators that are trained on separate data would not improve performance, and may even degrade it.

At least there is no theoretical justification. Double Q learning addresses a specific problem with maximisation bias: When your estimators are noisy and you select the highest estimate (the greedy action), there will be a bias towards overestimating its value (in the update step) when using the same estimator for both. There is no equivalent bias to consider that would be impacted moving from 2 to 3 estimators.

In addition, Q learning can make use of both estimators in each update - one to select, the other to use value. A higher number of estimators would need to be rotated through.

However, it is possible that some other factor would make a 3 or 4 estimator agent effective. I have not experimented with this, and not aware of anything published. So you could always try the experiment. I suggest pick an environment in which double Q learning is already shown to perform well, and give it a go. These kinds of "what if I changed this thing?" experiments usually come to nothing, but they can be fun.

I suspect what you will find is that the learning is slower, but a little bit more robust against some kinds of error. However, in a double Q learner, decreasing the learning rate and/or increasing the time between copying to the frozen copies of estimators should have a very similar effect.

Luca Anzalone · Accepted Answer · 2023-12-07T21:25:18.403

7

Yes, there are variations of Q-learning which use $n$ Q-functions named "enseble Q-learning" or "ensemble Q-functions". You can have a look at REDQ algorithm.

The main benefit of having multiple but uncorrelated Q-functions is that you can reduce the overestimation bias of the q-values, and therefore update them more often (with the same data) such that to achieve a better sample-efficiency, or, equivalently, update-to-data (UTD) ratio.

The drawback is that an ensemble is always costly (so it does NOT scale with $n$, at least naively), and so training is slowed down (regarding wall-clock time) although you use less data to learn.

To fix this issue, DroQ uses dropout to easily create an ensemble of Q-functions, which is also cheap to evaluate. Therefore, you get speed, efficiency and SOTA performance.

Update: Ensemble of Q-functions are also popular in offline RL (where you want to learn a policy from a previously collected dataset of demonstrations), where the ensemble allows you to select the next best experience tuple by evaluating the uncertainty about the predicted q-values: http://arxiv.org/abs/2110.01548.

edited Dec 07 '23 at 21:25

answered Dec 05 '23 at 17:42

Luca Anzalone

2,888
3
14

Anzaone Good answer! I voted for you, ma potevo scriverlo anche in italiano che siamo entrambi italiani :) – Zollikofen4 Dec 05 '23 at 18:08
FWIW, there is a paper that shows an ensemble of MLP's can efficiently utilise a GPU for ensemble sizes of up to 50. So it does not necessarily need to be so costly. – David Dec 06 '23 at 11:06
@David Do you happen to have a paper name for that? I could probably find it but figured I'd ask. – foreverska Dec 06 '23 at 15:08
@foreverska this vectorised linear function is the implementation. There is a paper here that shows in Table 8 some speed comparisons using this implementation. – David Dec 06 '23 at 15:32

Does increasing the number of Q functions in Q-Learning scale?

2 Answers2