5

Q-Learning (Watkins, 1989) uses a single function to estimate the value of actions and to choose the next action. Double Q-Learning (Hasselt, 2010) extends this and uses two functions which are updated using different subsets of experience. The paper claims this tends to calm down the overestimation of Q-Learning and Hasselt further claims: "Therefore, this algorithm is not less data-efficient than Q-learning."

So my question enters, is there a Q-Learning with n functions? If so, does it scale on the two claims of Double Q?

foreverska
  • 724
  • 1
  • 16

2 Answers2

7

Probably adding more Q estimators that are trained on separate data would not improve performance, and may even degrade it.

At least there is no theoretical justification. Double Q learning addresses a specific problem with maximisation bias: When your estimators are noisy and you select the highest estimate (the greedy action), there will be a bias towards overestimating its value (in the update step) when using the same estimator for both. There is no equivalent bias to consider that would be impacted moving from 2 to 3 estimators.

In addition, Q learning can make use of both estimators in each update - one to select, the other to use value. A higher number of estimators would need to be rotated through.

However, it is possible that some other factor would make a 3 or 4 estimator agent effective. I have not experimented with this, and not aware of anything published. So you could always try the experiment. I suggest pick an environment in which double Q learning is already shown to perform well, and give it a go. These kinds of "what if I changed this thing?" experiments usually come to nothing, but they can be fun.

I suspect what you will find is that the learning is slower, but a little bit more robust against some kinds of error. However, in a double Q learner, decreasing the learning rate and/or increasing the time between copying to the frozen copies of estimators should have a very similar effect.

Neil Slater
  • 32,068
  • 3
  • 43
  • 64
7

Yes, there are variations of Q-learning which use $n$ Q-functions named "enseble Q-learning" or "ensemble Q-functions". You can have a look at REDQ algorithm.

The main benefit of having multiple but uncorrelated Q-functions is that you can reduce the overestimation bias of the q-values, and therefore update them more often (with the same data) such that to achieve a better sample-efficiency, or, equivalently, update-to-data (UTD) ratio.

The drawback is that an ensemble is always costly (so it does NOT scale with $n$, at least naively), and so training is slowed down (regarding wall-clock time) although you use less data to learn.

To fix this issue, DroQ uses dropout to easily create an ensemble of Q-functions, which is also cheap to evaluate. Therefore, you get speed, efficiency and SOTA performance.

Update: Ensemble of Q-functions are also popular in offline RL (where you want to learn a policy from a previously collected dataset of demonstrations), where the ensemble allows you to select the next best experience tuple by evaluating the uncertainty about the predicted q-values: http://arxiv.org/abs/2110.01548.

Luca Anzalone
  • 2,888
  • 3
  • 14