DQN arXiv 10-year anniversary: What are the outstanding problems being actively researched in deep Q-learning since 2019?

Question

Background

As of today (12-19-2023), the arXiv submission of the original deep Q-learning approach to achieve superhuman performance on ATARI games has turned a decade old. The original approach, sometimes referred to as vanilla DQN (2013), was the focus of numerous research investigations and improvements throughout the ensuing 5 years (2014-2018), as evidenced by the following problems and examples:

Estimation: Double DQN (DDQN, 2015) decouples action selection and estimation to prevent overestimation in the Q-values. Dueling DQN (2015) decouples the deep Q-network into separate state-value and action-value functions to achieve better policy evaluation in the presence of many similar-valued actions. Distributional DQN (2017) attempts to learn a value distribution to better model multimodal and nonstationary policies.
Efficiency: Prioritized experience replay (PER, 2015) samples important transitions more frequently to increase learning speed. Learning from multi-step bootstrap targets such as in A3C (2016) was performed in RAINBOW DQN (2017) to help propagate newly observed rewards faster to earlier visited states, again for increased learning speed.
Generalization: Certain environmental additions such as sticky actions (2018) prevent the agent from memorizing trajectories and require some basic level of generalization. Techniques from standard deep learning such as batch normalization and regularization (2018) have shown to improve the agent's ability to generalize.
Parallelism: General Reinforcement Learning Architecture DQN (Gorila, 2015) presented the first massively distributed architecture for DQN, which reduced wall-clock training time. Ape-X DQN (2018) is a distributed algorithm that incorporates other DQN improvements such as prioritized experience replay to again lessen training time.
Memory: Incorporating recurrent architectures such as LSTMs in deep recurrent Q-networks (DRQN, 2015) improves the agent's ability to learn and test on partially-observable environments. Recurrent replay distributed DQN (R2D2, 2018) incorporates an LSTM in a distributed architecture and presents training strategies to mitigate representational drift and recurrent state staleness caused by RNN architectures.
Exploration: Noisy networks (2017) inject stochasticity into the agent's policy to more efficiently explore. Random network distillation (RND, 2018) provides an exploration bonus based on the novelty of a state through its prediction error on a randomly-distilled predictor network that is continually trained on incoming states using a fixed target network for its target values.

Question: What are the outstanding problems being actively researched in deep Q-learning throughout the past 5 years (2019-2023)?

To keep this question within the on-topic guidelines and reduce possible subjectivity, please cite relevant sources and briefly summarize the sources (e.g. with a cause-and-effect structure as in the above paragraphs). Also, only discuss sources that involve Q-learning.

the continuous action space, C51 (distributional DQN) for example — Alberto, Dec 20 '23 at 13:13

score 3 · Accepted Answer · answered Dec 20 '23 at 19:20

Regarding estimation and sample efficiency:

REDQ (2021): using a large ensemble of Q-functions can reduce the overestimation bias (likewise Double DQN) so much that it enables to update the Q-networks more often than the environment interactions (i.e., high update-to-data ratio), therefore achieving high sample-efficiency.
DroQ (2022): improves over REDQ by also increasing the computational efficiency, thanks to the use of a small ensemble. This is possible by "dropout Q-networks", which use Dropout as an efficient way to introduce uncertainty in the Q-networks.

The concept of ensemble Q-functions is also useful in offline RL, where the goal is to learn a policy from a fixed dataset of experience:

This paper (2021) says that the uncertainty of the ensemble can be used to penalize the policy towards predicting actions that are out-of-distribution (OOD), due to an erroneously large Q-value.

score 1 · Answer 2 · answered Mar 05 '24 at 22:17

The original deep Q-learning approach has been extended in many fundamental works in offline reinforcement learning (ORL), where a fixed dataset is used to train an agent without any further environment interaction.

Batch Constrained Q-learning (BCQ, 2019) aims to keep the state-action visitation of the learned policy similar to that of the dataset, thereby reducing overestimation error caused by distributional drift between the dataset and the learned policy. For a given state, BCQ generates candidate actions to yield state-action pairs that have high similarity to those in the dataset. The most promising action is chosen through a learned Q-network. Furthermore, BCQ penalizes rare or unseen states through a modification to Clipped Double Q-learning. Implementing BCQ requires the training of 4 neural networks, and as a result, it seems to be used less frequently than other ORL algorithms.
Conservative Q-learning (CQL, 2020) regularizes the Q-values during training to learn a conservative Q-function and prevent the overestimation problem. The expected value of a policy under the learned Q-function lower-bounds the true value of the policy. Only lower-bounding in expectation prevents extra underestimation from approaches that learn a point-wise lower-bounded Q-function. The implementation of CQL is remarkably simple and makes it a popular choice as an ORL baseline algorithm.
Implicit Q-learning (IQL, 2021) learns a Q-function and completely avoids querying unseen actions. This is accomplished by treating the state-value function as a random variable, with randomness determined by the action. The values of actions are estimated by taking a state conditional upper expectile of this random variable. The resultant Q-function is extracted with advantage-weighted behavioral cloning, again to avoid querying unseen actions. Implementing IQL is simple, and its performance on standard ORL benchmark datasets has also made it a popular ORL baseline algorithm.
Calibrated Q-learning (Cal-QL, 2023) learns a Q-function initialization that, in expectation, lower-bounds the true value function of the learned policy and upper bounds the true value function of another reference policy (i.e. the behavior policy induced by the dataset). The motivation for Cal-QL is to prevent the performance decline of conservative ORL methods immediately after online finetuning begins, such as CQL. Upper-bounding a reference policy calibrates the Q-function to learn values on a realistic scale and has been shown empirically to prevent initial performance decline.

DQN arXiv 10-year anniversary: What are the outstanding problems being actively researched in deep Q-learning since 2019?

2 Answers2