Background
As of today (12-19-2023), the arXiv submission of the original deep Q-learning approach to achieve superhuman performance on ATARI games has turned a decade old. The original approach, sometimes referred to as vanilla DQN (2013), was the focus of numerous research investigations and improvements throughout the ensuing 5 years (2014-2018), as evidenced by the following problems and examples:
Estimation: Double DQN (DDQN, 2015) decouples action selection and estimation to prevent overestimation in the Q-values. Dueling DQN (2015) decouples the deep Q-network into separate state-value and action-value functions to achieve better policy evaluation in the presence of many similar-valued actions. Distributional DQN (2017) attempts to learn a value distribution to better model multimodal and nonstationary policies.
Efficiency: Prioritized experience replay (PER, 2015) samples important transitions more frequently to increase learning speed. Learning from multi-step bootstrap targets such as in A3C (2016) was performed in RAINBOW DQN (2017) to help propagate newly observed rewards faster to earlier visited states, again for increased learning speed.
Generalization: Certain environmental additions such as sticky actions (2018) prevent the agent from memorizing trajectories and require some basic level of generalization. Techniques from standard deep learning such as batch normalization and regularization (2018) have shown to improve the agent's ability to generalize.
Parallelism: General Reinforcement Learning Architecture DQN (Gorila, 2015) presented the first massively distributed architecture for DQN, which reduced wall-clock training time. Ape-X DQN (2018) is a distributed algorithm that incorporates other DQN improvements such as prioritized experience replay to again lessen training time.
Memory: Incorporating recurrent architectures such as LSTMs in deep recurrent Q-networks (DRQN, 2015) improves the agent's ability to learn and test on partially-observable environments. Recurrent replay distributed DQN (R2D2, 2018) incorporates an LSTM in a distributed architecture and presents training strategies to mitigate representational drift and recurrent state staleness caused by RNN architectures.
Exploration: Noisy networks (2017) inject stochasticity into the agent's policy to more efficiently explore. Random network distillation (RND, 2018) provides an exploration bonus based on the novelty of a state through its prediction error on a randomly-distilled predictor network that is continually trained on incoming states using a fixed target network for its target values.
Question: What are the outstanding problems being actively researched in deep Q-learning throughout the past 5 years (2019-2023)?
To keep this question within the on-topic guidelines and reduce possible subjectivity, please cite relevant sources and briefly summarize the sources (e.g. with a cause-and-effect structure as in the above paragraphs). Also, only discuss sources that involve Q-learning.