Maximizing Ensemble Diversity in Deep Reinforcement Learning
Co-authors for this project are Mariano Phielipp from Intel Labs and Ladislau Boloni from University of Central Florida.
Key Messages:
The 2022 International Conference on Learning Representations (ICLR) runs April 25th through 29th. Specifically dedicated to the improvement of representation learning (RL), or deep learning, ICLR is the leading gathering of professionals presenting cutting-edge research on all aspects of RL and its diverse applications. Our work, Maximizing Ensemble Diversity in Deep Reinforcement Learning, will be presented in this year’s conference in Poster Session 5.
Ensemble Reinforcement Learning
Ensemble reinforcement learning is a “method of combining learning models to produce a single learner to perform inference on the data.” This method is gaining popularity because of its ability to address some long-standing training challenges such as sample efficiency, exploration, and high estimated bias. Therefore, it is considered the go-to method for trial-and-error style learning. However, training the ensemble of neural networks with same data causes the network collapse problem where all the networks start giving identical outputs. Thus, losing all leverage of an ensemble.
Intel Labs, in collaboration with University of Central Florida proposed Maximize Ensemble Diversity in Reinforcement Learning (MED-RL), a set of regularization techniques inspired from economic theory that maximizes diversity between neural networks by promoting inequality between the network parameters. This regularization allows the ensemble reinforcement learning algorithms to harness their maximum potential.
Does Network Collapse Affect Performance?
Our work started with the conjecture that high similarity between the neural networks correlates to poor performance. To verify our hypothesis, we trained a MaxminDQN agent with 2 networks for 3000 episodes. The training graph along with the similarity heatmaps are shown below in Figure 1. Notably at episode 500 (heatmap A) and episode 2000 (heatmap C), the representation similarity between neural networks is low but the average return is relatively high. In contrast, at episode 1000 (heatmap B) and episode 3000 (heatmap D) the representation similarity is highest, but the average return is lowest.
Recommended by LinkedIn
Figure 1: The training graph and similarity heatmaps of a MaxminDQN agent with 2 neural networks. The letters on the plot show the time when similarities were calculated. Heatmaps at A and C have relatively low similarity and have relatively higher average return as compared to heatmaps at point B and D that have extremely high similarity across all the layers. See diagonal values from bottom left to top right.
Different Weight Initialization for Diversity in Ensemble
The most popular approach to address the network collapse problem is the different weight initialization. To test this approach, we performed a toy experiment where we trained two different architectures with different learning rates and batch sizes. We found that neural networks initialized with different weights learn almost identical functions. Figure 2a shows the learnt functions while Figure 2b represents their similarity heatmap before and after training. In Figure 2b, you can see that the output of the trained networks was 98% similar. Therefore, this method is not suitable for promoting diversity between neural networks on an ensemble.
Figure 2. Left: Fitting a sine function using two different neural network architectures. The upper function was approximated using 64 neurons in each hidden layer while the lower function used 32 neurons in each hidden layer. Right: Represents the similarity heatmap between different layers of both neural networks before and after training. The right diagonal (bottom left to top right) measures representation similarity of the corresponding layers of both neural networks.
For our approach, we integrated MED-RL in TD3[1], SAC[2] and REDQ[3] for continuous control tasks and in MaxminDQN[4] and EnsembleDQN[5] for discrete control tasks and evaluated on six Mujoco environments and six Atari games. We did this with the goal of integrating the regularizers to maximize the diversity between the neural networks. Our results show that MED-RL augmented algorithms outperform their un-regularized counterparts significantly and, in some cases, achieved more than 300% in performance gains and are up to 75% more sample-efficient. A sample of the results is shown below in Table 1. As demonstrated, this proposed set of regulation techniques, MED-RL, successfully maximizes the diversity between the networks of the ensemble. Furthermore, the sample-efficiency benefits of MED-RL, suggest that it can prove to be a useful tool in the field of robotics where data gathering is an expensive process.
Table 1. Max Average Return for MED-RL SAC over 5 trials of 1 million time steps. Maximum value for each task is bolded. ± corresponds to a single standard deviation over trials.
References:
Great job, Hassam!
Great job Hassam. Keep crushing it!