Deep Reinforcement Learning for minimizing portfolio variance

By: Mathi Danmark, 19JUL2022

In this post I will show how to use Deep Reinforcement Learning (DRL) to aid in the task of minimizing portfolio variance. I will look at a portfolio of Additional Tier1 notes (AT1), and try to minimize portfolio variance through the use of iTraxx Crossover and Bund Futures. The solution will be a first step towards creating a dynamic hedging strategy for AT1 notes.

The conclusion I reach, is that the DRL model is able to create a dynamic strategy that outperforms a benchmark strategy in the training and validation set. In the test set, the strategy is matching the benchmark strategy.

Deep Reinforcement Learning

DRL is a field within machine learning (ML) which uses the complex non-parametric optimization from ML to train an agent how to act in order to reach a target. In the simpler cases, the agent could be tasked with finding his way out of a maze or locate the shortest route between two points in the presence of obstacles. In the case presented here, the agent will be trained to take positions that minimize the portfolio variance.

DRL is especially exciting for investment analysis, because you can construct the analysis to include various real-world frictions, like trading costs and positing sizing restrictions. Further, the function that guides the agents behavior can be constructed so that the agent tries to reach one of many possible goals, for example; maximizing the absolute return, maximizing the Sharpe-ratio, minimize drawdowns, or minimizing the portfolio variance.

The Bloomberg European CoCo Tier1 Unhedged EUR Index will act as a proxy for the AT1 portfolio. The agent will be allowed to take any position (long or short) in the iTraxx Crossover index and Bund Futures each day, with the target of reaching the lowest possible portfolio variance (see Figure 1). The agent will incur trading costs of 25 bps when altering the positioning sizing, thereby reducing the incentive for erratic position changes.

Figure 1

Figure 2 illustrates how the analysis is set up. The agent is placed at some starting point in the past (t=0), and is only allowed to use prior data from the environment to guide his decision - the so called state of the world S_t. The agent then takes an action A_t (choosing the portfolio weights), which affects the environment. The agent then receives a reward R_t+1 for his action and a new state S_t+1 is available for him to base his next decision on. The agent will use the reward given to him to guide the quality of his decisions. This process is iteratively repeated until t=T. This process is known as a Markow Decisions Process.

Figure 2 (from: Sutton and Barto, Reinforcement Learning)

The core part of this iterative process lies in how the agent chooses his actions, and how he improves those choices. This is where the ML optimization process is used. The ML model takes as input the state of the world , and outputs the portfolio weights. The ML procedure uses backpropagation to calculate the gradient of the loss function with respect to all the weights in the model layers, instructing the model how to marginally alter the weights to lower the loss function, causing the agent to optimize his behavior. After successfully training the model, it will be able to take an historic input of the market, and use that to predict the optimal weights for the next day.

I have chosen to use a 4-layer Neural Network to map the state of the world to the portfolio weights (see Figure 3 for network architecture). FC is a fully connected layer. In the Flatten layer, I add the old weights to the concatenation, to improve the optimization. The square brackets show the layer dimensions. The state of the world S_t, holds the AT1, Crossover and Bund returns the for previous 30 days.

I have used the simplest model architecture that gave a satisfactory result. More complex models also converged, but the investment performances were not superior.

Figure 3

The amount of trainable parameters in this kind of deep neural network can be relatively high, allowing to map extremely complex functions. This gives rise to the problem of over-fitting. Over-fitting means that the model is indicated to work well on the training data, but it does not generalize well on data it has not directly used in the optimization process. I am interested in creating ML models that generalize well, because then the model has found some patters that are general to solving the problem at hand, instead of simply creating a good mapping between the input and output you train it on. For this reason, one typically choose the ML model configuration that minimize the validation loss.

Data

The dataset comprises of 3 input timeseries’; the Bloomberg European CoCo Tier1 EUR Unhedged total return index (I31415EU Index), the iTraxx Crossover Generic index (ITRXEXE), and Bund Futures (RX1). I have used daily observations from 2015 until 29JUN2022.

I have used 80% of the data for training (05JAN2015 to 30DEC2020), 10% for validation (31DEC2020 to 29SEP2021), and 10% for testing (30SEP2021 to 29JUN2022). I have trained the model on the training data, but chosen the model configuration that minimizes the validation loss.

Source: Bloomberg Finance L.P.

Model training

I have run 100 iterations, where the agent is allowed to pass through the training data once each time (the iterative process shown in Figure 2). Figure 4 shows the training and validation loss.

Figure 4 shows that the validation loss remains higher than the training loss, indicating that the model does not work as well outside the training data. None the less, the result is reassuring, as the validation loss is in close proximity to the training loss, and also decreasing, meaning that the behavior learned on the training data also works, to some extent, on data not used for parameter optimization.

Figure 4

Model result

Once the model has been trained, I can use it to predict the weights.

Figure 5 shows the model predictions on three different periods; row 1 is the full data (including the training data), row 2 is the validation data, and row 3 is the test data. The graphs to the left show the cumulative returns of the constituents and the portfolio. The graphs to the right show the portfolio weights (AT1 always weighs 100%). For example, if xover has a weight -60, the model chooses to take a short positions in the iTraxx Crossover equal to -60% of the AT1 position.

We can see that the model is creating a dynamic strategy, where the xover and bund weights are not constant, but change for different states of the market.

Comments on some out-of-sample examples in Figure 5:

· In AUG to OCT 2021 (second row), the model lowers the Crossover short, and goes slightly long Bund Futures. This happens as rates start to widen, which indicate that the model expects the AT1 price to start decreasing, and decrease more than Crossover – which fits the generally observed historical market dynamics.

· In MAR to MAY 2022 (third row), we can see that the model lowers its short Crossover position from -90% in March to -25% in mid-May. This coincides with a period of some spread divergence. In March, AT1 had underperformed the Crossover index. The model seem to expect some convergence, which subsequently happened, as AT1 outperformed the Crossover index through April. This move it the primary cause, that the DRL model outperforms all the constituents in the test data.

Overall, the model chooses to be short the Crossover, with weights in the -140% to -25% region. The model chooses to swing between being long or short Bund Futures, with weights ranging from -25 to +25%.

In the next section I will evaluate the models investment performance.

Figure 5

Model Performance

In order to gauge the performance of the model, we need to compare it to another strategy. A simple alternative strategy would be to look at the training data, find the optimal fixed weight solution that minimizes the portfolio variance, and use that going forward. Again, I am only interested in the weights for Crossover and Bund Futures, as I assume that the portfolio is long 100% AT1.

In Figure 6, I have illustrated the approach. Figure 6 only contains the training data. The ‘random’ grey dots are the portfolio return and standard deviation where the weights are drawn at random in the interval -200% to 200%. The red dot; train_random_opt, shows the optimal fixed weight solution among the random weights. The optimal fixed weight combination is xover -87.0%, and bund futures -9.8% (this does not need to sum to 100% - the benchmark strategy can also take any position sizes). This simple approach is our benchmark performance.

The black dot; DRL model, shows the DRL models in-sample performance. This means, that the model has used this data for parameter optimization. As we can see, the model is able to create an investment strategy that is better at minimizing portfolio variance than any fixed weight alternative. This initial result is promising, but not entirely surprising, given the complexity of the DRL model.

But, because I am interested in the out-of-sample abilities of the model, I have not chosen the model configuration that minimizes the Train loss, but the Validation loss.

Figure 6

Figure 7 is created on the validation data. Again, the grey dots are portfolio returns and standard deviations generated by random portfolio weights. The red dot is the portfolio return and standard deviation when using the optimal fixed weights from the training data (again, xover weight: -87.0% and bund futures weight: -9.8%). The black dot shows the DRL models result when inputting the validation data.

First, we see that the old fixed weight optimum (red dot) is no longer optimal at minimizing variance on this data. Second, the DRL is still better than the train_random_opt at minimizing variance. Finally, the DRL model still generates a result that is not possible to create using only fixed weights (it is outside the grey cloud).

This result is reassuring, as the DRL model is able to outperform our simple fixed weight benchmark strategy on out-of-sample data. But the final test still remains, how will the model perform on data not used for optimization or model configuration; the test data.

Figure 7

Figure 8 shows the model performance on the test data. The DRL model shows marginally better performance, compared to the benchmark (standard deviation of 6.39% vs. 6.44%). We also see, that the DRL model is no longer outperforming what is possible with a fixed weight portfolio – albeit it is still close.

So in this final true out-of-sample test, the DRL model still provides a reasonable strategy, even though it does only show marginal better performance than our benchmark strategy.

Figure 8

Model dynamics

One of the primary oppositions towards the use of ML in finance, is the opaqueness of the inner workings of the model. In classifications tasks, like image recognition, you are perhaps less worried about how the trained weights specifically interpret the world, as long as the model performance is satisfactory.

In finance, you work with a fundamentally different reality, where the input of the model can take a form that is radically different from anything the model has seen before. Therefore you need to have an understanding of how the model reacts to the input.

This topic alone merits an extended analysis, but this post is already longer than intended, so I hope to cover it in more detail later. For now, I will present a basic view of how the DRL model sees the world. Figure 9 contains the training data, showing scatterplots of the observed credit spreads and the model weights at that instance. Row 1 shows the Crossover weights, and row 2 shows the Bund Future weights.

It is obvious that the DRL model has not created an easily interpretable investment dynamic, but the weights are not totally random either. It is especially apparent, that the model increases the Crossover short position, when the Crossover and AT1 spreads widen materially (the two upper left graphs). The model also increases the short position in Bund Futures when AT1 and Crossover spreads widen – albeit the relationship is less clear, showing signs of non-linear effects (the two bottom left graphs) . Finally, It is unclear how the Bund Future is utilized to set the weights – at least from this simple view (the two rightmost graphs).

It is also apparent that the model has chosen a strong relationship between the Bund and Crossover weights (this is also apparent from Figure 5), where they move in tandem, but at different absolute levels and scales.

To further uncover the inner dynamics of the model, one could do scenario analyses, where the model was fed stylized scenarios to see how it reacted. This would give greater confidence that the model is reliable, even for more extreme data than previously observed.

Figure 9

Concluding remarks

The results obtained here are reassuring; the DRL approach is flexible, allowing one to look at a wide range of investment problems, and the DRL model was able to provide an interesting investment strategy – even though the test performance only marginally outperformed our benchmark strategy.

This was my first small project, using Deep Reinforcement Learning to look at an investment problem. My plan is to continue to work with ML and DRL and use the methods to try and solve other investment problems. I plan to continue to publish these small project – mostly to force myself to work somewhat thoroughly and stringently through the problem at hand.

Please reach out to me, if you have an interest in the finer details of the model, or want to discuss ML in finance in general.

Further work

There are several obvious elements to look at in order to try to improve the model performance:

- Hyperparameter tuning: the model contains a range of hyper parameters where tuning could be explored in more detail.

- Adding more explanatory variables: It would be interesting to see how the model would perform if I inputted additional explanatory variables like; stock returns, economic data, or technical indicators.

- Network architecture: I have only used fully connected neural network layers in the architecture. Apart from other of these structures, it would be interesting to use a convolutional neural network (CNN) instead. A CNN would intuitively be better at incorporating regional interdependencies from the data.

Resources

My code is inspired by Selim Amrouni, Aymeric Moulin, and Phillipe Mizrahi’s excellent work, referenced in bullet 1 below. Bullet 2 to 4 aided my understanding of Deep Reinforcement Learning.

1. GitHub - selimamrouni/Deep-Portfolio-Management-Reinforcement-Learning: This repository presents our work during a project realized in the context of the IEOR 8100 RL Class at Columbia University.

2. Reinforcement Learning For Automated Trading using Python (analyticsvidhya.com)

3. https://blog.dominodatalab.com/deep-reinforcement-learning

4. Richard S. Sutton and Andrew G. Barto , “Reinforcement Learning: An Introduction”.

Deep Reinforcement Learning for minimizing portfolio variance

Mathi Danmark

Recommended by LinkedIn

More articles by Mathi Danmark

Others also viewed

How Reinforcement Learning concepts can help us with our New Year goals

Exploring Reinforcement Learning: Policy Parameterization and Policy Gradient Theorem

Deep Reinforcement Learning: Lessons from a different lens

Reinforcement Learning (RL) in AI

Reinforcement Learning, Game Theory, and the Limits of Optimization

What is Reinforcement Learning (RL)? Explained

LLM Reasoning Era: Could Inverse Reinforcement Learning be the key to advancing LLM reasoning?

Understanding Reinforcement Learning from Human Feedback (RLHF) and the Difference Between DPO and PPO Fine-Tuning

Intuition on Reinforcement Learning

From Loss to Relevance: Reward Modeling in Transformer Model

Explore content categories

Recommended by LinkedIn

More articles by Mathi Danmark