Build an Algorithmic Trading Strategy Fama-French 5-factor model in Python

Roger Hahn

Published Mar 18, 2021

Don’t worry about what the markets are going to do, worry about what you are going to do in response to the markets. - Michael Carr

In 1993, Eugene Fama and Kenneth French developed a 5 factor model to describe stock returns. Fama and French were professors at the University of Chicago. In 2013, Fama shared the Nobel Memorial Prize in Economic Sciences. This linear 5-factor model is used to quantify the relationship between an asset's return and the risk of the returns. Each factor risk carries a premium, and the total asset return can be expected to correspond to a weighted average of these risk premia. The five factors are (1) market risk, (2) the outperformance of small versus big companies, and (3) the outperformance of high book/market versus small book/market companies, (4) profitability and (5) investment.

Importantly, the 5-factor model can be used to assess active management of a portfolio versus selectively picking assets and timing the market. If performance can be explained by known return drivers as described by the model, the strategy can be replicated as a low-cost, algorithmic trading strategy.

The Fama French factors are obtained by sorting stocks into three size groups and then into two for each of the remaining three firm-specific factors. The factors involve three sets of value-weighted portfolios formed as 3 x 2 sorts on size and book-to-market, size and operating profitability, and size and investment. The risk factor values computed as the average returns of the portfolios (PF) as outlined in the following table:

The Fama-French 5 factors are based on the 6 value-weight portfolios formed on size and book-to-market, the 6 value-weight portfolios formed on size and operating profitability, and the 6 value-weight portfolios formed on size and investment.

Fama and French regularly publish updated risk factor and research portfolio data that can be downloaded from their website for free.

Let's jump into the code. Shout out to my classmates at Columbia University and Machine Learning for Algorithmic Trading. The code snippets below are generously borrowed from my teammates at the Columbia Data Science Bootcamp and others.

First, import pandas, numpy, matplotlib, and seaborn.

import pandas as pd
import numpy as np
 
from statsmodels.api import OLS, add_constant
import pandas_datareader.data as web
 
from linearmodels.asset_pricing import LinearFactorModel
 
import matplotlib.pyplot as plt
import seaborn as sns

Then, obtain monthly returns for the period 2010 – 2017 as follows:

ff_factor = 'F-F_Research_Data_5_Factors_2x3'
ff_factor_data = web.DataReader(ff_factor, 'famafrench', start='2010', end='2017-12')[0]


ff_factor_data.info()
<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 96 entries, 2010-01 to 2017-12
Freq: M
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Mkt-RF  96 non-null     float64
 1   SMB     96 non-null     float64
 2   HML     96 non-null     float64
 3   RMW     96 non-null     float64
 4   CMA     96 non-null     float64
 5   RF      96 non-null     float64
dtypes: float64(6)
memory usage: 5.2 KB

Stats from ff_factor_data

ff_factor_data.describe()

Portfolios

Fama and French make available numerous portfolios that we can illustrate the estimation of the factor exposures, as well as the value of the risk premia available in the market for a given time period. We can use a panel of the 17 industry portfolios at a monthly frequency.

Subtract the risk-free rate from the returns because the factor model works with excess returns:

ff_portfolio = '17_Industry_Portfolios'
ff_portfolio_data = web.DataReader(ff_portfolio, 'famafrench', start='2010', end='2017-12')[0]
ff_portfolio_data = ff_portfolio_data.sub(ff_factor_data.RF, axis=0)


<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 96 entries, 2010-01 to 2017-12
Freq: M
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Food    96 non-null     float64
 1   Mines   96 non-null     float64
 2   Oil     96 non-null     float64
 3   Clths   96 non-null     float64
 4   Durbl   96 non-null     float64
 5   Chems   96 non-null     float64
 6   Cnsum   96 non-null     float64
 7   Cnstr   96 non-null     float64
 8   Steel   96 non-null     float64
 9   FabPr   96 non-null     float64
 10  Machn   96 non-null     float64
 11  Cars    96 non-null     float64
 12  Trans   96 non-null     float64
 13  Utils   96 non-null     float64
 14  Rtail   96 non-null     float64
 15  Finan   96 non-null     float64
 16  Other   96 non-null     float64
dtypes: float64(17)
memory usage: 13.5 KB

Now, obtain equity data from the quandl wiki prices for US equities.

with pd.HDFStore('../data/assets.h5') as store:
    prices = store['/quandl/wiki/prices'].adj_close.unstack().loc['2010':'2017']
    equities = store['/us_equities/stocks'].drop_duplicates()

Create a pandas dataframe for equity sectors and prices

sectors = equities.filter(prices.columns, axis=0).sector.to_dict()
prices = prices.filter(sectors.keys()).dropna(how='all', axis=1)

Calculate returns using pct_change and drop na.

returns = prices.resample('M').last().pct_change().mul(100).to_period('M')
returns = returns.dropna(how='all').dropna(axis=1)

Align data using loc function and assign to ff_factor_date and to ff_portfolio.

ff_factor_data = ff_factor_data.loc[returns.index]
ff_portfolio_data = ff_portfolio_data.loc[returns.index]

Compute excess returns from ff_factor_data and assign to excess_returns df.

excess_returns = returns.sub(ff_factor_data.RF, axis=0)
excess_returns.info()
excess_returns = excess_returns.clip(lower=np.percentile(excess_returns, 1),
                                     upper=np.percentile(excess_returns, 99))

Step 1: Factor Exposures

Implement the first stage to obtain the 17 factor loading estimates as follows:

betas = []
for industry in ff_portfolio_data:
    step1 = OLS(endog=ff_portfolio_data.loc[ff_factor_data.index, industry], 
                exog=add_constant(ff_factor_data)).fit()
    betas.append(step1.params.drop('const'))

betas = pd.DataFrame(betas, 
                     columns=ff_factor_data.columns, 
                     index=ff_portfolio_data.columns)
betas.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17 entries, Food  to Other
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Mkt-RF  17 non-null     float64
 1   SMB     17 non-null     float64
 2   HML     17 non-null     float64
 3   RMW     17 non-null     float64
 4   CMA     17 non-null     float64
dtypes: float64(5)
memory usage: 1.4+ KB

Step 2: Risk Premia

Run 96 regressions of the period returns for the cross section of portfolios on the factor loadings

lambdas = []
for period in ff_portfolio_data.index:
    step2 = OLS(endog=ff_portfolio_data.loc[period, betas.index], 
                exog=betas).fit()
    lambdas.append(step2.params)

lambdas = pd.DataFrame(lambdas, 
                       index=ff_portfolio_data.index,
                       columns=betas.columns.tolist())
lambdas.info()

<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 95 entries, 2010-02 to 2017-12
Freq: M
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Mkt-RF  95 non-null     float64
 1   SMB     95 non-null     float64
 2   HML     95 non-null     float64
 3   RMW     95 non-null     float64
 4   CMA     95 non-null     float64
dtypes: float64(5)
memory usage: 10.2 KB

lambdas.mean().sort_values().plot.barh(figsize=(12, 4))
sns.despine()
plt.tight_layout();

t = lambdas.mean().div(lambdas.std())

t

Mkt-RF    0.342748
SMB      -0.011482
HML      -0.262373
RMW      -0.046212
CMA      -0.175708
dtype: float64

Results

window = 24  # months
ax1 = plt.subplot2grid((1, 3), (0, 0))
ax2 = plt.subplot2grid((1, 3), (0, 1), colspan=2)
lambdas.mean().sort_values().plot.barh(ax=ax1)
lambdas.rolling(window).mean().dropna().plot(lw=1,
                                             figsize=(14, 5),
                                             sharey=True,
                                             ax=ax2)
sns.despine()
plt.tight_layout()

window = 24  # months
lambdas.rolling(window).mean().dropna().plot(lw=2,
                                             figsize=(14, 7),
                                             subplots=True,
                                             sharey=True)
sns.despine()
plt.tight_layout()

Fama-Macbeth with the LinearModels library

The linear_models library extends statsmodels with various models for panel data and also implements the two-stage Fama—MacBeth procedure:

mod = LinearFactorModel(portfolios=ff_portfolio_data, 
                        factors=ff_factor_data)
res = mod.fit()
print(res) 
                      LinearFactorModel Estimation Summary                      
================================================================================
No. Test Portfolios:                 17   R-squared:                      0.6889
No. Factors:                          5   J-statistic:                    17.081
No. Observations:                    95   P-value                         0.1466
Date:                  Wed, Jun 17 2020   Distribution:                 chi2(12)
Time:                          14:12:24                                         
Cov. Estimator:                  robust                                         
                                                                                
                            Risk Premia Estimates                             
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Mkt-RF         1.2294     0.4076     3.0161     0.0026      0.4305      2.0282
SMB           -0.0452     0.8661    -0.0522     0.9584     -1.7427      1.6524
HML           -1.0782     0.6886    -1.5658     0.1174     -2.4278      0.2714
RMW           -0.1397     0.8304    -0.1682     0.8664     -1.7672      1.4879
CMA           -0.6245     0.5075    -1.2305     0.2185     -1.6193      0.3702
==============================================================================

Covariance estimator:
HeteroskedasticCovariance
See full_summary for complete results

To view or add a comment, sign in

Build an Algorithmic Trading Strategy Fama-French 5-factor model in Python

Roger Hahn

Portfolios

Step 1: Factor Exposures

Step 2: Risk Premia

Fama-Macbeth with the LinearModels library

More articles by Roger Hahn

Others also viewed

Orchestrating the Hive Mind: LangChain vs. LangGraph vs. Plain Python

Understanding Beta: How to Calculate It Using Python Regression

Vibe Coding Part II: From Prototype to Production — And the Unexpected Power of AI Reverse Engineering

PyConEs17: Line your pockets with Python & TensorFlow

Meet Julia, your new ML and Data Science language

Live, Online Distribution Estimation Using t-Digests

Day 7: Building a Retrieval-Augmented Generation (RAG) Q&A System

Mini Computational Experiment – Candle & Script in STT

Your AI Agent Can't Debug — Here's How I Fixed It

Part 1: The End of the App

Explore content categories

Portfolios

Step 1: Factor Exposures

Step 2: Risk Premia

Fama-Macbeth with the LinearModels library

More articles by Roger Hahn

$285 billion vaporized. Solo attorneys are about to eat BigLaw's lunch. The same root cause is behind both.

Patent AI Insider #4: Confusion Matrices Are Live on PatentBench (And Opus 4.7 Is Not What They Told You)

The Measurement Problem in Patent AI: Why a Vendor Self-Assessment Is Not a Benchmark

The Supreme Court Just Proved Jack Dorsey and Elon Musk Wrong About Patents

The Patent AI Transparency Report: What the Vendors Won't Tell You

The Supreme Court Just Proved Jack Dorsey and Elon Musk Wrong About Patents. Here's the Math.

Meet Abigail, an AI chatbot for patent professionals

Recording Patent Assignments on Blockchain

Predict Bitcoin Prices Using Deep Learning

Why Replicants should be allowed to login to the USPTO

Others also viewed

Orchestrating the Hive Mind: LangChain vs. LangGraph vs. Plain Python

Understanding Beta: How to Calculate It Using Python Regression

Vibe Coding Part II: From Prototype to Production — And the Unexpected Power of AI Reverse Engineering

PyConEs17: Line your pockets with Python & TensorFlow

Meet Julia, your new ML and Data Science language

Live, Online Distribution Estimation Using t-Digests

Day 7: Building a Retrieval-Augmented Generation (RAG) Q&A System

Mini Computational Experiment – Candle & Script in STT

Your AI Agent Can't Debug — Here's How I Fixed It

Part 1: The End of the App

Explore content categories