Pre-processing Finance Data

Pre-processing Finance Data

As data scientists, we know that acquiring, processing, and cleaning data takes up the majority of our time. It's easy to get caught up in the excitement of modelling, but it's important to remember the crucial step of data preprocessing. By manipulating and cleaning data, we can improve model performance and ensure accurate analysis.

In this article, we'll focus on financial time series and cover recipes for

  • converting prices to returns,
  • adjusting returns for inflation,
  • changing the frequency of time series data
  • imputing missing data
  • aggregating trade data.

With these tools, we can take our data analysis to the next level and make a real impact in our industry.


Converting Prices to Returns

In order to model time series, many techniques require that the time series be stationary. Although we will delve into this topic in more detail in the future, we will briefly touch on it now. Stationarity assumes that the statistics of a process, such as the mean and variance of a series, remain constant over time. By using this assumption, we can create models that aim to predict future values of the process.

Unfortunately, asset prices are usually non-stationary. Their statistics not only change over time, but we can also observe trends or seasonality in the data. In order to make the time series stationary, we transform the prices into returns. This also has the added benefit of normalization, allowing us to easily compare different return series. Comparing raw stock prices would be more difficult, as one stock may start at ₹100 and another at ₹1,000.

There are two types of returns:

  • Simple Returns: aggregate over assets. The simple return of a portfolio is the weighted sum of the returns of individual assets.Rt = (Pt - Pt-1)/Pt-1 = (Pt/(Pt-1)) -1
  • Log Returns: When looking at log returns, they accumulate over time. To better understand this concept, let's use an example: the log return for a particular month is the total of the log returns for the days within that same month. Log returns are defined as:rt = log(Pt/Pt-1) = log(Pt) - log(Pt-1)

Pt is the price of an asset in time t. In the preceding case, we do not consider dividends, which obviously impact the returns and require a small modification of the formulas.

Let's try this in Python

import numpy as np
import pandas as pd
import yfinance as yf

# Download TATASTEEL data
tatasteel = yf.download("TATASTEEL.NS", start="2020-01-01", end="2023-10-01")

# Calculate daily returns
tatasteel['Daily_Returns'] = tatasteel['Close'].pct_change()

# Calculate log returns
tatasteel["Log_Returns"] = np.log(tatasteel["Adj Close"]/tatasteel["Close"].shift(1))

tatasteel.head()        
Article content
Stock DataFrame

Now let's visualise it.

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(20,8))
plt.subplot(2,1,1)
sns.lineplot(tatasteel['Daily_Returns'], label='Daily Returns', color='r')
plt.subplot(2,1,2)
sns.lineplot(tatasteel['Log_Returns'], label='Log Returns', color='b')
plt.show()        
Article content
Simple Returns and Log Returns

Adjusting The Returns for Inflation

When conducting various types of analyses, particularly those spanning a significant period, it is important to take inflation into account. Inflation refers to the overall increase in the price level of an economy over time, or put simply, the decline in the purchasing power of money. Therefore, it may be necessary to separate the effects of inflation from the rise in stock prices resulting from a company's growth or development.

We will focus on adjusting the returns and calculating the real returns, rather than directly adjusting the prices of stocks. This can be done using the following formula:

Article content
Returns with Inflation consideration

where Rrt is the real return, Rt is the time t simple return, and 𝜋𝑡 stands for the inflation rate.

For this example, we use TataSteels’s stock prices from the years 2020 to 2023 (downloaded as in the previous recipe).

For Indian Inflation Rate - https://www.rateinflation.com/inflation-rate/india-historical-inflation-rate/

We can rely on the above site. and here is the code to extract data.

import requests
from bs4 import BeautifulSoup
import csv

# Define the URL of the webpage
url = "https://www.rateinflation.com/inflation-rate/india-historical-inflation-rate/"

# Send an HTTP GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find the table containing the monthly inflation rate data
    table = soup.find('table')
    
    if table:
        # Create a CSV file to store the data
        with open('india_inflation_data.csv', 'w', newline='') as csvfile:
            csv_writer = csv.writer(csvfile)
            
            # Extract table headers
            headers = [th.text.strip() for th in table.find_all('th')]
            csv_writer.writerow(headers)
            
            # Extract and write table rows
            rows = table.find_all('tr')[1:]  # Skip the header row
            for row in rows:
                data = [td.text.strip() for td in row.find_all('td')]
                csv_writer.writerow(data)
        
        print("Data has been successfully downloaded and saved to 'india_inflation_data.csv'.")
    else:
        print("Table not found on the webpage.")
else:
    print("Failed to fetch the webpage. Status code:", response.status_code)         

Let's do more preprocessing before actually using this date.

# Loading saved csv file
indian_inf = pd.read_csv('india_inflation_data.csv')

# capitalising it before converting to date
indian_inf.columns = [col.capitalize() for col in indian_inf.columns]

# dropping 'Annual' columns as it's of no use
indian_inf.drop(columns=['Annual'], inplace=True)

# Changing from pivot-table ( wide-form ) to long-form
melted_indian_inf = pd.melt(indian_inf, id_vars=['Year'], 
                            value_vars=indian_inf.columns.to_list()[1:], var_name='Month', value_name='Inflation Rate')

# merging Month and Year
melted_indian_inf['Month-Year'] = melted_indian_inf['Month']+'-'+melted_indian_inf['Year'].astype(str)

# Converting it to date
melted_indian_inf['Month-Year'] = melted_indian_inf['Month-Year'].apply(lambda x : datetime.strptime(x, '%b-%Y').date())

# Dropping redundant columns
melted_indian_inf.drop(columns=['Year', 'Month'], inplace=True)

melted_indian_inf = melted_indian_inf.set_index('Month-Year')
melted_indian_inf.index = pd.to_datetime(melted_indian_inf.index)

melted_indian_inf.resample("M").last()

melted_indian_inf = melted_indian_inf.dropna()

melted_indian_inf['Inflation Rate'] = melted_indian_inf['Inflation Rate'].apply(lambda x: float(x.strip('%'))/100)        

Now it's ready to merge with our original data.

tatasteel_monthly = tatasteel.resample("M").last()
tatasteel_monthly = pd.merge(tatasteel_monthly, melted_indian_inf, how='inner', left_index=True, right_index=True)

tatasteel_monthly['Monthly_Returns'] = tatasteel_monthly['Close'].pct_change()        

Let's calculate real returns ;

tatasteel_monthly["Real_Returns"] = ((tatasteel_monthly["Monthly_Returns"] + 1) / (tatasteel_monthly["Inflation Rate"] + 1) - 1) 

tatasteel_monthly.head()

plt.figure(figsize=(20,8))
sns.lineplot(tatasteel_monthly['Monthly_Returns'], label='Monthly Returns')
sns.lineplot(tatasteel_monthly['Real_Returns'], label='Monthly Real Returns')
plt.show()        
Article content
Monthly Returns Vs Monthly Real Returns

Imputing Missing Data

Dealing with missing data is essential. We can impute missing data using various techniques such as forward filling (ffill()), backward filling (bfill()), or interpolation (interpolate()). I preferred interpolation.

Conclusion

In this article, we have discussed several data preprocessing methods using the yfinance library. We have covered techniques such as converting prices to returns, adjusting returns for inflation, modifying the frequency of time series data, filling in missing data, and consolidating trade data. These techniques are crucial in ensuring that your stock data is accurate and ready for analysis.

It is important to note that data preprocessing is a critical step in the stock data analysis process. The accuracy of your analysis depends largely on the quality of your data. By mastering these data preprocessing techniques, you will be better equipped to make informed investment decisions and obtain valuable insights from your stock data analysis.

To view or add a comment, sign in

More articles by Shubham Gupta

Others also viewed

Explore content categories