Pre-processing Finance Data
As data scientists, we know that acquiring, processing, and cleaning data takes up the majority of our time. It's easy to get caught up in the excitement of modelling, but it's important to remember the crucial step of data preprocessing. By manipulating and cleaning data, we can improve model performance and ensure accurate analysis.
In this article, we'll focus on financial time series and cover recipes for
With these tools, we can take our data analysis to the next level and make a real impact in our industry.
Converting Prices to Returns
In order to model time series, many techniques require that the time series be stationary. Although we will delve into this topic in more detail in the future, we will briefly touch on it now. Stationarity assumes that the statistics of a process, such as the mean and variance of a series, remain constant over time. By using this assumption, we can create models that aim to predict future values of the process.
Unfortunately, asset prices are usually non-stationary. Their statistics not only change over time, but we can also observe trends or seasonality in the data. In order to make the time series stationary, we transform the prices into returns. This also has the added benefit of normalization, allowing us to easily compare different return series. Comparing raw stock prices would be more difficult, as one stock may start at ₹100 and another at ₹1,000.
There are two types of returns:
Pt is the price of an asset in time t. In the preceding case, we do not consider dividends, which obviously impact the returns and require a small modification of the formulas.
Let's try this in Python
import numpy as np
import pandas as pd
import yfinance as yf
# Download TATASTEEL data
tatasteel = yf.download("TATASTEEL.NS", start="2020-01-01", end="2023-10-01")
# Calculate daily returns
tatasteel['Daily_Returns'] = tatasteel['Close'].pct_change()
# Calculate log returns
tatasteel["Log_Returns"] = np.log(tatasteel["Adj Close"]/tatasteel["Close"].shift(1))
tatasteel.head()
Now let's visualise it.
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(20,8))
plt.subplot(2,1,1)
sns.lineplot(tatasteel['Daily_Returns'], label='Daily Returns', color='r')
plt.subplot(2,1,2)
sns.lineplot(tatasteel['Log_Returns'], label='Log Returns', color='b')
plt.show()
Adjusting The Returns for Inflation
When conducting various types of analyses, particularly those spanning a significant period, it is important to take inflation into account. Inflation refers to the overall increase in the price level of an economy over time, or put simply, the decline in the purchasing power of money. Therefore, it may be necessary to separate the effects of inflation from the rise in stock prices resulting from a company's growth or development.
Recommended by LinkedIn
We will focus on adjusting the returns and calculating the real returns, rather than directly adjusting the prices of stocks. This can be done using the following formula:
where Rrt is the real return, Rt is the time t simple return, and 𝜋𝑡 stands for the inflation rate.
For this example, we use TataSteels’s stock prices from the years 2020 to 2023 (downloaded as in the previous recipe).
For Indian Inflation Rate - https://www.rateinflation.com/inflation-rate/india-historical-inflation-rate/
We can rely on the above site. and here is the code to extract data.
import requests
from bs4 import BeautifulSoup
import csv
# Define the URL of the webpage
url = "https://www.rateinflation.com/inflation-rate/india-historical-inflation-rate/"
# Send an HTTP GET request to the URL
response = requests.get(url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Parse the HTML content of the page
soup = BeautifulSoup(response.content, 'html.parser')
# Find the table containing the monthly inflation rate data
table = soup.find('table')
if table:
# Create a CSV file to store the data
with open('india_inflation_data.csv', 'w', newline='') as csvfile:
csv_writer = csv.writer(csvfile)
# Extract table headers
headers = [th.text.strip() for th in table.find_all('th')]
csv_writer.writerow(headers)
# Extract and write table rows
rows = table.find_all('tr')[1:] # Skip the header row
for row in rows:
data = [td.text.strip() for td in row.find_all('td')]
csv_writer.writerow(data)
print("Data has been successfully downloaded and saved to 'india_inflation_data.csv'.")
else:
print("Table not found on the webpage.")
else:
print("Failed to fetch the webpage. Status code:", response.status_code)
Let's do more preprocessing before actually using this date.
# Loading saved csv file
indian_inf = pd.read_csv('india_inflation_data.csv')
# capitalising it before converting to date
indian_inf.columns = [col.capitalize() for col in indian_inf.columns]
# dropping 'Annual' columns as it's of no use
indian_inf.drop(columns=['Annual'], inplace=True)
# Changing from pivot-table ( wide-form ) to long-form
melted_indian_inf = pd.melt(indian_inf, id_vars=['Year'],
value_vars=indian_inf.columns.to_list()[1:], var_name='Month', value_name='Inflation Rate')
# merging Month and Year
melted_indian_inf['Month-Year'] = melted_indian_inf['Month']+'-'+melted_indian_inf['Year'].astype(str)
# Converting it to date
melted_indian_inf['Month-Year'] = melted_indian_inf['Month-Year'].apply(lambda x : datetime.strptime(x, '%b-%Y').date())
# Dropping redundant columns
melted_indian_inf.drop(columns=['Year', 'Month'], inplace=True)
melted_indian_inf = melted_indian_inf.set_index('Month-Year')
melted_indian_inf.index = pd.to_datetime(melted_indian_inf.index)
melted_indian_inf.resample("M").last()
melted_indian_inf = melted_indian_inf.dropna()
melted_indian_inf['Inflation Rate'] = melted_indian_inf['Inflation Rate'].apply(lambda x: float(x.strip('%'))/100)
Now it's ready to merge with our original data.
tatasteel_monthly = tatasteel.resample("M").last()
tatasteel_monthly = pd.merge(tatasteel_monthly, melted_indian_inf, how='inner', left_index=True, right_index=True)
tatasteel_monthly['Monthly_Returns'] = tatasteel_monthly['Close'].pct_change()
Let's calculate real returns ;
tatasteel_monthly["Real_Returns"] = ((tatasteel_monthly["Monthly_Returns"] + 1) / (tatasteel_monthly["Inflation Rate"] + 1) - 1)
tatasteel_monthly.head()
plt.figure(figsize=(20,8))
sns.lineplot(tatasteel_monthly['Monthly_Returns'], label='Monthly Returns')
sns.lineplot(tatasteel_monthly['Real_Returns'], label='Monthly Real Returns')
plt.show()
Imputing Missing Data
Dealing with missing data is essential. We can impute missing data using various techniques such as forward filling (ffill()), backward filling (bfill()), or interpolation (interpolate()). I preferred interpolation.
Conclusion
In this article, we have discussed several data preprocessing methods using the yfinance library. We have covered techniques such as converting prices to returns, adjusting returns for inflation, modifying the frequency of time series data, filling in missing data, and consolidating trade data. These techniques are crucial in ensuring that your stock data is accurate and ready for analysis.
It is important to note that data preprocessing is a critical step in the stock data analysis process. The accuracy of your analysis depends largely on the quality of your data. By mastering these data preprocessing techniques, you will be better equipped to make informed investment decisions and obtain valuable insights from your stock data analysis.