Extracting information from your excel sheet using Python

Extracting information from your excel sheet using Python

One of the fascinating things about programming is that with a few lines of code you can get your computer to carry out a task that would have been otherwise mundane and annoying for you to do on your own. One of those mundane tasks is extracting information from a large excel sheet. The Python programming language is very robust, and one of the areas where it shines is helping us automate tedious and repetitive Excel tasks.

In this blog post, we will be embarking on a step by step process to extract some valuable information from an excel sheet. The excel sheet we will be using is the sheet that contains all the fruit sales of a supermarket for a month. Each row contains individual records of fruits purchased by customers. There are three columns in the excel sheet. Column A gives the cost per pound of the purchased fruit, Column B gives the Pounds sold, and Column C gives us the total cost of the purchase. The excel sheet has 23758 rows and four columns. You can download the excel sheet here.

Our goal is to find out and document the total pounds sold total sales and the entire purchase instances for each fruit in that month. You can imagine the frustration of having to go through 23758 rows to extract information about each fruit, well we are in luck as Python would help us to complete this task in no time. The steps below will give an in-depth and practical explanation of how you can use Python to complete this task.

Before we get on to this task, I want to assume that you have a basic knowledge of writing code in Python and that you have the Python installed on your computer.

Install the openpyxl Module

The python module we will be working with is the OpenPyXL Module. The OpenPyXL Module is a library that allows you to use Python to read and write excel files or files with the .xlsx/xlsm/xltx/xltm extension. If you don’t have it installed on your IDE, you can install it using

pip install openpyxl

To test if you successfully installed it, import it using

Import openpyxl

So far no error is returned you have installed the OpenPyXL Module and are ready to work on some excel documents.

Read in and open the Excel document with OpenPyXL

The next port of call is to read in the excel sheet into our Python environment. Make sure the Excel you will be working with is in your current working directory(CWD). You can access your CWD using:

import os 
os.getcwd()

#this changes our CWD, if the excel sheet is not in CWD

What if the excel sheet is not in your CWD? You can copy the file and paste it into your CWD, so you access it from there. Once we are sure we have our Excel document in our CWD, we can now read it in.

After reading in the Excel document, we can now access it to obtain various information about the excel sheet.

import pandas as pd 
file = ‘produceSales.xlsx’
data = pd.ExcelFile(file)

print(data.sheet_names) 

#this returns the all the sheets in the excel file ['sheet1']

Next, we parse the sheet we will be working with into a data frame, this will enable us to know if our excel sheet was correctly read in.

df = data.parse(‘Sheet1’)

Read in the spreadsheet data

The next step is to read in data from the spreadsheet [Sheet1].

ps = openpyxl.load_workbook(‘produceSales.xlsx’)

sheet = ps[‘Sheet1’]

Next, we use a For loop to iterate over all the rows in the sheet. The code block can be accessed here

Create an empty dictionary that contains all the information on each fruit. We then use the set.default() method to fill the first set of elements into the dictionary. set.default() first argument checks if the key exists in the dictionary, if it doesn’t it replaces it with the second argument. That way, we can start populating our dictionary with the second argument of the set.default function.

Finally, we populate the dictionary. For each new produce seen in a new row, we increase the metric by its corresponding value in the new row.

Write the Results to a File

After populating the TotalInfo dictionary. We can write this populated dictionary to any file of our choice be it a .csv, .txt, .py et al. We will be using the pprint.pformat module to pretty print our dictionary’s values and we use python’s write mode to write the dictionary’s values to the file. The code snippet below gives an illustration:

resultFile = open(‘Total_info.txt’, ‘w’)

resultFile.write(pprint.pformat(TotalInfo))

resultFile.close()

print(‘Done.’)

The Total_info.txt file will be found in your CWD.You can always change the file format by changing the .txt extension to whatever file format you want.

The code snippet below shows how you can change to a .csv file format.

Open (‘Total_info.csv’, ‘w’)

Conclusion

In this blog post, we demonstrated how we could use Python to extract information from an excel sheet. Knowing how to obtain information from an excel sheet is always a welcome addition to your toolbox as it saves you a lot of time from carrying out repetitive tasks. Feel free to modify the code in the article to suit your needs; you can access the notebook that contains the end to end code that was used in this blog post here.

Happy Pythoning.


To view or add a comment, sign in

Others also viewed

Explore content categories