Python project: This is Not Another Iron Mining Analysis
Background
For this week's project, We will sit behind the wheel of a data analyst hired by a mining company called Metals R' Us. Our first task in this new role will be analyzing data from their flotation plant. The information collected from this analysis will help determine the level of purity of iron concentrate.
The Dataset
The dataset we will analyze in this project was taken from Kaggle and contains real data collected between March 2017 and September 2017. This dataset has 24 columns and 737453 rows.
The columns were sampled every 20 seconds and every hour. The samples were gathered at 20-second intervals. The date sampled includes the day, month, year, and hour, but does not show the minutes. If you are interested in reviewing this set of data, you can find it at the following link
Please note that this report is for educational purposes only and is part of a project for Data Career Jumpstart. For this project, we will use #Excel, the browser-based notebook #DeepNote, which will allow us to use #Python for this analysis.
Data Cleaning
Before starting our analysis, we needed to connect our Notebook to the libraries Pandas (for data manipulation), Seaborn, and Matplotlib (for data visualization). Given that time will be one of the important measures when creating visualizations, we need to redefine the date column, which is currently being read as a string. Using the Pandas function to_datetime(), we will convert the date column to datetime.
The Data Analysis
The first step in this analysis will be to provide a summary of statistics for each column. We will use the Pandas function df.describe() to calculate the average, median, as well as minimum and maximum values.
Given the nature of this business, we have a series of relevant variables to monitor:
Our engineering team asked us to investigate the readings from July 16, 2017, as it seems something unusual happened. To do this, we will use a date filter with a boolean mask to create a new DataFrame, df_july.
To focus on specific columns in our existing DataFrame (df_july), we can create a list called important_cols that contains the column names we want to highlight. Then, in one step, we will create a new DataFrame, df_july_important, that contains only the columns from df_july that match the names in important_cols. This will filter df_july to keep only the information we're interested in.
Recommended by LinkedIn
Correlations
Now, the head of engineering reached out to see if these variables correlate. Using the data visualization library Seaborn, we created scatterplots to visualize the four most relevant variables.
Since it is not easy to identify any correlations in the scatterplots, a correlation matrix will provide a clearer perspective on any insights.
The matrix confirmed that all the correlation values are low. The highest correlation is between % Iron Concentrate and Flotation Column 05 Level, with a value of 0.09.
% Iron Concentrate
The last step in this study is analyzing the % Iron Concentrate changes throughout July 16. Once again, using the Seaborn Library, we will visualize this information.
The highest and lowest percentages of iron concentration are present at the end of the day, with the highest being around 66.5% and the lowest 64.5%. It is safe to assume that during this day, the concentration levels remained stable with no significant changes.
Conclusion
Throughout the development of this project, it was easy to understand the importance of data for every industry. Mining is no exception, where the right analysis of information can save a company millions of dollars and ensure optimal resource allocation. Although there were no major findings when analyzing the top five most relevant variables for the company, this project demonstrated how using Python together with different libraries can help make informed business decisions.
Thank you for taking the time to read my new article. I would love to hear your thoughts and comments! It means a lot to me. This project was part of the Avery Smith ' Data Career Jumpstart BootCamp that I am a part of. I am diving deeper into the data world and always learning. Please follow me at Andres Cordero and stay connected on this data journey!
Interesting project, Andres! Keep up the good work, and thanks for sharing your project.
Great job Andres Cordero!
Insightful! You’ve got my attention 👊🏻