Python Processing Plant
Who likes running experiments, testing hypotheses, or even solving problems? If you are like me, the answer is a big fat YES!! So I guess it makes sense that in previous jobs I was a technician and an engineer for a manufacturing company. I loved getting my hands dirty (metaphorically and literally) in the lab and on the assembly line. Often times I was responsible for designing, testing and analyzing experiments to see if we were doing our processes as efficiently as we could.
For this data project, I have been recently “hired” as a data analyst for a manufacturing company. Let's call them Metals R' Us. I have been given data from their froth flotation processing plant. My goal was to operate with Python functions for analyzing key components found in the data and look for irregularities that might have happened. The "plant manager" wanted an investigation done to see if there were any problems that needed to be addressed. First, let me give you an idea of what the froth process encompasses.
Froth Flotation Process Background
The froth flotation process is widely used in mineral processing when wanting to isolate specific minerals or compounds. This process is used to filter out impurities like dirt, sand and silica while keeping only the specified mineral: which in our case is iron. Basically what happens is the company digs a big hole where they collect big clumps of dirt. Iron is found in those clumps, which is the main item the company is trying to get so they can eventually sell it. They put it through a froth flotation process to come up with cleaner Iron (see figure below).
The company then sends these clumps into a pulp or slurry (mixture of water and ore), mix it with Starch and Amina (which strip the dirt away from the iron) and then shoot air or Nitrogen bubbles at the liquid mixture to get the iron to rise to the top, while the impurities remain at the bottom. The iron is then brought to the surface in a frothy like state, where it is then transferred into a separate area with only the pure iron particles. And then you have the final pure iron product.
This process is also used in waste water treatment plants where water is separated from solids or oils. For more information on this frothy process, check out this website or watch this video on Youtube for a more clear explanation.
DATA
The data can be found here. This is real data taken and used to predict the quality in the froth flotation process from March 2017 to September 2017. Column readings are odd in some ways with some columns being sampled every 20 seconds while others are sampled every hour. So necessary steps were taken to clean the data set here.
Three libraries were required for upload and support in Python. Pandas, Seaborn and Matplotlib. Pandas were used for data manipulation while Seaborn and Matplotlib were used for data visualization.
To get an idea of what the dataset contained, the function df.head() was used to showcase the first five rows of the dataset with all of the columns and df.shape was used to tell us the exact number of rows and columns in the set. There were 737,453 rows and 24 columns in this dataset.
The dates used in this dataset were in text or string form so they had to be converted to a date time column in order to be aggregated. See below the Python function I began with to realize what form was being used and then the converted function to update the string into date format: df['date'] = pd.to datetime(df['date']).
Another oddity with the dataset was that commas were used for the numerical data. I updated the cells to contain periods instead of commas so the numbers would be formatted the same. The function used to address this was df = pd.read_csv('MiningProcess_Flotation_Plant_Database.csv',decimal=",")
The data dictionary shown below describes the columns used in the dataset. In the columns, flow is how fast something is moving and level is the height that the frothing occurs from all the bubbles.
The second to last column is "% Iron Concentrate" which is one of the main items to focus on. This tells us how pure the iron is at the end of the flotation process. In order to access certain columns, the function df["% Iron Concentrate"] shows the information relating to that specific column.
Whereas, to access certain rows and their information, the function df.iloc[176:183,:] helps index specific rows to focus on.
Analysis
Now onto the analysis portion. The boss asked to get some summary statistics for each of the columns. They want to know the average and median, as well as the min & max for every column.
Summary Stats
By using the function df.describe() we can clearly see the wanted information in an easy to see format. Each column is provided with their summary stats.
And if we want Python to tell us clearly the range of data we are working with, meaning what was the start, midpoint and end date. I used the MAX, MIN & MEDIAN functions. See Below for details.
Recommended by LinkedIn
June 16 Irregularity?
The boss was going through some reports and wanted me to investigate anything unusual that might have happened on June 16, 2017 (our halfway point of the gathered dataset).
First, I needed to filter the rows with a boolean mask and create a new dataframe df_june:
Now this says, let's create a new dataframe called df_june that is actually just the old dataframe, but only where the date is larger than June 15, 2017 at midnight & less than June 17, 2017. The & sign allows for those two conditions to be met. This has helped reduce the number of rows, but we still have all the columns.
I then created a variable that was a list of all the important columns we want to focus on. The % Iron Concentrate is the most important column along with the % Silica Concentrate, Ore Pupl pH, Flotation Column 05 Level and Date. I called this variable important_cols. After setting that column, I simply created a new dataframe called df_june_important and set it equal to the older dataframe (df_june) in the column of important_cols.
HOW DO THEY RELATE?
Now, we have just isolated our variables and have graphs to look at, but what does it mean? It's hard to tell based on those small graphs. And the boss wanted to know how these variables all relate to one another. But since we wanted to look at all these variables & their relationships it would require 6 different scatter plots. The data visualization library Seaborn allows us to do this easily with just one line of code, the pairplot.
This just seems like even more confusing visuals. So I decided to match the graphs with linear regression models.
Now this was a lot more helpful. It shows us a strong correlation between % Iron Concentrate and % Silica Concentrate. The other relationships are of little significance and don't tell us how variables are related. But it would appear that the higher the % Iron Concentrate would yield lower % Silica Concentrate on this particular day.
More Detail
The boss was still a bit confused and wanted to see how the % Iron Concentrate changed throughout that day. So I made a line plot with Seaborn to show this graph, where the date is expressed in hours.
This graph clearly shows us that there was a major increase in % Iron Concentrate between 6:00 am and 9:00 am & between 9:00 am and 12:00 pm. Thus leading to the sharp decrease in the % Silica Concentrate as seen in the graph before.
The boss found this graph to be very useful. They wanted to see the other variables (important columns) across the same time frame. I thought about putting them all on the same graph, but the units of measure are actually very different from one another. The percentage for Iron & Silica will always be between 0 - 100, but the pH has much lower values whereas the flotation has much higher values.
So instead, I made a few separate graphs to show the boss the fluctuation over the course of the day by using a loop function with Python. I skipped past the first 2 variables because we didn't need to see a date to date comparison or the previous graph we just saw.
CONCLUSION
As was mentioned earlier, it is clear to see that when the % Iron Concentrate went up, the % Silica Concentrate went down at the same time. There is nothing significant about the fluctuation of pH. But it is noteworthy to see the sudden jump of flotation level at the end of the day on June 16. This could be explained in part by closing down the plant and stopping the processes, leading to an overfill of material.This data can be useful for future comparisons after giving it over to the boss.
Python and its libraries are able to produce charts fairly easily to show the information you are trying to analyze and can keep businesses running smoothly.
Thank you for reading all of this. If you have any questions feel free to comment below or connect with me Brock Johnson here on LinkedIn.
I am looking for new opportunities and roles in the data world, so if you hear of any or are in the market please reach out, thanks!
Well done! Very informative, but easy to understand.
Good job Brock 👏💪👏