Use Machine Learning to Predict Fish Species
Machine Learning is a subset of artificial intelligence (AI) where we make the machine learn important patterns, characteristic and features of the data and use them for prediction, regression or clustering unknown samples. It is a process of getting inferences from the data that we have and allow machine to learn and use that knowledge on other similar data. Here is an example to use machine learning to predict fish species .
Step 1. Setup Python Data Science environment How to setup Python Data Science environment using Anaconda Powershell Prompt
Step 2. Install Scikit-learn
using the following command: pip install scikit-learn
Step3. Gather Data from Various Sources Like Files
In this example, we gather fish data from a csv file
Step 4. Data Preprocessing
Data preprocessing refers to the processing of handling the data and making it fit for data analytics.Raw data as such in most of the cases cannot be consumed directly. We must process it to make it suitable and easy to work with the data.Data Preprocessing consists of 2 major steps: data clean and data transformation. Data transformation techniques includes rescaling of data,normalization and Standardization and encoding etc.
Recommended by LinkedIn
Most of the models work with numbers, so we need use encode to converte categorical data that maybe in the form of text into numbers. Here we use encode to converte spicies column value(Roach, Whitefish etc) to numbers.
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
mport pandas as pd
from sklearn.preprocessing import OrdinalEncoder
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
#load the dataset
df = pd.read_csv('CPRG 109_Fish.csv')
#choose the 'Species' column
X = df.select_dtypes(exclude=['object']).to_numpy()
# Encode the target
enc = OrdinalEncoder()
enc.fit(df[["Species"]])
df[["Species"]] = enc.transform(df[["Species"]])
y = df["Species"].to_numpy().reshape(-1, 1)
Step 5. Train,Validation and Test Dataset
Train dataset used to fit the model to the data. This is the dataset used to train the model and make the learning from the data. In this example, we random choose 75% as training data.
Test dataset is usually an unknown dataset where we don’t know what the real value of target is. After the model creation and validation is complete, we use unknown test samples to run the model. In this example, we left 25% as testing data.
Use Train_Test_Split method takes in 4 parameters and produces a split in the input dataset. Now data is ready for use with machine learning models
from sklearn.model_selection import train_test_split
train_feature, test_feature, train_target, test_target = train_test_split(X, y, train_size=0.75, shuffle=True)
Step 6. Train the Model and Use Visualization Techniques to Check Model Performs
In this example, we use Logistic Regression and Decision Trees and compare the accurate
model = LogisticRegression(solver='liblinear', random_state=0)
model.fit(train_feature, train_target)
#test the model out on test dataset
predictresultloR=model.predict(test_feature)
predictresultloR = predictresultloR[:,np.newaxis]
#transform to get the predicted fish Species
predictedlabelloR=enc.inverse_transform(predictresultloR)
#transform to get the true fish Species
truth=enc.inverse_transform(test_target)
resultloR=truth==predictedlabelloR
#Perform decision tree classification on the train data and then test the model out on test dataset.
model = DecisionTreeClassifier(criterion="gini")
model.fit(train_feature,train_target)
#test the model out on test dataset
predictresultDTC=model.predict(test_feature)
predictresultDTC = predictresultDTC[:,np.newaxis]
#transform to get the predicted fish Species
predictedlabelDTC=enc.inverse_transform(predictresultDTC)
#transform to get the true fish Species
truth=enc.inverse_transform(test_target)
resultDTC=truth==predictedlabelDTC
ax=plt.subplots(figsize=(50,50))
tree.plot_tree(model,filled=True)
plt.show()
Great piece of information, thanks for sharing Sandy Yang.