A machine learning micro project

A machine learning micro project

When we think about machine learning, for many people, it may seem to involve complex systems written in a combination of programming languages and mathematics. Well, if you intend to deeply understand the logic backing the AI algorithms, perhaps it's really complex (at least for me, because I understand only the basics of math). However, if we want to develop a simple machine learning project, it will not require much more than a few lines of code.

It is important to set expectations by clarifying that this is an introductory-level project, being completely far from the real world, where data is usually much more complex and chaotic. Putting it aside, we can use Python and the library Scikit Learn to create a machine learning model capable of to classify three different types of Iris flower. For this, we gonna use the dataset created by Ronald Fisher in 1936 that did the measurement of the structure in a big sample of this kind of flower, classifying it on the categories setosa, versicolor, or virginiana. To train a machine learning model capable of classifying each specie of Iris flower, it will be used the technique KNN (K-Nearest Neighbor).

For this micro project, let's assume that the reader already has an instance of Python installed on his computer, as well as a setup development environment. We gonna use three Python's libraries: Numpy, Matplotlib, and Scikit Learn (also, I'm using Jupyter Notebook as well, but it's up to you considering that you have alternatives for it). After we execute the installation of the libraries, let's import a few functions that our script requires.

Article content

Now we may download the dataset Iris and do the split of the data into X for the features and Y for the labels (the labels are in the column target, and the features are the other columns). Furthermore, we gonna split 80% of the dataset for training and 20% for testing by filling the parameter test_size in the function train_test_split, where we can input values between 0.0 and 1.0 to indicate the amount of data from the dataset that will be used in each phase. In the end, we will finish with four variables, and, considering that our data is now in memory, we start calling it a dataframe.

Article content

We can visualize the values of our dataframe.

Article content

Now we can execute the training of our model.

Article content

That's it! With only a few lines of code, we already have a trained machine learning model! Using the function score, we can check that the mean accuracy of the model was 96.6%, which is a pretty good performance.

To understand a little bit more about the dataframe that we are handling, let's plot some charts. First, we can check the division between each species of flower based on the size of its sepal.

Article content

Now we can chat about the division based in the size of its petals.

Article content

So, this article was designed to achieve two main goals. The first one is to demonstrate that some machine learning techniques are simple and can be executed with just a few lines of code. Then, the other goal is to demonstrate that a simple machine learning model is enough to build extremely efficient solutions. This model has 96% of average accuracy and seems to be a totally efficient solution to identify the different types of Iris flowers. Obviously, there exist other questions to be verified in the real world, such as the rarity of a well-organized dataset well organized like this one and the possibility of it being a model with a strong bias. However, the point is, sometimes a simple machine learning technique is enough to build an efficient solution. Therefore, why don't we start with something simple?

To write this article, I used as a reference my own lecture at UFSC's PET Math. The code is available in my Github repo.

To view or add a comment, sign in

More articles by Rafael Araújo, MCs

  • Sentiment Analysis: beyond positive, negative, and neutral

    Keeping track of Natural Language Processing (NLP), this article discusses the subject of my research for my master's…

    1 Comment
  • Is Natural Language Processing (NLP) still relevant?

    In this article, I have decided to revisit the subject of Artificial Intelligence, focusing on a specific field within…

  • The raise of Data Quality

    Currently, the subject of data quality is growing in importance in the data field. When we talk about “data quality”…

  • A few cases of data quality

    In one of the first projects I worked on, I was tasked with developing several web scrapers to collect public datasets.…

  • Using Airflow for data quality

    In this article, I decided to explore the possibilities of using Airflow as a data quality tool. This is now something…

  • Five movies about data

    In this article, I decided to try joining two of my passions: movies and data technology. I’ve been a cinephile since I…

  • Do not hire me because of my technical knowledge; hire me because of the games I platinumed!

    Okay, the title of this article is clearly clickbait, but if you are not a gamer, it's worth reading on because you may…

  • Organizing the analytical data

    Data can be viewed in two ways: transactional or analytical. Handling data in these different ways creates a need to…

  • The Code Challenge Pitfall

    I just finished another Code Challenge that left me with the same sensation as when I wake up after a hangover…

  • Presenting the library BR Utils

    In today’s article, I will talk about a Brazilian open source project that may be very useful for those who are working…

    1 Comment

Others also viewed

Explore content categories