Kaggle Competition | Multi class classification on Image and Data

Alexandra LORENZO DE BRIONNE

Published Mar 29, 2019

This article is a brief summary and an overview of the models used during the Kaggle Competition PetFinder.my Adoption prediction (during the step 1). The objective is to predict the "adoptability" of pets - specifically how quickly is a pet adopted.

For the deep learning models I used Python on Google Colab, a free cloud service based on Jupyter Notebooks that supports free GPU (the coolest platform).

Quick Overview of Data

The data is provided by Petfinder.my, a platform dedicated to pet adoption in Malaysia. It includes 11 data sources such as text, tabular and image data.

All data are included and features are extracted create the best model.

Data Description

We have 14993 dogs and cats in the training set and 3948 in the test set containing important information about pets such as age, breed, color, health information, etc.

The variable Description was analyzed using the Google Sentiment Analysis API it includes a sentiment score and other variables useful for the analysis.

Similarly, all images were analyzed using the Google Vision API extracting label and metadata information.

Let's see the target AdoptionSpeed repartition:

0 - Pet was adopted on the same day as it was listed (3%)
1 - Pet was adopted between 1 and 7 days (1st week) after being listed (20.6%)
2 - Pet was adopted between 8 and 30 days (1st month) after being listed (27%)
3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed (22%)
4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days) (28%)

All the variables are analyzed deeply to see possible interactions and impact on the dependent variables AdoptionSpeed.

Available in my Kaggle.

Features engineering/extraction

From the file train_metadata.zip we extract image features from the JSON files, creating information about colors, detecting labels and crop hints features.

From the file train_sentiment.zip we extract the document sentiment score and magnitude.

We then create different variables like the number of pets per Rescuer, the character length of name and description, a pure breed variable etc.

Image Features Extraction

To extract information from images I used the package OpenCV (Open Source Computer Vision Library) and PIL (Python Imaging Library) available in Python. Both packages propose to open, manipulate and extract features from different images.

I created image quality features such as blur and pixel information, but also Image moment using Hu moment.

NLP Features Extraction

To improve the predictions we extract new features from the variable "Description", "Sentiment Analysis" (created previously) and "Metadata Description" using the Singular Value Decomposition (SVD) and Non Negative Matrix Factorization (NMF).

Including the 5 principal components of SVD and NMF slightly improve the model.

Train/Test/Validation in Machine Learning

In a first step we divide our data into a training (75% of the data) and a testing set (25% of the data). Checking that the test and train set have the same class repartitions (using the stratify option).

To tune the hyper-parameter and avoid over-fitting we used the technique of Cross Validation (CV). For K-Fold CV, we further split our training set into a K number of subsets, called folds. We then iteratively fit the model K times, each time training the data on K-1 of the folds and evaluating on the Kth fold. We perform many iterations of the entire K-Fold CV process, each time using different model settings. We then compare all hyperparameters and select the best ones!

Oversampling

Oversampling is a well-known way to potentially improve models trained on imbalanced data. Indeed, the first class ("Pet was adopted on the same day as it was listed") contains only 3% of the data. We simply do a random over sampling on the training set (75% of the data).

Multi-class Classification

Our output is an ordinal classification, therefore we will use Regression models and then optimize boundaries. I used the models implemented in scikit-learn (Random Forest, K-Nearest Neighbor, SGD and Gradient Tree Boosting ) and the LightGBM and XGBoost package. We then choose the best model based on the metric Cohen's Kappa which measures the inter-rater agreement for classifying items.

Features Importance

By using the LightGBM model, we can see the importance of each feature in the construction of the boosted decision trees. The result shows, as previously, that Age, Number of pets per Rescuer, Image features and NLP features are the best classifier for Adoption Speed. We select the most important features to prevent from over-fitting.

Evaluation

With a maximum Cohen Kappa score of 0.37 I'm in the middle of the race. However, I'm in the TOP 1% of up-voted Notebook concerning Deeply Descriptive analysis and Features Engineering.

Step 2: Image multi classification

For this part, I don't expect much from the results. Indeed, classifying pet's adoptability on images will be really complex.

Part 1: Format Image Data to Input to Keras Model

First of all, we need to structure our training and validation datasets. We will be using the ImageDataGenerator and flow_from_directory() functionality of Keras. flow_from_directory() automatically infers the labels from the directory structure of the folders.So, we need to create a directory structure where images of each class sits within its own sub-directory. The validation folder contains 20% of the images.

Part 2: Transfer Learning

Deep Learning supports an immensely useful feature called 'Transfer Learning'. That means you are able to take a pre-trained deep learning model, which is trained on a large-scaled dataset such as ImageNet and re-purpose it to handle an entirely different problem. For this task we will use the Inception V3 model, with weights pre-trained on ImageNet. See more information about Keras application.

Part 3: Train Model

We first train our small fully-connected model and load its weights. Then, we start by freezing the layers of the Inception V3 model up to the last convolutional block. The next step is to compile the model using the RMSprop optimiser (note that you can use SGD or Adam optimiser). We finally plot our training results using the history output. More information see the Keras blog.

The accuracy is really low (as we thought), so the model is not added to the Kaggle Competition.

Conclusion

Participating in this Kaggle provides me much wider range of learning than anywhere else online. We come up with great ideas and share the ideas in the competition. My first advice, be careful you will become a Kaggle addict and pass easily more than 200 hours per month.

My second advice for a Kaggle Competition is to join a team. Indeed, after a couple of weeks I didn't have time to implement all models and features that I wanted to.

All Python codes as Descriptive Analysis, Data Preparation, Multi-class Models and Image Multi classification are available in my GitHub.

Voilà! Hope the article was useful to you. Feel free to comment and suggest improvements.

Lectures

Courses from Standford University: NLP and Convolutional Neural Networks.

Book by P.Pujari, Md. R. Karim and M. Sewat Practical Convolutional Neural Network.

To view or add a comment, sign in

Kaggle Competition | Multi class classification on Image and Data

Alexandra LORENZO DE BRIONNE

Quick Overview of Data

Data Description

Features engineering/extraction

Image Features Extraction

NLP Features Extraction

Train/Test/Validation in Machine Learning

Oversampling

Multi-class Classification

Features Importance

Evaluation

Step 2: Image multi classification

Conclusion

Lectures

More articles by Alexandra LORENZO DE BRIONNE

Others also viewed

The Nixtlar library, Gaussian Processes with PyMC, Algorithms for Decision Making

Build Your First RAG System in a Jupyter Notebook: A Beginner-Friendly Guide

Computer Vision on YOLOv5 and Pytorch

Convert your PDF knowledge-bases into a Q&A Gen-AI App

Fine-Tuning LLaMA2 with Alpaca Dataset Using Alpaca-LoRA

Breaking the Jargons: Issue 7

Support vector machine

TSA Kaggle Data Challenge - ChiPy blog 1

The ggplot2 New Release, Regression and Other Stories, Deep Learning for Computer Vision, Introduction to Decision Trees with Python

Power of Machine Learning in stock movement prediction

Explore content categories

Quick Overview of Data

Data Description

Features engineering/extraction

Image Features Extraction

NLP Features Extraction

Train/Test/Validation in Machine Learning

Oversampling

Multi-class Classification

Features Importance

Evaluation

Step 2: Image multi classification

Conclusion

Lectures

More articles by Alexandra LORENZO DE BRIONNE

Churn Prediction using Apache Spark ML & Databricks

Demystification of Recommender Systems

Others also viewed

The Nixtlar library, Gaussian Processes with PyMC, Algorithms for Decision Making

Build Your First RAG System in a Jupyter Notebook: A Beginner-Friendly Guide

Computer Vision on YOLOv5 and Pytorch

Convert your PDF knowledge-bases into a Q&A Gen-AI App

Fine-Tuning LLaMA2 with Alpaca Dataset Using Alpaca-LoRA

Breaking the Jargons: Issue 7

Support vector machine

TSA Kaggle Data Challenge - ChiPy blog 1

The ggplot2 New Release, Regression and Other Stories, Deep Learning for Computer Vision, Introduction to Decision Trees with Python

Power of Machine Learning in stock movement prediction

Explore content categories