In AIML Data is Paramount ... Why ?

Claudio Passera

Published Oct 29, 2022

We all know that Data is one of the most important elements of an AIML project. But why ? I share my experience and I’m looking forward to receiving your inputs and comments !

In my first post I went through how we set up our projects. Once the basics are defined, people usually have in mind how everything should work. At the end we expect the algorithm to mimic a human behavior, don't we ? The point is that as humans we observe a very large amount of data in parallel, we continuously make correlations, get inputs from multiple sources to learn, leverage the experience that more senior people transfer to us, and then we define what we observe (classify), we can make a prediction on the evolution of the system, we get feedback on our outcomes and them and learn from the feedbacks. We would like to transfer all of this in a software.

The good part of it is that there are today techniques to do it, provided the challenge is well defined, specific, so that we can apply what is called narrow band AI.

Once we defined the problem to solve, the next step is to decide which are the elements of the system to observe to make a decision. If I ask you how is the weather like in this moment, specifically "is it sunny or rainy ?", (two classes) to make the classification you might observe if the sun is there, if rain is dropping ... this we call meaningful features.

There are also other features, i.e. the “how the Stock Market closed a week ago”. This is a feature, probably a non relevant feature for weather classification today.

The first point about data is to have them and have access to them. And you need them in large quantity, I would say thousands of samples.

The second decision you want to make is if you want to go for Supervised Learning or Unsupervised Learning.

Recommended by LinkedIn

Data Pre-processing: The Battle Before the battle

Utkarsh Sharma 4 years ago

Automated Augmentation Explained

Daniel McGeough Jr. 10 months ago

K-Nearest Neighbors (KNN) vs. K-Means: Understanding…

Navadeep Komarraju 1 year ago

Supervised means you have predefined categories, buckets, and you want to classify your new sample in a specific bucket. In the weather case there were two (sunny, rainy), in our microwave case, The categories are : Propagation, Congestion, Fault.
Unsupervised means that you do not have predefined categories, you are more interested to identify correlations in your samples, identify patterns in the data, see how they naturally group into categories and then try to make sense of these categories. For example, you can observe a certain number of features, based on the features the samples are grouped in clusters. Then you can assume that the samples in a Cluster have something in common, if you identify the commonalities, it is fair to assume that when you get a new sample and your AIML Classifies it in a Cluster then it will show the cluster behaviour.

You can think for example at recommendation system for shopping. People that buy "a", "b", "c" then frequently buy "d". You did not label the bucket (cluster) but if the samples get classified in that cluster or close to it, then it will probably show a similar behavior. I intentionally write in bold the relevant keywords that have math and statistic behind them.

If you want to build a model with Supervised Learning (defined buckets) then you need each one of your training samples to carry a Label, this means, define in advance the bucket it belongs to. This labelling is necessary because we will then train the algorithm giving examples so that it can learn, and if the buckets are predefined, then we need each sample to carry the label. This is an important job for the domain experts, the people that know the system and that, by analyzing the sample, can label it. The domain experts are the people that train the system, we want the algorithm to learn from them.

If you go for Unsupervised Learning you don't need labels, however, you will probably spend a large amount of time with the domain experts to make sense of the categories, the clusters that the algorithm identified.

The bottom line here is the importance of the domain experts to start making sense of the data, labelling them in Supervised Learning or analyzing the clusters in Unsupervised Learning.

In AIML Data is Paramount ... Why ?

Claudio Passera

Recommended by LinkedIn

More articles by this author

Others also viewed

Gini Index -CART Decision Algorithm in Machine Learning

AI in Analytics: What Still Matters

Cyclical Encoding: An Alternative to One-Hot Encoding

First Lesson in ML and Model Building: Consider All Relevant Data

IID in machine learning

ML Nugget#5: Principles of Smart Data Selection

How Data Analysts Can Stay Relevant in the Age of AI and Automation

6 Reasons Why Your Machine Learning Project Will Fail to Get Into Production

Trees

Optuna the Best Hyperparameter Tuning Tools for Machine Learning

Generalization in weather prediction models

How To Fine-Tune AI Models On Small Datasets

How to Train Custom Language Models

The Role Of Feature Engineering In Predictive Analytics

Explore content categories

Recommended by LinkedIn

The impact of Generative AI on Software Engineering

Sep 15, 2024

Others also viewed

Gini Index -CART Decision Algorithm in Machine Learning

AI in Analytics: What Still Matters

Cyclical Encoding: An Alternative to One-Hot Encoding

First Lesson in ML and Model Building: Consider All Relevant Data

IID in machine learning

ML Nugget#5: Principles of Smart Data Selection

How Data Analysts Can Stay Relevant in the Age of AI and Automation

6 Reasons Why Your Machine Learning Project Will Fail to Get Into Production

Trees

Optuna the Best Hyperparameter Tuning Tools for Machine Learning

Similar topics

Generalization in weather prediction models

How To Fine-Tune AI Models On Small Datasets

How to Train Custom Language Models

The Role Of Feature Engineering In Predictive Analytics

Explore content categories