Data Management In Machine Learning
Image Source: https://www.wordstream.com/wp-content/uploads/2021/07/machine-learning.png

Data Management In Machine Learning

Machine Learning or ML is a subset of artificial intelligence (AI) that involves the development of algorithms and models that enables machines to learn from the given training data and make decisions based on this. This needs a lot of data to train the model and hence finding the right source and type of data is very important. In machine learning, data types refers to the different types or formats of the data which are used for this training. Understanding these data types are very important for building an accurate and effective machine learning models. Let us walk through some of the major data types used in the machine learning process.

Numerical Data: This includes measurable data like age, weight, distance, height etc.

Categorical Data: Any data that can be grouped based on a category. For example, Species, Vehicles, Regions etc

Time-based Data: Any form of data that can be grouped on the basis of time. For example; daily, weekly, monthly, yearly etc.

Text Data: Data that are formed with words and sentences that has a meaning.

As said, any data that can be formed with the above mentioned data types. But if you carefully look in to this, we can see that there is a correlation between the data items and sometimes this data can be searched with SQL queries. For example, the student details or employee details in an organization. This can be also be numerical as well. Such data is called Structured Data. All other data types; for example audio, video, images etc are called Unstructured Data.

No alt text provided for this image
Image source: https://drek4537l1klr.cloudfront.net/serrano/v-4/Figures/image011.png

Data can be sometimes accompanied with details like it's attributes, characteristics or category. If you look at the uploaded photos of your facebook posts, the source code on the browser shows what the photo is about. Any data in which the attributes or characteristics are associated with it are called Labelled Data. If the attributes or characteristics are missing or not present along with a data are called Unlabelled Data.

The Machine Learning algorithm needs these attributes to make predictions. The process of adding labels to a data based on the context is called Data Labelling.

In order to start a research project or any other project in ML, the first and foremost important step is the Data Selection. This must be done prior to the Data Collection. I can say that this is a very important stage because, if we select the wrong data set, this is going to affect all the further steps in your project. Data Set which fine for one project may not be the right set for another project. The factors to consider at this point are:

  • Data Type
  • Source of Data
  • How we are going to extract data from the source
  • Accessibility
  • How much the available data is alight with the objective of the project

Once identified, large quantities of data from the source needs to be loaded to train the model. The fact here is a large portion of the data loaded contains junk information which are not helpful of the project. When the quantity of these non-productive data increases, the model training becomes slower and not effective. it sometimes causes the model to learn inaccurate things. Here comes the concept of Feature Selection. It separates good data from the rest which is most suitable for the model training. The following image shows the different methods used for feature selection.

No alt text provided for this image
Image: https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/feature-selection-methods-1.png

Data Sampling is another important concept which helps the model to identify information about a population based on the statistics from the subset of that population. There are many sampling methods which I will write a different article separately. This is very similar/exactly the sampling techniques which are used in statistics.

To view or add a comment, sign in

More articles by Tismon Varghese

Others also viewed

Explore content categories