Predictive Analysis: Wonders of Machine Learning With Apache Spark MLLIB

Debajani Mohanty

Published Jul 12, 2016

Have you heard of the freaky AI Robot that learns new words in real-time and tells Human Creators it will keep them in a “People Zoo”? Ref - http://www.johnnyetc.com/robot-says-he-will-put-humans-in-a-people-zoo/

Yes we are talking about the wonders of artificial intelligence and machine learning here. Predictive Analytics Market would worth 9.20 Billion USD by 2020. Love or Hate It, Predictive Analytics Is the Next Big Thing in market that we all got to learn.

Predictive analytics provides a predictive score or probability in order to determine organizational processes such as in marketing, credit risk assessment, fraud detection, manufacturing, healthcare, financial services, insurance, telecommunications, retail, travel, healthcare, pharmaceuticals, capacity planning and other fields.

It uses a variety of statistical techniques that analyze current and historical facts to make predictions about future or otherwise unknown events. Among these techniques usage of machine learning is gaining momentum in today’s market.

Three Categories of Techniques for Machine Learning

Three common categories of machine learning techniques are Classification, Clustering and Collaborative Filtering.

Classification: Gmail uses a machine learning technique called classification to designate if an email is spam or not, based on the data of an email: the sender, recipients, subject, and message body. Classification takes a set of data with known labels and learns how to label new records based on that information.
Clustering: Google News uses a technique called clustering to group news articles into different categories, based on title and content. Clustering algorithms discover groupings that occur in collections of data.
Collaborative Filtering: Amazon uses a machine learning technique called collaborative filtering (commonly referred to as recommendation), to determine which products users will like based on their history and similarity to other users.

Overview of ML Algorithms

In general, machine learning may be broken down into two classes of algorithms: supervised and unsupervised.

Supervised algorithms use labeled data in which both the input and output are provided to the algorithm. Unsupervised algorithms do not have the outputs in advance. These algorithms are left to make sense of the data without labels.

Classification & Regression

Both Classification and regression fall into supervised form of machine learning where we have a history of data from which we find patterns and predict the new data by matching with previous patterns. Decision trees are very common and easily understood tools in supervised form for classification and regression.

Classification tree

Classification trees, as the name implies are used to separate the dataset into classes belonging to the response variable. Usually the response variable has two classes: Yes or No (1 or 0). For example to figure out if an email is spam or not, if .

Regression Trees

They are needed when the response variable is numeric. For example, the predicted price of a consumer good. Thus regression trees are applicable for prediction type of problems as opposed to classification.

In either case, the predictors or independent variables may be categorical or numeric. It is the target variable that determines the type of decision tree needed.

K-Means

K-means, the unsupervised section of machine learning is one of the most commonly used clustering algorithms that group data points into a predefined number of clusters. The implementation in spark.mllib has the following parameters:

k is the number of desired clusters.
maxIterations is the maximum number of iterations to run.
initializationMode specifies either random initialization or initialization via k-means||.
runs is the number of times to run the k-means algorithm (k-means is not guaranteed to find a globally optimal solution, and when run multiple times on a given dataset, the algorithm returns the best clustering result).
initializationSteps determines the number of steps in the k-means|| algorithm.
epsilon determines the distance threshold within which we consider k-means to have converged.
initialModel is an optional set of cluster centers used for initialization. If this parameter is supplied, only one run is performed.

Collaborative Filtering ALS

Commonly used for recommender systems. In the big data world, recommendation system is increasingly becoming popular. The reasons is that while other big data products such as the data used to benchmark and those used for predictions require a decision maker, the data used for recommendation and filter systems is an automated one and does not need an analyst. Collaborative filtering recommends products based on statistics driven purchase patterns of similarly profiled customers built around supervised or unsupervised techniques. Collaborative filtering built around neighborhood methods and latent factor models are designed to investigate the affinity between consumers’ profiles and product inter-dependencies to diagnose new user-item associations.

Soon I will upload a video with hands-on examples on each of the techniques discussed.

References

https://www.mapr.com/blog/apache-spark-machine-learning-tutorial - - Apache Spark Machine Learning Tutorial | MAPR
https://spark.apache.org/docs/1.6.2/mllib-guide.html - MLlib Spark 1.6.2 Documentation
http://www.simafore.com/blog/bid/62482/2-main-differences-between-classification-and-regression-trees - 2 main differences between classification and regression trees
http://www.cxotoday.com/story/why-recommendation-system-is-big-datas-shining-star/ - Why Recommendation system is big data’s shining star?

Sachin Mehra 9y

Good read. Keep sharing!!!

1 Reaction

To view or add a comment, sign in

Predictive Analysis: Wonders of Machine Learning With Apache Spark MLLIB

Debajani Mohanty

Three Categories of Techniques for Machine Learning

Overview of ML Algorithms

Classification & Regression

Classification tree

Regression Trees

K-Means

Collaborative Filtering ALS

References

More articles by Debajani Mohanty

Others also viewed

Machine Learning in Data Science: From Fundamentals to Production-Ready Insights

5 Myths of Machine Learning

The Foundation of Successful Machine Learning in Data Science and Analytics

Machine Learning: The Truth is Out There

Logistic Regression with Six Jars of ML

Navigating Model Drift: Ensuring Longevity in Machine Learning

Data Science vs. Machine Learning vs. AI: What’s the Difference?

What is Tabular Data?

Harnessing Big Data and Machine Learning for Technical Chart Analysis

Deep learning, regression,....and SQL

Machine Learning Models For Healthcare Predictive Analytics

Supervised Learning Techniques

Predictive Analytics in Email Marketing

Best Practices For Evaluating Predictive Analytics Models

How to Use Predictive Analytics in Medicine

Machine Learning Models for Breast Cancer Risk Assessment

ML in high-resolution weather forecasting

How LLMs Generate Data-Rich Predictions

Explore content categories

Three Categories of Techniques for Machine Learning

Overview of ML Algorithms

Classification & Regression

Classification tree

Regression Trees

K-Means

Collaborative Filtering ALS

References

More articles by Debajani Mohanty

“Kumpany Budhi Maa” the Original Entrepreneuress Of Pre-Independent India

Orphan to Billionaire: The Mysterious Story of Bitcoin & Satoshi Nakamoto

Are Crypto-Currencies not Vulnerable to Inflation?

Smart Hiring: Tortoise vs. Hare. What the organizations should keep in mind while hiring techies

“I will never be rich, but I will take good care of you” - Narayana Murthy

The Five Pillars Of Women Empowerment: Let’s Be Bold For Change

IOT Opportunity for I. T. Organizations in Smart Farming in India

7 Habits to make it Big into The World of I.T. Architecture

Far Side of the Moon, The Deep Dark Web

Women, Be the Golden Goose: Create Win-Win Situation At Work & Home

Others also viewed

Machine Learning in Data Science: From Fundamentals to Production-Ready Insights

5 Myths of Machine Learning

The Foundation of Successful Machine Learning in Data Science and Analytics

Machine Learning: The Truth is Out There

Logistic Regression with Six Jars of ML

Navigating Model Drift: Ensuring Longevity in Machine Learning

Data Science vs. Machine Learning vs. AI: What’s the Difference?

What is Tabular Data?

Harnessing Big Data and Machine Learning for Technical Chart Analysis

Deep learning, regression,....and SQL

Similar topics

Machine Learning Models For Healthcare Predictive Analytics

Supervised Learning Techniques

Predictive Analytics in Email Marketing

Best Practices For Evaluating Predictive Analytics Models

How to Use Predictive Analytics in Medicine

Machine Learning Models for Breast Cancer Risk Assessment

ML in high-resolution weather forecasting

How LLMs Generate Data-Rich Predictions

Explore content categories