Predictive Analysis: Wonders of Machine Learning With Apache Spark MLLIB
Have you heard of the freaky AI Robot that learns new words in real-time and tells Human Creators it will keep them in a “People Zoo”? Ref - http://www.johnnyetc.com/robot-says-he-will-put-humans-in-a-people-zoo/
Yes we are talking about the wonders of artificial intelligence and machine learning here. Predictive Analytics Market would worth 9.20 Billion USD by 2020. Love or Hate It, Predictive Analytics Is the Next Big Thing in market that we all got to learn.
Predictive analytics provides a predictive score or probability in order to determine organizational processes such as in marketing, credit risk assessment, fraud detection, manufacturing, healthcare, financial services, insurance, telecommunications, retail, travel, healthcare, pharmaceuticals, capacity planning and other fields.
It uses a variety of statistical techniques that analyze current and historical facts to make predictions about future or otherwise unknown events. Among these techniques usage of machine learning is gaining momentum in today’s market.
Three Categories of Techniques for Machine Learning
Three common categories of machine learning techniques are Classification, Clustering and Collaborative Filtering.
- Classification: Gmail uses a machine learning technique called classification to designate if an email is spam or not, based on the data of an email: the sender, recipients, subject, and message body. Classification takes a set of data with known labels and learns how to label new records based on that information.
- Clustering: Google News uses a technique called clustering to group news articles into different categories, based on title and content. Clustering algorithms discover groupings that occur in collections of data.
- Collaborative Filtering: Amazon uses a machine learning technique called collaborative filtering (commonly referred to as recommendation), to determine which products users will like based on their history and similarity to other users.
Overview of ML Algorithms
In general, machine learning may be broken down into two classes of algorithms: supervised and unsupervised.
Supervised algorithms use labeled data in which both the input and output are provided to the algorithm. Unsupervised algorithms do not have the outputs in advance. These algorithms are left to make sense of the data without labels.
Classification & Regression
Both Classification and regression fall into supervised form of machine learning where we have a history of data from which we find patterns and predict the new data by matching with previous patterns. Decision trees are very common and easily understood tools in supervised form for classification and regression.
Classification tree
Classification trees, as the name implies are used to separate the dataset into classes belonging to the response variable. Usually the response variable has two classes: Yes or No (1 or 0). For example to figure out if an email is spam or not, if .
Regression Trees
They are needed when the response variable is numeric. For example, the predicted price of a consumer good. Thus regression trees are applicable for prediction type of problems as opposed to classification.
In either case, the predictors or independent variables may be categorical or numeric. It is the target variable that determines the type of decision tree needed.
K-Means
K-means, the unsupervised section of machine learning is one of the most commonly used clustering algorithms that group data points into a predefined number of clusters. The implementation in spark.mllib has the following parameters:
- k is the number of desired clusters.
- maxIterations is the maximum number of iterations to run.
- initializationMode specifies either random initialization or initialization via k-means||.
- runs is the number of times to run the k-means algorithm (k-means is not guaranteed to find a globally optimal solution, and when run multiple times on a given dataset, the algorithm returns the best clustering result).
- initializationSteps determines the number of steps in the k-means|| algorithm.
- epsilon determines the distance threshold within which we consider k-means to have converged.
- initialModel is an optional set of cluster centers used for initialization. If this parameter is supplied, only one run is performed.
Collaborative Filtering ALS
Commonly used for recommender systems. In the big data world, recommendation system is increasingly becoming popular. The reasons is that while other big data products such as the data used to benchmark and those used for predictions require a decision maker, the data used for recommendation and filter systems is an automated one and does not need an analyst. Collaborative filtering recommends products based on statistics driven purchase patterns of similarly profiled customers built around supervised or unsupervised techniques. Collaborative filtering built around neighborhood methods and latent factor models are designed to investigate the affinity between consumers’ profiles and product inter-dependencies to diagnose new user-item associations.
Soon I will upload a video with hands-on examples on each of the techniques discussed.
References
- https://www.mapr.com/blog/apache-spark-machine-learning-tutorial - - Apache Spark Machine Learning Tutorial | MAPR
- https://spark.apache.org/docs/1.6.2/mllib-guide.html - MLlib Spark 1.6.2 Documentation
- http://www.simafore.com/blog/bid/62482/2-main-differences-between-classification-and-regression-trees - 2 main differences between classification and regression trees
- http://www.cxotoday.com/story/why-recommendation-system-is-big-datas-shining-star/ - Why Recommendation system is big data’s shining star?
Good read. Keep sharing!!!