Machine Learning using Apache Flink and Apache Spark - Part 1

Naveen Sinha

Published Jun 13, 2016

A small statement before I start the article: I have been exploring Apache Spark and Apache Flink for last 6 months. While there is no doubt that Apache Spark is more popular than the other, and has more production deployments, I find Apache Flink looks much more promising in terms of capability (and the underlying Architecture it is built on).

I am betting high on Apache Flink; I am sure it is going to replace Apache Spark in near future.

Now to the main topic..

Machine learning had been a subject of research for a long time in the past, and had been used only by some big companies, for some specialized use-cases. This is not true anymore.

Machine learning is finding its place in much more common use-cases, almost in each and every domain. Now, it’s very much possible (I would say, not very difficult) to apply machine learning to normal business use-cases and “add intelligence to the processes”, enable the business processes to learn from historical data and use that information and context and apply those to the current business transactions. In other words convert the business processes from Reactive to Predictive!!!

The big promise of Machine Learning and Big data – Moving from “Reactive Enterprise” to “Predictive Enterprise”.

I am planning to write a small blog on this subject with my version / my ideas on this topic.

There are many tools available in the market which has Machine Learning module in built into it. In this article, we are going to focus on Apache Spark and Apache Flink. To keep the discussion more focused, I am going to show a real example with working code on how to use Supervised Learning – Multiple Regression. As part of the implementation we will get to understand the following.

Feature Scaling
Polynomial Features
Fit a Model to the training data
Run predictions for the test data

I will also talk about different types of features – Categorical Features, Quantitative Features. I will touch upon Stepwise Multiple Regression and Logistic Regression.

Possibly, in a separate article / blog, I will give the above process a Statistical angle, where I am planning to talk about Multiple Regression, Logistic Regression and data preparation and analysis part like

How to choose predictors or features and concepts like Multicollinearity, Variance Inflation Factor
Data cleansing – Outliers, Influential points, Residual Plots
Data transformations for Regression model – I will talk about Heteroscedasticity

And some basic concepts like Hypothesis testing, Normality, t-statistics, confidence interval etc.

Machine Learning modules of Spark and Flink provide support for most of the common ML Algorithms.

Apache Flink provides support for the following: (In the current version)

Apache Spark in the current version provides following:

Currently, I see that Apache Flink is lagging on the support for different algorithms and capabilities in Machine Learning space. But, its catching up fast and I see there are plans, per the product roadmap, to revamp the ML module with all the important capabilities and algorithms.

Check the second part of this article where I add little more theory and then a code walk-through...

A better formatted version of the same article is also available at my wordpress site. Click the link below to open

MachineLearning with Apache Flink and Apache Spark

Thomas Vengal 6y

Any update to this comparsion?

To view or add a comment, sign in

Machine Learning using Apache Flink and Apache Spark - Part 1

Naveen Sinha

Check the second part of this article where I add little more theory and then a code walk-through...

More articles by Naveen Sinha

Others also viewed

Deploy your Machine Learning Pipelines

ML in Two Worlds - Scikit-Learn vs PySpark. (Blog-1)

PySpark Structured Streaming in Spark 2

Stream Processing/Analytics Tools Like Apache Flink is NOT Transactional Machine Learning (TML)

Recommendation Engine with Spark MLlib

AirFlow 3 is coming, forecasting with the fable library, Docker for data science and engineering

Choosing Machine Learning Frameworks: Apache Mahout vs. Spark ML vs. Killer H2O

R formulas in Spark and un-nesting data in SparklyR: Nice and handy!

Streaming Anomaly Detection Using Matrix Factorization

Dealing with Synchronous & Asynchronous API calls in Spark UDF

Machine Learning Models For Healthcare Predictive Analytics

The Role Of Feature Engineering In Predictive Analytics

Machine Learning Models for Breast Cancer Risk Assessment

Tips for Optimizing Apache Spark Performance

Best Practices For Evaluating Predictive Analytics Models

Explore content categories

Check the second part of this article where I add little more theory and then a code walk-through...

More articles by Naveen Sinha

Data Science / Machine Learning use cases for Telcos

Machine Learning with Apache Flink & Spark - Part 2

Analyzing New York’s Rush Hour

Others also viewed

Deploy your Machine Learning Pipelines

ML in Two Worlds - Scikit-Learn vs PySpark. (Blog-1)

PySpark Structured Streaming in Spark 2

Stream Processing/Analytics Tools Like Apache Flink is NOT Transactional Machine Learning (TML)

Recommendation Engine with Spark MLlib

AirFlow 3 is coming, forecasting with the fable library, Docker for data science and engineering

Choosing Machine Learning Frameworks: Apache Mahout vs. Spark ML vs. Killer H2O

R formulas in Spark and un-nesting data in SparklyR: Nice and handy!

Streaming Anomaly Detection Using Matrix Factorization

Dealing with Synchronous & Asynchronous API calls in Spark UDF

Similar topics

Machine Learning Models For Healthcare Predictive Analytics

The Role Of Feature Engineering In Predictive Analytics

Machine Learning Models for Breast Cancer Risk Assessment

Tips for Optimizing Apache Spark Performance

Best Practices For Evaluating Predictive Analytics Models

Explore content categories