Machine Learning using Apache Flink and Apache Spark - Part 1
A small statement before I start the article: I have been exploring Apache Spark and Apache Flink for last 6 months. While there is no doubt that Apache Spark is more popular than the other, and has more production deployments, I find Apache Flink looks much more promising in terms of capability (and the underlying Architecture it is built on).
I am betting high on Apache Flink; I am sure it is going to replace Apache Spark in near future.
Now to the main topic..
Machine learning had been a subject of research for a long time in the past, and had been used only by some big companies, for some specialized use-cases. This is not true anymore.
Machine learning is finding its place in much more common use-cases, almost in each and every domain. Now, it’s very much possible (I would say, not very difficult) to apply machine learning to normal business use-cases and “add intelligence to the processes”, enable the business processes to learn from historical data and use that information and context and apply those to the current business transactions. In other words convert the business processes from Reactive to Predictive!!!
The big promise of Machine Learning and Big data – Moving from “Reactive Enterprise” to “Predictive Enterprise”.
I am planning to write a small blog on this subject with my version / my ideas on this topic.
There are many tools available in the market which has Machine Learning module in built into it. In this article, we are going to focus on Apache Spark and Apache Flink. To keep the discussion more focused, I am going to show a real example with working code on how to use Supervised Learning – Multiple Regression. As part of the implementation we will get to understand the following.
- Feature Scaling
- Polynomial Features
- Fit a Model to the training data
- Run predictions for the test data
I will also talk about different types of features – Categorical Features, Quantitative Features. I will touch upon Stepwise Multiple Regression and Logistic Regression.
Possibly, in a separate article / blog, I will give the above process a Statistical angle, where I am planning to talk about Multiple Regression, Logistic Regression and data preparation and analysis part like
- How to choose predictors or features and concepts like Multicollinearity, Variance Inflation Factor
- Data cleansing – Outliers, Influential points, Residual Plots
- Data transformations for Regression model – I will talk about Heteroscedasticity
And some basic concepts like Hypothesis testing, Normality, t-statistics, confidence interval etc.
Machine Learning modules of Spark and Flink provide support for most of the common ML Algorithms.
Apache Flink provides support for the following: (In the current version)
Apache Spark in the current version provides following:
Currently, I see that Apache Flink is lagging on the support for different algorithms and capabilities in Machine Learning space. But, its catching up fast and I see there are plans, per the product roadmap, to revamp the ML module with all the important capabilities and algorithms.
Check the second part of this article where I add little more theory and then a code walk-through...
A better formatted version of the same article is also available at my wordpress site. Click the link below to open
Any update to this comparsion?