Data Science Trending in 2016
Data Science field keeps the fast-paced growing speed with improvement of existing platform and softwares as well as newly developed tool kits. With so many choices in the market, I'd like to give my two cents on the data science trend.
Big Data
Hadoop Ecosystem will still be the lead of big data platforms. Hortonworks, Cloudera and MapR are out there with available Hadoop tools MapReduce, Pig, Hive and Spark. Among those, MapReduce is more like the fundamental backbone of Pig, Hive and Spark. Pig and Hive are more SQL user friendly MapReduce language. Spark brings in the great idea of in-memory computation, lazy execution and solid machine learning library MLlib. Spark will play a leading-role in big data.
Open Source Programming Language
R and Python are very competitive as open source programming languages. I do not see a chance that R will replace Python or vice versa. Both R and Python have the data frame concept and Machine Learning packages.
With regards to leveraging big data concepts, Python is a must if you do not do Java coding. You can easily make user defined functions (UDFs) in Python for Pig, Hive and Spark, and write MapReduce solely via Python. R is catching up in those areas.
R has the strength in statistics and matrix operation as it was originally built in for those needs.
For Text Analytics and Scraping the web, I'd prefer Python. Especially when you use dictionary for text content manipulation.
For Image Processing, Python wins the game thanks to the various image processing packages such as Open CV, Python Imaging Library, and scikit image.
For Deep Learning, Python works generally fine with Caffe and TensorFlow and H2O. You can also dive in H2O using R.
Machine Learning Toolkit
I am quite passionate about Dato as a machine learning platform. SFrame provides the capability of handling data larger than your RAM in a single PC. GraphLab machine learning library makes running machine learning just like stacking LEGO blocks, which is similar to SKlearn. As a commercial tool, Dato also provides built-in functions such as recommendation system, fraud detection, etc.
Visualization
Tableau, Spotfire, SAS VA are out there for making interactive visualization dashboards. Tableau and Spotfile have a better leverage of user-defined functions and can marry with many data platforms.
Data visualization using R and Python are quite similar. And Plotly enables the interactive visualization easily.
Take Home Message
There is no super-tool that can solve all data science problems. As a data scientist, the key capability to identify the right tool for a business case, and adapt yourself to use it properly a timely fashion.
Thank you Yang. Good share& helpful for my aspiration to be a data scientist
Wonderful insight. Thank you, Yang. I found your comments about Spark and Visualizations very helpful.
Well done - very helpful, thank you
This is very helpful as newcomer big data field
Good article, Yang Cong. Also thanks for the list of tools with their utilities. This is a good summary.