Data Science Trending in 2016

Data Science field keeps the fast-paced growing speed with improvement of existing platform and softwares as well as newly developed tool kits. With so many choices in the market, I'd like to give my two cents on the data science trend. 

Big Data

Hadoop Ecosystem will still be the lead of big data platforms. Hortonworks, Cloudera and MapR are out there with available Hadoop tools MapReduce, Pig, Hive and Spark. Among those, MapReduce is more like the fundamental backbone of Pig, Hive and Spark. Pig and Hive are more SQL user friendly MapReduce language. Spark brings in the great idea of in-memory computation, lazy execution and solid machine learning library MLlib. Spark will play a leading-role in big data. 

Open Source Programming Language 

R and Python are very competitive as open source programming languages. I do not see a chance that R will replace Python or vice versa. Both R and Python have the data frame concept and Machine Learning packages.

With regards to leveraging big data concepts, Python is a must if you do not do Java coding. You can easily make user defined functions (UDFs) in Python for Pig, Hive and Spark, and write MapReduce solely via Python. R is catching up in those areas.

R has the strength in statistics and matrix operation as it was originally built in for those needs. 

For Text Analytics and Scraping the web, I'd prefer Python. Especially when you use dictionary for text content manipulation.

For Image Processing, Python wins the game thanks to the various image processing packages such as Open CV, Python Imaging Library, and scikit image

For Deep Learning, Python works generally fine with Caffe and TensorFlow and H2O. You can also dive in H2O using R. 

Machine Learning Toolkit

I am quite passionate about Dato as a machine learning platform. SFrame provides the capability of handling data larger than your RAM in a single PC. GraphLab machine learning library makes running machine learning just like stacking LEGO blocks, which is similar to SKlearn. As a commercial tool, Dato also provides built-in functions such as recommendation system, fraud detection, etc. 

Visualization

Tableau, Spotfire, SAS VA are out there for making interactive visualization dashboards. Tableau and Spotfile have a better leverage of user-defined functions and can marry with many data platforms. 

Data visualization using R and Python are quite similar. And Plotly enables the interactive visualization easily. 

Take Home Message

There is no super-tool that can solve all data science problems. As a data scientist, the key capability to identify the right tool for a business case, and adapt yourself to use it properly a timely fashion. 

Thank you Yang. Good share& helpful for my aspiration to be a data scientist

Like
Reply

Wonderful insight. Thank you, Yang. I found your comments about Spark and Visualizations very helpful.

Like
Reply

Well done - very helpful, thank you

Like
Reply

This is very helpful as newcomer big data field

Like
Reply

Good article, Yang Cong. Also thanks for the list of tools with their utilities. This is a good summary.

Like
Reply

To view or add a comment, sign in

More articles by Yang Cong, PhD

  • Introvert to Extrovert - my personal story

    "What's the biggest impact of the pandemic on you?" This is a virtual cohort networking question. I said "I…

    5 Comments
  • This Data Will Make You Rethink Home Workout

    During COVID-19 global pandemic, home workouts have been a popular substitute for studio/outdoor workouts. Personally…

    10 Comments
  • Reserve Estimation -- Modeling with Uncertainties

    Reserves are the volume of hydrocarbons that can be profitably extracted from a reservoir using existing technology…

    6 Comments
  • Advance Your Career by Learning Data Science Online

    Reposted from https://www.datasciencemom.

    1 Comment
  • Online Training for Data Scientist

    Data scientist is one of the most popular jobs nowadays. How to become one? I think you just have to be a melting pot…

    7 Comments
  • Data Science MOOCs

    Data Science training is getting more and more popular through formal academic master programs, massive online open…

    10 Comments

Others also viewed

Explore content categories