Data Science Landscape
Business Applications vs. Sciences
Any data science project should be driven by business problems that means data science serves an organization by providing answers for its business problems and strategies in decision making process. Business problems can be classified as forecast, classification or prediction, segmentation, association and summarization; the related applications are survival analysis of any types, customer retention, scoring, rare event identification or fraud detection, customer targeting by segments, recommendation system, process optimization, topic identification, relationship mining, sentiment analysis, etc. Data science algorithms or machine learning includes two fundamental categories, supervise and non-supervised, the most common supervised learning methods include regression (logistic regression, multi-linear regression, Ridge. etc.), decision trees, Neural Network, Support Vector Machine, etc. Unsupervised learning include clustering and principle component analysis, etc. Those methods can be combined in solving more complicated problems to form resembled methods like random forest built on many trees or boosting methods built on multiple regressions or semi-supervised learning that combines supervised and unsupervised methods as one. Recently deep learning is starting to become popular and it embeds learning algorithms and workflows within a single learning process for delivering optimized solutions. The chart below is a mapping from business problems into types of learning methods but it’s not a mapping from a specific business application to a specific scientific method. The right methods should be chosen according to a specific business problem and the end performance matric.
Predictive Analytics
Predictive analytics is a sub area of data science by focusing on the prediction. It usually goes from low level to high level. Thinking about a scenario that a patent goes to a doctor’s office, first the doctor tries to understand what happened to the patient and the patient tells the symptoms from sickness; then the doctor explores what happened to the patient, and he also may tell what will happen next or possible symptoms, finally the doctor provides the patient with a prescription. Those are a sequence of the processes used for predictive analytics. In business, we start with historical data, find the truth, what happened such as which transactions are fraudulence and what are the patterns look like among those transactions and why it happened at what time (causational analysis plus time awareness analysis). The result is that we can tell stories about those transactions. However the analysis does not stop here. The best value comes from how to prevent fraud transactions from happening in the future and what actions to take to stop the fraud transactions, that needs to develop predictive or forecasting models and embedded those models into a transaction process at real time. Whenever, the model identifies a transaction with higher fraud score above a certain threshold (defined by business criteria), an alert is generated and the transaction is stopped immediately. The whole process starts with raw data to identify valuable information, gain deeper knowledge, draw business insight and finally optimize strategical decisions. It goes from hindsight that knows nothing to insight that provides clues about what causes problems to foresight that know what will happen and what to do. The chart below shows the big picture about predictive analytics.
Analytics Tools
This is report about top analysis tools used in year 2014 and 2015 and their comparison. The report from KDDNugget newsletter. R is #1 in 2015, followed by RapidMiner and SQL and Python.
Big data tools – Big data tools in 2015, Hadoop is #1 followed by Spark, Hive and SQL
Programming Language – Python is #1, followed by Java, C++, etc.
Analytics by Industry – Top industries use analytics in 2014, CRM, Banking and Health care or HR, Fraud Detection