Machine Learning 101

Jennifer Tang

Published Jan 26, 2020

Personally, I think everything in the world has a beginning, middle and end. Whether it is a person, a pencil or something conceptual like respect. When I first meet a person, I cannot instinctively respect him/her. I need to learn about his/her past and connect with them in the present in order to develop a minimum level of respect to want a future with him/her. The connection made in the present is the most interesting part on a timeline. In the case of Machine Learning (ML), it began with Artificial Intelligence (AI) and it will develop off-springs like Deep Learning (DL) that will become something else entirely in the future.

https://www.pinterest.com/pin/332984966178178336/

AI was a term coined by computer scientist John McCarthy in 1956. It is the science of using machines to perform human-like tasks. Most AI technology assesses an environment using a set of computer readable instructions called an algorithm and takes an action to ensure survival. They are two categories of AI: General and Narrow. Artificial General Intelligence (AGI) technology demonstrates an array of human-like characteristics. I usually imagine the housekeeper robot, Rosie from The Jetsons. I think it is safe to say most people image robots when considering AGI. Then, there is Artificial Narrow Intelligence (ANI) where the technology could perform only one human-like characteristic really well or even better than people. In this case, I usually imagine a Tesla car that has a self-driving feature.

https://www.forbes.com/sites/andriacheng/2019/06/26/amazon-gos-even-bigger-rollout-is-not-a-matter-of-if-but-when/#6a566a8c6f52

Right now, ANI is quite popular due to the trending interest in automating human tasks. Amazon opened stores that allows customers to walk in, take products from the shelves and walk out with them. They call this process Just Walk Out Technology that using camera sensors to identify customers and products. Then, to track who took which product and/or who returned the product, they simulated human movements. Finally, the cashier-less payment allows customers to walk right out and receive an email receipt within minutes of exiting the store.

In fact, some elements of this technology could be categorized as ML which is a subfield of AI. The original goal of ML was for it to be a tool in developing and improving AI. But overtime, ML became a much bigger entity all on its own. It was first coined and defined by computer scientist, Arthur Lee Samuel in 1959 as a “Field of study that gives computers the ability to learn without being explicitly programmed.” The goal of ML is to develop and train algorithms to learn from given data inputs called training data set. The algorithms should be able to improve itself and overtime make increasingly accurate decisions. In ML, program codes are generally simple but the algorithm it implements is where the complexity resides.

http://www.quickmeme.com/memes/Grumpy-Cat/page/585/

Another computer scientist, Tom Michael Mitchell in 1998 stated “A computer program is said to learn from experience E with respect to some class of tasks T and the performance measure P, if its performance at task in T, as measured by P, improves with experience E.” It is definitely a daunting statement so let us break it down with an example. Suppose on Twitter you consistently give hearts to cat related tweets and based on that Twitter learns you particularly like cats. E is the Twitter algorithm watching you giving hearts to cat related tweets and skipping over non-cat related tweets. T is when you give hearts to cat related tweets only. P is the number of cat related tweets you gave hearts to compared to the number of non-cat related tweets you gave hearts to. Twitter learns of your cat preference and will add more cat related tweets to your timeline.

Similar to AI, within ML there are also two types of learning methods: Supervised and Unsupervised. Before, I mentioned training data set which really are examples decisions made by humans. In the Twitter example, the training data set would be all the times you gave hearts to cat related tweets and also times you did not. In Supervised Learning, the algorithm will learn the pattern and predict an output. Since given the “right answer”, the algorithm will generate the “right” output. Meanwhile, in Unsupervised Learning, the algorithm is given a data set but without decisions made by humans. The data set has no data structure and it is up to the algorithm to find a pattern and predict an output. In this case, no “right answer” is given.

Before we go any further, I should introduce and explain the many algorithms available in ML. In Supervised Learning, the two most popular algorithms are regression and classification. The third is forecasting that takes a historic data set to predict trends. In Unsupervised Learning, there are clustering and dimension reduction. The chart below from The SAS Data Science Blog provides wonderful guidance on which algorithm is the right one to use depending on the program’s needs and wants.

https://blogs.sas.com/content/subconsciousmusings/2017/04/12/machine-learning-algorithm-use/

In regression, continuous output values are predicted based on input values. When a training data set is passed into a regression, a hypothesis is mapped. In a linear regression, the hypothesis is the trend line and represented as h(x) = θ0 + θ1x. Suppose a training data set contains the size of coffee shops by square feet and their coffee price per cup. The training data set would look something like the following.

The size would be θ0 and price would be θ1. The size of a coffee shop is the input value that ask the algorithm to find the price of a cup of coffee which is the output value. When the training data set is plotted on a graph and a trend line is drawn, we could easily find a pattern. As the size of a coffee shop increase, its price also increases. If we want to find out how a cup of coffee would be if the coffee shop is 500 sq. ft., then it is likely to be $4.00.

In classification, a discrete output value is predicted based on an input value. It is also known as logistic regression where the predicted output value is always between 0 to 1. The hypothesis is represented as h(x) = p (y = 1 | x; θ) which means the probability that output y would be 1 based on input x.

In clustering, it takes at least two sets of training data sets in order to predict an output. Let us take the coffee shop training data set and say it’s for coffee shops in the United States. I created another coffee shop training data set for coffee shops in Canada. And I was given a new training data example and needed to predict whether the example is for a coffee shop in the United States or Canada. The two training data sets looks like the following:

The clustering algorithm would take the closest examples from to determine which one it belongs to. In this case, it is surrounded by four Canadian data examples compared to one United States data example. Therefore, this new example will be predicted to belong to the Canadian training data set.

In dimension reduction, the given data set is reduced in order to advance runtime, improve data integrity and most importantly, speed up the learning. Data sets could contain redundant attributes such as having measurements in inches and in centimeters. In this case, if we use both attributes, there could be a rounding error. Instead of removing one of the attributes, we could combine them and create a new attribute. The process in doing that is when data is reduced from two-dimension to one-dimension.

https://towardsdatascience.com/how-the-amazon-go-store-works-a-deep-dive-3fde9d9939e9

Lastly, deep learning could be either Supervised Learning or Unsupervised Learning. It was inspired by the interconnecting neurons in the human brain. It uses Artificial Neural Networks (ANNs) algorithms that layers on top of each other. The top layer succeeds the data sets from the lower layer. This data structure allows more data to be absorbed. Earlier, we discussed the significant amount of ML that goes into running Amazon Go stores. In order to determine product placement on the shelves, Amazon developed synthetic activity data using simulators. They created virtual customers that differed in clothing, hair, height, weight, etc. These virtual customers simulated the different poses and most importantly, arm movements. These simulations were tracked and collected as training data sets for their deep learning technology.

There are many programming languages to use for ML. According to computer scientist and statistician, Andrew Ng, it is best to develop a prototype in Octave and then, migrate to Java or C++. He has seen the ease of ML implementation for both beginners and experts alike when programming in Octave first. It is especially useful when building a large-scale algorithm and having it work quickly. Other popular programming languages for ML are NumPy, Python, MATLAB and R. Additionally, TensorFlow is a wonderful free resource library for ML. It is especially useful for deep learning because it allows users to create neural networks.

The advancement in ML has prove that it is no longer a pet project for scientists. There is public interest and need for it. It has helped with credit card fraud detection, discovered unrecognized artists of fine art paintings, provided criminal risk assessments and even, predicted the next financial crisis. In order for ML to continue achieving advancements, we need to ensure there is suitable and accessible data with limited biases.