Solving a classification problem using machine learning
Your company has started out a new internet campaign and you’d like to know if it has improved the perception of your brand or had the opposite effect. This will help you decide whether to continue and build on your current marketing strategy or go back and start from scratch. In this post, I will discuss how machine learning helps you answer this question and what you’ll need to implement it within your business.
To measure the effectiveness of the campaign, you have decided to gather all the social media mentions of your brand and count the number of positive versus negative posts. You received thousands of responses and you are now faced with a problem. How are you going to read through all these texts and determine which ones are good and which ones are bad?
To solve this issue, you engaged a data scientist to build an application that will read through text and mark it as either positive or negative. The data scientist goes away, comes back and delivers an application to do just that. So, you give him the database of all the social media posts that you gathered and the application returned the same database plus an additional field marking each record as either positive or negative automagically, saving you hundreds of hours of reading time. Now, you can go on with your business by counting the number of records accordingly.
So what did the data scientist do and how did he do it? He basically built an artificial intelligence application that is able to do sentiment analysis. This is one of the first problems they introduce to you in machine learning courses. Traditionally, this is done by identifying “positive” words (e.g. brilliant, awesome, cool, wonderful etc…) and looking for said words in the text to determine if it’s positive. Conversely, you can identify negative words (e.g. suck, cliched, awful, bad etc… ) and do the same. Studies have shown that this method only resulted in to ~60% accuracy, which is just marginally better than tossing a coin. Using machine learning, data scientists are able to improve the accuracy of sentiment analysis applications to up to ~95%.
To use machine learning, the first thing you’ll need is a large amount of training data. So for our example, your company did not provide it. So, the data scientist had to look for data elsewhere to train his application. Can you think of a place where you have thousands of examples where you have a given text and it is also labelled as either good or bad?
One example is Amazon customer reviews. All the feedback in Amazon are tagged with one to five stars. You can get all the one star reviews and label them as “negative” and get all the five star reviews and label them as “positive”. Getting the Amazon data is a totally separate topic. So, I’ll skip that for this post. On a side note, this also one of the reasons why you get so many challenges asking for photos or videos on Facebook. That is to gather labelled data (e.g. photos of you 10 years apart, videos of you doing something specific, tagging of photos etc…).
Next thing you do is get the top X number of unique words used in all of the reviews. Each word becomes a parameter or field. Each text in the amazon data becomes one record and a value of 1 is assigned to the corresponding word field/parameter if that word appears in the text. The label is also another parameter in the record where it is set to 1 or positive if the amazon review is 5 stars and is set to 0 or negative if it is a one star review. This is called the bag of words approach where you basically converted a group of words in to a feature vector (a series of 1 and 0 corresponding to the words that exist in the text) with a known label (1 or 0 depending on whether it’s a one or five star review).
With the feature vector and the known label, this now becomes a mathematical problem that the computer can process. So, the data scientist builds a machine learning algorithm that assigns a weight on each word that relates to whether the record is either the positive or negative based on the label. The end result is similar to the traditional AI approach wherein you have words that correlate to either a positive (e.g. still, superb, love, great) or negative (e.g. bad, stupid, worst, “!”, “?”) review. However, in this case, you’ll notice the words the you get is mostly different and some of them are not intuitive from the human perspective such as “still”, exclamation point and question mark. Machine learning helps you find relations between input parameters and target variables that are not apparent and obvious to us because there are so many variables that can affect an outcome and humans may not be able to process or recognise all of them.
To recap, you as the subject matter expert of your company, have provided the data that you want to be processed to get the target label (positive or negative). The data scientist got his training data from amazon reviews. Although, it is also possible for you and your company to provide the training data instead. Then, the data scientist used machine learning algorithms to build you an application to read your dataset and apply a positive or negative label. He used the amazon reviews to train the model to improve the accuracy and performance before deploying the solution to your business. You used the application to read through your database and it now has an additional column to say whether each one is positive or negative. You then summarised this added information to make your business decision on whether to continue and build on your current marketing strategy or not.
In terms of roles and responsibilities, you own and provide the input data to be processed. You define the target outcome. The data scientist will help you build the model to relate the input to the target outcome. It’s a collaboration between you and the data scientist to define and gather the training dataset with the known target labels. The results after applying the model to your input dataset is back to you on how to extract insights and ultimately add business value to you and your company.
Thank you for reading this post. As usual, please like and share if you enjoyed it. I’ll discuss an example using regression in an enterprise setting in my next post.
Great article Rodney Ponferrada. I particularly like how you clarified the role of the client and the data scientist in your recap.
Excellent post. Very interesting. Thanks for sharing your insights on Machine Learning.