Sentiment Analysis of COVID19 Data

Sentiment Analysis of COVID19 Data

D V Sai Srikar

Executive Summary

The purpose of this project is to compare TextBlob and AWS Comprehend. TextBlob is a NLP library available for use in python, while AWS comprehend is a service being offered by AWS and can be accessed on python via the Boto3 library on python. The specific comparison being made is between the output of the sentiment analysis they perform.

The setup for the project entails multiple parts, getting the data from twitter, cleaning the data and then performing sentiment analysis on it using TextBlob and AWS Comprehend.

TextBlob gives the output in the form of a range between -1 and 1 which then can be sorted in accordance to how many ranges of sentiment is required. For convenience it is usually sorted into 'negative', 'positive' and 'neutral' being less then 0, more than 0 and 0 respectively.

Merits of TextBlob are that it is a python library that does not require you to use any authentication to use, it is very easy to setup and the documentation is very helpful. Demerits are the low level of information being output, the range of sentiment and subjectivity are sometimes not enough to aptly understand why a statement was given a particular value.

AWS comprehend has a much more sophisticated output, Comprehend gives the output in the form of a JSON dump from which we can take the relevant information required. This JSON dump contains information like the key words from the sentence that cause the sentiment, the subjectivity of it and finally the sentiment itself. This means we do not have to try and sort the ranges our self. Comprehend gives four outputs by default, 'negative', 'positive' ,'neutral' and 'mixed'. This adds another layer of abstraction wherein we can identify statements that have both positive and negative connotations.

Merits of Comprehend is the high level of information being given out, this level of information is very useful for understanding why a statement was classified the way it was. Demerits are the fact that you have to have a valid AWS account to use the service, this required additional setup.

Finally I would like to thank Col krishna I , Principal security architect AWS and Jagadamba Krovvidi. Associate Vice President Infosys for helping and guiding me throughout this project.

Table of Contents:

1. Introduction:

2. How to execute the application on your local machine:

3. Overview of the modules present

4. Brief overview of the Sentiment analysis done:

5. Project Pipeline:

  1. Vpc.tf
  2. Tweepy
  3. Databricks
  4. TextBlob
  5. AWS comprehend

6. Summary

7. References




Introduction:

Over the last year we have adjusted to living with the pandemic. Masks, sanitization, distancing and so on have become the norm. This adjustment has made a lot of people realize the importance of having a social presence as they just stay at home all day now. Indian twitter usage grew by 74% over the period of 3 months last year.

So what does this actually mean

It means now more than ever there is a lot of data available to be used, to understand the general trend of emotions, and to try and identify what problems are affecting people the most.

This article showcases sentiment analysis done using Databricks and AWS Comprehend. We find that data from twitter is mostly comprised of opinions and as such the result is a bell curve with neutral being the highest in quantity with the curve tapering off towards positive and negative.

To ensure that this can be easily replicated and verified by others I have created the entire infrastructure as well as the application using Terraform. This application takes tweets identified by a given search term "#covid19", and returns the data in a JSON format. This JSON is then converted to a data frame with unique columns using a databricks notebook. Sentiment analysis is performed on the content of the tweet using 2 providers, TextBlob and AWS comprehend. The results from both are compared and visualized using matplotlib.

No alt text provided for this image

This diagram is a simple representation of how the entire flow will look once completed.

How to execute the application on your local machine:

To setup and run this project on a local machine, the following is required to be done:

  1. Code for this project is found at https://github.com/SrikarDVS/Covid_twitter_sentiment_analysis.git
  2. Instructions for the prerequisite installation and execution are given in the readme file.


Overview of the modules present

  1. Terraform: Deploying AWS cloud infrastructure using terraform (Infrastructure as code). We would deploy an EC2 instance attached to the public subnet of a VPC. The data collected in this ec2 instance would go into a data stream, also deployed by terraform.
  2. Tweepy: Tweepy is a python library used to interact with the twitter api. To collect data in the EC2 instance we run a python script to get data from twitter using tweepy. This fetches data in the form of a JSON.
  3. AWS Services Used: There are multiple AWS services being used in this project, the major ones being AWS EC2, VPC, Kinesis Datastream and Comprehend.
  4. VPC: The VPC is a Virtual private cloud hosted in AWS, it allows you to allocate subnets and deploy instances on those subnets.
  5. EC2 instance: A T2 micro instance is used, the reason for this is the fact that there is not much processing being done on the instance itself. It is only being used to ingest the data from twitter into a Kinesis Datastream.
  6. Kinesis DataStream: This is the datastream we use to store all the data from the python script running on the ec2 instance present.
  7. DataBricks (PySpark and Pandas): In a databricks notebook data from the kinesis stream is taken and stored in a pyspark dataframe, this can also be converted into a pandas dataframe and processing can be done on it.
  8. Sentiment Analysis: On the data cleaned and processed, sentiment analysis is done using 2 providers, namely TextBlob and AWS comprehend. The results from this are put back into the dataframe.
  9. MatPlotLib: This is used to showcase the difference between the 2 providers for sentiment analysis and also to show time series data.


Brief overview of the Sentiment analysis done:

Understand the social sentiment of your brand, product or service while monitoring online conversations. Sentiment Analysis is contextual mining of text which identifies and extracts subjective information in source material.

Sentiment analysis is important in identifying trends in data qualitatively. Questions like "are people happy about this change?" or "What is the social perception of product X?" can be answered through sentiment analysis. Twitter especially is very useful for gathering data for sentiment analysis.

In this project AWS comprehend and TextBlob are used for sentiment analysis. TextBlob is a python library and as such can be installed and called locally, AWS comprehend can be called using the Boto3 library.

The TextBlob API takes an entire string as an input and gives output in the form of polarity and subjectivity. Polarity refers to the sentiment of the statement and is a float between -1 and 1. Subjectivity is also a float and ranges between 0 and 1, 0 being objective and 1 being subjective.

Since the output is in the form of a range, testing needs to be done to determine what is the range for positive, negative and neutral. Ideally subjectivity is also taken into account as it helps define the context of a given statement and helps prevent false positives.

AWS comprehend adds a layer of sophistication by including a key for recognizing the language of the tweet and also having a 'Mixed' output criteria.

AWS comprehend works on AWS’s own library and generally has good estimations


Project Pipeline:

Terraform:

Terraform is an open-source infrastructure as code software tool that provides a consistent CLI workflow to manage hundreds of cloud services. Terraform codifies cloud APIs into declarative configuration files. In this application terraform is responsible for the deployment of all AWS services. Terraform allows us to launch the resources required all at once, rather than having to navigate to each dashboard in the console and launch them individually. The resources deployed using the terraform script are:

  1. VPC
  2. Security Group for this VPC
  3. Public/Private Subnets, their Route Tables and associations
  4. EC2 T2 Micro instance
  5. Kinesis Data stream

Along with being responsible for launching all these services, terraform has a very useful feature called provisioners which lets you execute remote code. This is what we use to automate the tweepy python script. The image below represents the flow of how services are deployed using terraform.

No alt text provided for this image

Overview of Terraform Script for Provisioning of Resources

Terraform operates by having configuration files that declare the resources to be started. Once these configuration files are run, the resources begin operating. On cloning the github repo given here, the files found are:

No alt text provided for this image



The main script file for deployment of resources is called vpc.tf’, this is what is used to declare all the resources required to be used and all the provisioners being used also.

The file called terraform.tfvars is used to store all the global variables being used in the vpc.tf script.

The variables.tffile gives the description and default values to all the global variables being declared in the tfvars file.

Twitter_analytics.pyis the python script which gets the tweets and stores them in the kinesis data stream.

Tweets.ipynb is the ipython notebook, which has the code required to process and perform sentiment analysis on the tweets retrieved.

How vpc.tf is used to deploy:

As mentioned earlier there are three main terraform files, one for variables, one for their descriptions and the last is a script for deployment of resources. The file used deploy resources is called vpc.tf and it is responsible for calling the providers to start the functionality and services required.

In this project the vpc.tf file is mostly used for calling and deploying AWS resources, the major ones being a VPC, an EC2 instance and a Kinesis DataStream.

To create a VPC there needs to be a security group, the default security group can be utilized for this or a custom one can be made. The VPC can then be created and added to this security group. Once the VPC has been created the subnets, route tables, route associations and so on are also made. To the VPCs public subnet we add the EC2 instance along with an internet gateway( IGW ) and to this EC2 instance we attach a key pair created.

This key pair is what is used to SSH into the ec2 instance to copy the python script and run it.

No alt text provided for this image

This is the code responsible for creating the EC2 instance, the AMI is the unique reference code given to each resource by AWS.

Once the EC2 instance has been created, the python script must start running on it automatically. To do this 'provisioners' are used. Provisioner is a feature provided by Terraform.

Provisioners can be used to model specific actions on the local machine or on a remote machine in order to prepare servers or other infrastructure objects for service.

No alt text provided for this image

This is the snippet for running a provisioner to copy a script from your local machine into an ec2 instance and then run it. The provisioners being used here are “file” to copy the local script onto the remote instance and “remote-exec” to run commands on the ec2 instance. As seen both provisioners have a connection tag, this is where the ssh key location has to be given. In remote-exec a special tag called “inline” is used, which lets you write commands to be executed, in the order they are given, on the cli of the remote machine. This is used to copy and run the script automatically. This is very useful as it lets you get the data stream running with just one command.

The data stream is the final resource we initiate using terraform. The python script running on the EC2 instance streams its output to the datastream created. This stream can be accessed from anywhere to get access to the twitter data.

No alt text provided for this image



To run the tf file once all the resources have been written there are 2 commands to be used, terrafrom init and terraform apply. The former created a tfstate file which logs all the resources being created, this helps to monitor which resources are active and which are not. The other command is the apply command used to deploy and start all the resources.

Tweepy:

The python script used in this project was taken from here “https://github.com/joking-clock/twitter-capture-python/blob/master/twitter_kinesis_data.py”. 

No alt text provided for this image

Tweepy is the python library responsible for fetching tweets with the hashtag we want. 

No alt text provided for this image

Listener is the function used to search for tweets with specific hashtags.

No alt text provided for this image

This snippet is the function responsible for taking the listener output and putting it in a kinesis datastream.

A twitter developer account is required to access the consumer and access keys required to run this script. Details on how to get access to this can be found at the Github page for this project "https://github.com/SrikarDVS/Covid_twitter_sentiment_analysis".

DataBricks:

Databricks is an online ipython notebook provider, their notebooks support multi-language scripts and their comunity edition lets you access one cluster to perform tasks free of cost. The version of databricks being used in this project is the community edition. Using data bricks the twitter stream data from the kinesis stream is converted into a spark dataframe.

df = kinesisDF \
  .writeStream \
  .format("memory") \
  .outputMode("append") \
  .queryName("tweets")  \
  
  .start()
        

Spark and kinesis have a module specific to fetching data from kinesis streams. Further details are found here “https://spark.apache.org/docs/latest/streaming-kinesis-integration.html”. 

The data stream needs a data schema to be used to put data in. Once the schema is present the data from the kinesis stream can be pushed into the data frame.

Once the tweepy script has run for a fair bit of time, the data can be taken from the stream using a spark.sql command.

tweets = spark.sql("select cast(data as string) from tweets")
        

The next step after getting the data into a spark DB is to sort and clean the data, spark has a useful feature called user defined function that let you create a function and iterate it over a given column easily. The function being used here is to separate the single column DB into 3 different columns i.e. from a single column containing the tweet into 3 columns containing the tweet ID, the Timestamp and the tweet string respectively.

from pyspark.sql import functions as F
import json
def parse_tweet(text):
    data  = json.loads(text)
    id = data[0]['id']
    ts = data[0]['ts']
    tweet = data[0]['tweet'] 
    return (id, ts, tweet)
    
# Define your function
getID = udf(lambda x: parse_tweet(x)[0], StringType())
getTs = udf(lambda x: parse_tweet(x)[1], StringType())
getTweet = udf(lambda x: parse_tweet(x)[2], StringType())
# Apply the UDF using withColumn
tweets = (tweets.withColumn('id', getID(F.col("data")))
               .withColumn('ts', getTs(F.col("data")))
               .withColumn('tweet', getTweet(F.col("data")))
         )
tweets = tweets.drop('data')
tweets = tweets.withColumn('tweet', F.regexp_replace(('tweet'),'\\\\[0-9a-zA-Z]{1,5}',''))
tweets = tweets.withColumn('tweet', F.regexp_replace(('tweet'),'b\'RT\s@[a-zA-Z_]+\:',''))
tweets.display()        

Along with parsing the tweet into columns, the string of the tweet has to be cleaned as well. Most tweets contain emojis that translate into Unicode when imported, this tends to confuse the sentiment analysis algorithm and can easily be removed by using a regex command.

TextBlob:

TextBlob is a python library for NLP and can be used to generate sentiment analysis of given strings.

import textblob
def get_sentiment(text):
    from textblob import TextBlob
    tweet = TextBlob(text)
    if tweet.sentiment.polarity < 0:
      sentiment = "NEGATIVE"
    elif tweet.sentiment.polarity == 0:
        sentiment = "NEUTRAL"
    else:
        sentiment = "POSITIVE"
    return sentiment
getSentiment = udf(lambda x: get_sentiment(x), StringType())
        

The range of the sentiment analysis is -1 to 1, and as such this range has to be defined to give the output in the format required. In this case it is 'negative', 'positive' and 'neutral'. The reason is that there is a comparison being made to AWS comprehend which gives the output similarly.

TextBlob used 2 analyzers for sentiment analysis, PatternAnalyser and NaiveBayesAnalyzer. The former is based on the Pattern library and the latter is from the NLTK ( Natural Language ToolKit ). NLTK was trained on movie reviews.

The default is PatternAnalyser and it is the one being used in this project but the change to NaiveBayer is very simple and can be done by referring to the TextBlob docs.

AWS Comprehend:

To call AWS services in python scripts the Boto3 library is used, authorization is done by giving your AWS keys in its parameters.

import boto3
comprehend = boto3.client('comprehend', region_name = regionname,aws_access_key_id = awsAccessKeyId,aws_secret_access_key = awsSecretKey)
        

AWS comprehend has an extra layer of sophistication over TextBlob in the sense you can add a parameter to detect the language of the tweet. This lets you classify the tweets in terms of the language along with the sentiment.

Comprehend gives output in the form of 'positive', 'negative', 'neutral' and 'mixed'. The 'mixed' category is another difference between Comprehend and TextBlob.

def aws_sentiment(x):
    lang = comprehend.detect_dominant_language(x)
    out = json.dumps(comprehend.detect_sentiment(Text=x, LanguageCode='lang'), sort_keys=True, indent=4)
    out = json.loads(out)
    return out.get('Sentiment')        

Outputs:

The best way to showcase data is visually and as such the comparison between TextBlob and Comprehend is done using MatPlotLib, which is a powerful plotting tool.

No alt text provided for this image

This is the output from Comprehend, the output is as expected tending towards a large amount of neutral statements followed by negative and the least being mixed and positive.


No alt text provided for this image

TetxtBlob does not give mixed outputs but the trend is similar, largely neutral but in the case of TextBlob there seem to be more positive statements than negative.

This may occur due to the range given, as TextBlob has user defined ranges, or may occur due to its classification of the statement as subjective or objective.

No alt text provided for this image


The timeseries graph is in minutes shows that the tweets frequency is very high and the topic has a considerable amount of discussion going around.


Summary:

As seen from the graphs, TextBlob classifies into positive, neutral and negative. However AWS also has a mixed tag. This makes the output from AWS a bit more comprehensive and this results in better classification and more data points when using this data in a training model further on.

The code for this entire project can be found at:

https://github.com/SrikarDVS/Covid_twitter_sentiment_analysis.git”.

References:

  1. https://towardsdatascience.com/playing-with-data-gracefully-1-a-near-real-time-data-pipeline-using-spark-structured-streaming-409dc1b4aa3a
  2. https://github.com/joking-clock/twitter-capture-python

To view or add a comment, sign in

More articles by Srikar Devarakonda

Others also viewed

Explore content categories