Another Twitter sentiment analysis with Python - Part 5 (Tfidf vectorizer, model comparison, lexical approach)
This is the 5th part of my ongoing Twitter sentiment analysis project. You can find the previous posts from the below links.
- Part 1: Data cleaning
- Part 2: EDA, Data visualisation
- Part 3: Zipf’s Law, Data visualisation
- Part 4: Feature extraction (count vectorizer), N-gram, confusion matrix
In the last part, I tried count vectorizer to extract features and convert textual data into a numeric form. In this part, I will use another feature extraction technique called Tfidf vectorizer.
Tfidf Vectorizer
TFIDF is another way to convert textual data to numeric orm, and is short for Term Frequency-Inverse Document Frequency. The vector value it yields is the product of these two terms; TF and IDF.
Let’s first look at Term Frequency. We have already looked at term frequency with count vectorizer, but this time, we need one more step to calculate the relative frequency. Let’s say we have two documents in our corpus as below.
- I love dogs
- I hate dogs and knitting
Relative term frequency is calculated for each term within each document as below.
For example, if we calculate relative term frequency for ‘I’ in both document 1 and document 2, it will be as below.
Next, we need to get Inverse Document Frequency, which measures how important a word is to differentiate each document by following the calculation as below.
If we calculate inverse document frequency for ‘I’,
Once we have the values for TF and IDF, now we can calculate TFIDF as below.
Following the case of our example, TFIDF for the term ‘I’ in both documents will be as below.
As you can see, the term ‘I’ appeared equally in both documents, and the TFIDF score is 0, which means the term is not really informative in differentiating documents. The rest is same as count vectorizer, TFIDF vectorizer will calculate these scores for terms in documents, and convert textual data into the numeric form.
Once I instantiate Tfidf vectorizer, and fit the Tfidf-transformed data to logistic regression, and check the validation accuracy for a different number of features.
Since I also have the result from count vectorizer, I tried in the previous post, I will plot them together on the same graph to compare.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from time import time
def accuracy_summary(pipeline, x_train, y_train, x_test, y_test):
if len(x_test[y_
test == 0]) / (len(x_test)*1.) > 0.5:
null_accuracy = len(x_test[y_test == 0]) / (len(x_test)*1.)
else:
null_accuracy = 1. - (len(x_test[y_test == 0]) / (len(x_test)*1.))
t0 = time()
sentiment_fit = pipeline.fit(x_train, y_train)
y_pred = sentiment_fit.predict(x_test)
train_test_time = time() - t0
accuracy = accuracy_score(y_test, y_pred)
print "null accuracy: {0:.2f}%".format(null_accuracy*100)
print "accuracy score: {0:.2f}%".format(accuracy*100)
if accuracy > null_accuracy:
print "model is {0:.2f}% more accurate than null accuracy".format((accuracy-null_accuracy)*100)
elif accuracy == null_accuracy:
print "model has the same accuracy with the null accuracy"else:
print "model is {0:.2f}% less accurate than null accuracy".format((null_accuracy-accuracy)*100)
print "train and test time: {0:.2f}s".format(train_test_time)
print "-"*80return accuracy, train_test_time
cvec = CountVectorizer()
lr = LogisticRegression()
n_features = np.arange(10000,100001,10000)
def nfeature_accuracy_checker(vectorizer=cvec, n_features=n_features, stop_words=None, ngram_range=(1, 1), classifier=lr):
result = []
print (classifier)
print "\n"for n in n_features:
vectorizer.set_params(stop_words=stop_words, max_features=n, ngram_range=ngram_range)
checker_pipeline = Pipeline([
('vectorizer', vectorizer),
('classifier', classifier)
])
print "Validation result for {} features".format(n)
nfeature_accuracy,tt_time = accuracy_summary(checker_pipeline, x_train, y_train, x_validation, y_validation)
result.append((n,nfeature_accuracy,tt_time))
return result
from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer()
feature_result_ugt = nfeature_accuracy_checker(vectorizer=tvec)
feature_result_bgt = nfeature_accuracy_checker(vectorizer=tvec,ngram_range=(1, 2))
feature_result_tgt = nfeature_accuracy_checker(vectorizer=tvec,ngram_range=(1, 3))
nfeatures_plot_tgt = pd.DataFrame(feature_result_tgt,columns=['nfeatures','validation_accuracy','train_test_time'])
nfeatures_plot_bgt = pd.DataFrame(feature_result_bgt,columns=['nfeatures','validation_accuracy','train_test_time'])
nfeatures_plot_ugt = pd.DataFrame(feature_result_ugt,columns=['nfeatures','validation_accuracy','train_test_time'])
plt.figure(figsize=(8,6))
plt.plot(nfeatures_plot_tgt.nfeatures, nfeatures_plot_tgt.validation_accuracy,label='trigram tfidf vectorizer',color='royalblue')
plt.plot(nfeatures_plot_tg.nfeatures, nfeatures_plot_tg.validation_accuracy,label='trigram count vectorizer',linestyle=':', color='royalblue')
plt.plot(nfeatures_plot_bgt.nfeatures, nfeatures_plot_bgt.validation_accuracy,label='bigram tfidf vectorizer',color='orangered')
plt.plot(nfeatures_plot_bg.nfeatures, nfeatures_plot_bg.validation_accuracy,label='bigram count vectorizer',linestyle=':',color='orangered')
plt.plot(nfeatures_plot_ugt.nfeatures, nfeatures_plot_ugt.validation_accuracy, label='unigram tfidf vectorizer',color='gold')
plt.plot(nfeatures_plot_ug.nfeatures, nfeatures_plot_ug.validation_accuracy, label='unigram count vectorizer',linestyle=':',color='gold')
plt.title("N-gram(1~3) test result : Accuracy")
plt.xlabel("Number of features")
plt.ylabel("Validation set accuracy")
plt.legend()
From above chart, we can see including bigram and trigram boost the model performance both in count vectorizer and TFIDF vectorizer. And for every case of unigram to trigram, TFIDF yields better results than count vectorizer.
Algorithms Comparison
The best result I can get with logistic regression was by using TFIDF vectorizer of 100,000 features including up to trigram. With this I will first fit various different models and compare their validation results, then will build an ensemble (voting) classifier with top 5 models.
I haven’t included some of the computationally expensive models, such as KNN, random forest, considering the size of data and the scalability of models. And the fine-tuning of models will come after I try some other different vectorisation of textual data.
I will not go into detail of explaining how each model works since it is not the purpose of this post. You can find many useful resources online, but if I get many questions or requests on a particular algorithm, I will try to write a separate post dedicated to the chosen model.
(Please note that inside the below “classifier_comparator” function, I’m calling another custom function “accuracy_summary”, which reports validation accuracy compared to null accuracy, and also the time it took to train and evaluate.)
from sklearn.svm import LinearSVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.linear_model import Perceptron
from sklearn.neighbors import NearestCentroid
from sklearn.feature_selection import SelectFromModel
names = ["Logistic Regression", "Linear SVC", "LinearSVC with L1-based feature selection","Multinomial NB",
"Bernoulli NB", "Ridge Classifier", "AdaBoost", "Perceptron","Passive-Aggresive", "Nearest Centroid"]
classifiers = [
LogisticRegression(),
LinearSVC(),
Pipeline([
('feature_selection', SelectFromModel(LinearSVC(penalty="l1", dual=False))),
('classification', LinearSVC(penalty="l2"))]),
MultinomialNB(),
BernoulliNB(),
RidgeClassifier(),
AdaBoostClassifier(),
Perceptron(),
PassiveAggressiveClassifier(),
NearestCentroid()
]
zipped_clf = zip(names,classifiers)
tvec = TfidfVectorizer()
def classifier_comparator(vectorizer=tvec, n_features=10000, stop_words=None, ngram_range=(1, 1), classifier=zipped_clf):
result = []
vectorizer.set_params(stop_words=stop_words, max_features=n_features, ngram_range=ngram_range)
for n,c in classifier:
checker_pipeline = Pipeline([
('vectorizer', vectorizer),
('classifier', c)
])
print "Validation result for {}".format(n)
print c
clf_accuracy,tt_time = accuracy_summary(checker_pipeline, x_train, y_train, x_validation, y_validation)
result.append((n,clf_accuracy,tt_time))
return result
trigram_result = classifier_comparator(n_features=100000,ngram_range=(1,3))
And the results for comparison is as below.
It looks like logistic regression is my best performing classifier.
And the result for the ensemble classifier, which takes votes from the top 5 model from the above result (linear regression, linear SVC, multinomial NB, ridge classifier, passive-aggressive classifier) is as below. Note that I did not include “linear SVC with L-1 based feature selection” model in the voting classifier, since it is the same model as Linear SVC, except for the fact that it filters out features first by L-1 regularization, and comparing the results linear SVC without the feature selection showed a better result.
from sklearn.ensemble import VotingClassifier
clf1 = LogisticRegression()
clf2 = LinearSVC()
clf3 = MultinomialNB()
clf4 = RidgeClassifier()
clf5 = PassiveAggressiveClassifier()
eclf = VotingClassifier(estimators=[('lr', clf1), ('svc', clf2), ('mnb', clf3), ('rcs', clf4), ('pac', clf5)], voting='hard')
for clf, label in zip([clf1, clf2, clf3, clf4, clf5, eclf], ['Logistic Regression', 'Linear SVC', 'Multinomial NB', 'Ridge Classifier', 'Passive Aggresive Classifier', 'Ensemble']):
checker_pipeline = Pipeline([
('vectorizer', TfidfVectorizer(max_features=100000,ngram_range=(1, 3))),
('classifier', clf)
])
print "Validation result for {}".format(label)
print clf
clf_accuracy,tt_time = accuracy_summary(checker_pipeline, x_train, y_train, x_validation, y_validation)
The validation set accuracy of the voting classifier turned out to be 82.47%, which is worse than the logistic regression alone, which was 82.92%.
Lexical Approach
What I have demonstrated above are machine learning approaches to text classification problem, which tries to solve the problem by training classifiers on a labeled data set. Another famous approach to sentiment analysis task is the lexical approach. “In the lexical approach the definition of sentiment is based on the analysis of individual words and/or phrases; emotional dictionaries are often used: emotional lexical items from the dictionary are searched in the text, their sentiment weights are calculated, and some aggregated weight function is applied.” http://www.dialog-21.ru/media/1226/blinovpd.pdf
In the part 3 of this series, I have calculated harmonic mean of “positive rate CDF” and “positive frequency percent CDF”, and these have given me a good representation of positive and negative terms in the corpus. If it successfully filters which terms are important to each class, then this can also be used for prediction in lexical manner.
So I decided to make a simple predictor, which make use of the harmonic mean value I calculated. Below I go through the term frequency calculation, and the steps to get ‘pos_normcdf_hmean’, but this time I calculated term frequency only from the train set. (* Since I learned that I don’t need to transform sparse matrix to dense matrix for term frequency calculation, I computed the frequency directly from sparse matrix)
from sklearn.feature_extraction.text import CountVectorizer
cvec = CountVectorizer(max_features=10000)
cvec.fit(x_train)
neg_train = x_train[y_train == 0]
pos_train = x_train[y_train == 1]
neg_doc_matrix = cvec.transform(neg_train)
pos_doc_matrix = cvec.transform(pos_train)
neg_tf = np.sum(neg_doc_matrix,axis=0)
pos_tf = np.sum(pos_doc_matrix,axis=0)
from scipy.stats import hmean
from scipy.stats import norm
def normcdf(x):return norm.cdf(x, x.mean(), x.std())
neg = np.squeeze(np.asarray(neg_tf))
pos = np.squeeze(np.asarray(pos_tf))
term_freq_df2 = pd.DataFrame([neg,pos],columns=cvec.get_feature_names()).transpose()
term_freq_df2.columns = ['negative', 'positive']
term_freq_df2['total'] = term_freq_df2['negative'] + term_freq_df2['positive']
term_freq_df2['pos_rate'] = term_freq_df2['positive'] * 1./term_freq_df2['total']
term_freq_df2['pos_freq_pct'] = term_freq_df2['positive'] * 1./term_freq_df2['positive'].sum()
term_freq_df2['pos_rate_normcdf'] = normcdf(term_freq_df2['pos_rate'])
term_freq_df2['pos_freq_pct_normcdf'] = normcdf(term_freq_df2['pos_freq_pct'])
term_freq_df2['pos_normcdf_hmean'] = hmean([term_freq_df2['pos_rate_normcdf'], term_freq_df2['pos_freq_pct_normcdf']])
term_freq_df2.sort_values(by='pos_normcdf_hmean', ascending=False).iloc[:10]
If you want a more detailed explanation of the formula I have applied to come up with the final values of “pos_norcdf_hmean”, you can find it in part 3 of this series.
The calculation of the positivity score I decided is fairly simple and straightforward. For each word in a document, look it up in the list of 10,000 words I built vocabulary with, and get the corresponding ‘pos_normcdf_hmean’ value, then for the document calculate the average ‘pos_normcdf_hmean’ value. If none of the words can be found from the built 10,000 terms, then yields random probability ranging between 0 to 1. And the single value I get for a document is handled as a probability of the document being positive class.
Normally, a lexical approach will take many other aspects into the calculation to refine the prediction result, but I will try a very simple model.
pos_hmean = term_freq_df2.pos_normcdf_hmean
y_val_predicted_proba = []
for t in x_validation:
hmean_scores = [pos_hmean[w] for w in t.split() if w in pos_hmean.index]
if len(hmean_scores) > 0:
prob_score = np.mean(hmean_scores)
else:
prob_score = np.random.random()
y_val_predicted_proba.append(prob_score)
pred = [1 if t > 0.56 else 0 for t in y_val_predicted_proba]
from sklearn.metrics import accuracy_score
accuracy_score(y_validation,pred)
With the average value of “pos_hmean”, I decide the threshold to be 0.56, which means if the average value of “pos_hmean” is bigger than 0.56, the classifier predicts it as a positive class, if it’s equal to or smaller than 0.56, it will be predicted as a negative class. And the result from the above model is 75.96%. The accuracy is not as good as logistic regression with count vectorizer or TFIDF vectorizer, but compared to null accuracy, 25.56% more accurate, and even compared to TextBlob sentiment analysis, my simple custom lexicon model is 15.31% more accurate. This is an impressive result for such a simple calculation and also considering the fact that the ‘pos_normcdf_hmean’ is calculated only with the training set.
In the next post, I will try to implement Doc2Vec to see if the performance gets better.
Thank you for reading, and you can find the Jupyter Notebook from the below link.
https://github.com/tthustla/twitter_sentiment_analysis_part5/blob/master/Capstone_part4-Copy3.ipynb
And Medium blog post: