Regression Analysis: Simplify Complex Data Relationships - Data Analytics Project

Regression Analysis: Simplify Complex Data Relationships - Data Analytics Project

Intro to the project:

As I work through my Google Advanced Data Analytics certification, I've made it to my next project, which pushes me to explore the depths of regression modeling, applying my growing skills to the TikTok dataset.

Background on the TikTok project scenario:

TikTok users have the ability to submit reports that identify videos and comments that contain user claims. These reports identify content that needs to be reviewed by moderators. The process generates a large number of user reports that are challenging to consider in a timely manner.

TikTok is working on the development of a predictive model that can determine whether a video contains a claim or offers an opinion. With a successful prediction model, TikTok can reduce the backlog of user reports and prioritize them more efficiently.

Project Assignment:

My assignment for this project is to create and evaluate a logistic regression model aimed at deciphering the relationship between video characteristics and verified user status. Using Python, I went into predictive modeling, enhancing my analytical repertoire while aiming to provide actionable insights for stakeholders.

What is logistic regression?

  • Logistic regression is a data analysis technique that uses mathematics to find the relationships between two data factors. It then uses this relationship to predict the value of one of those factors based on the other. The prediction usually has a finite number of outcomes, like yes or no.

Skills I utilized in this project:

  • Conducting statistical analysis
  • Conducting regression modeling
  • Creating predictive models
  • Python coding

Analyzing the Data

The first step I took was importing packages and loading my dataset:

# Import packages for data manipulation
import pandas as pd
import numpy as np

# Import packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import packages for data preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.utils import resample

# Import packages for data modeling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")        

Exploring data and correlations with EDA

The first steps for this process involve getting a feel for the dataset, and getting some basic statistics:

# Display first few rows
data.head()
# Get number of rows and columns
data.shape
# Get data types of columns
data.dtypes
# Get basic information
data.info()
# Generate basic descriptive stats
data.describe()        

After getting a feel for the data, I did some basic data cleaning:

# Check for missing values
data.isna().sum()
# Drop rows with missing values
data = data.dropna(axis=0)
# Check for duplicates
data.duplicated().sum()        

After data cleaning, I made some boxplots to detect outliers for video statistics:

 # Create a boxplot to visualize distribution of `video_like_count`
plt.figure(figsize=(6,2))
plt.title('Boxplot to detect outliers for video_like_count', fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
sns.boxplot(x=data['video_like_count'])
plt.show()

# Create a boxplot to visualize distribution of `video_comment_count`
plt.figure(figsize=(6,2))
plt.title('Boxplot to detect outliers for video_comment_count', fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
sns.boxplot(x=data['video_comment_count'])
plt.show()        
Article content
Article content

I discovered outliers in the video like and comment count categories, and removed them for my analysis:

percentile25 = data["video_like_count"].quantile(0.25)
percentile75 = data["video_like_count"].quantile(0.75)
iqr = percentile75 - percentile25
upper_limit = percentile75 + 1.5 * iqr
data.loc[data["video_like_count"] > upper_limit, "video_like_count"] = upper_limit

percentile25 = data["video_comment_count"].quantile(0.25)
percentile75 = data["video_comment_count"].quantile(0.75)
iqr = percentile75 - percentile25
upper_limit = percentile75 + 1.5 * iqr
data.loc[data["video_comment_count"] > upper_limit, "video_comment_count"] = upper_limit        

I then started to analyze the verified status counts:

data["verified_status"].value_counts(normalize=True)        

I found that about 94% of the dataset represents videos posted by unverified accounts, and 6% represents videos posted by verified accounts. This is the outcome variable in my analysis, and I need to make sure they are balanced. To do this, I used resampling:

# Identify data points from majority and minority classes
data_majority = data[data["verified_status"] == "not verified"]
data_minority = data[data["verified_status"] == "verified"]

# Upsample the minority class (which is "verified")
data_minority_upsampled = resample(data_minority,
replace=True,                 
n_samples=len(data_majority),
random_state=0)               

# Combine majority class with upsampled minority class
data_upsampled = pd.concat([data_majority, data_minority_upsampled]).reset_index(drop=True)

# Display new class counts
data_upsampled["verified_status"].value_counts()        

The verified status in the new dataset is now balanced.

I then wanted to get the average video transcription text length, a variable I have previously overlooked, for videos posted by each verified status:

data_upsampled[["verified_status", "video_transcription_text"]].groupby(by="verified_status")[["video_transcription_text"]].agg(func=lambda array: np.mean([len(text) for text in array]))        

I found that the average video transcription text differed for each, being 89.4 words on average for non-verified users and 84.6 words for verified users. Because of this difference I decided to extract the length of the transcription text's and create a new column to the dataframe:

data_upsampled["text_length"] = data_upsampled["video_transcription_text"].apply(func=lambda text: len(text))        

I then created a histogram to better visualize this data:

sns.histplot(data=data_upsampled, stat="count", multiple="stack", x="text_length", kde=False, palette="pastel", 
hue="verified_status", element="bars", legend=True)
plt.title("Seaborn Stacked Histogram")
plt.xlabel("video_transcription_text length (number of characters)")
plt.ylabel("Count")
plt.title("Distribution of video_transcription_text length for videos posted by verified accounts and videos posted by unverified accounts")
plt.show()        
Article content

I then started to examine correlations between variables. To do this, I created a correlation matrix and a heatmap of the data:

data_upsampled.corr(numeric_only=True)

plt.figure(figsize=(8, 6))
sns.heatmap(
    data_upsampled[["video_duration_sec", "claim_status", "author_ban_status", "video_view_count", 
                    "video_like_count", "video_share_count", "video_download_count", "video_comment_count", "text_length"]]
    .corr(numeric_only=True), 
    annot=True, 
    cmap="crest")
plt.title("Heatmap of the dataset")
plt.show()        
Article content

The most correlated variables are the video view and like counts, with a correlation coefficient of 0.86. A model assumption for logistic regression is that there cannot be severe multicollinearity between features. Because of this, I am going to exclude the video like count variable because it has the greatest correlation between multiple variables.

After performing EDA and examining correlations, I am now ready to construct my model

My first step in model construction was to select my variables. I am using variable status as my outcome variable, and using the different video statistic variables (minus like count) as my features:

y = data_upsampled["verified_status"]
X = data_upsampled[["video_duration_sec", "claim_status", "author_ban_status", "video_view_count", "video_share_count", "video_download_count", "video_comment_count"]]        

I am now splitting my data into training and testing sets:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)        

Now I am ready to encode my variables. My first step is to check my data types:

X_train.dtypes        
Article content

Because the variables claim_status and author_ban_status are object data types, I need to encode them so they can be represented by my numbers. I did this by assigning a unique number to each unique value:

X_train_to_encode = X_train[["claim_status", "author_ban_status"]]
X_encoder = OneHotEncoder(drop='first', sparse_output=False)
X_train_encoded = X_encoder.fit_transform(X_train_to_encode)
X_encoder.get_feature_names_out()

# Place encoded training features (which is currently an array) into a dataframe
X_train_encoded_df = pd.DataFrame(data=X_train_encoded, columns=X_encoder.get_feature_names_out())
# Dropping  `claim_status` and `author_ban_status` columns 
X_train.drop(columns=["claim_status", "author_ban_status"]).head()
# Adding transformed dataframe variables to original dataframe
X_train_final = pd.concat([X_train.drop(columns=["claim_status", "author_ban_status"]).reset_index(drop=True), X_train_encoded_df], axis=1)
        

I repeated this process with my output variable because that was also an object data type and needed to be encoded.

I then constructed a logistic regression model and fit it to the training set:

log_clf = LogisticRegression(random_state=0, max_iter=800).fit(X_train_final, y_train_final)        

Once again, I had to encode categorical features in the testing set:

X_test_to_encode = X_test[["claim_status", "author_ban_status"]]
X_test_encoded = X_encoder.transform(X_test_to_encode)
X_test_encoded_df = pd.DataFrame(data=X_test_encoded, columns=X_encoder.get_feature_names_out())
X_test.drop(columns=["claim_status", "author_ban_status"]).head()
X_test_final = pd.concat([X_test.drop(columns=["claim_status", "author_ban_status"]).reset_index(drop=True), X_test_encoded_df], axis=1)        

I then used the logistic regression model to get predictions on the encoding set:

y_pred = log_clf.predict(X_test_final)
y_pred
# Display the true labels of the testing set
y_test


y_test_final = y_encoder.transform(y_test.values.reshape(-1, 1)).ravel()        

Finally, I created a confusion matrix to visualize the results of the logistic regression model:

log_cm = confusion_matrix(y_test_final, y_pred, labels=log_clf.classes_)
log_disp = ConfusionMatrixDisplay(confusion_matrix=log_cm, display_labels=log_clf.classes_)
log_disp.plot()
plt.show()         
Article content

  • The upper-left quadrant displays the number of true negatives: the number of videos posted by unverified accounts that the model accurately classified as so.
  • The upper-right quadrant displays the number of false positives: the number of videos posted by unverified accounts that the model misclassified as posted by verified accounts.
  • The lower-left quadrant displays the number of false negatives: the number of videos posted by verified accounts that the model misclassified as posted by unverified accounts.
  • The lower-right quadrant displays the number of true positives: the number of videos posted by verified accounts that the model accurately classified as so.

I then created a classification report for the model:

target_labels = ["verified", "not verified"]
print(classification_report(y_test_final, y_pred, target_names=target_labels))        
Article content

  • The classification report shows that the logistic regression model achieved a precision of 61% and a recall of 84%, and it achieved an accuracy of 65%

Conclusion:

  • Based on the logistic regression model, each additional second of the video is associated with a 0.009 increase in the log odds of the user having a verified status. That means for every one-second increase in video length, we expect the odds of the user is verified to go down by 0.9%
  • The logistic regression model had limited predictive power, with a precision of 61%. However, its recall was very good, at 84%.


To view or add a comment, sign in

More articles by Noah Owsiany

Others also viewed

Explore content categories