Basic Statistics for Exploratory Data Analysis (EDA)
Even though neural networks are very effective for large unstructured data like images, text and speech, we still have to manually analyze data which is either smaller or in a structured format or both, like the ones in relational databases, excel sheets or any tables in general. In this article, I go over the concepts I learned from reading the book "Hands-on Exploratory Data Analysis with Python".
Before we look at EDA let's go over the common types of data we find in structured formats like databases.
2. Categorical data - Usually whenever data falls into one of the buckets of a given set it is called categorical data. (Example - Blood type of a person can be A, B, AB or O). If there are only two categories (two buckets) it is called "Binary Categorical Variable" whereas if there are multiple it is called "Polytomous Categorical Variable"
While we are at it let us also look at the various measurement scales in statistics.
Now that we've covered data and measuring scales let's dive into the stat concepts we'll need for EDA
Distribution Functions
Continuous Function - A continuous function is any function that does not have any unexpected changes in value. These abrupt or unexpected changes are referred to as discontinuities. For example, consider the following cubic function:
y = x ** 3 + x ** 2 - 5*x + 3
Probability Density Function (PDF) - It is the probability that a function has a value x.
Probability Mass Function (PMF) - If the function is associated with discrete random variables rather than continuous random variables.
The probability distribution or probability function of a discrete random variable is a list of probabilities linked to each of its attainable values. Continuous probability distribution includes normal distribution, exponential distribution, uniform distribution, gamma distribution, poison distribution and binomial distribution. Let us look at the equations for each.
Uniform distribution:
f(x) = 1 / (b-1) if a <= x <= b else 0 if x < a or x > b
Normal distribution:
f(x) =
(1 / sigma * (np.sqrt(2 * math.pi))) * e ** (-((x mu) ** 2) / 2 * sigma ** 2)
Exponential distribution:
f(x) = lambda * e ** (- (lambda * x)) if x>=0 else 0
Binomial distribution: Has only two possible outcomes (eg success or failure)
Cumulative Distributive Function: The probability that the variable takes a value less than or equal to x
f(x) = P[X <= x] = alpha
When a distribution is a scalar continuous, it provides the area under the PDF, ranging from minus infinity to x. The CDF specifies the distribution of multivariate random variables.
Descriptive Statistics - There are two types of descriptive stats
Measure of central tendency
Recommended by LinkedIn
Measure of dispersion
Standard Deviation - This show how much data is spread out from the mean (It is the average of the difference between each value in the dataset from it's mean)
Variance - It is the square root of the standard deviation
Skewness - The measure of asymmetry in a dataset about its mean.
Kurtosis - It is a statistical measure that illustrates how heavily the tails of distribution differ from those of a normal distribution. This technique can identify whether a given distribution contains extreme values.
Types of kurtosis -
Calculating percentiles
Percentiles measure the percentage of values that lies below a certain value.
formula to calculate percentile of X = ((Number of observations less than X) / (Total Number of observations)) * 100
Quartile - 25th percentile is referred to as Q1, 50th percentile is referred to as Q2, 75th percentile is referred to as Q3 and finally, Q4 is the 100th percentile
We can visualise quartiles as box plots
Correlation
Any dataset that we want to analyze will have different fields (that is, columns) of multiple observations (that is, variables) representing different facts. The columns of a dataset are, most probably, related to one another because they are collected from the same event. One field of record may or may not affect the value of another field. To examine the type of relationships these columns have and to analyze the causes and effects between them, we have to work to find the dependencies that exist among variables. The strength of such a relationship between two fields of a dataset is called correlation, which is represented by a numerical value between -1 and 1.
Correlation tells us how variables change together, both in the same or opposite directions and in the magnitude (that is, strength) of the relationship. To find the correlation, we calculate the Pearson correlation coefficient, symbolized by ρ (the Greek letter rho). This is obtained by dividing the covariance by the product of the standard deviations of the variables:
rho(xy) = ((standard deviation of xy) / (standard deviation of x) * (standard deviation of y))
Types of correlation analysis:
Simpson's paradox - It is the difference that appears in a trend of analysis when a dataset is analyzed in two different situations: first, when data is separated into groups and, second, when data is aggregated.
Hypothesis Testing
Type - 1 error is False-positive and Type - 2 error is False negative
P-value - This is also referred to as the probability value or asymptotic significance. It is the probability for a particular statistical model given that the null hypothesis is true. Generally, if the P-value is lower than a predetermined threshold, we reject the null hypothesis.
Level of significance: This is one of the most important concepts that you should be familiar with before using the hypothesis. The level of significance is the degree of importance with which we are either accepting or rejecting the null hypothesis. We must note that 100% accuracy is not possible for accepting or rejecting. We generally select a level of significance based on our subject and domain. Generally, it is 0.05 or 5%. It means that our output should be 95% confident that it supports our null hypothesis.