Machine Learning: Working Example of Analysis of Variance(ANOVA) in R

Machine Learning: Working Example of Analysis of Variance(ANOVA) in R

Some Basic Knowledge

Variance - Variance gives a measure of how the data distributes itself about the mean or expected value.



Where,

σ2 = Variance

x = Values given in a set of data

x̄ = Mean of the data

n = Total number of values.

Normal Distribution - The normal distribution is defined by the Normal equation :

Normal equation - The value of the random variable Y is:


where X is a normal random variable, μ is the mean, σ is the standard deviation, π is approximately 3.14159, and e is approximately 2.71828.The random variable X in the normal equation is called the normal random variable. The normal equation is the probability density function for the normal distribution.

t - Test - A t-test is a statistical examination of two population means. A two-sample t-test examines whether two samples are different and is commonly used when the variances of two normal distributions are unknown and when an experiment uses a small sample size. For example, a t-test could be used to compare the average floor routine score of the U.S. women's Olympic gymnastic team to the average floor routine score of China's women's team.

F - Distribution - The F-distribution is the distribution of ratios of two independent estimators of the population variances. The main use of F-distribution is to test whether two independent samples have been drawn for the normal populations with the same variance, or if two independent estimates of the population variance are homogeneous or not, since it is often desirable to compare two variances rather than two averages. For instance, college administrators would prefer two college professors grading exams to have the same variation in their grading. For this, the F-test can be used.

In order to perform F-test of two variances, it is important that the following are true:

  • The populations from which the two samples are drawn are normally distributed.
  • The two populations are independent of each other.

If the two populations have equal variances, then s12 and s22 are close in value and F is close to 1. But if the two population variances are very different, s12 and s22 tend to be very different, too.

Choosing s12 as the larger sample variance causes the ratio to be greater than 1. If s12 and s22 are far apart, then F is a large number. Therefore, if F is close to 1, the evidence favours the null hypothesis (the two population variances are equal). But if F is much larger than 1, then the evidence is against the null hypothesis, and we can infer that possibly the population variances differ to a large extent.



P - Value - A p-value helps you determine the significance of your results. The calculated value of the F-test with its associated p-value is used to infer whether one has to accept or reject a null hypothesis.

  • A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis.
  • A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis.
  • p-values very close to the cutoff (0.05) are considered to be marginal (could go either way).

Degrees of Freedom - The number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. The number of independent ways by which a dynamic system can move, without violating any constraint imposed on it, is called number of degrees of freedom. In other words, the number of degrees of freedom can be defined as the minimum number of independent coordinates that can specify the position of the system completely. Estimates of statistical parameters can be based upon different amounts of information or data. The number of independent pieces of information that go into the estimate of a parameter are called the degrees of freedom. In general, the degrees of freedom of an estimate of a parameter are equal to the number of independent scores that go into the estimate minus the number of parameters used as intermediate steps in the estimation of the parameter itself (i.e. the sample variance has N-1 degrees of freedom, since it is computed from N random scores minus the only 1 parameter estimated as intermediate step, which is the sample mean).

Introduction

Analysis of variance (ANOVA) is a collection of statistical models used to analyze the differences among group means and their associated procedures (such as "variation" among and between groups). In the ANOVA setting, the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether or not the means of several groups are equal, and therefore generalizes the t-test to more than two groups. ANOVAs are useful for comparing (testing) three or more means (groups or variables) for statistical significance. It is conceptually similar to multiple two-sample t-tests, but is less conservative (results in less type I error) and is therefore suited to a wide range of practical problems.

These days, researchers are using ANOVA in many ways. The use of ANOVA depends on the research design. Commonly, researchers are using ANOVA in three ways : one-way ANOVA, two-way ANOVA and N-way Multivariate ANOVA.

One-Way : When we compare more than two groups, based on one factor (independent variable), this is called one way ANOVA. For example, it is used if a manufacturing company wants to compare the productivity of three or more employees based on working hours. This is called one way ANOVA.

Two-Way : When a company wants to compare the employee productivity based on two factors (2 independent variables), then it said to be two way (Factorial) ANOVA. For example, based on the working hours and working conditions, if a company wants to compare employee productivity, it can do that through two way ANOVA. Two-way ANOVA can be used to see the effect of one of the factors after controlling for the other, or it can be used to see the interaction between the two factors. This is a great way to control for extraneous variables as you are able to add them to the design of the study.

Factorial ANOVA can be balanced or unbalanced. This is to say, you can have the same number of subjects in each group (balanced) or not (unbalanced). This can come about, depending on the study, as just a reflection of the population, or an unwanted event such as participants not returning to the study. Not having equal sizes groups can make it appear that there is an effect when this may not be the case. There are several procedures a researcher can do in order to solve this problem :

  • Discard cases (undesirable)
  • Conduct a special kind of ANOVA which can deal with the unbalanced design

There are three types of ANOVA that can candle an unbalanced design. These are the Classical Experimental design (Type 2 analysis), the Hierarchical Approach (Type 1 analysis), and the Full regression approach (Type 3 analysis). Which approach to use depends on whether the unbalanced data occurred on purpose.

  • If the data was not intended to be unbalanced but you can argue some type of hierarchy between the factors, use the Hierarchical approach (Type 1).
  • If the data was not intended to be unbalanced and you cannot find any hierarchy, use the classical experimental approach (Type 2).
  • If the data is unbalanced because this is a reflection of the population and it was intended, use the Full Regression approach (Type 3).

N-Way : When the factor comparison is taken, then it said to be n-way ANOVA. For example, in productivity measurement if a company takes all the factors for productivity measurement, then it is said to be n-way ANOVA.

Procedure: In an ANOVA, a researcher first sets up the null and alternative hypothesis. The null hypothesis assumes that there is no significant difference between the groups. The alternative hypothesis assumes that there is a significant difference between the groups. After cleaning the data, the researcher must test the above assumptions and see if the data meets them. They must then do the necessary calculation and calculate the F-ratio. Simply look at the p value against the established alpha. Accordingly, we reject the null hypothesis or we fail to reject the null hypothesis. Rejecting the null hypothesis, we will conclude that the mean of the groups are not equal. 

Questions the ANOVA Answers

One-way ANOVA :

  • Are there differences in GPA by grade level (freshmen vs. sophomores vs. juniors)?
  • Are there differences in the profit of the Fortune 500 companies by highest educational degree attained by the CEO (B.A./B.S vs. master’s vs. doctorate)?

Two-way ANOVA :

  • Are there differences in GPA by grade level (freshmen vs. sophomores vs. juniors) and school district (district one vs. district two)?
  • Are there differences in the profit of the Fortune 500 companies by highest educational degree attained by the CEO (B.A./B.S vs. master’s vs. doctorate) and number of employees (<100 vs. 100 - 500 vs. > 500)?

Data Level and Assumption

Data level and assumption plays a very important role in ANOVA. In ANOVA, the dependent variable can be continuous or on the interval scale. Factor variables in ANOVA should be categorical. ANOVA is a parametric test and has some assumptions, which should be met to get the desired results. ANOVA assumes that the distribution of data should be normally distributed. ANOVA also assumes the assumption of homogeneity, which means that the variance between the groups should be equal. ANOVA also assumes that the cases are independent to each other or there should not be any pattern between the cases. As usual, when planning any study, extraneous and confounding variables need to be considered. ANOVA is a way to control these types of undesirable variables.

Testing of the Assumptions

1. The population in which samples are drawn should be normally distributed.

2. Independence of cases: the sample cases should be independent of each other.

3. Homogeneity: Homogeneity means that the variance between the groups should be approximately equal.

These assumptions can be tested using statistical software. The assumption of homogeneity of variance can be tested using tests such as Levene’s test or the Brown-Forsythe Test. Normality of the distribution of the population can be tested using plots, the values of skewness and kurtosis, or using tests such as Shpiro-Wilk or Kolmogorov-Smirnov. The assumption of independence can be determined from the design of the study.

It is important to note that ANOVA is not robust to violations to the assumption of independence. This is to say, that even if you violate the assumptions of homogeneity or normality, you can conduct statistical procedures that will still enable you to conduct the ANOVA but you cannot with violations to independence. In general, with violations of homogeneity the study can probably carry on if you have equal sized groups. With violations of normality, continuing with the ANOVA should be ok if you have a large sample size and equal sized groups.

Effect Size: When conducting an ANOVA it is always important to calculate the effect size. The effect size can tell you the degree to which the null hypothesis is false. Effect sizes can be considered small, medium or large. A medium effect size is one that is noticeable to the layperson's eye. 

Four of the commonly used measures of effect size in ANOVA are : Eta 
squared (h2), partial Eta squared (hp2), omega squared (w2), and the 
Intraclass correlation (rI).
For more information on this, you can go through the following link

http://www.uccs.edu/lbecker/glm_effectsize.html

If from running an ANOVA you determine that you do not have statistically significantly different groups, but you have a large effect size, you might want to rerun this ANOVA with a larger sample size. A large effect size without statistical significance could be an indication that significance can be reached with a larger sample.

Related Statistical Tests: These days, researchers have extended ANOVA in MANOVA and ANCOVA. MANOVA stands for the multivariate analysis of variance. MANOVA is used when the dependent variable in ANCOVA are two or more than two. ANCOVA stands for analysis of covariance. ANCOVA is used when the researcher includes one or more covariate variables in the independent variable.

For a Working Example in R: ANOVA

Thanks Awesome explanation!

Like
Reply

While studying on machine learning sometimes i am stucking at statistics and stochastics. But you prepared Excelent and presented excelent to tutorial.  Thanks alot :) 

Many thanks for your help. My question is how can we conduct Full Regression Approach of ANOVA on the software (either on Excel or SPSS)

Like
Reply

To view or add a comment, sign in

More articles by Prerna Sahay

  • How To Install R on Ubuntu 16X

    How To Install R on Ubuntu 16X This blog post is one of my first one after long months break. I left this for so long…

  • Introduction to Machine Learning!

    “The question of whether a computer can think is no more interesting than the question of whether a submarine can…

  • Hadoop VS Spark

    Hadoop and Spark are popular Apache projects in the big data ecosystem. Apache Spark is an improvement on the original…

    9 Comments
  • WHAT IS BIG DATA?

    Big Data is nothing but an assortment of such a huge and complex data that it becomes very tedious to capture , store ,…

    8 Comments

Others also viewed

Explore content categories