Using Small and Medium Data in an Age of Big Data

Big data today
Today we are in an age of big data. Whether this refers to the storage and access of data in large relational databases, or, more often, to the partitioning of data in data centers across relatively cheap commodity servers, it has changed our approach to data analysis. In the latter context data can be accessed and processed rapidly in batch form through Hadoop and MapReduce, or alternatively in more complex sequential and iterative processes through Spark. Spark fills in some of the gaps where MapReduce is not appropriate, offering solutions where batch processing is not feasible.
Data analysis techniques before big data
One assumption might be that there is no role for the tools and techniques that were developed for data analysis in the days when computers did not have the storage capacity to deal with big data, nor the processing power to derive output from big data in any reasonable time. But let’s not forget that understanding of big data depends on the same fundamental rules of probability and statistics that were mainly derived in an age before computers had been invented, when the results of data analysis had to be worked out through a painful process of manual analysis.
Changing capabilities for data analysis
When the slide rule, computer and then the pocket calculator emerged as tools to assist in data analysis, the capability to deal with large datasets was far behind what it is today. And now we have smartphones with computing capabilities beyond advanced computers of a few decades ago, and the pocket calculator can exist as an app on a phone.
Do we need to consider small or medium data?
The origins of probability theory and statistics are part of the history of science and mathematics: is there anywhere we need to use techniques relevant to small- or medium-sized datasets in today’s big data world? I argue yes, there are many examples. For example, in conservation biology, studying rare populations (analysts cannot wait for the population to expand to millions.) Another example, in clinical trials: the combination of one or more of unusual medical conditions, caution relating to the deployment of new drugs, and ethical issues relating to the selection of subjects may prevent the gathering of large amounts of data. Even in business where big data has been demonstrated to be relevant in many contexts, commercial confidentiality or the small-scale operation of start-ups may mean that data cannot be gathered on a large scale. Another business consideration may be the cost of studying data on a large scale: although the expected decline in costs of cloud-based data analysis services may make this less important over time.
Data analysis with smaller datasets
So what can be done with smaller datasets? The simplest possible experimental design is an analysis that compares two groups. One, the control group, is subject to conditions as close as possible to those that would be expected if the experiment did not occur. The second, the experimental group, is subject to a change of conditions the effect of which it is desired to test. The experiment tests whether the change in conditions in the experimental group produces results that differ from random effects. The problem with small datasets is, are there enough individuals in each group to make such a test statistically significant?
How power analysis can help
Fortunately, statistical power analysis can help here. The power of a statistical test can be summarized as the probability of correctly rejecting random effects when there is a real difference between the control and experimental groups. It depends on: the size of the groups in the test (the size of each group, easiest to calculate when they are the same size); the difference between the mean values of the variable being studied in the test; the standard deviation (a measure of variation) of the variable being studied; and the significance level that you regard as appropriate in showing a difference between the groups (0.05 or lower is often assumed.) It can be used to tell you whether you have enough experimental subjects for a useful test, or how many experimental subjects you should look for.
What values are needed to carry out power tests?
Statistical power is often expected to be in the range 0.8 to 0.9 (a high probability because you want your statistical test to work.) If you define the power you want, you need to make some other assumptions if you want to use it to find or confirm the number of subjects needed in each group. Experimental practise in different disciplines probably makes it easy to choose the significance level you want. But standard deviation and difference between means will need to be based on estimates. There may be data from previous studies that can support these.
Alternative approaches
Because this depends on many assumptions, statistical power tests may be more applicable after an experimental study has been carried out. The number of subjects in each group is known, the significance level has been decided, the standard deviation and difference between means can be calculated (although these calculations are only based on the subjects in the study). The return of a statistical power value will suggest whether the study is valid (having a high statistical power), or whether another experiment under different conditions, that may include more subjects, should be considered.
Power analysis in R
R provides built-in functions to carry out power analysis. Where you are using a t-test to test significance you can use: (in this example power has been set at 0.9, significance level [sig.level] at 0.05, standard deviation [sd] assumed to be 5.0 and difference between means [delta] assumed to be 9.0). This example is not based on a real dataset. The number of subjects in each group (n) has not been set but has been calculated by the power.t.test function. Because the value of n returned is not an integer, in fact 8 is the smallest number of subjects that would satisfy these requirements.









If n is known this function could be used to find the power of a statistical experiment after it has been completed.
Conclusions
Here I have referred to the simplest sort of experimental design, with two groups. There are lots of more complicated experimental designs that can be used. When power analysis does not show any feasible approach using these sort of comparative tests, studies using Bayesian inference based on observation over time might provide alternatives. And does a continuing role for small- and medium-sized datasets mean that a lot of the noise about big data is irrelevant? Certainly not. It is very important, but “small data” has not gone away.

This and other blog posts on data, business, science and other topics can be found at my blog pages http://paulmarrow.com/blog/

Easy to lose sight or basic data analysis techniques with big advancements in computer processing of "big data". Learned such basic concepts many years ago and am starting to relearn forgotten concepts while learning more about the application of software such as R in data analysis. Understanding the basics of data analysis makes the application of such data analysis more understandable and you are less likely to make serious mistakes.

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore content categories