Students in seminars and other courses often criticize if a study relies on a small sample. What does this mean? How can we tell that a sample is *too* small? And should me always try to maximize the sample size? With a power analysis you can estimate the number of observation that are needed to have enough statistical power to detect an effect.

In order to run a power analysis, we have to provide several information: First, we have to make an assumption about the effect between the examined variables. On the left side you can see scatter plot with simulated data. Go on and adjust the effect of X on Y. Like the correlation between X and Y, we have to assume how strong both variables are related. Is there a weak or a strong effect between X and Y? Second, holding other aspects like effect size constant, we have to specify how much statistical power do we want to detect the effect of X on Y.

This app illustrates both aspects in more detail and shows you how a power analysis helps researchers to determine the “right” sample size and how you can get an idea whether a study has (not) enough statistical power to detect an effect.

Imagine, you are examining the mean difference of two (experimental) groups A and B, as the density plot on the left side illustrates. The effect size is determined by the *mean difference* between A and B and the *standard deviation* of both groups. The overlap between the distribution of A and B becomes smaller if the mean difference between and A and B increases. However, this depends also on the standard deviation of both distributions. As the standard distribution increases, the overlap between A and B also increases. Give it a try and adjust the values for the mean differences and the standard deviation.

So, how can measure the effect size if it depend on how we measure the variables? Cohen's δ is a very popular measure for effect sizes and it helps to get rid of the underlying scale. In order to estimate Cohen's δ we take in principal the estimated difference in the means of both groups and divide it by the pooled standard deviation of X and Y. Cohen also provided some guidance to assess whether the effect of X on Y is small, medium, or large:

Effect size | Cohen (1988) |
---|---|

Small effect | δ = 0,2 |

Medium effect | δ = 0,5 |

Large effect | δ = 0,8 |

Thus, measuring effect size is useful to compare the results of different studies since the effect size no longer depends on the measured outcome. Moreover, how we measure effect sizes depends also on the applied method. We can estimate the effect size in terms of correlation as the last pane showed, but we will apply a power analysis for two experimental groups. In order to understand what power analysis does, it is sufficient to know that we can transfer results from one approach to another. For instance, we can convert a correlation coefficient to δ. Depending on the effect we hope to detect, we may have a lower or a higher statistical power to detect an effect if it really exists.

The power of a statistical test is defined as 1 minus the Type II error probability. If you are not familiar with sampling, this definition is very complicated. An interpretation will probably help. Let's say we have a statistical power of 0.8, which means that we have a probability of 80% that we correctly reject the null hypothesis in case that there is no effect between X and Y. In applied research, you will often encounter a statistical power of 0.8, many researchers use this threshold as a convention to say that a test has sufficient statistical power.

Keeping all other aspects constant, we can learn from power analysis that larger samples have a with a higher statistical power, because a larger sample size helps to increase the precision to estimate X and Y. On the left side you can see the results of two power analysis and the plot clearly shows for both cases that the power increases with larger sample sizes.

However, the left plot shows two power analysis. In the first analysis we assume a *low* and the second a *medium* effect size. Feel free to adjust the plot to see what happens if you increase (decrease) the effect size. Anyway, in accordance with the idea that the distribution of two variables have a smaller or wider overlap (depending on the sample size), you may realize that a small sample is sufficient to detect a large effect. We need approximately 200 cases in case of a small effect, while the number of observation goes down if we expect a large effect. Thus, the number of observations that are needed for a sufficient statistical power declines with an increasing effect size.

Below you can see the output of a power analysis that calculates the number of observation needed to have sufficient statistical power to find an effect; and on the left side I made a visualization that illustrate only how many observations per group (control and treatment group) we would need to have a sufficient power. Feel free to adjust the effect size and the power and see how the needed sample size increases (or decreases) based on the estimation of the power analysis!

However, this apps highlights the main idea of a power analysis, but in order to estimate the number of observation that are needed to detect an effect, we have to specify additionally the significance level (e.g. 95% significance level) and - in case of a t-test with a treatment and control group - decide which kind of hypothesis testing we apply. For instance, we can assume that the expected mean difference is “greater”, “less”, or “two.sided” if both is possible. If you provide an estimate for the effect size, statistical power, the significance level, a value for the hypothesis test, you can run a power analysis to determine number of observations.

Ultimately, there is only one question left from the start. Is ti wise to increase the sample at any cost? Look at the plot on the left side. This time you can see three different power analysis for small effect sizes and you can increase the number of observation to check how many observations are needed to get a sufficient power.

All three groups do not have sufficient power (>0.8) in case of small sample size. However, look what happens if you increase the sample size. You may detect a significant effect even in case of a very small effect size! Thus, keep in mind that a large sample size increases the statistical power, which implies that even a very small differences could become significant, even though such a small effect may not make any difference from a substantial point of view.

Next time you encounter a study with a very large sample size, please ask yourself whether a significant effect is also substantial (e.g. in terms of explained variance, substantial mean differences) not just whether the reported differences are significant or not. Obviously, a large sample size is not everything.