When nursing education researchers conduct null hypothesis statistical tests (NHSTs) using a prespecified p-value criterion (such as .05) for statistical significance, their goal is to draw a conclusion about the relationship between the variables under study based on how well the observed (collected) data is compatible with the expected (hypothesized) data. When evaluating the statistical estimates produced from null hypothesis tests, there are only two choices researchers can make: retain or reject the null hypothesis. Although researchers strive to reach the correct conclusion, two incorrect conclusions are possible. Researchers may reject a null hypothesis that should have been retained, known as a Type I error, and which can also be thought of as a false-positive result. The researcher can also retain a null hypothesis that should have been rejected, a Type II error, which can be thought of as a false-negative result. For decades, researchers have been aware that too many studies in the nursing literature are subject to Type II errors, and although Gaskin and Happell (2014) indicate that nursing researchers have made progress in addressing the rate of Type II errors, the problem remains (Polit & Beck, 2017). This Methodology Corner installment continues the discussion on the appropriate use of p-values and effect sizes (Spurlock, 2017) by highlighting the role of power analyses in enhancing statistical conclusion validity.
Type II Errors
Let us imagine a simple scenario where a nurse educator–researcher wants to know whether nursing students taking a course online tend to perform worse or better on a psychopharmacological examination than their peers taking the course in person. Because only a small population is accessible to the researcher, after the examinations were administered, the researcher compared the scores from the 15 in-class students to the 15 online students and found students in the online section tended to perform better (M = 84.11, SD = 5.14) than their in-class counterparts (M = 80.33, SD = 5.08). Although the researcher finds the difference to be meaningful from an educational perspective, follow-up inferential statistical testing indicated that the mean examination scores were not statistically significantly different from each other, t(27.99) = 2.023, p = .053. Consequently, the researcher is uncertain as to what should be concluded, from a statistical or educational perspective, from this very small study.
Nursing researchers have been encouraged for decades (e.g., Polit & Sherman, 1990) to address the validity of study findings that use inferential, null hypothesis tests, such as t tests, in part by accumulating evidence from studies capable of correctly identifying relationships between variables when those relationships are truly present. However, research by Gaskin and Happell (2013, 2014) indicates that only a fraction of nursing studies supply evidence of having undertaken steps—namely, power analysis—to help mitigate the risk of a Type II error. Failing to address threats to statistical conclusion validity can lead to faulty statistical inferences, such as in our nurse educator–research scenario where the inconclusive results stem from an underpowered t test. To help see this, consider that the Cohen's d effect size in our scenario is 0.74—a substantial effect size by any account. But for factors explicated below, the nurse educator's study was not equipped to detect statistically significant between-groups differences in the exam scores, even if they truly existed. Anytime researchers seek to improve the validity of their inferences using NHSTs that produce p-values, an a priori power analysis can help researchers manage the risks of committing a Type II error.
Statistical power refers to the probability of rejecting a null hypothesis when it is indeed false; put another way, power is the ability to detect relationships between variables when those relationships truly exist. Statistical power is indexed on a scale from .00 to 1.00, with zero indicating there is no chance of rejecting a false null hypothesis and 1.00 indicating a false null hypothesis will be rejected 100% of the time it is studied (Polit & Beck, 2017). It is commonly recommended that nursing studies achieve a minimum power of .80, indicating that a false null hypothesis would be rejected in 80% of study replications (Gaskin & Happell, 2013; Polit & Beck, 2017). However, many suggest that a power level of .80 hardly seems sufficient given that it means the null hypothesis would be incorrectly retained in 20% of study replications. Indeed, in light of the replicability problems observed across various disciplines, power levels as high as .90 and .95 have also been put forth for researchers to consider (e.g., Funder et al., 2014).
Power analyses help researchers evaluate the influence a study's design conditions likely have on the validity of the statistical conclusions the researchers will make once the data are analyzed. Although multiple conditions are evaluated in a power analysis, researchers primarily look to sample size to achieve their goals for statistical power (Polit & Beck, 2017). In general, as sample size increases, statistical power also increases. In this way, researchers can use a power analysis to identify sample sizes that moderate the risk of a Type II error. The problem of small sample sizes and Type II error rates was observed in our nurse educator–researcher example, in which comparing 15 participants in each class section was insufficient to identify the robust effect size of d = .74 as statistically significant.
Not only can power analysis help researchers identify an adequate sample size, it can also identify the point beyond which increasing the sample size adds only trivial amounts of power. This is important to note because increasing sample sizes also often come with increased costs, not only to the researcher, such as with study budget concerns, but also to study participants, such as when there is some level of risk or burden to which study participants are exposed (Gaskin & Happell, 2013). For example, if participating in a study involves a participant contributing 1 hour of his or her time, researchers should seek to recruit only the number of participants whose data are needed so as to cause the least invonvenience to the fewest number of participants possible. Thus, exceedingly large sample sizes are not always better. Researchers should use power analysis to help select a target sample size that balances the methodological needs of the study against the costs to the researcher and study participants.
Although sample size holds a preeminent role in a power analysis, additional factors must be addressed to conduct a power analysis. In particular, a power analysis would have little meaning without recognizing the role of the desired or hypothesized effect size (Gaskin & Happell, 2013). In general, as effect sizes increase, statistical power also increases (Polit & Beck, 2017), and researchers may need to consider techniques to increase an effect size where possible. For example, measurement instruments with poor reliability parameters measure their constructs imprecisely, with a high signal-to-noise ratio. Using these imprecise measures makes it difficult for researchers to accurately estimate the size of a relationship between measured variables (Hutcheon, Chiolero, & Hanley, 2010).
Researchers must also evaluate how the choice of statistical technique influences statistical power. In general, the more parameters used in a statistical model, the larger the sample size needed to adequately power the analysis (Tabachnick & Fidell, 2013). A common example of this in nursing education research is when researchers use a large number of academic and nonacademic variables to predict students' grade point averages in a multiple regression analysis. Finally, although the risk of committing a Type I error (also known as the alpha level) is rarely set to values other than .05, it is still influential when conducting a power analysis. Although setting alpha to a lower value, such as under a Bonferroni correction when multiple NHSTs are used, may reduce the risk of a false positive (Type I error), it also reduces the statistical power due to the use of increasingly conservative statistical significance thresholds. The inverse is also true: if alpha is set to higher levels, such as .25 in exploratory regression, power increases (Hosmer & Lemeshow, 2000).
Returning to our example scenario, imagine the researcher knew that the mean difference observed in the first comparison was meaningful from an educational perspective but after further investigation, he or she now suspects the study was not powered to identify a true mean difference consistent with a Cohen's d of 0.74. Indeed, using G*Power 126.96.36.199 (Faul, Erdfelder, Buchner, & Lang, 2009), a widely available and free power analysis software program, indicates that the researcher would need at least double the number of participants to minimally power their t test to .80 with a Cohen's d of 0.74 and alpha = .05. Consequently, after running a t test with a sample size in line with the recommendations of their power analysis, the nurse educator found evidence that the mean examination scores between the groups are significantly different from each other, t(57.789) = 2.2.455, p = .017.
Power analyses have at least two salient limitations that are important for researchers to keep in mind. First, given that power is based on the notion of study replications, power analyses do not tell us whether any individual study is appropriately powered or not (Gaskin & Happell, 2013). Rather, power analysis only helps researchers to manage the risk of a Type II error in any one study. For example, varying study conditions so that a power of .95 is achieved does not tell us with certainty whether the study is among the 95% of well-powered studies or the 5% with a Type II error. Second, given that it is difficult for researchers to anticipate and include all the factors that could be influential to the statistical power of a study, there will be some disparity between the parameters used in conducting a power analysis and the actual conditions in the study under which the data were collected. For example, it is hard to predict and incorporate challenging data conditions such as outliers, unbalanced groups, low subject recruitment rates, and nonnormal study data into an a priori power analysis. Accordingly, researchers should strive to approximate the conditions of their study as best as they can but should apply the results of a power analysis to their study with some reservation.
On the basis of the information presented in this article, we make four recommendations for researchers. First, we reiterate recommendations that researchers undertake a power analysis before a study is conducted to guide decision making about target sample sizes, all to increase the validity of study findings (e.g., Gaskin & Happell, 2013). Second, although achieving higher levels of power may strain resources, we encourage researchers to consider statistical power goals higher than .80, and preferably in the range of .90 to .95. Third, because increasingly large sample sizes are not always better, researchers should use power analyses in a manner that helps to select a target sample size that balances the methodological needs of the study against the costs of increasingly large samples. Finally, although power analyses may be an effective tool for planning, given that no power analysis can perfectly correspond with the study conditions that can be achieved (e.g., sample size), power analysis must be conservatively applied, taking an influential—but not absolute—role in study design decisions and the interpretation of study findings.
Please send feedback, comments, and suggestions for future Methodology Corner topics to Darrell Spurlock, Jr., PhD, RN, NEA-BC, ANEF, at
- Faul, F., Erdfelder, E., Buchner, A. & Lang, A.-G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41, 1149–1160. doi:10.3758/BRM.41.4.1149 [CrossRef]
- Funder, D.C., Levine, J.M., Mackie, D.M., Morf, C.C., Sansone, C., Vazire, S. & West, S.G. (2014). Improving the dependability of research in personality and social psychology: Recommendations for research and educational practice. Personality and Social Psychology Review, 18, 3–12. doi:10.1177/1088868313507536 [CrossRef]
- Gaskin, C.J. & Happell, B. (2013). Power of mental health nursing research: A statistical analysis of studies in the International Journal of Mental Health Nursing. International Journal of Mental Health Nursing, 22, 69–75. doi:10.1111/j.1447-0349.2012.00845.x [CrossRef]
- Gaskin, C.J. & Happell, B. (2014). Power, effects, confidence, and significance: An investigation of statistical practices in nursing research. International Journal of Nursing Studies, 51, 795–806. doi:10.1016/j.ijnurstu.2013.09.014 [CrossRef]
- Hosmer, D.W. & Lemeshow, S. (2000). Applied logistic regression (2nd ed.). New York, NY: John Wiley & Sons. doi:10.1002/0471722146 [CrossRef]
- Hutcheon, J.A., Chiolero, A. & Hanley, J.A. (2010). Random measurement error and regression dilution bias. BMJ, 340, 1–9 https://doi.org/10.1136/bmj.c2289 doi:10.1136/bmj.c2289 [CrossRef]
- Polit, D.F. & Beck, C.T. (2017). Nursing research: Generating and assessing evidence for nursing practice (10th ed.). Philadelphia, PA: Lippincott Williams & Wilkins.
- Polit, D.F. & Sherman, R.E. (1990). Statistical power in nursing research. Nursing Research, 39, 365–369. doi:10.1097/00006199-199011000-00010 [CrossRef]
- Spurlock, D. (2017). The purpose and power of reporting effect sizes in nursing education research. Journal of Nursing Education, 56, 645–647. doi:10.3928/01484834-20171020-02 [CrossRef]
- Tabachnick, B.G. & Fidell, L.S. (2013). Using multivariate statistics (6th ed.). Boston, MA: Pearson.