As the movement to reconsider the importance of and meaning attributed to statistical significance spreads into additional scientific fields where “p < .05” has dominated empirical inference for decades, researchers in these fields must now attend to the work of articulating the practical significance of their statistical findings, regardless of whether p values fall both above and below predefined thresholds. Just as reforming how we understand and appreciate p values will take significant time and effort by journals, faculty, and those leading research and evidence-based practice (EBP) efforts in practice, so too will identifying and communicating the practical significance of findings from our data analyses. These efforts are essential to setting our field on a course where our investigations are no longer judged in a binary fashion based on a statistical parameter that has been widely misunderstood and misused for decades.
In a Methodology Corner article from over 2 years ago (Spurlock, 2017), I wrote about newly released guidelines from the American Statistical Association (2016) which focused on the proper interpretation of p values. Under the American Statistical Association's nontechnical explanation, when conducting a null hypothesis statistical test such as a t test, ANOVA, or correlation analysis, p values associated with those statistical point estimates represent the probability that the observed data, such as the association between variables or the difference between group means, would be equal or even more extreme than its observed value if all the assumptions underlying the statistical model were met. That is, p values represent the compatibility between observed data—that which is being analyzed by the researcher—and data under a true null hypothesis, which always specifies the absence of a difference or association in the data.
This definition contrasts with decades of practice by nonstatisticians who often treat p values as indicators of the importance of the statistical test results rather than as probability estimates that require further elaboration by the researcher for proper interpretation. This point was emphasized recently in a letter to the editor signed by 25 statisticians and quantitative methodologists who work in schools of nursing and published simultaneously in two nursing journals (Hayat et al., 2019). Indeed, as Hayat et al. noted, part of the historical challenge in interpreting p values is in the choice of language to describe p values as significant or not when p values do not, in fact, convey information about the importance of a given test statistic. One need only open a recent issue of all but the most quantitatively focused journals from the social or clinical sciences to see how the misinterpretation of p values often unfolds. Consider this fictitious excerpt from a conference abstract detailing a study of flipped classrooms in nursing education:
When compared to randomly assigned students in the control group who received traditional lecture (n = 118), students in the experimental flipped classroom group (n = 126) scored statistically significantly higher on their comprehensive final examinations (t = 3.52, p < .001). The mean comprehensive final examination scores for the traditional lecture control group was 81.5 (SD = 6.9). Mean examination scores for the experimental flipped classroom group were higher, at 83.9 (SD = 3.2). Although more research is needed, our results should reassure faculty considering flipping their classrooms that learning outcomes can be significantly improved by adopting this increasingly popular approach to teaching and learning in classroom settings.
In the example above, we can note several things. First, the fictitious researchers did a fairly good job in their statistical reporting by including the sample size for each group, means and standard deviations for each group, and, finally, both the t statistic and p value. The reporting would have been improved if it included the actual difference in examination score points between the groups and the 95% confidence intervals for the difference in mean examination scores (M difference = 2.4, 95% CI = 1.03, 3.77) and the appropriate effect size estimate which in this case is Cohen's d, calculated as (MGroup1 – MGroup2)/SDpooled, resulting in an effect size of d = .39 for this fictitious example.
The next thing we note is that in the first sentence, the term statistically significant was used, whereas in the last sentence, the term statistically is missing before the word significant. This omission may seem unimportant at first, but by separating statistical from significance, the researchers have moved from a phrase used to qualitatively describe the p value associated with their test statistic to suggesting that significant improvements in learning outcomes were found. Although a mean improvement of 2.4 points on a final examination is not to be dismissed, the lower bound of the CI is 1.03 points, and the effect size is small to modest in size at d = .39. Cohen (1988) suggested that while even slightly smaller values of d are not trivial, it is only at d = .5 that effects become noticeable to the “naked eye” of keen observers. Given this additional information, the effectiveness of the flipped classroom intervention in the fictitious study above could not reasonably be described as producing “significant improvements” in learning outcomes. Rather, the language above is a subtle example of what Hubbard, Haig, and Parsa (2019) were describing when they wrote that, “In many fields of inquiry, like the social sciences, management sciences, and areas of biomedicine, the notions of statistical inference and scientific inference are viewed as all but synonymous” (p. 92). That is, although statistical estimates and their corresponding small p values may be described as statistically significant, researchers should exercise caution to not conflate statistical significance with practical significance, discussed in the next section.
Finally, the push to move beyond p values as solely determinative of significance has been bolstered by a recent supplemental issue of The American Statistician journal containing 44 papers on this topic, including an informative lead-in editorial by Wasserstein, Schirm, and Lazar (2019). All the papers from this special issue are permanently available online as open-access papers; readers, and especially those faculty who teach research methods or EBP at all levels of nursing education, are encouraged to use this valuable resource in implementing the necessary changes to courses where students are learning about hypothesis testing and the proper interpretation and meaning of p values.
The term practical significance is used in this article to refer to evaluations of significance or importance made based on the data analysis for a study, but with a focus separate from the obtained p value. Well over a decade into an era of EBP that some suggested might be just a passing fad in its earliest days, it is heartening to regularly hear faculty and students alike make note that just because a study's statistical analysis produced statistically significant p values, that does not mean the findings are practically significant. Yet, the inclusion of a discussion of the practical significance of study results in published research papers remains exceedingly rare. I would suggest this is because identifying and describing practical significance is difficult in general—and especially when compared to the decades-long practice of determining importance almost exclusively by the p value produced during the course of various statistical analyses. Polit (2017) noted that despite the efforts of many over the years, attempts to define and measure clinical significance, which I characterize in this article as a specific type of practical significance, have borne little fruit. If researchers find it too difficult to explain their findings in practical terms, it seems neither wise nor reasonable to expect consumers of research—be they students, practicing nurses, nursing faculty, or national leaders in the field—to make this determination on their own.
Pogrow (2019) described how in many social and clinical science fields, practical significance is defined exclusively in terms of effect size, where larger effect sizes indicate stronger practical significance. Although effect sizes are a key consideration in making defensible inferences from quantitative findings, they were not designed to function as the sole determinants of practical significance. This is because effect sizes are expressed in technical terms unfamiliar to consumers of research, and further, most effect sizes are expressed in units with no clear real-world application, such as is the case with the correlation coefficient. Research consumers might easily memorize the rules of thumb for classifying correlation coefficients as small, medium, or large, but most would be challenged to identify precisely when the association between variables reaches practical significance. This is an inherently difficult task because the context and details of each study are different and, as such, rules of thumb for interpreting effect sizes do not help in their real-world application.
Polit (2017) discussed clinical (practical) significance in terms of two levels: the group level and the individual level. Researchers in nursing education are more likely familiar with group-level determinants of practical significance, where effect sizes and confidence intervals are prime examples. Another metric called the number needed to treat (NNT) comes from epidemiology and medicine where, based on the impact clinical interventions have on patient outcomes, the number of patients who need to be treated with an intervention before one patient benefits from the treatment can be determined. Smaller NNT values are better; an NNT of 10 means that 10 patients would need to receive the intervention for one patient to benefit. Likewise, an NNT of 2 indicates that 50% of patients treated with the intervention will benefit from it. An exploration of the website https://www.thennt.com/, which provides a quick reference to NNTs and their evidentiary sources for many clinical treatments, is well worth the time for readers. There, NNTs derived from high-quality systematic reviews and meta-analyses are efficiently summarized, including, for example, an NNT of 0 for mortality benefit and 217 for avoiding a nonfatal heart attack when statins are taken for the prevention of cardiovascular disease in otherwise healthy adults (extrapolated from Chou, Dana, Blazina, Daeges, & Jeanne, 2016). The lack of mortality benefit can be compared to the parallel concept of number needed to harm (NNH), which is 21 for muscle damage and 204 for developing diabetes mellitus from statin use. So, in adults without existing cardiovascular disease, statins provide no mortality benefit and are more likely to cause muscle damage and diabetes mellitus than to prevent a nonfatal heart attack. The intuitive appeal of a metric such as the NNT/NNH for evaluating practical significance is clear: It takes complicated findings from an exhaustive literature synthesis and turns them into terms a health care provider could discuss with their patients when considering various treatment options. Because the NNT is calculated based on the differences in the outcomes between treated and untreated patients, its applicability to other fields is equally clear. In educational research, an intervention group is often compared with a control group on measures such as retention rate, licensure examination pass rate, or completion rate. Using these data alone, easily calculated NNTs for educational interventions could help nurse educators quickly understand the possible benefits and possible harms to students for various interventions under consideration.
Polit (2017) identified several methods to examine clinical (practical) significance at the level of the individual. Focusing on practical significance at the individual level is likely to be un-charted territory for researchers in nursing education given our near-exclusive historical reliance on null hypothesis statistical testing, which, by definition, requires our unit of analysis be the study sample instead of the individual study subject. Nevertheless, as Polit pointed out in clear terms, clinicians generally decide on a course of action for individual patients, not for groups. Similarly, nursing faculty make decisions about individual students much more frequently—sometimes daily—than they do about groups of students. Yet, the evidence to which faculty look for guidance when considering instructional approaches, developing learning activities, or designing assessments is likely to present results in terms of group descriptive and inferential statistics. What is left out when the focus is solely at the group level of analysis, especially in intervention research, is that important individual impacts and changes can be overlooked when data are analyzed and discussed only in the aggregate.
Polit (2017) identified an article by Jaeschke, Singer, and Guyatt (1989) as the most influential one written to define and describe clinical significance for health research. In highlighting the need for researchers to help translate scores from measurement scales into clinically actionable terms, Jaeschke et al. described the concept of a minimal clinically important difference, which they defined as “the smallest difference in score in the domain of interest which patients perceive as beneficial and which would mandate, in the absence of troublesome side effects and excessive cost, a change in the patient's management” (p. 408). Jaeschke et al. went on to note that clinicians develop an intuitive sense of which changes in various clinical parameters are meaningful and which are not, based on their experience with the measures being used and the patients they have treated. Nurses understand, for example, that if 100 patients were given the same dose of the same diuretic drug, urinary outputs among those 100 patients in response to the drug would be highly variable. Nurses have learned this by first understanding that numerous factors affect drug absorption and metabolism and second, by their experience observing the variable response that individual patients have to medications. I would suggest that nursing education researchers, most of whom we will consider either are or were faculty members at one time, similarly understand that not all students learn equally well in response to the same instructional strategy, require the same amount of time to master a given skill, or perform equally well on the same assessment. Thus, the evidence base to which nursing faculty look for guidance, with its near-exclusive focus on aggregate data summaries and statistics, is structured in ways that make identifying any practical significance that exists among the findings difficult at best, especially for research consumers.
One strategy for enhancing the ability of research consumers to understand the practical significance of a study's findings is for nursing education researchers to specify benchmarks, metrics, or goals with potential practical significance in the study design phase. Because few examples exist in the nursing education literature, most nursing education researchers will need to develop and propose criteria for judging the practical significance of their study findings, support and defend their choices to readers, and then ensure that practical significance is adequately considered alongside practical significance where appropriate. And to be clear, it is not necessary that every p value produced in an analysis have a parallel practical significance metric as well, as some statistical analyses are not amenable to such interpretations. However, nearly all types of experimental and nonexperimental intervention research, and most observational research using regression-based procedures, is amenable to evaluation of the practical significance of the statistical findings.
To demonstrate, we can return to our fictitious flipped classroom study, described earlier in this article, which produced an average 2.4 point advantage for the flipped classroom students compared with the traditional lecture students. Although the score difference between groups was statistically significant, owing mainly to the relatively large sample size, the effect size (d) for the score differences was modest but nontrivial. What if the fictitious researchers had elected to examine and report, in addition to final examination scores, the proportion of students in both groups that fell into each letter grade category? This could expose the fact that in the context of students having been randomly assigned to groups, in the experimental flipped classroom group, seven students earned an F in the course, whereas in the traditional lecture group, only three students earned an F. Suddenly, the calculus for whether the flipped classroom was effective changes from emphasizing the small but statistically significant improvement in scores in the flipped classroom group to a more mixed picture where mean examination scores went up, indicating that at least some students benefited from the flipped classroom format but the number of students failing the course was more than double that of the traditional lecture group. By failing to examine the distribution of students into each letter grade category for both the control and experimental groups, the fictitious researchers overlooked an important pattern in the data that, had they seen before writing up their conference abstract, would have changed (we hope!) the overly strong endorsement they gave to the flipped classroom format. Were the fictitious researchers wrong to use comprehensive final examination scores as an outcome variable? Of course not. But, faculty rarely evaluate their courses using a single metric such as final examination scores, so from a practical perspective there was sufficient room to more closely align the study's variables of interest with those routinely used by faculty teaching in real-world settings.
In a recent reflection on the insights gained over the past 100 years of formal educational research, the creator of meta-analysis, Glass (2016) lamented that although his analytic creation has gained widespread adoption in medicine, epidemiology, and some corners of social science, its impact on educational research and practice has been more limited due to the inherent difficulty in conducting experimental educational research. Glass noted that meta-analysis “has not lived up to its promises to produce incontrovertible facts that would lead education policy. What it has done is demonstrate that average impacts of interventions are relatively small and the variability of impacts is great” (p. 71). To be sure, conducting research in educational settings that can stand up to the rigorous methodological and statistical requirements of meta-analysis is exceedingly difficult. Hubbard et al. (2019) provided consolation, however, in noting that “while occasionally important, overall the part played by formal statistical inference in scientific inference is relatively minor” (p. 92). Although it will take years to fully reorder our collective understanding of statistical significance, we can begin that work in earnest now by strengthening our focus on methodological and measurement rigor and by insisting that practical significance be considered a coequal partner with statistical significance when drawing inferences from the studies we conduct.
Please send feedback, comments, and suggestions for future Methodology Corner topics to Darrell Spurlock, Jr., PhD, RN, NEA-BC, ANEF, at
- Chou, R., Dana, T., Blazina, I., Daeges, M. & Jeanne, T.L. (2016). Statins for prevention of cardiovascular disease in adults: Evidence report and systematic review for the US Preventive Services Task Force. JAMA, 316, 2008–2024 https://doi.org/10.1001/jama.2015.15629 doi:10.1001/jama.2015.15629 [CrossRef]27838722
- Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates.
- Glass, G.V. (2016). One hundred years of research: Prudent aspirations. Educational Researcher, 45, 69–72 https://doi.org/10.3102/0013189X16639026 doi:10.3102/0013189X16639026 [CrossRef]
- Hayat, M.J., Staggs, V.S., Schwartz, T.A., Higgins, M., Azuero, A., Budhathoki, C. & Ye, S. (2019). Moving nursing beyond p < 0.05. Research in Nursing & Health. Advance online publication. https://doi.org/10.1002/nur.21954 doi:10.1002/nur.21954 [CrossRef]
- Hubbard, R., Haig, B.D. & Parsa, R.A. (2019). The limited role of formal statistical inference in scientific inference. The American Statistician, 73(Suppl. 1), 91–98 https://doi.org/10.1080/00031305.2018.1464947 doi:10.1080/00031305.2018.1464947 [CrossRef]
- Jaeschke, R., Singer, J. & Guyatt, G.H. (1989). Measurement of health status: Ascertaining the minimal clinically important difference. Controlled Clinical Trials, 10, 407–415 https://doi.org/10.1016/0197-2456(89)90005-6 doi:10.1016/0197-2456(89)90005-6 [CrossRef]
- Pogrow, S. (2019). How effect size (practical significance) misleads clinical practice: The case for switching to practical benefit to assess applied research findings. The American Statistician, 73(Suppl. 1), 223–234 https://doi.org/10.1080/00031305.2018.1549101 doi:10.1080/00031305.2018.1549101 [CrossRef]
- Polit, D.F. (2017). Clinical significance in nursing research: A discussion and descriptive analysis. International Journal of Nursing Studies, 73, 17–23 https://doi.org/10.1016/j.ijnurstu.2017.05.002 doi:10.1016/j.ijnurstu.2017.05.002 [CrossRef]28527824
- Spurlock, D. (2017). Beyond p < .05: Toward a Nightingalean perspective on statistical significance for nursing education researchers. Journal of Nursing Education, 56, 453–455 https://doi.org/10.3928/01484834-20170712-02 doi:10.3928/01484834-20170712-02 [CrossRef]
- Wasserstein, R.L., Schirm, A.L. & Lazar, N.A. (2019). Moving to a world beyond “p < 0.05.”The American Statistician, 73(Suppl. 1), 1–19 https://doi.org/10.1080/00031305.2019.1583913 doi:10.1080/00031305.2019.1583913 [CrossRef]