Love is the most important thing in the world, but baseball is pretty good, too. Yogi Berra
If you were a baseball manager in charge of picking players for your team, how would you do it? Until recently, managers and scouts looked at players' physical attributes, batting averages, how they performed against various teams, and even “the look on their face.”1 Managers then combined these attributes using their experience, “gut feelings,” and “instincts” to decide which players they wanted.
When mental health professionals use similar approaches to make diagnoses or assess violence risk, they are relying on what scholars now call their “unstructured clinical judgment.” In psychiatry, for example, clinicians might draw on their experience with seeing patients, a patient's past history, current signs and symptoms of illness, and their personal understanding of the available research to judge the likelihood that some unwanted outcome (such as suicide or violence toward others) will happen in some future period.
The court system has told mental health clinicians it's their job to decide which patients are dangerous and to take steps to mitigate the danger.2 Yet, decades of research has shown that clinicians' unstructured judgments are, at best, moderately accurate in predicting who really will become violent3–7 and that as they feel more confident in their predictions, clinicians become less accurate.4
Improving on Unstructured Judgment
Can clinicians do better at judging risk? The answer is yes, and as it turns out, they can do so the same way that many modern baseball managers improve their chances of finding good players. Smart managers increasingly use sabermetrics, which is defined as “conventional statistics in carefully-chosen combinations, to calculate measures thought to more accurately gauge a player's value or relative worth.”1
In the mental health field, the analogous approach involves using an actuarial risk assessment instrument (ARAI) to gauge dangerousness to others. Rather than relying on clinical judgment, clinicians who use ARAIs evaluate risk in the same way that insurers write policies and set premiums. Insurance actuaries develop specific questions and formulae through empirical investigations of what kinds of factors correlate strongly to risk. Rate setters then feed this information into a risk-scoring algorithm to decide whether to write a policy and what to change.8
Because ARAIs use explicit formulae to judge risk, they are more consistent than unguided human judges are and, therefore, outperform them. Actuarial tools also can be improved as investigators learn more about factors that affect risk, although the ones used most commonly in mental health risk assessment are not often revised and tend to ignore factors that affect day-to-day fluctuations in risk.7 Despite criticism that risk assessments derived from groups of people do not apply to individual people, ARAIs can help make decisions about people provided that such decisions are not misinterpreted as “predictions” about those people.9,10 ARAIs are most valid when applied to members of the specific population for which they were developed,11 because any conclusion about probabilities of future events requires awareness of the base rate of those events in the relevant population.12
Actuarial methods aren't the only way to outperform unstructured judgment. Suppose a baseball manager already has a roster with lots of hard-throwing, right-handed pitchers and realizes that the team needs a left-handed pitcher with a great curveball. The manager still should implement player-selection methods with proven reliability, but he'd also want some flexibility to place the statistical data in context, given the team's specific situation.
If the manager did this, he'd be using what the field of mental health risk assessment calls “structured professional judgment” (SPJ). When using SPJ tools, clinicians consider a list of factors with demonstrated links to violence risk. SPJ instruments vary in methods used to score and combine risk factors, and in how they lead to a global estimate of violence risk.13 Yet ultimately, clinicians who use SPJ make the final determination of risk, without referring to any explicit formula. An advantage of SPJ tools is that they often include dynamic factors (eg, current clinical status and foreseeable management problems) that can guide treatment planning4 and efforts to reduce risk.14
As is true for actuarial judgment, research suggests that SPJ outperforms unstructured judgment in gauging risk of violence.14 For both SPJ and actuarial tools, the distribution of results of nonviolent and violent people overlap substantially, which means that their users need to decide what level of apparent risk justifies a particular mitigating action (eg, implementing involuntary hospitalization).14 Ongoing investigation into whether ARAIs or SPJ measures provide greater predictive validity of violence risk suggests that they perform similarly,5,13,15 in part because the available tools share many common factors with the largest predictive value.13 Comparing the performance of SPJ and ARAIs is difficult because, if used successfully, SPJ tools should lower the rate of violence, which may affect their apparent accuracy.16
Dozens of ARAIs and SPJ tools are available,4 making it important for clinicians to match an instrument appropriately to the assessment task at hand. To achieve this, clinicians should consider the following:
- Does the person being assessed come from a population similar to the one for which the instrument was designed?
- Does the setting for which the tool was designed match the person's situation (eg, in prison)?
- What is the instrument's operationalized outcome (eg, postdischarge violence, sex offender recidivism)?
- What is the desired follow-up period?8,15
SPJ tools are often best suited to making clinical and management decisions, but they should not be the sole determinants of sentencing or discharge decisions.6 A useful role for risk assessment tools is to screen out people at lower risk of offending6 who might best be managed via diversion rather than incarceration.8
Balls, Strikes, and Accuracy
In baseball games, umpires call balls and strikes. Yet we know from motion-tracking technology (which is nearly perfect17) that human umpires make mistakes. Using the terms in Table 1, we can describe several indices of umpire accuracy.
A Matrix for Categorizing Umpire Calls
One thing we might like to know is how often the umpire calls “strike” when a pitch really is a strike. The statistic that quantifies this is called sensitivity or the true positive rate, and one computes it from a sample by dividing the number of strikes called “strike” by the total number of actual strikes, or true positives divided by the sum of true positives plus false negatives (TP/[TP+FN]) (Table 1).
Similarly, we might wonder how often the umpire calls “ball” when a pitch is not a strike. This quantity is the umpire's specificity, and it is calculated by dividing the number of balls called “ball” by the total number of actual balls (ie, true negatives divided by the sum of true negatives plus false positives [TN/(TN+FP]) (Table 1).
Two other useful statistics are the positive predictive value (PPV) and the negative predictive value (NPV). An umpire's PPV tells you how often the video confirms a pitch as a strike if the umpire calls it a strike, and it is calculated as true positives divided by the sum of true positives plus false positives (TP/[TP+FP]) (Table 1). NPV says how often the video confirms a pitch was a ball if the umpire calls it a ball and is calculated as TN/(TN+FN) (Table 1).
Receiver Operating Characteristic Curves
Unlike home-plate umpires who have to call either “ball” or “strike,” judgments about risk of violence often include levels of confidence about the outcome. Yet the sensitivity-specificity framework described above assumes that an assessment can lead only to binary, yes-or-no judgments about risk.
Because of this limitation, investigators since the 1990s have used receiver operating characteristic (ROC) curves to describe the accuracy of risk assessments.3 As Figure 1 illustrates, a ROC curve plots sensitivity as a function of the false positive rate (which equals 1-specificity) to allow for the fact that “judgments about the occurrence of a future event usually fall along an implicit continuum from low to high probability.”14Figure 1 represents the typical accuracy for a risk assessment instrument that, for illustrative purposes, we have conceived as leading to one of five risk ratings: high, medium-high, medium, medium-low, and low. Cumulatively, each liberalizing step in the decision threshold captures a higher fraction of the actually violent people, so that sensitivity increases. But this happens at a cost, which is that an increasing fraction of nonviolent people are misidentified (ie, the false positive rate rises along with the true positive rate).
A receiver operative characteristic curve for describing the performance of a risk assessment technique.
The area under the ROC curve (AUC) is a summary index of accuracy that enjoys wide use in studies of risk assessment. In this context, AUC also has a practical meaning: it equals the probability that a randomly selected, actually violent person will receive a higher risk rating than a randomly selected, nonviolent person.
Notice that AUC is an index of discrimination accuracy, not a measure of how often a risk assessment tool predicts the correct outcome. This is true because the AUC—like sensitivity and specificity—is independent of the base rate of the phenomenon in question. To know how often a violence prediction is correct, you'd need to specify which risk level (ie, decision threshold) is being used and the base rate of violence in the population under study. Other limitations of the AUC statistic are that (1) it does not provide details about the trade-off between sensitivity and specificity; (2) it does not tell users how to balance false positive and false negative errors; and (3) it does not speak to whether a clinically useful distinction can be made, given a particular discrimination capacity.7,14,18
Two other indices describe how risk assessment tools function in the context of civil or criminal confinements based on risk judgments.15,19 Number needed to detain (NND) is the number of people who would need to be detained to prevent one episode of violence. Number safely discharged (NSD) is the number of people that would be released for each episode of outside-the-hospital violence.19 NND and NSD assume use of a particular decision threshold and base-rate, but they allow users to frame the consequences of a particular decision strategy with an eye to its consequences for public safety and civil liberty.6
Communication of Risk
If reading this article is one of your first experiences in learning about risk assessment, you probably found the previous sections' statistical discussion to be a bit dense and even confusing. Not surprisingly, then, judges and juries find these ideas confusing too. Recognizing this, forensic clinicians have devoted substantial attention during the past decade to discussing how best to communicate to legal decision-makers what the results of risk assessments mean.
Broadly, information about risk can be communicated by describing specific risk factors, by reporting the likelihood of the outcome in question, or by discussing interventions to reduce risks.20 We summarize these approaches and their features in Table 2. Notice that expressions of likelihood can come in three forms: categorical, probabilistic, and frequency. To return for a moment to our baseball theme, the equivalent expressions run as follows: (1) categorical: “Cincinnati has a low chance of winning the pennant,” (2) probabilistic: “Cincinnati has a 5% chance of winning the pennant,” and (3) frequency: “Five times out of 100, a team like Cincinnati's wins the pennant.”
Three Types of Expert Statements About Risk
In offering categorical statements, clinicians may end up advocating for a certain conclusion rather than interpreting the data.21 Yet, judges at civil commitment hearings prefer categorical messages and may not interpret numerical estimates correctly.20 Similarly, jurors often have difficulty understanding and processing risk prediction information rationally.9 For example, in one study of sexually violent predator (SVP) commitment hearings, more than 80% of jurors thought a 15% chance of recidivism implied the respondent was “likely” to reoffend, and more than one-half of jurors thought that even a 1% chance implied that reoffending was “likely.”22 Jurors may ignore numbers for several reasons:
- For some, the magnitude of the harm of another potential sexual offense outweighs considerations of likelihood.
- Many jurors do not understand probability well enough to evaluate the evidence presented, or may not consider a 10% chance of recidivism to differ importantly from a 30% chance.
- Motivated reasoning may lead some jurors to use the evidence they heard to justify whatever conclusion they thought was correct.9,21,22
In deciding whether a SVP respondent is likely to reoffend, jurors also find clinical testimony and nonstatistical evidence more persuasive than information derived from scientifically based risk assessments.23,24 These kinds of findings have led to several recommendations for experts who present information on risk.
- Experts can provide detail about what concrete information contributed to the scores generated by a risk measure.24
- Experts can try to address jurors' previous beliefs about sexual offenders and explain the basic statistical concepts underlying actuarial assessment.21
- Because jurors have heightened skepticism of recidivism testimony from both sides when the defense presents expert testimony, experts need to explain why they use ARAIs and why clinical judgment may be misleading.23
Challenges to Risk Testimony
Opposing attorneys often challenge mental health experts' testimony on risk, even questioning whether such testimony is even admissible as evidence. Although each jurisdiction has its own rules governing admissibility of evidence, states tend to use one of two main approaches. In New York, California, Pennsylvania, and other states that apply the Frye standard,25 evidence is admissible if its basis is “generally accepted” in the expert's field. In federal courts and states (eg, Ohio, Virginia, and Michigan) that apply the Daubert standard,26 judges are supposed to weigh factors that bear on whether proffered testimony conforms to appropriate standards of scientific validity and methodological rigor.27
Although challenges to admissibility of risk testimony are unusual, structured risk assessment tools appear to be challenged more often than unstructured judgment,11,27 despite the lower validity of the unstructured approach. Nonetheless, clinicians who plan to provide expert testimony on risk should be cognizant of these potential issues:
- Unstructured judgment may be most vulnerable to admissibility challenges when it conflicts with testimony based on a structured approach.11 In unstructured assessments, experts may focus on risk factors with little actual relationship to risk, or they may use valid risk factors but give undue weight to some. If experts assign high significance to particular risk factors, they should be ready to explain why they did so and what research supported this.
- Revising the results of structured assessment tools with one's own clinical opinion may decrease the tool's predictive validity.11 Although developers of many risk assessment instruments recommend consideration of obviously relevant factors that do not appear in the instruments themselves, experts should be prepared to explain what information risk assessment tools yield on their own and how any additional findings complement that information.
- Using multiple instruments that measure the same risk construct probably will not increase validity,11 in part because of correlations in various tools' variables. A sounder approach is to select a particular tool based on its relevance to the evaluee's circumstances and appropriateness to the judgment task.
- Experts should avoid applying tools to people from nonstudied populations. For example, the Psychopathy Checklist-Revised (PCL-R) has some value as a risk measure, but it may be less useful in evaluating people who differ substantially from the European-American adult men with whom the PCL-R was primarily developed.28
Experts should also know that replication studies of risk assessment instruments often yield results that are poorer than those published by the instruments' original developers.29 Additionally, “practitioners simply cannot assume manual-based probabilistic estimates of recidivism risk to be accurate,” and that “even when replication studies match the sample and design characteristics of normative investigations closely and use manual-based protocols exactly, group-based recidivism rates still do not hold.”30
A final caution derives from the “allegiance effect.” Just as baseball umpires tend to make calls that favor the home team,31 mental health professionals form opinions biased toward the side that hires them. In a simulation study that yielded a striking demonstration of the allegiance effect, Murrie et al.32 found that despite using common, manualized actuarial tools, experts assigned lower risk scores if hired by the defense and higher risk scores if hired by the prosecution. Experts should recognize their susceptibility to the allegiance effect and do their best to provide balanced, objective opinions.
- Beneventano P, Berger PD, Weinberg BD. Predicting run production and run prevention in baseball: the impact of sabermetrics. Int J Bus Humanit Technol. 2012;2(4):67–75.
- Monahan J. Tarasoff at thirty: how developments in science and policy shape the common law. U Cin L Rev. 2006;75(2):497–522.
- Mossman D. Assessing predictions of violence: being accurate about accuracy. J Consult Clin Psychol. 1994;62(4):783–792. doi:10.1037/0022-006X.62.4.783 [CrossRef]
- Singh J. Violence risk assessment: what behavioral healthcare professionals should know. Rev Fac Med. 2015;63(3):355–356. doi: . doi:10.15446/revfacmed.v63n3.50292 [CrossRef]
- Singh J, Grann M, Fazel S. A comparative study of violence risk assessment tools: a systematic review and metaregression analysis of 68 studies involving 25,980 participants. Clin Psychol Rev. 2011;31:499–513. doi: . doi:10.1016/j.cpr.2010.11.009 [CrossRef]
- Fazel S, Singh J, Doll H, Grann M. Use of risk assessment instruments to predict violence and antisocial behaviour in 73 samples involving 24,827 people: systematic review and meta-analysis. BMJ. 2012;345:e4692. doi: . doi:10.1136/bmj.e4692 [CrossRef]
- Norko M, Baranoski M. The prediction of violence; detection of dangerousness. Brief Treat Crisis Intervention. 2008;8(1):73–91. doi:10.1093/brief-treatment/mhm025 [CrossRef]
- Desmarais S, Johnson K, Singh J. Performance of recidivism risk assessment instruments in U.S. correctional settings. Psychol Serv. 2016;13(3):206–222. doi: . doi:10.1037/ser0000075 [CrossRef]
- Scurich N, Monahan J, John RS. Innumeracy and unpacking: bridging the nomothetic/idiographic divide in violence risk assessment. Law Hum Behav. 2012;36(6):548–554. doi: . doi:10.1037/h0093994 [CrossRef]
- Mossman D. From group data to useful probabilities: the relevance of actuarial risk assessment in individual instances. J Am Acad Psychiatry Law. 2015;43:93–102.
- Krauss D, Scurich N. Risk assessment in the law: legal admissibility, scientific validity, and some disparities between research and practice. Behav Sci Law. 2013;31:215–229. doi: . doi:10.1002/bsl.2065 [CrossRef]
- Singh JP, Fazel S, Gueorguieva R, Buchanan A. Rates of violence in patients classified as high risk by structured risk assessment instruments. Br J Psychiatry. 2014;204(3):180–187. doi: . doi:10.1192/bjp.bp.113.131938 [CrossRef]
- Skeem J, Monahan J. Current directions in violence risk assessment. Curr Direct Psychol Science. 2011;20(1):38–42. doi: . doi:10.1177/0963721410397271 [CrossRef]
- Mossman D. Evaluating risk assessments using receiver operating characteristic analysis: rationale, advantages, insights, and limitations. Behav Sci Law. 2013;31:23–39. doi: . doi:10.1002/bsl.2050 [CrossRef]
- Singh J, Fazel S. Forensic risk assessment: a metareview. Crim Just Behav. 2010;37(9):965–988. doi: . doi:10.1177/0093854810374274 [CrossRef]
- Ho H, Thomson L, Darjee R. Violence risk assessment: the use of the PCL-SV, HCR-20, and VRAG to predict violence in mentally disordered offenders discharged from a medium secure unit in Scotland. J Forensic Psychiatry Psychol. 2009;20(4):523–541. doi: . doi:10.1080/14789940802638358 [CrossRef]
- Lindbergh B. Rise of the machines? http://grantland.com/features/ben-lindbergh-possibility-machines-replacing-umpires/. Accessed August 1, 2017.
- Bradley A. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition. 1997;30(7):1145–1159. doi: . doi:10.1016/S0031-3203(96)00142-2 [CrossRef]
- Singh J. Predictive validity performance indicators in violence risk assessment: a methodological primer. Behav Sci Law. 2013;31:8–22. doi: . doi:10.1002/bsl.2052 [CrossRef]
- Evans S, Salekin K. Involuntary civil commitment: communicating with the court regarding “danger to other.”Law Hum Behav. 2014;38(4):325–336. doi: . doi:10.1037/lhb0000068 [CrossRef]
- Varela JG, Boccaccini MT, Cuervo VA, Murrie DC, Clark JW. Same score, different message: perceptions of offender risk depend on Static-99R risk communication format. Law Hum Behav. 2014;38(5):418–427. doi: . doi:10.1037/lhb0000073 [CrossRef]
- Knighton J, Murrie D, Boccaccini M, Turner D. How likely is “likely to reoffend” in sex offender civil commitment trials?Law Hum Behav. 2014;38(3):293–304. doi: . doi:10.1037/lhb0000079 [CrossRef]
- Boccaccini M, Murrie D, Turner D. Jurors' views on the value and objectivity of mental health experts testifying in sexually violent predator trials. Behav Sci Law. 2014;32:483–495. doi: . doi:10.1002/bsl.2129 [CrossRef]
- Turner DB, Boccaccini MT, Murrie DC, Harris PB. Jurors report that risk measure scores matter in sexually violent predator trials, but that other factors matter more. Behav Sci Law. 2015;33(1):56–73. doi: . doi:10.1002/bsl.2154 [CrossRef]
- Frye v United States, 292 F 1013 (DC Cir 1923).
- Daubert v Merrell Dow Pharms., Inc., 509 US 579 (1993).
- Monahan J. Violence risk assessment: scientific validity and evidentiary admissibility. Wash & Lee L Rev. 2000;57:901–918.
- Walsh T, Walsh Z. The evidentiary introduction of Psychopathy Checklist-Revised assessed psychopathy in U.S. courts: extent and appropriateness. Law Hum Behav. 2006;30(4):493–507. doi: . doi:10.1007/s10979-006-9042-z [CrossRef]
- Singh JP, Grann M, Fazel S. Authorship bias in violence risk assessment? A systematic review and meta-analysis. PLoS One. 2013;8(9):e72484. doi:. doi:10.1371/journal.pone.0072484 [CrossRef]
- Singh J. Five opportunities for innovation in violence risk assessment research. J Threat Assess Management. 2014;1(3):179–184. doi: . doi:10.1037/tam0000018 [CrossRef]
- Moskowitz TJ, Wertheim LJ. Scorecasting: The Hidden Influences Behind How Sports Are Played and Games Are Won. New York, NY: Three Rivers Press; 2011.
- Murrie D, Boccaccini M, Guarnera L, Rufino K. Are forensic experts biased by the side that retained them?Psychol Sci. 2013;24(10):1889–1897. doi: . doi:10.1177/0956797613481812 [CrossRef]
A Matrix for Categorizing Umpire Calls
||Video-Confirmed Result (ie, “the Truth”)
Three Types of Expert Statements About Risk
||Expert's Purpose or Goal
|Description of risk
||Describe general and case-specific risk and protective factors
|Likelihood of outcome
||Explain likelihood of a violent act
||Yes: may be a category, probability, or frequency
||Focus on reducing risk
||Sometimes; similar to likelihood model
||Yes: treatment options, other interventions