Simulation has expanded rapidly in nursing education and is well integrated in prelicensure and graduate nursing programs. With simulation, students have a safe environment to practice assessment, patient care, communication, teamwork, decision making, and clinical skills. Simulation can be used for formative assessment, which is meant to improve learning and performance and to provide feedback to individual students and the team. This type of assessment is integral to effective teaching, with facilitators observing performance, communication, and other behaviors and providing specific information for further learning and development (Sando et al., 2013).
In contrast to formative assessment, summative simulated-based assessment is intended to determine students' competence in practice. Summative assessments need to be carefully designed to ensure they are valid, measuring the knowledge, skills, and attitudes they are intended to, and reliable, reproducing the same results by different evaluators (interrater reliability) and by the same evaluator at another time (intrarater reliability) (O'Leary, 2015). Because summative assessments can have high-stakes consequences (i.e., students may need to pass the assessment to pass the course or even to graduate from the nursing program) ensuring the validity and reliability of summative assessments is critical prior to their use with nursing students.
The purpose of this article is to provide guidelines for developing summative simulation-based assessments for use at the end of a nursing course or program. These guidelines are based on the literature and the authors' experiences with a study that examined the feasibility of using simulation for high-stakes testing in nursing (Rizzolo, Kardong-Edgren, Oermann, & Jeffries, 2015). In many nursing programs, students take standardized tests throughout their program to measure their learning and prior to graduation to evaluate their readiness for the National Council Licensing Examination for Registered Nurses® (NCLEX-RN). However, such standardized tests do not provide a measure of students' clinical performance or their readiness for practice. Simulation-based assessments would provide a means of determining students' clinical competence at the end of a course or the nursing program.
Summative Simulation-Based Assessment
High-stakes simulations, in the form of objective structured clinical examinations (OSCEs), have been used in medicine since 1999 when the U.S. Medical Licensing Examination instituted computer-based simulations as part of the Step 3 examination (Boulet, 2008). In an OSCE, students rotate through 8 to 12 stations, lasting 3 to 10 minutes each. Stations are focused on determining the competency of students in various clinical skills such as taking a health history, conducting an examination of a patient, and performing procedures. Some of these assessments involve standardized patients, who are trained to portray a patient with a specific condition. Standardized patients and OSCEs are an accepted method for summative assessment of students' knowledge and skills (Ha, 2016; McWilliam & Botwinski, 2010; Meskell et al., 2015; Selim, Ramadan, El-Gueneidy, & Gaafer, 2012).
In some nursing programs, faculty have developed simulations, using both low- and high-fidelity manikins, for summative assessment in nursing courses. Wolf et al. (2011) described how they integrated assessments in their nursing program. In the capstone course, during the final semester of the program, the simulation builds on the thinking and clinical skills mastered in previous semesters and focuses on prioritization. This is a high-stakes assessment in which students need to demonstrate competency to pass the course.
There are fewer descriptions in the nursing literature of using simulation for summative assessment of end-of-program competencies. Bensfield, Olech, and Horsley (2012) used simulation-based assessment of senior baccalaureate nursing students to determine whether they achieved program competencies and to guide remediation. The objectives for the evaluation were derived from the program outcomes and four quality and safety competencies (patient-centered care, teamwork and collaboration, evidence-based practice, and safety). Nursing faculty developed an evaluation tool that included six behaviors for evaluators to observe during the simulation. If students were not competent in any of the first five behaviors on the tool, they were required to participate in remediation. Students were evaluated by three faculty members who came to consensus about whether students met the objectives and passed. Nearly 25% of the students did not pass the summative assessment and had to participate in remediation. The process, however, revealed gaps in the curriculum as well as the need for students to be comfortable with simulation for instructional purposes before it is used for high-stakes assessment (Bensfield et al., 2012).
Key Steps and Considerations
Although many nurse educators are skilled in developing and implementing simulation for formative assessments and in giving feedback to students to improve their learning and performance, few are prepared for using simulation for summative assessment. Summative simulated-based assessment is intended to determine students' competence in performance and is not intended to give feedback to students. When the goal of the assessment is to determine students' competence, especially when there are high-stakes consequences, there are critical steps to be followed in designing and validating the simulation and preparing evaluators (Table).
Guidelines for Summative Simulation-Based Assessments
Defining the Objectives and Knowledge and Skills to Be Assessed
Similar to any assessment, the first step is specifying the objectives and the knowledge and specific skills to be assessed. This is critical because the objectives guide decisions about the simulation, scenario, and tools for evaluation. Some objectives, such as those at the end of a nursing program, are likely to be assessed using simulation with high-fidelity patient simulators because of the ability to develop complex scenarios that provide an opportunity to evaluate assessment, patient care, communication, teamwork, clinical reasoning, and other higher level outcomes, whereas others, such as procedures and tasks, might be evaluated with task trainers. Considering the time required for faculty to develop and pilot test a high-stakes simulation-based assessment, the complexity of the scenario should match the aims of the evaluation.
Designing Appropriate Simulations
For end-of-program assessment, it is likely that multiple scenarios will be needed for students to demonstrate their knowledge and competence considering the broad scope of most program objectives. Multiple scenarios allow nurse educators to include more and different types of content to assess students' knowledge and skills and to determine whether students perform similarly across scenarios. Using multiple scenarios also addresses the variability reported in scores on individual scenarios. Murray, Boulet, Kras, McAllister, and Cox (2005) and Murray et al. (2007) reported that scores vary more across scenarios than they do between raters and learners.
In developing scenarios for summative assessment, faculty need to ensure that they reflect the objectives to be evaluated, require use of the intended knowledge and skills, and are appropriate for the students' level. At the end of a nursing program, the students' performance should reflect the expected competencies of new graduates. Simulations for summative assessment can be short, approximately 3 to 5 minutes, such as when assessing procedures and technical skills, or long, approximately 30 minutes, when assessing behaviors and nontechnical skills such as communication and teamwork (Mudumbai, Gaba, Boulet, Howard, & Davies, 2012). Longer scenarios allow the faculty to create a clinical situation that is complex enough to evaluate students' higher level skills, such as managing a changing patient's condition and setting priorities. As students develop their knowledge and skills, it is expected that their performance in the simulation would differ from that of less experienced students. Experts should review the scenarios to verify that they are appropriate for the objectives for which the scenarios were developed, that students will have to use their knowledge and skill in the simulation, and that it represents end-of-program performance.
In a study on the feasibility of using simulation for end-of-program assessment, sample program outcomes from diverse prelicensure programs across the United States were reviewed to identify objectives well-suited to assessment via a simulation (Rizzolo et al., 2015). Four areas of program objectives emerged: students' (a) competence in assessment and intervention, (b) clinical judgment, (c) ability to provide quality and safe care, and (d) skills in teamwork and collaboration. Simulation experts developed scenarios on complex clinical situations in each of those areas. At the end of the scenarios, students reported to the facilitator, who asked a series of questions about the simulated patient if the students neglected to include that information in their handoff. The handoff and facilitator questions provided data on students' clinical judgment and underlying thought processes.
The simulations were piloted in 10 schools of nursing in different regions of the United States (five baccalaureate, four associate degree, and one diploma program) and were video recorded using the Laerdal AVS system (Rizzolo et al., 2015). In this pilot test, facilitators and students evaluated the scenarios to verify they reflected the objectives and the knowledge and skills expected of students at the end of their nursing program, were realistic, and were consistent with current nursing practices. Based on this pilot testing, the scenarios were revised, and the way the recordings were made was changed to ensure better video and audio quality.
Selecting or Developing Assessment Tools
For summative assessment, it is critical that the tools used for rating performance in the simulation are appropriate, valid, and reliable. Tools can provide for two types of ratings: analytic and holistic (or global) (Boulet, 2008; Oermann & Gaberson, 2014). With analytic scoring, the evaluator observes individual components of the performance and scores each separately, summing these individual scores for a total score. Analytic ratings are usually completed using skills checklists (Boulet, 2008). Similar to clinical evaluation tools, some of the items on the checklist can be designated as critical behaviors that have to be met to pass the assessment; items on the checklist also can be weighted based on their importance in performing the skill. Although checklists are useful for skills assessments in nursing, Kardong-Edgren and Mulcock (2016) noted that there rarely is a predetermined rationale, other than tradition, for setting the passing score on a checklist. However, for summative assessment, this score is critical to avoid setting the score too low or high.
With holistic scoring, the performance is assessed globally. For this type of summative assessment, when the aim is to evaluate complex and multidimensional behaviors rather than procedures and skills, rating scales are most appropriate (Boulet, 2008; Boulet & Murray, 2010). Because varied types of rating scales are used for clinical evaluation in nursing, nurse educators are familiar with these tools. In a study by the current authors, evaluators assessed students' performance using the Creighton Competency Evaluation Instrument (CCEI) (Hayden, Keegan, Kardong-Edgren, & Smiley, 2014; Hayden, Smiley, Alexander, Kardong-Edgren, & Jeffries, 2014) combined with a list of actions to be taken in the scenario. The CCEI, a modification of the Creighton Simulation Evaluation Instrument (Todd, Manz, Hawkins, Parsons, & Hercinger, 2008), includes 23 behaviors in four categories: assessment, communication, clinical judgment, and patient safety. Evaluators rate competence in each of these behaviors. For some end-of-program assessments, it is likely that multiple tools will be used, such as a rating scale that provides for a global rating of performance combined with checklists or lists of key actions that should be performed in the scenario, as well as other tools depending on the outcomes assessed.
Ensuring the Validity of the Assessment
The validity of a simulation-based assessment relates to the degree it measures what it was intended to (Boulet & Murray, 2010; O'Leary, 2015; Scalese & Issenberg, 2008). With validity, the emphasis is on the consequences of the assessment: Can accurate interpretations be made about the students' competence based on their performance in the simulation? To support the content validity of the assessment, the scenarios should be realistic, reflect the actual practices of a nurse in that situation, and be evidence based. In the current study, 10 experts were asked to confirm that the scenarios reflected the objectives, required use of the intended knowledge and skills, and were appropriate for assessment at the end of a nursing program. Experts also verified that the list of actions to be taken in the scenario were evidence based and not unique to a clinical setting.
As part of ensuring the validity of a simulation, studies can be conducted to correlate the knowledge and skills assessed in the simulation to other evaluations, such as clinical evaluations and test scores. Another strategy is to pilot the simulation-based assessment with varying levels of students in the school of nursing. One would expect individuals with more expertise to perform better in the simulation than those with less expertise (Girzadas, Clay, Caris, Rzechula, & Harwood, 2007; Murray et al., 2007).
Ensuring the Reliability of Ratings
For high-stakes assessment, the scores must be reliable. This means that the same scores will be reproduced by different evaluators and that if they rescored a simulation at a later time, their scores would be the same. Consistency across evaluators is easier to achieve using checklists and lists of key actions than a rating scale because they are more specific and focused. With rating scales, whether used in a simulation-based assessment or for clinical evaluation, more potential chances for error exist. Behaviors are broader (e.g., communication and leadership), allowing for varied interpretations and judgments as to the quality of the performance.
To improve reliability with these assessments, several important steps should be taken by schools of nursing. First, evaluators need to be trained. Second, more than one evaluator should be used. In an observation, evaluators may focus on different aspects of the performance (Oermann & Gaberson, 2014). Judging the quality of the performance and deciding whether students are competent adds another level of decision open to individual judgment. Using multiple raters not only takes into account these potential differences in interpretations, it also provides an opportunity to combine ratings. Third, the evaluators should not know the students whose performances are being rated. It is possible that evaluators who taught the students in the laboratory or clinical practice may be biased in their evaluations. Stroud, Herold, Tomlinson, and Cavalcanti (2011) examined the OSCE scores for 158 medical residents on the basis of whether examiners were familiar with the residents from previous clinical experiences. Knowing the students and having a positive impression of their performance were associated with a significant increase in ratings on the OSCE.
Finally, the ratings should be performed independently to avoid evaluators influencing each other's scores, and at least two evaluators should be used. Video recordings of the simulations allow multiple evaluators to rate performance independently. Medical students were video recorded on their communication skills and ability to obtain an informed consent in an OSCE and evaluated independently by two raters (Kiehl et al., 2014). The scores indicated strong interrater agreement, and the researchers concluded that video recordings of performance were effective for use in high-stakes assessments.
No school of nursing should attempt to conduct summative simulation-based assessments without training evaluators. The majority of errors in rating performance are due to the evaluators, not to the tool used for the evaluation (Pangaro & Holmboe, 2008). Multiple errors can occur when rating performance in both clinical practice and simulation-based assessments. One set of errors occurs when evaluators restrict the range of their ratings. They may use only the mid-portion of the rating scale (central tendency error), tend to rate students toward the high end of the scale (leniency error), or tend to rate students toward the low end (severity error). In rating performance, there also may be a halo effect in which the evaluators let an overall impression of the students influence the ratings; when conducting summative assessments for high stakes, evaluators should not know the students being rated. A logical error can occur when similar ratings are given for items on the scale that are logically related to one another; for example, if there are multiple behaviors on clinical judgment, these may receive the same rating even though the evaluators did not observe performance of each one (Oermann & Gaberson, 2014).
The quality of the ratings in a simulation-based assessment depends on the evaluators and the degree to which they accurately observe and evaluate the performance or skill (Feldman, Lazzara, Vanderbilt, & DiazGranados, 2012; Pangaro & Holmboe, 2008). Training should be conducted during multiple sessions in which evaluators have an opportunity to rate varied scenarios using the same tools they will use in the summative assessment, discuss their ratings and rationale with each other, and come to agreement regarding observed behaviors and competence. Although the actual assessment of students' performance is performed independently by raters, the training should be completed by the evaluators as a team.
In the current authors' study of end-of-program assessment, a rater training program consistent with others reported in the literature was implemented (Eppich et al., 2015; Feldman et al., 2012). The goals of rater training were to ensure consistency across evaluators and by evaluators themselves in their observations of performance and ratings of those behaviors and to reach agreement about whether those observations represented competent performance of nursing students at the end of their program. The training sessions were facilitated by experts in simulation, debriefing, and assessment. The evaluators were experienced nursing faculty from schools of nursing across the United States who had expertise in simulation and had completed a course in evaluation or assessment.
Ten evaluators were trained in multiple sessions that were both face-to-face and Web based. In the first training session, which was face-to-face, evaluators selected one of the four areas of program objectives: assessment and intervention, clinical judgment, quality and safety, or teamwork and collaboration. For training purposes, video recordings were used from the pilot study, which included examples of good, mediocre, and poor performance in each of those areas. Evaluators worked as a two- or three-person team to rate the performance of students in the videos in the area they selected. The goal of this first training session was for the two or three raters to come to consensus about the performance required by students to be scored as competent. Evaluators first scored the video recordings alone, and then they discussed their observations of behaviors as a team, with the goal of coming to agreement on behaviors that reflected the competence of the students in the scenarios. The first training session also included education about common rater errors and how to avoid such errors.
However, this training session revealed some of the issues encountered in observing and rating students' clinical performance: the assessment of performance is not always clear-cut, and nurse educators may have different standards for determining the competence of students on the basis of that performance (Oermann & Gaberson, 2014). Many of the teams were not able to come to agreement on their assessment of the students' performance in the scenario. Although the session consisted of 16 hours throughout 3 days, it clearly was not enough, as evidenced later in the study.
Following this face-to-face training session, each evaluator was e-mailed two of the video recordings to rescore each month for 4 months. This training provided practice for the evaluators in viewing and independently rating the same videos from the first training session, similar to the process used in the study by Eppich et al. (2015). Intrarater agreement on these repeated scorings of the videos ranged from 70% to 97%.
Next, a face-to-face session was held to score a new set of video recordings. Evaluators met first with their team to review and come to agreement on the criteria for scoring the items on the CCEI. Intraclass correlation coefficients (ICC) were calculated as measures of the interrater reliability coefficients (Ellasziw, Young, Woodbury, & Fryday-Field, 1994; Hallgren, 2012). The ICCs ranged from 0.06 to 0.33 across the four areas (assessment and intervention, clinical judgment, quality and safety, and teamwork and collaboration), with an overall ICC of 0.44, indicating fair interrater reliability. Cicchetti (1994) provided commonly cited cutoffs for qualitative ratings of agreement based on ICC values: interrater reliability is poor with ICC estimates less than 0.40, fair for values between 0.40 and 0.59, good for values between 0.60 and 0.74, and excellent for values between 0.75 and 1.0. The ratings also revealed inconsistent use of the CCEI. As a result, the authors of the scenarios were asked to identify criteria for rating the performance and determining whether students were competent. The evaluators then used those criteria for scoring the video recordings.
In the final phase of the study, the videos were e-mailed every 2 to 3 weeks for evaluators to score independently, and the evaluators were directed not to discuss their ratings. Evaluators then met face-to-face to rescore these same videos. With these multiple training sessions, combined with practice in scoring videos in between the face-to-face sessions, the overall ICC was 0.65, indicating good interrater reliability. In a high-stakes assessment, the evaluators need to accurately observe and rate performance (Feldman et al., 2012; Pangaro & Holmboe, 2008). Ratings should be consistent across evaluators. The study demonstrated the need for extensive training of evaluators to achieve accurate and consistent ratings of performance in a simulation, which is critical for use of simulation-based assessments in high-stakes decisions.