Much has been written about the need for more authentic testing, especially in high stakes examinations (Baxter & Glaser, 1998; Bennett et al., 1999; Huff & Sireci, 2001; O’Leary, 2002; Parshall & Balizet, 2001). And it seems clear that it may be difficult for one test to fully assess the range of desirable behaviors in any complex domain such as is needed in a professional licensure examination (Martinez, 1999). Yet, it is incumbent on those organizations that license or certify professionals to strive toward a comprehensive and valid assessment of the competencies of practitioners. With advances in computer technology, it is becoming easier to move beyond multiple-choice items and investigate the use of alternate item formats for high stakes testing. The use of alternate format items may allow examinees to demonstrate competencies in a different manner while also enabling more authentic assessment of additional knowledge, skills, and abilities. The ability to more accurately demonstrate competency is especially significant in nursing education, which requires its graduates to demonstrate critical thinking and problem solving skills.
Many test developers suggest that multiple-choice items can be used to evaluate critical thinking, as long as the items are focused on measuring higher order thinking ability. McDonald (2002) proposed that such assessment consists of the ability to use item information in a unique situation—moving away from recall or comprehension-level questions that require only rote skills. Using such a unique situation is considered to be one way a recall question can be revised into an application or analysis type of question. Although there is agreement that item development at the application and analysis level is essential for the measurement of critical thinking, there appears to be increased evidence of the use of higher order thinking when examinees answer constructed response items requiring that the answer be written or typed. Items not limited to a single response as the key may encourage the examinee to move from recall to application-analysis and therefore demonstrate cognitive processes that can be identified as critical thinking (McDonald, 2002; Reynaud & Murray, 2003).
The multiple-choice format of testing was designed to measure aspects of achievement prior to the evolution of cognitive psychology, which sought a more viable method of assessing learning and ability that emphasized internal mental processes. Modern development in the field was due to a post-World War II focus on research into performance and attention, as well as developments in the field of computer science. The advent of technology and availability of personal computers has encouraged the growth of computer-based testing and the divergence of standard multiple-choice items into selected-response and constructed-response test item formats. The term selected-response denotes an item where a correct answer is selected from several choices, whereas a constructed-response, as stated above, requires the examinee to provide the answer in a type-in method. Table 1 demonstrates the alternate format item types and, when possible, the paired multiple-choice item. For some alternate item types (clinical scenario or chart), it was not possible to have a closely matched item pair.
Table 1: Comparison of Selected Response (Traditional) Item Types and Constructed Response (Alternate) Item Formats
As suggested by the literature, there are advantages and disadvantages to each item format, such as selected response and constructed response (discrete and extended formats). For example, Martinez (1999) stated:
There is empirical evidence and a theoretical rationale to support the idea that constructed response items are, in general, better suited than multiple-choice items, to provide detailed information pertinent to cognitive diagnosis. (p. 218)
In addition, selected-response multiple-choice items, while efficient, do not take full advantage of the computer’s capability to provide additional ways in which to assess an examinee’s abilities (Parshall, Davey, & Pashley, 2000).
In the design of the examination used for this research, constructed responses, as well as additional alternate formats were used, paired with a selected-response multiple-choice item of similar content.
In 1994, the National Council Licensure Examination (NCLEX®) moved from a paper-and-pencil format of multiple-choice questions to computer-adaptive technology using those same item formats. At that time, it was postulated that computers have the potential to assess new skills and abilities that have been difficult or extremely expensive to measure using traditional testing formats (McHenry & Schmitt, 1994). Innovations in computer-based testing include additional item types with features that include sound, graphics, animation, and video integrated into the item stem, response options, or both. Item development has progressed from multiple-choice type questions of selecting one answer of several response alternatives, to the ability to drag and drop objects to rank order, click on graphics, choose multiple responses, as well as fill-in-the-blank short-answer responses.
It was important to know whether the different nursing content areas and competencies being assessed affected the format or perception of the formats by the participants. By using a variety of content areas for different item types and including several items on the same content, it was expected that the researchers would be able to disentangle the effect of item placement and presentation from opinions about and characteristics of the items (McCrudden, Schraw, Hartley, & Kiewra, 2003). In working with calculation items, examinees they are in the NCLEX examinations. For those items requiring review of laboratory results, standard normal laboratory values were provided. Although examinees are required to know statistical norms for common laboratory values when taking the NCLEX, reviewing laboratory values with institutional norms provides more realism to chart and exhibit items.
Given the increased cost of producing alternate items and the additional time it may take for examinees to respond to the items, it is important that there be a valid rationale for the use of alternate item types. There are several reasons for the use of alternate items that are supported throughout the literature, including an increased complexity of items, which allows for the ability to assess additional competencies and to permit examinees to demonstrate their competencies differently.
The purpose of this study was to determine whether items that use alternate formats were rated favorably by stakeholders (i.e., new graduates), in comparison with paired multiple-choice items. Prior to administering these items, the questions themselves had been reviewed by experts serving on item development panels for the NCLEX examination. It is important to ascertain the opinions of stakeholders and experts regarding the authenticity of alternate items because this information provides support to the validity of this alternate method of assessing examinees (Bachman, 2002).
A favorable determination by both the experts reviewing the item format and content, as well as by the actual test takers, indicated that both groups thought the alternate item types were more realistic, more challenging, and more likely to allow a demonstration of competence than were the paired multiple-choice items.
Thirty-seven RNs and 7 licensed practical nurses (LPNs) participated in this portion of the study. Of the 37 RNs in the study, 31 (84%) held a baccalaureate degree and 6 (16%) held an associate degree in nursing. Regarding their area of nursing specialty, 40.5% of the participants reported working in a medical-surgical or subspecialty, and 21.6% reported working in a critical care setting. In addition, 71% of the participants worked in a hospital setting. By way of comparison, the results of the 2002 RN Practice Analysis (Smith & Crawford, 2003b) indicate that in the general population, 27% of new graduate RNs hold the baccalaureate degree and 71% work in hospital settings. Because of the small number of practical nurse participants, the LPNs are reported as generalists. Of the seven LPN participants, four reported working in long-term care, one in a community-based setting, one in the hospital, and one in an “other,” nonspecified setting. The results of the 2003 LPN Practice Analysis (Smith & Crawford, 2003a), although difficult to compare with this small number of participants, does indicate that the majority of participants worked in long-term care settings.
A nonexperimental design was used. The participants were administered a research test that contained multiple-choice items paired with items using an alternate format. Then the participants were asked to give their opinions about the new item types administered in the research test. Summary statistics on the participant’s responses to the survey were produced using Microsoft Access® and Excel®.
For this research, a letter requesting participation in the study was sent to 470 nurses in a 30-mile area of downtown Chicago who had successfully passed the NCLEX examination and were within their first year of practice. The choice to use new graduates as experts was to determine how the alternate item formats compared with standard multiple-choice questions for participants who had recently completed the NCLEX examination for licensure. All participants were volunteers. Participant information was maintained separately from the collected data, and data were reported as aggregate group data only. All participants were provided with a written explanation of the study and were given the opportunity to ask questions. Informed consent was indicated by completion of a telephone interview, and onsite, participants signed a confidentiality agreement and were given a tutorial on how to answer the alternate format test questions. Actual onsite time to complete the tutorial, test, and posttest survey was 2 hours. Participants were paid an honorarium for their time and travel expenses.
In developing the test, items—not only the formats, but within formats—were varied to include diverse content, as well as a range of difficulty for the multiple-choice items. It was not possible to know the difficulty level of the alternate items because the items used in the study had not been calibrated. Difficulty levels for the alternate items were estimated using expert knowledge and previous research, as no data were available. The multiple-choice paired items matched the content and knowledge difficulty as much as possible. All items were validated in nursing textbooks used by multiple nursing programs and were evaluated by item review panels who certified the correct answer and that the items are appropriate for entry-level practitioners. The items (multiple-choice and alternate) are varied in their positions throughout the tests. This variation in positioning of items is relevant to reduce the cueing between items and the effects of content on opinions of items.
After completing the test, participants were asked to complete a 13-item questionnaire that asked participants their opinions about the alternate items, compared with standard multiple-choice items. The instrument consisted of two parts: the first part was designed to gather demographic information about the participants, including type of licensure (RN versus LPN), basic nursing education program, current employment status, and employment specialty area, along with type of employing facility. Respondents were also asked to characterize their assessment of the six types of questions regarding their clarity and realism, the challenge they imposed, and the degree to which the item enabled them to demonstrate their competence. In addition, respondents were asked to rate the clarity of the images and graphics used in the hot spot items. On this section of the survey, respondents were asked to rate their agreement with statements using a 4-point Likert scale with response possibilities ranging from strongly agree to strongly disagree. Finally, respondents were asked to rank the difficulty of the format of the alternate types of questions, compared with standard 4-response, multiple-choice questions. Summary statistics and clinical scenario or charts regarding the participant’s responses were produced using Excel.
Results and Discussion
All of the participants completed the test and the survey instrument. The opinion of the participants regarding the alternate items compared with the standard multiple-choice items was positive for every item type. The participants agreed that the alternate items allowed them to demonstrate their competence in a more realistic and challenging way, compared with multiple-choice items. For example, for the multiple response items, 33 of the 36 participants (92%) agreed that the items were more realistic than the standard multiple-choice items. Thirty-three of 37 (89%) of the participants agreed that the alternate items were more challenging, and 34 of 37 (92%) agreed that the alternate items allowed them to demonstrate their competence more than multiple-choice items did. For the chart or exhibit items, 86% of the participants thought the items were more realistic, 95% thought the items were more challenging, and 86% thought the items allowed them to demonstrate their competence better than multiple-choice items did (Table 2).
Table 2: Examinee Opinion of Alternate Items,
Also as seen in Table 2, there were similar findings for the LPNs. Based on both groups’ opinions, it can be concluded that, in general, these alternate item formats are more realistic, more challenging, and allow entry-level nurses to demonstrate their nursing competence better than a paired multiple-choice item.
The RN and LPN participant’s opinions of the format of the alternate items tended to be rated as the same or more difficult than multiple-choice items. The most difficult format was the chart or exhibit format. Twenty-seven of the 37 RN participants (73%) rated this format as more difficult and 5 (14%) of the participants rated the format the same level of difficulty as multiple-choice items. However, for the hot spot items 13 of 37 participants (35%) rated this format less difficult and 10 (27%) rated the hot spot format the same level of difficulty as a multiple-choice item. There were similar findings for the LPNs. It could be that the calibrated difficulty level of the items may have influenced the participant’s response about format. Also, the tutorial in this research study was not interactive and thus did not allow participants to practice their computer skills on the chart or exhibit format. All of these factors could have also contributed to the participant’s opinions about the difficulty of the clinical scenario or chart format.
When asked to comment on the items, in general, the participant’s wrote more positive (N = 32) comments than negative (N = 12) comments—almost a 3 to 1 ratio of positive to negative comments. The positive comments were favorable regarding the alternate items. The negative comments were about the computer interface, the quality of graphics, and difficulty of items. Some of the participants’ actual comments by item type are noted below.
- These are things (chart/exhibit format) you would need to do when taking care of a patient—explaining the lab values as well as carrying out orders. [For ordered response,] this question allowed the reader to put themselves in the actual situation and kind of go through the actual steps that need to be taken. It is so much harder to work out a problem (fill-in-the-blank calculation) when there are no listed answers that can let you know if you are on the right track.
- These were more of a challenge since there was more than one right answer [multiple response]…. I think they were more realistic and related to nursing care better than standard multiple-choice.
- These questions [multiple response] were more in-depth and caused me to think more critically than when I took the NCLEX.
- I liked these [hot spot items] because it allowed you to have a visual aid about what the question is asking you for. It allows you to be skilled on different landmarks and position on the body.
As noted previously, there were fewer negative comments, and most of the negative comments related to the graphics and sounds used in the item, the computer skills needed, and the length of time it took to answer the items. One participant stated, “This (chart/exhibit item) took a lot of time—having to keep switching back and forth between screens. Very frustrating.” For the multiple-response item, it was noted:
- Although more realistic, it is harder to answer than the standard questions.
- I felt more challenged by the prioritization (ordered response) question but also more frustrated because I know if I made one error in the order I would miss the whole question.
- The chest tube (hot spot) question was a little fuzzy in the picture.
As highlighted by these comments, there are advantages for each item type, as well as some disadvantages to the alternate item formats, some of which have been alluded to in the negative comments. One of disadvantages is the increased time needed to respond to some of these alternate item formats. In calculating the average response time in seconds for all the item types, there is a slight increase in response time compared with equivalent multiple-choice items for multiple response and ordered response items. The chart or exhibit items, audio items, and hot spot items have no multiple-choice items that can be used for equivalent comparisons for time (Table 3). For the calculation (fill-in-the-blank) items, the response times appear to be the same as for multiple-choice type items (Gorham & Wendt, 2004). However, it should be noted that when candidates become familiar with the alternate format items, the issue of response times should diminish.
Table 3: Summary of Average Response Time (Seconds) by Item Type
This increased response time can have a negative effect on the overall assessment of an examinee because there may be fewer opportunities to sample the examinee’s behavior in the same amount of testing time. Alternately, the testing time could be extended to accommodate the alternate items.
In terms of overall assessment Martinez (1999) stated:
Sometimes the best policy decisions will not be a matter of either/or, but of what mixture of item formats will yield the best possible combined effect…assessments can capitalize on their respective positive features and minimize their liabilities. (p. 216)
Perhaps the best strategy for assessing entry level nursing competence is to take advantage of each item type. Efforts should be made to maximize the benefits of each item and thereby reduce the risk of using just one type of item on a high stakes examination. What then becomes an important question to research is what is the best blend of item types—How many multiple-choice items versus case situations versus constructed response items should be included in an examination?
Conclusions and Recommendations
The results of this study indicated that, in general, entry-level nurses believed alternate items were more realistic and more challenging than multiple-choice items. Nurses believed that alternate items allowed them to demonstrate their competence better than multiple-choice items. As noted by one participant, “These items are more like what happens in the hospital—more like real life.” Thus the alternate items appeared to be more authentic at face value.
From an education standpoint, the use of alternate item formats in preparing nursing students for eventual practice encourages these novices to approach client care situations using critical thinking abilities. The authenticity of these types of items allows students to move beyond the application of rote memorization, challenging them to apply learning in a new way. Although actual experience will ultimately reinforce formal education and cement the nurses’ fund of knowledge, testing that encourages critical thinking while still a student, increasing the likelihood the nurse will internalize such strategies to use in future practice.
Care will need to be taken so the computer interface and software used for alternate items is user friendly. It is also important to consider the length of response time for the various item types and to consider the benefits of each item type—alternate and multiple-choice. There are advantages to multiple-choice items; however, testing programs may want to include alternate items in their examinations to enhance authenticity. Use of alternate items may facilitate opportunities to assess additional competencies such as higher levels of cognitive processing or critical thinking. Also, additional subject content areas may be able to be assessed with alternate items.
The potential for alternate format items to test for higher level cognitive competencies has resulted in the addition of such items to the licensing examinations for RNs and LPNs. Still to be addressed is the optimal composition of multiple-choice and alternate items in the assessment examination. Ultimately, the answer may be a policy decision based on money, time, and values of the organization and, most importantly, how to assess entry-level nurse competence.
- Bachman, L2002. Alternate interpretations of alternative assessments: Some validity issues in educational performance. Educational Measurement: Issues and Practice, 213, 5–18. doi:10.1111/j.1745-3992.2002.tb00095.x [CrossRef]
- Baxter, GP & Glaser, R1998. Investigating the cognitive complexity of science assessments. Educational Measurement: Issues and Practice, 173, 37–45. doi:10.1111/j.1745-3992.1998.tb00627.x [CrossRef]
- Bennett, RE, Morley, M, Quardt, D, Rock, DA, Singley, MK & Katz, IR et al. . 1999. Psychometric and cognitive functioning of an under-determined computer-based response type for quantitative reasoning. Journal of Educational Measurement, 36, 233–252. doi:10.1111/j.1745-3984.1999.tb00556.x [CrossRef]
- Gorham, J & Wendt, A. (2004, April). Investigation of the item characteristics of innovative items. Paper presented at the American Education Research Association Annual Meeting. , San Diego, CA. .
- Huff, KL & Sireci, SG2001. Validity issues in computer-based testing. Educational Measurement: Issues and Practice, 20 (3), 16–25. doi:10.1111/j.1745-3992.2001.tb00066.x [CrossRef]
- Martinez, ME1999. Cognition and the question of test item format. Educational Psychologist, 34, 207–218. doi:10.1207/s15326985ep3404_2 [CrossRef]
- McCrudden, M, Schraw, G, Hartley, K & Kiewra, K. (2003, April). The influence of presentation, organization and example context on text learning. Paper presented at the American Education Research Association Annual Meeting. , Chicago, IL. .
- McDonald, ME2002. Systematic assessment of learning outcomes: Developing multiple-choice exams Sudbury, MA: Jones & Bartlett.
- McHenry, JJ & Schmitt, N. 1994. Multimedia testing. In Harris, J, Rumsey, M & Walker, C (Eds.), Personnel selection and classification (pp. 193–232). Hillsdale, NJ: Erlbaum.
- O’Leary, M2002. Stability of country rankings across item formats in the Third International Mathematics and Science Study. Educational Measurement: Issues and Practice, 214, 27–38. doi:10.1111/j.1745-3992.2002.tb00104.x [CrossRef]
- Parshall, CG & Balizet, S2001. Audio CBTs: An initial framework for the use of sound in computerized tests. Educational Measurement: Issues and Practice, 202, 5–15. doi:10.1111/j.1745-3992.2001.tb00058.x [CrossRef]
- Parshall, CG, Davey, T & Pashley, PJ2000. Innovative item types for computerized testing. In Van der Linden, W & Glas, C (Eds.), Computerized adaptive testing theory and practice (pp. 129–148). Dordrecht, The Netherlands: Kluwer Academic Publishers.
- Reynaud, R & Murray, H. (2003, April). The effect of higher order questions on critical thinking skills. Paper presented at the American Education Research Association Annual Meeting. , Chicago, IL. .
- Smith, J & Crawford, L. 2003a. Report of findings from the 2003 PN Practice Analysis: Linking the NCLEX-PN examination to practice, 17. Chicago: National Council of State Boards of Nursing.
- Smith, J & Crawford, L. 2003b. Report of findings from the 2002 RN Practice Analysis: Linking the NCLEX-RN examination to practice. Chicago: National Council of State Boards of Nursing.
Comparison of Selected Response (Traditional) Item Types and Constructed Response (Alternate) Item Formatsa
|Selected Response (Traditional) Item Type||Constructed Response (Alternate) Item Type|
|Multiple-choice item: While assessing the patient’s abdomen, which of the following sequences of the examination should the nurse recognize as appropriate?||Fill-in-the-blank/ordered response: While assessing the patient’s abdomen, in what sequence should the examination be conducted (identify steps by inserting the number of the first steps, second step)?|
|1. inspection, auscultation, percussion, and palpation.||_____test for rebound tenderness|
|2. auscultation, palpation, inspection, and percussion||_____percussion|
|3. inspection, palpation, auscultation, and percussion.||_____auscultation|
|4. palpation, inspection, auscultation, and percussion||_____palpation|
|Multiple-choice item: When caring for a client who has a wound infected with methicillin-resistant Staphylococcus aureus (MRSA), which of the following infection control procedures should the nurse implement?||Multiple-response item: When caring for a client who has a wound infected with MRSA, which of the following infection control procedures should the nurse implement? (Check all that apply)|
|1. Place the client in a private room.||_____1. Wear a protective gown when entering the client’s room.|
|2. Ask the client to wear a surgical mask.||_____2. Put on a particulate respirator mask when administering medications to the patient.|
|3. Use sterile gloves to remove the wound dressing.||_____3. Wear gloves when delivering the client’s meal tray.|
|4. Wear a protective gown when providing wound care.||_____4 Ask the client’s visitors to wear a surgical mask when in the client’s room.|
|_____5. Wear sterile gloves when removing the client’s dressing.|
|_____6. Put on a face shield before irrigating the client’s wound.|
|Multiple-choice item: The nurse is performing a cardiac assessment upon admission. Which of the following describes the best anatomic location to auscultate the mitral valve at its loudest?||Hot spot: The nurse is performing a cardiac assessment upon admission. Click on the areab where the nurse should auscultate to hear the mitral valve at its loudest.|
|1. Second intercostal space at the right sternal border.|
|2. Third intercostal space at the left mid-clavicular line.|
|3. Fourth intercostal space at the left sternal border.|
|4. Fifth intercostal space at the left mid-clavicular line.|
|Multiple-choice item: The nurse is caring for a client whose intake and output must be calculated. The nurse observes that the client has consumed 8 ounces of apple juice, one hamburger on a bun, one-half cup of green beans, 8 ounces of tea, and one cup of ice cream. Which of the following should the nurse record as the client’s intake?||Fill-in-the-blank/calculation: The nurse is monitoring the dietary intake and output of a client. The nurse observes that the client has consumed 8 ounces of apple juice, one hamburger on a bun, one-half cup of green beans, 8 ounces of tea, and one cup of ice cream. How many milliliters should the nurse record for the client’s intake?|
|1. 360 milliliters||_______________milliliters|
|2. 560 milliliters|
|3. 720 milliliters|
|4. 760 milliliters|
Examinee Opinion of Alternate Itemsa,b
| More realistic||33||92||3||8||5||83||1||17|
| Challenged more||33||89||4||11||5||83||1||17|
| Demonstrate competence||34||92||3||8||4||62||2||33|
|Fill-in-the blank calculation|
| More realistic||7||88||1||13||4||67||2||33|
| Challenged more||8||100||0||0||4||67||2||3|
| Demonstrate competence||7||88||1||13||4||67||2||3|
| More realistic||35||95||2||5||6||100||0||0|
| Challenged more||31||84||6||16||6||100||0||0|
| Demonstrate competence||32||86||5||14||6||100||0||0|
| More realistic||33||89||4||11||4||67||2||33|
| Challenged more||32||86||5||14||5||83||1||17|
| Demonstrate competence||33||92||3||8||4||67||2||33|
| More realistic||34||92||3||8||0||0||0||0|
| Challenged more||32||86||5||14||0||0||0||0|
| Demonstrate competence||31||84||6||16||0||0||0||0|
| More realistic||32||86||5||14||4||67||2||33|
| Challenged more||35||95||2||5||6||100||0||0|
| Demonstrate competence||32||86||5||14||5||83||1||17|
Summary of Average Response Time (Seconds) by Item Type
|Item Type||Research Study||July 2003 NCLEX® Pretest|