Like all educators, nursing faculty are faced with the task of test administration to assess students' knowledge. We often experience a certain degree of uneasiness when we assign grades based on a student's test performance because we are forcing "the qualitative variations into a scholastic linear scale of some kind" (Thurstone 1928, p. 532). Through measurement, some quality or attribute is assigned or restated as some quantitative value. Because of the differing nature of qualitative and quantitative variables, limitations and errors are inherent in the process of measurement.
Psychometric models provide a frame that may assist educators in the process of quantifying a student's performance on test items. The different models vary according to their underlying assumptions. The one psychometric model most familiar to all is the Classical Test or True Score Theory. This theory is a predictive, summary theory as it answers the question of how you may use the test to predict something. Classical Test Theory maintains the assertion that an individual's true score is !inobservable and the true score is equal to the observed score plus or minus the error. There always will be an error term because it is impossible to measure things accurately. The error term is defined as the difference between the observed score and the true score. Two assumptions in Classical Tfest Theory are that there is no correlation between true score and error and that there is no correlation between the error terms.
Using Classical Test Theory we perform item analysis following test administration. The two items values commonly sought are item difficulty and item discrimination. Item difficulty is the proportion of respondents who answered the item correctly, and the discrimination index is the correlation between responses on test items and the total score for an individual. The two values give us some useful information in the evaluation of individual test items.
Many educators will use these values obtained after one test administration to construct a test for the next time the content is taught. For example, we may wish to have the average item difficulty equal 75%. During test construction we select items with calculated difficulties from previous test administrations, when averaged together will equal 75%. However, the item difficulties are based on a different sample of respondents. When the second test administration produces unexpected results for item difficulties, we are observing the major limitation of Classical Test Theory. In this psychometric model, the values obtained for item difficulty and discrimination are sample dependent. Consequently, the sample of respondents should remain the same when these values are used for repeated test administrations and especially when one is concerned with the task of developing equivalent test forms.
Although Classical Test Theory is a familiar and comfortable model for us to use, it does have a major limitation that may impact on its utility. Let us now consider another psychometric model and its value particularly in the area of nursing education.
The Rasch Model
Latent trait refers to a "mathematical model of the relationship of item responses to an underlying dimension" (Guion & Ironson, 1983, p. 55). Unlike Classical Test Theory, latent trait theory is not fixed to a specific sample. One example of the latent trait theory is the Rasch Model.
FIGURE 1: Scale values of test items for major content areas (Circled numbers refer to serial order of test items.)
Like other psychometric models, the Rasch Model does contain assumptions. The one, explicit mathematical assumption most people recognize is local independence. One assumes the probability of answering one test item correctly does not depend upon a respondent correctly answering any other items of the test. The assumption may be a major limitation if a series of test items are grouped in a "patient situation," a common format used in nursing tests, and if obtaining a correct response on one item influences the way a person responds on a subsequent item. Also, in a series of test items that involve a type of matching activity where, for example, the respondent is asked to match a word with a phrase and there is an even number of words and phrases, local independence may not exist.
In the Rasch Model we consider the difficulty of a test item. "A more difficult test item is an item upon which every person has less chance of success" (Masters, 1980, p.19). Exclusive of the sampling dependence condition, an item's value in the Rasch Model is similar to item difficulty in Classical Test Theory. The item value is expressed in a generic mathematical unit called "logit" and is defined as "its natural log odds for eliciting failure from persons with zero ability" (Wright & Stone, 1979, p. 17).
The other valuable item statistic in the Rasch Model is the "fit" statistic. Fit of an item may be defined as a measure of the degree to which a test item performed the way it was expected to perform. Similar to item discrimination of Classical Test Theory, the fit statistic tells us whether or not the more able persons are answering a difficult item correctly while the less able persons are missing the item. If a test item "fits" the model, then the expected results are achieved.
Having item values and item fit statistics, one then may compare these values to the serial order of the test items. This comparison reveals if the easier items are at the beginning of the test or at certain points and also, if misfitting items occur at certain points in the test. If students do not have enough time to complete an exam, the more difficult items may appear at the end of the test because the respondents simply did not have enough time to answer the items. One also may consider the calibration order to determine if there is a relationship between the item value and fit statistic. One could evaluate if the more difficult items or the easier items are also the same items that misfit the model, producing the unexpected results. Finally, the fit order of the test items would reveal any patterns with the occurrence of well-fitting or misfitting test items.
Item separation is another statistic generated from the test items. It defines the number of levels into which the items may be divided based on their values. This allows one to compare levels of separation with some other factor such as the content area or cognitive levels of the test items. One then may determine whether or not certain groups of test items clearly are more difficult or easier than other subgroups of test items.
Often in nursing education, we test on a unit topic following a series of lectures on content areas within that unit. Having a value for each test item, one may collect their values within one content area and make a comparison with the values of items in other content areas. This would tell us whether or not the items are comparably difficult across various content areas. For one nursing exam, the items were analyzed according to the six major content areas which are part of the unit on oxygenation. In Figure 1, the test items are grouped into content areas and plotted according to their scale values. The questions on Shock and Blood Vessel Disruption were slightly easier than questions for the other four content areas.
Another aspect common in nursing education is the integration of certain threads within our content areas. These threads, whether they be systems theory, pharmacology or physical assessment may be examined in a test. In a format similar to Figure 1, one can collect the items related to the integrated threads and examine them with regards to their values.
The Rasch Model is useful in the process of developing a pool of test items or for the generation of equivalent test forms. With latent trait models, "item parameters should remain the same regardless of the subgroup tested" (Guion & Ironson, 1983, p. 61). Through the use of item values and the development and administration of multiple test items for a particular content area, one may obtain a reliable pool of test items. From this pool one may construct equivalent test forms that have comparable item values for the same content area. Here, the true worth of the consistency of item parameters cannot be overestimated.
The Rasch Model also contains statistics for person measures. Typically, we assign person measures usually reflective of the percent of test items answered correctly. In the Rasch Model this differs. Just as we obtained a value for each test item, so too, do we obtain a value or person measure for each individual who took the test. Once again, the mathematical unit called "logit" is used. "A person's ability in logits is his natural log odds for succeeding on items of the kind chosen to define the 'zero' point on the scale" (Wright & Stone, 1979, p. 17). The person measure value facilitates comparison of individuals or subgroups reflective of their overall ability to answer the test items correctly.
Each person also has a fit statistic. Here, "fit" refers to a measure of the degree to which a person responds in the way he or she is expected to respond. One expects a person with more ability to respond correctly to the more difficult items more frequently than a person with lesser ability. Thus, a person would misfit the model and obtain a high positive fit statistic if he or she succeeds on the more difficult items and fails the easier items (Masters, 1980, p. 16). The fit statistic also provides valuable information about individuals or subgroups. If members of a subgroup misfit the model, one has a basis for further examination into the characteristics or experiences common to the group. In nursing education, the obvious subgroups are the small clinical groups that have different instructors and different clinical placements. The extent to which members of a subgroup fit or misfit the model may determine the extent to which the differences impact on the subgroup's test performance.
Another product from the statistics for person measures is the standardized residual. If a person does not fit the model, one may obtain a report of their standardized residuals. These values are a measure of the extent to which a person responded to each test item in a way that was different from their expected responses. For example, if a person was expected to answer an item incorrect but actually answered it correctly, she or he would receive a positive residual value. The larger the residual value is, the more deviant are the actual results.
One may analyze the residuals of misfitting individuals or subgroups to determine if there is a common characteristic of the test items producing the unexpected results. Test developers may uncover errors in the wording or construction of the items by this process. If a subgroup of students were exposed strictly to the generic names of medications in their clinical placement and test items contained only the brand names of medications, this subgroup may misfit the model and have residual values for those test items that contain the brand names of medications. One can think of many variables in nursing education whose impact could be assessed or just revealed through a residual analysis.
An educator could also scan the list of test item residuals to determine if any particular items are producing a substantial number of significant residual irrespective of known subgroups. If large residuals are appearing, one has reason to reexamine the test item and speculate as to the possible causes for the unexpected result.
As with the test items, the model gives a person separation or number of levels the entire group of respondents may be divided into, based on their person measures. The heterogeneity or homogeneity of a group is revealed by the person separation value.
FIGURE 2. Scale map showing positions of people and test items (x represents one person; O represents one test item).
Person and Item Statistics
As mentioned previously, the item value and person measure both use the mathematical unit of a "logit." Because of this commonality, item values and person measures may be compared on the same linear scale that positions both people and test items. With this scale, one may obtain an overview of the group's ability as compared with the difficulty of the test items. If there are items located at the upper end of the scale and beyond the point where any persons are located, then these items are described as more difficult than any persons ability. In Figure 2, items 10 and 28 are located at the upper end of the scale beyond any person position indicating the two items are more difficult than any person's ability. Mastery of test items is delineated when there are items located at the lower end of the scale and beyond the point where any persons are located. The difficulty of these items is below all of the persons' abilities. In Figure 2, there are 23 test items located at the lower end of the scale below any person position. With criterion-referenced tests, one expects a certain number of test items at the mastery level. The scale map in Figure 2 indicates the test items at the mastery level for the group of students who took the test.
It is recognized that the Rasch Model is not the ultimate solution to overcome the limitations of Classical Test Theory. It was presented because it offers valuable information to educators who construct and use tests. With Classical Test Theory, the focus is primarily on the test items. The Rasch Model provides data for test items and the persons who took the test. This data provide a much more global perspective of what actually occurs when a person responds to a test item. In comparison with Classical Test Theory, the Rasch Model seems to make the difficult task of quantifying the qualitative variations a little bit easier.
- Guion, R. & Ironson, G. (1983). Latent trait theory for organizational research. Organizational Behavior and Human Performance, 31, 54-87.
- Masters, G. A Rasch model for rating scales (Doctoral dissertation, The University of Chicago, 1980). DAI No. 000-000.
- Thurstone, L.L. (1928). Attitudes can be measured. American Journal of Sociology, 33, 529-554.
- Wight, B. & Stone, M. (1979). Best Test Design. Chicago: Mesa Presa.