Journal of Nursing Education

Development and Evaluation of Classroom Tests: A Practical Application

Mary Kay Flynn, MA, RN; Jean L Reese, PhD, RN



This manuscript describes the concerns of a 15-member faculty group in a team-taught undergraduate nursing course regarding the quality of its classroom tests. A process of systematic evaluation spanning several semesters resulted in changes both in testing and, subsequently, in the grading policy. The primary goal of the changes was to strengthen the validity and reliability of tests which, in turn, would increase test fairness for the students. Discussion areas include 1) reliability and validity; 2) difficulty and discrimination indices; 3) effects of poor-item elimination; and 4) determination of cut scores. Explanations of how testing and measurement principles were applied occurs in each area.



This manuscript describes the concerns of a 15-member faculty group in a team-taught undergraduate nursing course regarding the quality of its classroom tests. A process of systematic evaluation spanning several semesters resulted in changes both in testing and, subsequently, in the grading policy. The primary goal of the changes was to strengthen the validity and reliability of tests which, in turn, would increase test fairness for the students. Discussion areas include 1) reliability and validity; 2) difficulty and discrimination indices; 3) effects of poor-item elimination; and 4) determination of cut scores. Explanations of how testing and measurement principles were applied occurs in each area.


Testing, measurement, and evaluation play a prominent role in all learning institutions, including schools of nursing. Students as educational consumers are holding educators increasingly accountable for accurate measurement of learning outcomes (Cronbach, 1984). Evaluation based on written tests can have far-reaching consequences for students in terms of success or failure in the nursing program. Consequently, nursing educators have the responsibility for developing testing procedures that fairly evaluate students' achievement and yield reliable scores. However, the development of classroom tests that are both valid and reliable challenges the best of educators. This article reports on the changes in testing implemented in a 15member faculty group of one team-taught undergraduate nursing course. The primary purpose was to strengthen the validity and reliability of classroom tests.

Concern over the quality of classroom exams prompted the faculty to ask the following questions: 1) Were our tests valid in that they adequately sampled the course content? 2) Were we testing critical knowledge that was commensurate with the ability level of the students? 3) Did a sufficient number of test items reflect cognitive levels other than recall of facts? 4) Were our tests reliable? and 5) Were our inferences based on tests reflected in our grading policies?

These questions stimulated a process of group evaluation and change that spanned several semesters. Changes described herein relate to validity and reliability, omission of poor items, difficulty and discrimination indices, and determination of cut scores for a series of classroom exams.


A primary goal of upgrading the authors' tests was to increase content validity. In the past, each faculty member who lectured submitted a set of items to a test committee responsible for organizing the test. There were no checks and balances for determining whether the questions were: 1) too "teacher specific"; 2) trivial or irrelevant; 3) too easy or too difficult; 4) ambiguous; or 5) testing cognitive levels other than recall of facts. A procedure was implemented whereby all faculty reviewed submitted items related to their specialty.

As part of the review process, faculty attempted to answer each exam item before knowing the "correct" answer, then offered suggestions for items' improvement. The critical question was, "Do these test items provide an accurate measurement of the knowledge presented in the lectures and readings?"

As colleagues pointed out shortcomings of the authors' most prized questions, they discovered how personally attached they were to their items. This type of exchange required the use of diplomatic skills and the specification of criteria for including specific test items.


TABLE 1KR-20 Reliabilities


KR-20 Reliabilities

The number of items included in each content area was based on the proportion of lecture hours devoted to that topic. This seemed a reasonable way to partition the content areas of the test and it remained the same. In addition, members of the faculty had concerns about cognitive levels being tested within each content area. It was believed that in a practice discipline, such as nursing, the tests should be evaluating students at the application level or higher.

To ensure that items tested more than recall of facts, Bloom's (1956) taxonomy of cognitive learning outcomes was used to develop a table of specifications. This is a twoway table with content areas on one side and levels of learning outcomes along the other. Item writers identified the cognitive level of each submitted item. This exercise encouraged faculty to develop classroom tests that went beyond the knowledge level and insured that the tests were representative of the objectives. As a result, it was learned that over half of the questions reflected the application level, though very few were at the analysis level or higher. Determining the cognitive level of an item proved difficult because the level changed according to how the lecture content was structured for the students.

Item writers were also encouraged to construct several questions which related to one situation. This allowed students to concentrate on one stream of thought and decreased student reading time per item. This usually has the effect of decreasing measurement error.

Omission of Poor Items

Since the authors' university has a central computerized examination and testing service, they were able to obtain the statistical measurements on both previous and current exams. Item analysis for every test item was reviewed following each exam. Beginning in the fall of 1985, they started rescoring exams in order to avoid penalizing students for items that tested poorly, to reduce measurement error, and to increase the reliabilities of the tests. Items considered for omission were those in which both the discrimination level was less than 0.20 and the difficulty level was less than 45%. Once these criteria were met, faculty considered the technical quality, content validity, and previous statistics of an item on order to determine whether an item should be omitted or retained.

Omission of items does raise concerns regarding test validity. For instance, omission of several questions from one content area could lead to inadequate sampling of content from that area. This concern must constitute an additional factor in the decision process. Thus far, omitted items have come from several content areas so that content validity was not thought to be jeopardized.

The item analysis also helped faculty to determine why items did not test well. Item writers could then choose to delete the item, revise it, or present the content differently for the next semester. The item review process made item writers more accountable for test development and evaluation. This activity also exposed new faculty to the intricacies of test writing and evaluation.


Traditionally, reliability has been defined as the accuracy and consistency of measurement. A test is considered consistent if it ranks the individual in the same position on successive administrations. Reliability is defined more precisely in the new Standards for Educational and Psychological Testing as the "degree to which scores are free from errors of measurement" (APA, 1985, p. 19). These errors come from unwanted variation, which lowers the test's reliability and, in turn, lessens faith in the observed score.

Sources of error include item sampling, anxiety, fear, effort, guessing, and other factors that affect the test taker's performance. Although no test of educational achievement is free from errors of measurement, a fundamental goal in measurement is to reduce these errors to a reasonable minimum.

The reliability coefficient provides the best statistical index for judging the accuracy of measurement and, subsequently, the quality of a test. A true score with no error would have a reliability coefficient of 1.00. Although this level of perfection is never attained, expertly constructed standardized tests often do yield a coefficient of .90. In contrast, classroom tests may show reliability coefficients as low as .50, indicating that as much error as accuracy exists in the measurement. The goal set by our course group was a reliability of .70 or better on each exam.

Test reliability is most frequently estimated by an internal consistency procedure. The Kuder-Richardson formula (KR-20) was adopted, which measures the average correlation of all possible split halves. A KR-20 coefficient is based on the proportion of students scoring correctly on each item and the standard deviation of test scores. A computer facilitates calculation of the coefficient, as the KR-20 is tedious to do by hand.

Table 1 depicts the comparison of the KR-20 reliabilities on four exam scorings before and after omission of poor items. The reliabilities ranged from 0.62 to 0.68 on the first scoring, to 0.66 to 0.72 on the second scoring. Generally, the reliability will be higher in a longer test (no items omitted) if the items are of similar difficulty and fatigue is not a factor. However, the KR-20 increased in most of the shortened tests (poor items omitted) because measurement error was decreased. When tests are long (e.g., 80 items in Exam III and 81 items in the Final Exam), it is wise to check the last 5 to 10 questions to determine whether most of the students were able to complete the test in the time allowed. If most of the students got the last 10 questions wrong or omitted them, fatigue may have been a factor and reliability will be reduced.

Reliabilities increased with omission of poor items for all tests except Exam I. It is possible that this decrease occurred because the test was easier when 10 items were omitted. This explanation is consistent with the obtained mean increase in the difficulty index from 66.51 to 75.08 after rescoring, indicating that 75% of the students answered the items correctly. A wide spread of difficulty values (very hard to very easy items) also occurred, which typically causes the distribution of test scores to be concentrated and thereby lowers the reliability.

Difficulty Index

Table 2 depicts the test summary statistics for the four exams administered in the fall of 1985. The difficulty index shows the percent of students who marked an item correctly, and thus reflects how easy an item is for students. A test with moderate difficulty, i.e., one in which 50% to 80% of the class anwers the items correctly, can be used to maximally differentiate among students. This range holds the best probability for obtaining a high discrimination index (0.40 or higher). However, the ideal difficulty level would be a point on the difficulty scale midway between chance level difficulty (25% correct for four-alternative multiple-choice items) and zero difficulty (100% correct). This means that the proportion of correct responses to a multiple-choice item would be about 62.5%.

For any particular test, there is a relationship between the spread of item difficulties, spread of test scores, and the reliability. In general, the wider the range of item difficulty values, the more concentrated the test scores and the lower the reliability. One of our criterion for rescore was a difficulty level less than 45% which indicated that the item was too hard. The authors were less concerned about high difficulty levels (over 80%), since a very easy item could reflect content that most students were expected to know and would be omitted if the item did not discriminate among students. Mean difficulty values ranged from 59% to 66% on first scoring to 56% to 75% on the rescoring. Slight increases in the mean difficulty values of the rescored test indicated a more moderate difficulty level than the first scored test and had more of an effect on the reliability than the mean discrimination index.

Generally, content and phrasing of the item detractors determine item difficulty. During the test evaluation process, faculty were asked to carefully inspect the items in order to determine why the item was either too difficult or too easy. The problem might be due to inadequate instruction, unclear phrasing, implausible detractors, or that it was just a good item answered correctly by well-prepared students. Test developers can make multiple-choice items easier by making the stem very general and the responses more diverse; items are made harder by using more specific stems and more similar responses.

Format of the items has a direct relationship to item difficulty. The authors' test consisted primarily of multiplechoice items with four alternatives, a few true/false, and an occasional matching set. Faculty preferred multiple-choice over true/false because of the similarity of the former to credentialling exams and lack of faculty familiarity in constructing true/false items. True/false items do have merit and can be highly discriminating, which has the effect of increasing reliability.

Essay or open-ended questions are not used because of the time involved in correcting the large number of papers. The use of complex multiple-choice items requiring the student to choose a combination of answers (e.g., a, b, and d), is often confusing and difficult for most students. Also, students may know two out of the three correct responses and receive no credit for that knowledge. Because of the difficulty in correctly answering this type of question, both the discrimination and reliability are lowered.

Discrimination Index

A discrimination index is a ratio of the number of correct responses on an item for those scoring in the top 27% of the test and the number of correct responses for those scoring in the bottom 27%. Some testing centers use the upper and lower fourths. However, after numerous statistical analyses of tests, our institution found that using 27% as the upper and lower cutoff provided a more reliable discrimination index. Because we were concerned with norm-referenced interpretations, it was important that the tests be able to discriminate between the well-prepared and the poorly prepared students. One of the best uses that can be made from the indices of discrimination is the selection of previously high discriminating items for inclusion in a new test.

It is difficult for faculty to derive much meaning from discrimination indices when considered alone. Since the discrimination values are sensitive to the kinds of instruction students receive in relation to that test item, the index must be placed in the context of the learning situation. An item that appears highly discriminating with one group of students may be weak or even negative in discrimination with another group. In the authors' case, comparison of the item discrimination index from one semester to the next sometimes revealed diverse values, particularly when content was taught differently because of faculty turnover. To correct this situation, lectures will be presented by tenure track faculty only. This restriction reduces the number of lecturers and, hopefully, will improve the consistency of lecture content.

Small samples also weaken the confidence placed in the discrimination index. Since there was always over 100 students for any one test, the error encountered was not likely attributable to sample size.


TABLE 2Test Statistics- Fall 1985


Test Statistics- Fall 1985

Criterion for reusing an item in a test was a previous discrimination index of 0.30. The item writer was asked to rework the item before including it in a new test if it did not meet this criterion. New items received the same scrutiny as the reworked items in the test review process. Criterion for omission of an item from the test before rescoring was a discrimination value less than 0.20. A negative discrimination is particularly undesirable. It indicates that students who earned high scores on the test performed more poorly on that item than those who earned low scores on the test. Guidelines for acceptable indices of discrimination may be found in "Essentials of Educational Measurement" (Ebel, 1979, p. 267).

The mean discrimination values for the authors' exams are depicted in Table 2. Values ranged from 0.18 to 0.23 on the first scoring and 0.22 to 0.23 on the rescoring. Since the mean value reflects the entire discrimination distribution, it is more meaningful to look at individual item discrimination values. The low discrimination index for some test items helped to partially explain the lower reliabilities. In reality, it is not unusual for teacher-made tests to have a mean discrimination of 0.20. However, efforts should be made to increase the item discrimination. This may be accomplished by using clear phasing with more plausible distractors, without making the item more difficult.

For all tests, the mean and median were nearly the same, indicating that the score distribution was not skewed. Scores spread across nearly six standard deviations (STD. DEV.), indicating a fairly normal distribution. The standard error of measurement (SEM) decreased for all exams when the poor items were omitted. The SEM reflects the amount of error on each side of a raw score using raw score units. The SEM provides another indication for the absolute accuracy of the test scores (Ebel, 1979). Given the standard deviation for a set of scores and the reliability for those scores, one can estimate the standard deviation of the errors of measurement. Although the SEM was reviewed in addition to the other test statistics, more focus was placed on the reliability coefficient to provide information as to the accuracy of the tests. Both the SEM and the reliability coefficient have shortcomings. The reliability coefficient not only depends on the quality of the test, but also on the variability of the group being tested. However, the SEM is almost entirely dependent upon the number of items in the test and not upon the test quality. A proposed solution is to make use of both test statistics for grading purposes when a student's score lies on the border of the cut score. Since the authors' cut scores are relatively generous at this time, they chose not to use the SEM.

Determining Cut Scores

Statistical procedures can help determine what score on the test marks achievement of the knowledge level identified by the faculty. However, determining achievement of a knowledge level always requires that faculty compromise in order to arrive at a relatively arbitrary set of standards. When the cut score for passing is set high, a large proportion of students will fail. When the cut score is set too low, students may not have achieved the instructional objectives of the course. This setting of standards involves three factors: previous experience, political pressure (by the educational institution or by the profession itself), and individual judgment (Cronbach, 1984).

All three factors impacted on our decision-making process. The faculty was able to draw on a large experience pool from previous semesters in establishing grading policies. The political pressure felt actually came from the faculty who taught the next nursing course. Too many students were lacking the prerequisite knowledge to succeed in the subsequent course. Ultimately, faculty judgment played the predominant role.

After several discussion periods, faculty agreed to change the method of determining the passing score from percentage of students to be passed to percentage of the total score. Cut scores were chosen according to standard deviation units under the normal curve. In a normal distribution, the range of test scores is somewhere near six standard deviations. Standard scores (t scores) were used instead of raw scores. This provided a common scale for comparing the results from different tests without changing the shape of the distribution.

After much discussion, the faculty chose to use the cut score (Figure 1). Using the standard scores, any score 29 and below was failing, 30-39 a D, 40-49 a C, 50-59 a B, and any score above 60 an A. Since the test reliabilities did not always reach the goal of .70, generous cut scores were used in order to guarantee fairness to students. As the authors' efforts in increasing the reliabilities bring fruition, they will consider adjusting the cut scores.


Although team teaching is the most frequently used teaching strategy in an integrated curriculum (Griffith, 1983), it can create many problems for testing. Synthesizing the diverse content and translating this content into discriminating items can prove to be a difficult process. In addition, changes needed to improve classroom exams may be met with resistance by some faculty. Removing elements of threat and instilling the awareness of need must be part of the change process.

The greatest gain in reliability resulted from the peer review process, the first of the changes to be implemented. Test reliabilities ranged from .53 to .65 at that time. Requiring questions to meet certain inclusion criteria also appreciably increased reliability. The rescoring procedure, while resulting in only a modest improvement in reliability, yielded the most fair outcome to the wellprepared students and communicated to them a concern to have fair tests. Tests will never be popular with students. That, in itself, does not relieve the faculty from the responsibility of employing sound testing techniques.

FIGURE 1: Determination of Cut Scores.

FIGURE 1: Determination of Cut Scores.


  • American Psychological Association, Inc. (1985). Standards for Educational and Psychological listing. Washington, DC: Author.
  • Bloom, B.S, (1956?. Ihxonomy of Educational Objectives, Handbook I: Cognitive Domain. New York: David McKay.
  • Cronbach, L.J. (1984). Essentials of Psychological lasting. New York: Harper and Row.
  • Ebel, R.E. (1979). Essentials of Educational Measurement. New Jersey: Prentice-Hall.
  • Griffith, J.W. (1983). Team Teaching: Philosophical Considerations and Pragmatic Consequences. Journal of Nursing Education, 22(8), 342-344.


KR-20 Reliabilities


Test Statistics- Fall 1985


Sign up to receive

Journal E-contents