Reliability Issues And Evidence
This paper discusses assessment reliability with emphasis on the benefits of careful assessment design and administration when used for measuring students with disabilities. Our discussion centers on the concept of measurement error, specifically in the context of (a) the process for collecting responses, (b) the scores assigned to observed responses, (c) the decisions made based on these scores, and (d) the reliability of an assessment system. The importance of reliability pivots around the need for assurances that assessments are designed and used in ways that minimize unstable response patterns and corresponding individual and collective examinee scores. Reliable measurement is also a necessary condition for measurement of validity—although it is not the only condition. Without reliability, it is impossible to determine whether an assessment accurately measures student achievement. The challenge that must be addressed is to offer flexible assessments that can be adapted to different student needs.
Perhaps the most psychometrically technical aspect of assessment, reliability is generally described in terms of score consistency. The Standards for Educational and Psychological Testing define reliability as "the consistency of [such] measurements when the testing procedure is repeated on a population of individuals or groups" (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999, p. 25). Reliability typically refers to the measurement error that is introduced into the "entire measurement process" (p. 27), limits the degree to which generalizations can be made beyond the specific testing event, and quantifies the confidence that can be held in the value assigned to any performance. "Reliability data ultimately bear on the repeatability of the behavior elicited by the test and the consistency of the resultant scores" (p. 31). Specifically, for the purposes of this paper, we are concerned about the reliability (dependability, replicability, etc.) of behavior, scores, and inferences, as well as accounting for types of error.
Error can be classified into two types: (a) systematic and (b) unsystematic (random). Systematic error addresses validity issues; unsystematic error address reliability issues. Reliability is related to measurement error, which "almost always refers to the random component of error" (Feldt & Brennan, 1989, p. 105). Because large-scale assessments involve so many steps for development, implementation, and analysis, unsystematic error enters into the process in many different ways. Obviously, the use of performance assessments and testing accommodations can introduce a host of additional sources of error beyond the student and item. The design and development of items and tasks may introduce unsystematic error; for example, performance tasks, while considered comparable, may render alternate forms nonequivalent. Unsystematic error can result from varied assessment implementation by different teachers, and in different classrooms with different students. Finally, the scoring process itself may introduce unsystematic error (e.g., scoring via raters).
Calculating an index of reliability requires quantifying the measurement error associated with (a) observed behaviors and (b) their associated numeric scores. The situation becomes complex when observed behaviors depend on the sampling of items and the manner in which items "elicit" observed behavior. This is true irrespective of whether an item format uses a selected response (SR) or constructed response (CR) format. Furthermore, assigning numeric values to observed behaviors—that is scoring and scaling—affects the reliability of the measurement system. Scoring issues pertain to whether the score is very specific (e.g., using a scale from 1–500) or very general (e.g., with three score levels, as in conventional classification standards of "does not meet," "meets," or "exceeds"). Ultimately, we need some indication that careful assessment design (item sampling, administration, and scoring) diminishes error.
States have designed a range of approaches to assessment so that students can freely participate. Yet this range also may result in the introduction of unsystematic error and the potential for an array of random "nuisance" factors that may threaten the psychometric reliability of assessments. Because a state's assessment comprises this wide variety of approaches, and given their implementation with diverse populations, multiple types of evidence are required to ensure that reliable measures are obtained.
This discussion begins with first providing a conceptual definition of reliability, then identifies sources of error, and finally describes evidence that focuses on measurement reliability and measurement designs to attenuate measurement error. Note: Error scores, parallel forms, reliability coefficients, and standard error of measurement are the most important concepts in defining reliability. The sources of error often arise from procedural components of design and delivery of a large-scale assessment; the impact of that error is then documented through statistical analysis. This combination of procedural and statistical evidence forms the first line defense for developing of a validity argument. All tests must be reliable; their reliability, however, does not guarantee the validity of inferences from the results. It would be impractical to thoroughly cover most of the relevant technical psychometric issues pertaining to measurement reliability in this paper. Readers wishing for more technical documentation should refer to the references provided throughout.
Definition of Reliability
It is difficult to appraise the presence of reliability without definitions. Most important are the concepts of error scores, parallel forms, reliability coefficients, and standard error of measurement.
One of the most traditional conceptualizations is in terms of the true score: "a personal parameter that remains constant over the time required to take at least several measurements" and "the limit approached by the average of observed scores as the number of these observed scores increases" (Feldt & Brennan, 1989, p. 106). Unfortunately, it is impossible to know a person's true score; it must be estimated from the observed score, which provides imperfect information. Therefore, in addition to the observed score, an error score must be theorized. A very simple concept of observed score, true score, and error score is captured in Equation 1.
observed score = true score + error score. (1)
The observed score is composed of two components: (a) the true score and (b) the error score. Both the true score and error score are unobserved and must be estimated.
The concept of error score is at the heart of reliability. The goal of good measurement design is to minimize the error component. Note: In the simple model above (Equation 1), error is thought to occur randomly. The importance of random error may be recognized if an assessment is used repeatedly to measure the same individual. The observed score would not be the same on each repeated assessment. In fact, scores are more or less variable, depending on the reliability of the assessment instrument. The best estimate of an examinee's true score is the average of observed scores obtained from repeated measures. The variability around the mean is the theoretical concept of error, also called error variance. As noted earlier, measurement error can occur in the form of either systematic bias, which deals with construct validity, or random error, which deals with reliability. Random error can never be eliminated completely.