U.S. Department of Education: Promoting Educational Excellence for all Americans - Link to ED.gov Home Page
OSEP Ideas tha Work-U.S. Office of Special Education Programs
Ideas that work logo
  Home  Contact us
Models for Large-Scale Assessment
Instructional Practices
 Information About PDF

 Printer Friendly Version (pdf, 74K)

Validity Evidence

In his extensive essay on test validity, Messick (1989) defined validity as "an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores and other modes of assessment" (p. 13). Four essential components of assessment systems need to be considered in making a validity argument: content coverage, response processes, internal structure, and relations to external variables. To put it simply, validity is a judgment about the degree to which each of these components is clearly defined and adequately implemented. Validity is a unitary concept with multifaceted processes of reasoning about a desired interpretation of test scores and subsequent uses of these test scores. In this process, we seek answers to two important questions. Whether the students tested have disabilities or not, the questions are identical: (1) How valid is our interpretation of a student's test score? and (2) How valid is it to use these scores in an accountability system to make judgments about students’ performance as it relates to a set of content standards?

Validity evidence may be documented at both the item and total test levels. This paper focuses only on documentation of validity evidence at the total test level. At this level, the Standards for Educational and Psychological Testing (American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 1999 ) calls for evidence of the type noted above regarding content coverage, response processes, internal structure, and relations to other variables. Examples for each source of validity evidence are provided using illustrations from some large-scale assessment programs’ technical documentation. We subsequently provide a discussion of each of these sources.

Evidence of Content Coverage

In part, evidence of content coverage is based on judgments about "the adequacy with which the test content represents the content domain" (AERA et al., 1999, p. 11). As a whole, the test comprises sets of items that sample student performance on the intended domains. The expectation is that the items cover the full range of intended domains and that there are a sufficient number of items so that scores credibly represent student knowledge and skills in those areas. Without a sufficient number of items, a potential threat to the validity of the construct exists because the construct may be underrepresented (Messick, 1989).

Once the purpose of the test and intended constructs are determined, test blueprints and specifications serve as the foundation of validity evidence for determining the extent to which the test provides sufficient content coverage. Among other things, Standards for Educational and Psychological Testing (AERA et al., 1999) suggests that specifications should "define the content of the test, the number of items on the test, and the formats of those items" (Standard 3.3, p. 43). Test blueprints that include this information often are released to educators (and the public) via test manuals. Table 1 provides an example of a South Carolina test blueprint for fifth-grade mathematics on a fixed format (pencil and paper) given with or without accommodations for the regular assessment.

What might test blueprints look like for alternate assessments? Those based on standardized tasks or items may include information similar to the regular assessment. For example, administration manuals for Texas’s State-Developed Alternate Assessment II, a pencil and paper test, include test blueprints for reading, writing and math by instructional level (grade or grade band). The blueprint for each subject indicates the number of items used to assess each objective in the Texas Essential Knowledge and Skills (Texas Education Agency, 2004). The manual briefly describes item formats elsewhere. The educator manual for Massachusetts’ portfolio-based alternate assessment, an example with less standardized tasks or items, instructs teachers to include portfolio evidence that relates to specific grade-level standards within the state’s curriculum framework (Massachusetts Department of Education, 2004). Teachers have some flexibility in determining the specific standards for which evidence is provided (e.g., selecting three of the five possible standards). Maximizing the specificity and clarity of instructions to teachers may help standardize the type of evidence for alternate assessments in general and portfolios in particular, thus allowing for more consistent interpretations of scores (as would be desired in large-scale assessment programs).

Table 1

Palmetto Achievement Challenge Tests (PACT) Blueprint for Grade Five Mathematics (South Carolina Department of Education, n.d.)

Distribution of Items

Type of Item Constructed Response Multiple Choice Total on Test
Number of Items 4–5 32–34 35–39 items
Value of Each Item 2–4 1  
Total on Test 11–13 points 32–34 points 45 points

Distribution of Points by Strand

Strand Constructed Response Points Multiple Choice Points Points per Strand
Numbers & Operations At least one constructed response item in at least 4 of the 5 areas for a total of: The balance of the points will be multiple choice and distributed through the strands according to the total points listed in the next column. 11 – 13
Algebra 7 – 9
Geometry 8 – 10
Measurement 8 – 10
Data Analysis & Probability 7 – 9
TOTALS 11–13 points 32–34 points 45

When performance assessments such as checklists, rating scales, or portfolios are used, student test scores are based on fewer items than for regular assessments (and therefore on a smaller sample of the intended domain). In these cases, it is especially important that the assessment items or pieces of portfolio evidence be representative of the intended domain (AERA et al., 1999, Standard 4.2) and also specified in the assessment administration instructions. Otherwise, construct underrepresentation is likely to be a threat to validity.

Evidence related to content coverage may become increasingly complex and nuanced, as state assessment systems use universal design principles to shift from a finite series of discrete testing options to a broader continuum consistent with the cascade of options. As such, tables of test specifications may need to be tailored to groups of students who have multiple paths of participation in the cascade of assessment options. These test specifications may need to explicitly describe the populations of students for whom the test is intended as well as their selection criteria. In that case, high-quality items will serve as a foundation for content-related validity evidence at the assessment level. This topic represents an area in which considerable empirical evidence is needed.

Methods for Examining Content Coverage

The Standards for Educational and Psychological Testing (AERA et al., 1999) and other resources (see Downing & Haladyna, 2006) provide overviews of procedures for examining the match between test specifications and the items that comprise the test. The procedures that test developers follow in specifying and generating test items should be well documented (Standard 1.6). In large-scale assessment systems, where multiple forms of each test are needed, item writing and reviewing may continue on an ongoing basis. Trained item reviewers with content area expertise may be given a set of items and asked to indicate which standard the item matches. Comparisons would then be made between item reviewers’ ratings and the item classifications initially made by the test developer. A critical consideration in collecting evidence related to content coverage is the standards upon which the assessment is based. The most tenable judgments of content match will be made if items are compared to specific objectives under a content standard (i.e., the smallest unit of measurement, or "grain size"), whether this match is for the grade-level, modified, or alternate achievement standards. The items in all forms of the alternate assessment need to be aligned with state grade-level content standards.

Evidence of content coverage also may come from methods for investigating the alignment of standards and assessments. For instance, Webb’s (1999) alignment criteria consider not only alignment at the item level but also statistical indicators about the degree to which the test as a whole matches the standards. The range of knowledge correspondence criterion indicates the number of objectives within the content standard with at least one related assessment item; an acceptable range of knowledge exists when at least 50 percent of the objectives for a standard had one or more assessment items (which Webb reported as varying from 0 percent to 100 percent). Resnick, Rothman, Slattery, and Vranek’s (2003) study of regular assessments in five states used the test blueprint as the basis for the intended content standards and calculated a proportional range statistic similar to that of Webb. They found that the regular assessments in the five states that were examined tended to have lower-range scores than the criterion they established as acceptable. The research findings suggest that, as part of their continuous improvement efforts, states should pay particular attention to ensuring that the range of knowledge correspondence for regular assessments falls within acceptable ranges.

The range of knowledge indicator also has been applied to alternate assessments. In an analysis of two states’ alternate assessments, Flowers, Browder, and Ahlgrim-Delzell (2006) found very low levels of range of knowledge match (0 percent to 37 percent of standards met the criterion). In contrast, Roach, Elliott, and Webb (2005) found that the majority of standards on one state’s alternate assessments in reading, language arts, math, science, and social studies met or weakly met the criterion for acceptable range of knowledge. For a variety of reasons, meeting an acceptable range of knowledge correspondence is particularly challenging for alternate assessments. For example, unlike the discrete multiple-choice items on regular assessments, the embedded items that make up portfolio assessments do not readily lend themselves to discrete matching across content objectives. As such, more evidence is needed to establish criteria for acceptable range of knowledge correspondence for alternate assessments, as discussed below.

While Webb’s (1999) range of knowledge criterion indicates the extent to which items match specific objectives within a standard, the Balance of Representation criterion indicates the degree of proportionality with which assessment items are distributed across objectives within a standard. In other words, the criterion indicates the extent to which items are evenly distributed across objectives. Balance of Representation is a calculated index, with 1 indicating a perfect balance and values closer to zero indicating most items were aligned with only a few objectives within the standard. (See Flowers, Browder, Ahlgrim-Delzell, & Spooner, 2006, or Tindal, 2005, for the Balance of Representation formula and calculation instructions.) In an analysis of four states’ mathematics and science assessments, Webb used .7 as the criterion for an acceptable balance, and found a fairly high percentage of standards (71 percent to 100 percent across states and tests) met that criterion. Applying this criterion to one state’s alternate assessment, Roach et al. (2005) found that all of the standards had acceptable results for Balance of Representation. In contrast, Flowers et al. (2006) found that only one of three states’ language arts alternate assessments had any standards that met Webb’s criterion for acceptable or weak balance of representation. The other assessments had 0 percent acceptable balance of representation across the standards. Resnick et al. (2003) also investigated balance using qualitative judgments about "the relative importance of test items [given] to content and skills" (p. 20) in comparison with the importance stated in the standards. Based on responses to a set of open-ended questions, item reviewers rated the balance of sets of items on a scale (good, appropriate, fair or poor). Raters found that very few standards were rated appropriate or higher on language arts and math tests at the elementary, middle and high school levels. These findings may be partly attributed to limitations of providing an adequate sample of target skills from which to make inferences regarding the distribution of assessment items across content standards, particularly when the content standards are described at the granular level (i.e., the content standard includes multiple objectives that target discrete skills) needed to provide "entry points" for students with the most significant cognitive disabilities.

Criteria for Evidence Related to Content Coverage

Regardless of the type of assessment or the achievement standards upon which it is based, similar types of information should be included in test specifications in order to evaluate validity evidence related to content coverage. Standard 3.3 provides a comprehensive list of contents (AERA et al., 1999, p. 43):

The test specifications should be documented, along with their rationale and the process by which they were developed. The test specifications should define the content of the test, the proposed number of items, the item formats, the desired psychometric properties of the items, and the item and section arrangement. They should also specify the amount of time for testing, directions to the test takers, procedures to be used for test administration and scoring, and other relevant information.

While the same kinds of information should be provided across assessments, the criteria established to judge content coverage as acceptable (e.g., sufficient number of items per standard) may vary. For example, the more partitioned a state’s content standards are, the more difficult it may be for assessments to meet Webb’s (1999) suggested Range of Knowledge criterion; using only that criterion as a target for "good" coverage may result in tests that are prohibitively long. The level of specificity of the content standards (e.g., global standards vs. specific grade-level objectives) against which items are aligned also may influence the statistics that are obtained on content coverage. Not until states develop evidence related to content coverage and consistently share that information with the psychometric and special education communities will a sufficient body of evidence exist to establish criteria for what is considered "acceptable" content coverage across the range of assessment options in large-scale assessment systems.