Standards and Assessment Approaches for Students With Disabilities Using a Validity Argument
The purpose of this paper is to illustrate the validation process for large-scale assessments using the standards-based assessments of two states while minimizing construct-irrelevant variance or construct underrepresentation. The term "construct-irrelevant variance," as it applies to standards-based assessments, means that the test measures too many variables, many of which are irrelevant to the content standards. The term "construct underrepresentation" indicates that the tasks that are measured in the assessment fail to include important dimensions or facets of the content standards. This process emphasizes the decision-making process used to design an assessment and the collection of evidence, both procedural and empirical, to evaluate not just an outcome, but also all of the assumptions and decisions made in creating and administering assessments, and scoring students’ performance on specific tasks (items). Procedural evidence focuses on test development, the quality of the items and tasks, the assemblage of "items" into the total test, and the administration and scoring process. Empirical evidence documents content coverage (alignment between the content standards and the assessment), the stability and consistency in sampling behavior over replications, assessment "item" functioning, reliability of judgments and scoring, internal relations among assessment items, response processes, and external relations with other assessments.
In the first section, we describe how to determine the validity of accommodations. Test validity generally refers to the degree to which the inferences about students’ proficiency based on test scores are meaningful, useful, and appropriate. We begin by considering construct-irrelevant variance that might arise from making changes in the general education assessments with the use of accommodations. State standards are the central constructs and they become operationalized through the use of large-scale assessment systems. These assessment systems need to be analyzed carefully to identify the introduction of construct-irrelevant variance into the determination of proficiency. As elaborated by Messick (1989) in his extensive essay on test validity, the validity argument involves systematically collecting and using evidence to evaluate a claim of proficiency based on scores from standards-based assessments.
The second section of the paper compares two states’ standards and alternate assessments to highlight the nature of an "item" or task within an assessment system; the assessment approach in the first state is portfolio-based, and in the second state it is task-based. In each of these systems, we consider the kinds of procedural and empirical evidence that need to be collected to evaluate the validity claim that performance on the large-scale assessment is an adequate indicator of proficiency. We focus in particular on construct underrepresentation in this analysis.
The third section of the paper presents seven principles for developing items and tasks to ensure the focus is on grade-level content standards. These principles should be used to guide the process for developing alternate assessments, whether they are judged against grade-level, modified, or alternate achievement standards. We consider the grade-level focus for developing tasks; the breadth, depth and complexity of the items and tasks; the overlap across participation options; development across grade levels; the need for universal design; and finally, what students can do and its relationship to scoring. These seven principles should allow states to develop an assessment system that is completely inclusive and seamlessly integrates all participation options.
The fourth section of the paper operationalizes these principles using an example of reading assessment in which we make changes to accommodate students with disabilities participating in large-scale assessments judged against grade-level content standards, as well as substantial changes that become part of the alternate assessment. We focus on the two types of changes: changes in the supports (assistive technologies, prompts and scaffolds) provided, and changes in breadth, depth and complexity. The three important components of the assessment model are drawn together in this paper.
- A validation process is articulated using an argument with claims, assumptions, and evidence to evaluate the inferences that are made from the performance of students on assessment tasks representing selected domains of knowledge and skills (i.e., content standards). The process must begin with content standards that need to be operationalized into tasks (items) used to assess student proficiency. The collective tasks represent an approach to assessment.
- This approach to assessment is then analyzed in its administration for students with disabilities who require appropriate assessment accommodations. If members of the IEP team deem the need for an alternate assessment, other approaches are then analyzed that rely on indirect teacher judgments (using rating scales and checklists), portfolios, performance events and performance task collections.
- The assessment system as a whole is finally considered as representing a range of options for the participation of individual students, each of whom can access the general education content standards based on their unique needs. Ideally, the validation process is supportive in both process and outcomes, but the results are tentative and require further attention if any changes are made.
In the fifth section, conclusions and recommendations address these three issues. The recommendations make the validity argument explicit in its assessment approach (i.e., the process of operationalizing content standards into tasks or test items used to assess student proficiency); the options in which students participate (i.e., participate in the regular assessment with, or without, accommodations, or participate in an alternate assessment); and the manner in which changes are made (i.e., changes made to a test item, task, or format).
Construct-Irrelevant Variance and the Need for Accommodations
To understand a validity argument it is essential to have a clear idea of which construct is being tested because it forms the basis of the claim of validity. For example, in the following standards from Massachusetts and Oregon involving a mathematics problem from a fifth-grade state practice test, a word-story problem is presented as a multiple-choice item. It is essential to know whether this test item also has within it other constructs that are irrelevant to the mathematical construct being tested.
To answer the question we must consider (1) the construct being assessed; (2) the knowledge and skills reflected in the specific tasks and the manner in which this knowledge and these skills are sampled, formatted and scored; and (3) the use of test scores to make inferences about the teaching and learning process as well as the accountability system (relative to the construct). The validity claim is that the test adequately reflects the domain of knowledge and skills of the standards and can be used as the basis for the inference of proficiency.
Construct in an Example of a Mathematics Standard and an Assessment Problem1
||Add and subtract decimals to hundredths, including money amounts.
||Select and use appropriate operations (addition, subtraction, multiplication and division) to solve problems, including those involving money.
||Tommy bought 4 shirts for $18.95 each and 3 pairs of pants for $21.49 each. What was the total Tommy spent?
||A) $11.33 B) $135.97 C) $132.27 D) $140.27
The validity argument considers whether the task presented on the large-scale assessment appropriately measures the domain of achievement or whether it is misrepresented or underrepresented as described in Table 2.
Validation Claim and Questions Supported by Evidence
| Achievement (domain of tasks)
|| Does the math story problem include other constructs or rely on access or prerequisite skills that prevent students from displaying their knowledge and skill?
|| Does the math story problem adequately represent the kind of mathematics operations needed to solve money estimation problems in the presence of suitable distracters (i.e., irrelevant elements of the problem)?
In this simple mathematics problem, reading may be part of construct-irrelevant variance that impedes our efforts to measure the mathematical knowledge and skills as applied in this limited situation (a printed math story problem). However, if we had used a performance task to measure achievement (open-ended problem requiring the student to write his or her answer), then writing may have become part of the construct-irrelevant variance. If we had required a demonstration of money estimation in a local store or in the community, however, a host of other factors that are part of the assessment (the type of store in which we shopped, the presence of others at the check out, the bills being used, etc.) would then have become sources of construct-irrelevant variance. Construct-irrelevant variance can arise from several sources, including from the unique needs of students with disabilities or groups of individuals and how they participate in large-scale assessment systems. This source of variance is systematic and either consistently disadvantages or advantages individuals or groups. For example, if students are allowed only 60 minutes to complete a reading test, students with poor reading skills will be consistently disadvantaged. Or if students are given read-aloud assistance and the tester inadvertently prompts the correct choice by inflection, students taking the test from this person are systematically advantaged. In both examples, math performance is confounded with (influenced by) other characteristics of the measurement process that are irrelevant to the construct being measured.
In the math story problem as a measure of achievement, the construct also can be seriously underrepresented, failing to include appropriate operations (addition, subtraction, multiplication or division), steps (making exact change or estimations of change), distracters (elements of the problem that need to be seen as irrelevant), or critical strategies (use of self-guided actions that were used by the student but not documented). In all of these instances, the construct may have been underrepresented.
The validity claim can be threatened by several factors, for example, by insufficient evidence. And in making the claim, serious social consequences are at stake. Misinterpretations could be made (e.g., the student is not proficient in mathematics). Resources could be misdirected (e.g., very complex tasks are used that require intensive manpower to administer and score, for which reliability-related evidence is found lacking). Tasks could be misrepresented as constructs because measurement specialists, content experts, and special educators fundamentally disagree with (or are uninformed by) each other. Knowing the limitations of assessments for making inferences about proficiency in cognitive skills using more complex tasks, it is important to emphasize the need for appropriate and credible assessment approaches.
1 Examples excerpted from (1) the Oregon Department of Education’s fifth grade mathematics content standards for computation and estimation, available at: http://www.ode.state.or.us/teachlearn/real/Standards/Default.aspx (accessed March 25, 2006); and (2) the Massachusetts Department of Education’s fourth grade mathematics content standards for number sense and operations, available at: http://www.doe.mass.edu/frameworks/math/2000/num3.html (accessed March 24, 2006).