U.S. Department of Education: Promoting Educational Excellence for all Americans - Link to ED.gov Home Page
OSEP Ideas tha Work-U.S. Office of Special Education Programs
Ideas that work logo
  Home  Contact us
Models for Large-Scale Assessment
Instructional Practices
 Information About PDF

 Printer Friendly Version (pdf, 59K)

Including Students with Disabilities in Large-Scale Assessment:
Executive Summary

The Decision Framework

The paper titled "A Decision Framework for IEP Teams Related to Methods for Individual Student Participation in State Accountability Assessments" describes a systematic framework for IEP teams to determine the most suitable way for students with disabilities to participate in the annual statewide assessments consistent with the IDEA statute and regulations. A critical point is made in the paper that IEP teams must determine how students with disabilities participate in statewide assessments for accountability, not whether they participate. Decisions are to be made for each student individually and not linked to a disability category, classroom placement, or the student’s involvement in instruction related to functional or daily living skills. Furthermore, the participation decision is to be made for each academic subject separately (e.g., reading, mathematics).

Although four possible testing options are currently available and a fifth testing method has been proposed, individual states may adopt and present the methods differently; therefore, the IEP teams need to be familiar with the testing methods available for students with disabilities in their state. Their deliberations about testing methods must be based on a systematic decision-making process that takes into account the need for accommodations, alternate assessments, and use of alternate assessments based on alternate achievement standards or assessments based on modified achievement standards if available in the state, relative to the testing method used. The framework draws attention to the requirement that students with disabilities have access to the general curriculum (IDEA, 1997 and 2004) and re-emphasizes an important condition of assessment: Students need the opportunity to learn the material on which they will be tested. By adhering to seven principles that are described in the paper referenced above, IEP teams can ensure that students receive instruction based on grade-level academic content, and they can promote instructional practices supported by research.

In their decision-making about how students should participate and their selection of assessment methods, IEP teams are directed to consider the educational needs of each student by answering five questions. The excerpt from Table 2, below, displays how the framework links the five questions to the choice of testing methods. Ultimately, IEP teams need to base their assessment recommendations on students’ responses to special education, interaction with text, instructional supports, and accommodations and assistive technologies used in the administration of an assessment.

Table 2

Decision Framework for Individualized Education Program Teams in Choosing an Assessment Method*



Foundation for Content Assessed
Based on Grade-level Achievement Standards Testing Methods
Based on Other Achievement Standards
Testing Methods Questions
Regular assessment
Regular assessment with accommodations
Alternate assessment
Assessment with modified achievement standards
Alternate assessment with alternate achievement standards
Question 1:
In what way does the student access the general curriculum?
Student shows progress in the full scope and complexity of the grade-level curriculum, although the student may not yet be on grade level. The student can make progress toward, but may not reach, grade-level achievement standards in the same timeframe as other students, and changes in breadth and depth of the materials taught would facilitate his or her access to the general education curriculum and the grade-level content standards. Due to significant cognitive disabilities (e.g., in memory, transfer of learning), student needs extensive prioritization.

Table continues with remaining questions.

* See Table 2 in "A Decision Framework for IEP Teams Related to Methods for Individual Student Participation in State Accountability Assessments," for the complete table. The excerpt here is provided as an example of the questions that should guide the choice of assessment.

A Model for Including Students With Disabilities in Large-scale Assessment Systems

As discussed in the paper titled "Validity Evidence," the underlying premise of testing the academic achievement of students with disabilities is that such testing can be valid if certain conditions are met satisfactorily. And it is important to always frame validity with the following two questions:

  1. How valid is the interpretation of a set of test scores?
  2. How valid is it to use this set of test scores in an accountability system?

The model for inclusion developed in the series of papers has three main components that a state education agency, local school district, and school would implement: (1) a systematic decision-making framework for determining the population of students appropriate for each of the assessment methods; (2) an approach to assessment; and (3) a validity argument that includes specific types of evidence to be collected when making changes in the approach to assessment to ensure full participation of all student populations (see Table 3).

Table 3

A Model for Including Students With Disabilities in Large-scale Assessment Systems

(1) Decision-Making for Participation
(2) Assessment Approaches
(3) Collection of Evidence to Support Claims and Inferences
Five methods of assessment for students with disabilities:
  • Regular assessment
  • Regular assessment with accommodations
  • Alternate assessment based on grade-level achievement standards
  • Assessment based on modified achievement standards
  • Alternate assessment based on alternate achievement standards
Testing approaches within a statewide assessment system
  • Multiple choice
  • Short constructed response
  • Rating scales and checklists
  • Portfolios
  • Performance tasks and events
Technical evidence used in validity argument Procedural evidence (how assessment decisions and processes are implemented)
  • Test Development and Administration
  • Alignment
  • Standard Setting
Statistical evidence (empirical outcomes that result from implementation)
  • Reliability evidence (e.g., internal consistency, inter-rater agreement)
  • Item statistics (e.g., difficulty and differential item functioning)
  • Validity evidence (e.g., internal structures, response processes, and relationship with other variables)
  • Construct validity evidence (e.g., construct under-representation and construct irrelevant variance)

The key elements of the model include defining the population of students with disabilities who need to be included (in each of the five methods) in a large-scale assessment system, identifying the testing approach or approaches that have been adopted statewide, and collecting technical evidence supporting the validity argument in relation to any changes made in the testing approach. The model (1) focuses on a total assessment system in which students participate in any number of ways, and (2) is based on an iterative validation process of making claims about assessment approaches and then collecting evidence to support the claims. In assembling the evidence, a number of specific guidelines are based on the latest educational standards (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999) that address reliability evidence (such as internal consistency and inter-judge agreements) and validity evidence (content-related evidence, response processes, internal structures, and relations to other variables). The specific types of evidence depend on the decision-making framework used to include students with disabilities (column 1 in Table 3) and the approach to assessment (column 2 in Table 3). Students with disabilities reflect a diverse group; each student needs to be considered individually by his or her IEP team as it recommends appropriate participation in the large-scale assessment program. Likewise, the assessment approach used to measure student performance on grade-level content standards also needs to be considered, as it directs the kind of evidence that can and should be collected, given that each approach makes certain assumptions and relies on certain strategies to measure achievement. The validity evidence collected depends upon how students with disabilities participate and how the state enacts its large-scale assessment.

Although providing appropriate accommodations to students with disabilities on the regular large-scale assessment still allows educators to make inferences about proficiency on state content standards that are comparable to the inferences made about students’ proficiency when accommodations are not provided, at some point changes are made that are significant enough to alter the breadth and/or depth of how grade-level content is measured. Significant changes present a shift in the inferences that are warranted, and the changes become a part of alternate assessments judged against different achievement standards. The series of papers take up two major issues when changes are made: (1) distinguishing between test accommodations that allow comparable inferences from the assessment and changes that result in different inferences, and (2) changes in breadth and/or depth that maintain links to grade-level standards.

Changes in Testing

Often changes can be made that involve using supports when administering tests (e.g., use of assistive technologies, prompts, or scaffolds) to remove construct irrelevant variance and maintain the meaning of the construct being measured. When such changes are made, they can be considered accommodations and allow educators to make inferences that are comparable to those for assessments administered without accommodations. A list of such accommodations should be made available in administration manuals to provide test users an explanation about the accommodation and conditions under which the accommodation can or should be applied. In addition, technical documentation should be provided on the empirical evidence supporting the effects of using the accommodations. Both types of evidence need to provide support for making the same inference of proficiency when no accommodations are present.

When changes to the way tests are administered or taken modify the breadth and/or depth of items, the content of the test is being changed. In these kinds of changes, an alternate assessment is being considered and the critical issue is simply to determine what achievement standard is being applied. Whether the modified or alternate achievement standard is being used to judge proficiency, the inferences about proficiency are not the same as when the test is provided without or with accommodations. Although grade-level content standards are being used, their breadth and/or depth has been changed to warrant constraints to the inferences that can be made. The difference in the inferences between these two achievement standards lies in the procedural and empirical evidence collected. This evidence needs to be provided in both the technical documentation and the reporting systems.

Skill Development in Achievement Testing

Skills and knowledge from content standards typically evolve gradually across grades; as a consequence, it is difficult to develop items or tasks for a given grade ("on-grade items") that are unique and not relevant for adjacent grades ("cross-grade items"). To reflect this progression of skills, different regular assessment test forms can be created specifically for each grade to tap grade-level content standards and then be statistically linked through a vertical scaling or linking process. A scaling process is generally one in which raw test scores (usually the total number of correct responses) are transformed into standardized scores, with a particular mean and standard deviation. With vertical scaling, common (anchor) assessment items across grades are used so the score in each grade can be compared to scores from previous and subsequent grades. As a consequence, the assessment score across the grades can be placed on the same scale, and changes in value can be considered equal intervals. This linking is done to provide a common scale for showing growth across grades and to reflect the idea that skills develop in a sequence (e.g., a difficult item in an earlier grade becomes an easy item in a subsequent grade). This scale is constructed through a statistical process called "vertical scaling" in which anchor items are used in more than one grade-level test (e.g., an item appears in both the third- and fifth- grade item). The items within the scale measure the same construct, and scores are typically used to track yearly progress between adjacent grades.

Under certain conditions, these cross-grade items might be acceptable for alternate assessments based on grade-level achievement standards or assessments judged against modified achievement standards. The papers offer a series of questions and criteria that can be used to help gauge the degree to which cross-grade items are suitable. They may be appropriate where assessments are aligned with grade-level content standards, have been linked to cover a common cross-grade core of the curricula, and do not constitute a major breach of the construct being assessed (thus providing procedural evidence). Furthermore, statistical evidence needs to be collected to reflect a vertical scale, comparing the performance of students who take these items on grade level and others who take them as cross-grade items.

This same logic of vertical scaling or linking also may be important for assessments judged against modified achievement standards to ensure progressive levels of achievement across grade levels. Because most skills in reading and mathematics reflect a progression or sequence in which proficiency of subsequent skills is based on proficiency of earlier requisite skills, this sequence may be articulated as part of the validity evidence collected. Both types of evidence would nevertheless need to explicitly relate to the grade-level content standards through changes in the breadth and/or depth.


 Previous  |  Next