U.S. Department of Education: Promoting Educational Excellence for all Americans - Link to ED.gov Home Page
OSEP Ideas tha Work-U.S. Office of Special Education Programs
Ideas that work logo
  Home  Contact us
Models for Large-Scale Assessment
TOOL KIT HOME
OVERVIEW
MODELS FOR LARGE-SCALE ASSESSMENT
TECHNICAL ASSISTANCE PRODUCTS
Assessment
Instructional Practices
Behavior
Accommodations
RESOURCES
 
 
 Information About PDF
 
 

 Printer Friendly Version (pdf, 74K)

Validity Evidence

Evidence of Response Processes

Considering again the range of assessment formats available for students with disabilities, response processes represent the cognitive behaviors required to respond to an item. In constructed response items, such as those seen on many regular assessments, the student’s response process is the primary focus and intended to reflect a range of cognitive dimensions. However, these same assessments also require judging teachers (e.g., in assembling a portfolio or completing a rating scale), or raters (e.g., of extended response items, performance assessments or checklists), because an understanding of a person’s response process also contributes to the validity evidence for test score inferences. When statements about examinees’ or raters’ cognitive processes are included in validity arguments, evidence about those processes should be described (Standard 1.8). The sections below will guide the collection of such evidence.

Item Specification and Review Procedures

While test blueprints described in the previous section are based on content and item type, tables of specifications also may be developed by forming a matrix that includes content and cognitive demand. Each cell in the table shows the number of items used to assess each topic or strand at each level representative of a cognitive demand. States use various frameworks for describing cognitive demand, each with a different number of levels and accompanying descriptors. For instance, the blueprint for South Carolina’s PACT tests in science describes six levels of cognitive demand (South Carolina Department of Education, 2001). Webb’s (1999) alignment criterion for depth of knowledge includes four levels. Haladyna, Downing and Rodriguez (2002) highlight the need for a taxonomy to classify items on the basis of cognitive demand that is empirically based and more universally used. While this suggestion was made in the context of classroom assessment, a well-founded taxonomy would also be useful for validity studies of large-scale assessments.

Response Processes of the Student

The processes of students’ responses to most regular assessment items (selected response, short constructed response) may be considered through typical item review procedures. Test developers may start with a table of specifications, and external reviewers’ judgments about content and cognitive demand for each item may be compared with what was intended by the test developer. As with any other type of expert review process, reviewers should be well trained, and their selection, expertise, training and rating procedures should be thoroughly documented (Standard 1.7).

While expert ratings of cognitive demand provide some indication of students’ use of lower- and higher-order thinking skills on the assessment, they do not rule out the possibility that the student’s response is based on something other than what was intended. Haladyna (1999) describes a range of procedures designed to determine students’ cognitive processes when responding to paper and pencil test items. One commonly used method is a think-aloud procedure, in which students orally describe everything they think about while working through a problem.

Cognitive processes may be more difficult to assess directly for alternate assessments aligned to alternate achievement standards. Think-aloud procedures are not feasible when students cannot communicate verbally, and interview methods — even with the benefit of assistive devices for students who communicate nonverbally — may not work for this population of students. In these cases, direct observational data collection methods may be needed to document possible cognitive processes. Administration of performance assessments may be videotaped, for example, and analyzed for behavioral evidence of particular processes. Videotaped accounts of instructional activities that yield products included in a portfolio also may provide supportive evidence of the student’s response processes. Obviously, such resource-intensive data collection methods would be appropriate for empirical studies about the assessment rather than a component of the assessment given to all students.

One unique approach to standardizing the cognitive demand in alternate assessment is seen in Pennsylvania’s Alternate System of Assessment (PASA Project, 2003), in which a series of otherwise standardized performance tasks are administered in one of three ways, depending upon the level of cognitive demand the teacher determines to be appropriate for a student. Level A is designed for the most simple response processes, usually providing a very simple discrimination context and requiring few steps; Level B takes the problem to a slightly more complex response by providing a more rich context (elements of a problem) and requiring more steps; finally, Level C is the most complex response with the most context provided and the most extended response. In order to encourage high expectations for student performance, the level of cognitive demand is incorporated into the student’s score.

Response Processes of Teachers and Raters

When someone other than the student is partially responsible for the responses to an assessment, that person’s cognitive behavior also exerts some influence on the student’s score. Studies on one state’s portfolio-based alternate assessment support this assumption, as teacher training and understanding of the portfolio scoring system was associated with strong assessment scores (Browder, Karvonen, Davis, Fallin, & Courtade-Little, 2005; Karvonen, Flowers, Browder, Wakeman, & Algozzine, in press). When a teacher administers a performance assessment, completes a checklist, or assembles a portfolio, the teacher’s exact responsibilities should be delineated in the assessment’s specifications. The influence of judges and raters attenuates measurement reliability, which in turn impacts validity.

When assessment administration procedures are more standardized, there is less opportunity for the teacher’s cognitive processes to influence the student’s score. Clearly written, easy-to-follow scripts may help teachers adhere to a prescribed sequence of minimally intrusive prompting on performance tasks. Another basic method to maximize standardization is to have a neutral monitor present during the administration of performance tasks (South Carolina Department of Education, 2005).

As they are used with students, think-aloud procedures also may be used with teachers and raters. Teachers may be interviewed about the administration of a performance assessment, or asked to think aloud about completing a checklist or rating scale or about compiling a portfolio-based alternate assessment. Think-aloud procedures may be used to examine how raters follow the prescribed rating process, and possible places where failure to follow the process raises questions about validity of the score interpretation (Heller, Sheingold, & Myford, 1998). Post-hoc interrater agreement indices also may be used to assure consistency in rating a performance or scoring a product.

Criteria Related to Evidence Based on Response Processes

Considerations in evaluating criteria related to response processes are similar to those for content coverage. While well-established methods exist to collect this evidence, there is not yet a sufficient body of empirical evidence to provide clear standards for what is acceptable. Early studies on alignment (Flowers et al., 2006; Resnick et al., 2003; Roach et al., 2005; Webb, 1999) provide some possible benchmarks but should not be considered gold standards at this point. States should be aware of methodological problems in considering these sources of evidence that might attenuate the statistics obtained (e.g., judging the alignment on Webb’s depth of knowledge criterion when comparing specific assessment items to vaguely worded content standards).

Evidence of the Internal Structure of the Test

The internal structure of assessment instruments reflects the dimensionality of the score or the degree to which the outcome can be explained by the format of the problem (e.g. selected response multiple choice versus constructed response short answers). The dimensionality of the problem has direct influence on the degree or clarity with which a construct is defined or can be inferred from the score. The Standards for Educational and Psychological Testing (AERA et al., 1999, pp. 13–15) call for a study on internal structure as part of test validation. For valid test score interpretations and validity generalization, it is expected that (1) the items show some level of internal consistency (Standard 1.11); (2) the internal structure of the test remains stable across major reporting groups (p.15); and (3) the internal structure of the test remains stable across alternate (and equivalent) forms of the same test (pp. 51–52). This section addresses these three topics.

Inter-Item Consistency

With an assessment consisting of items measuring the same construct or tapping the content standards in the same subject area (like reading or mathematics), it is expected that these items show some level of consistency among themselves. In other words, it is desirable that student responses to various items or parts of the test be logically related and not contradict each other in any substantial way.

Internal consistency can be checked by looking at the inter-item correlation matrix. It is desirable that all inter-item correlations be positive. Internal consistency is also manifested in overall test reliability indices such as KR20 or Cronbach’s alpha. Phillips (2000) reported that a coefficient alpha of at least 0.85 was generally considered by most assessment experts as adequate for standardized tests such as Texas’ high school graduation test. It should be pointed out, however, that the performance variability among students taking an alternate assessment is typically not as large as the performance variability among students taking the regular assessment. Therefore, the coefficient alpha for a modified assessment may be lower than the 0.85 threshold.

Inter-Strand Consistency

An assessment (based on grade-level, modified, or alternate achievement standards) may be designed to measure knowledge and skills in a number of separate content strands. In mathematics, for example, the strands may range from simple computations (operations) to more complex topics such as probability and statistics. In other domains like the assessment of writing, a strand may be one aspect of writing (such as expository or persuasive) and various strand scores are sometimes obtained by scoring the same student response using different rubrics. Because these separate strands are parts of a larger construct, it is desirable that they be somewhat related to each other (but not exceedingly so). Inter-strand correlations are expected to be positive and moderate. High inter-strand correlations are not desirable because the strands may essentially reflect very similar types of skills or abilities and may be overly redundant. Table 2 provides an illustration of data on between-strand correlations for the South Carolina PACT assessments in grade 8 mathematics in 2003 (South Carolina Department of Education, 2003).

Table 2

Inter-Strand Correlation Matrix for PACT Grade 8 Mathematics Assessments

MATHEMATICS

Strand Number of Students Mean Score per Strand Standard Deviation Minimum Points/Items per Strand Maximum Points/Items per Strand
Number & Operations
50,070
8.901
3.849
0
17
Algebra
50,070
10.832
4.113
0
18
Geometry
50,070
7.238
2.901
0
16
Measurement
50,070
2.909
2.168
0
8
Data Analysis & Probability
50,070
5.158
2.504
0
13

Between-Strand Pearson Correlation Coefficient Matrix

Strand Number & Operations Algebra Geometry Measurement Data Analysis & Probability
Number & Operations
1.000
0.764
0.655
0.665
0.661
Algebra
0.764
1.000
0.654
0.644
0.661
Geometry
0.655
0.654
1.000
0.608
0.607
Measurement
0.665
0.644
0.608
1.000
0.606
Data Analysis & Probability
0.661
0.661
0.607
0.606
1.000

Unidimensionality

It is of considerable importance to check the number of dimensions (constructs) as operationally measured by the assessment and then reflected in the test data. Subject areas such as reading, mathematics and science (at the lower grade levels) are typically thought of as single constructs; however, student performance on the assessment items may be contingent on other unintended and irrelevant factors. Performance on constructed response items in mathematics, for example, may be dependent partially on reading level. Likewise, to solve a science problem, reading and writing skills as well as some knowledge of mathematics may be essential.

Assessments based on alternate and modified achievement standards typically are given to students with a wide range of disabilities and academic training, so the internal structure of the assessment instrument may be more complex than it is for typical students assessed on grade-level achievement standards under standardized conditions. Performance tasks administered with scaffolded assistance may reveal both the ability of the student and the assessor’s judgment regarding the level of prompting needed by the student in order to demonstrate his or her knowledge, which brings an extra dimension to be considered in any interpretation of the test data.

Interpretation of test scores is more straightforward when the assessment taps into one unique construct like reading or mathematics. Unidimensionality may be checked via a variety of statistical methods ranging from classical principal component analysis (PCA) or factor analysis (FA) techniques to more modern methods based on item response theory (IRT). The scree test, for example, in PCA and FA has been used quite often in this regard. Other IRT techniques include the DIMTEST and similar procedures developed by Stout and his associates (Stout, 1987; Nandakumar & Stout, 1993).

Similarity of Factor Structure and Differential Item Functioning Across Accommodated and Nonaccommodated Conditions

Many authors, such as Geisinger (1994), have noted major measurement issues when students are assessed by standardized tests that have been administered under nonstandard conditions. By adhering to standard procedures, errors in scores can more likely be attributed to random or individual errors, rather than systematic administrative errors, and scores can be interpreted similarly for all participants (Geisinger, 1994). Some students with disabilities require accommodations to allow an assessment to tap into a student’s ability on the construct(s) being measured, curtailing or neutralizing the effect of a student’s disability on his or her test result. A true accommodation should allow a student to be assessed in such a way that a disability does not misrepresent the student’s actual level of proficiency. Students ideally should be placed on equal footing and not advantaged or disadvantaged because of a disability, and the interpretations of their scores should be valid for the purpose of the test.

In her landmark writing on the balance between the individual rights of the student with a disability and the integrity of the testing program, Phillips (1994, p. 104) indicated the need to address a number of questions including the following two: (1) Does accommodation change the skill being measured? and (2) Does accommodation change the meaning of the resulting scores?

A host of psychometric issues need to be considered in addressing these two questions. Willingham (1989) pointed out that in order to achieve score comparability, various forms of the assessment need to display similar factor structures, and no differential item functioning (DIF) should exist across student groups. In a long review of psychometric, legal, and social policy issues about test accommodations for examinees with disabilities, Pitoniak and Royer (2001) concluded that further legal decisions would determine assessment accommodation policies and that more research is needed on test comparability.

The PCA, FA, or IRT techniques previously described can analyze factor structures. DIF analysis may be performed on the quality of test items. It should be mentioned that evidence regarding similarity in factor structure and DIF is more easily collected when exactly the same assessment is administered under varying conditions, as separate test forms. This is not usually the case where the accommodated form is the same as the regular form. More often than not there are significantly fewer numbers of accommodated versus nonaccommodated students, and still fewer when breaking out analyses by accommodation, limiting the appropriateness of conducting PCA, FA, IRT, or DIF analyses. In addition, most students receive multiple accommodations in a given administration.

However, when accommodated and nonaccommodated forms are developed and constructed to reflect exactly the same strands (content standards), PCA or FA may be carried out at the strand level rather than at the item level. This type of analysis was used in two studies based on the South Carolina High School Exit Examination (Huynh, Meyer, & Gallant-Taylor, 2004; Huynh & Barton, 2006). An analysis of the similarity of inter-item and inter-strand correlations may also provide evidence of the similarity across various groups of the construct assessed by the instruments.

Similarity of Factor Structure and Differential Item Functioning Across Major Reporting Groups

When data are available, it may be of interest to determine whether a test’s factor structure is similar across important reporting groups such as students with varying disabilities. An example of this type of analysis may be found in Huynh and Barton (2006), who compared the factor structure of the same test across groups of students with physical, learning and emotional disabilities. It is noted that this type of evidence may be hard to compile with small reporting groups. In this case, perhaps a consensus agreement via "juried" assessment, like the one used in Oregon, may be a good choice (Oregon Department of Education, 2004).

Special Considerations for Tests With Cross-Grade Items: Internal Structure and Differential Item Functioning

The U.S. Department of Education recently published a Notice of Proposed Rulemaking (NPRM) in the Federal Register that would allow states to develop modified achievement standards and use assessments aligned with those modified standards for a group of students with disabilities who can make progress toward, but may not reach, grade-level achievement standards in the same time frame as other students.1 An assessment based on modified achievement standards may be narrower in content breadth and coverage than an assessment based on grade-level achievement standards. An assessment based on modified achievement standards may comprise a number of cross-grade items that were originally designed for students at other grade levels. Cross-grade items may be used under certain conditions to enlarge the item bank for the purpose of test form construction.

In some cases it may be possible to find a section of the assessment based on grade-level achievement standards (GLAS) that is similar to the assessment based on modified achievement standards (MAS) either at the item level (all items are identical) or at the strand level (all strands are the same). In these cases, it may be possible to check the similarity between the GLAS section and the MAS section in terms of factor structure. When enough data are available, it may be possible to check for DIF between students who took the GLAS section and those who took the test with MAS.

Similarity in Types of Errors Made by Students

Evidence of similarity of constructs across major reporting groups may also be collected by analyzing the major types of errors made by students on various assessments administered to these groups. A study along these lines for the South Carolina High School Exit Examination was reported in Barton and Huynh (2003). Overall, the question is, "Are the types of errors made by students taking the regular assessment with accommodations or students taking the test based on modified achievement standards similar to those of general education test takers at a similar grade level, level of mastery, etc.?" Similarity in the errors is a good indicator that the assessment instruments tap similar constructs.


1 Retrieved from the World Wide Web on Feb. 8, 2006 at http://www.ed.gov/legislation/FedRegister/proprule/2005-4/121505a.html

 

 Previous  |  Next