|
Universal Design Applied to Large Scale Assessments
Element #2. Precisely Defined Constructs
An important function of well-designed assessments is that they measure what they actually intend to measure. According to Popham and Lindheim (1980), "a test development project begins with a careful consideration of the skills or attitudinal characteristics proposed for measurement" (p. 3). Just as universally designed architecture removes physical, sensory, and cognitive barriers to all types of people in public and private structures, universally designed assessments remove all non-construct-oriented cognitive, sensory, emotional, and physical barriers. This is referred to as construct-irrelevant variance, "the degree to which test scores are affected by processes that are extraneous to its intended construct" (AERA, APA, NCME, 1999).
Recently, much controversy has surrounded the question of whether the use of particular accommodations invalidates the measurement of the constructs that a test is designed to measure. For example, different groups may define reading comprehension differently. Some may define it as constructing meaning from written text, while others may have a broader construct targeting comprehension and not how the information is obtained. The argument for the latter is especially provocative for students who are visually impaired. With fewer students learning Braille, more students use technological devices that read text. It could be argued that this is the only way for these students to comprehend meaning from text.
Resolution of such issues is hampered by the lack of a clear, generally accepted definition of the construct the test is designed to measure. Because scores on state tests often influence high-stakes decisions about whether a student will be promoted to the next grade or can graduate from high school, the need for clearly defined constructs is more critical than ever. And, once these constructs are defined, they need to be available to people who make decisions about how tests can be administered.
Another common test construct controversy relates to the reading skills required on mathematics assessments. Several research studies have found that students with reading difficulties scored higher on math tests when questions were read to them (Calhoun, Fuchs, & Hamlett, 2000; Harker & Feldt, 1993; Koretz, 1997; Tindal, Heath, Hollenbeck, Almond, & Harniss, 1998). This finding implies that the reading requirements of a mathematics assessment may prevent students with marginal reading ability from demonstrating their competency in math. However, problem-solving items tend to require substantial reading. Math educators have mixed feelings about these items and their reading loads. Shorrocks-Taylor and Hargreaves (1999) suggested that the language used in questions on tests that assess subjects other than language become as "transparent" or simplified as possible so that the mathematical demands become clear for most of the students tested. Though these researchers found little mention of the language dimension of testing in the assessment literature, this has been a major concern of test developers for some time.
Element #3. Accessible, Non-Biased Items
According to the Standards for Educational and Psychological Testing (AERA, APA, NCME, 1999), "the quality of the items is usually ascertained through item review procedures and pilot testing. Items are reviewed for content quality, clarity and lack of ambiguity. Items sometimes are reviewed for sensitivity to gender or cultural issues" (p. 39). According to the National Research Council (1999), bias arises when:
Deficiencies in the test itself result in different meanings for scores earned by members of different identifiable subgroups. For example, a test intended to measure verbal reasoning should include words in general use, not words and expressions associated with, for example, particular cultures or locations, as this might unfairly advantage test takers from these cultural or geographical groups. (p. 78)
Kopriva (2000) describes a process for incorporating accessibility as a primary dimension of test specifications. In this process, the dimensions of breadth, depth, and accessibility are considered collectively in the development of tests and test items, making it possible for accessibility to be woven into the fabric of the test, rather than being added after the fact.
One way to reduce bias is to research whether any items are more difficult for students from particular subpopulations. This can be accomplished through the administration of a field-test that can help determine item difficulty and "ability to discriminate among test takers of different standing on the scale" (AERA, APA, NCME, 1999, p. 39). In order to evaluate the quality of the items, studies of differential item functioning (DIF) are often conducted by test developers. Differential Item Functioning occurs when students equated on relevant ability but representing different groups do not have the same probability of responding correctly to test items. DIF is usually investigated by comparing item difficulty. DIF analysis has traditionally been used to detect the differential function of an item according to group identity (e.g., race, gender, disability). Willingham (1988) refers to "comparable validity," defined as a test's ability to yield comparable scores from person to person, subpopulation to subpopulation, and setting to setting. Item response theory (IRT) is used to predict a student's probability of answering an item correctly.
Potentially biasing elements are defined by Popham and Lindheim (1980) as "anything in an item that could potentially advantage or disadvantage any subgroup of examinees within the populations to be tested" (p. 6). Popham (2001, p. 93) lists sample questions that could be asked to avoid potential bias:
- Curricular congruence. Would a student's response to this item, along with others, contribute to a valid determination of whether the student has mastered the specific content standards the item is supposed to be measuring?
- Instructional sensitivity. If a teacher is, with reasonable effectiveness, attempting to promote students' mastery of the content standard that this item is supposed to measure, is it likely that most of the teacher's students will be able to answer the item correctly?
- Out-of-school factors. Is this item essentially free of content that would make a student's socioeconomic status or inherited academic aptitudes the dominant influence on how the student will respond?
- Bias. Is this item free of content that might offend or unfairly penalize students because of personal characteristics such as race, gender, ethnicity, [disability,] or socioeconomic status?
There are gray areas of potential bias affected by varying experience. The distinction between deficiencies due to inexperience because of sensory or physical disabilities or instruction-related deficiencies is not clear-cut. The test score "may accurately reflect what the test taker knows and can do, but low scores may have resulted in part from not having had the opportunity to learn the material tested as well as from having had the opportunity and having failed to learn" (AERA, APA, NCME, 1999, p. 76).
It is important not to excuse children with sensory deficits from the academic expectations held for all other children. For example, emphasis on verbal communication that is typical in instruction for deaf students may leave less time for attention to academic content standards. Poor performance on assessments may reveal instructional deficiencies, and would not necessarily indicate item bias. The focus for these students should be on changing instructional practice, thereby giving them opportunities to master important skills and knowledge.
Many states have "sensitivity" or "bias" review panels for their assessments, although most of these do not include a disability or limited English proficiency representative. A review panel made up of people who are trained and knowledgeable in the issues for each of the student subgroups should be invited to screen potential items (Allen, Bulla, Goodman, Henderson, Skutchan, Willis, & Scott, in press; Geisinger & Carlson, 1992; Popham, 2001; Popham & Lindheim, 1980). The purpose of sensitivity/bias reviews is to be sure no child has an advantage or disadvantage because of the presentation or content of an item, which would invalidate that item's contribution to a test score. The item pool should be large enough so that a bias review committee has the flexibility to recommend the elimination of items that appear to be biased. Any items that are biased against particular populations should be replaced or changed to eliminate bias. Careful item development and a full bias review improve the validity of test results.
Kopriva (2000, pp. 67-68), in her discussion of accuracy in testing students with limited English proficiency, recommends the following elements of an expanded bias review:
- Bias reviews provide an opportunity for a wider range of special needs educators and other educational stakeholders to have input throughout the test development process.
- In addition to reviewing test items for offensive content, bias reviews provide an excellent opportunity for test developers and publishers to receive guidance about test formats, working, rubrics, non-text item accessibility, and administration and response conditions.
- Participants should be briefed on all steps taken to ensure accessibility throughout the development and implementation process.
- Publishers need to allocate sufficient time for a thorough, item-by-item review of materials.
- Participants should be able to review mock-ups of some of the assessments to get an idea of layout and presentation, and assess whether these are sufficient.
These elements clearly apply to other populations as well as to English language learners.
Previous | Next 
|