|
Validity Evidence
Evidence Based on Relations to External Variables
The Standards for Educational and Psychological Testing (AERA et al., 1999, pp. 1315) also calls for validity evidence based on relations to other variables. External evidence for the construct being measured may be found in the relationship between the test and other similar or dissimilar measures. Evidence of this type is sometimes referred to as "convergent" and "divergent," respectively. Applying this type of validity evidence to large-scale assessments in reading and math should result in reading assessment scores that are more closely related to other reading scores than to math scores, and math assessment scores that are more closely related to other math scores than to reading scores.
For example, Table 3 reports correlational data on the relationship between Palmetto Achievement Challenge Tests (PACT) scores and scores on the TerraNova tests for grades three and six. The table is extracted from the 1999 PACT Technical Documentation written by Huynh, Meyer, and Barton (2000). In this table, PACT scores are scale scores and TerraNova subtest scores are normal curve equivalents. The data indicate that PACT English Language Arts (ELA) assessments relate more strongly to the TerraNova reading and language components than to the TerraNova mathematics component. Similarly, PACT mathematics assessments relate more strongly to the TerraNova mathematics component than to the other two TerraNova components. However, the divergent evidence for validity is stronger for PACT mathematics assessments than for ELA assessments.
Table 3
Correlation Between the PACT and the TerraNova for Grades Three and Six
Grade/Content |
TerraNova |
Reading |
Language |
Math |
Grade 3 ELA |
.776 |
.768 |
.729 |
Grade 3 math |
.689 |
.687 |
.803 |
Grade 6 ELA |
.755 |
.754 |
.741 |
Grade 6 math |
.710 |
.705 |
.863 |
Conclusions and Recommendations
Together with the information about validity evidence at the item level, this paper provides the reader with a set of methods for collecting evidence to evaluate the validity of score interpretations in large-scale assessment systems. The good news is that many of the techniques for examining evidence related to content coverage, response processes, internal structure, and relationships between test scores and other variables, are well established in the psychometric literature. The challenge will come in determining what constitutes "good" validity evidence using each of the techniques described in this paper. At this time, there is not a sufficient body of theoretical and empirical evidence to recommend minimally acceptable values of statistical indicators for all of the sources of validity evidence. Especially for some assessments that are yet to be designed (e.g., assessments based on modified achievement standards) and for assessments that may be taken by small groups of students, new small-sample techniques for gathering and assessing validity evidence are needed. Following are some recommendations for collecting validity evidence:
- States need to collect and document validity evidence for all four general areas described in this paper: content coverage, response processes, internal structure, and relations to other variables. Even if strong validity evidence is collected in one or two areas, the failure to collect evidence across all four areas will weaken arguments that may be made for the validity of score interpretations.
- Validity evidence for content coverage and response processes should be collected with the greatest precision possible. Blueprints for assessments should clearly state the strands or sub-strands within each content area, the types of items used to assess those strands, and the levels of cognitive demand required to respond to the items. Evidence of content coverage will be stronger if links can be made between specific achievement standards (grade level, modified, or alternate) and assessment items. Evidence collected through alignment methods that consider breadth and depth of content coverage will provide a richer understanding of content match than item-level judgments alone. States will need to carefully consider methods to assess response processes to rule out possible problems with construct-irrelevant variance. The response processes of teachers, raters of performance assessments, and students with the most significant cognitive disabilities who complete tasks for performance-based alternate assessments should be considered as well.
- Evidence related to the internal structure of the assessment may include inter-item correlations, inter-strand correlations, summary data from test dimensionality, similarity of the tests internal structure across major reporting groups, and the nature of the errors major reporting groups made on the test.
- Validity evidence collected via relationships between scores on the target assessment and external measures should include relationships with both similar measures (convergent evidence) and dissimilar measures (divergent evidence). Correlational analyses of scores on these measures may be used to make judgments about the quality of evidence from relationships with external measures.
- It will continue to be of great importance to thoroughly document the backgrounds of students who are involved in the field-testing process (Standard 1.5). Disability labels alone may not be good proxies on which to base assumptions about group homogeneity. Careful documentation of testing accommodations will also be needed (Standard 1.13) for subsequent analysis of the potential unintended impact of accommodations (e.g., construct irrelevant variance).
In general, well-established data collection guidelines should be followed (Downing & Haladyna, 1997) and validity evidence should be clearly documented. The challenges associated with collecting new types of evidence should not discourage the continued study and collection of item- and test-level validity evidence for all types of assessments.
References
American Educational Research Association (AERA), American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: AERA.
Barton, K., & Huynh, H. (2003). Patterns of errors made by students with disabilities on a reading test with oral reading administration. Educational and Psychological Measurement, 63, 602614.
Bloom, B., Englehart, M., Furst, E., Hill, W., & Krathwohl, D. (1956). Taxonomy of educational objectives: The classification of educational goals. Handbook I: Cognitive domain. New York, Toronto: Longmans, Green.
Browder, D. M., Karvonen, M., Davis, S., Fallin, K., & Courtade-Little, G. (2005). The impact of teacher training on state alternate assessment scores. Exceptional Children, 71, 267282.
Downing, S. M., & Haladyna, T. M. (1997). Test item development: Validity evidence from quality assurance procedures. Applied Measurement in Education, 10, 6182.
Downing, S. M., & Haladyna, T. M. (Eds.). (2006). Handbook on test development. Mahwah, NJ: Lawrence Erlbaum Associates.
Fueuer, M. J., Holland, P. W., Green, B. F., Bertenthal, M. W., Hemphill, F.C. (1999). Uncommon measures: Equivalence and linkage among educational tests. Washington, DC: National Academy Press.
Flowers, C., Browder, D., & Ahlgrim-Delzell, L. (2006). The alignment of three states alternate assessments to state standards. Exceptional Children, 72, 201-215.
Flowers, C., Browder, D., Ahlgrim-Delzell, L., & Spooner, F. (2006). Promoting the alignment of curriculum, assessment, and instruction. In D. M. Browder & F. Spooner (Eds.), Teaching reading, math, and science to students with significant cognitive disabilities. Baltimore: Brookes.
Geisinger, K. F. (1994). Psychometric issues in testing students with disabilities. Applied Measurement in Education, 7, 121140.
Haladyna, T. M. (1999). Developing and validating multiple-choice test items (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15, 309334.
Heller, J. I., Sheingold, K., & Myford, C. M. (1998). Reasoning about evidence in portfolios: Cognitive foundations for valid and reliable assessment. Educational Assessment, 5, 540.
Huynh, H., & Barton, K. (2006). Performance of students with disabilities under regular and oral administrations for a high-stakes reading examination. Applied Measurement in Education, 19(1), 2139.
Huynh, H., Meyer, P., & Barton, K. (2000). Technical documentation for the 1999 Palmetto Achievement Challenge Tests of English Language Arts and Mathematics, Grades Three through Eight. Columbia, SC: South Carolina Department of Education, Office of Assessment. Retrieved July 13, 2005, from http://www.myscschools.com/offices/assessment/Publications/
Index_of_Technical_Reports.htm.
Huynh, H., Meyer, P., & Gallant-Taylor, D. (2004). Comparability of student performance between regular and oral administration for a mathematics test. Applied Measurement in Education, 17, 3957.
Karvonen, M., Flowers, C. P., Browder, D. M., Wakeman, S., & Algozzine, B. (In press). A case study of the influences on alternate assessment outcomes for students with disabilities. Education and Training in Developmental Disabilities.
Massachusetts Department of Education. (2004). 2005 educators manual for MCAS-Alt. Malden, MA: Author. Retrieved June 25, 2005, from http://www.doe.mass.edu/mcas/alt/05edmanual.pdf.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13103). New York: American Council on Education.
Minnesota Department of Education. (2005, March). Minnesota Comprehensive Assessments Series II: Test specifications for reading and mathematics. Roseville, MN: Author. Retrieved July 5, 2005, from http://education.state.mn.us/content/087562.pdf.
Nandakumar, R., & Stout, W. (1993). Refinement of Stouts procedures for assessing unidimensionality. Journal of Educational Statistics, 18, 4168.
Oregon Department of Education. (2004). Juried assessment 20042005: Guidelines for using the juried assessment process and guidelines for jurying a modification. Salem, OR: Author. Retrieved June 25, 2005, from http://www.ode.state.or.us/teachlearn/testing/admin/juried/juriedassmtmanual0405.pdf.
PASA Project. (2003). Pennsylvania alternate system of assessment administrators manual. Pittsburgh, PA: Author. Retrieved June 25, 2005, from http://www.pattan.net/files/instruction/adminmanual.pdf.
Phillips, S. E. (1994). High-stakes testing accommodations: Validity versus disabled rights. Applied Measurement in Education, 7, 93120.
Phillips, S. E. (2000, April). Legal corner: GI Forum vs. TEA. NCME Newsletter, 8(2), n.p.
Pitoniak, M., & Royer, J. (2001). Testing accommodations for examinees with disabilities: A review of psychometric, legal, and social policy issues. Review of Educational Research, 71, 53104.
Reckase, M. D. (1979). Unifactor latent trait models applied to multifactor tests: Results and implications. Journal of Educational Statistics, 4, 207230.
Resnick, L. B., Rothman, R., Slattery, J. B., & Vranek, J. L. (2003). Benchmarking and alignment of standards and testing. Educational Assessment, 9, 127.
Roach, A. T., Elliott, S. N., & Webb, N. L. (2005). Alignment of an alternate assessment with state academic standards: Evidence for the content validity of the Wisconsin alternate assessment. The Journal of Special Education, 38, 218231.
South Carolina Department of Education. (n.d.). Blueprint construction of PACT in mathematics. Columbia, SC: Author. Retrieved June 25, 2005, from http://www.myscschools.com/offices/assessment/Publications/PACTblueprints.htm.
South Carolina Department of Education. (2001). PACT science assessment: A blueprint for success. Columbia, SC: Author. Retrieved June 25, 2005, from http://www.myscschools.com/offices/assessment/Publications/PACTblueprints.htm.
South Carolina Department of Education. (2003). Technical documentation for the 2003 Palmetto Achievement Challenge Tests of English Language Arts, Mathematics, Science, and Social Studies . Retrieved July 13, 2005, from http://www.myscschools.com/offices/assessment/Publications/PACT-Tdoc03.doc.
South Carolina Department of Education. (2005). High school assessment program alternate assessment (HSAP-Alt) test administration manual. Columbia, SC: Author. Retrieved June 25, 2005, from http://www.myscschools.com/offices/assessment/Publications/HSAPAltTAM030805.pdf.
Stout, W. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psychometrika, 52, 589617.
Texas Education Agency (2004, August). State-Developed Alternative Assessment II information booklets. Austin, TX: Author. Retrieved June 25, 2005, from http://www.tea.state.tx.us/student.assessment/resources/guides/sdaa/index.html.
Tindal, G. (2005). Alignment of alternate assessments using the Webb system: An abbreviated version. Washington, DC: Council of Chief State School Officers.
Webb, N. L. (1999). Alignment of science and mathematics standards and assessments in four states. (NISE Research Monograph No. 18). Madison, WI: University of Wisconsin-Madison, National Institute for Science Education. (ERIC Document Reproduction Service No. ED440852).
Willingham, W. W. (1989). Standard testing conditions and standard score meaning for handicapped examinees. Applied Measurement in Education, 2, 97103.
The U.S. Department of Education is reviewing public comments received on the notice of proposed rulemaking regarding modified achievement standards. As this analysis is not completed, the content of this document may not necessarily reflect the final views or policies of the Department concerning modified achievement standards.
This document was produced under U.S. Department of Education Contract No. EDO4CO0025/0002 with the American Institutes for Research. Renee Bradley served as the contracting officer's representative. No official endorsement by the U.S. Department of Education of any product, commodity, service or enterprise mentioned in this report or on Web sites referred to in this report is intended or should be inferred.
Previous | Back to Start 
|