Editing Standardized test (section)

==Design and scoring==

=== Design ===
Most commonly, a major academic test includes both human-scored and computer-scored sections.

A standardized test can be composed of multiple-choice questions, true-false questions, essay questions, [[authentic assessment]]s, or nearly any other form of assessment.  Multiple-choice and true-false items are often chosen for tests that are taken by thousands of people because they can be given and scored inexpensively, quickly, and reliably through using special answer sheets that can be read by a computer or via [[computer-adaptive test]]ing. Some standardized tests have short-answer or essay writing components that are assigned a score by independent evaluators who use [[Rubrics (education)|rubrics]] (rules or guidelines) and benchmark papers (examples of papers for each possible score) to determine the grade to be given to a response.

==== Any subject matter ====
[[File:Road_Test_Scoring_Standards_and_Record_of_Result,_KMVSS_Small_Vehicle_Driver's_Test_20190419.jpg|alt=Poster on a wall, displaying required behaviors and points that will be deducted for errors in English and Chinese|thumb|Poster showing the standards for passing [[Driving test|driving tests]] in Taiwan.  Every person who wants a driver's license takes the same test and gets scored in the same way.]]
Not all standardized tests involve answering questions. An authentic assessment for [[Athletic skill|athletic skills]] could take the form of [[running]] for a set amount of time or [[dribbling]] a ball for a certain distance. Healthcare professionals must pass tests proving that they can perform medical procedures. Candidates for driver's licenses must pass a standardized test showing that they can drive a car. The [[Canadian Standardized Test of Fitness]] has been used in medical research, to determine how [[Physical fitness|physically fit]] the test takers are.<ref>{{Cite journal|last1=Horowitz|first1=M. R.|last2=Montgomery|first2=D. L.|date=January 1993|title=Physiological profile of fire fighters compared to norms for the Canadian population |journal=Canadian Journal of Public Health |volume=84|issue=1|pages=50–52|issn=0008-4263|pmid=8500058}}</ref><ref>{{Cite book |title=Canadian Standardized Test of Fitness (CSTF): for 15 to 69 years of age: interpretation and counselling manual |date=1987 |publisher=Canadian Society for Exercise Physiology |author1=Canadian Association of Sports Sciences |author2=Fitness Appraisal Certification and Accreditation Program |author3=Canadian Society for Exercise Physiology |author4=Fitness Canada |isbn=0-662-15736-2 |location=Gloucester, Ontario |oclc=16048356}}</ref>

=== Machine and human scoring ===
[[File:Cito Eindtoets Basisonderwijs.JPG|thumb|right|Some standardized testing uses multiple-choice tests, which are relatively inexpensive to score, but any form of assessment can be used.]]
Since the latter part of the 20th century, large-scale standardized testing has been shaped in part, by the ease and low cost of grading of multiple-choice tests by computer. Most national and international assessments are not fully evaluated by people.

People are used to score items that are not able to be scored easily by computer (such as essays). For example, the [[Graduate Record Examinations|Graduate Record Exam]] is a computer-adaptive assessment that requires no scoring by people except for the writing portion.<ref>[http://www.ets.org/portal/site/ets/menuitem.1488512ecfd5b8849a77b13bc3921509/?vgnextoid=302433c7f00c5010VgnVCM10000022f95190RCRD&vgnextchannel=7196e3b5f64f4010VgnVCM10000022f95190RCRD#Scoring_and_Reporting ETS webage] {{Webarchive|url=https://web.archive.org/web/20090618054925/http://www.ets.org/portal/site/ets/menuitem.1488512ecfd5b8849a77b13bc3921509/?vgnextoid=302433c7f00c5010VgnVCM10000022f95190RCRD&vgnextchannel=7196e3b5f64f4010VgnVCM10000022f95190RCRD#Scoring_and_Reporting |date=2009-06-18 }} about scoring the GRE.</ref>

Human scoring is relatively expensive and often variable, which is why computer scoring is preferred when feasible. For example, some critics say that poorly paid employees will score tests badly.<ref name="Houtz">Houtz, Jolayne (August 27, 2000) [http://archives.seattletimes.nwsource.com/cgi-bin/texis.cgi/web/vortex/display?slug=4039520&date=20000827&query=wasl+pearson+pay+hour "Temps spend just minutes to score state test A WASL math problem may take 20 seconds; an essay, 2{{frac|1|2}} minutes"] {{Webarchive|url=https://web.archive.org/web/20070310174605/http://archives.seattletimes.nwsource.com/cgi-bin/texis.cgi/web/vortex/display?slug=4039520&date=20000827&query=wasl+pearson+pay+hour |date=2007-03-10 }}.  ''Seattle Times'' "In a matter of minutes, a $10-an-hour temp assigns a score to your child's test"</ref> [[Inter-rater reliability|Agreement between scorers]] can vary between 60 and 85 percent, depending on the test and the scoring session. For large-scale tests in schools, some test-givers pay to have two or more scorers read each paper; if their scores do not agree, then the paper is passed to additional scorers.<ref name="Houtz" />

Though the process is more difficult than grading multiple-choice tests electronically, essays can also be graded by computer.  In other instances, essays and other open-ended responses are graded according to a pre-determined assessment rubric by trained graders. For example, at Pearson, all essay graders have four-year university degrees, and a majority are current or former classroom teachers.<ref>{{Cite news|last=Rich|first=Motoko|date=2015-06-22|title=Grading the Common Core: No Teaching Experience Required|newspaper=The New York Times|url=https://www.nytimes.com/2015/06/23/us/grading-the-common-core-no-teaching-experience-required.html|access-date=2015-10-06|issn=0362-4331}}</ref>

=== Use of rubrics for fairness ===
Using a [[Rubric (academic)|rubric]] is meant to increase fairness when the test taker's performance is evaluated. In standardized testing, measurement error (a consistent pattern of errors and biases in scoring the test) is easy to determine in standardized testing. When the score depends upon the graders' individual preferences, then test takers' grades depend upon who grades the test.  

Standardized tests also remove grader bias in assessment. Research shows that teachers create a kind of self-fulfilling prophecy in their assessment of test takers, granting those they anticipate will achieve with higher scores and giving those who they expect to fail lower grades.<ref>{{cite journal|last1=Lee|first1=Jussim|year=1989|title=Teacher expectations: Self-fulfilling prophecies, perceptual bias, and accuracy|journal=Journal of Personality and Social Psychology|volume=57|issue=3|pages=469–480|doi=10.1037/0022-3514.57.3.469}}</ref> In non-standardized assessment, graders have more individual discretion and therefore are more likely to produce unfair results through [[unconscious bias]].  
{| class="wikitable"
|+Sample scoring for the open-ended history question: What caused [[World War II]]?
|-
! Student answers
! Standardized grading
! Non-standardized grading
|-
|
| Grading rubric: Answers must be marked correct if they mention at least one of the following:  Germany's invasion of Poland, Japan's invasion of China, or economic issues.
| No grading standards.  Each teacher grades however he or she wants to, considering whatever factors the teacher chooses, such as the answer, the amount of effort, the student's academic background, language ability, or attitude.
|-
|''Student #1:'' WWII was caused by Hitler and Germany invading Poland in 1939.
|
''Teacher #1:'' This answer mentions one of the required items, so it is correct.<br />
''Teacher #2:''  This answer is correct. 
|
''Teacher #1:'' I feel like this answer is good enough, so I'll mark it correct. <br /> 
''Teacher #2:''  This answer is correct, but this good student should be able to do better than that, so I'll only give partial credit.
|-
|''Student #2:'' WWII was caused by multiple factors, including the Great Depression and the general economic situation, the rise of national socialism, fascism, and imperialist expansionism, and unresolved resentments related to WWI.  The war in Europe began with the German invasion of Poland.
|
''Teacher #1:'' This answer mentions one of the required items, so it is correct. <br /> 
''Teacher #2:''  This answer is correct.
|
''Teacher #1:'' I feel like this answer is correct and complete, so I'll give full credit. <br /> 
''Teacher #2:''  This answer is correct, so I'll give full points. 
|-
|''Student #3:''  WWII was caused by the assassination of Archduke Ferdinand in 1914.
|
''Teacher #1:'' This answer does not mention any of the required items. No points.<br /> 
''Teacher #2:'' This answer is wrong. No credit.
|
''Teacher #1:'' This answer is wrong. No points. <br /> 
''Teacher #2:''  This answer is wrong, but this student tried hard and the sentence is grammatically correct, so I'll give one point for effort.
|}

===Using scores for comparisons===

There are two types of [[test score]] interpretations:  a [[norm-referenced test|norm-referenced]] score interpretation or a [[criterion-referenced test|criterion-referenced]] score interpretation.<ref name="Allen" />

* '''Norm-referenced score interpretations''' compare test takers to a [[Sampling (statistics)|sample of peers]].<ref name="Allen" />  The goal is to rank test takers as being better or worse than others.  Norm-referenced test score interpretations are associated with [[traditional education]]. People who perform better than others pass the test, and people who perform worse than others fail the test.
* '''Criterion-referenced score interpretations''' compare test takers to a criterion (a formal definition of content), regardless of the scores of other examinees.<ref name="Allen" /> These may also be described as [[standards-based assessment]]s, as they are aligned with the [[standards-based education reform]] movement.<ref>Where We Stand: Standards-Based Assessment and Accountability (American Federation of Teachers) [http://www.aft.org/pubs-reports/downloads/teachers/StandAssessRes.pdf/] {{webarchive|url=https://web.archive.org/web/20060824050606/http://www.aft.org/pubs-reports/downloads/teachers/StandAssessRes.pdf/|date=August 24, 2006}}</ref> Criterion-referenced score interpretations are concerned solely with whether or not this particular student's answer is correct and complete.  Under criterion-referenced systems, it is possible for all test takers to pass the test, or for all test takers to fail the test.

Either of these systems can be used in standardized testing. What is important to standardized testing is whether all students are asked the equivalent questions, under reasonably equal circumstances, and graded according to the same standards. 

[[File:Standard Normal Distribution.png|alt=a generic normal curve, with standard deviations marked|thumb|A norm-referenced test may be designed to find where the test taker falls along a [[normal curve]].]]

A ''normative assessment'' compares each test taker against other test takers. A [[norm-referenced test]] (NRT) is a type of test, [[Educational assessment|assessment]], or [[evaluation]] which yields an estimate of the position of the tested individual in a predefined population. The estimate is derived from the analysis of test scores and other relevant data from a [[Sample (statistics)|sample]] drawn from the population. This type of test identifies whether the test taker performed better or worse than other people taking this test.  An [[IQ test]] is a norm-referenced standardized test.     

Comparing against others makes norm-referenced standardized tests useful for admissions purposes in higher education, where a school is trying to compare students from across the nation or across the world. The standardization ensures that all of the students are being tested equally, and the norm-referencing identifies which are better or worse.  Examples of such international benchmark tests include the Trends in International Mathematics and Science Study ([[Trends in International Mathematics and Science Study|TIMMS]]) and the Progress in International Reading Literacy Study ([[Progress in International Reading Literacy Study|PIRLS]]).
[[File:Ensuring water quality (7831355868).jpg|alt=Technician holds color-coded card with water testing standards|thumb|[[Water testing]] uses criterion-referenced testing, because it is more important to determine whether the local water is safe to drink than to compare it against water from a different place.]]

A [[criterion-referenced test]] (CRT) is a style of test which uses test scores to show how well test takers performed on a given task, not how well they performed compared to other test takers. Most tests and quizzes that are written by school teachers are criterion-referenced tests. In this case, the objective is simply to see whether the test taker can answer the questions correctly.  The test giver is not usually trying to compare each person's result against other test takers.