STANDARD SETTING BOGSAT - BUNCH OF GUYS SITTING AT A TABLE z42\doc\web\2000\07\bogsat.txt In response to Monty Neill's explanation of how standards are set, this is what Washington does. See if you can keep from laughing in the aisles, or crying. From Arthur Hu: The Washington method that they told the legislature is probably similar to what most other states are doing. 1. The test maker comes up with a test allegedly aligned to the specifications and standards [but really isn't, they're just the hardest problems they think they can throw at 4th graders and get away with, how else can you explain comparing uncommon fractions, similar triangles, and probability as ratio when those skills are clearly set at higher grade level in the state documents??] 2. You give the test in a pilot trial to everybody in the state, this happened for 4th grade math in 1997. 3. The questions don't even get a scoring guide until AFTER all 64,000 responses are in. All the scoring people meet in a big gym, then they have a pow-wow to decide, for each question which responses are worth a 1, 2, 3, or 4. The question do NOT have correct answers when they are written. Based on this, a scoring "rubric" is created for each question. 4. Then the tests are scored subjectively by people hired for $8-$12 an hour on a temp basis, the same sort of people that might be hired by the Census or HR Block, average math and literacy level is probably somewhere between G5 and G8 to score problem that require skills at G10-graduate college level to completely comprehend. Scorers are determined to be fit if they agree 80% of the time, and are not off by more than 1 point the rest of the time. So you can be off by as much as 25% (1 point on a 4 point scale), 80% of the time, and 50% off the rest of the time, and that's "accurate". You know, it would be interesting to see how well the scorers would do if they took the WASL. 5. After the questions are scored, a "standards setting committee" made up of "parents, businessmen, teachers, community leaders" is put into a room, and they're shown a sequence of questions in the order of difficulty based on what percentage of students got the question right. Each judge is asked to place a "bookmark" at the question that represents each level until they come to a "consensus". At no point in time are any of the problems compared against the test specifications or the grade level benchmarks since it can be safely assumed that the vendor that created the test has already done this. Also, the judges are not told what percentage scored on any problem, so there is nothing to keep the judges from deciding that a problem that only 5% got right is something that "every student should know and be able to do". 6. This level sets the percentage of problems needed to be correct for each level, and remains fixed. The test construction process is scientifically guaranteed to produce tests of equal difficulty each year afterwards, which allow a valid comparison of progress across years. Of course, this probably doens't apply to the reading which has deliberately been made "more challenging" because too many students were judged to have passed before (about half), and this ignores evidence that test scores tend to inflate because teachers teach to the problems they saw last year that are repeated, and tests in other states such as Kentucky and Texas were found to get easier each year BTW, most experts believe the NCTM standards are pile of garbage, to put it lightly. Even the revised standards promote bizarre "invented" math methods to give an excuse for not teaching standard methods. -------------------------------------------------------------------------- To unsubscribe from the ARN-L list, send command SIGNOFF ARN-L to LISTSERV@LISTS.CUA.EDU. -----Original Message----- From: Monty Neill [mailto:Mneillft@AOL.COM] Sent: Thursday, June 15, 2000 9:42 AM To: ARN-L@LISTS.CUA.EDU Subject: Re: reporter wants to know... The publishers of both the Stanford 9 and the Terra Nova (CTBS5) norm referenced tests provide "levels" as well. For math, the levels are based, they claim, on the NCTM standards and probably a few others; for reading they are based on the NAEP frameworks primarily. Based on the already-developed NRTs, which are designed of course to sort students out from first to ninety-ninth percentiles, the companies engaged in "levels setting" procedures. I don't know which of several models they actually used, but they all come down to a variation of BOGSAT -- bunch of guys sitting around a table - that aim to establish a group decision about where cutoff scores should be set to demarcate levels. So, the group has to decide, based on a definition of what one knows at the borders of level 1 and level 2 (based in turn on the standards chosen), how many items a student would get correct to move from level 1 to 2, then 2 to 3, then 3 to 4. Once one knows the number correct on any given one of the tests, then the percentile rank that corresponds to that number correct is also known. Turns out that on the SAT(, as Deb has said, the cutoff points can be very high. It would appear from this that the percentage of US students who know math well according to the standards the testmakers turned to is quite low (each grade has a different percentile score that equates to a level 1 cutoff, etc.) But the reading standards are different, more students read better. (BTW, this would correspond to the results of international studies which has US students near the top in grades 3 and 8, while middle of the pack in math at 4 and 8). The Terra Nova may have "easier" levels -- Wisconsin uses it as the state test, and got criticized for having too many "proficient" students as compared with the state's NAEP results. Or it may be that WI did its own BOGSAT and set its own levels on the Terra Nova (I think they did, actually, but am not certain). These examples reveal again the subjectivity involved in setting the levels, which means they can vary a good deal. Then again, the tests, which are designed to sort students out (and in the most commonly used forms contains only multiple-choice items, tho the version of SAT9 used in Boston contains open-ended question, as does the WI version of CTBS5) are not designed with the standards and levels in mind. So the levels are based on what is on the test designed to sort and rank, rather than a test constructed directly from the levels. One result is that there might be few items on the tests that correspond to major areas of the NCTM (or other) standards and more that correspond to other NCTM areas. And the test items on an NRT are selected to mostly sort at the middle of the curve, meaning there are relatively few items that most get right or very few get right -- which will also impact the setting of the levels. One consequence is that the "levels" need to be taken with a pound of salt. Also, as just reported in Boston, on math most students are in level 1, low -- but a student might have scored higher on the SAT9 than over 40% of the kids in the country and still be level 1. Low standards, inadequate test, both, neither? Monty Neill FairTest -------------------------------------------------------------------------- To unsubscribe from the ARN-L list, send command SIGNOFF ARN-L to LISTSERV@LISTS.CUA.EDU.