Common Core Assessment 20x More Expensive? What Can Edtech Do?

The forthcoming Common Core (CC) Assessments are the next generation of standardized tests in the US, and will meet the testing frequency requirements of the most recent version of the Elementary and Secondary Education Act also known as No Child Left Behind unless congress should act to change this, which is most unlikely. Forty six of the fifty states have signed on to voluntarily administer the exams that will be written to meet the standards of the Common Core. The Smarter Balanced Assessment Consortium (SBAC) is one of two consortia that organizes the architecting and contracting for the Common Core assessments; SBAC is responsible for about half of the member states, including California.

I have examined the SBAC’s RFP’s for testing design and delivery of the CC assessments, and the consortium managed to construct a guide for contractors that even Finnish educators would admire. It is difficult to tell from the website, but it appears that the SBAC employed work groups that engaged school practitioners, or at least retired practitioners, to shape the tasks.

The winning bids for exam design, delivery, and reporting for the SBAC, have all gone to Wireless Generation, a company turned down by the New York Department of Education at least in part because of the parent company’s (Newscorp) role in mishandling personal data. This actually concerns me less (for now) than does the challenge that the private, for profit Wireless Generation (WG) must meet to deliver on the promise of the Common Core.

I am hopeful that WG can construct a multiple choice administration tool that is adaptive and requires less time of students to assess what multiple choice tests can; namely, what a student does not know. Call me cynical, but less time spent taking multiple choice tests is a win at this point.

I am also confident that the SBAC oversight of the process is likely to inspire WG to construct free response questions for both language arts and mathematics that look much like the sample assessments that are already available on the SBAC website. These sample items require students to read critically, write with insight, and analyze text appropriate for their age with significant depth in language arts. In mathematics the problem solving requires significant application of critical thinking skills in addition to strong traditional mathematical manipulation capability. I am impressed.

The College Board and the International Baccalaureate organizations are two examples of standardized testing administration bodies that have done some things right. Both organizations employ a combination of multiple choice items and free response in their standardized assessments. The IB takes this one step further and takes sample work from the school year (essays, laboratories, etc.) as well. Then, both organizations employ armies of trained educators to perform the assessment in a controlled fashion.

The logistics of this task are incredibly challenging. For the scoring of the AP exams, tens of thousands of educators gather in gymnasiums and cafeterias around the country to score the nearly 3.7 million AP exams taken by students in a given year.

The IB handles this international challenge a bit differently. In addition to end of year exam scoring events, trained IB scorers (all of whom are IB trained teachers as well) are mailed work samples from the work completed during the school year by students in another school (likely from another country) to carry out the assessment at their leisure – sort of – there is a deadline. In both cases, however, the exams are scored by trained human eyes; and, in particular, the eyes of, accomplished, highly trained, professional teachers.

The Common Core assessment is likely to look much more like IB assessment than the AP. The SBAC has outlined a formative assessment schedule that will look much like the IB’s internal assessment of student work that happens during the school year. And the extensive amount of writing in the SBAC outlined assessments will more closely mirror the many papers associated with end of year exams in IB. To meet the testing requirements for the IB in my 2000 student high school (that serves only about 500 students that take IB exams) we employ a full-time IB coordinator and a part time administrative assistant. The total local personnel cost for this administration (includes benefits, insurance, etc.) is probably about $150k. The exams themselves cost the students $75 each. And the annual fee to be an IB school (mostly goes to test creation, assessment, and reporting) is $10,400. To review… to administer exams that closely match those outlined by the SBAC for the Common Core for 500 students the cost is about:

$150k + $75 * 2 exams/student *(500 students) = $150k + $75k + $10.4k = $235,400

…or approximately $471 per student

To put this in perspective, the per pupil cost to administer the NCLB inspired tests in 2004 was approximately $20 per student. That is a 20x difference.

Here is a well-thought out examination of the cost of standardized testing…

WG will clearly have a significantly larger assessment challenge to meet and undoubtedly a smaller per student fee to work with. The CC assessments must meet law; and for language arts and mathematics, this will mean testing each child at least eighteen times during their public school experience (grades 3-11). I love back of the envelope calculations, so hang with me on this one. That means that the 13 million out of 19 million test-eligible public school students that fall within CC adopting states that are served by the SBAC will take a total of approximately 26 million end of year tests (in language arts and mathematics alone), and even more mid-year formative assessments. This is a challenge that is an order of magnitude greater than the most experienced test administration organization in the country has ever faced.

How will WG accomplish this task? Both the AP and the IB are nonprofit organizations that rely upon armies of educators who will travel to work for little pay to have the experience of scoring student exams as professional development. The SBAC has this answer to a related FAQ on their website, “Finally, teachers will score parts of the assessments, including extended response and performance tasks.” If WG can build trust with America’s teachers, and every school were to be involved in the process, this could be a significant growth experience for the country.

Keep in mind that computer based assessment of student written argument is still impossible. Nobody has cracked this nut, and nobody in edtech is likely to do so for at least another ten years. However, there are computer-based productivity tools that could make this process faster and easier to accomplish for our nation of educators.

There is a significant opportunity for WG to fulfill the goals of the SBAC to improve the quality of assessment that is currently happening in the United States. If WG engages educators as allies and partners in constructing and assessing quality exams, we might just make a smarter generation. However, if in the balancing act that weighs quality and profit, WG elects to cut costs with unproven computer-based assessment techniques, or worse, outsourcing of exam scoring to non-educators, I fear that the focus on student argument will be lost.

I am hopeful that what we will experience is a net improvement over what has existed for the last ten years. Assessment is an important part of learning. We have far more summative assessment than any of the other industrialized nations right now, and it has not proven successful in increasing student intelligence and critical thinking ability. Few would argue that one significant reason for this failure has been low quality tests coupled with a process that leaves educators out. With deeper assessments that focus on argument, written and assessed in partnership with America’s educators, and aided by advances in technology, we have the opportunity for a renaissance in public education.

jackwest1972

Tom,

Thank you for your critique. I like a good debate. I had read the original Shermis and Hammer paper, examining the results of the vendor challenge, but I had not yet read your promotional piece that includes reference to the Hewlett automated essay scoring challenge. In order to properly respond to your critique, I re-read the Shermis and Hammer paper, and then read your promotional piece that you link to above. I remain unconvinced.

Before I make any comments here, let me share my notes from reading both your promotional piece and the academic paper by Shermis and Hammer. They can be found in a shared Google doc here…

https://docs.google.com/document/d/1UwzU3MzDxCz5IzXlyX7O9y-2VCZSAt2ORZTVrVzDi6g/edit?usp=sharing

Additionally, I would like to share for my audience that I have been a public high school teacher for 15 years, and I am also working with Hapara Inc., a company that makes a teacher dashboard for Google Apps. My comments in no way are reflective of the opinions of Hapara leadership. I speak for myself. My teaching, however, definitely biases me in the direction of believing that teachers are still needed for analysis of the products that students produce that require high level thinking skills. I believe it is also true that I do not stand to financially gain in any way from my Hapara work by arguing the points that I do here and in the lengthier notes linked above.

I agree with this point made early in your promotional piece…

“It is our position that the application of machine scoring for student assessment shows important promise, but the only relevant application is to support assessment and instruction, not to supplant recognized best practices among teachers and other expert instructors.”

Automated essay scoring is already to the point that it can be an excellent ancillary tool for both formative and summative assessment, but it is not yet capable of evaluating complex written argument, particularly when it is referential.

I am also in agreement with this summary analysis from the Shermis/Hammer paper that you reference…

“As a general scoring approach, automated essay scoring appears to have developed to the
point where it can be reliably applied in both low-stakes assessment (e.g., instructional
evaluation of essays) and perhaps as a second scorer for high-stakes testing.”

I am particularly excited by the idea that automated essay scoring could be applied in a formative way throughout the school year to assist teachers in tracking student progress in real time. But, like Shermis and Hammer, I am not confident enough in the results of these studies to believe that automated essay scoring can yet be used for much more than as a second scorer on student argument in particular, on our nationwide summative tests.

There are a couple of important points that I will summarize here for my readers. First, judging the effectiveness of an automated scoring system by simply comparing average scores is dangerous. It is quite possible that every single sample scored by the machine could be in 80% disagreement with the human scoring force so long as on average the machines are high as much as they are low. More appropriate metrics are therefore the statistical correlations that account for variance.

On the metrics used by Shermis and Hammer to account for this variance, scoring systems that used a three or four point rubric did quite well. With gross bins like this, it is easier for the algorithms to hit their targets; since they are not assessing argument – but simply using statistical proxies. However, with rubrics that contained more points, thus requiring a more nuanced assessment, the automated scoring systems fell apart.

To this same point, the Hewlett contest for automated essay response used rubrics on a two or three point scale. Some of the improvements that we witnessed in scoring accuracy might be attributable to this changing of the rules. If all the machine has to do is determine yes/no/maybe, then it will be more successful than if it has to determine (in a more complex way) the extent to which Felipe, in his grammatically unsophisticated, but analytically complex style, has demonstrated that he can…

“Write informative/explanatory texts to examine and convey complex ideas, concepts, and information clearly and accurately through the effective selection, organization, and analysis of content.”

-CCSS.ELA-Literacy.W.9-10.2

Given this reference, a question arose for me when evaluating both your promo piece and the Shermis/Hammer paper. Were these automated scoring engines chewing on essays written for last generation assessments or the coming generation of assessments? I suspect that many of the statistical tools used in the algorithms to score essays written for NCLB inspired tests will be far less complex than what we are expecting students to do (with argument and logic, in particular) to meet the standards of the Common Core.

I will conclude first by saying that I appreciate your challenge. And I will concede that my
provocative title that implies to properly assess student writing on summative assessments inspired by the Common Core, we will have to spend 20x more money per student if it is done to actually meet the Common Core standards is likely high. The number is probably more like 10x, unless, as I pointed out in the original post, the SBAC and PARCC are able to engage schools and teachers in the process as is done, for example, during the assessment of Regents Exams in the state of New York. By leveraging the existing public education teaching force in their normal work day, we could probably approach the target expense, and improve the quality of the assessments more quickly.

Like you, I am hopeful about the promise of automated writing assessment, but I would hate to see the hope inspired in educators across the nation, once again extinguished by another round of standardized tests that fail to deliver on the promise of inspiring instruction that truly engages higher order thinking skills simply because it was not economically feasible to write and score the tests we need to make this happen. I think engaging educators in the process in a big way is the solution that makes the most sense in the near future.

Best,
Jack

Ref’s
ASAP case study co-written by Vander Ark and others

Click to access ASAP-Case-Study-FINAL.pdf

Shermis and Hammer paper

Click to access NCME_2012_Paper3_29_12.pdf

4 Responses

whereiskatima

EXCELLENT and informative. I still do not think teachers nor parents (who pay the taxes for our public schools) have wrapped themselves around this important step forward. Glad you put it in understandable language.

Reply April 27, 2013 at 3:36 pm
Tom Vander Ark

Jack, Both consortia are shooting for the $20 price point so your title and argument are off base. You’re also mistaken about online writing assessment–I spent much of last year demonstrating the veracity of online scoring, here’s the paper describing the two trials we ran: http://gettingsmart.com/wp-content/uploads/2013/02/ASAP-Case-Study-FINAL.pdf

Reply May 26, 2013 at 9:43 pm
1. jackwest1972
  
  Tom,
  
  Thank you for your critique. I like a good debate. I had read the original Shermis and Hammer paper, examining the results of the vendor challenge, but I had not yet read your promotional piece that includes reference to the Hewlett automated essay scoring challenge. In order to properly respond to your critique, I re-read the Shermis and Hammer paper, and then read your promotional piece that you link to above. I remain unconvinced.
  
  Before I make any comments here, let me share my notes from reading both your promotional piece and the academic paper by Shermis and Hammer. They can be found in a shared Google doc here…
  
  https://docs.google.com/document/d/1UwzU3MzDxCz5IzXlyX7O9y-2VCZSAt2ORZTVrVzDi6g/edit?usp=sharing
  
  Additionally, I would like to share for my audience that I have been a public high school teacher for 15 years, and I am also working with Hapara Inc., a company that makes a teacher dashboard for Google Apps. My comments in no way are reflective of the opinions of Hapara leadership. I speak for myself. My teaching, however, definitely biases me in the direction of believing that teachers are still needed for analysis of the products that students produce that require high level thinking skills. I believe it is also true that I do not stand to financially gain in any way from my Hapara work by arguing the points that I do here and in the lengthier notes linked above.
  
  I agree with this point made early in your promotional piece…
  
  “It is our position that the application of machine scoring for student assessment shows important promise, but the only relevant application is to support assessment and instruction, not to supplant recognized best practices among teachers and other expert instructors.”
  
  Automated essay scoring is already to the point that it can be an excellent ancillary tool for both formative and summative assessment, but it is not yet capable of evaluating complex written argument, particularly when it is referential.
  
  I am also in agreement with this summary analysis from the Shermis/Hammer paper that you reference…
  
  “As a general scoring approach, automated essay scoring appears to have developed to the
  point where it can be reliably applied in both low-stakes assessment (e.g., instructional
  evaluation of essays) and perhaps as a second scorer for high-stakes testing.”
  
  I am particularly excited by the idea that automated essay scoring could be applied in a formative way throughout the school year to assist teachers in tracking student progress in real time. But, like Shermis and Hammer, I am not confident enough in the results of these studies to believe that automated essay scoring can yet be used for much more than as a second scorer on student argument in particular, on our nationwide summative tests.
  
  There are a couple of important points that I will summarize here for my readers. First, judging the effectiveness of an automated scoring system by simply comparing average scores is dangerous. It is quite possible that every single sample scored by the machine could be in 80% disagreement with the human scoring force so long as on average the machines are high as much as they are low. More appropriate metrics are therefore the statistical correlations that account for variance.
  
  On the metrics used by Shermis and Hammer to account for this variance, scoring systems that used a three or four point rubric did quite well. With gross bins like this, it is easier for the algorithms to hit their targets; since they are not assessing argument – but simply using statistical proxies. However, with rubrics that contained more points, thus requiring a more nuanced assessment, the automated scoring systems fell apart.
  
  To this same point, the Hewlett contest for automated essay response used rubrics on a two or three point scale. Some of the improvements that we witnessed in scoring accuracy might be attributable to this changing of the rules. If all the machine has to do is determine yes/no/maybe, then it will be more successful than if it has to determine (in a more complex way) the extent to which Felipe, in his grammatically unsophisticated, but analytically complex style, has demonstrated that he can…
  
  “Write informative/explanatory texts to examine and convey complex ideas, concepts, and information clearly and accurately through the effective selection, organization, and analysis of content.”
  
  -CCSS.ELA-Literacy.W.9-10.2
  
  Given this reference, a question arose for me when evaluating both your promo piece and the Shermis/Hammer paper. Were these automated scoring engines chewing on essays written for last generation assessments or the coming generation of assessments? I suspect that many of the statistical tools used in the algorithms to score essays written for NCLB inspired tests will be far less complex than what we are expecting students to do (with argument and logic, in particular) to meet the standards of the Common Core.
  
  I will conclude first by saying that I appreciate your challenge. And I will concede that my
  provocative title that implies to properly assess student writing on summative assessments inspired by the Common Core, we will have to spend 20x more money per student if it is done to actually meet the Common Core standards is likely high. The number is probably more like 10x, unless, as I pointed out in the original post, the SBAC and PARCC are able to engage schools and teachers in the process as is done, for example, during the assessment of Regents Exams in the state of New York. By leveraging the existing public education teaching force in their normal work day, we could probably approach the target expense, and improve the quality of the assessments more quickly.
  
  Like you, I am hopeful about the promise of automated writing assessment, but I would hate to see the hope inspired in educators across the nation, once again extinguished by another round of standardized tests that fail to deliver on the promise of inspiring instruction that truly engages higher order thinking skills simply because it was not economically feasible to write and score the tests we need to make this happen. I think engaging educators in the process in a big way is the solution that makes the most sense in the near future.
  
  Best,
  Jack
  
  Ref’s
  ASAP case study co-written by Vander Ark and others
  
  Click to access ASAP-Case-Study-FINAL.pdf
  
  Shermis and Hammer paper
  
  Click to access NCME_2012_Paper3_29_12.pdf
  
  June 2, 2013 at 12:05 pm
Carol Burris

Jack, IB does not have multiple choice except for a few science exams. We do well over 1000 exams, with an AP doing the work, as well as other duties… IB is not for profit and its employees make modest salaries, not the millions that Joel Klein gets for wireless.
To turn a phrase, I know IB and Wireless is not IB.

Reply June 6, 2013 at 1:26 pm

Question Tank

Teaching and learning, science, research, and speculation

Common Core Assessment 20x More Expensive? What Can Edtech Do?

About Jack West

4 Responses

What do you have to say? Cancel reply

Share this:

Related

About Jack West

4 Responses

What do you have to say? Cancel reply