Ofqual’s Algorithm: The method to madness

You would be hard pressed to find anyone willing to defend Ofqual’s calculated grades a month after they briefly arrested the dreams of many students. And yet the algorithm had its supporters and defenders before Results Day on 13 August. IF Research Intern Anthony J. Piwowarski explores the ways in which a now universally condemned policy was rationalised by those in the know

“This has been a mega mess and should not have happened.” Such was the assessment of Robert Halfon, the Conservative chair of the education select committee, a day after the government scrapped its calculated grades algorithm and awarded A-level students the Centre Assessed Grades (CAGs) submitted by teachers in June.

In the days following the English A-level Results Day on 13 August, the exam algorithm produced by the Office of Qualifications and Examinations Regulation (Ofqual) congregated unlikely allies, outraged as almost 40% of marks were adjusted down by the devised algorithm. A-level candidates marched in their hundreds, their aspirations devoured by the chasms opening up between their assessed and awarded grades. Statisticians took the algorithm to task, revealing poor assumptions and weak measures of uncertainty.

All were collectively astounded that the government could defend a system with holes that neatly mapped onto gaping socioeconomic divides.

And yet it did. Gavin Williamson, the Secretary of State for Education, cut no isolated figure when insisting upon the system’s fairness and robustness in the months leading up to the Results Day. Before 13 August, he was backed by non-government departments, devolved administrations, exam boards, ministers, officials, newspapers, and some political commentators. Since then, the class of 2020 has brought its future back from the brink, and the government’s statistical and political failures have been thoroughly dissected.

At this moment, it’s perhaps worth thinking about how so many holders of office and authority came to create, implement and rationalise a system subsequently deserted by its own Prime Minister.

The spectre of grade inflation

On 15 August 2020, two days after A-level results were released, Gavin Williamson gave a combative interview to the Times journalist Steven Swinford. Facing intense media and public scrutiny following a calamitous Results Day two days prior, Mr Williamson deployed his fiercest rhetoric yet: a U-turn, he claimed, would lead to “rampant grade inflation”.

This exchange would soon become a relic of ministerial failure, rather than a decisive statement of governmental intent. Yet taken out of its context of incompetence, this interview reveals much about the underpinnings of the government’s approach to this year’s grades. Its message was far from unusual, echoing countless op-eds, ministerial statements and think tank reports that have influenced conservative educational policy since 2010.

In 2014 the current Chief Adviser to the Prime Minister Dominic Cummings, then a special adviser to Education Secretary Michael Gove, justified the department’s strident educational reforms on the basis that “the curriculum and exams had been devalued over the quarter century since MPs took control of them”. These exam statistics were consciously, and perhaps rightly, reframed; Dennis Sherwood, a regular contributor to the HEPI blog series, notes that until the Cameron government, higher rates of top grades at A-level were lauded as “proof” of better teaching. Reducing grade inflation has therefore become a guiding principle for circles close to government, underpinning a certain idea of fairness and justice in educational outcomes.

Unsurprisingly, this assumption has framed institutional priorities in the last few months. It is telling that, in its written statement to the Education Select Committee, Ofqual emphasised that it had been directed by government to “ensure that any model did not lead to excessive grade inflation”.

The trade-offs of “fairness”

Subsequent statistical decisions were made on the basis that, above all else, the calculated grades should not be unduly inflated. When early analysis of the data showed that A-level CAGs averaged 12 percentage points better than 2019 results, Ofqual’s response was swift and decisive, removing CAGs from the eventual algorithm. This despite the possibility that much of this inflation arose from the cumulative effect of teachers, who received rather vague guidance about awarding CAGs, making reasonable rounding decisions.

On the other hand, Ofqual’s finding that 8.5% of centres experienced “unstable” outcomes in 2015 and 2016 GCSE results, meaning that these schools saw larger year-on-year variation in results, was not sufficient to justify accounting for this in the final model. Whilst this decision was likely informed by statistical and practical considerations, these were not specified in Ofqual’s Summer 2020 Guidance Notes. Ofqual’s inactivity was instead justified on the basis that 90% of centres were classed as having stable outcomes. Notwithstanding the fact that this analysis was limited to GCSE results, inaction hit hardest comprehensive schools whose swings in results often reflected the unusual talent and hard work of a given cohort.

Whilst grade inflation was a sufficient rationale for abandoning the extensive input of teacher-assessed grades, Ofqual deemed the issue of variability as marginal despite poorer, gifted students losing out. In the end, the government would later find out, it is precisely at these margins where the greatest successes, emotions, and hard work of the exam season are crystallised. Protests articulated this, exposing the mismatch between the government’s deep-seated notions of educational fairness, and those of the public.

In the words of Sharon Witherspoon, Vice President of the Royal Statistical Society for Education and Statistical Literacy, “Once you took the decision that you weren’t going to tolerate more than a couple of percentage points of grade inflation… that drove everything else.”

Broken barometers of statistical significance

Value judgements about educational fairness that resonated deeply with government ministers were not the only way in which the algorithm was rationalised. Those who worked on the algorithm subjected it to extensive validation (enough to fill a 319-page document) that, based on internal metrics, seemed to justify its usage.

Except, these metrics were themselves flawed. Ofqual contextualised the predictive accuracy of its model with reference to a recent (2018) Ofqual study examining marking consistency. The experiment measured the probability of an exam script marked by an examiner being given the same grade by a senior colleague. In other words, it measured how often exam scripts are marked “correctly”, defined as the grade given by a more experienced examiner. Ofqual found that, in more subjectively assessed subjects, its predictive algorithm used in 2020 was actually more accurate than exam-based assessments. The devised algorithm correctly predicted 68% of A-level History results, versus the 56% correct predictions awarded by exam-based assessments. Whilst neither of these numbers are particularly flattering, the proposition that Ofqual’s algorithm was more accurate than exams could certainly be seen as compelling.

Ofqual’s justification for its algorithm was benchmarked against a broken system, but one whose brokenness is so fundamental that it is unlikely to be resolved given the popular perception that exams are an acceptable means of assessing merit.

Concluding thoughts

The above analysis is a small sample of the kinds of ways in which a policy decision with such damaging consequences could be made to seem fair and statistically sound. Policy-making assumptions have a history of their own, and it is worth exploring how the subjective choice to value grade inflation over other educational justices became objective parameters within which the algorithm could be evaluated.

Yet Ofqual’s report into the unreliability of exam-based assessments is perhaps even more disturbing. Whilst protestors rightly rejected this year’s unreliable system, Ofqual’s own research exposing the unreliability of “normal” exam marking has flown under the radar, only incidentally relevant to this year’s debacle. Whilst everyone will eagerly return to summer exams in the coming years, it is worth considering whether the broader system draws its appeal from the mere illusion of merit.

Photo by Mark Claus on Unsplash: https://unsplash.com/@markcnunes

Help us to be able to do more 

Now that you’ve reached the end of the article, we want to thank you for being interested in IF’s work standing up for younger and future generations. We’re really proud of what we’ve achieved so far. And with your help we can do much more, so please consider helping to make IF more sustainable. You can do so by following this link: Donate