Exam grades can never be ‘accurate’. But they can be ‘reliable’.

Exam grades can never be ‘accurate’. But they can be ‘reliable’.
Date7th Dec 2020AuthorDennis SherwoodCategoriesPolicy and News

How accurate are GCSE, AS and A level grades?

A reasonable question; and an important one too. We all want exam assessments to be trustworthy, and so measuring the accuracy of grades seems to be a sensible thing to do. 

To me, however, the question is deeply problematic, for the only way that I can think of to decide whether or not any grade is ‘accurate’ is to compare it to a grade that is known to be ‘right’. That ‘right’ grade must itself be determined by an underlying ‘right’ mark, and it’s here we hit the rocks. As everyone knows, and as Ofqual acknowledges, “it is possible for two examiners to give different but appropriate marks to the same answer”. Importantly, neither examiner has made any ‘marking errors’; rather, the two different marks – say, 64 and 66 – are both “appropriate”, and result from legitimate differences in the academic judgement of the two examiners. If it so happens that the two marks are within the same grade width, then the candidate is given, say, grade B as a result of both marks. But if the B/A grade boundary is 65, then one mark results in grade B, the other, grade A. Which is ‘right’? Which is ‘accurate’? 

That’s just one example, but the general point is, I believe, true: because there is no single ‘right’ mark, there is no single ‘right’ grade, and so it is impossible to determine whether or not the grade given to any script is ‘accurate’.

Ofqual fudge this by defining what they sometimes call the ‘definitive’ grade, sometimes the ‘true’ grade, this being the grade resulting from the mark given to a script by a ‘senior examiner’. In practice, no one knows whether any particular script is marked by a ‘senior examiner’ or by an ‘ordinary’ one, and – if by an ‘ordinary’ one – whether that mark happens to result in the same grade as a ‘senior examiner’s’ or not. And if there is more than one senior examiner, this assumes that all ‘senior examiners’ are ‘of the same mind’; even if there is only one ‘senior examiner’, this further assumes that this august person maintains a constant standard under all circumstance, so that exactly the same mark would be given to the same script no matter how tired that person might be. 

All that makes me uncomfortable with the concept of ‘accuracy’ as applied to exam grades, and rather worried when someone in as high a position as Dr Michelle Meadows, Ofqual’s Executive Director of Strategy, Risk and Research, makes a statement such as this response to a question at the hearing of the Education Select Committee held on 2nd September 2020: “There is a benchmark that is used in assessment evidence that any assessment should be accurate for 90% of students plus or minus one grade”. 

Let me therefore pose a rather different question: “How reliable are exam grades?”.

It might appear that, in essence, this question is the same as the first, and that the replacement of ‘accurate’ by ‘reliable’ is merely the result of thumbing through a thesaurus. But no: whereas ‘accuracy’ requires a prior definition of ‘right’, I consider that ‘reliable’ is an appeal to a ‘second opinion’, in that a ‘reliable’ grade is one that would have a very high probability of being confirmed, and not changed, as a result of a fair re-mark by a different examiner. ‘Reliability’ is therefore very easy, and pragmatic, to test: give the script to another examiner, and see whether or not the re-mark confirms the originally-awarded grade or not.

This is what Ofqual has done for whole cohorts of 14 subjects, reporting the results in their November 2018 report Marking Consistency Metrics – An update. The re-marking was done by ‘senior examiners’, and so they refer to the ‘accuracy’ of the resulting grades – for (all varieties) of Maths, about 96%; for Economics, 74%; Geography, 65%; History, 56%. The average across the 14 subjects studied by Ofqual is about 75%, a number which my own analysis suggests applies across all subjects, implying that about 1 grade in 4, as awarded, would be changed if re-marked by a ‘senior examiner’ – or, in plainer language, about 1 exam grade in 4 is wrong.

Ofqual call their results measures of ‘accuracy’; my preference is to refer to them as measures of ‘reliability’, where the ‘second opinions’ have been given by a particular category of examiner, a ‘senior examiner’.

But regardless of the semantics of ‘accuracy’ and ‘reliability’, my view is that the fact that 1 exam grade in 4 is wrong, and has been wrong for years, is outrageous. So let me float an idea as to how assessments, as shown on candidates’ certificates, can be as close to 100% reliable as we might wish – where, in this case, I really do mean ‘reliable’, for, as I have argued, to me the concept of ‘accuracy’, as regards exam grades, is meaningless.

To achieve this, two policy changes need to be made: the first relating to the rule for determining grades from the script’s mark; the second relating to the rule for appeals.

To put that into context, the current rule for determining a script’s grade is simple – the script is given a (single) mark, say, 64; that mark is mapped onto a pre-defined grade scale, say, “all marks from 61 to 65 are grade B”; and the grade B is awarded accordingly. The appeals rule is (in principle) simple too – if the script can in fact be re-marked (which is in practice very difficult), then a re-mark, say 66, ‘over-rules’ the original mark, 64, and may or may not result in a change in grade, depending on where the grade boundaries are.

The fundamental problem with these policies is that the rule for determining the grade fails to recognise that the mark, 64, is ‘fuzzy’, in that different examiners can legitimately give the same script a different mark. The grading policy assumes that 64 is the ‘one-and-only’ mark for that script, which it surely isn’t.

To deliver reliable grades, this rule has to change, and to recognise that marks are ‘fuzzy’. So let’s suppose that the ‘fuzziness’ for the subject, is known to be, say, 4 marks either side of any given mark. That means that a script marked 64 might be re-marked anywhere between 64 – 4 = 60 and 64 + 4 = 68 if fairly re-marked by any other examiner (not just a senior one). This measure varies by subject, with Maths being characterised by a smaller number than Economics, which is in turn smaller than the number for History – please do contact me for further details.

So, hold on to your hat and take a deep breath.

Suppose that a script is marked 64, this being more realistically 64 ± 4. Currently, the grade is determined by the original mark 64. But suppose that the grade is determined not by 64, but by the ‘adjusted’ mark 64 + 4 = 68.

And before the word ‘ridiculous’ passes through your mind, consider what might happen when the script is re-marked 66, perhaps on appeal. 

No. The script is not re-graded according to 66 + 4 = 70. 

Why not?

Because the possibility that a re-mark greater than 64 has already been taken into account by awarding the grade based on 64 + 4 = 68.

So only if the re-mark is greater than 64 + 4 = 68, or less than 64 – 4 = 60, would the grade be reviewed. And if the statistics have been done properly, so that the ‘adjustment’ recognises a sensible range of fair re-marks, then the likelihood that an originally-awarded grade will be changed is very low. As a corollary, the originally awarded grade will be highly reliable – and trustworthy.

And no, this does not drive ‘grade inflation’, for there is no year-on-year effect; there might, however, be a single re-calibration of grade boundaries in the year this policy is introduced.

A variant of this idea is to dispense with grades altogether, and to show on the certificate the mark, and also the measure of that subject’s fuzziness, for example ‘Geography, 64 ± 4’; the policy on appeals would be the same.

Well, if you’ve read this far, you could be furious, intrigued, incredulous, interested, or typing an email suggesting that I be locked up. I’m happy with all of those, for my objective is to encourage you to think about it.

For exam assessments, accuracy is impossible. 

But reliability is within our reach – but only if Ofqual are willing to change their deeply flawed current policies.

Dennis Sherwood runs Silver Bullet Machine, a boutique consultancy firm, and is an education campaigner. Read his previous piece on 2020's centre-assessed grades here.

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now