Another wrong (about randomisation) educationalist
Jo Boaler, author, Stanford professor (click here), and founder of the educational website YouCubed (click here) is visiting the UK to persuade schools to take up her maths teaching ideas (click here). She objects to rote learning of times tables. According to the Times Education Supplement (click here) she has said:
“Governments saying everybody has to memorise their times tables to 12 times 12 is absolutely disastrous.”
Blimey! But she has her reasons. She believes forcing weak children to learn tables makes them anxious about maths in general.
“What we know now is that when you give things to kids like a timed multiplication test, about a third of them develop anxiety. For those kids the working memory which holds maths facts is blocked and they can’t access it.”
“Some kids aren’t fast memorisers,and they decide from an early age that they can’t do maths because of the timed maths tests.”
Even bright kids are harmed:
“Other kids may be OK but see maths as a shallow subject which is about recall of facts and disengage. So [tables cause] huge damage”.
It all sounds plausible, her webpages cite whole libraries of academic papers, and she is obviously a charismatic educationalist. Judging by her twitter feed @joboaler many teachers adore her.
But others argue that learning tables is a vital early step in getting comfortable with mathematics. Click here for one. They also have theories and academic papers in support.
The research cited by each side is impenetrable to anyone uncommitted to the argument the author is advocating, and I’m certainly not qualified to judge it.
But I am qualified to say that in an area like maths teaching where factors like innate ability, teacher enthusiasm and parental engagement almost certainly influence results, the only reliable way to judge who is right, is a randomised controlled trial. But Jo Boaler cites none, and Google can’t find any.
I am also qualified to state that Jo Boaler doesn’t understand the limitations of observational data and the need for randomised trials in education. In a paper (click here) critiquing a US National Mathematics Advisory Panel that had advocated teaching tables in 2008, she wrote,:
“When comparing teaching approaches to consider which is more effective, random or equal assignment may be thought of as presenting a research ideal. If students are assigned to random or equal groups and given different treatments, and one treatment results in better outcomes, then researchers have a strong case for making causal statements. Experiments such as these have emanated from medical research, and they lend themselves to the controlled conditions of laboratories. However, when researching learning in complicated places such as schools, such models become highly impractical and, some would say, implausible.”
“Researchers in mathematics education do not need to assign students to groups in quasi-experimental studies, taking control of their education, as they can employ statistical methods to control for differences in student characteristics. Using logistic regression analysis, for example, researchers can control for factors such as prior mathematics achievement, gender, and socioeconomic status. It could be argued that researchers cannot control for every variable that may affect a student in a population, but they can control for all those known to be reasonable […]”
That’s wrong Jo Boaler. Other educationalists have expressed similar sentiments, and they are wrong too. Not just a bit wrong, but absolutely 100% wrong. The exact opposite of correct. Score gamma triple minus in the “education intervention evaluation exam”. Go to the back of the class Jo Boaler.
It is the very complexity of education, the many unknown factors that influence outcomes, that justifies randomisation. “Prior mathematics achievement, gender, and socioeconomic status” aren’t the problem. Jo Boaler is right about that; we can measure them and, at least in principle, control for them using logistic regression analysis. But, by definition, no amount of fancy statistics can ever control for unknown factors; they are unknown factors. The only way to have any assurance that they are more or less equal between the two groups under study is to allocate the students at random.
This misunderstanding about randomisation causes trouble in two ways. If Jo Boaler is wrong, her ideas are condemning thousands, maybe millions, of children to not learn their tables, and a lifetime of innumeracy.
But what if she is right? In that case her failure to test her ideas against well-conducted randomised trials allows governments all over the world to continue forcing children to learn their tables by rote and condemn even more to a lifetime of fear of mathematics.
What a pity no-one taught Jo Boaler how to evaluate educational interventions properly.
What happened next
Luke’s version is a story of redemption. A wastrel lose his inheritance “with riotous living”, admits his error, “I […] am no more worthy to be called thy son”, but his father forgives him, kills the fatted calf, and tells his resentful older brother to rejoice “for this my son was dead, and is alive again; he was lost, and is found”.
In Graham Greene’s Monsignor Quixote* the communist Mayor recounts the parable before dinner. In his version the son from a bourgeois family objects to inherited wealth and, in a Tolstoyan gesture of solidarity with the poor, gives his share away and lives as a peasant until, his courage failing, he returns to his father for forgiveness. But then he is disgusted for a second time, pines for the hard earth floor and broods on the saying of a wise peasant – Lenin’s words – that capitalism is a machine invented by capitalists to keep the working class in subjection. As the travellers enter Botin’s restaurant, the Mayor calls for suckling pig and a bottle of the Marques de Murrietta’s red wine, and Greene exposes his hypocrisy.
“I am surprised that you favour the aristocracy” says Monsignor Quixote, referring to the wine.
The mayor splutters the conventional communist excuses, and the priest admits he only eats horse steaks at home.
Rudyard Kipling also sends the prodigal son back to poverty for a second time. In this case to escape his stifling family, and especially his sanctimonious elder brother.
The Prodigal Son
Here come I to my own again,
Fed, forgiven and known again,
Claimed by bone of my bone again
And cheered by flesh of my flesh.
The fatted calf is dressed for me,
But the husks have greater zest for me,
I think my pigs will be best for me,
So I’m off to the Yards afresh.
I never was very refined, you see,
(And it weighs on my brother’s mind, you see)
But there’s no reproach among swine, d’you see,
For being a bit of a swine.
So I’m off with wallet and staff to eat
The bread that is three parts chaff to wheat,
But glory be! – there’s a laugh to it,
Which isn’t the case when we dine.
My father glooms and advises me,
My brother sulks and despises me,
And Mother catechises me
Till I want to go out and swear.
And, in spite of the butler’s gravity,
I know that the servants have it I
Am a monster of moral depravity,
And I’m damned if I think it’s fair!
I wasted my substance, I know I did,
On riotous living, so I did,
But there’s nothing on record to show I did
Worse than my betters have done.
They talk of the money I spent out there –
They hint at the pace that I went out there –
But they all forget I was sent out there
Alone as a rich man’s son.
So I was a mark for plunder at once,
And lost my cash (can you wonder?) at once,
But I didn’t give up and knock under at once,
I worked in the Yards, for a spell,
Where I spent my nights and my days with hogs.
And shared their milk and maize with hogs,
Till, I guess, I have learned what pays with hogs
And – I have that knowledge to sell!
So back I go to my job again,
Not so easy to rob again,
Or quite so ready to sob again
On any neck that’s around.
I’m leaving, Pater. Good-bye to you!
God bless you, Mater! I’ll write to you!
I wouldn’t be impolite to you,
But, Brother, you are a hound!
Greene made a political point, but Kipling got to the heart of the story. It’s not sufficient to kill the fatted calf. We need to shed our smug piety at the sinner’s misfortune.
* Graham Greene. Monsignor Quixote. The Bodley Head, London. 1982. p 38.
The gardens and park at Lanhydrock (click here) are superb. But the man on the gate struggled to remember any famous associations with the house itself; Gladstone planted a tree and there were mutterings about Poldark, but that was it. So I skipped the house and, late in the day, discovered this.
A tiny stream flows though and the water looks clean, albeit dark and weedy. I wasn’t concerned by the sign, rather the reverse, but I had no swimmers, young people were around, and while I contemplated it started to rain, and my courage failed me. I’m kicking myself, but it’s there for another day.
Meeting Point by Louis MacNeice and Wincher’s Stance by John Clinch
In 1938 MacNeice, who was still on good terms with his ex-wife Mary, and whose affair with Nancy, the illustrator of I Crossed the Minch, was drawing to a close, met the writer and political activist Eleanor Clark on a US lecture tour. The following year he engineered a job at Cornell university to be with her.
I couldn’t resist juxtaposing the poem of their meeting, presumably at New York docks, with the late John Clinch’s sculpture, Wincher’s Stance, at Glasgow bus station. Both are accessible and deservedly popular.
Time was away and somewhere else,
There were two glasses and two chairs
And two people with the one pulse
(Somebody stopped the moving stairs)
Time was away and somewhere else.
And they were neither up nor down;
The stream’s music did not stop
Flowing through heather, limpid brown,
Although they sat in a coffee shop
And they were neither up nor down.
The bell was silent in the air
Holding its inverted poise –
Between the clang and clang a flower,
A brazen calyx of no noise:
The bell was silent in the air.
The camels crossed the miles of sand
That stretched around the cups and plates;
The desert was their own, they planned
To portion out the stars and dates:
The camels crossed the miles of sand.
Time was away and somewhere else.
The waiter did not come, the clock
Forgot them and the radio waltz
Came out like water from a rock:
Time was away and somewhere else.
Her fingers flicked away the ash
That bloomed again in tropic trees:
Not caring if the markets crash
When they had forests such as these,
Her fingers flicked away the ash.
God or whatever means the Good
Be praised that time can stop like this,
That what the heart has understood
Can verify in the body’s peace
God or whatever means the Good.
Time was away and she was here
And life no longer what it was,
The bell was silent in the air
And all the room one glow because
Time was away and she was here.
— Louis MacNeice
Chapel cliff natural swimming pool
The picturesque Cornish fishing village is no place for swimmers.
Only the brave would get in among the boats, and the small NE facing beach just outside the harbour is in shade from midday onwards. But take the SW coast path a few hundred yards to Chapel cliff, where at low tide steps lead down to a natural sea-swimming pool.
Luke and I went at high tide. The pool was flooded, and there was a slight swell. Access directly off the rocks would have been tricky, but the steps made it easy. A great swimming spot.
The P4C trial security rating
Last week I criticised a trial, which the authors claimed had shown that a programme of philosophy teaching (P4C) in primary schools improved pupils literacy and maths (click here). The organisation which ran the trial, a semi-independent largely government-funded charity, the Education Endowment Foundation (EEF) has now defended their work (click here), without responding to any of the substantive issues, namely imbalance at baseline, negative primary results, >50% attrition on one primary endpoint and inappropriate cherry picking among data-driven secondary endpoints. Instead they defend a side issue, the lead researcher Stephen Gorard’s failure to report statistical significance. They also insist that the trial was evaluated independently using the EEF’s padlock rating scheme. With three padlocks out of five awarded by the EEF evaluators, that scheme is supposed to indicate that the results have a moderate degree of security.
The padlock scheme is described here. The criteria, as described by the EEF, are as follows.
1. Design: The quality of the design used to create a comparison group of pupils with which to determine an unbiased measure of the impact on attainment.
2. Power: The minimum detectable effect that the trial was powered to achieve at randomisation, which is heavily influenced by sample size. Implementation (thresholds that could be over-ridden by criteria 4, ‘Balance’, in exceptional circumstances)
3. Attrition: The level of overall drop-out from the evaluation treatment and control groups, which could potentially bias the findings. Analysis and interpretation (judgement required)
4. Balance: The final amount of balance achieved at the baseline on observable characteristics in the primary analysis.
5. Threats to validity: How well-defined and consistently delivered the intervention was, and whether the findings could be explained by anything other than the intervention.
The final padlock rating is derived by rating design, power and attrition, taking the lowest, and adjusting it up or down according to the presence or absence of balance at baseline and other threats to validity. The final rating cannot be higher than the lowest rating for design or power.
The EEF evaluator’s rating is given in appendix 2 of the main P4C trial report available here. I’ve reproduced the relevant table here:
Their justification was as follows:
“This evaluation was designed as a randomised controlled trial. The sample size was designed to detect a MDES of less than 0.4, by design, reducing the security rating to 3 . At the unit of
randomisation (school), there was zero attrition, and extremely low attrition at the pupil level also. The post-tests were administered by the schools by teachers who were aware of the treatment allocation, but with invigilation from the independent evaluators. Balance at baseline was high, and there were no substantial threats to validity.”
Let’s review these judgments.
A cluster randomized trial with 48 schools and 3,159 pupils. Five padlocks is correct.
I don’t know how to judge this without significance testing. But it’s a pretty large trial, albeit one which will lose some power from the cluster design. The EEF evaluators rated it three padlocks, which seems reasonable.
The EEF evaluators judged attrition at the cluster level only, which is correct according to the EEF guide, because that was the level at which randomisation occurred. All randomised schools were followed up, so they rated 5 padlocks on this criterion.
The evaluators noted that pupil level attrition was very low, which suggests they somehow missed the 52% attrition on KS2 score. But unless they had reduced the attrition rating to less than three padlocks this would not alter the final rating. The reason is that at this point the evaluator is supposed to allocate an interim padlock rating based on the lowest of the above three marks. The EEF guide reads:
“At this point the overall security rating for the evaluation will be determined by the minimum rating across the above three criteria. The minimum of first two criteria (planned design and power) determine the maximum security rating for the evaluation.”
So, however we interpret attrition, the overall security rating at this stage is three padlocks.
The EEF guide then says that this interim rating should be adjusted up or down depending on the final two criteria, balance and other threats to validity.
The evaluators judged that the groups were well balanced at baseline, on the basis of table 4 (baseline characteristics) in the report. But they ignored, or did not notice, the baseline imbalance of nearly 0.2 SD in KS1 reading and mathematics (total KS1 baseline scores are not reported) and of between 0.05 and 0.1 SD in CAT score. We can forgive the evaluators because these imbalances are not reported in table 4. They appear in the third columns of tables 5, 7 and 11 in the main report.
Since the Key Stage (KS) and Cognitive Ability Test (CAT) scores are the trial’s two primary outcomes, the evaluators made a mistake here, albeit a forgiveable one. The EEF guide (Table 3 below) suggests that in the presence of a baseline difference of >0.1 on a key characteristic the evaluators should drop two padlocks.
Three minus two = one. The interim rating should now be one padlock.
Other threats to validity
The EEF guide lists six other potential threats, namely 1. Insufficient description of the intervention, 2. Diffusion (or ‘contamination’), 3. Compensation rivalry or resentful demoralisation, 4. Evaluator or developer bias, 5. Testing bias, and 6. Selection bias. It suggests that adjustment should be made as follows:
“If any of the above issues are identified as a cause for concern some judgement should be used in adjusting the security rating to account for any issues identified. The following are some suggested rules:
- If there is evidence of any one or two threats the rating should drop 1 padlock
- If there is evidence of more than two threats the rating should drop 2 padlocks”
The EEF evaluators did not detect any threats. But in my opinion there are two unambiguous ones. No 1 because it was not made clear what teaching the control pupils got during the P4C lessons, and no 6 because the outcomes on which the conclusions were based were change scores selected post hoc, and susceptible to regression to the mean. A critical reviewer might also argue that the selective choice of outcomes suggests evaluator bias (threat 4). But this seems a bit circular so I’m giving them the benefit of the doubt on that.
Even if the teaching that control pupils got was recorded somewhere else, the problem of selecting change scores post hoc, and their susceptibilty to regression to the mean, is a definite threat to validity. So at best the final rating should drop by a further padlock. One minus one = zero. A final padlock rating of zero out of five.
According to the EEF zero padlocks mean the P4C trial “adds little or nothing to the evidence base”.
I’d be delighted to learn if I’ve made a mistake in the above. If not the EEF may wish to look for new evaluators.
A misleading randomised trial
Last week’s press was full of the news that teaching philosophy to primary school kids helped with their maths and reading. The BBC led with, “Philosophy sessions ‘boost primary school results'” (here). The Mirror, “Want to boost your child’s maths and reading skills? Then teach them about trust and kindness” (here), and specialists implied they’d known it all along, “It stands to reason that philosophy benefits learning” Times Education Supplement (here). Only the Guardian, put even a coded doubt into the final two words of its headline,”Philosophical discussions boost pupils’ maths and literacy progress, study finds” (here).
Sceptics need to take this seriously because, unusually for an education intervention, the evidence comes from a randomised controlled trial (RCT). In this case a trial of “Philosophy for Children” (P4C) a programme of weekly lessons. An outfit called the Society for the Advancement of Philosophical Enquiry and Reflection in Education (SAPERE) charges £4,000 to train and support teachers to deliver P4C in an average school for a year and claims to have signed up 600 schools in the UK alone. If they could persuade all 17,000 primary schools in the UK to join, they’d make a killing. So let’s take a critical look.
The trial report is here, or for those with access problems Philosophy_for_Children report. It has not been published in a peer reviewed journal, although there are apparently plans to do so. It was peer reviewed by the organisation that commissioned the study, the Education Endowment Foundation, who have judged that the findings of an improvement in reading and maths to have a moderate degree of security. The RCT was not registered on a trials database, but the protocol was publicly available here, or for those with access problems Philosophy_for_Children protocol.
Whole schools were randomised, a risky “cluster” design if individual inclusion can be altered by knowledge of the allocation, but one well-suited to education interventions; parents or teachers are unlikely to move children just because they are getting, or not getting, a weekly philosophy class, and all pupils do the outcome tests anyway. The 26 intervention schools got P4C implemented for 9 and 10 year olds for a period of somewhere between one and two years. The 22 control schools got “business as usual”; it’s not clear whether they gave an alternative lesson or sent the kids to the playground. Control schools got P4C at the end of the trial period so any effects could only be measured up to that time point. 3,159 children were included, and the groups were well balanced at baseline.
The protocol listed two primary outcomes, namely the overall Key Stage 2 (KS2) and overall Cognitive Abilities Test (CAT4) scores at the end of the trial, both reported as means and standard deviations (SD). A high score is good. There were seven planned secondary outcomes including the three components of KS2 (reading, writing and maths), and the four components of the CAT (verbal, non-verbal, quantitative and spatial ability). The plan was to also do subgroup analyses by year group and by whether the pupil was eligible for free school meals or not.
The results are in tables 5 to 19 of the main report.The CAT scores were reported for 2,821 (89%) of the enrolled pupils but KS2 scores were only available for 1,529 (48%). For some reason the overall KS2, one of the primary outcomes, was not reported at all! The seven secondary subscale outcomes are all reported, although not the subgroup analysis by year group. The subgroup by eligibility for free school meals was only reported for the eligible pupils.
There aren’t actually any differences between the groups in the scores reported. Nearly all slightly favour the control group, but the differences are tiny fractions of a standard deviation. No significance tests are reported, but I guess if they had been done, and corrected for multiple testing, they would all have been non significant (i.e. P>0.05). A negative trial.
But then the researchers got to work. They noticed that by chance the scores were slightly worse in the treatment group at study entry, so they decided to compare the change in score rather than the absolute scores. This manoevre was pre-specified in the protocol for the CAT score, but not for KS2. The authors openly admit that it was data driven.
“By the end the treatment group had narrowed this gap in all three subjects, especially for KS2 scores in reading and maths. For this reason, the key stage results are all presented as gain scores representing progress from KS1 to KS2.”
Unsurprisingly (because random variation tends to regress to the mean*) the results now favoured the intervention group. But it’s still a tiny effect.
Table 5 KS1-2 Reading. The difference in change scores = 0.11 (SD 1.0)
Table 6 KS1-2 Writing. The difference in change scores = 0.03 (SD 1.0)
Table 7 KS1-2 Maths. The difference in change scores = 0.08 (SD 1.0)
Still no tests of statistical significance (the lead author, Stephen Gorard, has some sort of principled objection to them) but their lack does not stop him concluding: “The results in tables 5 and 7 are unlikely to be due to chance.” On this basis the report’s first Key Conclusion, the primary finding of the trial, states:
“There is evidence that P4C had a positive impact on Key Stage 2 attainment. Overall, pupils using the approach made approximately two additional months’ progress on reading and maths.”
This sentence, plastered all over the Education Endowment Foundation website and press releases led predictably to the breathless headlines.
But it’s wrong. The triallists pre-specified two primary outcomes but only reported one, which showed no difference. They pre-specified seven secondary outcomes which showed no differences either. However when they altered their analysis plan after seeing the data they noticed that two of the secondary outcomes showed a tiny shift in mean change scores favouring the intervention. The effect size was about 10% of a standard deviation, and less than half the participants had the relevant scores measured, but who cares! Without any tests of statistical significance they declared that it was unlikely to have occurred by chance!
In an email to me Stephen Gorard wrote that he had no axe to grind. His research group Research Design & Evaluation (click here) had nothing to do with SAPERE or P4C; they had just been commissioned to evaluate the programme. He likened RD&E to a “taxi for hire”. Indeed so. Taxis get you where you want to go. RD&E gets you the results you want.
* Matthew Inglis @mjinglis (click here) makes the same point more elegantly with this graph of the three scores before and after the intervention.