The P4C trial security rating
Last week I criticised a trial, which the authors claimed had shown that a programme of philosophy teaching (P4C) in primary schools improved pupils literacy and maths (click here). The organisation which ran the trial, a semi-independent largely government-funded charity, the Education Endowment Foundation (EEF) has now defended their work (click here), without responding to any of the substantive issues, namely imbalance at baseline, negative primary results, >50% attrition on one primary endpoint and inappropriate cherry picking among data-driven secondary endpoints. Instead they defend a side issue, the lead researcher Stephen Gorard’s failure to report statistical significance. They also insist that the trial was evaluated independently using the EEF’s padlock rating scheme. With three padlocks out of five awarded by the EEF evaluators, that scheme is supposed to indicate that the results have a moderate degree of security.
The padlock scheme is described here. The criteria, as described by the EEF, are as follows.
1. Design: The quality of the design used to create a comparison group of pupils with which to determine an unbiased measure of the impact on attainment.
2. Power: The minimum detectable effect that the trial was powered to achieve at randomisation, which is heavily influenced by sample size. Implementation (thresholds that could be over-ridden by criteria 4, ‘Balance’, in exceptional circumstances)
3. Attrition: The level of overall drop-out from the evaluation treatment and control groups, which could potentially bias the findings. Analysis and interpretation (judgement required)
4. Balance: The final amount of balance achieved at the baseline on observable characteristics in the primary analysis.
5. Threats to validity: How well-defined and consistently delivered the intervention was, and whether the findings could be explained by anything other than the intervention.
The final padlock rating is derived by rating design, power and attrition, taking the lowest, and adjusting it up or down according to the presence or absence of balance at baseline and other threats to validity. The final rating cannot be higher than the lowest rating for design or power.
The EEF evaluator’s rating is given in appendix 2 of the main P4C trial report available here. I’ve reproduced the relevant table here:
Their justification was as follows:
“This evaluation was designed as a randomised controlled trial. The sample size was designed to detect a MDES of less than 0.4, by design, reducing the security rating to 3 . At the unit of
randomisation (school), there was zero attrition, and extremely low attrition at the pupil level also. The post-tests were administered by the schools by teachers who were aware of the treatment allocation, but with invigilation from the independent evaluators. Balance at baseline was high, and there were no substantial threats to validity.”
Let’s review these judgments.
A cluster randomized trial with 48 schools and 3,159 pupils. Five padlocks is correct.
I don’t know how to judge this without significance testing. But it’s a pretty large trial, albeit one which will lose some power from the cluster design. The EEF evaluators rated it three padlocks, which seems reasonable.
The EEF evaluators judged attrition at the cluster level only, which is correct according to the EEF guide, because that was the level at which randomisation occurred. All randomised schools were followed up, so they rated 5 padlocks on this criterion.
The evaluators noted that pupil level attrition was very low, which suggests they somehow missed the 52% attrition on KS2 score. But unless they had reduced the attrition rating to less than three padlocks this would not alter the final rating. The reason is that at this point the evaluator is supposed to allocate an interim padlock rating based on the lowest of the above three marks. The EEF guide reads:
“At this point the overall security rating for the evaluation will be determined by the minimum rating across the above three criteria. The minimum of first two criteria (planned design and power) determine the maximum security rating for the evaluation.”
So, however we interpret attrition, the overall security rating at this stage is three padlocks.
The EEF guide then says that this interim rating should be adjusted up or down depending on the final two criteria, balance and other threats to validity.
The evaluators judged that the groups were well balanced at baseline, on the basis of table 4 (baseline characteristics) in the report. But they ignored, or did not notice, the baseline imbalance of nearly 0.2 SD in KS1 reading and mathematics (total KS1 baseline scores are not reported) and of between 0.05 and 0.1 SD in CAT score. We can forgive the evaluators because these imbalances are not reported in table 4. They appear in the third columns of tables 5, 7 and 11 in the main report.
Since the Key Stage (KS) and Cognitive Ability Test (CAT) scores are the trial’s two primary outcomes, the evaluators made a mistake here, albeit a forgiveable one. The EEF guide (Table 3 below) suggests that in the presence of a baseline difference of >0.1 on a key characteristic the evaluators should drop two padlocks.
Three minus two = one. The interim rating should now be one padlock.
Other threats to validity
The EEF guide lists six other potential threats, namely 1. Insufficient description of the intervention, 2. Diffusion (or ‘contamination’), 3. Compensation rivalry or resentful demoralisation, 4. Evaluator or developer bias, 5. Testing bias, and 6. Selection bias. It suggests that adjustment should be made as follows:
“If any of the above issues are identified as a cause for concern some judgement should be used in adjusting the security rating to account for any issues identified. The following are some suggested rules:
- If there is evidence of any one or two threats the rating should drop 1 padlock
- If there is evidence of more than two threats the rating should drop 2 padlocks”
The EEF evaluators did not detect any threats. But in my opinion there are two unambiguous ones. No 1 because it was not made clear what teaching the control pupils got during the P4C lessons, and no 6 because the outcomes on which the conclusions were based were change scores selected post hoc, and susceptible to regression to the mean. A critical reviewer might also argue that the selective choice of outcomes suggests evaluator bias (threat 4). But this seems a bit circular so I’m giving them the benefit of the doubt on that.
Even if the teaching that control pupils got was recorded somewhere else, the problem of selecting change scores post hoc, and their susceptibilty to regression to the mean, is a definite threat to validity. So at best the final rating should drop by a further padlock. One minus one = zero. A final padlock rating of zero out of five.
According to the EEF zero padlocks mean the P4C trial “adds little or nothing to the evidence base”.
I’d be delighted to learn if I’ve made a mistake in the above. If not the EEF may wish to look for new evaluators.
A misleading randomised trial
Last week’s press was full of the news that teaching philosophy to primary school kids helped with their maths and reading. The BBC led with, “Philosophy sessions ‘boost primary school results'” (here). The Mirror, “Want to boost your child’s maths and reading skills? Then teach them about trust and kindness” (here), and specialists implied they’d known it all along, “It stands to reason that philosophy benefits learning” Times Education Supplement (here). Only the Guardian, put even a coded doubt into the final two words of its headline,”Philosophical discussions boost pupils’ maths and literacy progress, study finds” (here).
Sceptics need to take this seriously because, unusually for an education intervention, the evidence comes from a randomised controlled trial (RCT). In this case a trial of “Philosophy for Children” (P4C) a programme of weekly lessons. An outfit called the Society for the Advancement of Philosophical Enquiry and Reflection in Education (SAPERE) charges £4,000 to train and support teachers to deliver P4C in an average school for a year and claims to have signed up 600 schools in the UK alone. If they could persuade all 17,000 primary schools in the UK to join, they’d make a killing. So let’s take a critical look.
The trial report is here, or for those with access problems Philosophy_for_Children report. It has not been published in a peer reviewed journal, although there are apparently plans to do so. It was peer reviewed by the organisation that commissioned the study, the Education Endowment Foundation, who have judged that the findings of an improvement in reading and maths to have a moderate degree of security. The RCT was not registered on a trials database, but the protocol was publicly available here, or for those with access problems Philosophy_for_Children protocol.
Whole schools were randomised, a risky “cluster” design if individual inclusion can be altered by knowledge of the allocation, but one well-suited to education interventions; parents or teachers are unlikely to move children just because they are getting, or not getting, a weekly philosophy class, and all pupils do the outcome tests anyway. The 26 intervention schools got P4C implemented for 9 and 10 year olds for a period of somewhere between one and two years. The 22 control schools got “business as usual”; it’s not clear whether they gave an alternative lesson or sent the kids to the playground. Control schools got P4C at the end of the trial period so any effects could only be measured up to that time point. 3,159 children were included, and the groups were well balanced at baseline.
The protocol listed two primary outcomes, namely the overall Key Stage 2 (KS2) and overall Cognitive Abilities Test (CAT4) scores at the end of the trial, both reported as means and standard deviations (SD). A high score is good. There were seven planned secondary outcomes including the three components of KS2 (reading, writing and maths), and the four components of the CAT (verbal, non-verbal, quantitative and spatial ability). The plan was to also do subgroup analyses by year group and by whether the pupil was eligible for free school meals or not.
The results are in tables 5 to 19 of the main report.The CAT scores were reported for 2,821 (89%) of the enrolled pupils but KS2 scores were only available for 1,529 (48%). For some reason the overall KS2, one of the primary outcomes, was not reported at all! The seven secondary subscale outcomes are all reported, although not the subgroup analysis by year group. The subgroup by eligibility for free school meals was only reported for the eligible pupils.
There aren’t actually any differences between the groups in the scores reported. Nearly all slightly favour the control group, but the differences are tiny fractions of a standard deviation. No significance tests are reported, but I guess if they had been done, and corrected for multiple testing, they would all have been non significant (i.e. P>0.05). A negative trial.
But then the researchers got to work. They noticed that by chance the scores were slightly worse in the treatment group at study entry, so they decided to compare the change in score rather than the absolute scores. This manoevre was pre-specified in the protocol for the CAT score, but not for KS2. The authors openly admit that it was data driven.
“By the end the treatment group had narrowed this gap in all three subjects, especially for KS2 scores in reading and maths. For this reason, the key stage results are all presented as gain scores representing progress from KS1 to KS2.”
Unsurprisingly (because random variation tends to regress to the mean*) the results now favoured the intervention group. But it’s still a tiny effect.
Table 5 KS1-2 Reading. The difference in change scores = 0.11 (SD 1.0)
Table 6 KS1-2 Writing. The difference in change scores = 0.03 (SD 1.0)
Table 7 KS1-2 Maths. The difference in change scores = 0.08 (SD 1.0)
Still no tests of statistical significance (the lead author, Stephen Gorard, has some sort of principled objection to them) but their lack does not stop him concluding: “The results in tables 5 and 7 are unlikely to be due to chance.” On this basis the report’s first Key Conclusion, the primary finding of the trial, states:
“There is evidence that P4C had a positive impact on Key Stage 2 attainment. Overall, pupils using the approach made approximately two additional months’ progress on reading and maths.”
This sentence, plastered all over the Education Endowment Foundation website and press releases led predictably to the breathless headlines.
But it’s wrong. The triallists pre-specified two primary outcomes but only reported one, which showed no difference. They pre-specified seven secondary outcomes which showed no differences either. However when they altered their analysis plan after seeing the data they noticed that two of the secondary outcomes showed a tiny shift in mean change scores favouring the intervention. The effect size was about 10% of a standard deviation, and less than half the participants had the relevant scores measured, but who cares! Without any tests of statistical significance they declared that it was unlikely to have occurred by chance!
In an email to me Stephen Gorard wrote that he had no axe to grind. His research group Research Design & Evaluation (click here) had nothing to do with SAPERE or P4C; they had just been commissioned to evaluate the programme. He likened RD&E to a “taxi for hire”. Indeed so. Taxis get you where you want to go. RD&E gets you the results you want.
* Matthew Inglis @mjinglis (click here) makes the same point more elegantly with this graph of the three scores before and after the intervention.
River Thames & Oxford Canal
Iris Murdoch lived in or near Oxford for over 50 years, and almost all her novels involve some sort of ordeal by water, albeit none in the upper reaches of the Thames. She knew the river and canal well.
On November 3rd 1952, at the height of their brief love affair – he died 16 days later – she walked in Port Meadow with the anthropologist Franz Steiner (more about him here). In June 1953 she moved to a basement flat at 24 Southmoor road, the house owned by the controversial radiation and cancer epidemiologist Alice Stewart (more about her here), and acquired a canoe; she surely paddled this route. Later still she and her husband John Bayley skinny dipped near the A34 bypass bridge.
The free car park, between Wolvercote village and Godstow has access to the eastern channel below the weirs. Paddle past the Trout Inn channel to turn up the main stream and enter Godstow lock.
0 miles – Port Meadow car park
0.2 miles – Godstow lock
0.25 miles – Godstow bridge. Scarred by motorboat collisions! Abbey left.
0.4 miles – A34 bypass bridge
Elegy for Iris, (p1) by John Bayley:
“For years now we’ve usually managed a treat for ourselves on really hot days at home in the summer. We take the car along the bypass road from Oxford, for a mile or two, and twist abruptly off onto the verge – quite a tricky feat with fast-moving traffic just behind. Sometimes there are hoots and shouts from passing cars which have to brake at speed, but by that time we have jolted to a stop on tussocky grass, locked the car, and crept through a gap in the hedge.
I remember the first time we did it, nearly forty-five years ago. We were on bicycles then and there was little traffic on the unimproved road. Nor did we know where the river was exactly; we just thought it must be somewhere there. And with the ardour of comparative youth we wormed our way though the rank grass and sedge until we almost fell into it. Crouching in the shelter of the reeds, we tore our clothes off and slipped in like water rats. A kingfisher flashed past our noses as we lay soundlessly in the dark, sluggish current. A moment after we had crawled out and were drying ourselves on Iris’s half slip a big pleasure boat chugged past within a few feet of the bank. The steersman, wearing a white cap, gazed intently ahead.Tobacco smoke mingled with the watery smell at the foot of the tall reeds.”
1 mile – Kings lock
Just upstream of the lock, “Duke’s cut” leaves right.
1.5 miles – junction with the eastern Thames stream. Keep left. Follow the signpost.
1.7 miles – A40 bridge followed by railway bridge over Duke’s cut lock
1.8 miles – towpath bridge followed by junction with the Oxford canal. Turn right.
1.9 miles A40 bridge followed by the first of three lifting footbridges
2 miles – A34 bypass bridge.
2.4 miles – lifting footbridge 234. This one had a minstrel on it.
2.5 miles – Wolvercote bridge and lock. Followed immediately by the adjacent Wolvercote footbridge
2.75 miles – Wolvercote Green Field bridge. The Plough left.
3.25 miles – Wolvercote railway bridge
3.5 miles – St Edward’s lifting footbridge. No 238.
4 miles – bridge 238B Elisabeth Jennings Way bridge
4.2 miles – Frenchay Road bridge
4.4 miles – Aristotle bridge
After a few hundred yards the houses and back gardens of Southmoor Road line the left bank.
No. 48 is about 500 yards after Aristotle bridge as the canal kinks very slightly left, marked by two large weeping willows. We found two old Oxford ladies launching their canoe from an adjacent garden. They didn’t know that Murdoch had once lived nearby.
4.75 miles – Walton Well Road bridge
5 miles – Mount place footbridge
5.5 miles – Footbridge over Isis lock. Land right and portage over the bank into the Castle Mill stream. Turn right in the pool below the lock to enter the Sheepwash channel, which immediately passes under Rewley Road bridge and a railway bridge, before joining the Thames
5.75 miles – Footbridge and junction of the Sheepwash channel with the Thames. Turn right upstream. Mark the channel entrance if paddling this trip in reverse.
The riverside path passes over various other side channels
6.5 miles – Medley footbridge. Port Meadow right.
8 miles – At the upper end of Port Meadow take the right channel back to the car park. Or better still the middle one past a boat house, and under a wooden footbridge to the Trout Inn.
Iris loved pubs and must have visited this one. It’s famous now as one of Inspector Morse’s drinking holes. Finally, don’t miss Wolvercote community orchard (click here).
Rivers Lugg & Teme
Park at the B4362 bridge at Mortimer’s Cross layby. Cross the bridge and take the footpath over a stile, across two fields and a low hill into which the river cuts. About half a mile downstream the path descends into a water-meadow, and the river meanders between shingle banks. There are few deep pools – it’s impossible to get out of your depth – but it’s private. We met no-one on a sunny Saturday in June. Grid reference SO428634
The stretch of Lugg upstream of the A4110 bridge (around SO420656) is also allegedly good, but the river in front of the Riverside Inn didn’t look enticing and I couldn’t get access from the north bank.
Just upstream of Leintwardine bridge at the junction of the Teme and Clun is a well-known spot (click here), but I didn’t know about that. The pool below the weir and road bridge looked enticing, but public.
Instead I followed the minor road downstream past “The Sun”, a traditional pub. Immediately after passing a small factory right, a footpath runs to the river. The deep pools in the meanders, easily accessible from shingle banks, are very private. Grid reference SO478736
This stretch is labelled on the Ordnance Survey as Leintwardine Fishery, but we saw no fishermen in June. Downstream of Ludlow there is said to be a good spot at SO532687, accessible from the footpath from the A456, but not personally tested.
Britain’s most popular 20th century poet is long overdue his place in Poets’ Corner. The nincompoops who fret that he liked a bit of porn and wrote some letters that don’t pass muster with today’s thought police, have been pushed aside by admirers of Wedding Wind, Church Going and The Whitsun Weddings. Not to mention the millions who know the first line of This be The Verse.
The unveiling will be on Dec 2 2016. Surely there won’t be a service; not for the man who called religion “that vast moth-eaten musical brocade/Created to pretend we never die”. But which poem will they read? Not one of the “Weddings”. Not High Windows; it’s a little soon for the “F word” in Westminster Abbey. Perhaps Church Going. But I’d vote for the one that ends with Larkin’s most puzzling line; the one read out at funerals, and quoted by Anthony Lane after the Twin Towers fell on 9/11. The line that was almost meant and almost true.
An Arundel Tomb
Side by side, their faces blurred,
The earl and countess lie in stone,
Their proper habits vaguely shown
As jointed armour, stiffened pleat,
And that faint hint of the absurd–
The little dogs under their feet.
Such plainess of the pre-baroque
Hardly involves the eye, until
It meets his left hand gauntlet, still
Clasped empty in the other; and
One sees, with sharp tender shock,
His hand withdrawn, holding her hand.
They would not think to lie so long.
Such faithfulness in effigy
Was just a detail friends could see:
A sculptor’s sweet comissioned grace
Thrown off in helping to prolong
The Latin names around the base.
They would not guess how early in
Their supine stationary voyage
Their air would change to soundless damage,
Turn the old tenantry away;
How soon succeeding eyes begin
To look, not read. Rigidly they
Persisted, linked, through lengths and breadths
Of time. Snow fell, undated. Light
Each summer thronged the grass. A bright
Litter of birdcalls strewed the same
Bone-riddled ground. And up the paths
The endless altered people came,
Washing at their identity.
Now, helpless in the hollow of
An unarmorial age, a trough
Of smoke in slow suspended skeins
Above their scrap of history,
Only an attitude remains:
Time has transfigured them into
Untruth. The stone finality
They hardly meant has come to be
Their final blazon, and to prove
Our almost-instinct almost true:
What will survive of us is love.
The third recent negative self-hypnosis in labour trial
Self-hypnosis is a popular method of pain relief in labour; it sounds like a good idea, it’s cheap, could probably be taught to many women, and is unlikely to have serious adverse side effects. But until recently there were only poor quality trials. Now suddenly there is a glut of good ones.
In 2014 we commented on two, a Danish trial (click here) and the Australian HATCH trial (click here). Both were prospectively registered, with a predefined primary endpoint, (epidural in the Danish trial, epidural or opiates in HATCH), hit their predetermined sample size and analysed everyone in an unbiased way, by intention to treat. Both were negative.
Now my friend Professor Soo Downe from Preston in Lancashire has reported on a third one, the Self Hypnosis in Pregnancy (SHIP) trial. Click here for the full report.
Again it was beautifully designed and conducted. Prospectively registered here. (The link states it was retrospective, but this seems to be a fault with the recently updated website; SHIP was registered well before any codes were broken or analysis was done.) The primary outcome was epidural use, and the planned sample size 300 per group. 680 women were eventually randomised (343 to self-hypnosis and 337 to control) and all were followed-up. Epidural use was 94/343 (28%) in the self-hypnosis group v. 101/337 (30%) in controls, odds ratio (OR) 0.89, 95% confidence interval (CI) 0.64–1.24. i.e. the self-hypnosis does not work.
The authors also measured 29 allegedly predetermined (only 10 were listed on the trial registration site) secondary outcomes of which 27 were not statistically significantly different. For some reason they were placed in a supplementary appendix; come on you BJOG editors, get your act together! Some slightly favoured self-hypnosis, e.g. breast feeding 44% v 39% OR 1.23, 95% CI 0.82 – 1.86, or prolonged neonatal admission 6.2% v 6.6%. OR 0.94, 95% CI 0.50-1.74). Others slightly favoured controls, e. g. Caesarean deliveries 25% v 23 %, OR 1.11, 95% CI 0.78-1.58), and three of the 4 stillbirths were in the self-hypnosis group. But none were statistically significant. Nor were there any significant differences in anxiety, depression or “impact of events” scores at 2 and 6 weeks postnatal. This is a clearly negative trial.
However, the authors (or BJOG) provided a tweetable abstract:
“Going to 2 prenatal self-hypnosis groups didn’t reduce labour epidural use but did reduce birth fear & anxiety postnatally at < £5 per woman”.
This is misleading. Self-hypnosis did not reduce fear & anxiety postnatally. It may have made a difference in the change in anxiety level between before and after labour and in the change in fear of birth between the two time periods, but these are very odd trial outcome measures. You can’t be anxious or fearful about birth after it has occurred. More importantly the change measures were not pre-specified among the secondary outcomes, they are dependent on a low postnatal response rate which was higher in the intervention than the control group, and the baseline scores for both measures were non-significantly higher in the hypnosis group at baseline, so some of the change is likely to be due to regression to the mean, i.e nothing to do with the treatment. At best the change scores are hypothesis generating for future studies.
Here’s a better tweetable abstract:
SHIP is the 3rd well-designed RCT to show that self-hypnosis is ineffective for pain relief in labour. But it is cheap & harmless.
At BB Sophia, a private maternity hospital
When BB Sophia, Stockholm’s second private maternity hospital, opened last year I applauded the increase in diversity, and anticipated competition driving up standards (click here). However, last week the Swedish TV channel 4 programme, Cold Facts, aired some serious allegations about safety there*.
The trouble stems from a birth on 24 August 2014. Gegie Boden had a difficult delivery, complicated by shoulder dystocia, and collapsed shortly afterwards. Recognition of her collapse was allegedly delayed, perhaps because staff were more concerned about the baby. She was soon transferred to the nearby state-run Karolinska hospital, where she died a week or so later. Doctors interviewed on the programme, none of whom apparently worked at BB Sophia, alleged that intensive care facilities were substandard, and that the rules for the levels of intensive care required in private maternity units had been made less stringent to allow the clinic to open.
BB Sophia’s owners reject these claims and state they not only had adequate facilities for short term intensive care, but also a formal agreement with the Karolinkska to transfer patients needing longer term care. The Inspektionen för vård och omsorg (IVO), the Inspectorate for Health and Social Care, is investigating, but has not yet reported.
About five mothers die, out of the 100,000 or so who give birth in Sweden each year, one of the lowest rates in the world, so one death in a unit delivering 4,000 babies annually, while tragic, is not in itself evidence of poor care. But a TV programme about it, quoting doctors publicly alleging substandard care, and aired before the official report is complete, suggests that private hospitals are under closer scrutiny than their government counterparts.
This may be a good thing. Unlike government hospitals, which are often “too big to fail”, private ones cannot afford to ignore public safety concerns. I hope I’m not naive, but I remain optimistic that independent health care providers will drive up standards in the long run.