“Success for All” literacy programmes
The Education Endowment Foundation trial results were negative
Is the Foundation spinning it as positive?
Success for All is a commercial education programme (click here) which claims to improve literacy in primary schools. Teachers are shown how to provide effective phonics teaching and provided with structured lesson plans. Heads and managers get help with ability grouping and with encouraging parental involvement.
Sounds good, but does it really make a worthwhile difference? The Education Endowment Foundation (EEF) has just published a randomised trial testing the programme’s effectiveness.
Ripe-tomato.org loves randomised trials of educational innovations, but criticised an earlier Foundation trial (click here), for playing down its predefined primary and secondary endpoints, which had shown no benefit, and claiming that the treatment had worked on the basis of new, possibly data driven, endpoints. Let’s look at the Success for All trial (click here for the full report or Success_for_All_Evaluation_Report).
It was a cluster randomised trial comparing 27 intervention schools (874 pupils) with 27 controls (893 pupils). That’s 54 schools and 1767 pupils in total. Randomisation details are not given in the main report but, according to the revised protocol, schools were “allocated in pairs based on a ranking of Key Stage 2 results”. The planned sample size (50 schools, 1250 pupils) was exceeded, and the trial had 80% power at conventional levels of statistical significance (5%) to detect a difference of 0.2 of a standard deviation in mean test scores. This would convert to about three months attainment difference. Educationalists generally label such an effect size as at the border between “small” and “medium”. The researchers presumably judged it as the minimum worthwhile difference for a relatively expensive intervention like this. That seems reasonable, although I confess I’m not qualified to judge. The intervention took place in two waves, June 2013-14 and 14-15, with results collected during the intervention (end of reception class) and at project completion (end of year one class), so the final results were in by August 2016.
The trial was retrospectively registered (click here) in June 2016, and the two versions of the protocol on the EEF website (click here) are dated Feb and August 2016. Nevertheless the analysis didn’t happen till November 2016, so this late registration may not matter.
The primary outcome on the registry and both protocols was the same, the six components of the Woodcock Reading Mastery Test score (high is good), three measured at end of reception and three at year one. Six primary endpoints are too many, but the authors eventually compared just the total Woodcock scores, so only two primary endpoints. Seven intervention schools gave up on the programme, but fortunately all but one of them collected outcome data so they were included in the primary “analysis by intention to treat”. Schools and pupils were well balanced at trial entry (table 5), but 214 children (12% of the original sample) missed the reception class assessment, and by the year one point 430 children (24%) were lost to follow-up.
Results
The raw mean end of year one scores in the intervention schools were 82 v 78 controls (Table 6) but the standard deviation was huge (about 57 points) so the four point difference amounted to a tiny effect size of only 0.07 SD (95% CI -0.03 to 0.18, P=0.14). The effect size on the mean Woodcock score at the end of reception (trial midpoint) was even smaller 0.04 SD (95% CI -0.06 to 0.14 P=0.42).
In summary the main trial result was negative. These effects are too small to be educationally meaningful, they could easily have occurred by chance, and even if real, they would be too small to justify either the cost or the teachers’ time and effort.
The authors also looked at a phonics score at the trial end (not pre-specified) and adjusted for different combinations of baseline variables, but the effect sizes were smaller and the nominal P values larger. Curiously, although they had planned to test the Woodcock subscales, they decided not to do so “to avoid multiple testing”. The raw sub-scale scores differed little (Table 7).
Subgroup analyses
No subgroup analyses are mentioned on the trial registry, but the original protocol planned “exploratory analysis [… e.g.] boys/girls, ethnicity, children of different abilities at baseline, high/low implementation schools”. In the revised protocol they added “the main analysis will be repeated on a subsample […] eligible for free school meals”.
The free school meal subgroup analysis was negative at the end of the trial; effect size 0.12 SD (95% CI -0.10, 0.34, P = 0.23) but nominally significant at the midpoint reception class; effect size 0.22 SD (95% CI 0.01, 0.44, P = 0.03). This intermediate benefit would be worthwhile if real, although presumably not worth much if it disappeared a year later.
The secondary analysis by baseline attainment showed no differences, and the other secondary analyses by gender or ethnicity appear to have been quietly (and wisely) forgotten.
This left a secondary analysis excluding the seven intervention schools who gave up on the programme. Removing those schools, which were so disorganised that they couldn’t follow through on a two year literacy project, from the intervention group but not from controls, introduces bias in favour of the intervention. Even so the effect size was tiny, and of borderline statistical significance, 0.10 SD (95% CI -0.01, 0.22, P=0.05).
The programme cost £46,000 per school in the first year (about £169 per pupil) which fell steeply in subsequent years, so that over three years the cost per pupil per year was estimated at £62. The cost of the recipient school’s own staff training time was excluded from this figure.
Here are the Foundation evaluators’ conclusions together with my comments. Note an effect size of 0.07 is equivalent to one months progress.
“Children who took part in Success for All (SfA) made 1 additional month’s progress, on average, after two years compared to children in other schools. [We] are moderately confident that this difference was due to SfA.”
My interpretation. Depends what you mean by “moderately confident”. I’m moderately confident that the intervention is ineffective. The trial hasn’t ruled out a tiny beneficial effect size, but the observed difference could well have occurred by chance. The trial has not ruled out a small harmful effect either. No-one would licence a new drug on the basis of such a result.
“Children eligible for free school meals (FSM) made 2 additional months’ progress after two years, compared to FSM children in control schools. The smaller number of FSM pupils in the trial limits the security of this result, though combined with other findings in the report it provides some evidence that SfA does improve literacy ability for children eligible for free school meals.”
My interpretation. No they didn’t. The 2 months additional progress (total 3 months) comes from the effect size 0.22 SD P=0.03 for the interim score at the end of reception. Not only was this a secondary endpoint, albeit pre-specified, but the apparent effect faded by year one (effect size 0.12 SD. 95% CI -0.10 to 0.34. P = 0.23).
The remainder of the Foundation’s summary points are bland but positive, e.g “Schools that successfully delivered SfA were enthusiastic and valued the programme.”
The trial authors know the results were negative. The full report contains this sentence:
“The only other trials were based in the US and reported a positive effect of the programme, achieving effect sizes in the region of 0.15-0.30. The current trial has been unable to replicate these effects in an English context.”
But a visitor to the Foundation’s website would struggle to find it among the positive spin.
Perhaps it’s churlish to criticise. Randomised trials of educational innovations are few and far between, and I certainly don’t want to discourage them. But if negative trials are spun as positive, disinterested parties will soon disbelieve their results, as they currently do most non-randomised education research. That would be a pity.
This trial cost £1.4M. It was well designed and conducted. The result was negative. It should be reported as such.
Jim Thornton