Skip to content

The $55M i3 trial of “Reading Recovery”

March 4, 2018

Results were biased by measuring outcomes when the intervention group hit their success target. 

Reading Recovery (RR), a method for helping poor readers, designed on the basis of research in the 1970s by New Zealand psychologist Marie Clay, is a worldwide movement. It is controversial, partly because it is a semi-secret commercial programme for which schools or parents have to pay, and partly because it emphasises “whole language” rather than phonics based methods.

In 2015 RR was evaluated in a large and expensive ($55M) randomised controlled trial in the United States. The report of the first year of the randomised element (click here ) is behind a paywall, but a copy is available here i3 reading recovery trial.  The full report of all four randomisation years, as well as the non-randomised scale-up phase is free (click here) or reading_recovery_final_report.

The trial has been criticised by some educationalists (click here) who appear to be more politically opposed to RR, than to have identified genuine flaws in the trial’s methods. They cite four “problems”.  1. Many low achievers were excluded. Answer: They were excluded from intervention and control groups equally, so did not bias the results. 2. The control group received a range of different experiences. Answer: Yes. They got “usual care”, common practice in a pragmatic trial. 3. The successful completion rate of students in the program was modest. Answer: Yes, but it still appeared to work. 4. No data supported the claim that Reading Recovery leads to sustained literacy learning gains. Answer: A valid criticism. Long term effects could not be measured in the randomised groups, because controls got RR at the end of the intervention period anyway.  Other critics have repeated the claims that the effect of RR was small but not sustained (click here) or focused on lack of definition of what the control group got (click here).

The trial also has defenders (e.g. click here). The What Works Clearing House, an independent outfit evaluating evidence-based educational interventions not only issued a special evaluation (click here) or wwc_may_102814, but gave the trial its strongest possible endorsement; “The research described in this report meets WWC group design standards without reservations.”

I’m a medical doctor with no vested interest – I only learned of the existence of RR a few weeks ago – but I do know a bit about randomised trials. So far as I’m aware, the more fundamental flaws identified in the rest of this post, have not previously been described.

Surprisingly for such a large trial, it was unregistered (the word registration or registry does not appear in the report), nor did it have a published protocol (the word protocol appears five times but never as part of “trial protocol”), so readers have to take on trust the statement that the primary and secondary outcomes, and sample size had been pre-specified.

It was conducted “from the 2011-2012 school year through 2014-2015”. 1,254 schools participated.  The research objective was “What is the immediate impact of Reading Recovery on the reading achievement of struggling 1st-grade readers, as compared with business-as-usual literacy instruction?”.

Population – “Struggling 1st-grade readers”. Staff in each school identified the eight pupils with the lowest Observation Survey of Early Literacy Achievement (OS) score.

Intervention – 30 minutes one to one Reading Recovery lessons per day for 12-20 weeks, as a supplement to regular classroom literacy instruction.  These are described as “[…] individualized, short-term, highly responsive instruction […] Lessons attend to phonemic awareness, phonics, vocabulary, fluency, and comprehension [… are] intended to help students develop a set of self-regulated strategies for problem-solving words, self-monitoring, and self-correcting that they can apply to the interpretation of text [and] enabling students to use meaning, structure, letter-sound relationships, and visual cues in their reading and writing.”

The RR lessons went on for at least 12 weeks, following which they stopped when the pupil reached the target OS achievement score of 16, or at 20 weeks, whichever came first. We will return to this point.

Control – regular classroom literacy instruction and access to any literacy supports that were normally provided to low achieving 1st-grade readers by their schools, other than Reading Recovery.  The researchers describe these in detail.

“We obtained information on 1,245 (57 percent) of all students assigned to the control group […]. 39 percent received no supplemental instructional supports; 37 percent participated in some individual or small-group intervention (other than Reading Recovery) provided by a Reading Recovery-trained teacher; 23 percent participated in a literacy intervention that was not delivered by a Reading Recovery teacher; and 8 percent received ELL or special education supports. Seven percent a combination of the services listed above. A majority of control group students (61 percent) did experience some form of supplemental literacy support in addition to regular classroom instruction. Therefore, in this study, we are comparing the effectiveness of Reading Recovery to that of classroom instruction plus a range of other support services that schools provide to struggling readers.”

This is clear. The trial is not, as some had alleged (click here), one of “”RR v nothing”, but of “RR v usual care”. However, the fact that the phonics/whole language components of usual care was not defined, mean the trial was never likely to contribute much to the phonics v whole language debate.

Outcome – The primary outcome was the Iowa Tests of Basic Skills (ITBS) at the end of the 12 -20 week intervention.  Secondary outcomes were the Reading Comprehension and Reading Words subscales of the ITBS, and the Observation Survey described above, which had been used to define the trial entry groups and to judge the length of the programme.

Randomisation – A total of 9,784 students were identified by picking the eight children with the lowest OS score per school. 1,254 times 8 = 10,032 but the missing pupils won’t bias the result because they were lost pre-randomisation.

“[…] teachers entered the names of the selected students into an online random assignment tool, noting their English language learner (ELL) status and their baseline OS Text Reading Level (TRL) subtest scores […]. The tool then matched them into pairs by first matching any students with ELL designations, then matching the student with the lowest TRL subtest score with the next-lowest student, and so on. Once the students were matched, a randomizing algorithm then randomly assigned one student in each pair to the treatment group and the other to the control group. The result was recorded in IDEC, and the tool was locked so that randomization in that school could not be redone.”

This is good. But then:

“4,892 students were randomized to treatment, and 4,892 to control. Both pretest and posttest data were available for 7,855 of these students (4,136 treatment; 3,719 control).”

This differential drop out (more drop outs among controls) would normally be a worry because drop outs might be poorly supported students. If you remove more poorly supported students from the control group than the intervention one, the control group would end up with higher scores, even if the intervention had no effect. Fortunately the researchers were aware of the problem and took steps to avoid it.

“Pairs in which either student was missing assessment data were dropped from the RCT, leaving a total of 6,888 students who were able to be matched into pairs with complete data (3,444 matched pairs in 1,122 schools).”

Excellent.  They did the right thing. If either member of a pair failed to complete the outcome measure they dropped both from analysis. As the authors put it, “Because the entire pair was dropped in the event that one student in a pair was missing outcome data, there is no differential attrition overall.” Here is the Consort flow diagram.

Table 2.3 (not shown here) shows the sex, race and baseline reading level of each group were, as would be expected, well balanced.

Now to the fundamental flaw; the timing of the outcome test.

“The precise timing of posttest [the short term outcomes] administration varied, as posttests were administered to both the treatment and the control student in a given matched pair immediately after the treatment student completed his or her 12- to 20-weeks of Reading Recovery lessons. This ensured that the two students experienced their assigned conditions for an identical time period. Typically, posttests were administered roughly halfway through the school year. As the study was designed as a delayed-treatment RCT, control students began receiving their Reading Recovery lessons after the posttests were administered to both treatment and control students in each matched pair.”

This sounds reasonable, but it biases results in favour of RR. Remember RR stopped when the pupil reached a score of 16 on the OS assessment, i.e at the point when the RR programme was judged to have succeeded, and the pupil could be returned to the normal classroom.  Since there will surely be some day to day variation in scores due to extraneous factors, the RR group are systematically being scored on, or near to, a good day.  Their control partner is measured on the same calendar day, which may or may not be a randomly good day for them*.

And here are the results. Table 2.3 and 2.4 are taken from the full (four year) trial report. For some reason tests of statistical significance were omitted from there – we’ve noticed before (click here) that some educationalists object to them – so on the right is table 2 of the first randomised year only trial report with its P values.

The top row (tables 2.3 and 2.4) shows the mean ITBS score at the end of the RR or control period. i.e. up to the point at which RR succeeded, or 20 weeks, whichever was sooner. The percentile ranks show where that score would lie on typical Grade 1 mid year scores. Rank 1 the worst reading centile, and 100 the best reading. In table 2 (right) the primary outcome (mean ITBS score) is now placed at the bottom.


The RR pupils had a mean ITBS score of 138.8 (high is good) and controls 135.4.  The mean difference of 3.4 is about half a standard deviation better (favouring RR). In the first year, with a quarter of the sample size, the scores were 139.2 and 135 respectively, also favouring RR by half a standard deviation, and very unlikely to have occurred by chance (P<0.001) (table 2).

But I’m afraid these results are biased. The only difference between the groups should have been the RR programme.  But, as we have seen, there was another difference.  The RR students were measured on a systematically “good day” but the controls on the same calendar day, which may, or may not have been a “good day” for them.  We need not concern ourselves over the possibility of data dredging, P-hacking, repeated looks at the data, or any of the other biases that can creep in to unregistered trials, because the raw data are already clearly biased.

It is extraordinary that the researchers, the What Works Clearing House, and all the trial’s critics, not only missed the lack of registration or a published protocol, but this obvious bias in the timing of the primary and secondary trial outcome measures.

It’s great that educationalists are doing randomised trials, but they need to be done properly.  Critics should study the methods carefully.

Jim Thornton

*FOOTNOTES – If you’re not convinced, consider a hypothetical pair of pupils, and that RR had no effect.  Imagine that both pupils were bumping along with OS scores between say 12 and 15.  Let’s further imagine that on week 15 the control pupil by chance had a good day and hit an OS score of 16, but the RR pupil, also by chance had a bad day and dropped to 11.  These two scores would not be noticed; the trial goes on.  Now imagine that the following week the RR pupil got things sorted out at home, did well and scored 16, while in his turn the control pupil had his share of home troubles and dropped to 11.  Because the RR pupil had hit the magic score of 16 the RR stopped and the triallists measured both pupils.  They are identical, each having had by chance one good day and one bad day, but if we measure them both on the RR pupil’s good day we get an RR data point of 16 and a control data point of 11.

The OS score is not the trial’s primary outcome, but it and the IELTS are both surely correlated with chance “good days”.

For the avoidance of doubt the subset of pupil pairs whose RR member battled on through the full 20 weeks are not biased, because they were each sampled at a point unrelated to either member’s achievement.  But only about half of pupil pairs went on for this long.

I guess that, given the trial dataset, a more expert statistician than I could measure the variation in OS scores, the proportion of pupil pairs who stopped RR for success before 20 weeks, and the correlation between OS and IELTS scores, and use these data to model the bias that was likely to have been introduced by this test-timing method.  It would be an interesting project.


More unethical neonatal circumcision research in Africa?

February 18, 2018

Did an under-powered poorly-analysed trial give a misleading result, would the results have been generalisable anyway, and did the parents understand what they were consenting their sons for?


The trial, funded by the Bill and Melinda Gates Foundation, was conducted in 2013 in Zimbabwe, “traditionally a non circumcising country” by researchers from Harare and London (2015 report here or Zimbabwe circ trial, registration here or PACTR Registry zimbabwe circ trial).

Participants –  male infants from a Harare polyclinic.

Intervention – circumcision with a new Accucirc device (left above).

Control – circumcision using the standard Mogen clamp (right above).

Primary Outcome – the number of “moderate and severe” adverse events (AEs). The registry plan to include “minor” AEs as well was silently changed in the paper.


Using envelope randomisation in a 2:1 ratio, AccucircMogen, the planned sample size of 100:50 had:

“80% power to detect noninferiority, based on a 2-sided 95% confidence interval (CI) approach, a 2% risk of AE in the Mogen clamp arm, and a noninferiority margin of 6% failure between the 2 arms. A noninferiority margin of 6% was chosen because this was deemed the maximum difference in safety that would be acceptable in terms of public health”.

A six percent absolute difference in moderate or severe adverse events is large for the “minimum clinically important difference” (MCID) for a trial of surgical devices. It implies that parents would choose the new device if the trial could reassure them that it had no more than six percent additional complications!  Did anyone ask them?

The trial ran more smoothly than almost any other in history.

“One hundred fifty male infants aged 6–54 days were circumcised between January and June 2013. All were circumcised according to their allocated intervention (n = 100 AccuCirc; n = 50 Mogen clamp). All participants attended the 3 scheduled follow-up visits on days 2, 7, and 14.”

The result was two moderate or severe adverse events in the Accucirc group and none with the Mogen clamp. One baby suffered excess skin removal which took four months to heal, and another inadequate skin removal requiring further surgery. The authors conclude:

“2.0% higher in the AccuCirc arm compared with the Mogen Clamp arm (95% CI: −0.7 to 4.7). As the 95% CI excludes the noninferiority margin of 6%, the result provides evidence of noninferiority of AccuCirc compared with the Mogen clamp.”

This is wrong. The normal approximation to the binomial’s 95% upper bound for 2/100 events is indeed 4.7%. But that formula is unreliable for small numbers. In that situation the exact confidence interval, for which the 95% upper bound for 2/100 events is 7%, is preferable.

This means that even ignoring the implausible zero envelope loss, the perfect compliance and the 100% follow-up, the correct statistical test alters the results from positive to negative; Accucirc may have too high an excess of adverse events for it to be acceptable.

The following wasn’t in their analysis plan, but if the authors had analysed their trial as a conventional superiority one, the estimated relative risk (RR) would have been infinite because of zero events with the Mogen clamp. If they had followed convention and added 0.5 events to the Mogen group to allow a relative risk to be estimated they would have RR 2.5, (95% CI 0.12 – 52).  They would be 95% certain that the true effect lay somewhere between Accucirc having 10 times fewer, or 50 times more, adverse events than the Mogen clamp.


The authors ran the trial to a higher standard than they could implement in practice.

“All infants received vitamin K to minimize bleeding; vitamin K should be routinely administered at birth but was out of stock nationally at the time of the trial and therefore had to be imported specifically for that purpose.”

i.e. the research was conducted with special drug safety cover unavailable in the rest of Zimbabwe. How could they therefore extrapolate from the first sentence of their conclusion: “We safely circumcised 150 infants in a randomized trial of AccuCirc versus Mogen clamp for EIMC in Zimbabwe.” to the second: “The AccuCirc device has the potential to facilitate widespread scale-up of safe EIMC in sub-Saharan Africa.”


For a trial in Zimbabwe “traditionally a non circumcising country”, the researchers had to both persuade the parents to let their baby undergo circumcision, and then gain informed consent to him participating in the research trial comparing the standard with a new “experimental” method.

“Sensitization on EIMC [Early Infant Male Circumcision] and participant recruitment took place at the antenatal clinic and after delivery in the maternity ward. Educational materials (posters and pamphlets) and demand creation activities (road shows, dramas, group and interpersonal discussions) were used to educate and sensitize the community about the trial.”

Ignore for a moment the ethics of “sensitizing” non-circumcising communities to, and running “demand creation activities” for, neonatal circumcision. If the researchers themselves confuse “sensitization and demand creation” for circumcision, with “sensitisation and demand creation” for the trial, how likely is it that parents of potential participants were clear about the difference?

“To enrol 150 babies in the comparative trial, we approached 1151 parents of newborn male infants, corresponding to a 13% uptake of EIMC. A total 984 (85%) parents declined for their son to participate.” […] “A further 17 male infants were excluded after assessing their eligibility for inclusion (Fig. 1).”

This reads as if all parents of eligible participants consented to the trial.

However, Fig 1 (the trial flow diagram above) shows that only three babies were ineligible for medical reasons, leaving 14 parents who definitely did understand the difference between agreeing to the circumcision and consenting to the trial. We know this because they agreed to the former but declined the latter.

This still leaves the consent rate for the research trial as 150/164 or 91%! That is high for any randomised trial, extraordinarily high for a surgical trial, and suggests that some parents had, like the researchers, indeed muddled consent for circumcision with consent to randomisation?

Elsewhere, while trying to explain the low overall circumcision uptake, the authors accidentally acknowledge just such a muddle, although they argue that it caused parents to decline their son being circumcised at all because of reluctance to join the research.

“Zimbabwean parents were informed that the trial was comparing 2 EIMC devices. Parents may therefore, have felt this indicated that the devices were “experimental”; this thought may have exacerbated their fear of harm.”

Subsequent non-randomised rollout of Accucirc by the same authors, in the same clinics, immediately after the trial ended (click here for details) had an even lower total uptake of 500/4617 = 11%,  so the authors’ claim is unlikely. Parents weren’t put off circumcision because they feared randomisation; they disliked it just as much when randomisation was not on offer.

No-one can prove at this distance what parents really understood. But it looks like many parents allowed their sons to join this badly designed and analysed trial thinking that they were consenting to the circumcision. This after nine months of pro-circumcision propaganda – forgive me! – for which nine months of “sensitizastion” had “created a demand”.

The study was approved by the Medical Research Council of Zimbabwe (click here), and the ethics committees of University College London (click here) and the London School of Hygiene and Tropical Medicine (click here).

Jim Thornton


Dr Bawa-Garba and nurse Amaro

February 8, 2018

The chance of two health workers independently committing clinical negligence manslaughter on the same day, on the same child?

A few weeks ago my colleagues and I wrote in the BMJ about the conviction of Dr Bawa-Garba for clinical negligence manslaughter of her patient, six year old Jack Adcock (click here).  In part, we noted:

“There is another strange aspect to the jury’s verdict. One of the nurse co-defendants was also convicted of the same offence at the same trial. Medical negligence manslaughter is extremely uncommon, with only 22 convictions and three guilty pleas since 1795 in the UK – fewer than one every eight years. For comparison the National Lottery has created 4,750 millionaires since 1994, nearly 200 per year. Given that there has never been any suggestion that Dr Bawa-Garba and the nurse colluded, it is an extraordinary coincidence that two people should independently commit such a rare offence on the same day, in the same hospital and while caring for the same child?”

Much else has been written (e.g. click here for a view from the US) about the pressure Bawa-Garba was under that day, the failings of her employing hospital and of her consultant, and whether a jury can really appreciate the difficulty of diagnosing sepsis in a child, but no-one else seems to have noted this coincidence. Was the simultaneous crime really as unlikely as two closely related people separately winning the lottery jackpot, and if so, did the judge inform the jury correctly?

After the event, the probability of two related individuals winning the jackpot is never as low as people imagine. There are many ways to link people; relatives, neighbours, friends, coworkers etc., and the lottery has run every week for 25 years or so. The fact that members of a family from Tipton won three times (click here) does not really “defy the odds of 350 billion to one”.

But in prospect, given that one particular person has experienced a rare event, the probability of a related person experiencing the same event can be defined.  For each National Lottery game the chance of winning the jackpot is about 1 in 14 million. Once John Smith has won, the chance of his friend Sam Brown winning the same jackpot the following week, assuming he buys a ticket, is 1 in 14 million. If ten friends come round to John’s house the week after his big win, each with a ticket in hand, the chance of one of them walking off with the jackpot that very week is 1 in 1.4 million. Not very likely.

Medical negligence manslaughter is rarer than lottery jackpot wins, but the population that can commit it is also smaller than lottery players. To commit it you must both be one of about 400,000 doctors, nurses, health visitors or midwives working in the UK, and also care for a patient who dies; staff who don’t have a patient die, may do many careless things that day, but they cannot commit clinical negligence manslaughter. About 1,600 people die every day in the UK, so if they were each looked after by only one clinical carer, 1,600 people could potentially commit clinical negligence manslaughter on a typical day.  If more realistically each person who died had been looked after by ten clinical carers, 16,000 people would be potentially able to commit the offence each day.  Since only one person is found guilty every eight years, the daily probability of a health worker committing such an offence is at most 1,600 times 365 days times 8 years, = 1 in 4.6 million, and more realistically, 16,000 times 365 times 8, = 1 in 46 million.

We don’t know the order in which the jury decided the guilt of Dr Bawa-Gaba and nurse Amaro, but they must have found one guilty first. Given that the case had come to court, the jury should not have been unduly surprised to find one of the people in the dock guilty.

But the jury had also been led to believe that there had been no collusion between Dr Bawa-Garba and nurse Amaro – this was no Morecambe Bay, where a group of midwives had allegedly gone “off the rails” together. The prosecution had argued that Dr Bawa Garba and nurse Amaro were each independently almost uniquely incompetent and careless.  If so, what is the chance that on the same day in the same hospital and caring for the same child, another person should independently administer care that was, as the judge instructed, so “truly exceptionally bad” as to amount to clinical negligence manslaughter

If staff commit clinical negligence manslaughter on somewhere between 1 in 4.6 million, and 1 in 46 million patient treatment days where the patient dies, and if say ten other staff cared for that boy on the day in question, the chance in prospect that the second defendant had also committed the same offence was somewhere between 1 in 460,000 and 1 in 4.6 million! i.e. the same order of probability as the chance in prospect, that the week after John Smith won the jackpot, one of his ten friends would also win the jackpot. If someone had told the jury that, would they have still convicted both women?

There are five possible explanations.

  1. A truly extraordinary thing happened that day, an approximately 1 in a million coincidence.
  2. Dr Bawa-Garba and nurse Amaro colluded.
  3. Something about that Leicester hospital’s children’s ward made clinical negligence manslaughter more likely than usual that day.
  4. Clinical negligence manslaughter is more common than we think, and many doctors and nurses are getting away with it.
  5. Dr Bawa-Garba, or nurse Amaro, or both, were wrongly convicted.

Jim Thornton


Menopausal hormone therapy for health promotion

January 30, 2018

Is the USPTF wrong?

The US Preventive Services Task Force (USPTF) has just reviewed their advice for the fourth time in the 15 years since the two huge Women’s Health Iniative (WHI) trials reported.  No-one questions menopausal hormone therapy for symptom control, but for long-term health promotion we need to be sure that it does more good than harm. Previous advice that long term harms outweigh the benefits has been consistent, but needs regular review.

The latest advice (click here), authored by experts selected to avoid any potential conflicts of interest with drug manufacturers (details here), following public consultation, and published in one of the world’s most prestigious medical journals, the Journal of the American Medical Association JAMA (Impact factor 44), is unchanged; don’t use hormone therapy for health promotion.

It is supplemented by a 15 page evidence report (click here) as well as 20 evidence tables, and endorsed by three independent commentaries (click here, here, and here) which provide historical and methodological context. Here are the abstract’s findings, and conclusions and recommendation.

Findings Although the use of hormone therapy to prevent chronic conditions in postmenopausal women is associated with some benefits, there are also well-documented harms. The USPSTF determined that the magnitude of both the benefits and the harms of hormone therapy in postmenopausal women is small to moderate. Therefore, the USPSTF concluded with moderate certainty that combined estrogen and progestin has no net benefit for the primary prevention of chronic conditions for most postmenopausal women with an intact uterus and that estrogen alone has no net benefit for the primary prevention of chronic conditions for most postmenopausal women who have had a hysterectomy.

Conclusions and Recommendation The USPSTF recommends against the use of combined estrogen and progestin for the primary prevention of chronic conditions in postmenopausal women. (D recommendation) The USPSTF recommends against the use of estrogen alone for the primary prevention of chronic conditions in postmenopausal women who have had a hysterectomy. (D recommendation)

This seems clear and measured. But another group of nine self-styled “experts” have written a response in a low impact scientific journal Menopause (IF 2.7) (click here), titled “Menopausal hormone therapy for primary prevention: why the USPSTF is wrong”. They’ve written the same article in an even less prestigious journal Climacteric (IF 1.7) (click here) as well. Both journals are behind paywalls so few will read them, and I’m certainly not going to buy access. But in case hormone manufacturers distribute reprints at menopause meetings, or to naive journalists or non-expert doctors, let’s check the Menopause and Climacteric authors’ conflicts of interests.

RD Langer, Professor of Family Medicine, University of Nevada has signed his name to “medical writer” authored articles for Novo Nordisk (click here), received research support from Glaxo Smith Kline and served as a consultant and expert witness for Wyeth.

JA Simon, Professor of Obstetrics and Gynecology, George Washington University has also signed his name to “medical writer” authored articles (click here) and served as a consultant to or on the advisory boards of AbbVie, Allergan, AMAG, Amgen, Inc., Apotex, Inc., Ascend Therapeutics, Azure Biotech, JDS Therapeutics, Merck & Co, Millendo Therapeutics, Noven, Novo Nordisk, Nuelle, Perrigo Company, PLC, Radius Health, Regeneron Pharmaceuticals, Roivant Sciences, Sanofi SA, Sebela Pharmaceuticals, Sermonix Pharmaceuticals, Shionogi, Inc., Sprout Pharmaceuticals, Symbiotec Pharmalab, TherapeuticsMD, and Valeant Pharmaceuticals; and has received grant/research support from AbbVie, Actavis, PLC, Agile Therapeutics, Bayer Healthcare, GlaxoSmithKline, New England Research Institute, Novo Nordisk, Palatin Technologies, Symbio Research, and TherapeuticsMD; and has also servedon the speaker’s bureaus of Amgen, Eisai, Merck, Noven Pharmaceuticals, Novo Nordisk, Shionogi, Inc., and Valeant Pharmaceuticals; and is a stockholder in Sermonix Pharmaceuticals.

A Pines, Professor of Internal Medicine, Tel-Aviv University is one of five “prolific partisan editorialists” identified by Athina Tatsioni and colleagues in 2010 (click here). In most of his publications he reports no conflicts of interest. However, on the Climacteric editorial board he acknowledges being a consultant for a local, Israeli distributor that represents about 20 international pharma companies with niche products in all fields of medicine (click here).

RA Lobo, Professor of Obstetrics and Gynecology, Columbia University Medical Center, New York was identified by Adriane Fugh-Berman and her colleagues in 2011 (click here) as one of ten conflicted authors repeatedly using “promotional tone” in articles about hormone therapy.  In 2014 the website of the International Menopause Society reported that he “has provided consulting services for many large pharmaceutical laboratories”, although the link is now dead. He does not report conflicts of interests in his more recent articles, although many are behind paywalls so difficult to check.

HN Hodis, Professor of Cardiology, University of Southern California has no declared conflicts of interest.

JH Pickar, Professor of Obstetrics and Gynecology, Columbia University Medical Center, New York has signed his name to “medical writer” authored articles funded by Pfizer (click here) and received consultant fees from Wyeth/Pfizer, Shionogi Inc., Radius Health, and Therapeutics MD and has stock options in TherapeuticsMD

DF Archer, Professor of Obstetrics and Gynecology, Eastern Virginia Medical School (click here) has received grant support from Actavis and Glenmark, received grant support, honoraria, and travel support from Bayer Healthcare, Endoceutics, Merck, Radius Health, Shionogi, and TherapeuticsMD, receiving honoraria and travel support from Exeltis/CHEMO France, Pfizer, Sermonix Pharmaceuticals, and TEVA/HR Pharma, received honoraria and travel support from and has an equity interest in Agile Therapeutics and Innovagyn

PM Sarrel, Emeritus Professor of Obstetrics, Gynecology, Reproductive Sciences, and Psychiatry, Yale University (click here) has served as a medical consultant for Noven Therapeutics

WH Utian, Professor Emeritus, Reproductive Biology, Case Western Reserve University, Cleveland (click here) has been a consultant/advisory board member for Bayer, Bionovo, Hygeia (Orcas Therapeutics), Lupin, Merck, Novogyne, Pharmavite.

Readers might wish to bear the above in mind when they decide whether or not to take these doctors’ advice to go against the USPTF recommendations to avoid menopausal hormone therapy for health promotion.

Jim Thornton


Different diagnostic tests for women at risk of preterm birth

January 27, 2018

Could they alter the effectiveness of progesterone?

Last week I looked at the idea that putting a plastic ring around that part of the cervix that protrudes into the vagina could prevent preterm birth (click here); it didn’t make much sense. Today I consider the idea that how we diagnose women at risk of preterm birth affects whether treatment works. The two tests are cervical length scanning and vaginal fetal fibronectin (fFN) measurement.

As preterm labour approaches, the uterus contracts, pulling and shortening the cervix – so called effacement. Although cervical length measurement by ultrasound is tricky, and there are false positives, it’s a reasonable test for predicting preterm birth (click here).

fFN, a protein made by the fetus, lies between the membranes and the uterus, holding them together. Normally it is absent from vaginal secretions, but as the cervix effaces the membranes shear off the lower part of the uterus, and fFN leaks into the vagina.  Again some false positives but a reasonable test (click here).

Each test is a different way to measure the same thing, the process of cervical effacement that precedes labour.  It may be that, depending on the cut-off values – length of cervix, level of fFN – one is slightly better than the other, but they are based on the same process and identify the same women, those at risk of preterm birth because the cervix is effacing.

It would be strange if treatment for preterm labour worked when the risk had been diagnosed one way, but not the other.  Imagine if streptomycin cured tuberculosis diagnosed by X ray, but not diagnosed by m. tubercle in sputum! Or if anti-hypertensive drugs prevented strokes when high blood pressure had been measured directly, but not when it had been inferred by retinal fundoscopy! If tests identify the same disease, treatment will work equally well whichever test is used.

But some enthusiasts claim that progesterone works when the risk of preterm birth is based on an ultrasound detected short cervix, but not on a vaginal fFN test. They argue as follows:

The best quality trials are overall negative, and dividing them up by singletons or twins, or by progestagen type, doesn’t identify a subgroup where the drug works.  But apparently it does work for the subgroup of women who joined trials on the basis of a short cervix diagnosed by vaginal scan. Many secondary meta-analyses make this claim (e.g. click here, here, and here), such that some people advocate universal cervical length screening followed by treatment with progesterone (click here).

But progesterone doesn’t work, it might even be harmful, for women judged at risk of preterm birth on the basis of a positive vaginal fFN test. Does that really make biological sense? It doesn’t to me.

Jim Thornton



Dying and its aftermath

January 24, 2018

Two poems by Elizabeth Jennings

In the early 1960’s, Jennings made a number of suicide attempts and spent time in and out of mental hospitals. She wrote For a Woman with a Fatal Illness, part of the masterpiece Sequence in Hospital (1963), during this period. Absence is from  an earlier collection A Sense of the World (1958).

For a Woman with a Fatal Illness

The verdict has been given and you lie quietly
Beyond hope, hate, revenge, even self-pity.

You accept gratefully the gifts – flowers, fruit –
Clumsily offered now that your visitors too

Know you must certainly die in a matter of months,
They are dumb now, reduced only to gestures,

Helpless before your news, perhaps hating
You because you are the cause of their unease.

I, too, watching from my temporary corner,
Feel impotent and wish for something violent –

Whether as sympathy only, I am not sure –
But something at least to break the terrible tension.

Death has no right to come so quietly.



I visited the place where we last met.
Nothing was changed, the gardens were well-tended,
The fountains sprayed their usual steady jet;
There was no sign that anything had ended
And nothing to instruct me to forget.

The thoughtless birds that shook out of the trees,
Singing an ecstasy I could not share,
Played cunning in my thoughts. Surely in these
Pleasures there could not be a pain to bear
Or any discord shake the level breeze.

It was because the place was just the same
That made your absence seem a savage force,
For under all the gentleness there came
An earthquake tremor: Fountain, birds and grass
Were shaken by my thinking of your name.



Cerclage pessaries

January 7, 2018

Do they make sense?

The cervix is a tube of fibrous tissue with an important job, to hold the baby in for nine months. If it fails the pregnancy ends in miscarriage or preterm birth.

Here’s how the cervix opens. The muscular upper part of the uterus contracts, effacing (shortening) the tube and then stretching the cervix passively over the baby’s head.


The cerclage pessary is a plastic device inserted in the vagina in such a way that it encircles that part of the cervix which protrudes into the vagina (picture below). The idea is that it can prevent miscarriage or premature birth by holding the cervix closed.  Trials have had mixed results (click here), but today I want to ask; is the idea even plausible?

Look at the pictures again, and get your head around the process of effacement.  How could a pessary lying loosely round that part of the cervix which protrudes into the vagina hold the cervix closed? The cervix can’t dilate until it has effaced.  But effacement pulls the cervix out of the pessary. The idea is impossible.

Even if the pessary exerted some pressure on the cervix it would not be able to do so on the very ones that most need it – those where the effacement/shortening has already started, or where the cervix was already shortened by previous surgery such as cone biopsy.

And don’t forget the pessary is a foreign body kept in the vagina for may weeks, sometime months. Infection is a well recognised cause of preterm birth. How did the idea that the cerclage pessary might work ever get traction?

Jim Thornton

%d bloggers like this: