Leor Sapir

What Does Quality of Evidence Mean?

Gender clinician Jack Turban misunderstands key concepts in evidence-based medicine.

/ Health Care, The Social Order

/ Eye on the News / Health Care, The Social Order

On Wednesday, the Dartmouth Political Union hosted a debate on sex and gender between MIT philosopher Alex Byrne, University of California at San Francisco psychiatrist Jack Turban, and Aston University emerita neuroscientist Gina Rippon.

An interesting moment came when Byrne asked Turban what he thought of the recently published Cass Review, the 388-page comprehensive report on youth gender medicine, authored by British physician Hilary Cass and her colleagues. Turban claimed that the report found “moderate” quality evidence for “gender-affirming care,” and that, contrary to its reception, the review’s findings did not lend support to restrictions on puberty blockers and other medical interventions for pediatric gender dysphoria.

Turban’s characterization is at odds with that of Cass and her team. Cass’s report, published alongside seven new systematic evidence reviews on several issues associated with youth gender transition, concludes that the evidence for the safety and efficacy of puberty blockers and cross-sex hormones as treatments for gender-related distress in adolescents is “remarkably weak.” Youth gender medicine, Cass writes in the prestigious British Medical Journal, “is built on shaky foundations.”

Here, I want to respond to Turban’s comments about evidence quality in the Cass Review. These issues are technical but important for those following the debate over pediatric gender medicine.

First, some background. Jack Turban is one of the nation’s most prominent defenders of pediatric sex-trait modification (“gender-affirming care”). He has garnered a reputation outside of his circle of followers for pursuing agenda-driven research, evading scientific debate, launching ad hominem attacks on scientific critics, and misrepresenting research findings—including his own.

In a recent deposition in a lawsuit over Idaho’s Vulnerable Child Protection Act, which bans sex-trait modification in minors, Turban demonstrated under oath his lack of understanding of, or failure to be honest about, basic principles of evidence-based medicine (EBM). He seemed unaware, for example, that systematic reviews of evidence are meant not only to assess the available research but also to score the quality of that research. Gordon Guyatt, a professor of health research methods, world-renowned expert in EBM, and a founder of the field, has said that when it comes to systematic reviews, Turban has shown he “does not understand what it’s all about.”

The studies Turban and other gender clinicians cite in support of “gender-affirming care” often suffer from high risk of bias and show inconsistent findings regarding mental-health outcomes. Further, these studies often are conducted by gender clinicians with ideological, professional, and even financial stakes in administering drugs and surgeries to minors.

The benefit of systematic reviews is that they do not take authors’ conclusions at face value. Instead, they allow independent experts in research methods and evidence evaluation to scrutinize studies’ designs and conclusions. The research on youth gender medicine interventions generally lacks adequate follow-up time, has high drop-out rates, fails to control for potential confounding factors, and regards as homogeneous a patient population with significantly different clinical presentations.

Because systematic reviews are EBM’s gold standard for furnishing clinicians and guideline developers with reliable information, it’s necessary to respond directly to Turban’s claim in the Dartmouth debate that the new systematic reviews associated with the Cass report found “moderate quality” evidence that puberty blockers improve mental health. (I will focus here on the puberty blockers review, although the analysis applies to the cross-sex hormone review as well.)

Turban’s claim is false, for three reasons. First, he ignores the crucial distinction in EBM between quality of studies and quality of evidence—an admittedly non-obvious distinction, but one that any competent clinician who opines on EBM issues should comprehend. Second, he fails to distinguish between the mental-health and non-mental-health-related research cited in the report. Third, he ignores the fact that the authors of the systematic reviews used a scoring tool that already sets a lower bar for evaluating research. In effect, the reviewers (and the Cass team) performed affirmative action for youth-gender-medicine research and still found it wanting.

Quality of Studies v. Quality of Evidence. To evaluate the quality of studies on puberty suppression, the authors of the systematic review used a modified version of the Newcastle–Ottawa Quality Assessment Scale (NOS), a tool for evaluating nonrandomized studies. Studies assessed by the scale receive one of three grades: low, moderate, or high. Of the 50 studies on puberty suppression the authors identified as relevant, 24 (including one by Turban) were excluded for being low quality. Of the remaining 26, one was determined to be high quality, and 25 moderate quality. Turban’s confusion is therefore understandable: wouldn’t the finding that most of the research is moderate quality mean that the evidence overall is moderate quality?

Not exactly. In EBM, “quality of study” refers to a given study’s risk of bias. “Risk of bias” is a technical term, which Cochrane defines as “systematic error, or deviation from the truth, in results.” To give an obvious example, if you want to test the effects of puberty blockers on mental health and give them to patients who are already receiving psychotherapy, any positive outcomes may be attributable to the drugs, the therapy, or some combination of the two. A study design that is incapable of isolating the effects of puberty blockers from confounding variables like psychotherapy is at high risk of bias.

Quality of evidence, on the other hand, refers to the confidence we can have in our estimate of an intervention’s effect, based on the entire body of information. Quality of studies (based on risk of bias) is one factor that determines quality of evidence; others include publication bias (when, for example, a journal declines to publish an unfavorable study); inconsistency (when studies addressing the same question come to significantly different results); indirectness (when the studies do not directly compare interventions of interest in populations of interest, or when they do not report outcomes deemed important for clinical decisions); and imprecision (when studies are subject to random error, often due to small sample sizes).

Gender medicine research, and youth gender medicine research in particular, suffers from these problems. To give one example, inapplicability is a form of indirectness in which the subjects of a study are different from the patients to whom an intervention is being offered. The gold standard of research in youth gender medicine is the Dutch study. That study suffers from high risk of bias, but it is also inapplicable to the majority of minors now seeking “gender-affirming care” because it was done on patients with a different clinical presentation than the group responsible for the sudden and dramatic rise in gender dysphoria diagnoses and referrals: teen girls with no prepubertal history of gender issues and with high rates of psychiatric and/or neurocognitive challenges.

Turban’s claim that the systematic reviews on puberty blockers and cross-sex hormones found “moderate” quality evidence is therefore incorrect. The reviews found moderate and a few high-quality studies, but they did not find moderate quality evidence. In fact, the University of York authors of the systematic reviews did not even evaluate the quality of evidence using widely accepted and standardized tools such as Grading of Recommendations, Assessment, Development, and Evaluations (GRADE). Instead, they summarized their findings in narrative form. “There is a lack of high-quality research assessing puberty suppression in adolescents experiencing gender dysphoria/incongruence,” they wrote. “No conclusions can be drawn about the impact on gender dysphoria, mental and psychosocial health or cognitive development. Bone health and height may be compromised during treatment.”

Quality With Regard to What Outcomes? Turban’s second mistake is to suggest that the “moderate-quality evidence” was about “improvements in mental health.” A look at the chart included in the systematic review on puberty blockers, however, reveals that of the 25 moderate-quality studies, most appear in four columns: puberty suppression (17 studies), physical health (14), bone health (5), and side effects (3) (most studies examine more than one domain). Many of the studies do not examine mental-health outcomes.

It’s not possible for me to give a detailed account here of what each of the moderate-quality studies examined, but a few examples should be enough to show why Turban’s suggestion is misleading. One moderate-quality study included in the “puberty suppression” category tested whether Histrelin implants (a puberty blocker) are still effective at disrupting the puberty-inducing mechanism of the pituitary gland after one year. Another moderate-quality study, in the “physical health” category, was about the effects on body composition (in terms of height and lean mass) from sudden withdrawal of sex hormones in late-pubertal adolescents. Neither study examined participants’ mental-health outcomes.

Lowering the Bar for “Gender-Affirming Care.” To assess the strength of various studies, the University of York systematic review authors used a scoring tool specifically designed for nonrandomized studies. Such studies already face a higher risk of bias, since their proctors do not randomly assign comparable participants into treatment and control groups. The field of youth gender medicine lacks even a single randomized controlled study—the gold standard for testing causal claims about the safety and efficacy of medical interventions.

I asked Yuan Zhang, an assistant clinical professor of health research methods, evidence, and impact at McMaster University, home of EBM, for his impression of the Cass-linked systematic review’s methods. “With regard to the question of the effects of puberty blockers on mental health, even if the University of York team had done a quality of evidence scoring, it would not have been better than very low quality.” Zhang is referring to the lowest score on GRADE. “If you want to produce credible evidence of cause and effect, for instance in order to be able to say that puberty blockers are responsible for improvement in mental health, there is no alternative to a randomized controlled trial.”

Advocates of puberty blockers like Turban argue that conducting a RCT in the gender-medicine context would be unethical, as we already know that puberty blockers are “medically necessary” interventions and that withholding them would cause harm. Of course, this claim assumes the very thing that’s in dispute. Proponents also argue that conducting a double-blinded RCT would be impossible, as there is no way to hide from participants (and their physicians) whether puberty blockers or placebos were being administered. This second objection is more reasonable, but it’s possible to design a non-blinded RCT with active comparators. Non-puberty-suppressed participants can be given antidepressants or psychotherapy, for instance. The passage of time alone may have an effect on mental health (due to a phenomenon known as “regression to the mean”).

As James Cantor, a psychologist and author of important articles and expert reports on gender medicine, told me, “Even if one accepted, for arguments’ sake, that RCTs couldn’t be done, it still wouldn’t justify barreling ahead as if they had been done and always showed unmitigated success.” The reason should be obvious: drugs and surgeries pose real and potentially serious risks to a person’s physical and mental health. Because in this case they are being given to adolescents who are physically healthy, the burden is on proponents of hormonal interventions to prove their safety and efficacy.

How do reviewers assess the quality of non-randomized studies, which inherently are more prone to bias? The most common tool is “Risk of Bias in Non-randomized Studies—Interventions” (ROBINS-I). It’s not clear why the authors of the Cass systematic reviews chose not to use this tool, but one possible reason is that ROBINS-I is very rigorous in assessing risk of bias in non-randomized research. Applying it to existing gender-medicine research would likely have resulted in all available studies being found to be at “serious” or “critical” risk of bias.

The NOS, which the Cass researchers used, has separate scoring scales for pre-post, cohort, and cross-sectional studies. Pre-post studies examine the effects of an intervention in a single cohort with no comparator group. Cohort studies follow a group of patients over a period of time but also lack adequate controls. Cross-sectional studies capture data at a single point in time, through methods such as surveys or medical-chart reviews.

The only high-quality study of puberty blockers included in the systematic review was a cross-sectional study from the Netherlands. A cross-sectional design is definitionally incapable of ascertaining causal relationships, so how could this study come out above other types of nonrandomized studies? The answer is that the NOS scale scores each type of study differently. A high-quality cross-sectional study means that it is high quality for cross-sectional design, not high quality for nonrandomized research in general.

Turban’s misperceptions about quality in medical research lead to similarly misguided policy conclusions. He claimed in the Dartmouth debate, for example, that moderate-quality evidence “is not particularly unusual in medicine,” adding, “I can’t think of another example in medicine where you have that quality of evidence, and you ban the care. The report also doesn’t say to ban care.”

Turban is correct that this area of medicine has been singled out for special treatment, but not in the way he thinks. Indeed, Hilary Cass, author of the Cass Review, claims that pediatric gender medicine has been “exceptionalised”—too many clinicians in this field have “abandoned normal clinical approaches to holistic assessment” and instead deferred to their patient’s self-diagnosis and desire for medical intervention. No other area of medicine has been allowed to proceed so quickly, with so little evidence, on such vulnerable patients, and with such little follow-up.

Advocates like Turban point out that many medical treatments and protocols in pediatrics are still used despite low-quality evidence. This fact, they claim, shows that gatekeepers are prejudicially motivated to restrict gender transition. An influential Yale report from 2022, for example, cited the recommendation against giving children aspirin for fevers due to risk of developing Reye’s syndrome—a progressive and potentially fatal neurological disease—despite there being only low-quality evidence linking aspirin to Reye’s.

A rule of thumb in EBM is that strong recommendations require strong evidence. In some cases, however, low-quality evidence can justify strong recommendations. Examples of such “discordant recommendations” are when the alternative to nontreatment is death, and when alternative interventions can achieve the same effects with less risk. The Yale team conveniently neglected to mention that kids can be given Tylenol, which isn’t linked to Reye’s, instead of aspirin.

When Turban says that moderate-quality evidence is “not particularly unusual” in medicine, he is thus misleading his audience on two counts. First, he falsely implies that the quality of evidence (rather than of studies) is moderate, and confuses NOS’s use of “moderate” with the use of this term in GRADE (where quality of evidence is at issue). Second, he suggests that puberty blockers fall under one of the exceptional scenarios in EBM where discordant recommendations are appropriate.

It’s noteworthy that this marks a shift in Turban’s public position, which has been that “the body of research indicates that these interventions result in favorable mental health outcomes.” In his expert witness reports, Turban has claimed that “Existing research shows gender-affirming medical treatments for adolescents with gender dysphoria are consistently linked to improved mental health.” Yet at Dartmouth, he appeared to make a different claim: the evidence is not strong, but it’s common practice in pediatrics to offer medical interventions based on uncertain evidence.

As for banning “care,” Turban is correct that the Cass Review does not recommend a blanket prohibition on puberty blockers. But if Cass’s recommendations were to be implemented in the U.S., most of the kids currently getting them would no longer be eligible, and those who would be eligible would be able to receive them only as part of research. Turban, like other gender clinicians, has conveniently but disingenuously latched on to age restriction laws (“bans”) as a way to avoid acknowledging this important implication.

Advocates of hormonal interventions frame the choice as one between only two alternatives: their own “affirmative” approach or total prohibition. They then use Europeans’ allowance for at least some instances of pubertal suppression as evidence that European countries have rejected the prohibitionist approach, and that, by implication, they agree with advocates’ “affirming” approach.

The only real disagreement between health-care authorities in places like England, Sweden, and Finland, and those in U.S. red states is whether these drugs should be allowed within research settings and administered in exceptional cases. England’s National Health Service has officially ended the routine use of puberty blockers for adolescents with gender dysphoria. Turban, by contrast, has seemed to agree that these drugs should be given out for free, on-demand, without parental consent.

At Dartmouth, Turban warned against “conflating very technical terms from the grading scale, like for medical evidence, with lay terminology saying it’s all low-quality evidence.” I agree. But Turban appears not to understand the technical terms. Perhaps someone should explain them to him in lay terminology.

Editor’s note: When this story was published, the wording in Zhang’s quote was rendered as “low quality.” It should have read “very low quality,” and has been corrected accordingly.

Leor Sapir is a fellow at the Manhattan Institute.

Photo: krisanapong detraphiphat/Moment via Getty Images

Donate

City Journal is a publication of the Manhattan Institute for Policy Research (MI), a leading free-market think tank. Are you interested in supporting the magazine? As a 501(c)(3) nonprofit, donations in support of MI and City Journal are fully tax-deductible as provided by law (EIN #13-2912529).

Up Next

- article

What Does Quality of Evidence Mean?

Further Reading

Up Next