In a piece for PLOS Medicine’s 15th Anniversary, Specialty Consulting Editor Alexander Tsai contextualizes a 2008 study on antidepressants and its impact on the medical and media conversation around mental health and medication.
In 2008, one of the most controversial studies on antidepressant medication treatment was published in PLOS Medicine by Irving Kirsch and colleagues. The central findings from that meta-analysis of 35 studies, which has by now been cited more than 2500 times, were that the efficacy of antidepressant medication treatment does not meet arbitrary thresholds of clinical significance, that it is conditioned on symptom severity, and that this phenomenon results primarily from non-response to placebo among those with severe symptoms. This meta-analysis was based on a dataset largely similar to one used by Kirsch and several of his colleagues in a 2002 publication in the journal Prevention and Treatment. But it was the PLOS Medicine publication that ignited a firestorm of media coverage (e.g., “Antidepressant drugs don’t work—official study”) and dueling commentaries and also kicked off a series of competing meta-analyses that continues to the present day.
The context of the Kirsch and colleagues (2008) study, and the work in the 1990s leading up to its publication, can be characterized by a burgeoning public and scientific interest in the diagnosis and treatment of mental illness. Dr. Allen Frances had begun convening the work groups for the 4th edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM), which would be more data-driven than any previous version of the DSM, coinciding with the World Health Organization’s work on revising the 10th edition of the International Classification of Diseases. Fluoxetine was the first of the selective serotonin reuptake inhibitors to be approved for the U.S. market in 1987, and by 1998 annual sales peaked at $2.8 billion. Prozac also colonized American popular culture, propelled by books celebrating its mood-altering powers such as Peter Kramer’s landmark Listening to Prozac (1993) and Elizabeth Wurtzel’s splashy Prozac Nation (1994).
Following the publication of the study by Kirsch and colleague (2008), both confirmatory and sparring meta-analyses have continued to pepper the literature over the past decade. Meta-analyses by Barbui and colleagues (CMAJ 2008) and Turner and colleagues (NEJM 2008), which were published in January shortly after the meta-analysis by Kirsch and colleagues was accepted for publication, estimated effect sizes that were of similar magnitude. Fournier and colleagues (JAMA 2010) conducted a mega-analysis (i.e., meta-analysis of individual patient data) of 6 studies that confirmed the treatment-by-severity interaction. Subsequent meta-analyses by Jakobsen and colleagues (BMC Psych 2017) and by Cipriani and colleagues (Lancet 2018) also yielded consistent effect size estimates. Other meta-analyses of individual patient data by Gibbons and colleagues (Arch Gen Psych 2012), Rabinowitz and colleagues (Br J Psych 2016), Furukawa and colleagues (Acta Psych Scand 2018), and, most recently, Hieronymus and colleagues (Lancet Psych 2019), have estimated similar effect sizes but disconfirmed the treatment-by-severity interaction.
What is remarkable about this literature is the inconsistency of interpretation in the setting of some consistent findings:
|Study||N||Mean difference||Effect size||Interaction by initial severity|
|Kirsch and colleagues (2008)||35 studies||1.8||0.32||Yes|
|Barbui and colleagues (2008)||40 studies||0.31|
|Turner and colleagues (2008)||74 studies||0.31|
|Fournier and colleagues (2010)||718 patients||Yes|
|Gibbons and colleagues (2012)||5056 patients||2.55||No|
|Rabinowitz and colleagues (2016)||10737 patients||2.05||No|
|Jakobsen and colleagues (2017)||49 studies||1.94||0.23|
|Furukawa and colleagues (2018)||2464 patients||1.60||0.20||No|
|Cipriani and colleagues (2018)||432 studies||1.97||0.30|
|Henssler and colleagues (2018)||91 studies||0.27|
|Hieronymus and colleagues (2019)||8262 patients||No|
No study is perfect, and robustness tests suggesting alternative approaches for re-analysis can likely be grounded in reasonable justifications, but it appears that the literature has converged on the finding that antidepressant medication treatment improves mood, in the short term, by an average of about 2 to 3 points on the Hamilton Depression Rating Scale (HDRS). To contextualize this number, consider that the HDRS has a maximum score of 57, a score of 7 is typically used to define remission, a score of 20 is typically used as the entry criterion for participation in randomized controlled trials and/or to define severe depression. For the typical private practice outpatient with current major depressive episode who has an HDRS score of 18.8, a 50% reduction in symptom severity, or 9.4 points, would be required to define response (a construct that has been criticized for having a potentially unclear meaning). Two to three points on the HDRS represents about a third of a standard deviation’s worth when considering the range of scores typically encountered among private practice outpatients with current major depressive episodes, is approximately equal to the score of a typical “healthy control,” and is far below the minimal change that would be detectable by clinicians using the Clinical Global Impressions-Improvement scale.
Pushback against the findings of Kirsch and colleagues (2008) has typically taken on several forms. First, some have disputed the analytic choices made by Kirsch and colleagues (2008). (See, for example, Horder and colleagues [J Psychopharm 2010] and Fountoulakis & Möller [Int J Neuropsychopharm 2011].) However, none of the suggested tweaks fundamentally change the estimated effect size: 2 to 3 points. Second, some have attempted to better contextualize the estimated effect size for antidepressant medication treatment by comparing it to other established treatments. Leucht and colleagues conducted a meta-meta-analysis (Br J Psych 2012, followed by BMC Med 2015), reviewing a wide range of medication classes and finding that effect sizes for psychiatric medications were comparable to those for medications used in other medical conditions (e.g., angiotensin converting enzyme inhibitors to prevent mortality in congestive heart failure). Third, others have pointed out that the excessive attention given to Kirsch and colleagues (2008) parallels American psychiatry’s more general focus on psychopharmacological rather than psychotherapeutic treatment strategies, and that evidence-based psychotherapies yield benefits of similar magnitude. Fourth, the finding of a treatment-by severity interaction has been questioned. Most of the newer studies have disconfirmed the treatment-by-severity interaction even while estimating mean differences and effect sizes consistent with those of Kirsch and colleagues (2008). Fifth, the standard of clinical significance has been questioned. Turner and colleagues (NEJM 2008) estimated a nearly identical effect size, but, in a subsequent commentary on the study by Kirsch and colleagues (2008), Turner and Rosenthal (2008) cautioned against the uncritical acceptance of arbitrary thresholds of clinical significance recommended by the UK’s National Institute for Health and Care Excellence (NICE). Gibbons and colleagues (Arch Gen Psych 2012) also estimated a similar effect size but de-emphasized this finding, instead highlighting the 10 to 20 percentage-point differences in response and remission. (It should be noted that, where some see “enormous” differences in response rates thought to be of clinical significance [also defined arbitrarily], others see statistical artifact.)
Kirsch and colleagues (2008) have succeeded in stimulating a lively debate. Obviously, numerous questions remain for clinicians attempting to synthesize the evidence and make decisions for the benefit of their patients. Where are the long-term studies, and what is known of the harms? Two-thirds of those who use antidepressant medications in the U.S. have taken them for two years or more, and one-fourth have taken them for more than a decade — and long-term use appears to be increasingly common. Yet antidepressant medication studies of the kind reviewed here routinely last no more than 8-12 weeks. Nearly all longer-term studies follow the “randomized discontinuation” design, which is well known to provide little in the way of valid data to inform clinical decision making. (This problem has been most thoroughly discussed in the context of bipolar maintenance treatment — see Tsai and colleagues  and Goodwin and colleagues  — but has also been described in the context of depression maintenance treatment.) Precious few studies have adopted the parallel design for long-term follow-up: in the meta-analysis by Deshauer and colleagues (2008), they screened 2,693 abstracts, identified only six parallel randomized trials lasting 6-8 months in duration, and identified none lasting more than 1 year. In the meta-analysis by Henssler and colleagues (2018), conducted a decade later, they identified 91 studies lasting 8 weeks in duration and only 2 studies lasting 24 weeks in duration; the estimated effect sizes were, again, consistent across time points and largely consistent with the others reviewed above. In addition to the little that is known about long-term benefits of antidepressant medication treatment, there is also little known about the potential long-term risks. Appeals to the infeasibility of conducting long-term studies are reasonable, but one would then expect that enthusiasm about long-term treatment would be comparably circumspect.
Alexander C. Tsai is Associate Professor of Psychiatry at the Massachusetts General Hospital and Harvard Medical School, and is a Specialty Consulting Editor for PLOS Medicine.
Image Credit: Prylarer, Pixabay (CC0)