Week 47: Rumination #3: Fools' gold: the widely touted methodological "gold standard" is neither golden nor a standard

This week's post is an abbreviated version of a "rumination" from the 4th edition of Qualitative Research and Evaluation Methods by Michael Quinn Patton, published in mid-November, 2014.

It argues for contextual appropriateness when choosing methods and research designs instead of using a hierarchy of evidence which favours a "gold standard" method.

The Wikipedia entry for Randomised Controlled Trials (RCTs), reflecting common usage, designates such designs as the “gold standard” for research. News reports of research findings routinely repeat and reinforce the “gold standard” designation for RCTs. Government agencies and scientific associations that review and rank studies for methodological quality acclaim RCTs as the gold standard.

The gold standard versus methodological appropriateness

A consensus has emerged in evaluation research that evaluators need to know and use a variety of methods in order to address the priority questions of particular stakeholders in specific situations. But researchers and evaluators get caught in contradictory assertions:

(a) select methods appropriate for a specific evaluation purpose and question, and use multiple methods—both quantitative and qualitative—to triangulate and increase the credibility and utility of findings, but

(b) one question is more important than others (the causal attribution question), and one method (RCTs) is superior to all other methods in answering that question.

Thus, we have a problem. The ideal of researchers and evaluators being situationally responsive, methodologically flexible, and sophisticated in using a variety of methods runs headlong into the conflicting ideal that experiments are the gold standard and all other methods are, by comparison, inferior. Who wants to conduct (or fund) a second-rate study if there is an agreed-on gold standard?

The rigidity of a single, fixed standard

The gold standard allusion derives from international finance, in which the rates of exchange among national currencies were fixed to the value of gold. Economic historians share a “remarkable degree of consensus” about the gold standard as the primary cause of the Great Depression.

The gold standard system collapsed in 1971 following the United States’ suspension of convertibility from dollars to gold. The system failed because of its rigidity. And not just the rigidity of the standard itself but also the rigid ideology of the people who believed in it: policymakers across Europe and North America clung to the gold standard despite the huge destruction it was causing. There was a clouded mind-set with a moral and epistemological tinge that kept them advocating the gold standard until political pressure emerging from the disaster became overwhelming.

Treating RCTs as the gold standard is no less rigid. Asserting a gold standard inevitably leads to demands for standardization and uniformity (Timmermans & Berg, 2003). Distinguished evaluation pioneer Eleanor Chelimsky (2007) has offered an illuminative analogy:

It is as if the Department of Defense were to choose a weapon system without regard for the kind of war being fought; the character, history, and technological advancement of the enemy; or the strategic and tactical givens of the military campaign. (p. 14)

The gold standard accolade means that funders and policymakers begin by asking, “How can we do an experimental design?” rather than asking, “Given the state of knowledge and the priority inquiry questions at this time, what is the appropriate design?” Here are examples of the consequences of this rigid mentality:

At an African evaluation conference, a program director came up to me in tears. She directed an empowerment program with women in 30 rural villages. The funder, an international agency, had just told her that to have the funding renewed, she would have to stop working in half the villages (selected randomly by the funder) in order to create a control group going forward. The agency was under pressure for not having enough “gold standard evaluations.” But, she explained, the villages and the women were networked together and were supporting each other. Even if they didn’t get funding, they would continue to support each other. That was the empowerment message. Cutting half of them off made no sense to her. Or to me.
At a World Bank conference on youth service learning, the director of a university exercise in evaluation design. She explained that she carefully selected 40 students each year and matched them to villages that needed the kind of assistance the students could offer. Matching students and villages was key, she explained. A senior World Bank economist told her and the group to forget matching. He advised an RCT in which she would randomly assign students to villages and then create a control group of qualified students and villages that did nothing to serve as a counterfactual. He said, “That’s the only design we would pay any attention to here. You must have a counterfactual. Your case studies of students and villages are meaningless and useless.” The participants were afterward aghast that he had completely dismissed the heart of the intervention: matching students and villages.
I’ve encountered several organizations, domestic and international, that give bonuses to managers who commission RCTs for evaluation to enhance the organization’s image as a place that emphasizes rigor. The incentives to do experimental designs are substantial and effective. Whether they are appropriate or not is a different question.

Those experiences, multiplied 100 times, are what have generated this rumination.

Evidence-Based Medicine and RCTs

Medicine is often held up as the bastion of RCT research in its commitment to evidence-based medicine. But here again, gold-standard designation has a downside, as observed by the psychologist Gary Klein (2014):

Sure, scientific investigations have done us all a great service by weeding out ineffective remedies. For example, a recent placebo-controlled study found that arthroscopic surgery provided no greater benefit than sham surgery for patients with osteoarthritic knees. But we also are grateful for all the surgical advances of the past few decades (e.g., hip and knee replacements, cataract treatments) that were achieved without randomized controlled trials and placebo conditions. Controlled experiments are therefore not necessary for progress in new types of treatments and they are not sufficient for implementing treatments with individual patients who each have unique profiles.

Worse, reliance on EBM can impede scientific progress. If hospitals and insurance companies mandate EBM, backed up by the threat of lawsuits if adverse outcomes are accompanied by any departure from best practices, physicians will become reluctant to try alternative treatment strategies that have not yet been evaluated using randomized controlled trials. Scientific advancement can become stifled if front-line physicians, who blend medical expertise with respect for research, are prevented from exploration and are discouraged from making discoveries.

RCTs and Bias

RCTs aim to control bias, but implementation problems turn out to be widespread:

Even in the most stringent research designs, bias seems to be a major problem. For example, there is strong evidence that selective outcome reporting, with manipulation of the outcomes and analyses reported, is a common problem even for randomized trails. (Chan, Hrobjartsson, Haahr, Gotzsche, & Altman, 2004, p. 2457)

The result is that “a great many published research findings are false” (Ioannidis, 2005).

Methodological Appropriateness as the Platinum Standard

It may be too much to hope that the gold standard designation will disappear from popular usage. So perhaps we need to up the ante and aim to supplant the gold standard with a new platinum standard: methodological pluralism and appropriateness. To do so, I offer the following seven-point action plan (and resources below):

1. Educate yourself about the strengths and weaknesses of RCTS.

2. Never use the "gold standard" designation yourself.

If it comes up, refer to the “so-called gold standard.”

3. When you encounter someone referring to RCTs as "the gold standard", don’t be shy.

Explain the negative consequences and even dangers of such a rigid pecking order of method Understand and be able to articulate the case for methodological pluralism and appropriateness, to wit, adapting designs to the existing state of knowledge, the available resources, the intended uses of the inquiry results, and other relevant particulars of the inquiry situation.

4. Understand and be able to articulate the case for methodological pluralism and appropriateness

to wit, adapting designs to the existing state of knowledge, the available resources, the intended uses of the inquiry results, and other relevant particulars of the inquiry situation.

5. Promote the platinum standard as higher on the hierarchy of research excellence.

6. Don’t be argumentative and aggressive in challenging gold standard narrow-mindedness.

It’s more likely a matter of ignorance than intolerance. Be kind, sensitive, understanding, and compassionate, and say, “Oh, you haven’t heard. The old RCT gold standard has been supplanted by a new, more enlightened, Knowledge-Age platinum standard.” (Beam wisely.)

7. Repeat Steps 1 to 6 over and over again.

Resources

Approach

Randomised controlled trials: A description of the approach and discussion of when it is appropriate to choose this and how to implement it well, by Angela Ambroz and Marc Shotland from the Abdul Latif Jamell Poverty Action Lab (J-PAL).

Guides

Utilization-focused evaluation: The focus of this book is on Utilization-Focused Evaluation and includes a discussion of 10 limitations of experimental designs (pp 447-450).

What counts as good evidence?:This paper, written by Sandra Nutley, Alison Powell and Huw Davies for the Alliance for Useful Evidence, discusses the risks of using a hierarchy of evidence and suggests an alternative in which more complex matrix approaches for identifying evidence quality are more closely linked to the wider range of policy or practice questions being addressed.

Broadening the range of designs and methods for impact evaluations: The working paper, written by Elliot Stern, Nicoletta Stame, John Mayne, Kim Forss, Rick Davies, Barbara Befani for the UK Department for International Development (DFID) describes a range of alternatives to RCTs and outlines when they might be appropriate.

Overview

What scientific idea is ready for retirement: Large randomized controlled trials: This comment, written by Dean Ornish and published on the Edge.org blog What scientific idea is ready for retirement, argues that larger studies do not always equate to more rigorous or definitive results and that randomized control trials (RCTs) may in fact introduce their own biases.

Scientifically Based Evaluation Methods: This comment from the American Evaluation Association (AEA) argues that randomized control group trials (RCTs) are not the only studies capable of generating understandings of causality and that alternative and mixed methods are also rigorous and scientific.

The importance of a methodologically diverse approach to impact evaluation: This statement from the European Evaluation Society(EES) argues that randomised control trials are not necessarily the best way to ensure a rigorous or scientific impact evaluation assessment (IE) for development and development aid. The paper contends that multi-method approaches to IE are more effective over any single method.

Other References

Bernard, H. R. (2013). Social research methods: Qualitative and quantitative approaches (2nd ed.). Thousand Oaks, CA: Sage.

Chan, A. W., Hrobjartsson, A., Haahr, M. T., Gotzsche, P. C., & Altman, D. G. (2004). Empirical evidence for selective reporting of outcomes in randomized trials: Comparison of protocols to published articles. Journal of the American Medical Association, 291, 2457–2465.

Chelimsky, E. (2007). Factors influencing the choice of methods in federal evaluation practice. New Directions for Evaluation, 2007(113), 13–33.

Devereux, S., & Roelen, K. (2013). Evaluating outside the box: Mixing methods in analysing social protection programmes. London, England: Institute of Development Studies, Center for Social Protection.

Eichengreen, B., & Temin, P. (1997). The gold standard and the Great Depression (NBER Working Paper No. 6060). Cambridge, MA: National Bureau of Economic Research.

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8). http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124

Klein, G. (2014). Evidence-based medicine. http://edge.org/responses/what-scientific-idea-is-ready-for-retirement

(More resources and suggested reading can be found in the longer, published rumination).