Week 47: Rumination #3: Fools' gold: the widely touted methodological "gold standard" is neither golden nor a standard

MQuinnP's picture 4th December 2014 by MQuinnP

This week's post is an abbreviated version of a "rumination" from the 4th edition of Qualitative Research and Evaluation Methods by Michael Quinn Patton, published in mid-November, 2014. It argues for contextual appropriateness when choosing methods and research designs instead of using a hierarchy of evidence which favours a "gold standard" method.

The Wikipedia entry for Randomised Controlled Trials (RCTs), reflecting common usage, designates such designs as the “gold standard” for research. News reports of research findings routinely repeat and reinforce the “gold standard” designation for RCTs. Government agencies and scientific associations that review and rank studies for methodological quality acclaim RCTs as the gold standard. 

The Gold Standard Versus Methodological Appropriateness

A consensus has emerged in evaluation research that evaluators need to know and use a variety of methods in order to address the priority questions of particular stakeholders in specific situations. But researchers and evaluators get caught in contradictory assertions: 

(a) select methods appropriate for a specific evaluation purpose and question, and use multiple methods—both quantitative and qualitative—to triangulate and increase the credibility and utility of findings, but 

(b) one question is more important than others (the causal attribution question), and one method (RCTs) is superior to all other methods in answering that question.

Thus, we have a problem. The ideal of researchers and evaluators being situationally responsive, methodologically flexible, and sophisticated in using a variety of methods runs headlong into the conflicting ideal that experiments are the gold standard and all other methods are, by comparison, inferior. Who wants to conduct (or fund) a second-rate study if there is an agreed-on gold standard?

The Rigidity of a Single, Fixed Standard

The gold standard allusion derives from international finance, in which the rates of exchange among national currencies were fixed to the value of gold. Economic historians share a “remarkable degree of consensus” about the gold standard as the primary cause of the Great Depression.

The gold standard system collapsed in 1971 following the United States’ suspension of convertibility from dollars to gold. The system failed because of its rigidity. And not just the rigidity of the standard itself but also the rigid ideology of the people who believed in it: policymakers across Europe and North America clung to the gold standard despite the huge destruction it was causing. There was a clouded mind-set with a moral and epistemological tinge that kept them advocating the gold standard until political pressure emerging from the disaster became overwhelming.

Treating RCTs as the gold standard is no less rigid. Asserting a gold standard inevitably leads to demands for standardization and uniformity (Timmermans & Berg, 2003). Distinguished evaluation pioneer Eleanor Chelimsky (2007) has offered an illuminative analogy:

It is as if the Department of Defense were to choose a weapon system without regard for the kind of war being fought; the character, history, and technological advancement of the enemy; or the strategic and tactical givens of the military campaign. (p. 14)

The gold standard accolade means that funders and policymakers begin by asking, “How can we do an experimental design?” rather than asking, “Given the state of knowledge and the priority inquiry questions at this time, what is the appropriate design?” Here are examples of the consequences of this rigid mentality:

  • At an African evaluation conference, a program director came up to me in tears. She directed an empowerment program with women in 30 rural villages. The funder, an international agency, had just told her that to have the funding renewed, she would have to stop working in half the villages (selected randomly by the funder) in order to create a control group going forward. The agency was under pressure for not having enough “gold standard evaluations.” But, she explained, the villages and the women were networked together and were supporting each other. Even if they didn’t get funding, they would continue to support each other. That was the empowerment message. Cutting half of them off made no sense to her. Or to me.
  • At a World Bank conference on youth service learning, the director of a university exercise in evaluation design. She explained that she carefully selected 40 students each year and matched them to villages that needed the kind of assistance the students could offer. Matching students and villages was key, she explained. A senior World Bank economist told her and the group to forget matching. He advised an RCT in which she would randomly assign students to villages and then create a control group of qualified students and villages that did nothing to serve as a counterfactual. He said, “That’s the only design we would pay any attention to here. You must have a counterfactual. Your case studies of students and villages are meaningless and useless.” The participants were afterward aghast that he had completely dismissed the heart of the intervention: matching students and villages.
  • I’ve encountered several organizations, domestic and international, that give bonuses to managers who commission RCTs for evaluation to enhance the organization’s image as a place that emphasizes rigor. The incentives to do experimental designs are substantial and effective. Whether they are appropriate or not is a different question. 

Those experiences, multiplied 100 times, are what have generated this rumination.

Evidence-Based Medicine and RCTs

Medicine is often held up as the bastion of RCT research in its commitment to evidence-based medicine. But here again, gold standard designation has a downside, as observed by the psychologist Gary Klein (2014):

Sure, scientific investigations have done us all a great service by weeding out ineffective remedies. For example, a recent placebo-controlled study found that arthroscopic surgery provided no greater benefit than sham surgery for patients with osteoarthritic knees. But we also are grateful for all the surgical advances of the past few decades (e.g., hip and knee replacements, cataract treatments) that were achieved without randomized controlled trials and placebo conditions. Controlled experiments are therefore not necessary for progress in new types of treatments and they are not sufficient for implementing treatments with individual patients who each have unique profiles.

Worse, reliance on EBM can impede scientific progress. If hospitals and insurance companies mandate EBM, backed up by the threat of lawsuits if adverse outcomes are accompanied by any departure from best practices, physicians will become reluctant to try alternative treatment strategies that have not yet been evaluated using randomized controlled trials. Scientific advancement can become stifled if front-line physicians, who blend medical expertise with respect for research, are prevented from exploration and are discouraged from making discoveries.

RCTs and Bias

RCTs aim to control bias, but implementation problems turn out to be widespread:

Even in the most stringent research designs, bias seems to be a major problem. For example, there is strong evidence that selective outcome reporting, with manipulation of the outcomes and analyses reported, is a common problem even for randomized trails. (Chan, Hrobjartsson, Haahr, Gotzsche, & Altman, 2004, p. 2457)

The result is that “a great many published research findings are false” (Ioannidis, 2005).

Methodological Appropriateness as the Platinum Standard

It may be too much to hope that the gold standard designation will disappear from popular usage. So perhaps we need to up the ante and aim to supplant the gold standard with a new platinum standard: methodological pluralism and appropriateness. To do so, I offer the following seven-point action plan (and resources below):

1. Educate yourself about the strengths and weaknesses of RCTS. 

2. Never use the "gold standard" designation yourself. If it comes up, refer to the “so-called gold standard.”

3. When you encounter someone referring to RCTs as "the gold standard", don’t be shy. Explain the negative consequences and even dangers of such a rigid pecking order of method Understand and be able to articulate the case for methodological pluralism and appropriateness, to wit, adapting designs to the existing state of knowledge, the available resources, the intended uses of the inquiry results, and other relevant particulars of the inquiry situation. 

4. Understand and be able to articulate the case for methodological pluralism and appropriateness, to wit, adapting designs to the existing state of knowledge, the available resources, the intended uses of the inquiry results, and other relevant particulars of the inquiry situation. 

5. Promote the platinum standard as higher on the hierarchy of research excellence. 

6. Don’t be argumentative and aggressive in challenging gold standard narrow-mindedness. It’s more likely a matter of ignorance than intolerance. Be kind, sensitive, understanding, and compassionate, and say, “Oh, you haven’t heard. The old RCT gold standard has been supplanted by a new, more enlightened, Knowledge-Age platinum standard.” (Beam wisely.)

7. Repeat Steps 1 to 6 over and over again.


Evaluation Approach

Randomised Controlled Trials - a description of the approach and discussion of when it is appropriate to choose this and how to implement it well, by Angela Ambroz and Marc Shotland from the Abdul Latif Jamell Poverty Action Lab (J-PAL).  

Read more



Utilization-Focused Evaluation 

The focus of this book is on Utilization Focused Evaluation and includes a discussion of 10 limitations of experimental designs (pp 447-450). Read more


What counts as good evidence? 

This paper, written by Sandra Nutley, Alison Powell and Huw Davies for the Alliance for Useful Evidence, discusses the risks of using a hierarchy of evidence and suggests an alternative in which more complex matrix approaches for identifying evidence quality are more closely linked to the wider range of policy or practice questions being addressed.​ Read more


Broadening the range of designs and methods for impact evaluations

The working paper, written by Elliot Stern, Nicoletta Stame, John Mayne, Kim Forss, Rick Davies, Barbara Befani for the UK Department for International Development (DFID) describes a range of alternatives to RCTs and outlines when they might be appropriate. Read More



What scientific idea is ready for retirement: Large randomized controlled trials

This comment, written by Dean Ornish and published on the Edge.org blog What scientific idea is ready for retirement, argues that larger studies do not always equate to more rigorous or definitive results and that randomized control trials (RCTs) may in fact introduce their own biases. Read more


Scientifically Based Evaluation Methods

This comment from the American Evaluation Association (AEA) argues that randomized control group trials (RCTs) are not the only studies capable of generating understandings of causality and that alternative and mixed methods are also rigorous and scientific. Read more


The importance of a methodologically diverse approach to impact evaluation

This statement from the European Evaluation Society(EES) argues that randomised control trials are not necessarily the best way to ensure a rigorous or scientific impact evaluation assessment (IE) for development and development aid. The paper contends that multi-method approaches to IE are more effective over any single method. Read more


Other References

Bernard, H. R. (2013). Social research methods: Qualitative and quantitative approaches (2nd ed.). Thousand Oaks, CA: Sage.

Chan, A. W., Hrobjartsson, A., Haahr, M. T., Gotzsche, P. C., & Altman, D. G. (2004). Empirical evidence for selective reporting of outcomes in randomized trials: Comparison of protocols to published articles. Journal of the American Medical Association, 291, 2457–2465.

Chelimsky, E. (2007). Factors influencing the choice of methods in federal evaluation practice. New Directions for Evaluation, 2007(113), 13–33.

Devereux, S., & Roelen, K. (2013). Evaluating outside the box: Mixing methods in analysing social protection programmes. London, England: Institute of Development Studies, Center for Social Protection.

Eichengreen, B., & Temin, P. (1997). The gold standard and the Great Depression (NBER Working Paper No. 6060). Cambridge, MA: National Bureau of Economic Research.

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8). http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124

Klein, G. (2014). Evidence-based medicine. http://edge.org/responses/what-scientific-idea-is-ready-for-retirement

(More resources and suggested reading can be found in the longer, published rumination).


Image: "Pyrite 2" by Barta IV

A special thanks to this page's contributors
Founder and Director, Utilization-Focused Evaluation.
United States of America.


Michael O'Donnell's picture
Michael O'Donnell

Great post, and a nice deconstruction of the Gold Standard analogy! At Bond, we developed the NGO Evidence Principles to promote a pluralist approach, while also trying to set some common quality standards that can apply across methods. We're in the process of refining these based on feedback received to date (particularly around our "Appropriateness" indicators), and are always keen to hear feedback from others.

MQuinnP's picture
Michael Quinn Patton

Thanks for posting the Effective Principles. Very useful and insightful.  The new edition of Qualitative Research and Evaluation Methods has a major discussion of Principles-focused evaluation (see the book's index). Thus, the challenge, having articulated principles, is to follow them and evaluate their use.  


Simon Hearn's picture
Simon Hearn

Thanks for sharing these ruminations. I like the 7 point action plan, although I would suggest a small amendment to point 1 in the action plan: Educate yourself about the strengths and weaknesses of all methods you are considering to use.

I've met a couple of people recently who seemed to make a lot of sense to me when talking about RCTs. The first explained the principle of 'equipoise' which I found quite helpful in deciding whether randomisation was appropriate and ethical. I understood equipoise to mean that there is real debate and no decisive evidence about whether the intervention works or not. If there is no equipoise then randomisation is unethical becuase either you are administering an intervention that you know does not work, or you are witholding an intervention that you know does work. 

The second person qualified the usual 'gold-standard' sentiment by adding that RCTs are the gold standard for maximising internal validity of an effect estimate. I'm not sure how you would react to that assertion?

Just to add to the links above, readers may also enjoy this blog from cartoonist evaluator Chris Lysy on the same topic: http://freshspectrum.com/6-rct-randomista/

MQuinnP's picture
Michael Quinn Patton

Simon, Thanks for sharing your reaction.  I want to be clear that I am not anti-RCTs.  I have conducted RCTs. I am against designating any method the "gold standard." This goes for the assertion that RCTs are the gold standard for internal validity.  A well-designed, indepth, detailed case study can have as great or greater internal validity within a particular context, especially for capturing and understanding contextual factors. To ignore such factors, as the focus on internal validity in RCTs tends to do, risks making internal validity findings distortions of reality. Context is critical for internal and external validity.  

RCTs offer one narrow way of approaching internal validity.  A mixed methods, triangulated approach is another way which, in my judgement is stronger. 

Add new comment

Login Login and comment as BetterEvaluation member or simply fill out the fields below.