I'm doing an impact evaluation: What evidence do I need? (#AES17 presentation slides)

Are quantitative or qualitative methods better for undertaking impact evaluations? What about true experiments? Is contribution analysis the new 'state of the art' in impact evaluation or should I just do a survey and use statistical methods to create comparison groups?

Determining one's plan for an impact evaluation occurs within the constraints of a specific context. Since method choices must always be context specific, debates in the professional literature about impact methods can at best only provide partial guidance to evaluation practitioners. The way to break out of this methods impasse is by focusing on the evidentiary requirements for assessing casual impacts.

Session details:

This session presented a brief summary of the literature on the philosophy and principles of causal analysis, and relate these to some common evaluation models. A framework for applying three key evidentiary criteria was examined and participants were guided through examples of how to apply these in typical everyday situations.

In this innovative skills building session participants:
  • Were exposed to a new way of thinking about impact evaluation
  • Were made familiar with the evidentiary requirements for undertaking impact evaluations
  • Gained practical experience in the application of evidentiary criteria for causal analysis
  • Developed their skills in critiquing impact evaluation reports.
The session involved participants in practical exercises designed to illustrate the application of key ideas and concepts. The presenter encouraged participants to transfer their understandings to their own contexts throughout the session with an emphasis on practical applications. Participants were provided with a range of resource materials to support their evaluation practices back in the workplace.


Scott Bayley Principal Specialist Performance Management and Results, DFAT.

Scott Bayley is the Principal Specialist Performance Management and Results in DFAT’s international aid program. He is responsible for leading organisational change to achieve the continuous improvement of performance management practices and enhance development outcomes. 

Ruth NichollsAdviser, Department of the Prime Minister and Cabinet. 

Ruth works in evaluation policy and advice for the Indigenous Affairs Group of the Department of the Prime Minister and Cabinet.

Annotated reference list

Scott Bayley has also supplied an annotated reference list for this presentation:

  • Agodini, 2004, ‘Are Experiments the Only Option? A Look at Dropout Prevention Programs,’ The Review of Economics and Statistics, Vol. 86, No. 1, Pages 180-194.   (the paper argues that unobserved factors often exert powerful influences on outcomes and these factors are often difficult to control for using statistical methods).

  • Anderson, 2010, Proven Programs are the exception, not the rule, blog post: http://blog.givewell.org/2008/12/18/guest-post-proven-programs-are-the-exception-not-the-rule/   (The author argues that examples of proven effectiveness are rare. Their scarcity stems from two main factors: 1) the vast majority of social programs and services have not yet been rigorously evaluated, and 2) of those that have been rigorously evaluated, most, including those backed by expert opinion and less-rigorous studies, turn out to produce small or no effects, and, in some cases negative effects).

  • Asher, 1983, Causal Modeling, Sage.   (an introduction to causal/statistical modeling)

  • Asian Development Bank, 2006, Impact Evaluation: Methodological and Operational Issues. (a brief introduction, interesting discussion of common objections to impact studies)

  • Barlow & Hersen, 1989, Single Case Experimental Designs, Pergamon.   (an under-utilized approach in my opinion)

  • Bamberger, et al 2006, RealWorld Evaluation, Sage.   (an excellent overview of how to undertake evaluations of development programs while facing various types of constraints, also includes a discussion of the most commonly used designs for evaluating the impact of development programs)

  • Becker, 2000, Discussion Notes: Causality, http:/web.uccs.edu/lbecker/Psy590/cause.htm   (a brief summary of different philosophical perspectives on causality)

  • Block, 1999, Flawless consulting, Pfeiffer.   (highly recommended, excellent chapter on working with resistant clients)

  • Bloom, Michalopoulos, Hill, & Lei, 2002, Can Nonexperimental Comparison Group Methods Match the Findings from a Random Assignment Evaluation of  Mandatory Welfare-to-Work Programs? (the authors say no) Available free on the net at: http://aspe.hhs.gov/pic/reports/acf/7541.pdf

  • Boruch, 2005, Randomized Experiments for planning and evaluation: A practical guide, Sage.   (a good introduction to the topic)

  • Brady, 2002, Models of Causal Inference: Going Beyond the Neyman-Rubin-Holland Theory, Paper Presented at the Annual Meetings of the Political Methodology Group, University of Washington, Seattle, Washington.   (paper reviews four of the more common theories of causality)

  • Brinkerhoff, 1991, Improving Development Program Performance: Guidelines for Managers, Lynne Rienner.   (includes a discussion of the most common causes of performance problems in development programs, recommended)

  • Campbell & Stanley, 1963, Experimental and Quasi-experimental Designs for Research, Rand McNally.   (the all-time classic text, discusses the strengths and weaknesses of various research designs for assessing program impacts, highly recommended)

  • Carter, Klein and Day, 1992, How organisations measure success: the use of performance indicators in government, Routledge.   (offers a useful typology of different types of performance indicators and how they can be used/misused)

  • Coalition for Evidence-Based Policy, 2009, Which comparison-group (quasi-experimental) study designs are most likely to produce valid estimates of a program’s impact?   (excellent brief summary of the research evidence) Available free on the net at:

  • http://coalition4evidence.org/wp-content/uploads/2014/01/Validity-of-comparison-group-designs-updated-January-2014.pdf

  • Cook, 2000, ‘The false choice between theory-based evaluation and experimentation’, in Rogers et al (eds) Program Theory in Evaluation: Challenges and Opportunities, Jossey-Bass.   (excellent discussion of the main limitations of theory based methods for impact assessment)

  • Cook & Campbell, 1979, Quasi-experimentation, Houghton Mifflin.   (excellent, contains a useful review of different theories of causality and how to test for causal relationships as well as the application of quasi-experiments for impact evaluations)

  • Cook, Shadish & Wong, 2008, 'Three Conditions under Which Experiments and Observational Studies Produce Comparable Causal Estimates: New Findings from Within-Study Comparisons', Journal of Policy Analysis and Management, 27, 4, 724-750.   (this reference recommends regression-discontinuity designs, matching geographically local groups on pre-treatment outcome measures, and modeling a known selection process)

  • Cracknell, B. E. 2000. Evaluating Development Aid: Issue, Problems and Solutions. Sage Publications, New Delhi.   (interesting discussions of evaluating for learning vs for accountability plus the politics of evaluation)

  • Cracknell, B. E. 2001, ‘Knowing is all: Or is it? Some reflections on why the acquisition of knowledge focusing particularly on evaluation activities, does not always lead to action’. Public Administration and Development, 31, 371-379.

  • Davidson, E. J. 2004, Evaluation methodology basics: The nuts and bolts of sound evaluation, Sage.   (suggests 8 techniques for causal inference)

  • Davidson, E.J. 2000. 'Ascertaining causation in theory-based evaluation'. In P.J. Rogers,  T.A. Hacsi, A. Petrosino, and T.A. Huebner (eds.), "Program theory in evaluation: challenges and opportunities". New Directions in Evaluation, Number 87:17-26, San Francisco, CA.   (provides an overview of various methods)

  • Davis, 1985, The Logic of Causal Order, Sage.   (worth a quick read)

  • Deaton, A. 2010, What Can We Learn From Randomized Control Trials? Chicago: Milton Friedman Institute http://mfi.uchicago.edu/events/20100225_randomizedtrials/index.shtml

  • Dept of Finance, 1987, Evaluating Government Programs, Australian Government Publishing Service.   (introductory, includes a useful table comparing different types of research designs)

  • Dept of Finance and Administration, 2006,  Handbook of Cost-Benefit Analysis, Australian Government Publishing Service.   (an easy to read introduction). This book is available free on the net at: http://www.finance.gov.au/FinFramework/fc_2006_01.html

  • Donaldson, Christie & Mark, 2009, What Counts as Credible Evidence in Applied Research and Evaluation Practice?, Sage.   (offers a range of perspectives)

  • Epstein & Klerman, 2012, “When is a Program Ready for Rigorous Impact Evaluation?” Evaluation Review, vol 36, pp. 373-399.   (when it has a plausible theory of change in place and its implementation has been assessed as sound)

  • European Evaluation Society, 2007, Statement: The Importance of a Methodologically Diverse Approach to Impact Evaluation.

  • Freedman and Collier, 2009, Statistical Models and Causal Inference: A Dialogue with the Social Sciences, Cambridge University Press    (argues that statistical techniques are seldom an adequate substitute for substantive knowledge of the topic, having a good research design, relevant data and undertaking empirical testing in diverse settings).

  • Gates & Dyson, 2017, “Implications of the Changing Conversation About Causality for Evaluators”, American Journal of Evaluation, vol. 38, no. 1   (an introductory overview of issues for consideration plus six guidelines for evaluators seeking to make causal claims)

  • Gertler, P. et al 2010, Impact Evaluation in Practice, World Bank. It is available free on the net at: http://documents.worldbank.org/curated/en/2011/01/13871146/impact-evalua...

  • Glazerman, Levy & Myers, 2002, Nonexperimental Replications of Social Experiments: A Systematic Review, Mathmatica Policy Research Inc.   (this research paper concludes that more often than not, statistical models do a poor job of estimating program impacts, highly recommended). This report is available free on the net at: http://www.mathematica-mpr.com/publications/PDFs/nonexperimentalreps.pdf

  • Glennerster & Takavarasha, 2013, Running Randomized Evaluations: A Practical Guide, Princeton University Press. (a comprehensive and handy guide to running randomized impact evaluations of social programs)

  • Gilovich, 1991, How we know what isn’t so: The fallibility of human reason in everyday life, Free Press, New York. (demonstrates how cognitive, social and motivational processes distort our thoughts, beliefs, judgments and decisions)

  • gsocialchange, 2017, How do you know whether your intervention had an effect (a website with a range of resources) https://sites.google.com/site/gsocialchange/cause

  • Guba and Lincoln, 1989, Fourth Generation Evaluation, Sage.   (the authors argue that 'cause and effect' do not exist except by imputation, a constructivist perspective))

  • Hatry, 1986, Practical Program Evaluation for State and Local Governments, Urban Institute Press.   (a good introduction, includes a review of the circumstances in which it is feasible to use experimental designs)

  • Holland, P. 1986. 'Statistics and Causal Inference'. Journal of the American Statistical Association. Vol. 81 pp. 945-960.

  • Judd & Kenny, 1981, Estimating the Effects of Social Interventions, Cambridge.   (heavy emphasis on statistical applications, for the enthusiast)

  • Kahneman, D. 2013, Thinking, Fast and Slow; Farrar, Straus and Giroux.   (discusses why cognitive biases are common across all aspects of our lives, recommended)

  • Kenny, 2004, Correlation and causality,   (a very technical book about analysing causal impacts using statistical models). This book is available free at:  https://web.archive.org/web/20060928063320/http://davidakenny.net/doc/cc_v1.pdf

  • LaLonde, 1986, “Evaluating the Econometric Evaluations of Training Programs with Experimental Data”, American Economic Review, vol 76, pp. 604-620.   (a classic article explaining why the econometric analysis of correlational research designs usually fails to achieve accurate estimates of program impact)

  • Langbein, 1980, Discovering Whether Programs Work, Goodyear.   (good but technical)

  • Larson, 1980, Why government programs fail, Praeger.   (The reasons: a faulty theory of change/strategy; poor implementation; a changing external environment; or the evaluation itself is faulty)

  • Light, P. 2014, A Cascade of Failures: Why Government Fails and How to Stop It, Centre for Effective Public Management at Brookings.

  • (analysis of government failures in the USA, insightful)   (The reasons: poor policy; inadequate resources; culture; structure; lack of leadership)

  • Mark & Reichardt, 2004, ‘Quasi-experimental and correlational designs: Methods for the real world when random assignment isn’t feasible’. In Sansone, Morf and Panter, (eds), Handbook of methods in social psychology, (pp. 265-286), Sage.   (useful introductory overview, recommended)

  • Mayne, J. 2008, Contribution analysis: An approach to exploring cause and effect, ILAC Brief 16.   (an approach that is increasingly popular based on using program theory and shares similar strengths and weaknesses)

  • McMillan, 2007, Randomized Field Trials and Internal Validity: Not So Fast My Friend,   (good overview of the limitations). Available free at: https://web.archive.org/web/20160508125451/http://pareonline.net/pdf/v12n15.pdf

  • Michalopoulos, 2004, ‘Can Propensity-Score Methods Match the Findings from a Random Assignment Evaluation of Mandatory Welfare-to-Work Programs?’ The Review of Economics and Statistics, Vol. 86, No. 1, Pages 156-179.   (the answer: occasionally, but not consistently)

  • Miles and Huberman, 1994, Qualitative data analysis, Sage.   (contains examples of undertaking causal analysis with qualitative data)

  • Mohr, 1995, Impact Analysis for Program Evaluation, Sage.   (an advanced discussion of research designs and impact analysis)

  • Mohr, 1999, 'The Qualitative Method of Impact Analysis', American Journal of Evaluation, 20, 1, pp. 69-84.   (useful introduction to the topic)

  • Murnane and Willett, 2010, Methods Matter: Improving Causal Inference in Educational and Social Science Research, Oxford University Press.   (not overly technical and includes many examples)

  • National Institute for Health Research, 2016, Assessing claims about treatments effects: Key concepts that people need to understand.   (a useful summary). Available on the net at: http://www.testingtreatments.org/key-concepts-for-assessing-claims-about-treatment-effects/?nabm=1

  • Network of Networks on Impact Evaluation, 2009, Impact Evaluations and Development – NONIE Guidance on Impact Evaluations.   (a review of the methods commonly used by development agencies). Available free on the net at: http://www.worldbank.org/ieg/nonie/guidance.html

  • Nisbett and Ross, 1985, Human Inference: Strategies and Shortcomings of Social Judgments, Prentice-Hall.   (explains why we all struggle to accurately perceive causal relationships, basically people are terrible at this due to various ‘involuntary’ cognitive biases)

  • Norad, 2008, The Challenge of Assessing Aid Impact: A Review of Norwegian Evaluation Practice,   (provides a number of examples of problematic impact evaluations along with various lessons for better practice). Available free on the net at: http://www.norad.no/default.asp?V_ITEM_ID=12314

  • Nutt, 2002, Why Decisions Fail, Berrett-Koehler Publishers.   (interesting review of why strategic decisions often fail, e.g. lack of consultation, poor analysis, faulty implementation,)

  • Olsen & Orr, 2016, “On The ‘Where’ of Social Experiments: Selecting More Representative Samples to Inform Policy”, New Directions in Evaluation, no 152.   (useful suggestions for improving the external validity of experiments through better sampling)

  • Patton, 1987, How to use qualitative methods in evaluation, Sage.   (excellent discussion of combining qualitative and quantitative methods)

  • Patton, 1990, Qualitative Evaluation and Research Methods, Sage.   (good all round reference, helpful description of different types of purposeful sampling)

  • Pawson, R. 2002, “Evidence based policy: The promise of realist synthesis’. Evaluation, 8(3), 340-358.

  • Pearl, 2000, Causality: Models, Reasoning, and Inference, Cambridge University Press.   (advanced technical examination of causal/statistical modeling)

  • Peck, (ed) 2016, “Social Experiments in Practice: The What, Why, When, Where and How of Experimental Design and Analysis”, New Directions in Evaluation, no 152.   (a good overview of various issues from an econometric perspective)

  • Peck, L. 2017. When is Randomization Right for Evaluation? (offers principles for when experiments are appropriate). See: http://abtassociates.com/Perspectives/March/When-Is-Randomization-Right-...

  • Perrin, Burt. 1998. 'Effective Use and Misuse of Performance Measurement'. American Journal of Evaluation. Vol. 19 (1):367-379.   (highly recommended)

  • Perrin, Burt. 1999. 'Performance Measurement: Does the Reality Match the Rhetoric? A Rejoinder to Bernstein and Winston'. American Journal of Evaluation. Vol. 20(1).

  • Posavac & Carey, 2002, Program Evaluation: Methods and Case Studies, Prentice Hall.   (good all round text, includes a summary of the types of evaluation questions that can be answered by particular types of research designs)

  • Pritchett & Sandefur 2013, Context Maters for Size: Why External Validity Claims and Development Practice Don’t Mix, Working paper 336, Center for Global Development.   (this paper argues that impact evaluation findings are context dependent and hence we need to be very careful when seeking to generalize/apply findings from one context to another, even when using RCTs)

  • Ramalingham B. 2011, Learning how to learn: eight lessons for impact evaluations that make a difference, ODI, London.

  • Reichardt, C. 2000, ‘A typology of strategies for ruling out threats to validity’. In Bickman (ed) Research Design: Donald Campbell's’ legacy, Sage.   (for the enthusiast, very insightful)

  • Reynolds & West 1987, 'A multiplist strategy for strengthening nonequivalent control group designs', Evaluation Review, 11, 6, 691-714.   (an excellent example of how to fix up a weak research design by adding additional features thereby improving your overall assessment of the program's impact, highly recommended)

  • Rogers et al 2000, Program Theory in Evaluation: Challenges and Opportunities, No. 87.   (a series of papers on the strengths and weaknesses of using program theory to assist with causal analysis)

  • Roodman, 2008, Through the Looking Glass, and What OLS Found There: On Growth, Foreign Aid, and Reverse Causality, Working Paper 137, Center for Global Development.   (discussion of assessing the impact of foreign aid)

  • Rossi, Lipsey & Freeman 2003, Evaluation – A Systematic Approach, Sage.   (recommended, includes an excellent discussion of different types of research designs and when to use each of them)

  • Rothman & Greenland, 2005, 'Causation and Causal Inference in Epidemiology', American Journal of Public Health, Vol 95, No. S1.

  • Rubin, 2008, “For Objective Causal Inference, Design Trumps Analysis”, Annals of Applied Statistics, vol 2, pp. 808-840.   (the title says it all)

  • Scriven, M. 1976, ‘Maximizing the power of causal investigations: The modus operandi method’. In Gene V. Glass (ed.) Evaluation studies review annual, Volume 1, 101-118, Beverly Hills, CA: Sage Publications.

  • Shadish, Clark, & Steiner, 2008, ‘Can nonrandomized experiments yield accurate results? A randomized experiment comparing random and nonrandom assignments;, Journal of the American Statistical Association, 103(484), pp. 134-1343.   (yes, provided that all key variables are observed and we have good covariates to facilitate adjustment)

  • Shadish, Cook and Campbell, 2002, Experimental and Quasi-Experimental Designs for Generalized Causal Inference, Houghton Mifflin.   (advanced classic text)

  • Shadish, Cook and Leviton, 1991, Foundations of Program Evaluation, Sage.   (the final chapter contains an excellent summary of evaluation theory in relation to program design, evaluation practice, and theory of use)

  • Spector, 1981, Research Designs, Sage.   (a basic introduction)

  • Stame, N. 2010, ‘What Doesn’t Work? Three Failures, Many Answers’, Evaluation, 16, 4, 371-387.   (useful review of debates on impact evaluation methodology)

  • Stern, E. 2012, Broadening the Range of Designs and Methods for Impact Evaluations, DFID   (useful discussion of key issues). Available on the net at: http://www.oecd.org/dataoecd/0/16/50399683.pdf

  • Stern E. 2015, Impact Evaluation: A Guide for Commissioners and Managers, Bond   (a useful non technical overview). Available on the net at: http://www.bond.org.uk/data/files/Impact_Evaluation_Guide_0515.pdf

  • Treasury Board of Canada, no date, Program Evaluation Methods: Measurement and Attribution of Program Results,   (useful overview).  It is available free on the net at:  http://www.tbs-sct.gc.ca/eval/pubs/meth/pem-mep_e.pdf

  • Trochim, 1984, Research designs for program evaluation: The regression discontinuity approach, Sage.   (excellent method for evaluating impacts where entry into the program depends upon meeting a numerical eligibility criterion, e.g. income less than X, academic grades more than Y)

  • Trochim, 1989. 'Outcome Pattern Matching and Program Theory'. Evaluation and Program Planning. Vol. 12:355-366.

  • United Nations, 2013, Impact Evaluations in UN Agency Evaluation Systems - Guidance on Selection Planning and Management. Available on the web at:  http://www.uneval.org/document/download/1880

  • Weyrauch, V. and Langou, G. D. 2011, Sound Expectations: From Impact Evaluations to Policy Change, 3ie Working Paper 12. London: 3iE It is available free on the net at: http://www.3ieimpact.org/3ie_working_papers.html.

  • World Bank, 2011, Impact Evaluation in Practice. It is available free on the net at: https://openknowledge.worldbank.org/bitstream/handle/10986/2550/599980PU...

  • World Bank (Independent Evaluation Group) 2006, Conducting Quality Impact Evaluations Under Budget, Time and Data Constraints, author.   (this text is a highly summarized version of Bamberger’s book). It is available free on the net at: http://www.worldbank.org/ieg/ecd/conduct_qual_impact_eval.html

  • World Bank (Independent Evaluation Group) no date, Impact Evaluation- The Experience of the Independent Evaluation Group of the World Bank, author. It is available free on the net at:

  • http://lnweb18.worldbank.org/oed/oeddoclib.nsf/DocUNIDViewForJavaSearch/...$file/impact_evaluation.pdf?bcsi_scan_D4A612CF62FE9576=0&bcsi_scan_filename=impact_evaluation.pdf

  • Yin, 2000, 'Rival Explanations as an Alternative to Reforms as Experiments', in Bickman (ed) Validity and Social Experimentation, Sage.   (good review of how to identify and test rival explanations when evaluating reforms or complex social change)

  • Yin, 2003, Applications of Case Study Research, Sage.   (very good reference, includes advice on undertaking causal analysis using case studies)

  • Young, J. and Mendizabal, E. 2009, Helping researchers become policy entrepreneurs, ODI Briefing Papers 53. London: ODI http://www.odi.org.uk/resources/download/1127.pdf   (guidance on the research:policy interface)

Scott Bayley

6 September 2017