Randomized controlled trials (RCTs), or randomized impact evaluations, are a type of impact evaluation which uses randomized access to social programs as a means of limiting bias and generating an internally valid impact estimate.
An RCT randomizes who receives a program (or service, or pill) – the treatment group - and who does not – the control. It then compares outcomes between those two groups; this comparison gives us the impact of the program. RCTs do not necessarily require a “no treatment” control – randomization can just as easily be used to compare different versions of the same program, or different programs trying to tackle the same problem.
In this way, the control mimics the counterfactual. The counterfactual is defined as what would have happened to the same individuals at the same time had the program not been implemented. It is, by definition, impossible to observe – it’s an alternative universe! RCTs work by creating a group that can mimic it.
Many times, evaluations compare groups that are quite different to the group receiving the program. For example: if we compare the outcomes for women who take up microcredit to those that do not, it could be that women who choose not to take up microcredit were different in important ways that would affect the outcomes. For example, women who do not take up microcredit might be less motivated, or less aware of financial products.
Using a randomization approach means that a target population is first identified by the program implementer, and then program access is randomized within that population.
Instead of randomizing individuals, randomization can be done at cluster levels, such as villages, or schools, or health clinics. These are known as cluster randomized control trials.
There are two main reasons to randomize at a level larger than the individual. First, it can address contamination: where treated individuals mix and chat and potentially “share” treatment with individuals in the control group. This would “contaminate” our impact, and our control group would no longer be a good comparison. Randomizing at the village level may minimize the risk of this happening. Second, we might want to randomize at the level that the intervention would actually be implemented: for example, an intervention which provides electrification to schools. It is logistically impractical – if not impossible – to randomize electricity access over schoolchildren.
When randomizing at the cluster level, the unit of randomization is the unit at which we will randomly roll out the program; i.e. the cluster (in our example above, a school). The unit of analysis, defined as the unit at which we will collect data and compare outcomes, is usually the individual - for example, individual students’ test scores. This distinction will become important when we calculate the sample size needed. Among other things, sample size is affected by the intra-cluster correlation (ICC), which refers to how similar or dissimilar individuals within a cluster are. The ICC will determine how many individuals per cluster, and how many clusters, you will need to sample.
Consider this hypothetical example: an NGO, iPads 4 All (i4A), plans to distribute iPads to low-income children in a developing country. i4A want to evaluate the impact that an iPad has on children’s education, health and future income levels. It’s likely that they will never have enough iPads to cover all the children that “deserve” one. Instead of ad hoc distribution to the children who express interest, or are nearby, or who the government determines as “neediest”, an RCT would randomize their access.
If they randomize at the individual level, they would put all the eligible children’s names into a bowl, or a list on a computer, and run a lottery. Some children would get an iPad. Some would not. If they randomize at the school level, they would do this for the school names and some schools would receive iPads. In a phase-in/pipeline design, the individuals or schools who did not receive an iPad initially would be placed in a queue to receive it if the study found them to be effective and funds were available.
Beyond this simplified example, the RCT methodology can be adapted to a wide variety of contexts.
As with all human subjects research, RCTs are subject to rigorous ethical reviews to ensure that no human subjects are harmed during the research process.
Steps of an RCT
- An optional prelude is a needs assessment, which can provide information on the context and its constraints. For example: a needs assessment could tell us how many children have received their full immunization course in rural Rajasthan. It could lead us to specify a hypothesis, or key evaluation question.
- A program theory is developed (alternatively, a logic model). This program theory describes the program, unpacking the pathways of its impact, and articulates all the risks and assumptions which could hamper a successful program. It is also useful, at this stage, to think of the indicators which could be collected at each step of the way.
- A baseline survey is conducted of the entire target sample. Data are collected on the relevant indicators.
- The sample is randomized into different groups. Randomization can be done using software like Excel, or Stata. To ensure that randomization has “succeeded”, check they are equivalent in terms of baseline indicators and contextual variables that might be important: they should be statistically identical – that is, the same average income, the same average health level, and so on.
- The program or intervention is implemented in the treatment group.
- During the program, it is strongly advisable to monitor the program’s implementation. This data will have three advantages. First, it becomes a type of monitoring, which is beneficial for the implementing organization’s operations and efficiency. Second, it provides intermediate indicators which allow evaluators to unpack the “black box” of impact (and follow along the theory of change). In other words, these intermediate indicators allow us to answer why a program had the effect it did. Third, and most importantly, it is necessary to monitor that the intervention is being adequately implemented to the treatment group(s), and the control group is not being contaminated (receiving the intervention through some other means).
- Following the program’s implementation, and depending on the context of the evaluation (e.g. some indicators are quick to respond, others slow), there is an endline, or follow-up, survey. Ideally, this survey will share many questions and characteristic with the baseline survey.
- Outcomes are then compared between treatment and control groups to derive the impact estimate. Results are reported to the implementing partner.
The RCT approach is flexible enough to accommodate a variety of contexts and sectors. It can be used in education, health, environment, and so on. With a little imagination, randomization can be adapted to a number of different circumstances. Constraints and what RCTs cannot do will be discussed below. For now, here is a short gallery of examples of what RCTs can do.
In a microfinance study by the Abdul Latif Jameel Poverty Action Lab (J-PAL), a large Indian microfinance institution, Spandana, identified 104 low-income neighborhoods in Hyderabad, India, which were potential locations to open a branch office. Prior to opening the branch offices, 52 neighborhoods were randomly selected to have an office open in 2005 – this became the treatment group. The remaining 52 neighborhoods remained “control” (receiving an office in the following years). Households were then interviewed 15-18 months after the introduction of microfinance in the treatment areas.
A study in Bihar and Rajasthan, India, examined several treatments to address low literacy levels of children. One intervention focused on offering mothers literacy classes, assuming that more educated mothers would be more effective at helping children at home. A second intervention provided a guide to mothers on at-home activities which could enrich the learning environment for their children at home. A third intervention combined these two: mothers received both the mother literacy classes and the at-home activities guide. A comparison group received none of these services.
A remedial tutoring program in India used rotation design. Rotation design refers to a situation where, for two groups, one group is treatment and one is control – and then, those roles switch, with the previously-treated becoming control and the previously control becoming treated. In practice, the NGO Pratham identified 77 schools in Mumbai and 124 in Vadodara. Pratham’s intervention was a remedial tutor (called a “balsakhi”, or “child’s friend”) who would meet with 15-20 students who were falling behind in their grades.
Randomization was “rotated” in that, in 2001, half of the schools received a tutor for grade 3, and the other half received one for grade 4. In 2002, the schools received a tutor for the previously untreated grade. In this way, the impact of treatment could be determined by comparing grade 3 students in schools who received a grade 3 tutor to grade 3 students in schools who received a grade 4 tutor.
Often, budget constraints prohibit a full-scale roll-out of a program. These staggered roll-outs can thus be leveraged for randomized impact evaluations by simply selecting, by lottery, the areas which will receive the service first.
J-PAL’s deworming study used random phase-in. For three years, between 1998 and 2001, mass deworming was rolled out in 75 schools in western Kenya by the NGO International Child Support Africa. The 75 schools were placed in a lottery, with 25 schools receiving deworming in 1998, 25 in 1999, and the remaining 25 in 2001. In this way, in 1998, the 50 non-dewormed schools served as a control group for the 25 dewormed schools.
In many situations, it is politically, ethically or administratively untenable to deny services to a control group. In some of these cases, an encouragement design can be used – randomly-selected individuals will receive a promotional script or advertisement alerting them to this already-available service. In these cases, control group individuals still have access to the same service, however they will not receive the same reminders to use it. By the same token, treatment individuals can still refuse service (as in most interventions).
A J-PAL study in Tangiers, Morocco, worked with a local utility company – Amendis – which was already distributing drinking water (though take-up was less than 100%). The program was providing a subsidized interest-free loan to install a water connection. Amendis made this loan available to all eligible households; however, for the evaluation, a random subset of those households received a door-to-door awareness campaign and offered assistance with filling out the application. This promotion was the “encouragement” which pushed selected households (treatment) to sign up for the loan more often than those households which did not receive the promotion (control). In this way, the researchers were able to determine the impact of new Amendis water connections on households.
In the end, because take-up of water connections was higher in the “encouraged” (i.e. treatment) group than the non-encouraged (i.e. control) group, these two groups could be compared. And since encouragement was randomly assigned, any difference in outcomes could be attributed to the difference in the take-up rates of water connections.
Sometimes randomization can occur within a “bubble” of eligibility. For example, a J-PAL study in South Africa worked with an anonymous microfinance lender to identify 787 rejected loan applicants who had been deemed “potentially creditworthy” by the institution. (Applicants had been either automatically approved or rejected under the bank’s normal application process.) Within this sample of 787, this “bubble”, a randomly-selected subset of rejected applicants were assigned to be given a “second look” by one of the lending institution’s financial officers. These officers were not required to approve these individuals for loans, but they were encouraged to. (Thus, we can see that “take-up” in this case related to the financial officers approving applicants for loans.)
Mapping the approach in terms of tasks and options
RCTs share, with other impact evaluation methodologies, a number of the same tasks and options. For example, by definition, they must specify key evaluation questions. These questions could be things like: will deworming pills lead to increased school attendance? Will they lead to improved educational outcomes as well? Does access to microfinance lead to greater business investments? Is iron-fortified salt an effective way of decreasing anemia rates in the rural population?
Furthermore, data collection and data analysis are integral parts of the RCT approach. A deep understanding of the sample is essential: who is the target population? Is the selected sample representative of the larger population? Following randomization of program access, are the treatment and control groups comparable along important indicators? Thinking deeply about indicators is also important: for example, how will women’s empowerment be measured? Cognitive ability? Financial literacy? How will data on these indicators be collected?
Advice on choosing this approach
It is important to remember that, while RCTs can be a rigorous way to measure impact in certain circumstances, they are only one part of a wider array of evaluation tools. That is, they can be a useful addition to any portfolio of methods, but they are unlikely to be able to answer every question. In this section, we will describe some of the binding constraints which would prevent an evaluator from choosing the RCT approach.
Binding constraints: Sample size
One of the major constraints to any quantitative impact evaluation – not just RCTs – is sample size. In the case of RCTs, we are concerned with sample size along two dimensions: the unit of analysis, and the unit of randomization. Both the unit of analysis and the unit of randomization are integral in determining statistical significance and statistical power.
Statistical significance refers to the probability that the results we observe are not purely based on chance. Conventions in the literature state that significance levels above 90% - preferably at 95% - are sufficient. This means that, either 5% or 10% of the time, the results we observe are by chance.
Statistical power, instead, refers to the probability of detecting an impact when there is one. The inverse, then, is how likely are we to miss impact when it occurs (thus generating a “false negative”)? A number of factors determine statistical power: the sample size, the minimum detectable effect size (i.e. how sensitive must the test be), the outcome variable’s underlying variance, the proportion that are in treatment and control, and – if it is a cluster RCT – the intra-cluster correlation. Convention allows 80% to be a sufficient level of power.
There is an argument for saying that, for low levels of power, it is preferable not to conduct an impact evaluation – else resources will be wasted, resources which could be better used elsewhere (in conducting a good process evaluation, for example).
Binding constraints: Retrospective vs. Prospective
By design, RCTs cannot determine impacts of currently existing projects, that is, of programs that have already launched and did not, by chance, randomly deliver their services. (Most programs are, indeed, not delivered randomly – notable exceptions being Mexico’s PROGRESA and reservations for women and caste minorities under India’s 73rd amendment.) Given that randomization occurs at the moment of implementation, and randomization is integral to the RCT approach, they can only be planned ex ante – not ex post. Thus, for existing programs, RCTs can only be applied to either: roll-outs of the program into new areas, additions to the program (e.g. new products).
Advice when using this approach
There are a number of issues that may arise during the implementation of even the best-designed RCT. It is important, then, to be prepared and include plans to mitigate or control various risks.
Take-up rates can sometimes be lower than expected, and this can have consequences on your effect size (and, following that, on your statistical power). It is worth noting that the relationship between take-up and power is exponential: a 50% drop in effect size will require a four-fold increase in sample size to achieve the same power.
For this reason, it is advisable to adequately anticipate – and, if anything, underestimate – take-up rates of the program. Choosing a conservative, even pessimistic, estimate for this may reward you with higher power down the line.
Another issue which can compromise an RCT’s estimates is non-compliance by program participants. That is, while individuals may be assigned to treatment or control, these assignments are rarely required or controlled. Consider a microfinance program, which opens branches in randomly-selected “treatment” neighborhoods and does not do so in “control” neighborhoods. Individuals living in the latter may simply journey to the “treatment” neighborhoods in order to open an account at the microfinance branch office. In this case, the control group no longer serves as a true counterfactual.
Non-compliance, then, can threaten the integrity of randomization if individuals are able to self-select into groups. While non-compliance can never totally be eliminated, it can be minimized. One method is to choose a unit of randomization large enough such that the two groups are unlikely to mix. For example, in the microfinance example, if “treatment” and “control” neighborhoods were also reasonably far apart, we might expect non-compliance to remain low.
Note, however, when control group individuals take up the program, and treatment individuals do not, this resembles the encouragement design.
Attrition occurs when parts of your sample are no longer available for follow-up, for example, because they’ve moved away. If attrition is driven by statistical differences in your treatment and control groups, we call this “differential attrition”. This can be especially concerning, because it essentially un-randomizes your sample, as people are self-selecting out of one group or the other. It is important to note that, while the rates of attrition may look the same, differential attrition may still be occurring if the reasons people are leaving the treatment or control groups is related to the treatment.
In the microfinance example, differential attrition could occur if some households in treatment neighborhoods obtain loans, grow their businesses, and become wealthy enough to leave the neighborhood – out of our sample. If this was the case, we would not be able to include them in our analysis, and thus our remaining “treatment group” would look a little poorer than it should (since all the wealthy households have moved away!). It is therefore very important to follow up with households, especially in the case of differential attrition.
Non-differential attrition occurs when attrition from the treatment or control groups occurs for reasons unrelated to the treatment: people may move away, die, or otherwise drop out of our sample, and it has nothing to do with whether they are in treatment or control. In this case, we would only worry if non-differential attrition erodes our sample size such that issues of statistical significance or power crop up.
Conducting a baseline survey
In theory, if randomization has been successfully implemented, an endline survey is sufficient in determining an internally valid impact estimate. However, baseline surveys, beyond providing empirical assurance that randomization has generated balanced treatment and control groups, provide an additional benefit in the form of increased power. In general, more frequent data collection (such as a baseline, midline, and endline) can give us the same power for a smaller sample size. Also, baseline results allow us to measure heterogeneous effects (i.e. subgroup analysis) when the groups are defined by variables that could change over time. For example it allows us to measure the test-score impact of an education innovation on the subgroup of children who scored poorly on the exam at baseline. Without the baseline, we would not be able to identify which children these were.
Comparing multiple treatments
If we want to detect the difference between two variations of a program, then we will need more power – and, consequently, a larger sample size. If we simply want to compare having a program to not having a program, then less power (and thus a smaller sample size, relatively) is sufficient.
Duflo, E., Glennerster, R., & Kremer, M. (2007). Using randomization in development economics research: A toolkit. Handbook of development economics, 4, 3895-3962.
The Abdul Latif Jameel Poverty Action Lab (J-PAL) offers a weeklong Executive Education course throughout the world and throughout the year. This course explores impact evaluation, focusing on when and how to use randomized impact evaluations. A free, archived version of the course can be found online at MIT’s Open CourseWare.