Propensity scores

Synonyms:
Propensity score matching

Propensity score matching (PSM) is a quasi-experimental method used to estimate the difference in outcomes between beneficiaries and non-beneficiaries that is attributable to a particular program.

PSM reduces the selection bias that may be present in non-experimental data. Selection bias exists when units (e.g. individuals, villages, schools) cannot or have not been randomly assigned to a particular program, and those units which choose or are eligible to participate are systematically different from those that are not.

A propensity score is an estimated probability that a unit might be exposed to the program; it is constructed using the unit’s observed characteristics. The propensity scores of all units in the sample, both beneficiaries and non-beneficiaries, are used to create a comparison group with which the program’s impact can be measured. By comparing units that do not participate in a program, but otherwise share the same characteristics as those units which have participated, PSM reduces or eliminates biases in observational studies and estimates the causal effect of a program on an outcome or outcomes. 

PSM consists of four phases: estimating the probability of participation, i.e. the propensity score, for each unit in the sample; selecting a matching algorithm that is used to match beneficiaries with non-beneficiaries in order to construct a comparison group; checking for balance in the characteristics of the treatment and comparison groups; and estimating the program effect and interpreting the results.

  • Estimating the propensity score: The propensity scores are constructed using a logit or probit regression to estimate the probability of a unit’s exposure to the program, conditional on a set of observable characteristics that may affect participation in the program. In order for the propensity scores to correctly estimate the probability of participation, the characteristics included in the propensity score estimation should be well-considered and as exhaustive as possible. However, it is very important that characteristics that may have been affected by the treatment are not included. For this reason, it is best to use baseline data to estimate the propensity scores, if available. Once all relevant covariates are selected for inclusion, a logit or a probit regression is performed and the predicted probabilities are obtained.
  • Select a matching algorithm: Once the propensity scores are estimated, units in the treatment group (beneficiaries) are then matched with non-beneficiaries with similar propensity scores or probability of participating in the program. There are a number of matching algorithms that can be employed. The most common matching algorithms used in PSM include:
    • Nearest-neighbour matching: Each program beneficiary is matched to the non-beneficiary unit with the closest propensity score. Non-beneficiaries for which there are no beneficiaries with a sufficiently similar score are discarded from the sample; the same is true for beneficiaries for which there is no similar non-beneficiary. A variation of nearest-neighbour matching matches multiple (for example, the or five) non-beneficiaries to one single beneficiary.
    • Radius matching: (i.e. ‘caliper’ matching): A maximum propensity score radius – a ‘caliper’ – is established, and all non-beneficiaries within the given radius of a beneficiary are matched to that beneficiary.
    • Kernel matching: For each treated subject, a weighted average of the outcome of all non-beneficiaries is derived. The weights are based on the distance of the non-beneficiaries propensity score to that of the treated subject’s, with the highest weight given to those with scores closest to the treated unit.
  • Check for balance: Once units are matched, the characteristics of the constructed treatment and comparison groups should not be significantly different; i.e., the matched units in the treatment and comparison groups should be statistically comparable. Balance is generally tested using a t-test to compare the means of all covariates included in the propensity score in order to determine if the means are statistically similar in the treatment and comparison groups.  If balance is not achieved; i.e., the means of the covariates are statistically different, a different matching method or specification should be used until the sample is sufficiently balanced.
  • Estimating the program effect and interpreting results: Following the estimation of propensity scores, the implementation of a matching algorithm, and the achievement of balance, the intervention’s impact may be estimated by averaging the differences in outcome between each treated unit and its neighbour or neighbours from the constructed comparison group. The difference in averages of the subjects who participated in the intervention and those who did not can then be interpreted as the impact of the program.

Examples

Jalan and Ravallion (2003) conducted an impact evaluation that measured the effect of access to piped water on the incidence and duration of diarrhea among children less than 5 years of age in 16 states in India. The study utilized a household survey conducted in 1993-94 and used PSM to create comparable treatment and comparison households from within the larger sample. The authors argued that impact estimates based on the full sample are subject to selection bias because not all characteristics which influence both child health and water source selection are observable or included in the survey. They also claimed that the inclusion of variables that do not necessarily predict outcomes reduce bias in estimates of causal effects. For the purposes of the study, pre-exposure variables (e.g. state of residence, composition of household, assets, religion, access to public goods and village characteristics) were incorporated into a propensity score through a logit model. The Five-Nearest-Neighbor matching method was used to create the sample for analysis amounting to 33,000 observations. Approximately 650 households were excluded after the matching process due to the unavailability of sufficiently similar households. The authors concluded that access to piped water reduces disease prevalence by 21% and illness duration by 29%.      

Source: Jalan J, and Ravallion M (2003).

Advice for choosing this method

  • Propensity score matching requires statistical computations and is best conducted using statistical programs such as Stata or SPSS. It may be useful to involve an experienced statistician, depending on levels of staff knowledge. 
  • PSM demands a deep understanding of the observable covariates that drive participation in an intervention and requires that there is substantial overlap between the propensity scores of those subjects or units which have benefited from the program and those that have not; this is called the ‘common support’. If either of these two factors is lacking, PSM is not a suitable methodology for estimating causal effects of an intervention.
  • PSM also requires a large sample size in order to gain statistically reliable results. This is true for many causal inference methodologies but is particularly true for PSM due to the tendency to discard many observations which do fall under the common support.
  • PSM is not a panacea – because it matches only on observed information, it may not eliminate bias from unobserved differences between treatment and comparison groups.
  • It is important to understand the trade-offs between reducing bias and reducing standard errors that arise when choosing the specifications of the matching algorithms.  For example, when choosing the caliper size for the radius matching, if the caliper size is too large there is a risk that very dissimilar individuals will be matched, while if the caliper size is too small the sample size may become too small to obtain statistically convincing results. Similarly, for neighbour matching, choosing multiple neighbours decreases bias, relative to single neighbour matching, but increases standard errors due to the smaller sample size caused by a more stringent specification. Such trade-offs exist in each matching algorithm.

Advice for using this method

  • Deep understanding of the context is required by the evaluator in selecting observable covariates to include in the propensity score – rarely will a comprehensive list exist. Useful criteria to consider include the explicit criteria used in determining participation in the intervention (i.e. project or program eligibility), as well as factors associated with self-selection (i.e. the subject’s distance from a project location by foot).  Covariates that are affected by the intervention should not be included in the propensity score.
  • No matching algorithm that is superior in every context - each involves a trade-off between efficiency and bias. Balance statistics resulting from multiple matching algorithms should be examined to determine which method achieves the best balance.
  • Increase the reliability of the evaluation by using multiple matching algorithms and choose the matching algorithm which produces the best balance statistics.
  • It is important to understand the trade-offs between reducing bias and reducing standard errors that arise when choosing the specifications of the matching algorithms.  For example, when choosing the caliper size for the radius matching, if the caliper size is too large there is a risk that very dissimilar individuals will be matched, while if the caliper size is too small the sample size may become too small to obtain statistically convincing results. Similarly, for neighbour matching, choosing multiple neighbours decreases bias, relative to single neighbour matching, but increases standard errors due to the smaller sample size caused by a more stringent specification. Such trade-offs exist in each matching algorithm.

Resources

Guides

Examples

Dehejia R & Wahba S. (2002) “Propensity Score Matching Options for Non-experimental Causal Studies,” Review of Economics and Statistics. Volume 84, Number 1, pp. 151-161.

Heckman J, H. Ichimura, and P. (1997) Todd. “Matching as an Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Program”, The Review of Economic Studies, Volume 64, Number 4 (1997), pp. 605-654.

Heckman J, H. Ichimura, and P. Todd, (1998), “Matching as an Econometric Evaluation Estimator”. Retrieved from https://www.researchgate.net/publication/4783246_Matching_As_An_Econometric...

Heckman J, H. Ichimura, J. Smith, and P. Todd, (1998). “Characterizing Selection Bias Using Experimental Data”.Vol. 66, No. 5 (Sep., 1998), pp. 1017-1098. Retrieved from https://www.jstor.org/stable/2999630

Jalan J, and Ravallion M (2003). “Estimating the Benefit Incidence of an Antipoverty Program by Propensity Score Matching”.Journal of Business & Economic Statistics, American Statistical Association, vol. 21(1), pages 19-30, January. Retrieved from http://fmwww.bc.edu/RePEc/es2000/0873.pdf

Rosenbaum P and Rubin D. (1983) “The Central Role of the Propensity Score in Observational Studies for Causal Effects”, Biometrika, Vol. 70, No. 1, pp. 41-55

Rosenbaum P, D. Rubin (1985). “Constructing a Control Group Using Multivariate Matched Sampling Options that Incorporate the Propensity Score”, The American Statistician, Volume 39, Number 1, pp. 33-38.

'Propensity scores' is referenced in: