Propensity Scores

Synonyms: 
Propensity Score Matching

Propensity-score matching (PSM) is a quasi-experimental option used to estimate the difference in outcomes between beneficiaries and non-beneficiaries that is attributable to a particular program.

PSM reduces the selection bias that may be present in non-experimental data. Selection bias exists when units (e.g. individuals, villages, schools) cannot or have not been randomly assigned to a particular program, and those units which choose or are eligible to participate are systematically different from those who are not.

A propensity score is an estimated probability that a unit might be exposed to the program; it is constructed using the unit’s observed characteristics. The propensity scores of all units in the sample, both beneficiaries and non-beneficiaries, are used to create a comparison group with which the program’s impact can be measured. By comparing units that do not participate in a program, but otherwise share the same characteristics as those units which have participated, PSM reduces or eliminates biases in observational studies and estimates the causal effect of a program on an outcome or outcomes. 

PSM consists of four phases: estimating the probability of participation, i.e. the propensity score, for each unit in the sample; selecting a matching algorithm that is used to match beneficiaries with non-beneficiaries in order to construct a comparison group; checking for balance in the characteristics of the treatment and comparison groups; and estimating the program effect and interpreting the results.

  • Estimating the Propensity Score: The propensity scores are constructed using a logit or probit regression to estimate the probability of a unit’s exposure to the program, conditional on a set of observable characteristics that may affect participation in the program. In order for the propensity scores to correctly estimate the probability of participation, the characteristics included in the propensity score estimation should be well-considered and as exhaustive as possible. However, it is very important that characteristics which may have been affected by the treatment are not included. For this reason, it is best to use baseline data to estimate the propensity scores, if available. Once all relevant covariates are selected for inclusion, a logit or a probit regression is performed and the predicted probabilities are obtained.
  • Select a Matching Algorithm: Once the propensity scores are estimated, units in the treatment group (beneficiaries) are then matched with non-beneficiaries with similar propensity scores, or probability of participating in the program. There are a number of matching algorithms which can be employed. The most common matching algorithms used in PSM include:

o    Nearest-neighbor matching: Each program beneficiary is matched to the non-beneficiary unit with the closest propensity score. Non-beneficiaries for which there are no beneficiaries with a sufficiently similar score are discarded from the sample; the same is true for beneficiaries for which there is no similar non-beneficiary. A variation of nearest-neighbor matching matches multiple (for example, the or five) non-beneficiaries to one single beneficiary.

o    Radius matching: (i.e. ‘caliper’ matching): A maximum propensity score radius – a ‘caliper’ – is established, and all non-beneficiaries within the given radius of a beneficiary are matched to that beneficiary.

o   Kernel matching: For each treated subject, a weighted average of the outcome of all non-beneficiaries is derived. The weights are based on the distance of the non-beneficiaries propensity score to that of the treated subject’s, with the highest weight given to those with scores closest to the treated unit.

  • Check for Balance: Once units are matched, the characteristics of the constructed treatment and comparison groups should not be significantly different; i.e., the matched units in the treatment and comparison groups should be statistically comparable. Balance is generally tested using a t-test to compare the means of all covariates included in the propensity score in order to determine if the means are statistically similar in the treatment and comparison groups.  If balance is not achieved; i.e., the means of the covariates are statistically different, a different matching option or specification should be used until the sample is sufficiently balanced.
  • Estimating the Program Effect and Interpreting Results: Following the estimation of propensity scores, the implementation of a matching algorithm, and the achievement of balance, the intervention’s impact may be estimated by averaging the differences in outcome between each treated unit and its neighbor or neighbors from the constructed comparison group. The difference in averages of the subjects who participated in the intervention and those who did not can then be interpreted as the impact of the program.

Example

Jalan and Ravallion conducted an impact evaluation that measured the effect of access to piped water on the incidence and duration of diarrhea among children less than 5 years of age in 16 states in India. The study utilized a household survey conducted in 1993-94 and used PSM to create comparable treatment and comparison households from within the larger sample. The authors argued that impact estimates based on the full sample are subject to selection bias because not all characteristics which influence both child health and water source selection are observable or included in the survey. They also claimed that the inclusion of variables that do not necessarily predict outcomes reduce bias in estimates of causal effects. For the purposes of the study, pre-exposure variables (e.g. state of residence, composition of household, assets, religion, access to public goods and village characteristics) were incorporated into a propensity score through a logit model. The Five-Nearest-Neighbor matching option was used to create the sample for analysis amounting to 33,000 observations. Approximately 650 households were excluded after the matching process due to the inavailability of sufficiently similar households. The authors concluded that access to piped water reduces disease prevalence by 21% and illness duration by 29%.      

Source: Jalan J, and Ravallion M (2003) “Does Piped Water Reduce Diarrhea for Children in Rural India?”.

Advice

Advice for CHOOSING this option

  • Propensity score matching requires statistical computations and is best conducted using statistical programs such as Stata or SPSS. It may be useful to involve an experienced statistician, depending on levels of staff knowledge. 
  • PSM demands a deep understanding of the observable covariates that drive participation in an intervention and requires that there is substantial overlap between the propensity scores of those subjects or units which have benefited from the program and those who have not; this is called the ‘common support’. If either of these two factors are lacking, PSM is not a suitable methodology for estimating causal effects of an intervention.
  • PSM also requires a large sample size in order to gain statistically reliable results. This is true for many causal inference methodologies but is particularly true for PSM due to the tendency to discard many observations which do fall under the common support.
  • PSM is not a panacea – because it matches only on observed information, it may not eliminate bias from unobserved differences between treatment and comparison groups.
  • It is important to understand the trade-offs between reducing bias and reducing standard errors that arise when choosing the specifications of the matching algorithms.  For example, when choosing the caliper size for the radius matching, if the caliper size is too large there is a risk that very dissimilar individuals will be matched, while if the caliper size is too small the sample size may become too small to obtain statistically convincing results. Similarly, for neighbor matching, choosing multiple neighbors decreases bias, relative to single neighbor matching, but increases standard errors due to the smaller sample size caused by a more stringent specification. Such trade-offs exist in each matching algorithm.

Advice for USING this option

  • Deep understanding of the context is required by the evaluator in selecting observable covariates to include in the propensity score – rarely will a comprehensive list exist. Useful criteria to consider include the explicit criteria used in determining participation in the intervention (i.e. project or program eligibility), as well as factors associated with self-selection (i.e. the subject’s distance from a project location by foot).  Covariates that are affected by the intervention should not be included in the propensity score.
  • No matching algorithm that is superior in every context - each involves a trade-off between efficiency and bias. Balance statistics resulting from multiple matching algorithms should be examined to determine which option achieves the best balance.
  • Increase the reliability of the evaluation by using multiple matching algorithms and choose the matching algorithm which produces the best balance statistics.
  • It is important to understand the trade-offs between reducing bias and reducing standard errors that arise when choosing the specifications of the matching algorithms.  For example, when choosing the caliper size for the radius matching, if the caliper size is too large there is a risk that very dissimilar individuals will be matched, while if the caliper size is too small the sample size may become too small to obtain statistically convincing results. Similarly, for neighbor matching, choosing multiple neighbors decreases bias, relative to single neighbor matching, but increases standard errors due to the smaller sample size caused by a more stringent specification. Such trade-offs exist in each matching algorithm.

Resources

Guides

Examples

Sources

Dehejia R & Wahba S. (2002) “Propensity Score Matching Options for Non-experimental Causal Studies,” Review of Economics and Statistics. Volume 84, Number 1, pp. 151-161.

Heckman J, H. Ichimura, and P. (1997) Todd. “Matching as an Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Program”, The Review of Economic Studies, Volume 64, Number 4 (1997), pp. 605-654.

Heckman J, H. Ichimura, and P. Todd, (1998), “Matching as an Econometric Evaluation Estimator”. Retrieved from www.uh.edu/~adkugler/Heckmanetal.pdf

Heckman J, H. Ichimura, J. Smith, and P. Todd, (1998). “Characterizing Selection Bias Using Experimental Data”.Vol. 66, No. 5 (Sep., 1998), pp. 1017-1098. Retrieved from http://www.jstor.org/stable/2999630

Jalan J, and Ravallion M (2003). “Estimating the Benefit Incidence of an Antipoverty Program by Propensity Score Matching”.Journal of Business & Economic Statistics, American Statistical Association, vol. 21(1), pages 19-30, January. Retrieved from http://fmwww.bc.edu/RePEc/es2000/0873.pdf

Rosenbaum P and Rubin D. (1983) “The Central Role of the Propensity Score in Observational Studies for Causal Effects”, Biometrika, Vol. 70, No. 1, pp. 41-55

Rosenbaum P, D. Rubin (1985). “Constructing a Control Group Using Multivariate Matched Sampling Options that Incorporate the Propensity Score”, The American Statistician, Volume 39, Number 1, pp. 33-38.

 

Updated: 5th November 2014 - 4:37am
A special thanks to this page's contributors
Author
Oxford University.
Oxford.
Reviewer
International Initiative for Impact Evaluation.
Arlington.
Reviewer

Comments

There are currently no comments. Be the first to comment on this page!

Add new comment

Login Login and comment as BetterEvaluation member or simply fill out the fields below.