Data Cleaning

Synonyms: 
Data cleansing, datawash, data scrubbing

Data cleaning involves the detection and removal (or correction) of errors and inconsistencies in a data set or database due to the corruption or inaccurate entry of the data.  Incomplete, inaccurate or irrelevant data is identified and then either replaced, modified or deleted. 

Incorrect or inconsistent data can create a number of problems which lead to the drawing of false conclusions.  Therefore data cleaning can be an important element in some data analysis situations.  However, data cleaning is not without risks and problems including the loss of important information or valid data.

There are a large variety of tools available that can be used to support data cleaning. Additionally, many statistical programs have data validation built in, which can pick up some errors automatically, for example, non-valid variable codes.

Advice

Advice for USING this option

  • Back up your data before starting your data cleaning process. 
  • Create a list of all variables, variable labels and variable codes.
  • Decide which variables are crucial to the analysis and must have values for the responses to be complete. Often, survey responses will come back with missing data for certain questions and variables. If this appears on a crucial variable, the data from that survey will not be useful.
  • Look for coding errors 
    • Something like gender will have in most cases the possible codes of 1 = male, 2 = female, 0 = missing, and so in this case a code of 12 would be an error
    • Other errors might include missing data values
    • A frequency tests can help to identify errors
  • Look for outliers
    • Outliers can hide or create statistical significance and are important to identify
    • Creating a bar graph or similar is one way to quickly identify outliers
  • Check for logical consistency of answers
    • Cross-tabulating pairs of variables is one way of rooting out inconsistencies
  • Decide how to deal with incorrect or missing values. Some options are:
    • Removing responses with missing or incorrect values
    • Correct missing or incorrect data if the correct value is known
    • Going back to the data source and filling in the missing data variables
    • Setting values to an average or other statistical value

Resources

Overview

Guides

  • Data Cleaning 101outlines a step-by-step process for verifying that data values are correct or, at the very least, conform to some a set of rules through the use of a data cleaning process.

Sources

Rahm, E., & Hai Do, H. University of Leipzig, Germany, (n.d.). Data cleaning: Problems and current approaches. Retrieved from website: http://wwwiti.cs.uni-magdeburg.de/iti_db/lehre/dw/paper/data_cleaning.pdf

Wikipedia (2012). Data cleansing. Retrieved from http://en.wikipedia.org/wiki/Data_cleansing

Updated: 11th August 2014 - 10:51am
This Option is useful for:
A special thanks to this page's contributors
Author
Melbourne.
Contributor
BetterEvaluation Website and Engagement Coordinator, BetterEvaluation and ANZSOG.
Melbourne, Australia.

Comments

There are currently no comments. Be the first to comment on this page!

Add new comment

Login Login and comment as BetterEvaluation member or simply fill out the fields below.