Data cleaning involves the detection and removal (or correction) of errors and inconsistencies in a data set or database due to the corruption or inaccurate entry of the data. Incomplete, inaccurate or irrelevant data is identified and then either replaced, modified or deleted.
Incorrect or inconsistent data can create a number of problems which lead to the drawing of false conclusions. Therefore data cleaning can be an important element in some data analysis situations. However, data cleaning is not without risks and problems including the loss of important information or valid data.
There are a large variety of tools available that can be used to support data cleaning. Additionally, many statistical programs have data validation built in, which can pick up some errors automatically, for example, non-valid variable codes.
Advice for USING this option
- Back up your data before starting your data cleaning process.
- Create a list of all variables, variable labels and variable codes.
- Decide which variables are crucial to the analysis and must have values for the responses to be complete. Often, survey responses will come back with missing data for certain questions and variables. If this appears on a crucial variable, the data from that survey will not be useful.
- Look for coding errors
- Something like gender will have in most cases the possible codes of 1 = male, 2 = female, 0 = missing, and so in this case a code of 12 would be an error
- Other errors might include missing data values
- A frequency tests can help to identify errors
- Look for outliers
- Outliers can hide or create statistical significance and are important to identify
- Creating a bar graph or similar is one way to quickly identify outliers
- Check for logical consistency of answers
- Cross-tabulating pairs of variables is one way of rooting out inconsistencies
- Decide how to deal with incorrect or missing values. Some options are:
- Removing responses with missing or incorrect values
- Correct missing or incorrect data if the correct value is known
- Going back to the data source and filling in the missing data variables
- Setting values to an average or other statistical value
- Google Refine: Tool of the Year for Evaluators: provides an overview of Google Refine which is a desktop application (downloadable) that can be used to calculate frequencies and multi-tabulate data from large datasets and also clean up your data. (AEA)
- Data Cleaning: Problems and Current Approaches: explains the main problems that data cleaning is able to correct and then provides an overview of the solutions that are available to implement the cleansing of data. (University of Leipzig)
- Data Cleaning 101: outlines a step-by-step process for verifying that data values are correct or, at the very least, conform to some a set of rules through the use of a data cleaning process.
Rahm, E., & Hai Do, H. University of Leipzig, Germany, (n.d.). Data cleaning: Problems and current approaches. Retrieved from website: http://wwwiti.cs.uni-magdeburg.de/iti_db/lehre/dw/paper/data_cleaning.pdf
Wikipedia (2012). Data cleansing. Retrieved from http://en.wikipedia.org/wiki/Data_cleansing