Data cleaning

Synonyms:
data cleansing, datawash, data scrubbing
datacleaning.jpg

Data cleaning involves the detection and removal (or correction) of errors and inconsistencies in a data set or database due to data corruption or inaccurate entry. 

Incomplete, inaccurate or irrelevant data is identified and then either replaced, modified or deleted. 

Incorrect or inconsistent data can create a number of problems that lead to the drawing of false conclusions.  Therefore data cleaning can be an important element in some data analysis situations.  However, data cleaning is not without risks and problems including the loss of important information or valid data.

There are a large variety of tools available that can be used to support data cleaning. Additionally, many statistical programs have data validation built-in, which can pick up some errors automatically, for example, non-valid variable codes.

Advice for using this method

  • Back up your data before starting your data cleaning process. 
  • Create a list of all variables, variable labels and variable codes.
  • Decide which variables are crucial to the analysis and must have values for the responses to be complete. Often, survey responses will come back with missing data for certain questions and variables. If this appears on a crucial variable, the data from that survey will not be useful.
  • Look for coding errors 
    • Something like gender will have in most cases the possible codes of 1 = male, 2 = female, 0 = missing, and so in this case a code of 12 would be an error
    • Other errors might include missing data values
    • A frequency test can help to identify errors
  • Look for outliers
    • Outliers can hide or create statistical significance and are important to identify
    • Creating a bar graph or similar is one way to quickly identify outliers
  • Check for logical consistency of answers
    • Cross-tabulating pairs of variables is one way of rooting out inconsistencies
  • Decide how to deal with incorrect or missing values. Some methods are:
    • Removing responses with missing or incorrect values
    • Correct missing or incorrect data if the correct value is known
    • Going back to the data source and filling in the missing data variables
    • Setting values to an average or other statistical value

Resources

Rahm, E., & Hai Do, H. University of Leipzig, Germany, (n.d.). Data cleaning: Problems and current approaches. Retrieved from website: http://wwwiti.cs.uni-magdeburg.de/iti_db/lehre/dw/paper/data_cleaning.pdf

Wikipedia (2012). Data cleansing. Retrieved from http://en.wikipedia.org/wiki/Data_cleansing

'Data cleaning' is referenced in: