Variable Management
Data cleaning practices to help you manage data and ensure data integrity and reproducibility. Covers handling missing values, categorical and dummy variables, specify variables, skip logic, storage types, date variables, and logic tests for survey data.
- Coding best practices for data cleaning, such as recoding missing values and converting date formats, minimize manual errors and enhances efficiency.
- Proper handling of missing values, categorical variables, and date variables is crucial for ensuring data quality and consistency in survey data analysis.
- Logic tests and defensive programming techniques help identify and address data inconsistencies early, ensuring data integrity and reproducibility.
Raw survey data is often relatively clean at the variable-level because survey software generates the data with various constraints on how the data are formatted. For example, multiple response variables only allow certain responses which the analyst knows from the survey instrument. These responses do not change in content unless the survey does. For other types of variables, most computer-assisted personal interviewing software such as SurveyCTO allow for constraints to be built into the survey instrument. IPA has several open-sourced tools to support data quality during data collection such as the data management system, which checks for data quality concerns and ipaclean, which provides a set of tools for data cleaning and validation.
Much of the cleaning work required for survey data includes standardizing variable formats to make them usable for analysis. This includes ensuring missing and non-response values appear as missing in statistical software, that categorical formats are useful, and variables correspond to the correct storage format, subjective orientation, and follow missingness patterns. These are substantive coding tasks and can benefit from consistent and automated approaches. This section is primarily concerned with doing those tasks safely and quickly with minimal manual input or transcription.
Handling missing values
For several reasons, variables will have missing values for certain observations. The data should reflect why these values are missing, particularly for numeric variables. For instance, you may want to know whether someone did not answer a question about business revenues because the question was skipped—they were not business owners—or because they could not remember.
Missing values on import
Many statistical software will import variables and observations that only contain missing values when importing from other data formats. Often, clearing those values immediately after importing data is useful. This can be done manually, or through the missing
command which can be installed through SSC. One approach is to do this is as follows:
data starting on the row of variable names: B1
*Import using "${raw}/rawdata.xlsx", cellrange(B1) first clear allstring
import excel
rows and columns with only missing
*Remove missing dropobs, force
missing dropvars, force // usually don't use force, but fine in this case
missing IDs
*Ensure non-isid id
Sometimes missing values appear as non-empty values in other data management software (SQL uses NULL
, R uses two types: NA
with four subtypes and NaN
, etc.). Check data to ensure those values are removed or properly converted to missing after importing.
variable names if anything equals "NULL"
*Check ds, has(type string) // only search strings
foreach var in `r(varlist)' {
replace `var' = "" if `var' == "NULL"
}
For numeric variables, this can be done in one line using recode. For example, imagine all -99 values should be missing values:
missing for all values
*Recode -99 to recode * (-99=.)
Extended missing values in Stata
Stata has special codes for numeric missing values. For numeric variables, missing values are considered to be greater in value than all other numbers and themselves have an order of magnitude. The magnitude of missing values increases across the alphabet, with the standard missing value .
coming before .a
: .
< .a
< .b
< .c
< … < .z
The .a
, .b
, etc. are called “extended missing values.” . Note that extended missing values cannot be stored in string variables. Instead, all string variables with a missing value are shown as ""
(called “blank”).
Ensure extended missing values consistently represent missing values in your data. For instance, if -99 and -999 refer to “don’t know” in two different waves of the survey, they should be standardized. In that case, you will want to replace all -99 and -999 values with .d
. It can be helpful to use the recode command to efficiently replace those values. The following standards for extended missing values are recommended:
Type of Missing | Extended Missing |
---|---|
Don’t know | .d |
Other | .o |
Not applicable | .n |
Refusal | .r |
Skip | .s |
Version differences | .v |
Some responses may require the enumerator to switch from the standard missing values (for example if a response is restricted to be positive, -99 may not be allowed). Other times, enumerators may enter the wrong missing value by mistake, such as -999 instead of -99. As part of the code to relabel missing values by type, you can include code that searches for these changes. The following code replaces multiple values and looks for positive versions of the missing values in outliers.
*Assign numerical codes// numerical code for "Don't Know"
loc idk -99 99 999 // numerical code for "Refuse"
loc rf -77 77 75 // numerical code for "Not Applicable"
loc na -88 // numerical code for "Other"
loc oth -66 // numerical code for "Skip"
loc skip -70
values for each type of refusal across numerical variables
* Replace qui ds, not(type string)
local numvars `r(varlist)'
foreach var of local numvars {
missing values as negative for unlabeled numeric variable above 0
** Replace if mi("`: value label `var''") { // no labels from SurveyCTO import code
if value takes less than 0 values
*Skip if `var' < 0 continue
if value has missing
*now check qui sum `var' if inlist(`var', 99, 88, 77, 66)
if `r(N)' == 0 continue // move to next variable if no values have positive version
if this is the outlier value conditional on previous being completed
* only change foreach val of 99 88 77 66 {
di "`var' has `r(N)' cases of `val'"
qui sum `var'
qui replace `var' = -`val' if `var' == `val' & (`r(max)' == `val' | `r(min)' == `val')
}// end foreach val of 99 88 77 66
}// end if mi("`: value label `var''")
on missing patterns
** Relabel based foreach x of local idk {
replace `var' = .d if `var' == `x' // Don't know
}// end foreach x of local idk
foreach x of local na {
replace `var' = .n if `var' == `x' // N/A
}// end foreach x of local na
foreach x of local oth {
replace `var' = .o if `var' == `x' // Other
}// end foreach x of local oth
foreach x of local rf {
replace `var' = .r if `var' == `x' // Refuse
}// end foreach x of local rf
foreach x of local skip {
replace `var' = .s if `var' == `x' // Skip
}// end foreach x of local
}// end foreach var of local numvars
Check outliers and replacements manually, and search for modal responses in numeric variables that will not have missing values as outliers. Accurate responses can overlap with missing response codes. Automated replacement code, such as the code above, should only be completed after confirming that all inverted values are correct. Even better, use of defensive coding such as assert or the pause
command will resolve this. The best solution is often to make sure that the data generating process guards against this type of messiness. Programming in confirmation questions in the survey (“Did the respondent really answer 99 or is this a missing value?”) can help accomplish this with more accuracy.
Categorical and dummy variables
Categorical variables are ones that have no obvious ordering to the responses. For example, the question “What crop do you grow?” could have the following answers: soy bean, maize, cassava, ground nut, and yams. It can be helpful to have a numeric value attached to each response, however there is no clear ordering here. You can use the command encode to create a value for each in alphabetical order and it keeps the original response as the value label.
Dummy variables are one of the most common variable types you will use. These are also referred to as Boolean or indicator variables. These variables typically take on the values 0 and 1. However, it is important to note that your variable could and/or should have missing values if applicable. For example, an indicator about whether someone has a credit score could be defined as 1 for yes, 0 for no score and missing would mean there was no credit information available about that individual. Note that 0 and missing have different meanings and you should be careful around if and what values are missing. In Stata, you have several options for creating a dummy variable. Two examples of creating this “Has a credit score” dummy variable is below.
- This first way recommended method is to start with an entirely missing variable and then replace with
1's
and0's
for each condition.
gen has_score = .
replace has_score = 1 if credit_score != .
replace has_score = 0 if (credit_score == . & info_available == 1)
It is important to note that 0
is defined for those missing a credit score but that credit information is available for, the observations that remain missing are those that are missing a credit score because there was no credit information available. You do not want to lose this information due to poor coding in which you say all these people have no score.
- The second way is more concise but can be trickier.
gen has_score = (credit_score != .) if (info_available == 1)
This method does the above in one step. The first half creates a 1/0 dummy by assigning 1 to any observation that meets the condition in the first parentheses and 0 if it does not. Then it uses the if and second parenthesis to assign missing values. If the second condition (info_available == 1)
is false, then that observation will be missing. Either method is acceptable, just be careful to take the 0’s and missing values into account.
“Specify” Variables
Survey questions may have a “specify” option, in which the respondent explains their answer or gives an alternate answer not available on the list of choices for that question. These are often triggered by an “other” response. In the raw dataset, alongside the original variable will be a “specify” variable that shows the comments for that question. You (or someone else familiar with the survey) will sometimes need to read through these specify answers and then recode the original variable accordingly. (“Specify” variables are also called other/specify variables, “other” variables, and free-text variables.)
For example, if a question asked for someone’s favorite color, giving the options of blue, red, yellow, green, and other. If someone answered “other” and then wrote “sky blue” for their answer, you would want to recode the original variable for favorite color to say “blue” instead of “other.” However, if someone wrote “purple” you could leave their response as is (or, if enough people wrote purple, you could add another category to the variable).
Particularly for large surveys, this can be a hassle. One helpful approach is to do the following:
- First clean each string variable so that similar answers will show the same value. Use string functions like
lower()
,trim()
, anditrim()
to convert answers like “PUR pLE”, ” Purple” and “purple” to all be “purple”. - For each specify variable, collapse the dataset into unique answers (for example, if three people wrote “purple” the collapsed dataset would only show “purple” once).
- Store those unique answers into a spreadsheet with another column that shows what variable the answer corresponds to. Then, leave one column blank, which you will eventually fill in with the value from the original variable that the response corresponds to (if it should be recoded). For instance, next to “purple” you would put nothing, but next to “sky blue” you would put 1 if 1 corresponded to the “blue” answer option.
- Write code to merge the manual corrections from the excel file back into the do file, instead of running this as a series of
replace
orif `var' == "sky blue" | `var' == "ocean blue" | `var' == ...
commands in the do file.
Make sure you save do files and documents so that this process can be replicated and understood by someone else in the future.
This data flow would like the following. First, the specify responses are cleaned and saved so that the excel sheet can be modified.
/* MR 10/25/2019:
The variable q_oth is the "specify" variable corresponding to
the variable q If q == 99, then q_oth has a string value input
by the enumerator.
First, I standardize strung responses of the q_other variable and
then output an excel document with each unique response.
*/
*String cleaninggen q_oth_cl = q_oth // create a clean copy to preserve the raw data
replace q_oth_cl = lower(q_oth_cl)
replace q_oth_cl = strtrim(q_oth_cl) // only trim external spaces so "sky blue" does not become "skyblue"
*Save excel filetempvar map // create column header
gen `map' = ""
var `map' "Mapping"
lab var q_oth_cl "Other values"
lab
in the in the project folder
*Save to temporary file folder preserve
keep q_oth_cl `map'
duplicates drop q_oth_cl, force
export excel q_oth_cl `map' using "${temp}q_oth.xlsx", firstrow(varl) replace
restore
This results in a table that looks like this:
Other | Mapping |
---|---|
sky blue | 1 |
ocean blue | 1 |
navy | 1 |
depends on the day | -66 |
purple |
The RA would fill out the mapping column based on the allowed values in the survey. This means that the “Mapping” column would take 1 for the “sky blue” value if the data uses 1 for the survey. If the data uses string values at this point in the cleaning process the “Mapping” column could be filled with “blue.” Ensure that this process is reproducible and rule based. Inconsistent mapping of variables does not create clean data. After this table is completed, it can be merged back into the file to save the values.
The code to complete that looks like this:
/* MR 10/25/19:
For question "q", specified other values were cleaned according to the
following rules:
-Any color response with more than 1% of the sample was
added as a category
-Specified colors that are a subset of the option (sample)
-Non-colors were replaced as missing
-Extended missing values were used if these should have been
an extended missing value captured by the survey.
These values were saved in a file in the Project Folder at:
../08_Analysis&Results/01_Cleaning/05_Temp/q_oth_mapped.xlsx
They will then be merged in and replaced to the individual variables.
*/
in data and save a tempfile
*Load preserve
using "${temp}q_oth_mapped.xlsx", first clear
import excel keep if !mi(Mapping) // only merge on mapped values
ren Othervalues q_oth_cl // change the file back to q_oth for the merge
tempfile q_oth_mapping
save `q_oth_mapping'
restore
on file
*Merge using `q_oth_mapping', t(n:1) missing(nomatch)
mmerge q_oth_cl /*
Alternatively, merge to the subset of responses with non_missing values and
append these files afterwards using "merge m:1 q_oth using `q_oth_mapping'"
*/
assert _merge == -1 | _merge == 1 | _merge == 3 // missing, no coding, or coded
values
*Replace levelsof q // first collect every level of q and replace
levels `r(levels)'
loc foreach level of local levels {
replace q = `level' if mapping == `level'
}
values
*Replace extended // do this manually
replace missing values to IPA standard missing values
*replace q = .d if mapping == -66 // don't know
replace q = .r if mapping == -77 // refusal
replace q = .n if mapping == -88 // not applicable
*finally check that everything was capturedassert q != 99 if _merge == 3 // this assumes 99 == other in the survey AND that the all values were coded
drop _merge q_oth_cl mapping // remove extraneous variable
Skip Logic
Surveys will likely have skip logic. For instance, if a respondent says they have zero goats to a question, the survey may instruct the surveyor to (or the program may automatically) go to questions about sheep instead of questions about the quality of goat cheese the respondent makes from their goats. When a survey is bench tested and piloted, you should have tested that these skips worked by developing a series of checks based on a close reading of the survey instrument itself. However, skips may not be passed to the final dataset by some programs, such as SurveyCTO. This can make it hard to check if there were any errors in the survey coding.
Defining additional missing values for skips and for questions that were not asked, such as those in long repeat groups can be helpful for two reasons:
- Skip values that take an extended missing values can be identified using
==.s
without capturing other types of general missing values.s
. - Excluding skip values from the general missing value
.
can help to identify errors in cleaning later on, as Stata will not impute.s
normally. Then, failures in skip logic can be identified as part of the cleaning process. If some of the skips did not work or allowed for some entry error among respondents, document the issues by outputting a list of the problematic observations into a spreadsheet and mention it to the PIs.
The following code assigns skip values and then confirms that the skips were successful during the survey implementation. It can be very helpful to use the assert
command to check this. In addition, ensure that the observations who answered those questions are marked in the data by a dummy variable named in a consistent manner.
/*
In this example, there is a module that asks about business profits
only if the respondent has a business. The question that starts
a set of questions, b_prof_s*, on business profits is b_prof_yn.
All questions should be skipped if b_prof_yn == 0, but the
variables b_prof_s* exist if any respondent has a business.
First we assign the skip missing value to all observations if they
do not have a value. Then we run an assert to confirm skips worked
as intended. If they did not, the user is warned and
a dataset is saved.
*/
/* First identify if the respondent has a business and
fill skip values */
unab bus_items : b_prof_s* // save all business profits questions
foreach var of local bus_items {
/* create skip patternm note that `var' == ., not mi(`var')
to ensure extended missing values are not overwritten */
replace `var' = .s if `var' == . & b_prof_yn == 0
}
confirm that
** Now check to foreach var of local bus_items {
// don't use capture unless you control for every outcome
assert `var' == .s if b_prof_yn == 0
cap
if this fails
*Tag variables if _rc == 9 gen `var'_nos = `var' != .s & b_prof_yn == 0
for other options
*Controlling else if !_rc di "No errors in `var'"
// exit with an error if a different error than the assert failing
else exit _rc
}
list of each variable and if it were skipped
** Export a /* Formatting could be done differently here, the below
outputs an excel sheet that preserves all other answers
and is in the wide format.
*/
preserve
*Save ID and relevant variableskeep id key startdate b_prof*
*Keep relevant observationsqui ds b_prof_*_nos
egen tokeep = rowmax(`r(varlist)')
keep if tokeep == 1
drop tokeep
by variable and missing
*Order foreach var of local bus_items {
order `var' `var'_nos
}
*Save filesexport excel using "${temp}business_skip_errors.xlsx", first(var) replace
restore
Storage Types in Stata
Statistical software requires a storage type to determine how and what data to store about each variable or value. This storage type determines if the software treats a variable as text or a number, and how much information is stored in each variable. This information could be how many digits of precision are required or if the variable is just 0 or 1s. Some data types such as dates have specific metadata attached – January 1, 1960 was a Friday – that relate to storage type.
Variables are stored in two broad categories: string (text) or numeric. For tasks in analysis such as regression, Stata requires categorical variables to be stored as numeric variables, not string variables. This is also beneficial for storage size of variables. As a rule of thumb in Stata, ordinal and categorical variable should be stored as numeric variables with labeled values. Only text or IDs should ever be stored as a string variables. Labeling categorical variables is preferred and should be treated as a part of data management.
Storage formats such string or numeric are the variable’s type, different from its format. Variable formats affect how Stata displays values of variables to the user and are loosely related to the storage type – a string cannot be displayed with significant digits for example.
Numeric variables are stored as byte, int, long, float or double. Float and double are the two that can hold non-integer numbers (decimals) and are the most common. More details on how numerical formats may affect datasets is available in this guide article. Numerical and string formats can be changed using the recast
command, or by specifying a storage format using the generate
command.
String variables storage types are identified by their character length (str4
has 4 characters, str7
has 7 characters, etc.,). Variables are stored as string if they have any nonnumeric character in them (this includes commas and periods if they are imported as such). See help data_types for more information about variable types. Useful commands to go between string and numeric variables are destring
and tostring
. The command destring
turns a variable from a string into a numeric (must contain all nonnumeric characters). The command tostring
changes a numeric variable into a string variable. To convert strings to labeled numeric formats and vice versa see the encode
and decode
(or sencode
and sdecode
user-written commands).
A variable’s format controls how the data is displayed. This is can be used to format numeric variables to display with commas or a specific number of decimal points. For details on the corresponding formats for each variable type and how to format variables, type help format
. In short, it’s important to match the displayed format to the content, especially for outputs, so that content can be interpretable by humans.
Date variables
Dates are especially complex to work with in Stata. Stata stores dates as a numeric variable that captures either the number of days, months, quarters, or years since January 1, 1960. It is important to know that dates can also have a time component (datetimes), and are then stored as the number of milliseconds since January 1, 1960. These formats are numeric, but once they are stored as a date in Stata several special Stata functions can be applied to them to help with calculations that relate to dates and time. For more information see h datetime translation
.
Survey CTO defaults to storing date variables as strings when importing them to Stata. It’s advantageous to convert these variables to a proper date format, as it allows for various logical and mathematical calculations. For example, you could count the number of days between a survey start date and submission date, or the number of minutes between the survey was started and completed. The SurveyCTO datetime metadata (starttime endtime submissiondate
) are already stored as date variables.starttime
and endtime
are formatted as datetimes (%tc
) and submissiondate
is formatted as a date (%td
)).
To convert imported dates to be stored as a date type in Stata, the functions date()
and mdy()
will convert non-date variables to date-formatted variables. The functions clock()
and mdyhms()
will convert non-date variables to datetime-formatted variables. The date()
and clock()
functions are useful if your dates are stored as a string variable whereas the mdy()
and mdyhms()
functions are useful if your dates are stored as numeric variables. These commands will create a variable that is the number of days, or the number of milliseconds, since January 1, 1960.
For a set of SurveyCTO datetimes using the same format, the following code will convert all of the variables to Stata date formats and check for data quality:
list of date variable
*Define ds *_date // this should match your naming convention
`r(varlist)'
loc date_vars
save tempvar so indifferent to var length
*tempvar temp
double to store datetime to avoid rounding
* create gen double `temp' = .
date variable, convert to datetime and run checks
*Foreach foreach var of local date_vars {
*Display progressdi "Working on `var'"
if the date is already formatted as any date
*Skip /*
Note: This could also standardize formats using
the following code:
if regexm("`: format `v''", "%tc") {
replace `v' = cofd(`v')
format `v' %tc_CCYY_NN_DD
}
else if regexm("`: format `var'', "%t") continue
*/
if regexm("`: format `var''", "%t") continue
string to date
*Convert from /*
Note: This specifies a format that all variables share.
See h datetime in Stata for an overview on how to define
datetime formats for the clock() and date() functions.
*/
var label `var' // save var label
loc varl : replace `temp' = . // clear tempvar
replace `temp' = clock(`var', "MDYhms")
drop the variable as cannot replace values and convert type
*drop `var'
gen double `var' = `temp'
format `var' %tcCCYY_NN_DD__HH:MM:SS // choose a preferred format
range
*Check within "Jan. 01 1960, 12:00:00 AM", "MDYhms")
loc survey_start = clock("Feb. 29 2020, 11:59:59 PM", "MDYhms")
loc survey_end = clock(assert inrange(`var', `survey_start', `survey_end')
missing for calculate fields
*Check non-/*
This may not be the case for all projects if some
calculate fields are conditional on being read
*/
assert !mi(`var')
}// end foreach var of local date_vars
Some of these checks will duplicate checks conducted as part of IPA’s data management system. It doesn’t hurt to run these confirmations, but we recommend using the data management system to be aware of any problems as soon as possible.
If you would like to convert your daily variable into a monthly or yearly variable you can use the following functions: mofd()
and yofd()
. You can find descriptions of all the functions that convert between which count types for dates (daily, monthly, quarterly, or yearly) by typing help date()
and clicking on “Date and time functions” to go to the Stata manual.
Once you have date in the right type, you can work with the formatting of the variable. The formatting is how you make the date look like a date rather than the underlying numeric value. You can read how to format your dates in any way imaginable by typing help datetime display formats
. Note that changing the format of a date variable does not change its underlying value.
Logic tests
Clean data should be free of mistakes such as out of range values. Yet, the thousands of lines of survey, database management, cleaning and analysis code that make up projects often contain a number of mistakes. These mistakes are a natural part of any large project. Instead of focusing on having mistake-free code in all cases, it’s often more effective to write code that tests for if the data matches expectations. This approach does not substitute for code review and being careful, but
Commands like assert
in Stata and stopifnot()
in R allow the programmer to break the code if some logic test is not met. Logic tests check if the values observed in the data make sense given other values. For instance, can someone who says they were born in 2000 really have been a member of their current community for thirty years? These tests are used to ensure that the clean data is truly clean, not just that the programmer thinks the data are clean.
The list of logic tests to test is dependent on each individual dataset. Check for things that would be problematic for data analysis, such as missing values in variables you expect to always have a value. These checks should include other problems in the data that would clearly indicate that something is wrong with the data such as responses to a Likert scale survey question that wasn’t in the list of programmed responses. If any of those tests fail, flag the values in the data and bring them up with the PIs.
Defensive Programming
In computer science, programming that ensures a program can function under unanticipated situations conditions is called defensive programming. It can be useful to bring this mentality into social science research. In many cases, defensive programming is natural with the type of data we work with. Many variables have characteristics that we expect from the data and can be easily checked. For example, for a survey of employed adults in the US, with a working age of 18, no respondents should have age below 18.
We can check that this is the case in Stata:
assert age >= 18 & !mi(age) // . is considered infinitely large by Stata, so > 18 captures missing values as well.
Building these tests into your data flow should be a natural test that you work into data cleaning. This can be expanded to other functions like merges. If every household but one received both an agricultural and household survey, each household survey should merge to the plot dataset except one observation. The merge can be checked to confirm that that is the case:
merge success rate
*Check merge 1:1 id using "plots.dta" // note: merge has a built-in assert option
count if _merge != 3 // check how many merges did not have an observation in both
assert `r(N)' == 1
These tests should be built into cleaning code, as they are computationally cheap unless data is large. They should be executed every time the code runs, and should test major modifications that affect important variables.
Logic Tests for Survey Data
Survey data has the benefit of coming from a survey that you, as an analyst, have probably programmed. The survey software and import process often controls for many types of errors. However, data management errors can also occur due to the particular forms that survey data contains. These errors are common in survey data management:
- Missing values occur in variables that the programmer assumes always has real values
- Missing values are inconsistently defined in the survey and are not fully replaced (e.g. “Not Applicable” is -99, -98, and 77)
- Variable or value options change between survey versions and aren’t reconciled
- Variables with the same name are overwritten as part of a merge. See
h mmerge
for a user written command that stores this information.
- Variables with the same name are overwritten as part of a merge. See
- Observations are not unique and some actions such as sorting become irreproducible
Testing for these errors as part of the cleaning process is strongly recommended.
Logic Tests in Administrative Data
There are often broader concerns for administrative data since analysts do not directly control the data generation and management process. This means that data definitions and formats may change or be inconsistent over longitudinal data.
Some additional concerns to focus on in administrative data are:
- Ensure that variables remain consistent over repeated deliveries
- Data translation and missingness standards of the storage system may create values in statistical software (e.g. SQL treats missing as “NULL”)
Variable consistency can be handled using value labeling in Stata. For example, to ensure that data remains the same over the course of data collection, it can be useful to check that categorical variables map to expected values to a previously created value label. The noextend
option of the encode commands allows you to confirm no values exist in the variable other than those in the supplied label. The following code accomplishes this:
local list of variables to encode
*Create a var
loc str_var
values and confirm expected
*Encode foreach var of local str_var {
*Encode variables`var', label(`var'_label) replace noextend // type h sencode to see options
sencode
}// end foreach v of local str_var