Raw Data Management
Best practices for importing and manipulating raw data, covering data structures, unique identifiers, PII handling, and file format conversions for both survey data and administrative data.
- Data management requires careful consideration of data structures, wide vs. long format, unique identifiers, and modifications to ensure reproducibility and data integrity.
- Implementing proper PII handling procedures, including early de-identification of data and encryption, is crucial for ensuring data security and compliance with research protocols.
When we refer to “Raw data” we mean data in it’s original form when initially received from the data source. Raw data come in the form of the data collection instrument used to generate the data, be it a survey form, administrative database, computer logs, a customer relationship management system, or other source. These formats usually result from the form best used to capture the data and not to process it. Format conversion from the source format to one usable by statistical software often requires changing file formats, changing data formats, and general error correction. This section of the cleaning guide covers that process, primarily focused on getting data from SurveyCTO to a preferred structure in Stata.
Survey data
Most IPA research projects include survey data. Survey data has the advantage of a consistent format and a known structure. SurveyCTO data, especially, has tools that aid import into Stata. Deciding the data structure during survey programming provides important benefits for raw data management.
Survey data’s structure also carries unique challenges for data management: analysts must standardize formats of survey data and responses for analysis, add metadata to surveys, and complete corrections. Finally, since survey data often holds many sources of personally identifying information in the raw format, deidentification is often important and requires specific focus. Changes in survey version content may exacerbate this process and require multiple imports. This section provides information on these tasks. This section does not provide information on how to program surveys to make data management easy.
IPA has developed tools to support survey implementation and data quality assurance. IPA’s data management system (DMS) provides Excel and Stata-based resources to handle data quality checks. All IPA projects require the DMS. It provides information on data quality and will resolve duplicates before data cleaning begins, among many other features.
Administrative data
Administrative data often undergoes preprocessing for functional reasons. Working with SQL databases and other data management software provides additional challenges as data fields may change definitions over time and data structure may not be easily amenable to transfer or import into statistical software such as Stata or R.
No single process fits all administrative data management needs. Automating importing and restructuring, as well as checking data quality, are common analyst challenges before traditional cleaning begins. This section provides some information on how to handle these tasks in automatable fashions, as well as ways to think about data storage and unique identification. Other management tasks can be found in the data aggregation section.
Larger datasets amplify these problems. Big data requires special data management techniques when data size reaches limitations in statistical software capabilities, processing power, and storage format. Excel can only store about 1 million observations in .xlsx and 65,000 using .xls. This guide does not cover these challenges in detail, but many resources exist for processing big data in Stata—see NBER and Statalist for tips.
Importing into Stata
Data comes in many forms, from raw text (.txt) files to multi-sheet Excel (.xls and .xlsx) files. Importing data into Stata is necessary if the data is not already in Stata format (.dta file). In general, you should be able to use Stata’s functions and loops to efficiently import data. Stata can import most data formats. If your project’s data was collected using a platform like SurveyCTO, the raw data will come in .csv format and the SurveyCTO server will provide a do file that imports the raw .csv data into the Stata .dta format.
Importing different file types
In Stata 13 and beyond, the import command can import CSV files, excel files, and more depending on the option used (delimited for CSV, excel for excel), and export does the same for exporting. The import excel and import delimited commands provide a number of options that allow for a large amount of control of importing including from where in the workbook data should be imported from and how data should be saved. See help import for details on these options.
If you are new to using import are importing a file type you have not seen before, it can be helpful to use the drop-down menu by clicking “File>Import” and then selecting the appropriate file type. Once you do this, you will be able to copy the specific command syntax directly from the command prompt or review window in Stata to your do-file.
The insheet command remains an alternative for .csv, .tsv, and .txt files. It performs differently than import delimited and can be useful for some forms of data, but import delimited should be used preferentially. One thing to note is that it is often a good idea when using the insheet command, to use the option names and have your dataset have the same variable names at those at the top of the raw dataset.
Importing multiple files at once
A useful function for importing multiple files within a folder is the dir
extended macro function. You can find documentation on this by typing help extended_fcn
in Stata. This function allows you to store all the names of the files in a folder in a local so you can loop through them for importing. See example code of this process below.
/* This stores all files with the extension .xlsx in the
"$raw" data folder into a local "files" */
local files: dir "$raw" files "*.xlsx", respectcase
/* Loop through the files to import, clean the file name,
and save as a dta */
foreach file in `files' {
of which file you are working on
*Show your progress di in red "working on `file'"
*Import each fileusing "$raw/`file'", clear firstrow
import excel
**Quality Checks (Optional)of observations.
*Assert you have the correct number qui count
// If you know the amount of expected observations
assert `r(N)' == [number_of_expected_observations]
/*Check that what variables you think should be unique
identifiers are indeed unique. */
isid unique_id_var
for expected/necessary variables
*Check confirm var expected_var_names
* Edit filename/*The filenames in the local "files" includes the
extension (in this case .xlsx). So, I remove these and
make new clean file name to save the files as.
You can edit the filenames however you see fit. */
local cleanfilename = subinstr("`file'", ".xlsx","",.)
new clean file name as a dta file
*Save the file with the save "filepath/dtafiles/`cleanfilename'_raw.dta", replace
}
Note that the new .dta files are no longer saved in the same folder that your raw excel, csv, or any other type of files were saved in. As the imported data is no longer raw, they should be saved in either a temporary or data folder.
It can be helpful set up a “dta” or “temp” folder for you to save these intermediate data files before you start. To do so, you can create a folder to save your files in directly in your script by using mkdir "filepath"
. If the directory already exists, this will create an error. One solution is to use the capture
command and type cap mkdir "filepath"
which will suppress the error. We recommend avoiding capture
in most situations.
Data structure and reshaping
Wide and Long Data
One way to describe data is if the dataset is stored in a “long” or “wide” format. Long data keeps repeated values as an observation (row):
Household ID | Member ID | Treatment |
---|---|---|
HH001 | 01 | T |
HH001 | 02 | C |
Wide data describes data that has similar values as variables (columns). The same example could be displayed wide by expanding the Member ID
column to be a suffix of the Treatment
column, like the following:
Household ID | Treatment_01 | Treatment_02 |
---|---|---|
HH001 | T | C |
There is no correct choice for how data should be stored. You should decided the format to make the data most usable for the analysis being conducted. As a rule of thumb, in a clean dataset each observation is a row, each variable is a column, and each type of observation is a separate dataset (Wickham, 2014). However, the unit of analysis may change for different analyses. An analysis could be conducted at the person-year-level in one analysis, but at the person-level in another. The context of the data being analyzed are required to determine what format the data should be stored in. As long as an ID variable exists for each level of the data (e.g. a long dataset may have a household ID and a household member ID). For more guidance on how Stata treats “wide” and “long” datasets as well as how to change between them (reshape the data) type help reshape
in Stata.
When cleaning data, it’s generally a good idea to decide how the data will be stored based on how it will be cleaned. For example, if one goal of cleaning is to produce an income variable at the household-level, but income is collected at the respondent-day-level, it may be a good idea to clean the income module separately and keep that dataset long. Then, the income data can be collapsed to the household-level and merged onto the household-level dataset later in the variable construction process. How, or if, that happens is up to you as the analyst.
There is one point at which the data has a known unit of analysis: during high frequency checks and backchecks. Data is always at the unit of the survey submission when doing quality control. One observation is one survey submission. For multi-day surveys, a survey submission may be a different level of data than the survey.
SurveyCTO Data
When downloading from SurveyCTO, you can choose whether to export the data in wide or long format. SurveyCTO also produces a file dataset that links long formatted datasets on their unique IDs if the data is exported long.
Keeping the data in long format means that SurveyCTO will automatically organize observations at the largest unit level (e.g. household-level) and will save all sub-unit level data into separate datasets for each repeat group (e.g. person-level, plot-level, etc.). This means you will have a larger number of datasets to work with (and perhaps a more involved merging process if you want to compile all the data), but each dataset will be at the unit of the question, so may be more intuitive to work with.
Wide data saves you the process of merging manually. If any repeat groups exist, sub-unit level data will automatically be reshaped in order to fit with the main dataset. For example, names of individual household members collected by the variable name
will be reshaped as name_1
(for the first member), name_2
(for the second member), etc. It is easier to run quality checks on wide data, as these data will contain all data collected by the survey. However, these datasets can also grow very large very quickly and become unwieldy to work with. To load wide data into Stata it may be necessary to increase the number of variables Stata can read in a single dataset (up to 32,767 for Stata-SE) by typing set maxvar 32767
Alternative to reshape
To convert between wide and long data, Stata uses the reshape
command. Reshaping is a very computationally intensive command. If you are dealing with a large data set you will quickly find that using reshape
can take an excessively long time or even break the current Stata session. There is an alternative way to manually code a reshape
using expand
and replace
, that has the benefits of running much faster. It also provides an understanding of how a reshape
transforms your data structure. In addition, variable labels can be modified with more control if done manually.
The following code reshapes a wide dataset by person to a long dataset by person-month. The variable that we want to reshape is income
.
/*sCount all of the income variables and create
a variable for each observations*/
ds income*
local copies : word count `r(varlist)'
`copies'
expand
/*Then create a list of all of the vars with each stub
and manually expand*/
gen yearmonth = . // create an empty var that will hold the sub-group identifier (this is the j var in reshape)
foreach var in income {
all of the income variables
*Save ds `var'_*,
local reshapevarlist `r(varlist)'
/*remove the stub so that we can have the yearmonth
or j identifier foreach var alone while maintaining their order*/
local monthyear = subinstr("`reshapevarlist'", "`var'_", "", .)
version of the stub, this will become the long var
*create an empty gen `var' = .
value
*Loop through each forvalues x = 1/`copies' {
variable from the list we defined earlier in the loop
* Replace the /* See "h mod" to understand how mod works,
but in short "if mod(n, `copies') == `x'""
will only replace the variable for the nth observation
in each group defined by
an ID variable (e.g. the nth or last row created in the expand)
*/
local currvar : word `x' of `reshapevarlist'
replace `var' = `currvar' if mod(_n, `copies') == `x'
variable
* Generate the identifier local yearmonth : word `x' of `monthyear'
replace yearmonth = `yearmonth' if mod(_n, `copies') == `x'
wide variable
*Drop the drop `currvar'
}// end forval x= 1/`copies'
}// end foreach var in income
Adding or replacing data
Raw data may not contain the full or correct set of data. There are many reasons why this may be the case: survey responses may need to be corrected by enumerators, translations for responses need to be added, admin data may have entry errors, etc. It is relatively simple to make changes in data using statistical software, but it is important to make changes in a systematic way to ensure all modifications are reproducible.
There are at least two common situations in IPA projects where having a standardized data flow for modifying or adding data will increase data quality: replacements and translations. This article suggests best practices to add data collected outside of a survey form.
Replacements
As you are collecting data, there will inevitably be errors in your data that need to be manually corrected. It is important to always maintain the raw dataset with the original collected data. Once you have confirmed that a value in your dataset is incorrect and needs to be changed, this replacement should be made and saved in a new dataset, before you have done any other necessary cleaning.
When making a replacement, confirm that you are using a truly unique value for your observation. For example, the key
variable should be used if you are making replacements in SurveyCTO data, since there can be duplicates in your ID variable, or you may need to make a replacement in the ID variable.
For every replacement you make to your dataset, you must record:
- Who made the original error
- Who confirmed it was an error
- The original value
- The new value
- The reason for the change
One way to make replacements is using the replace
command if the key
variable matches the observation.
enum confirmed they added an extra zero to income
*replace income = 1000 if key == "uuid:2b2763e1-71b6-4e1e-8023-c15cdf7fa39d"
If you are making multiple replacements, this method can create long datasets and make it difficult to keep track of which replacements have been made. It can also lead to PII appearing in your do files if you are making replacements on PII data or sensitive data. To avoid encrypting your datasets, consider using user-written commands readreplace
or ipacheckreadreplace
(ipacheckreadreplace is a wrapper for readreplace). Both commands use an Excel file as an input sheet, where all replacements, notes, original values, and replacements are logged. ipacheckreadreplace
has a template replacements Excel file that you can download when you run the ipacheck command.
If you are using IPA’s Data Management System, this code snippet is included in the master_check do file. You can also use the ipacheckreadreplace
command in your own code with this format:
using "hfc_replacements.xlsm", ///
ipacheckreadreplace "key") ///
id(variable("variable") ///
value("value") ///
"newvalue") ///
newvalue("action") ///
action("comments") ///
comments("sheetname") ///
sheet("replacements_log.xlsx") logusing(
The Excel template already uses these column names, so you must change the options/column names if you are using your own file or column names. ipacheckreadreplace
also creates a replacements log, another Excel file that lists all the replacements, notes, and values, as well as a note that specifies if the replacement was successful. A replacement will only be successful in ipacheckreadreplace
if the unique ID variable and the original value match what was entered in hfc_replacements.xlsx.
Translation
Sometimes open survey responses need to be translated for deliverables or to support an analyst who isn’t fluent in the survey language. Translating these data within statistical software can result in long scripts with large potential for error rates and a potential to contain PII. This can be avoided by using an excel-based workflow with encrypted translation file:
- For each variable that needs to be translated, save the values that need translation to an excel sheet with an empty column (variable in the Stata .dta) for the language the responses need to be translated into.
- Translate the responses using a standardized procedure, e.g. double-entry with another person breaking ties and rules on when to drop comments with PII.
- Write code to merge the translation from the excel file back into the do file, instead of running this as a series of
replace
commands that are prone to error.
This workflow is shown below:
/* MR 10/25/2019:
Create a file to store translations for question q.
*/
variable name
*Generate translated gen q_en = ""
local vlab : variable label q // extract variable label to copy
label variable q_en "`vlab', English" // label with translation info
for coding with only those questions that need translating
*Save excel file export excel id q q_en using "${temp}/q_en.xlsx" if !mi(q), firstrow(variables) replace
This results in a table that looks like this:
id | q | q_en |
---|---|---|
01-0001 | Le dio sus herramientas agrícolas a su primo | |
18-0007 | El ID de esta respuesta debe set 18-0008, no 18-0007 |
The responses would be translated and then the file is then saved with a different name. We recommend saving the file as “[filename]_translated_[date]_[initials]”. We recommend doing double entry for all manual additions and comparing differences between any added responses. The completed file would have a value for every question below.
id | q | q_en |
---|---|---|
01-0001 | Le dio sus herramientas agrícolas a su primo. | This respondent gave their farming tools to their cousin |
18-0007 | El ID de esta respuesta debe set 18-0008, no 18-0007 | The ID of this response should be 18-0008, not 18-0007 |
Once these translations are completed, they would be merged on to the do file. The code to complete that looks like this:
/* MR 01/24/20:
Merge on the translated data from the saved excel file.
MR translated these data on January 24, 2020 and double entered them.
MR's RM double checked conflicting translated entries.
If PII was removed as part of the translation process, this will be
marked with a dummy* and corrected in the response of both languages.
*Note: no responses needed PII removal, so this code does not
create the dummy.
*/
in data and save a tempfile
*Load preserve
using "${temp}/q_en_20200124_MR.xlsx", first clear
import excel
*Do some cleaning to ensure excel files matches expectedmissing dropvars, force // remove extravars
missing dropobs, force // remove empty observations
confirm variable id q q_en // check have variables
tempfile to merge
*Save tempfile q_en
save `q_en'
restore
on file
*Merge using `q_en', t(1:1) uname(check_)
mmerge id
*Check fileassert _merge == 1 | _merge == 3 // in master only or translated
assert q == check_q if _merge == 3 // ensure no changes to comment field
assert _merge == 3 if !mi(q) // ensure all are translated
replace translation after checks
*Manually replace q_en = check_q_en
*Housecleaningdrop _merge check_q check_q_en
This process can be repeated for any number of variables. It can also be extended to remove PII as part of the translation process. In that case, make sure to maintain a raw version of the dataset that is encrypted, and mark which values were changed to remove PII in the translated dataset.
SurveyCTO split
Some surveys questions have the option of “check all that apply,” meaning that they allow for multiple responses. If you uncheck the “export select_multiple responses as a series of 1/0 columns” in SurveyCTO desktop, SurveyCTO will export these variables as a string containing all possible answers in the form of a space-separated list (e.g., a variable may have the string “A C E” to indicate responses of A, C, and E).
For analysis purposes, it’s usually best to split these variables into binary variables, one for each possible choice (A, B, C, D, or E), as well as one for each combination of choices. The IPA-written stringdum
function can help converting these variables to a more useful form, including separate dummies for each option, variables that count the number of times each option is selected, and several options for naming conventions. The split
command can also convert these variables, by parsing the variable on spaces. This splits the variable into a new variable after each space (see help split
). If this is not sufficient, Stata has a number of string functions (help string functions
). You can also use the regexm
function to accomplish this, which uses regular expressions. Below is an example script using both split and regexm to accomplish cleaning a select multiple question manually.
local of all string variables
* Create a qui ds _all, has(type string)
foreach var in `r(varlist)' {
local stringvars `stringvars' `var'
}
/* Exclude from stringvars vars with names that will be too long if we append
four characters ("_num") to them first check using tab that they are not
select_mult. If the length of the variable name is greater than 27 characters,
1. manually verify that it is not a select_mult variable
2. add it to a special local
*/
foreach var of varlist _all {
if strlen("`var'")>27 {
tab `var'
local too_long_vars `too_long_vars' `var'
}
}
string variable list
* Remove the too_long_vars from the local stringvars: list stringvars - too_long_vars
/* For each string variable,
1. create a tempvar,
2. an indicator for which have a number-space-number combo and
3. add them to a select multiple local
*/
foreach var in `stringvars' {
tempvar `var'_num
gen ``var'_num' = (regexm(`var', "[0-9] [0-9]"))
qui sum ``var'_num'
if r(max) ==1 {
variable to a list of select_multiple variables
*add the local select_mult `select_mult' `var'
}
}
/* Exclude from the `select_mult' local variables which have been
erroneously captured by the process above irregular responses to
the "survey_note" variable are captured by this process */
local exclude_notes "survey_notes"
local select_mult: list select_mult - exclude_notes
/*For several otherwise numeric variables,
the option "Not asked in this version" remains; remove it*/
foreach var in `select_mult' {
use this as a numeric indicator for .v
*no observations; safe to tab `var' if `var'=="-444"
replace `var'="-444" if `var'=="Not asked in this version"
/* split and de-string vars in the selec_mult list,
which now have only numeric characters */
split `var', gen(`var'_) destring
}
missing values in the now-numeric variables
*Re-code qui ds _all, has(type numeric)
foreach var in `r(varlist)' {
replace `var'=.o if `var'==-777
replace `var'=.d if `var'==-888
replace `var'=.r if `var'==-999
replace `var'=.v if `var'==-444
}
Unique identifiers
Unique IDs are critical to a well-managed dataset, particularly when it comes time to merge across datasets or sort values in a consistent manner. You should use the isid
command in Stata to check that the ID is indeed unique before you begin any sort of data management. Uniqueness is important not just to describe data, but because it affects how Stata manages data. Data will take a random order within a non-unique value if data are sorted on a shared identifier. This randomness is controlled by a separate random seed in Stata and can be made reproducible using set sortseed
or sort, stable
. Those approaches do not substitute for not having a unique ID; reproducibility is not a substitute for uniqueness.
To illustrate this, you can run the following code in Stata:
*Use system datasetset sortseed 1
sysuse auto, clear
sort by foreign and save order
*sort foreign
gen sort_order = _n
sort randomly so Stata will sort by foreign again
*/* sort first checks if it's in order, and doesn't do anything if
the data are in the expected order
*/
gen rand = runiform()
sort rand
drop rand
if foreign has a unique order
*Check sort foreign
assert _n == sort_order // check if the current order == saved order
in 74 observations
73 contradictions
assertion is falser(9);
Duplicates
ID variable should describe data at the unit of analysis. The unit of analysis is dependent on what the dataset is describing. For example, a survey module that describes yield by plot should not be unique at the household-level, but should be at the household-plot-level. That is equivalent to saying that isid householdid plotid
should not return an error in Stata. If you receive an error, use the duplicates
command to investigate as to why you have duplicate observations (help duplicates
). Although duplicate surveys should be resolved in IPA’s data management system, duplicate observations may remain in the raw data for a variety of reasons.
If they are true duplicates across all variables and it is clear why these duplicates were created, you can proceed to remove duplicate copies. If they are not duplicates across all variables, you will need to find why there are multiple observations and develop a rule for choosing which one to keep. Selected values should be reproducible and follow a consistent rule. In general, observations that are more recent and more informative (i.e., have fewer missing values) are better. However, for some projects, it is better to keep the earlier observation. You should check with your PI, the project manager and other staff involved in the survey wave about how to create a rules or rules. Once you come up with a decision, be sure to document both what you decide to do and why it was decided, as well as when and who made the decision for future reference.
Duplicates may not just exist for an ID variable. Some respondents may take a survey twice and receive two different IDs. See help duplicates
for guidance on commands that can be useful for this. Apart from dropping duplicates in terms of all variables using duplicates drop
, check for duplicates in terms of things you suspect may identify the same person (e.g., name, address, phone number, birthday and combinations of the above) using either duplicates
or isid
.
After cleaning each dataset individually, check that the unique IDs are assigned correctly across datasets. For instance, if you have a baseline survey and an endline survey, after merging them based on unique ID, check that the names and birthdate variables from each survey match to the other survey.
ID formats
IDs may not be unique due to how Stata manages numerical formats. If the ID variable is longer than 17 digits, numeric IDs will no longer be unique as Stata will begin to round the trailing digits. This is due to how Stata stores when they have more digits than their storage type can hold (16 for doubles). No numeric IDs will be unique if they have more than 17 digits, especially if the last digits are changing for individuals that should be unique. Store IDs as strings and include a letter to ensure that the ID will not be turned into a numeric variable and subsequently rounded by compress
or destring
. Alternatively, keep ID values at the minimum length needed, as it is not necessary to have an ID value over 17 digits.
De-identifying data
Personally Identifiable Information – also known as PII – is any datapoint or combination of datapoints that allows someone to identify an individual or household with a reasonable degree of certainty.
Examples of individual data points of PII include names, GPS coordinates, national identification numbers and addresses. Depending on the context, certain combinations of demographic data points qualify as PII so long as they can identify an individual or household with a reasonable degree of certainty. For example, the combination of village name, birth date, gender might be identifiable in small communities. Requirements and recommendations around PII apply equally to PII that consist of singular data points and those that consist of combinations of data points.
All PII must remain encrypted in storage and securely transmitted between devices. The only people who can access PII data must be both on the IRB-approved research protocol and referenced in the informed consent. For more information, see IPA’s protocols surrounding PII
When to remove PII
After you have data, it’s best practice to remove PII as soon as possible. The earlier you can de-identify the better. Unless necessary, avoid working with data containing PII as you move forward with analysis.
Some aspects of cleaning and analysis will require including PII. For instance, if you are trying to remove duplicate observations and need to determine if two respondents have the same name, you’ll want that in there. As always, you will need to keep the data encrypted with Boxcryptor at that point. Make sure do-files/scripts are encrypted as well if they include PII.
After you no longer need PII, strip it from the main dataset. The best practice for de-identifying data is to:
- Create a first .do file that imports the data, standardizes duplicate observations, completes corrections, and removes PII.
- Create a second .do file that performs the data cleaning on the limited or fully deidentified dataset. If need be, you can save this version in an unencrypted folder so that other project members who can access the data.
This results in a data flow where all data modification happens after the data is deidentified.