Dataset Documentation
Standards for documenting datasets, focusing on Stata commands to name and label variables and attach notes to datasets in memory. Covers variable naming conventions, labeling systems, value labels, and metadata management for research data.
- High-quality research relies on code that anyone involved in the project can understand and use.
- Proper dataset documentation starts with consistent naming and labeling of data.
- Code needs internal documentation to explain what problem each script solves, why you chose the approach, and how each section contributes to resolving the problem.
Features of well-documented datasets
Documented datasets in Stata should have the following characteristics:
- Variable names should follow patterns for both interpretability and ease in programming tasks.
- All variables should have descriptive labels.
- All values of categorical variables should have labels and be checked for consistency when you assign labels.
- All datasets should only contain variables needed as part of the dataflow.
- Datasets should have internal notes describing additional information necessary to use the data, such as the name of the item in the questionnaire.
Write this documentation along with project manuals, codebooks, and readmes that describe other aspects of the data generating process. This includes why you made decisions to create variables, how datasets relate and change through the dataflow, and how you define variables used in the analysis.
Variable names
Variable names often come from the prior method of data storage. These names may not work well for coding in statistical software due to length, formatting, or clarity. Standardizing these names to usable and interpretable formats is one of the first steps to ensure datasets are easy and intuitive to use.
If you use SurveyCTO data, the SurveyCTO import template automatically generates variable names and labels for you upon import. If you use other data or skip this template, you will need to rename your variables.
Naming Standards
Variable names should balance interpretability by users with the utilities that statistical software provides. For example, in Stata, several commands allow using wildcard commands such as *
and ?
to stand in for patterns in the data. This allows you to modify data systematically. If all income variables start with the inc_
prefix, you can modify every variable at once:
all income variables
*Call ds inc_*
Balance the following considerations while naming variables:
- Group variable names that describe outcome categories. For example, prefix all variables that count yield with
y_
. - Group types of variables like comment fields with a unique substring such as
_note
. - Create names that have substantive meaning and are easy to type. For example, use
inc_bus_
andinc_ag
instead ofsection1b_
andsection2a_
. - For indicator variables, name the variable what the value “1” indicates rather than the category. For example, if a variable takes 1 when a respondent is female and 0 when male, name the variable
female
notgender
. - Create unique and consistent naming patterns across all datasets used in the project:
- Two datasets with different units of analysis should not use the uninformative variable name
id
- Two datasets that describe income at various levels should use the same prefix to describe the same construct. For example, plot-level income could be
inc_ag_plot1
and baseline household income could beinc_ag*_bl
.
- Two datasets with different units of analysis should not use the uninformative variable name
In wide data, it’s also important to consider how statistical software performs. Often times patterning variable names is necessary for tasks like reshaping to work smoothly. In wide-data stored in Stata, reshape
uses a stub in the variable name to identify the value the long dataset would take for each group. Ensuring that variables are named consistently (e.g. baseline variables are suffixed by _1
, midline by _2
, and endline by _3
) can make it easier to reshape datasets.
Renaming in Stata
Stata’s rename
command is used to change variable names. While it is possible to rename multiple variables with these commands, it can often be easier to rename many variables from an external file such as an .xls. However, rename
allows for some operators to rename multiple variables that share patterns. These commands can be very powerful and can sometimes capture variables that you do not intend to rename. See h rename group
for a full description of renaming commands. Some options that are relevant for survey data follow:
Extended command | Function | Example |
---|---|---|
* |
Any number of characters | ren year_* * removes the prefix year_ from all variable names that start with year_ |
? |
Exactly one character | ren monday_? day_?_1 would change monday_a to day_a_1 |
# |
One or more digit (numeric only) | ren age# age(##) renames all numeric variables to use a minimum of two digits for number suffixes (e.g. age_1 becomes age_01) |
, renumber |
Reorders names to increase by 1 | ren survey_# survey_#, renumber reorders all variables prefixed with survey_ that end with a number so that they increase by 1 |
Also see renvars
(net search renvars
to install) a user written command that can help with complicated renaming tasks.
Renaming from External Files
If you are renaming/labeling a lot of variables it can be cleaner to put them in an excel file and import from there, rather than writing it all in your do file. For an example of how to efficiently rename variables from a .xlsx file, see the following:
codebook file
**A. Import /*
This file contains the master name and labels, as well as the
survey-wise name and labels
*/
"${raw}/variable_codebook.xlsx", firstrow clear
import excel
** B. Make locals with common names and corresponding survey namessort common_varname // sort in unique order
*Init project specific locals empty// project-specific variable names
loc survey_names // corresponding common variable names
loc common_names
all value of the excel file
* Loop through forvalues i = 1/`=_N' {
name and labels in order from the excel sheet
*Save in `i'
loc survey_name = varname in `i'
loc common_name = common_varname in `i'
loc common_varl = common_varlab
*Fill locals to add information to project`proj_names' `proj_name'
loc proj_names `common_names' `common_name'
loc common_names "`common_label' `"`common_varl'"'"
loc common_label
}// end forvalue i == 1/`N'
**C. Run some checkslength
*Check renaming lists are same assert "`:word count `proj_names''" == "`:word count `common_names''"
list of locals for logc
*Save macro list
**D. IMPORT THE DATASETS IN A LOOP AND RENAME
*Clean locals// init empty
loc common_name // init empty
loc common_varl
data
*Load raw use "``project'_directory'/`project'`input_dataset_suffix'.dta", clear
variable names to remain
*Loop through forvalues i = 1(1)`: word count `proj_names'' {
list to rename variables
*Collect names from `:word `i' of `proj_names''
loc proj_name `:word `i' of `common_names''
loc common_name `:word `i' of `common_label''
loc common_varl
label from common names
*Rename and rename `proj_name' `common_name'
var `common_name' "`common_varl'"
lab
}// end forvalues i = 1/`: word '
Variable labels
Stata variables have both names and labels. Variable names are what Stata uses to define a column. Variable labels provide additional information that you can display to the analyst. Names should follow patterns that make programming easy. For example, you could code all consumption questions as cons_1 - cons_20
and call them with ds cons_? cons_??
. Use variable labels as descriptors that say exactly what the variable is about. You can pull the exact question text from the survey, or use a paraphrased version if the text is lengthy.
Systematizing labels
Variable labels provide information about variable names that you often define for programmatic reasons.
- Give all variables labels, and give all multiple choice variables value labels.
- Keep the labeling system internally consistent.
- Make it easy to connect the variable in the dataset with the question on the questionnaire. Most analysis happens with the questionnaire in hand.
One format for defining variable labels for survey data includes both the question number in the questionnaire and a description of the contents in the variable. The basic format for that system is:
Variable name: descriptive name that uses prefixes or suffixes to provide patterns
Variable label: [question_number] descriptive label
This style is implemented below:
variable labels variables
*Define label var child_15 "[QA.101] Has children under 15"
label var child_15B "[QA.102a] Number boys under 15"
label var child_15B_S "[QA.102b] Number boys in school"
label var child_15G "[QA.103a] Number girls under 15"
label var child_15G_S "[QA.103b] Number girls in school"
Note that this code aligns the variable names and the variable labels in the text. This makes it easy to read the labeling as a programmer.
Labels from SurveyCTO
SurveyCTO automatically labels variables using questions from the survey instrument. However, Stata allows labels a maximum of only 80 characters, which means SurveyCTO imports often truncate the labels. For tips on how to attach information longer than 80 characters see the variable notes guide.
Stata Storage of Variable Labels
Stata can use value label data using the extended macro functions (see h extended_fcn
). The following code calls a variable label and assigns it to a local.
variable label of variable "var"
*Call local vlab : variable label var
This information can be searched conditionally. If, for example, you wanted to only apply a function to variables in the “QA” section of the survey defined in “Systematizing Labels” section in this article, you could check to see if the label starts with “[QA.”:
foreach var of varlist _all {
if regexm("`: variable label var'", "^\[QA") {
do something]
[
}// end if regexm("`: variable label var'", "^\[QA\.") {
}// end foreach var in `r(varlist)
Variable metadata
Sometimes you want to attach information or other labeling that is longer than Stata allows (labels are capped at 80 characters). If this is the case, you can store the full desired label into the variable notes or characteristics. Both notes and characteristics can describe variables or the dta file.
Notes
One variable can have multiple notes. Add notes into variables by typing note VARIABLE : "Note"
. For example, for variable VARIABLE
the note note
was added. To display the notes stored in one variable just type “notes VARIABLE”. Stata also stores notes as locals and you can call them using `VARIABLE[note1]'
. Stata numbers these notes based on the order in which it receives them. You can modify and delete note ordering with the note
command (see help notes
in Stata).
Survey CTO includes the full text of the question from the survey instrument as variable notes (as well as the truncated questions as variable labels) as part of the import do file. These notes will always be in the downloaded language. They will not contain filled values for the respondent that are produced as the result of calculate fields.
If you have changed or converted variable labels as part of a data transformation, you can convert notes into labels by looping through variables and using the stored local for notes:
variable in the varlist VARIABLES
*Loop through each foreach var of varlist VARIABLES {
label var `var' ``var'[note1]'
}
The char
command
Additional information can be added using characteristics, which function similarly to notes. The Stata manual describes characteristics as “an arcane feature of Stata [that] are great use to Stata programmers.” Many commands use and define specific named characteristics to attach metadata. Characteristics (type help char
in Stata) can describe variables and the dataset itself.
The main difference between characteristics and notes is that the char
command requires a name for each characteristic. Whereas note VARIABLE : "Note"
creates the next sequential note (1 if the first note, 2 if the second, etc.), char
explicitly requires a character name, “charname” in the following code: char define VARIABLE[charname] "Note"
. This can be useful for saving labels in multiple languages. These characteristics can then be called by name, instead of an arbitrarily assigned number.
For example, a data flow could take labels in each language from a SurveyCTO form and assign them as characteristics to each variable produced by the survey in the following:
*Import SurveyCTOusing "Baseline Household Survey.xlsx", first clear
import excel
*Keep variables with labelskeep type name label labelbangla relevance
ren label labelenglish
ren label* * // rename to remove "label_" prefix from all variables
not exported to Stata
*Remove variables drop if inlist(type, "begin group", ///
"end group", "image", "begin repeat", "end repeat")
drop type
keep variables with non_missing
*Only ds name, not // get list of label variables
`r(varlist)' relevance
loc languages egen has_lab = rownonmiss(`languages'), strok
// keep only rows with has_lab
keep if has_lab >= 1 & !mi(has_lab)
*Save variables and language names// init empty
loc varnames `=_N' {
forval i = 1(1)name = name[`i'] // save name of variable
loc
as locals
*Save labels // init counter at start
loc j = 1 foreach language of local languages {
`name'_`j' = `language'[`i']
loc
loc ++j
}// end foreach language of local languages
save local of names to add question text to
*`varnames' `name'
loc varnames
}// end forval i = 1(1)_N
data
*Load use survey.dta, clear
*Loop through names to add characteristicsforeach name of local varnames {
ds `name'* // collect names that are inclusive of repeat groups
`r(varlist)'
loc varl
foreach var of local varl {
variable or the repeat group
*Confirm only the assert "`var'" == "`name'" | regexm("`var'","^`name'[0-9][0-9]?$")
cap if _rc continue // skip if "`name'" is a prefix
as a named language
*add characteristic
loc j = 1foreach language of local languages {
char define `var'[`language'] "``name'_`j''"
loc ++j
}// end foreach language of local languages
}// end foreach var of local varl
}// end foreach name of local names
Value labels
For categorical variables the raw data often shows the string values for the selected response. For instance, you may see “male” and “female” as possible responses to the variable gender. When doing calculations, however, you need these variables to be numeric (in the float or long format) if they are not already imported this way. Preserve the extra information the strings capture by using “value labels.” A value label such as gender would assign “female” to 0 and “male” to 1 and display female and male to the analyst. See help label
for how to do this in Stata. Label values for two important reasons:
- it provides information to the analyst that reduces mistakes in coding or analyzing data
- many programs use information on whether a variable has value labels to identify it as a categorical variable, as opposed to a continuous numeric variable.
Encoding String Values in Stata
The quickest way to change string variables to numeric variables with value labels is the encode
command. encode
automatically converts the string variable into a numeric variable and assigns the numbers 1-x (where x is the number of unique answer choices) to the alphabetized list of the answer choices (ordered 0-9, followed by a-z). Because this happens automatically based on alphabetical order, you may need to recode or label them manually if you want value labels to match some existing assignment.
Stata stores value labels independently from the variables, so manage value labels separately from variables as they can contain PII. Deleting all variables that have a value label and saving the dataset removes the value label from the .dta file. To see which labels Stata currently defines and their content, use the label list
command. The return values of label list
and label dir
also store helpful summary information. You can modify or delete these labels to combine using the options of label define function:
*Drop old labelslabel drop ex1 ex2 ex3
label
*Define label define yesno 1 "No" 3 "Yes"
label list yesno
label to correct the error
*Modify label define yesno 1 "No" 2 "Yes", modify
values to the label defined above
*Add extended label define yesno .n "No response" 3 "Maybe", add
label to all of the variables it should apply to
*Apply the
loc dummy_vars ex1 ex2 ex3label values `dummy_vars' yesno
Formatting Labeling in Stata
It can be useful to change the delimiter to a semicolon so that a single command can take up several rows in your text editor, making it easier to read labeling. This is especially useful when multiple values are labeled. See help delimit
to learn about delimiters in Stata. An example would be:
for labeling
* Set delimiter
#delimit ;
label def female
"[0] Male"
0 "[1] Female"
1
;
label def region
"[1] Northern"
1 "[2] Southern"
2 "[3] Western"
3 "[4] Eastern"
4 "[5] Central"
5
;
cr
#delimit
label values female female
label values region region
Note how the labels have the corresponding value as well as the description in the value label. This is not strictly necessary, but can be useful if you want values to display alongside labels in outputs.
Defensive Workflow for Encoding Values
One way to ensure that data is encoded in an expected way is to check that values are only encoded from a pre-specified list of value labels that you defined. The user written command sencode
(install using ssc install sencode
) can help support this. sencode
labels the variable according to the values that you’ve predefined and then adds additional values in order from the highest value if it encounters values that you haven’t defined. An example data flow follows:
*Ensure sencode is installed
cap which sencodeif _rc ssc install sencode
data
*Load sysuse auto, clear
keep if _n <= 10 // Keep the first ten observations of the sample
# del ;
/*
This label is named by the variable name, an "_", and then "label"
so that we can loop over the labels.
*/
label define make_label
"AMC Concord"
1 "AMC Pacer"
2 "AMC Spirit"
3 "Buick Century"
4 "Buick Electra"
5 "Buick LeSabre"
6 "Buick Opel"
7 "Buick Regal"
8 "Buick Riviera"
9 "Buick Skylark"
10
;
cr
# del
local list of variables to encode
*Create a
loc str_var make
values and confirm expected
*Encode foreach var of local str_var {
*Encode variables/*
type h sencode to see options, the noextend option returns an error
if the existing label doesn't capture all values.
*/
`var', label(`var'_label) replace noextend
sencode
}// end foreach v of local str_var
values
*Display the labeled and unlabeled tab make
tab make, nol
Dataset Management
Keep only necessary variables. Order those variables in an understandable way, and name and label them. Also put them in the correct storage format for analysis. The clearest way to do this may vary, especially with variable order. The order that questions appear in the survey works well. Always put unique identifiers first.
Any script that saves data should have code that identifies the variables saved, orders them, and describes them for readers. This ensures that a reader can look at the code and understand what it produces without running a do file. An example codeblock for the end of a do file follows. Note that comments describe values and the file ends with some commented marker.
clean vars
**B. Sort and isid hhid // confirm Household ID is unique
sort hhid // sort in a unique order
local of variables
*Create a ///
loc vars /// ID Variables
hhid enum_id cluster survey_date form_id /// File source variables
/// Treatment assignment
treatment scto_rand /// Baseline demos
bl_hhh_age bl_hhh_female bl_hhh_educ ///
bl_hh_size /// Consumption
bl_cons_veg_* bl_cons_meat_* bl_cons_purch_* ///
bl_cons_alc /// Loan information
bl_loan_size bl_loan_exp_pay_m* bl_loan_miss_m* // Lender Fees
bl_msf bl_otaf
values
*Keep necessary qui ds `vars', not
assert `: word count `r(varlist)'' == 0 // check no variables dropped
keep `vars'
*Order ID firstorder `vars'
close
**C. Save and data to the data folder
*Save save "${data}01a_baseline.dta", replace
log
*Close the log c
**EOF**