Data Publication Preparation
This reference guide documents standardized procedures for preparing research materials—including data, code, and documentation—for public sharing in data repositories.
- Data publication requires systematic removal of personally identifiable information (PII) while maintaining analytical integrity.
- Complete documentation includes datasets, code, survey instruments, user-written commands, readme files, and codebooks.
- Materials must be cleaned, tested for reproducibility, and properly documented before public sharing.
Purpose and Scope
Data repositories allow researchers to share study materials publicly, raising research reliability by enabling verification and reuse of data. This guide documents standardized procedures for preparing materials for public sharing.
Required Materials for Publication
We strongly recommend sharing the following materials:
| File Type | Description | Required? |
|---|---|---|
| Datasets | Full set of collected variables (survey, administrative data, etc.), excluding PII | Yes |
| Data documentation | Context, notes on data-cleaning process, methodological details | Yes |
| User-written commands | ZIP file of all custom commands/programs | If applicable |
| Analysis code | Code required to reproduce publication results | If applicable |
| ReadMe file | Instructions for replication, including software requirements and command lists | Yes |
| Codebook | Variable labels, names, and descriptions | Yes |
| Survey instruments | All questionnaires and data collection tools | Yes |
Organizing Materials
Establish a clear directory structure:
study-materials/
├── data/
│ ├── raw/
│ ├── intermediate/
│ └── final/
├── code/
│ ├── cleaning/
│ ├── analysis/
│ └── figures-tables/
├── documentation/
│ ├── surveys/
│ ├── codebooks/
│ └── readme/
└── user-commands/
Data Preparation
Variable Naming and Labeling
Variable names should be:
- Short (under 32 characters)
- Descriptive
- Linked to survey questions where applicable
- Free of spaces and special characters
Variable labels must:
- Stay within Stata’s 80-character limit
- Reference survey question numbers
- Be interpretable without additional context
Example:
Variable name: gender_q1
Variable label: Q1. Gender of respondentCategorical Variables with “Other” Options
Many surveys include categorical variables with “Other, specify” options. Review these carefully to:
- Recode responses that match existing categories
- Create new categories if multiple responses suggest a pattern
- Document any recoding decisions
Quality Checks Before Sharing
Verify that:
- All variables have descriptive labels
- Value labels exist for all categorical variables
- Missing value codes are consistent and documented
- Outliers have been reviewed and documented
- Variable types are appropriate (numeric vs. string)
- Date variables are in standard formats
Removing Personally Identifiable Information
Before sharing data publicly, all personally identifiable information (PII) must be removed or anonymized. PII includes direct identifiers (names, contact information, precise locations) and indirect identifiers (combinations of variables that could identify individuals).
For detailed information on identifying and removing PII, including HIPAA guidelines, detection strategies, and anonymization techniques, see the Data Security Protocol: Personally Identifiable Information section of IPA’s Data Security Hub.
Quick PII Removal Checklist
Always remove or anonymize:
- Names (participants, relatives, staff, enumerators)
- Contact information (phone, email, address)
- Geographic locations smaller than 20,000 people
- Financial identifiers (bank accounts, credit cards)
- Government IDs and medical identifiers
Check for PII in:
- Variable names and labels
- String variables and free-text responses
- Code comments and file paths
- Survey instruments and documentation
Code Documentation
Essential Code Components
Every code file should include:
- Header with study information:
/*******************************************************************************
* Study: [Full study name]
* Purpose: [What this code does]
* Last updated: [Date]
* Stata version: [Version number]
*******************************************************************************/
* Basic settings
clear all
set more off
set maxvar 10000
* Set working directory
cd "INSERT WORKING DIRECTORY HERE"Clear comments explaining:
- Major sections and subsections
- Complex operations
- Non-obvious decisions
Output linked to publications:
* Table 1: Baseline characteristics
eststo clear
eststo: reg outcome treatment controls
esttab using "output/table1_baseline.tex", replace ///
title("Table 1: Baseline Characteristics")Master Do-File
Create a master file that runs all code in the correct sequence:
/*******************************************************************************
* Master do-file for [Study Name]
* Runs all code to reproduce paper results
*******************************************************************************/
cd "INSERT WORKING DIRECTORY HERE"
* Data preparation
do "code/01_clean_data.do"
do "code/02_construct_variables.do"
* Analysis
do "code/03_main_analysis.do"
do "code/04_robustness_checks.do"
* Output
do "code/05_tables.do"
do "code/06_figures.do"User-Written Commands
Include all custom commands (.ado files) used in your analysis:
- Identify them:
which outreg2 - Collect from:
C:\Users\[username]\ado\plus\or\ado\personal\ - Package as: ZIP file with list in ReadMe
Code Verification
Before sharing, verify that:
- Code runs without errors from a fresh Stata session
- Generated tables match publication results
- All file paths work on a different computer
Documentation Files
ReadMe File Structure
# [Study Title]
## Citation
[Full citation for publication]
## Authors
[List of authors and affiliations]
## Files Included
- Data files with descriptions
- Code files with purposes
- Survey instruments
- User-written commands
## Software Requirements
- Software version (e.g., Stata 17.0)
- Required user-written commands
## Replication Instructions
1. Unzip files maintaining directory structure
2. Update working directory in master.do
3. Run master.do
## Notes and Discrepancies
[Document any differences between paper and reproducible results]
## Contact
[Contact information for questions]Metadata Requirements
Provide complete metadata including:
- Study title and description
- Authors and affiliations
- Geographic and temporal coverage
- Unit of observation and sample size
- Key variables and collection methods
- Related publications and funding source
Pre-Publication Checklist
Before sharing materials publicly:
Data and Documentation:
Code and Reproducibility:
Metadata: