Data Publication Preparation
This reference guide documents standardized procedures for preparing research materials—including data, code, and documentation—for public sharing in data repositories.
- Data publication requires systematic removal of personally identifiable information (PII) while maintaining analytical integrity.
- Complete documentation includes datasets, code, survey instruments, user-written commands, readme files, and codebooks.
- Materials must be cleaned, tested for reproducibility, and properly documented before public sharing.
Purpose and Scope
Data repositories allow researchers to share study materials publicly, raising research reliability by enabling verification and reuse of data. This guide documents standardized procedures for preparing materials for public sharing.
Required Materials for Publication
We strongly recommend sharing the following materials:
| File Type | Description | Required? |
|---|---|---|
| Datasets | Full set of collected variables (survey, administrative data, etc.), excluding PII | Yes |
| Data documentation | Context, notes on data-cleaning process, methodological details | Yes |
| User-written commands | ZIP file of all custom commands/programs | If applicable |
| Analysis code | Code required to reproduce publication results | If applicable |
| ReadMe file | Instructions for replication, including software requirements and command lists | Yes |
| Codebook | Variable labels, names, and descriptions | Yes |
| Survey instruments | All questionnaires and data collection tools | Yes |
We recommend sharing materials in the following format to encourage accessibility and long-term usability:
| Material Type | Recommended Formats |
|---|---|
| Data | .csv, .json, .dta (with version specified), .parquet (with compression-method specified) |
| Code | .do, .py, .r, use notebooks sparingly (.ipynb, .Rmd, .qmd) |
| Documentation | .pdf, .md, .txt |
| Codebooks | .pdf, .xlsx, .csv |
| Surveys | .pdf, .docx, .xlsx |
Levels of Material Sharing
Replication packages can be shared at different levels of completeness:
| Level | Materials Included | Enables |
|---|---|---|
| Minimum | Data and code underlying published results, ReadMe explaining file relationships, minimal study-level metadata | Verifying that tables in the published article can be reproduced by running the code; limited reuse |
| Recommended | Minimum level, plus full set of collected variables (excluding PII), detailed data documentation, survey instruments | Full potential for reuse and secondary analysis; better understanding of study context for systematic reviews and policy application |
| Exceptional | Recommended level, plus cleaning and variable construction code | Start-to-finish reproducibility |
Organizing Materials
Establish a clear directory structure:
study-materials/
├── data/
│ ├── raw/
│ ├── intermediate/
│ └── final/
├── code/
│ ├── cleaning/
│ ├── analysis/
│ └── figures-tables/
├── documentation/
│ ├── surveys/
│ ├── codebooks/
│ └── readme/
└── user-commands/
Data Preparation
Variable Naming and Labeling
Variable names should be:
- Short (under 32 characters)
- Descriptive
- Linked to survey questions where applicable
- Free of spaces and special characters
Variable labels must:
- Stay within Stata’s 80-character limit
- Reference survey question numbers
- Be interpretable without additional context
Example:
Variable name: gender_q1
Variable label: Q1. Gender of respondentCategorical Variables with “Other” Options
Many surveys include categorical variables with “Other, specify” options. Review these carefully to:
- Recode responses that match existing categories
- Create new categories if multiple responses suggest a pattern
- Document any recoding decisions
Quality Checks Before Sharing
Verify that:
- All variables have descriptive labels
- Value labels exist for all categorical variables
- Missing value codes are consistent and documented
- Outliers have been reviewed and documented
- Variable types are appropriate (numeric vs. string)
- Date variables are in standard formats
Removing Personally Identifiable Information
Before sharing data publicly, all personally identifiable information (PII) must be removed or anonymized. PII includes direct identifiers (names, contact information, precise locations) and indirect identifiers (combinations of variables that could identify individuals).
For detailed information on identifying and removing PII, including HIPAA guidelines, detection strategies, and anonymization techniques, see the Data Security Protocol: Personally Identifiable Information section of IPA’s Data Security Hub.
Quick PII Removal Checklist
Always remove or anonymize:
- Names (participants, relatives, staff, enumerators)
- Contact information (phone, email, address)
- Geographic locations smaller than 20,000 people
- Financial identifiers (bank accounts, credit cards)
- Government IDs and medical identifiers
Check for PII in:
- Variable names and labels
- String variables and free-text responses
- Code comments and file paths
- Survey instruments and documentation
Code Documentation
Essential Code Components
Every code file should include:
- Header with study information:
/*******************************************************************************
* Study: [Full study name]
* Purpose: [What this code does]
* Last updated: [Date]
* Stata version: [Version number]
*******************************************************************************/
* Basic settings
clear all
set more off
set maxvar 10000
* Set working directory
cd "INSERT WORKING DIRECTORY HERE"Clear comments explaining:
- Major sections and subsections
- Complex operations
- Non-obvious decisions
Output linked to publications:
* Table 1: Baseline characteristics
eststo clear
eststo: reg outcome treatment controls
esttab using "output/table1_baseline.tex", replace ///
title("Table 1: Baseline Characteristics")Master Do-File
Create a master file that runs all code in the correct sequence:
/*******************************************************************************
* Master do-file for [Study Name]
* Runs all code to reproduce paper results
*******************************************************************************/
cd "INSERT WORKING DIRECTORY HERE"
* Data preparation
do "code/01_clean_data.do"
do "code/02_construct_variables.do"
* Analysis
do "code/03_main_analysis.do"
do "code/04_robustness_checks.do"
* Output
do "code/05_tables.do"
do "code/06_figures.do"User-Written Commands
Include all custom commands (.ado files) used in your analysis:
- Identify them:
which outreg2 - Collect from:
C:\Users\[username]\ado\plus\or\ado\personal\ - Package as: ZIP file with list in ReadMe
Code Verification
Before sharing, verify that:
- Code runs without errors from a fresh Stata session
- Generated tables match publication results
- All file paths work on a different computer
Documentation Files
ReadMe File Structure
# [Study Title]
## Citation
[Full citation for publication]
## Authors
[List of authors and affiliations]
## Files Included
- Data files with descriptions
- Code files with purposes
- Survey instruments
- User-written commands
## Software Requirements
- Software version (e.g., Stata 17.0)
- Required user-written commands
## Replication Instructions
1. Unzip files maintaining directory structure
2. Update working directory in master.do
3. Run master.do
## Notes and Discrepancies
[Document any differences between paper and reproducible results]
## Contact
[Contact information for questions]Metadata Requirements
Provide complete metadata including:
- Study title and description
- Authors and affiliations
- Geographic and temporal coverage
- Unit of observation and sample size
- Key variables and collection methods
- Related publications and funding source
Pre-Publication Checklist
Before sharing materials publicly:
Data and Documentation:
Code and Reproducibility:
Metadata:
Best Practices Throughout the Project Lifecycle
Following these practices during your project makes data curation and publication easier later.
Before Data Collection
- Register a pre-analysis plan for the study on a trial registry such as the AEA RCT Registry
- Use
_piisuffix in variable names for survey questions that will collect personally identifiable information—this simplifies later anonymization - Set up IPA’s folder structure from the start to facilitate file management and data flow
After Data Collection
- Anonymize data before writing cleaning code—if PII variables are used in cleaning and shaping code, that code will break during curation and may need to be excluded from the replication package
- Follow coding best practices when writing analysis code to ensure it runs smoothly without breaking
- Avoid post-publication code changes that could influence outputs and create discrepancies between code results and reported findings