Data Publication Preparation

This reference guide documents standardized procedures for preparing research materials—including data, code, and documentation—for public sharing in data repositories.

Key Takeaways

Data publication requires systematic removal of personally identifiable information (PII) while maintaining analytical integrity.
Complete documentation includes datasets, code, survey instruments, user-written commands, readme files, and codebooks.
Materials must be cleaned, tested for reproducibility, and properly documented before public sharing.

Purpose and Scope

Data repositories allow researchers to share study materials publicly, raising research reliability by enabling verification and reuse of data. This guide documents standardized procedures for preparing materials for public sharing.

Required Materials for Publication

We strongly recommend sharing the following materials:

File Type	Description	Required?
Datasets	Full set of collected variables (survey, administrative data, etc.), excluding PII	Yes
Data documentation	Context, notes on data-cleaning process, methodological details	Yes
User-written commands	ZIP file of all custom commands/programs	If applicable
Analysis code	Code required to reproduce publication results	If applicable
ReadMe file	Instructions for replication, including software requirements and command lists	Yes
Codebook	Variable labels, names, and descriptions	Yes
Survey instruments	All questionnaires and data collection tools	Yes

We recommend sharing materials in the following format to encourage accessibility and long-term usability:

Material Type	Recommended Formats
Data	`.csv`, `.json`, `.dta` (with version specified), `.parquet` (with compression-method specified)
Code	`.do`, `.py`, `.r`, use notebooks sparingly (`.ipynb`, `.Rmd`, `.qmd`)
Documentation	`.pdf`, `.md`, `.txt`
Codebooks	`.pdf`, `.xlsx`, `.csv`
Surveys	`.pdf`, `.docx`, `.xlsx`

Levels of Material Sharing

Replication packages can be shared at different levels of completeness:

Level	Materials Included	Enables
Minimum	Data and code underlying published results, ReadMe explaining file relationships, minimal study-level metadata	Verifying that tables in the published article can be reproduced by running the code; limited reuse
Recommended	Minimum level, plus full set of collected variables (excluding PII), detailed data documentation, survey instruments	Full potential for reuse and secondary analysis; better understanding of study context for systematic reviews and policy application
Exceptional	Recommended level, plus cleaning and variable construction code	Start-to-finish reproducibility

Organizing Materials

Establish a clear directory structure:

study-materials/
├── data/
│   ├── raw/
│   ├── intermediate/
│   └── final/
├── code/
│   ├── cleaning/
│   ├── analysis/
│   └── figures-tables/
├── documentation/
│   ├── surveys/
│   ├── codebooks/
│   └── readme/
└── user-commands/

Data Preparation

Variable Naming and Labeling

Variable names should be:

Short (under 32 characters)
Descriptive
Linked to survey questions where applicable
Free of spaces and special characters

Variable labels must:

Stay within Stata’s 80-character limit
Reference survey question numbers
Be interpretable without additional context

Example:

Variable name: gender_q1
Variable label: Q1. Gender of respondent

Categorical Variables with “Other” Options

Many surveys include categorical variables with “Other, specify” options. Review these carefully to:

Recode responses that match existing categories
Create new categories if multiple responses suggest a pattern
Document any recoding decisions

Removing Personally Identifiable Information

Before sharing data publicly, all personally identifiable information (PII) must be removed or anonymized. PII includes direct identifiers (names, contact information, precise locations) and indirect identifiers (combinations of variables that could identify individuals).

Comprehensive PII Guidance

For detailed information on identifying and removing PII, including HIPAA guidelines, detection strategies, and anonymization techniques, see the Data Security Protocol: Personally Identifiable Information section of IPA’s Data Security Hub.

Quick PII Removal Checklist

Always remove or anonymize:

Names (participants, relatives, staff, enumerators)
Contact information (phone, email, address)
Geographic locations smaller than 20,000 people
Financial identifiers (bank accounts, credit cards)
Government IDs and medical identifiers

Check for PII in:

Variable names and labels
String variables and free-text responses
Code comments and file paths
Survey instruments and documentation

Code Documentation

Essential Code Components

Every code file should include:

Header with study information:

/*******************************************************************************
* Study: [Full study name]
* Purpose: [What this code does]
* Last updated: [Date]
* Stata version: [Version number]
*******************************************************************************/

* Basic settings
clear all
set more off
set maxvar 10000

* Set working directory
cd "INSERT WORKING DIRECTORY HERE"

Clear comments explaining:
- Major sections and subsections
- Complex operations
- Non-obvious decisions
Output linked to publications:

* Table 1: Baseline characteristics
eststo clear
eststo: reg outcome treatment controls
esttab using "output/table1_baseline.tex", replace ///
    title("Table 1: Baseline Characteristics")

Master Do-File

Create a master file that runs all code in the correct sequence:

/*******************************************************************************
* Master do-file for [Study Name]
* Runs all code to reproduce paper results
*******************************************************************************/

cd "INSERT WORKING DIRECTORY HERE"

* Data preparation
do "code/01_clean_data.do"
do "code/02_construct_variables.do"

* Analysis
do "code/03_main_analysis.do"
do "code/04_robustness_checks.do"

* Output
do "code/05_tables.do"
do "code/06_figures.do"

User-Written Commands

Include all custom commands (.ado files) used in your analysis:

Identify them: which outreg2
Collect from: C:\Users\[username]\ado\plus\ or \ado\personal\
Package as: ZIP file with list in ReadMe

Code Verification

Before sharing, verify that:

Code runs without errors from a fresh Stata session
Generated tables match publication results
All file paths work on a different computer

Documentation Files

ReadMe File Structure

# [Study Title]

## Citation
[Full citation for publication]

## Authors
[List of authors and affiliations]

## Files Included
- Data files with descriptions
- Code files with purposes
- Survey instruments
- User-written commands

## Software Requirements
- Software version (e.g., Stata 17.0)
- Required user-written commands

## Replication Instructions
1. Unzip files maintaining directory structure
2. Update working directory in master.do
3. Run master.do

## Notes and Discrepancies
[Document any differences between paper and reproducible results]

## Contact
[Contact information for questions]

Metadata Requirements

Provide complete metadata including:

Study title and description
Authors and affiliations
Geographic and temporal coverage
Unit of observation and sample size
Key variables and collection methods
Related publications and funding source

Best Practices Throughout the Project Lifecycle

Following these practices during your project makes data curation and publication easier later.

Before Data Collection

Register a pre-analysis plan for the study on a trial registry such as the AEA RCT Registry
Use _pii suffix in variable names for survey questions that will collect personally identifiable information—this simplifies later anonymization
Set up IPA’s folder structure from the start to facilitate file management and data flow

After Data Collection

Anonymize data before writing cleaning code—if PII variables are used in cleaning and shaping code, that code will break during curation and may need to be excluded from the replication package
Follow coding best practices when writing analysis code to ensure it runs smoothly without breaking
Avoid post-publication code changes that could influence outputs and create discrepancies between code results and reported findings