Data Aggregation

This guide explains two primary methods for combining datasets: appending datasets with the same variables but different observations, and merging datasets with the same observations but different variables. Learn best practices for exact matching and alternatives for fuzzy matching when precise identifiers are unavailable.

Key Takeaways

There are two main approaches for data aggregation: Appending—for datasets with same variables but different observations—and Merging—for datasets with same observations but different variables
When exact matching is impossible, there are imperfect fuzzy matching alternatives.

When to Use Each Method

Use appending when: You have datasets with the same variables but different observations. Examples include combining adult and child survey data, or merging male and female participant files.

Use merging when: You have datasets with the same observations but different variables. Examples include combining survey responses split across multiple files by the survey program.

Appending Data

Appending combines files that contain the same variables but different observations to create a complete dataset. When appending datasets, pay attention to variable names and types.

Key Considerations for Appending

Variable names must match exactly: only variables with identical names will combine successfully.
Missing variables create gaps: if one dataset contains a variable that another lacks, the joint dataset will include that variable but observations from datasets without it will show as missing.
Variable types must be compatible: variables should be the same general type, either numeric or string:
- Mismatched types will generate errors and prompt you to use the force option
- The force option is not recommended as it keeps the master data type and marks using observations as missing
- Resolve type conflicts before appending
String length handling: combining string variables of different lengths results in a string of the longer length.
Numeric precision: when combining different numeric types, Stata keeps the more precise type and converts lower precision variables to higher precision.

Merging Data

Merging combines two or more datasets that contain the same observations but different variables. This situation often occurs when survey programs split variables across multiple datasets.

Prerequisites for Merging

Ensure both datasets have a unique ID
Specify the correct merge type: one-to-one or one-to-many
Check that datasets do not share variable names, except for ID variables

Handling Variable Name Conflicts

When datasets contain variables with identical names, Stata defaults to keeping the master data values. You can modify this behavior using update or replace options, but renaming one variable and keeping both is often preferable.

Many-to-Many Merge Warning

Avoid many-to-many merges as they represent poor practice. Many users expect this merge type to create all pair-wise combinations of matching observations, but it actually pairs datasets based on observation order within each ID. Use joinby if you need all pair-wise combinations.

Post-Merge Validation

After merging, validate results by checking the _merge variable. Each match type receives a numeric code. Use tab _merge to verify that results match expectations: number of matches, master-only observations, using-only observations, updated missing values, and conflicting nonmissing values.

Consider adding assertions after merging to ensure correct execution. Alternative merge commands like safemerge and mmerge provide additional safety features.

Useful Merge Options

Helpful options include assert, keep, keepusing, gen, and nogen. Consult help merge for detailed explanations.

Additional Resources

The IPA Stata beginner’s training manual provides step-by-step merge guidance. The IPA high intermediate Stata training includes a module on merging with common pitfall discussions.

Fuzzy Matching

When datasets lack unique IDs, you must find alternative linking methods. String matching on names or other variables is common but often inaccurate due to misspellings. For machine-entered data, removing common mismatch sources like spaces and inconsistent capitalization may help. Manual matching works for small datasets, but large datasets often require fuzzy matching solutions.

Fuzzy matching finds approximately similar strings rather than exact matches. These algorithms provide match probabilities and only apply when exact matching is impossible. Consult with your manager before using fuzzy matching, as its match rate is typically lower than exact matching. Exact matching remains preferable when possible.

Fuzzy Matching Process

Fuzzy matching typically involves three steps:

String cleaning: standardize spaces, capitalization, and remove special characters. This may include removing common phrases like titles in names.
Probabilistic matching: algorithms estimate match probabilities between observations in different datasets.
Match review: review matches to determine the success threshold, then manually verify remaining matches.

Stata Commands for Fuzzy Matching

Several user-written commands support fuzzy matching in Stata. Install them using ssc install [command]:

`reclink`

Specify one or more variables for similarity assessment instead of a single unique ID. Numeric ages of twenty-five and twenty-six match more than thirty-five and sixty-five. String matching works even with imperfectly clean variables.

Known issues:

The idusing() variable should never exist in the master dataset
reclink does not merge shared variable values from the using dataset without warning
Datasets should not share variable names except for matching variables
When idusing() and idmaster() use identical variable names, matches may fail
Consider using temporary files when changing variable names for matching

Alternative: reclink2 improves on reclink and adds many-to-one matching. Install with net install dm0082.

`matchit`

Matches single variables to generate probability scores. Match multiple columns sequentially and average or weight probabilities for multi-column matching.

Features multiple matching algorithms explained in the help file. All algorithms provide multiple likely matches for second-best guesses. Merge results using joinby or merge, but matchit does not function as a merge command.

`strgroup`

Install with net install strgroup. Calculates string differences and uses user-specified thresholds to create groups and matches.

Works within a single dataset. Data preparation typically involves joining a master variable to ID variables in the using dataset.

Additional Resources

Numerous other commands and approaches exist for this problem. The economics literature has evaluated these methods for historical record linkage. Abramitzky et al. test multiple matching protocols and find that automated comparison techniques compare favorably to human matching in various circumstances. Stata resources accompany the linked paper.

Reuse

CC BY 4.0