Shell Scripts
Learn to create reusable shell scripts that automate complex workflows. Discover how to write, execute, and debug shell scripts, making your computational tasks reproducible and shareable. Master the fundamentals of shell scripting for research automation.
This page is adapted from the Software Carpentry Shell Novice lesson, Copyright (c) The Carpentries. The original material is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
Changes made: Content has been modified and expanded by Innovations for Poverty Action (IPA) to include IPA-specific examples, multi-shell syntax (Bash, PowerShell, NuShell), and context relevant to research data management.
Original citation: Gabriel A. Devenyi (Ed.), Gerard Capes (Ed.), Colin Morris (Ed.), Will Pitchers (Ed.), Greg Wilson, Gerard Capes, Gabriel A. Devenyi, Christina Koch, Raniere Silva, Ashwin Srinath, et al. (2019, July). swcarpentry/shell-novice: Software Carpentry: the UNIX shell, June 2019 (Version v2019.06.1). Zenodo. http://doi.org/10.5281/zenodo.3266823
- Write a shell script that runs a command or series of commands for a fixed set of files.
- Run a shell script from the command line.
- Write a shell script that operates on a set of files defined by the user on the command line.
- Create pipelines that include shell scripts you, and others, have written.
We are finally ready to see what makes the shell such a powerful programming environment. We are going to take the commands we repeat frequently and save them in files so that we can re-run all those operations again later by typing a single command. For historical reasons, a bunch of commands saved in a file is usually called a shell script, but make no mistake: these are actually small programs.
Not only will writing shell scripts make your work faster (you won’t have to retype the same commands over and over again), it will also make your work more accurate (fewer chances for typos) and more reproducible. If you come back to your work later (or if someone else finds your work and wants to build on it) you will be able to reproduce the same results simply by running your script, rather than having to remember or retype a long list of commands.
You might wonder why we are using shell scripts instead of a more traditional programming language like Python, Stata, or R.
The shell is particularly well suited to automating workflows, in other sections of this website we explore programming languages for data analysis. However, to get the most out of those languages, you will often need to use the shell to manage files, run programs, and automate tasks. Learning to write and understand shell scripts is a valuable skill that complements your programming knowledge. Plus, we don’t want to get bogged down in programming language syntax when our goal is to learn to use the shell rather than a full programming language.
Script File Conventions
Each shell has different conventions for scripts:
| Shell | File Extension | Run Command | Shebang (optional) |
|---|---|---|---|
| Bash | .sh |
bash script.sh |
#!/bin/bash |
| PowerShell | .ps1 |
.\script.ps1 |
(not used) |
| NuShell | .nu |
nu script.nu or source script.nu |
#!/usr/bin/env nu |
On Windows, you may need to adjust your execution policy to run PowerShell scripts. If you get an error about scripts being disabled, run this command once:
Set-ExecutionPolicy -Scope CurrentUser RemoteSignedBuilding Validation Script
Now let’s help Amara create a validation script for her survey data. We’ll build it step by step, starting simple and adding features as we go.
Step 1: A Simple Script to Count Records
Let’s start by creating a script that counts records in a CSV file. Navigate to the survey-data directory:
Create a file called count_records.sh:
cd exercise-data/survey-data
nano count_records.shAdd this content:
#!/bin/bash
# Count records in a CSV file (excluding header)
tail -n +2 hh_baseline_001.csv | wc -lRun it:
bash count_records.sh20
Create a file called count_records.ps1:
cd exercise-data\survey-data
notepad count_records.ps1Add this content:
# Count records in a CSV file (excluding header)
(Import-Csv hh_baseline_001.csv | Measure-Object).CountRun it:
.\count_records.ps120
Create a file called count_records.nu:
cd exercise-data/survey-data
# Use your preferred editorAdd this content:
# Count records in a CSV file (excluding header)
open hh_baseline_001.csv | lengthRun it:
nu count_records.nu20
Step 2: Adding Command-Line Arguments
Our script only works with one file. Let’s make it flexible by accepting the filename as an argument:
Update count_records.sh:
#!/bin/bash
# Count records in a CSV file (excluding header)
# Usage: bash count_records.sh <filename>
tail -n +2 "$1" | wc -lIn Bash, $1 refers to the first argument passed to the script.
Run it with different files:
bash count_records.sh hh_baseline_001.csv
bash count_records.sh hh_baseline_005.csv20
10
Update count_records.ps1:
# Count records in a CSV file (excluding header)
# Usage: .\count_records.ps1 <filename>
param(
[Parameter(Mandatory=$true)]
[string]$File
)
(Import-Csv $File | Measure-Object).CountPowerShell uses the param() block to define named parameters.
Run it:
.\count_records.ps1 -File hh_baseline_001.csv
.\count_records.ps1 hh_baseline_005.csv20
10
Update count_records.nu:
# Count records in a CSV file (excluding header)
# Usage: nu count_records.nu <filename>
def main [file: string] {
open $file | length
}NuShell uses def main to define the script entry point with typed parameters.
Run it:
nu count_records.nu hh_baseline_001.csv
nu count_records.nu hh_baseline_005.csv20
10
Step 3: Building the Full Validation Script
Now let’s build a complete validation script that checks for:
- File existence
- Correct header columns
- Missing required fields (hhid, consent)
Create validate_survey.sh:
#!/bin/bash
# validate_survey.sh - Validate survey CSV files
# Usage: bash validate_survey.sh <file.csv>
EXPECTED_HEADER="hhid,survey_date,village,treatment_arm,consent,age,education_years"
# Check for correct number of arguments
if [ $# -ne 1 ]; then
echo "Usage: bash validate_survey.sh <file.csv>"
exit 1
fi
FILE="$1"
# Check if file exists
if [ ! -f "$FILE" ]; then
echo "ERROR: File not found: $FILE"
exit 2
fi
echo "Validating: $FILE"
echo "----------------------------------------"
# Check header
HEADER=$(head -n 1 "$FILE")
if [ "$HEADER" = "$EXPECTED_HEADER" ]; then
echo "[PASS] Header matches expected columns"
else
echo "[FAIL] Header mismatch"
fi
# Count total rows
TOTAL_ROWS=$(tail -n +2 "$FILE" | wc -l)
echo "[INFO] Total records: $TOTAL_ROWS"
# Count rows with missing hhid
MISSING_HHID=$(cut -d',' -f1 "$FILE" | tail -n +2 | grep -c "^$")
if [ "$MISSING_HHID" -gt 0 ]; then
echo "[FAIL] Missing hhid: $MISSING_HHID rows"
else
echo "[PASS] All rows have hhid"
fi
# Count rows with missing consent
MISSING_CONSENT=$(cut -d',' -f5 "$FILE" | tail -n +2 | grep -c "^$")
if [ "$MISSING_CONSENT" -gt 0 ]; then
echo "[FAIL] Missing consent: $MISSING_CONSENT rows"
else
echo "[PASS] All rows have consent"
fi
echo "----------------------------------------"
# Summary
if [ "$MISSING_HHID" -eq 0 ] && [ "$MISSING_CONSENT" -eq 0 ]; then
echo "RESULT: PASS"
else
echo "RESULT: FAIL"
fiCreate validate_survey.ps1:
# validate_survey.ps1 - Validate survey CSV files
# Usage: .\validate_survey.ps1 <file.csv>
param(
[Parameter(Mandatory=$true, Position=0)]
[string]$File
)
$ExpectedHeader = "hhid,survey_date,village,treatment_arm,consent,age,education_years"
# Check if file exists
if (-not (Test-Path $File)) {
Write-Host "ERROR: File not found: $File" -ForegroundColor Red
exit 2
}
Write-Host "Validating: $File"
Write-Host "----------------------------------------"
# Read file and check header
$Content = Get-Content $File
$Header = $Content[0]
if ($Header -eq $ExpectedHeader) {
Write-Host "[PASS] Header matches expected columns" -ForegroundColor Green
} else {
Write-Host "[FAIL] Header mismatch" -ForegroundColor Red
}
# Import CSV for data analysis
$Data = Import-Csv $File
$TotalRows = $Data.Count
Write-Host "[INFO] Total records: $TotalRows"
# Check for missing hhid
$MissingHhid = ($Data | Where-Object { $_.hhid -eq '' }).Count
if ($MissingHhid -gt 0) {
Write-Host "[FAIL] Missing hhid: $MissingHhid rows" -ForegroundColor Red
} else {
Write-Host "[PASS] All rows have hhid" -ForegroundColor Green
}
# Check for missing consent
$MissingConsent = ($Data | Where-Object { $_.consent -eq '' }).Count
if ($MissingConsent -gt 0) {
Write-Host "[FAIL] Missing consent: $MissingConsent rows" -ForegroundColor Red
} else {
Write-Host "[PASS] All rows have consent" -ForegroundColor Green
}
Write-Host "----------------------------------------"
# Summary
if ($MissingHhid -eq 0 -and $MissingConsent -eq 0) {
Write-Host "RESULT: PASS" -ForegroundColor Green
} else {
Write-Host "RESULT: FAIL" -ForegroundColor Red
}Create validate_survey.nu:
# validate_survey.nu - Validate survey CSV files
# Usage: nu validate_survey.nu <file.csv>
def main [file: string] {
let expected_columns = ["hhid", "survey_date", "village", "treatment_arm",
"consent", "age", "education_years"]
# Check if file exists
if not ($file | path exists) {
print $"ERROR: File not found: ($file)"
exit 2
}
print $"Validating: ($file)"
print "----------------------------------------"
# Read the CSV file
let data = open $file
let actual_columns = $data | columns
# Check header
if $actual_columns == $expected_columns {
print "[PASS] Header matches expected columns"
} else {
print "[FAIL] Header mismatch"
}
# Count total rows
let total_rows = $data | length
print $"[INFO] Total records: ($total_rows)"
# Check for missing hhid
let missing_hhid = $data | where { |row|
($row.hhid | is-empty) or ($row.hhid | into string | str trim | is-empty)
} | length
if $missing_hhid > 0 {
print $"[FAIL] Missing hhid: ($missing_hhid) rows"
} else {
print "[PASS] All rows have hhid"
}
# Check for missing consent
let missing_consent = $data | where { |row|
($row.consent | is-empty) or ($row.consent | into string | str trim | is-empty)
} | length
if $missing_consent > 0 {
print $"[FAIL] Missing consent: ($missing_consent) rows"
} else {
print "[PASS] All rows have consent"
}
print "----------------------------------------"
# Summary
if $missing_hhid == 0 and $missing_consent == 0 {
print "RESULT: PASS"
} else {
print "RESULT: FAIL"
}
}Running the Validation Script
Let’s test our scripts on different files:
# Test on a clean file
bash validate_survey.sh hh_baseline_001.csvValidating: hh_baseline_001.csv
----------------------------------------
[PASS] Header matches expected columns
[INFO] Total records: 20
[PASS] All rows have hhid
[PASS] All rows have consent
----------------------------------------
RESULT: PASS
# Test on a file with issues
bash validate_survey.sh hh_baseline_009.csvValidating: hh_baseline_009.csv
----------------------------------------
[PASS] Header matches expected columns
[INFO] Total records: 10
[FAIL] Missing hhid: 3 rows
[PASS] All rows have consent
----------------------------------------
RESULT: FAIL
# Test on a clean file
.\validate_survey.ps1 hh_baseline_001.csvValidating: hh_baseline_001.csv
----------------------------------------
[PASS] Header matches expected columns
[INFO] Total records: 20
[PASS] All rows have hhid
[PASS] All rows have consent
----------------------------------------
RESULT: PASS
# Test on a file with issues
.\validate_survey.ps1 hh_baseline_009.csvValidating: hh_baseline_009.csv
----------------------------------------
[PASS] Header matches expected columns
[INFO] Total records: 10
[FAIL] Missing hhid: 3 rows
[PASS] All rows have consent
----------------------------------------
RESULT: FAIL
# Test on a clean file
nu validate_survey.nu hh_baseline_001.csvValidating: hh_baseline_001.csv
----------------------------------------
[PASS] Header matches expected columns
[INFO] Total records: 20
[PASS] All rows have hhid
[PASS] All rows have consent
----------------------------------------
RESULT: PASS
# Test on a file with issues
nu validate_survey.nu hh_baseline_009.csvValidating: hh_baseline_009.csv
----------------------------------------
[PASS] Header matches expected columns
[INFO] Total records: 10
[FAIL] Missing hhid: 3 rows
[PASS] All rows have consent
----------------------------------------
RESULT: FAIL
Running Scripts in a Loop
Now Amara can validate all her files at once by combining scripts with loops:
for file in hh_baseline_*.csv; do
bash validate_survey.sh "$file"
echo ""
doneGet-ChildItem hh_baseline_*.csv | ForEach-Object {
.\validate_survey.ps1 $_.Name
Write-Output ""
}ls hh_baseline_*.csv | each { |file|
nu validate_survey.nu $file.name
print ""
}Reference Scripts
The complete validation scripts are included in the survey-data directory for reference:
validate_survey.sh- Bash versionvalidate_survey.ps1- PowerShell versionvalidate_survey.nu- NuShell version
You can examine these scripts to see the complete implementation, including additional error handling.
Key Points
- Save commands in files (shell scripts) for reuse:
- Bash:
.shfiles, run withbash script.sh - PowerShell:
.ps1files, run with.\script.ps1 - NuShell:
.nufiles, run withnu script.nuorsource script.nu
- Bash:
- Command-line arguments differ by shell:
- Bash:
$1,$2,$@(all arguments) - PowerShell: Use
param()block with named parameters - NuShell: Use
def main [arg: type]with typed parameters
- Bash:
- Scripts make your work reproducible—you can run the same validation on new data.
- Combine scripts with loops to process multiple files automatically.
- Place variables in quotes if the values might have spaces in them.
- PowerShell requires appropriate execution policy settings to run scripts.