Data Processing in Python

Learn to load, explore, visualize, and clean data using pandas and matplotlib. This tutorial covers loading tabular data into DataFrames, exploratory data analysis, creating visualizations, and handling missing values.

Recognition and Attribution

This page modifies content from the Software Carpentry Python Novice Gapminder lesson, specifically Episodes 7 (Reading Tabular Data into DataFrames), 8 (Pandas DataFrames), and 9 (Plotting).

In this tutorial, you will learn the fundamentals of data processing in Python. We’ll cover how to load data using pandas, explore it through exploratory data analysis, create visualizations, and handle missing values.

Pro Tip

When you see a piece of code you don’t understand, it’s okay to Google it or ask an LLM to explain it to you.
Pros do this all the time. There is only so much information you can keep in your head at once.
Of course, when using AI, be sure to verify.

Learning Objectives

By the end of this tutorial, you will be able to:

Load tabular data into pandas DataFrames
Explore data structure and calculate summary statistics
Select and filter data using various methods
Create visualizations to understand data patterns
Identify and handle missing values appropriately

Introduction

Data processing is a fundamental skill for data analysis and research. Python’s pandas library provides powerful tools for working with tabular data, similar to spreadsheets but with much more flexibility and power. In this tutorial, we’ll use real-world data from Gapminder, which contains economic and health indicators for countries around the world.

Tip

This tutorial assumes you’ve completed the Coding in Python tutorial and are familiar with Python basics like variables, data types, and functions.

Loading Data into DataFrames

The first step in data processing is loading your data into a format you can work with. In Python, we use the pandas library to work with tabular data.

Importing Libraries

Let’s start by importing the libraries we’ll need:

import pandas as pd
from causaldata import gapminder

The pandas library (abbreviated as pd by convention) is the primary tool for data manipulation in Python. The causaldata package provides easy access to several datasets commonly used in statistics and research, including Gapminder.

Loading the Gapminder Dataset

Now let’s load the Gapminder data:

# Load the data
df = gapminder.load_pandas().data

Here, we’ve created a variable called df (short for “DataFrame”) that contains our data. A DataFrame is pandas’ primary data structure for storing tabular data.

Note

If you have data in a CSV file instead, you can load it using pd.read_csv('filename.csv'). You can also specify which column should be used as the index using the index_col parameter, like this: pd.read_csv('filename.csv', index_col='country').

First Look at the Data

Let’s explore what we’ve loaded. The .head() method shows the first few rows:

df.head()

	country	continent	year	lifeExp	pop	gdpPercap
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
1	Afghanistan	Asia	1957	30.332	9240934	820.853030
2	Afghanistan	Asia	1962	31.997	10267083	853.100710
3	Afghanistan	Asia	1967	34.020	11537966	836.197138
4	Afghanistan	Asia	1972	36.088	13079460	739.981106

We can see that our DataFrame contains information about countries, including their continent, year, life expectancy, population, and GDP per capita.

To see the last few rows instead, use .tail():

df.tail()

	country	continent	year	lifeExp	pop	gdpPercap
1699	Zimbabwe	Africa	1987	62.351	9216418	706.157306
1700	Zimbabwe	Africa	1992	60.377	10704340	693.420786
1701	Zimbabwe	Africa	1997	46.809	11404948	792.449960
1702	Zimbabwe	Africa	2002	39.989	11926563	672.038623
1703	Zimbabwe	Africa	2007	43.487	12311143	469.709298

Understanding DataFrame Structure

To get a comprehensive overview of our data’s structure, we use .info():

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   continent  1704 non-null   object 
 2   year       1704 non-null   int64  
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64  
 5   gdpPercap  1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB

This tells us:

The number of rows (entries) and columns
The names of each column
The data type of each column (e.g., object for text, int64 for integers, float64 for decimals)
Whether there are any missing values
How much memory the DataFrame uses

We can also access specific information:

# Get column names
print("Columns:", df.columns.tolist())

# Get the shape (rows, columns)
print("Shape:", df.shape)

# Get data types
print("\nData types:")
print(df.dtypes)

Columns: ['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap']
Shape: (1704, 6)

Data types:
country       object
continent     object
year           int64
lifeExp      float64
pop            int64
gdpPercap    float64
dtype: object

Exploratory Data Analysis

Now that we’ve loaded our data, let’s explore it more deeply. Exploratory Data Analysis (EDA) helps us understand what’s in our data and identify patterns, outliers, or issues.

DataFrame Anatomy

A DataFrame is a 2-dimensional table with rows and columns. Each column is actually a Series (pandas’ 1-dimensional data structure). Think of it like a spreadsheet where each column can contain different types of data.

Summary Statistics

The .describe() method provides summary statistics for numerical columns:

df.describe()

	year	lifeExp	pop	gdpPercap
count	1704.00000	1704.000000	1.704000e+03	1704.000000
mean	1979.50000	59.474439	2.960121e+07	7215.327081
std	17.26533	12.917107	1.061579e+08	9857.454543
min	1952.00000	23.599000	6.001100e+04	241.165876
25%	1965.75000	48.198000	2.793664e+06	1202.060309
50%	1979.50000	60.712500	7.023596e+06	3531.846988
75%	1993.25000	70.845500	1.958522e+07	9325.462346
max	2007.00000	82.603000	1.318683e+09	113523.132900

This gives us the count, mean, standard deviation, minimum, quartiles, and maximum for each numeric column.

For individual columns, we can calculate specific statistics:

print("Mean life expectancy:", df['lifeExp'].mean())
print("Median GDP per capita:", df['gdpPercap'].median())
print("Maximum population:", df['pop'].max())
print("Minimum year:", df['year'].min())

Mean life expectancy: 59.474439366197174
Median GDP per capita: 3531.8469885
Maximum population: 1318683096
Minimum year: 1952

Selecting Data

Pandas provides two main methods for selecting data from DataFrames:

.loc[] - Label-based selection (uses row and column names)
.iloc[] - Position-based selection (uses integer positions)

Selecting Columns

To select a single column:

# Select the lifeExp column
life_expectancy = df['lifeExp']
print(life_expectancy.head())

0    28.801
1    30.332
2    31.997
3    34.020
4    36.088
Name: lifeExp, dtype: float64

To select multiple columns:

# Select country, year, and life expectancy
subset = df[['country', 'year', 'lifeExp']]
print(subset.head())

       country  year  lifeExp
0  Afghanistan  1952   28.801
1  Afghanistan  1957   30.332
2  Afghanistan  1962   31.997
3  Afghanistan  1967   34.020
4  Afghanistan  1972   36.088

Selecting Rows

To select rows by position using .iloc[]:

# First 5 rows
df.iloc[0:5]

	country	continent	year	lifeExp	pop	gdpPercap
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
1	Afghanistan	Asia	1957	30.332	9240934	820.853030
2	Afghanistan	Asia	1962	31.997	10267083	853.100710
3	Afghanistan	Asia	1967	34.020	11537966	836.197138
4	Afghanistan	Asia	1972	36.088	13079460	739.981106

To select rows by label/index using .loc[]:

# Rows 0 through 5
df.loc[0:5]

	country	continent	year	lifeExp	pop	gdpPercap
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
1	Afghanistan	Asia	1957	30.332	9240934	820.853030
2	Afghanistan	Asia	1962	31.997	10267083	853.100710
3	Afghanistan	Asia	1967	34.020	11537966	836.197138
4	Afghanistan	Asia	1972	36.088	13079460	739.981106
5	Afghanistan	Asia	1977	38.438	14880372	786.113360

Important

Important difference: .loc[] slices are inclusive on both ends (includes index 5), while .iloc[] follows Python convention and excludes the end (stops before index 5).

Selecting Specific Values

To select a specific value:

# Life expectancy for the first row
first_life_exp = df.loc[0, 'lifeExp']
print(f"First life expectancy value: {first_life_exp}")

First life expectancy value: 28.801

Boolean Filtering

One of the most powerful features of pandas is the ability to filter data using Boolean conditions. This is sometimes called “Boolean masking.”

# Create a Boolean mask for data from 2007
mask_2007 = df['year'] == 2007

# Use the mask to filter the data
data_2007 = df[mask_2007]
print(f"Rows in 2007: {len(data_2007)}")
print(data_2007.head())

Rows in 2007: 142
        country continent  year  lifeExp       pop     gdpPercap
11  Afghanistan      Asia  2007   43.828  31889923    974.580338
23      Albania    Europe  2007   76.423   3600523   5937.029526
35      Algeria    Africa  2007   72.301  33333216   6223.367465
47       Angola    Africa  2007   42.731  12420476   4797.231267
59    Argentina  Americas  2007   75.320  40301927  12779.379640

You can combine multiple conditions:

# High GDP countries in 2007
high_gdp_2007 = df[(df['year'] == 2007) & (df['gdpPercap'] > 30000)]
print(f"High GDP countries in 2007: {len(high_gdp_2007)}")
print(high_gdp_2007[['country', 'gdpPercap']])

High GDP countries in 2007: 20
               country    gdpPercap
71           Australia  34435.36744
83             Austria  36126.49270
119            Belgium  33692.60508
251             Canada  36319.23501
419            Denmark  35278.41874
527            Finland  33207.08440
539             France  30470.01670
575            Germany  32170.37442
671   Hong Kong, China  39724.97867
695            Iceland  36180.78919
755            Ireland  40675.99635
803              Japan  31656.06806
863             Kuwait  47306.98978
1091       Netherlands  36797.93332
1151            Norway  49357.19017
1367         Singapore  47143.17964
1475            Sweden  33859.74835
1487       Switzerland  37506.41907
1607    United Kingdom  33203.26128
1619     United States  42951.65309

Tip

When combining conditions, use & for “and” and | for “or”. Make sure to put parentheses around each condition!

Grouping and Aggregation

Often, we want to compute statistics for different groups in our data. The .groupby() method is perfect for this:

# Average life expectancy by continent
continent_life_exp = df.groupby('continent')['lifeExp'].mean()
print(continent_life_exp)

continent
Africa      48.865330
Americas    64.658737
Asia        60.064903
Europe      71.903686
Oceania     74.326208
Name: lifeExp, dtype: float64

In the example above, we grouped the dataframe rows by the values in the continent column, and then for each group, the lifeExp column was selected and the mean of that column determined fpor each group… All in one line of code!

You can group by multiple columns and calculate multiple statistics:

# Multiple statistics by continent
continent_stats = df.groupby('continent').agg({
    'lifeExp': ['mean', 'min', 'max'],
    'gdpPercap': 'mean',
    'pop': 'sum'
})
print(continent_stats)

             lifeExp                     gdpPercap          pop
                mean     min     max          mean          sum
continent                                                      
Africa     48.865330  23.599  76.442   2193.754578   6187585961
Americas   64.658737  37.579  80.653   7136.110356   7351438499
Asia       60.064903  28.801  82.603   7902.150428  30507333901
Europe     71.903686  43.585  81.757  14469.475533   6181115304
Oceania    74.326208  69.120  81.235  18621.609223    212992136

Data Visualization

A picture is worth a thousand words. Visualizations help us understand patterns and relationships in data that might not be obvious from tables of numbers. Python provides several libraries for creating visualizations, with matplotlib and seaborn being the most popular.

Setting Up

Let’s import the visualization libraries:

import matplotlib.pyplot as plt
import seaborn as sns

# Set a nice style
sns.set_style("whitegrid")

Note

From our imports, plt is now our reference (a handle, if you will) to the tools contained in matplotlib.pyplot. The same is true for sns and seaborn.

Line Plots

Line plots are great for showing trends over time. Let’s look at how life expectancy has changed for a specific country:

# Filter data for United States
us_data = df[df['country'] == 'United States']

# Create line plot
plt.figure(figsize=(10, 6))
plt.plot(us_data['year'], us_data['lifeExp'], marker='o')
plt.xlabel('Year')
plt.ylabel('Life Expectancy (years)')
plt.title('Life Expectancy in the United States Over Time')
plt.grid(True)
plt.show()

We can compare multiple countries:

# Compare several countries
countries_to_compare = ['United States', 'China', 'India', 'Brazil']

plt.figure(figsize=(12, 6))
for country in countries_to_compare:
    country_data = df[df['country'] == country]
    plt.plot(country_data['year'], country_data['lifeExp'],
             marker='o', label=country)

plt.xlabel('Year')
plt.ylabel('Life Expectancy (years)')
plt.title('Life Expectancy Trends: Selected Countries')
plt.legend()
plt.grid(True)
plt.show()

Note

The code above uses a for-loop to repeat the same action multiple times. The for country in countries_to_compare: line tells Python to go through each country name in our list, one at a time. For each country, Python executes the indented code below it: first filtering the data to get only that country’s rows, then plotting a line for that country. This is much more efficient than writing the same filtering and plotting code four separate times—once for each country. The loop automatically stops after processing the last country in the list.

Scatter Plots

Scatter plots help us see relationships between two variables. Let’s explore the relationship between GDP per capita and life expectancy:

# Scatter plot for 2007 data
data_2007 = df[df['year'] == 2007]

plt.figure(figsize=(10, 6))
plt.scatter(data_2007['gdpPercap'], data_2007['lifeExp'], alpha=0.6)
plt.xlabel('GDP per Capita (USD)')
plt.ylabel('Life Expectancy (years)')
plt.title('GDP per Capita vs Life Expectancy (2007)')
plt.xscale('log')  # Use log scale for GDP
plt.grid(True)
plt.show()

We can enhance this by coloring points by continent:

plt.figure(figsize=(12, 7))
sns.scatterplot(data=data_2007, x='gdpPercap', y='lifeExp',
                hue='continent', size='pop', sizes=(20, 500),
                alpha=0.7)
plt.xlabel('GDP per Capita (USD)')
plt.ylabel('Life Expectancy (years)')
plt.title('GDP per Capita vs Life Expectancy by Continent (2007)')
plt.xscale('log')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

Bar Charts

Bar charts are useful for comparing values across categories:

# Average life expectancy by continent in 2007
continent_2007 = data_2007.groupby('continent')['lifeExp'].mean().sort_values()

plt.figure(figsize=(10, 6))
continent_2007.plot(kind='bar', color='steelblue')
plt.xlabel('Continent')
plt.ylabel('Average Life Expectancy (years)')
plt.title('Average Life Expectancy by Continent (2007)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Using Pandas Built-in Plotting

Pandas DataFrames have built-in plotting methods that can be more convenient:

# Average life expectancy over time
yearly_avg = df.groupby('year')['lifeExp'].mean()

plt.figure(figsize=(10, 6))
yearly_avg.plot(marker='o')
plt.xlabel('Year')
plt.ylabel('Average Life Expectancy (years)')
plt.title('Global Average Life Expectancy Over Time')
plt.grid(True)
plt.show()

Saving Figures

To save a figure, use plt.savefig() before plt.show():

plt.figure(figsize=(10, 6))
plt.plot(us_data['year'], us_data['lifeExp'], marker='o')
plt.xlabel('Year')
plt.ylabel('Life Expectancy (years)')
plt.title('Life Expectancy in the United States')
plt.savefig('us_life_expectancy.png', dpi=300, bbox_inches='tight')
plt.show()

Running this code will save the plot as a PNG file on your computer called us_life_expectancy.png.

Dealing with Missingness

Missing data is a common challenge in data analysis. Understanding how to identify and handle missing values is crucial for producing reliable results.

Why Missing Data Matters

Missing data can:

Reduce the statistical power of your analysis
Introduce bias if the missingness is systematic
Cause errors in calculations if not handled properly

Identifying Missing Values

In pandas, missing values are typically represented as NaN (Not a Number). Let’s check if our Gapminder data has any missing values:

# Check for missing values in each column
print("Missing values per column:")
print(df.isnull().sum())

Missing values per column:
country      0
continent    0
year         0
lifeExp      0
pop          0
gdpPercap    0
dtype: int64

Good news! The Gapminder dataset we’re using has no missing values. However, let’s create a sample dataset with missing values to learn how to handle them:

# Create a sample dataset with missing values
import numpy as np

sample_data = df[df['country'].isin(['United States', 'Canada', 'Mexico'])].copy()
sample_data.loc[2, 'lifeExp'] = np.nan
sample_data.loc[5, 'gdpPercap'] = np.nan
sample_data.loc[8, 'pop'] = np.nan

print("Missing values in sample data:")
print(sample_data.isnull().sum())

Missing values in sample data:
country      3
continent    3
year         3
lifeExp      3
pop          3
gdpPercap    3
dtype: int64

We can also visualize which specific rows have missing values:

# Show rows with any missing values
print("\nRows with missing values:")
print(sample_data[sample_data.isnull().any(axis=1)][['country', 'year', 'lifeExp', 'gdpPercap', 'pop']])


Rows with missing values:
  country  year  lifeExp  gdpPercap  pop
2     NaN   NaN      NaN        NaN  NaN
5     NaN   NaN      NaN        NaN  NaN
8     NaN   NaN      NaN        NaN  NaN

Rows 2, 5, and 8, as expected!

Handling Missing Values

There are several strategies for dealing with missing data:

1. Dropping Missing Values

The simplest approach is to remove rows or columns with missing values:

# Drop any rows with missing values
cleaned_data = sample_data.dropna()
print(f"Original rows: {len(sample_data)}, After dropping: {len(cleaned_data)}")

Original rows: 39, After dropping: 36

You can also drop rows only if specific columns have missing values:

# Drop rows only if lifeExp is missing
cleaned_life = sample_data.dropna(subset=['lifeExp'])
print(f"After dropping rows with missing lifeExp: {len(cleaned_life)}")

After dropping rows with missing lifeExp: 36

Warning

Be careful when dropping data! If you have many missing values, you might lose a lot of information. Always check how much data you’re removing.

2. Filling Missing Values

Instead of dropping data, you can fill missing values with a specific value:

# Fill missing values with the column mean
sample_filled = sample_data.copy()
sample_filled['lifeExp'] = sample_filled['lifeExp'].fillna(sample_data['lifeExp'].mean())
sample_filled['gdpPercap'] = sample_filled['gdpPercap'].fillna(sample_data['gdpPercap'].mean())
sample_filled['pop'] = sample_filled['pop'].fillna(sample_data['pop'].mean())

print("After filling with mean:")
print(sample_filled[sample_data.isnull().any(axis=1)][['country', 'year', 'lifeExp', 'gdpPercap', 'pop']])

After filling with mean:
  country  year    lifeExp    gdpPercap           pop
2     NaN   NaN  71.263361  18798.67012  1.070295e+08
5     NaN   NaN  71.263361  18798.67012  1.070295e+08
8     NaN   NaN  71.263361  18798.67012  1.070295e+08

Other filling strategies include:

# Fill with a specific value
sample_data["lifeExp"].fillna(0)

# Forward fill (use the last valid value)
sample_data["lifeExp"].ffill()

# Backward fill (use the next valid value)
sample_data["lifeExp"].bfill()

3. Interpolation

For time series data (data where we can view changes in a variable over time), interpolation can provide more reasonable estimates. The sample we have chosen to illustrate missingness is too small to demonstrate this adequately, but the code would look like this:

# Interpolate missing values
sample_interpolated = sample_data.copy()
sample_interpolated['lifeExp'] = sample_interpolated['lifeExp'].interpolate()
sample_interpolated['gdpPercap'] = sample_interpolated['gdpPercap'].interpolate()
sample_interpolated['pop'] = sample_interpolated['pop'].interpolate()

print("After interpolation:")
print(sample_interpolated[sample_data.isnull().any(axis=1)][['country', 'year', 'lifeExp', 'gdpPercap', 'pop']])

Choosing the Right Strategy

Which strategy to use depends on:

Why the data is missing: Random missingness vs systematic patterns
How much data is missing: Small amounts can often be dropped
The nature of the variable: Time series might benefit from interpolation
Your analysis goals: Some methods may introduce less bias than others

Tip

Important note: Many pandas statistical functions automatically ignore NaN values. For example, df['lifeExp'].mean() will calculate the mean of only the non-missing values.

Practice Exercises

Let’s practice what we’ve learned with some exercises.

Pro Tip

Try to write the code by yourself before revealing the solution!

Exercise 1: Loading and Exploring

Using the Gapminder dataset:

Load the data
Display information about the DataFrame structure
Find which year has data for the most countries
Calculate the overall average life expectancy across all countries and years

Show the code

# 1. Load the data
from causaldata import gapminder
df = gapminder.load_pandas().data

# 2. Display information
df.info()

# 3. Find year with most countries
countries_per_year = df.groupby('year')['country'].nunique()
max_year = countries_per_year.idxmax()
print(f"Year with most countries: {max_year} ({countries_per_year[max_year]} countries)")

# 4. Calculate overall average life expectancy
overall_avg = df['lifeExp'].mean()
print(f"Overall average life expectancy: {overall_avg:.2f} years")

Exercise 2: Filtering and Aggregation

Filter the data to include only African countries in the year 2007
Calculate the average GDP per capita for these countries
Find the African country with the highest life expectancy in 2007
Count how many African countries had a life expectancy above 50 years in 2007

Show the code

# 1. Filter for African countries in 2007
africa_2007 = df[(df['continent'] == 'Africa') & (df['year'] == 2007)]

# 2. Calculate average GDP per capita
avg_gdp = africa_2007['gdpPercap'].mean()
print(f"Average GDP per capita in Africa (2007): ${avg_gdp:.2f}")

# 3. Find country with highest life expectancy
max_life_idx = africa_2007['lifeExp'].idxmax()
max_life_country = africa_2007.loc[max_life_idx, 'country']
max_life_value = africa_2007.loc[max_life_idx, 'lifeExp']
print(f"Highest life expectancy: {max_life_country} ({max_life_value:.2f} years)")

# 4. Count countries with life expectancy > 50
count_above_50 = (africa_2007['lifeExp'] > 50).sum()
total_countries = len(africa_2007)
print(f"Countries with life expectancy > 50: {count_above_50} out of {total_countries}")

Exercise 3: Visualization

Create a line plot showing how the average global GDP per capita changed over time
Create a scatter plot comparing population and life expectancy for the year 2007
Create a bar chart showing the average life expectancy for each continent in 2007

Show the code

import matplotlib.pyplot as plt

# 1. Line plot of average global GDP over time
avg_gdp_by_year = df.groupby('year')['gdpPercap'].mean()
plt.figure(figsize=(10, 6))
plt.plot(avg_gdp_by_year.index, avg_gdp_by_year.values, marker='o')
plt.xlabel('Year')
plt.ylabel('Average GDP per Capita (USD)')
plt.title('Global Average GDP per Capita Over Time')
plt.grid(True)
plt.show()

# 2. Scatter plot of population vs life expectancy (2007)
data_2007 = df[df['year'] == 2007]
plt.figure(figsize=(10, 6))
plt.scatter(data_2007['pop'], data_2007['lifeExp'], alpha=0.6)
plt.xlabel('Population')
plt.ylabel('Life Expectancy (years)')
plt.title('Population vs Life Expectancy (2007)')
plt.xscale('log')  # Log scale for population
plt.grid(True)
plt.show()

# 3. Bar chart of average life expectancy by continent (2007)
continent_life = data_2007.groupby('continent')['lifeExp'].mean().sort_values()
plt.figure(figsize=(10, 6))
continent_life.plot(kind='bar', color='steelblue')
plt.xlabel('Continent')
plt.ylabel('Average Life Expectancy (years)')
plt.title('Average Life Expectancy by Continent (2007)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Exercise 4: Working with Missing Data

Create a subset of data and practice handling missing values:

Create a subset with data from European countries in 2007
Randomly set 10% of the life expectancy values to NaN
Count how many missing values you created
Fill the missing values using the mean life expectancy
Compare the original and filled values

Show the code

import numpy as np

# 1. Create subset
europe_2007 = df[(df['continent'] == 'Europe') & (df['year'] == 2007)].copy()

# 2. Randomly set 10% to NaN
np.random.seed(42)  # For reproducibility
n_missing = int(len(europe_2007) * 0.1)
missing_indices = np.random.choice(europe_2007.index, n_missing, replace=False)
europe_2007.loc[missing_indices, 'lifeExp'] = np.nan

# 3. Count missing values
n_missing_actual = europe_2007['lifeExp'].isnull().sum()
print(f"Missing values created: {n_missing_actual}")

# 4. Fill with mean
mean_life = europe_2007['lifeExp'].mean()
europe_filled = europe_2007.copy()
europe_filled['lifeExp'] = europe_filled['lifeExp'].fillna(mean_life)

# 5. Compare
print("\nComparison of original and filled values:")
comparison = europe_filled.loc[missing_indices, ['country', 'lifeExp']]
comparison.columns = ['country', 'filled_lifeExp']
print(comparison)
print(f"\nMean value used for filling: {mean_life:.2f}")

Key Points

Use pandas to load and manipulate tabular data in Python
DataFrames are the primary structure for working with data tables
Use .head(), .info(), and .describe() to explore data structure and content
Select data using .loc[] (label-based) or .iloc[] (position-based)
Filter data using Boolean conditions for targeted analysis
Use .groupby() for calculating statistics across groups
Create visualizations with matplotlib and seaborn to understand patterns
Identify missing values with .isnull() and handle them appropriately
Choose missing data strategies based on why data is missing and analysis goals

What’s Next?

Now that you understand data processing fundamentals, you can explore:

Advanced data manipulation techniques (merging, reshaping, pivoting)
More sophisticated visualizations (heat maps, violin plots, pair plots)
Statistical analysis and hypothesis testing
Machine learning with scikit-learn
Working with time series data
Data cleaning and preprocessing techniques

Continue practicing these fundamentals as they form the foundation for all data analysis work in Python!

Learning Objectives

Introduction

Loading Data into DataFrames

Importing Libraries

Loading the Gapminder Dataset

First Look at the Data

Understanding DataFrame Structure

Exploratory Data Analysis

DataFrame Anatomy

Summary Statistics

Selecting Data

Selecting Columns

Selecting Rows

Selecting Specific Values

Boolean Filtering

Grouping and Aggregation

Data Visualization

Setting Up

Line Plots

Scatter Plots

Bar Charts

Using Pandas Built-in Plotting

Saving Figures

Dealing with Missingness

Why Missing Data Matters

Identifying Missing Values

Handling Missing Values

1. Dropping Missing Values

2. Filling Missing Values

3. Interpolation

Choosing the Right Strategy

Practice Exercises

Exercise 1: Loading and Exploring

Exercise 2: Filtering and Aggregation

Exercise 3: Visualization

Exercise 4: Working with Missing Data

Key Points

What’s Next?

Learning Resources