Python

Python is a high-level, general-purpose programming language that is widely used in data science, machine learning, and web development. It has a large standard library and a vibrant community that provides a wide range of libraries and tools for various applications. As such, Python provides a general-purpose ecosystem that can be used for a wide range of applications.

How to install Python?

There are many ways to install Python. We recommend using Python in a virtual environment to avoid conflicts with other Python installations on your system.

We recommend using uv as a a simple way to create and manage Python virtual environments.

You can manage the python packages that are installed in the virtual environment using a pyproject.toml file. See the pyproject.toml example in this repository for an example of how to manage Python packages. To add package dependencies to the virtual environment, using uv, you can run:

First, install uv using winget (Windows) or brew (MacOS/Linux):

# Install uv
winget install astral-sh.uv

# Install uv
brew install uv

# Install uv
brew install uv

Add libraries to the virtual environment using uv add ...:

> uv add jupyterlab pandas matplotlib seaborn

Coding Conventions

We highly recommend working with a virtual environment to manage Python dependencies. The pyproject.toml is the preferred way to keep track of python dependencies as well as project-specific python conventions.

We recommend using Ruff to enforce linting and formatting rules. In most cases you can use the default linting and formatting rules provided by ruff. However, you can customize the rules by modifying the [tool.ruff] section of the pyproject.toml file in the root of your project. for more about the configuration options, see the Ruff documentation.

If you are working in a virtual environment created in this repository, you automatically have access to Ruff via just lint-py and just fmt-python commands to lint and format your code.

For more inspiration, see the GitLab Data Team’s Python Guide and Google’s Python Style Guide.

Example Usage

In the example below, we show how Python can be used to explore and visualize a dataset.

Important Note

To follow along, you will need to work in a jupyter notebook with the right libraries installed in your environment. Don’t worry if you cannot do this now; we just want to show you what is possible here. We will revisit this example in Processing Data in Python.

Let’s load an example World Bank data via Gapminder using the causaldata package.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
from causaldata import gapminder

Load the Gapminder data as a pandas DataFrame:

df = gapminder.load_pandas().data

We can check the dimensions of the DataFrame using df.info():

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   continent  1704 non-null   object 
 2   year       1704 non-null   int64  
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64  
 5   gdpPercap  1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB

Let’s take a look at the first few rows of the DataFrame using df.head():

df.head()

	country	continent	year	lifeExp	pop	gdpPercap
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
1	Afghanistan	Asia	1957	30.332	9240934	820.853030
2	Afghanistan	Asia	1962	31.997	10267083	853.100710
3	Afghanistan	Asia	1967	34.020	11537966	836.197138
4	Afghanistan	Asia	1972	36.088	13079460	739.981106

Take a look at the relationship between GDP per Capita and Life Expectancy:

sns.scatterplot(x="gdpPercap", y="lifeExp", hue="continent", data=df).set(
    xscale="log", ylabel="Life Expectancy", xlabel="GDP per Capita"
)

Separate the data by year, focusing on 1957 and 2007:

sns.relplot(
    data=df.where(df["year"].isin([1957, 2007])),
    x="gdpPercap",
    y="lifeExp",
    col="year",
    hue="continent",
    col_wrap=1,
    kind="scatter",
    palette="muted",
).set(xscale="log", ylabel="Life Expectancy", xlabel="GDP per Capita")

How to install Python?

Coding Conventions

Example Usage

Learning Resources