Python

Getting started with Python for data analysis and research, including installation, setup, troubleshooting, and best practices.

Python is a high-level, general-purpose programming language that is widely used in data science, machine learning, and web development. It has a large standard library and a vibrant community that provides a wide range of libraries and tools for various applications. This guide covers Python installation, package management, environment setup, troubleshooting common issues, and best practices for data analysis and research.

How to install Python?

There are many ways to install Python. This guide recommends using Python in a virtual environment to avoid conflicts with other Python installations on your system.

The recommended tool is uv, a simple way to create and manage Python virtual environments.

Installing uv

First, install uv using winget on Windows or brew on MacOS/Linux:

# Install uv
winget install astral-sh.uv
# Install uv
brew install uv
# Install uv
brew install uv

Installing Python Packages

You can manage Python packages installed in the virtual environment using a pyproject.toml file. See the pyproject.toml example in this repository for an example of how to manage Python packages.

Choose one of the following methods to install packages:

Add libraries to the virtual environment using uv add:

uv add jupyterlab pandas matplotlib seaborn causaldata

This will install:

  • jupyterlab: Interactive development environment for data science
  • pandas: Data manipulation and analysis
  • matplotlib: Plotting and visualization library
  • seaborn: Statistical data visualization
  • causaldata: Example datasets for causal inference

If you prefer pip (Python’s standard package manager):

pip install jupyterlab pandas matplotlib seaborn causaldata

If you use Anaconda or Miniconda:

conda install jupyterlab pandas matplotlib seaborn
pip install causaldata

Note: causaldata is not available in conda channels, so use pip for that package.

Using Virtual Environments

Using a virtual environment keeps your project packages separate and avoids conflicts. For more guidance, see the virtual environment guide.

Create and activate a virtual environment with uv:

# Create virtual environment
uv venv myproject-env

# Activate it

On Windows:

myproject-env\Scripts\activate

On macOS/Linux:

source myproject-env/bin/activate

Create and activate using Python’s built-in venv:

# Create virtual environment
python -m venv myproject-env

# Activate it

On Windows:

myproject-env\Scripts\activate

On macOS/Linux:

source myproject-env/bin/activate

For more details on virtual environments, see the virtual environment guide.

Version Requirements

To ensure compatibility with the examples and tools used throughout this guide, you will need:

  • Python: 3.8 or higher
  • pandas: 1.3 or higher
  • seaborn: 0.12 or higher (required for seaborn.objects interface)
  • matplotlib: 3.4 or higher
  • numpy: 1.20 or higher
  • jupyter: Latest version recommended

The seaborn.objects interface, used in many visualization examples in this guide, requires seaborn version 0.12 or higher. If you encounter errors related to seaborn.objects, make sure you have the correct version installed.

Verify Your Installation

After installing Python and the required packages, you should verify that everything is working correctly.

Check Package Versions

Open Python and run the following to check that all packages are installed and verify their versions:

import pandas as pd
import seaborn as sns
import seaborn.objects as so
import matplotlib.pyplot as plt
import numpy as np

print(f"pandas version: {pd.__version__}")
print(f"seaborn version: {sns.__version__}")
print(f"matplotlib version: {plt.matplotlib.__version__}")
print(f"numpy version: {np.__version__}")

You should see version numbers printed without errors.

Test Seaborn.Objects

Make sure seaborn.objects is available by creating a simple plot:

import seaborn as sns
import seaborn.objects as so

# Load built-in dataset
penguins = sns.load_dataset("penguins")

# Create a simple plot
(
    so.Plot(penguins, x="bill_length_mm", y="bill_depth_mm")
    .add(so.Dot())
)

If this works without errors, the installation is complete.

Troubleshooting

Problem: “No module named ‘seaborn.objects’”

Your seaborn version is too old. Update it:

pip install --upgrade seaborn
# or
uv pip install --upgrade seaborn

Problem: Plots not showing

In Jupyter notebooks: Plots should display automatically.

In Python scripts: Add plt.show() at the end:

import matplotlib.pyplot as plt
# ... your plotting code ...
plt.show()

Or save the plot to a file:

plot.save("filename.png")

Problem: Import errors

Make sure you’re using the correct Python environment. Check with:

# On Windows
where python

# On macOS/Linux
which python

If this shows an unexpected Python installation, make sure the virtual environment is activated.

Problem: Permission errors during installation

Try installing with the --user flag:

pip install --user package_name

Or use a virtual environment (recommended) to avoid permission issues.

Setting Up Your Working Environment

Choose one of the following environments for working with Python:

VS Code with Python Extension

  1. Install VS Code
  2. Install the Python extension
  3. Create a Python file (.py) or Jupyter notebook (.ipynb)

VS Code provides excellent support for both scripts and notebooks with features like debugging, linting, and code completion.

Positron

Positron is a new IDE specifically designed for data science:

  1. Download from positron.posit.co
  2. Install and open
  3. Create a Python file or notebook

Positron combines the best features of traditional IDEs with notebook-style interactive computing.

Coding Conventions

We highly recommend working with a virtual environment to manage Python dependencies. The pyproject.toml is the preferred way to keep track of python dependencies as well as project-specific python conventions.

We recommend using Ruff to enforce linting and formatting rules. In most cases you can use the default linting and formatting rules provided by ruff. However, you can customize the rules by modifying the [tool.ruff] section of the pyproject.toml file in the root of your project. for more about the configuration options, see the Ruff documentation.

If you are working in a virtual environment created in this repository, you automatically have access to Ruff through just lint-py and just fmt-python commands to lint and format your code.

For more inspiration, see the GitLab Data Team’s Python Guide and Google’s Python Style Guide.

Example Usage

The example below shows how to use Python to explore and visualize a dataset.

To follow along, you will need to work in a jupyter notebook with the right libraries installed in your environment. Don’t worry if you cannot do this now; we just want to show you what is possible here. We will revisit this example in Processing Data in Python.

The following example loads World Bank data from Gapminder using the causaldata package.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
from causaldata import gapminder

Load the Gapminder data as a pandas DataFrame:

df = gapminder.load_pandas().data

We can check the dimensions of the DataFrame using df.info():

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   continent  1704 non-null   object 
 2   year       1704 non-null   int64  
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64  
 5   gdpPercap  1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB

Let’s take a look at the first few rows of the DataFrame using df.head():

df.head()
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106

Take a look at the relationship between GDP per Capita and Life Expectancy:

sns.scatterplot(data=df, x="gdpPercap", y="lifeExp", hue="continent").set(
    xscale="log", ylabel="Life Expectancy", xlabel="GDP per Capita"
)

Separate the data by year, focusing on 1957 and 2007:

sns.relplot(
    data=df.where(df["year"].isin([1957, 2007])),
    x="gdpPercap",
    y="lifeExp",
    col="year",
    hue="continent",
    col_wrap=1,
    kind="scatter",
    palette="muted",
).set(xscale="log", ylabel="Life Expectancy", xlabel="GDP per Capita")

Learning Resources

Back to top