Pandas: Series and DataFrames — Python Module 8 | LeadAfrik Public Economics Hub

Pandas is what you actually use for data analysis. It builds on NumPy with two new data structures — Series (a 1D labelled array) and DataFrame (a 2D table with named columns) — plus a vast set of methods for loading, cleaning, transforming, and analysing data.

Series — a 1D array with an index

python

import pandas as pd

rates = pd.Series([0.07, 0.10, 0.12, 0.15, 0.08],
                  index=['2020', '2021', '2022', '2023', '2024'],
                  name='CBR')
rates.mean()       # 0.104
rates['2023']      # 0.15
rates[rates > 0.10]

DataFrame — the workhorse

A DataFrame is a 2D table where each column is a Series with a shared index. Almost all real analysis lives in DataFrames.

python

df = pd.DataFrame({
    'bank': ['KCB', 'Equity', 'Coop', 'NCBA'],
    'assets_bn': [1500, 1700, 600, 700],
    'tier': [1, 1, 1, 1],
})

df.head()           # first 5 rows
df.shape            # (4, 3)
df.dtypes           # column types
df.describe()       # summary stats

Selecting columns and rows

python

df['bank']                    # one column → Series
df[['bank', 'assets_bn']]     # multiple columns → DataFrame

# loc: label-based
df.loc[0]                     # first row by label
df.loc[df['assets_bn'] > 1000, 'bank']

# iloc: position-based
df.iloc[0]                    # first row by position
df.iloc[0:2, 0:2]             # first 2 rows, first 2 cols

loc vs iloc — the distinction that catches everyone

loc uses labels (the index). iloc uses positions (0, 1, 2, ...). df.loc[0] works only if the index includes 0. df.iloc[0] always works (it's the first row). When in doubt, use iloc for positional access and loc for everything else.

Loading data

python

# CSV
df = pd.read_csv('rates.csv')

# Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# SQL
import sqlite3
conn = sqlite3.connect('mydb.sqlite')
df = pd.read_sql('SELECT * FROM rates', conn)

The Kenyan datasets in your practice environment

The practice environment has three pre-loaded DataFrames: bankrates (21 months of commercial bank lending and deposit rates), pension (14 half-years of pension industry asset allocation), and mpesa (Kenyan mobile-money volumes). Use them in every subsequent exercise.

Exercise

Using the pre-loaded bankrates DataFrame, print its first 5 rows.