Skip to content
Module 08 of 1260 min readBeginner

Pandas: Series and DataFrames

Creating, indexing, selecting, and the loc/iloc distinction that catches every beginner.

67%

Listen along

Read “Pandas: Series and DataFrames” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

Learning objectives

By the end of this module, you should be able to:

  • 01Create Series and DataFrames from Python data structures and from CSV/Excel/SQL sources
  • 02Use df.head(), df.shape, df.dtypes, df.describe() as the first-look inspection pattern
  • 03Distinguish .loc (label-based) from .iloc (position-based) selection
  • 04Subset rows and columns confidently with boolean masks and df[[col1, col2]] patterns

Pandas is what you actually use for data analysis. It builds on NumPy with two new data structures — Series (a 1D labelled array) and DataFrame (a 2D table with named columns) — plus a vast set of methods for loading, cleaning, transforming, and analysing data.

Series — a 1D array with an index

python
import pandas as pd
rates = pd.Series([0.07, 0.10, 0.12, 0.15, 0.08],
index=['2020', '2021', '2022', '2023', '2024'],
name='CBR')
rates.mean() # 0.104
rates['2023'] # 0.15
rates[rates > 0.10]

DataFrame — the workhorse

A DataFrame is a 2D table where each column is a Series with a shared index. Almost all real analysis lives in DataFrames.

python
df = pd.DataFrame({
'bank': ['KCB', 'Equity', 'Coop', 'NCBA'],
'assets_bn': [1500, 1700, 600, 700],
'tier': [1, 1, 1, 1],
})
df.head() # first 5 rows
df.shape # (4, 3)
df.dtypes # column types
df.describe() # summary stats

Selecting columns and rows

python
df['bank'] # one column → Series
df[['bank', 'assets_bn']] # multiple columns → DataFrame
# loc: label-based
df.loc[0] # first row by label
df.loc[df['assets_bn'] > 1000, 'bank']
# iloc: position-based
df.iloc[0] # first row by position
df.iloc[0:2, 0:2] # first 2 rows, first 2 cols

loc vs iloc — the distinction that catches everyone

loc uses labels (the index). iloc uses positions (0, 1, 2, ...). df.loc[0] works only if the index includes 0. df.iloc[0] always works (it's the first row). When in doubt, use iloc for positional access and loc for everything else.

Loading data

python
# CSV
df = pd.read_csv('rates.csv')
# Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# SQL
import sqlite3
conn = sqlite3.connect('mydb.sqlite')
df = pd.read_sql('SELECT * FROM rates', conn)

The Kenyan datasets in your practice environment

The practice environment has three pre-loaded DataFrames: bankrates (21 months of commercial bank lending and deposit rates), pension (14 half-years of pension industry asset allocation), and mpesa (Kenyan mobile-money volumes). Use them in every subsequent exercise.

Exercise

Using the pre-loaded bankrates DataFrame, print its first 5 rows.

Key takeaways

  • A DataFrame is a list of equal-length columns sharing an index — Series under the hood
  • .loc uses labels (the index); .iloc uses 0-based positions. Knowing which is which prevents most subsetting bugs
  • df.head(), df.shape, df.dtypes, df.describe() — run these four after every load
  • pd.read_csv handles 90% of imports; read_excel, read_sql cover the rest

Further reading

  1. 01

    Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter

    Wes McKinney · O'Reilly · 2022Written by the creator of pandas. The definitive reference.

  2. 02
  3. 03
Loading progress…
LeadAfrikPublic Economics Hub