Pandas is what you actually use for data analysis. It builds on NumPy with two new data structures — Series (a 1D labelled array) and DataFrame (a 2D table with named columns) — plus a vast set of methods for loading, cleaning, transforming, and analysing data.
Series — a 1D array with an index
import pandas as pdrates = pd.Series([0.07, 0.10, 0.12, 0.15, 0.08],index=['2020', '2021', '2022', '2023', '2024'],name='CBR')rates.mean() # 0.104rates['2023'] # 0.15rates[rates > 0.10]
DataFrame — the workhorse
A DataFrame is a 2D table where each column is a Series with a shared index. Almost all real analysis lives in DataFrames.
df = pd.DataFrame({'bank': ['KCB', 'Equity', 'Coop', 'NCBA'],'assets_bn': [1500, 1700, 600, 700],'tier': [1, 1, 1, 1],})df.head() # first 5 rowsdf.shape # (4, 3)df.dtypes # column typesdf.describe() # summary stats
Selecting columns and rows
df['bank'] # one column → Seriesdf[['bank', 'assets_bn']] # multiple columns → DataFrame# loc: label-baseddf.loc[0] # first row by labeldf.loc[df['assets_bn'] > 1000, 'bank']# iloc: position-baseddf.iloc[0] # first row by positiondf.iloc[0:2, 0:2] # first 2 rows, first 2 cols
loc vs iloc — the distinction that catches everyone
loc uses labels (the index). iloc uses positions (0, 1, 2, ...). df.loc[0] works only if the index includes 0. df.iloc[0] always works (it's the first row). When in doubt, use iloc for positional access and loc for everything else.
Loading data
# CSVdf = pd.read_csv('rates.csv')# Exceldf = pd.read_excel('data.xlsx', sheet_name='Sheet1')# SQLimport sqlite3conn = sqlite3.connect('mydb.sqlite')df = pd.read_sql('SELECT * FROM rates', conn)
The Kenyan datasets in your practice environment
The practice environment has three pre-loaded DataFrames: bankrates (21 months of commercial bank lending and deposit rates), pension (14 half-years of pension industry asset allocation), and mpesa (Kenyan mobile-money volumes). Use them in every subsequent exercise.
Exercise
Using the pre-loaded bankrates DataFrame, print its first 5 rows.