Data frames and tibbles — R Module 4 | LeadAfrik Public Economics Hub

Data frames are the central data structure for R analysis. A data frame is a list of equal-length vectors, displayed as a table. Most functions you'll meet — read.csv, lm, summary, ggplot — are built around data frames.

Creating a data frame

df <- data.frame(
    bank = c("KCB", "Equity", "Coop", "NCBA"),
    assets_bn = c(1500, 1700, 600, 700),
    tier = c(1, 1, 1, 1)
)
head(df)
nrow(df); ncol(df)
str(df)        # structure: types and a sample
summary(df)    # summary statistics

Selecting and filtering

df$bank                 # column as vector
df[, "bank"]            # same
df[df$assets_bn > 1000, ]   # filter rows
df[1, ]                 # first row
df[, c("bank", "assets_bn")]  # subset columns

Tibbles — the tidyverse upgrade

A tibble is a modern data frame with better printing, better subsetting behaviour, and stricter type rules. Most tidyverse functions return tibbles. They are interchangeable with data.frame for almost all purposes.

library(tibble)
tb <- tibble(
    bank = c("KCB", "Equity"),
    assets = c(1500, 1700)
)
tb     # prints with column types and dimensions

# Tibbles never auto-convert characters to factors (data.frame did, until R 4.0)

Reading and writing

df <- read.csv("rates.csv")
df <- read.csv("rates.csv", stringsAsFactors = FALSE)  # historical, not needed in R 4.0+

library(readr)
df <- read_csv("rates.csv")  # tidyverse version, faster, returns tibble

write_csv(df, "output.csv")

Always inspect after reading

After read_csv, always run str(df), summary(df), and head(df). One stray text value in a numeric column will break every subsequent analysis without an error message.

Exercise

Print the first 5 rows of the pre-loaded bankrates data frame.